Pandas basics - part 2

by Charlie Jackson


Posted on March 15, 2019 at 12:00 PM


displaying a graphic of a 2 dimensional array

In the previous tutorial, titled "Pandas basics" we looked at creating a DataFrame from dictionaries, accessing columns and querying the DataFrame using brackets, iloc and iloc. In this post we will look at looping through DataFrames and creating new columns.

Installs

The two packages we will using are Pandas and NumPy which do not come preinstalled with Python. These can be installed using pip

pip install pandas
pip install numpy
				

For all the preceeding code, you will need to import the respective packages below.

>>> import numpy as np
>>> import pandas as pd
>>>
				

DataFrames and loops

If you are familiar with the basic concepts of loops, you will comfortable with typical loops through strings, lists,dictionaries etc.

Looping through a list is quite straight forward as depicted below. You can also use the keyword "enumerate", which ties an index key to the value in the list.

>>> names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
>>> for name in names:
			print(name)

United States
Australia
Japan
India
Russia
Morocco
Egypt
>>>
>>> for index, value in enumerate(names):
	print(index, value)

0 United States
1 Australia
2 Japan
3 India
4 Russia
5 Morocco
6 Egypt
>>>
				

Looping through a dictionary is near enough the same as the list, although you must 'append .items()' onto the dictionary:

>>> country_dict = {'England':'London', 'France':'Paris', 'Germany':'Berlin', 'Spain':'Madrid'}
>>> for country, city in country_dict.items():
	print("The capital of " + country + " is " + city)

The capital of England is London
The capital of France is Paris
The capital of Germany is Berlin
The capital of Spain is Madrid
>>>
				

With DataFrames, you can loop through as you would a list to get the column titles or append .iterrows(). Note, if it is a NumPy array, you can do the same with np.nditer([numpy array]). Note what is returned is a Series, not a DataFrame.

>>> names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
>>> dr =  [True, False, False, False, True, True, True]
>>> cpc = [809, 731, 588, 18, 200, 70, 45]
>>> diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
>>> cars = pd.DataFrame(diction)
>>> row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
>>> cars.index = row_labels
>>> cars
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45
>>>
>>> for car in cars:
	print(car)

country
drives_right
cars_per_cap
>>>
>>> for lab, row in cars.iterrows():
	print(lab)
	print(row)


US
country         United States
drives_right             True
cars_per_cap              809
Name: US, dtype: object
AUS
country         Australia
drives_right        False
cars_per_cap          731
Name: AUS, dtype: object
JAP
country         Japan
drives_right    False
cars_per_cap      588
Name: JAP, dtype: object
ETC
ETC
>>>

				

To loop through the DataFrame and grab certain data (not a whole series) we can apply square brackets or use loc (label based) or iloc (integer based) to filter the Series.

>>> for lab, row in cars.iterrows():
	print(row['country']) # this is the same as loc
	print(row.loc['country'])
	print(row.iloc[0])

United States
United States
United States
Australia
Australia
Australia
Japan
Japan
Japan
India
India
India
Russia
Russia
Russia
Morocco
Morocco
Morocco
Egypt
Egypt
Egypt
>>>
				

Adding columns to a DataFrame

You can simply add a column using the same syntax as you would a dictionary

>>> cars['test'] = 0
>>> cars
           country  drives_right  cars_per_cap  test
US   United States          True           809     0
AUS      Australia         False           731     0
JAP          Japan         False           588     0
IN           India         False            18     0
RU          Russia          True           200     0
MOR        Morocco          True            70     0
EG           Egypt          True            45     0
>>>
				

Clearly this is useless, so let’s delete it and add something which might be slightly more applicable.

>>> del cars['test']
>>> cars
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45
>>>
				

To add a new column and values we can either use a for loop (not optimal - imagine you have millions of rows) or use an 'apply' function.

>>> for lab, row in cars.iterrows():
	cars.loc[lab, 'COUNTRY'] = row['country'].upper()
>>> cars
           country  drives_right  cars_per_cap        COUNTRY
US   United States          True           809  UNITED STATES
AUS      Australia         False           731      AUSTRALIA
JAP          Japan         False           588          JAPAN
IN           India         False            18          INDIA
RU          Russia          True           200         RUSSIA
MOR        Morocco          True            70        MOROCCO
EG           Egypt          True            45          EGYPT
>>>
				

Note how we have created a new column and inserted based upon the label in the for loop (so the data is inserted into the correct row) and also applied a function to our new data. A simpler way to do this is using the apply method which accepts a function.

>>> cars['COUNTRY'] = cars['country'].apply(str.upper)
>>> cars
           country  drives_right  cars_per_cap        COUNTRY
US   United States          True           809  UNITED STATES
AUS      Australia         False           731      AUSTRALIA
JAP          Japan         False           588          JAPAN
IN           India         False            18          INDIA
RU          Russia          True           200         RUSSIA
MOR        Morocco          True            70        MOROCCO
EG           Egypt          True            45          EGYPT
>>>
				

Notice, the apply() method accepts a function i.e you can pass you own function into this for you new column but without the parenthesis as you would a normal function.

>>> def times_2(data):
			data = data * 2
			return data
>>>
>>> cars['times_2'] = cars['cars_per_cap'].apply(times_2)
>>> cars
           country  drives_right  cars_per_cap        COUNTRY  times_2
US   United States          True           809  UNITED STATES     1618
AUS      Australia         False           731      AUSTRALIA     1462
JAP          Japan         False           588          JAPAN     1176
IN           India         False            18          INDIA       36
RU          Russia          True           200         RUSSIA      400
MOR        Morocco          True            70        MOROCCO      140
EG           Egypt          True            45          EGYPT       90
>>>
				

Code takeaway

This code can be added to a script and run in the terminal.

import numpy as np
import pandas as pd

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(diction)
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels
print(cars)

# Looping through a  DataFrame - this returns a Series
for lab, row in cars.iterrows():
	print(lab)
	print(row)

# With the loop as you are getting a series for each iteration, you can access unique row_labelsfor lab, row in cars.iterrows():
for lab, row in cars.iterrows():
	print(row['country']) # this is the same as loc
	print(row.loc['country'])
	print(row.iloc[0])



# Adding columns to a DataFrame - 1 - sqaure brackets
cars['test'] = 0
print(cars)
del cars['test']

# 2 - square brackets in loop
for lab, row in cars.iterrows():
	cars.loc[lab, 'COUNTRY'] = row['country'].upper()
print(cars)
del cars['COUNTRY']

# 3 using apply method
cars['COUNTRY'] = cars['country'].apply(str.upper)
print(cars)

# using a homemade function to the 'apply' method
def times_2(data):
    data = data * 2
    return data
cars['times_2'] = cars['cars_per_cap'].apply(times_2)
print(cars)