Posted on March 15, 2019 at 12:00 PM
In the previous tutorial, titled "Pandas basics" we looked at creating a DataFrame from dictionaries, accessing columns and querying the DataFrame using brackets, iloc and iloc. In this post we will look at looping through DataFrames and creating new columns.
The two packages we will using are Pandas and NumPy which do not come preinstalled with Python. These can be installed using pip
pip install pandas
pip install numpy
For all the preceeding code, you will need to import the respective packages below.
>>> import numpy as np
>>> import pandas as pd
>>>
If you are familiar with the basic concepts of loops, you will comfortable with typical loops through strings, lists,dictionaries etc.
Looping through a list is quite straight forward as depicted below. You can also use the keyword "enumerate", which ties an index key to the value in the list.
>>> names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
>>> for name in names:
print(name)
United States
Australia
Japan
India
Russia
Morocco
Egypt
>>>
>>> for index, value in enumerate(names):
print(index, value)
0 United States
1 Australia
2 Japan
3 India
4 Russia
5 Morocco
6 Egypt
>>>
Looping through a dictionary is near enough the same as the list, although you must 'append .items()' onto the dictionary:
>>> country_dict = {'England':'London', 'France':'Paris', 'Germany':'Berlin', 'Spain':'Madrid'}
>>> for country, city in country_dict.items():
print("The capital of " + country + " is " + city)
The capital of England is London
The capital of France is Paris
The capital of Germany is Berlin
The capital of Spain is Madrid
>>>
With DataFrames, you can loop through as you would a list to get the column titles or append .iterrows(). Note, if it is a NumPy array, you can do the same with np.nditer([numpy array]). Note what is returned is a Series, not a DataFrame.
>>> names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
>>> dr = [True, False, False, False, True, True, True]
>>> cpc = [809, 731, 588, 18, 200, 70, 45]
>>> diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
>>> cars = pd.DataFrame(diction)
>>> row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
>>> cars.index = row_labels
>>> cars
country drives_right cars_per_cap
US United States True 809
AUS Australia False 731
JAP Japan False 588
IN India False 18
RU Russia True 200
MOR Morocco True 70
EG Egypt True 45
>>>
>>> for car in cars:
print(car)
country
drives_right
cars_per_cap
>>>
>>> for lab, row in cars.iterrows():
print(lab)
print(row)
US
country United States
drives_right True
cars_per_cap 809
Name: US, dtype: object
AUS
country Australia
drives_right False
cars_per_cap 731
Name: AUS, dtype: object
JAP
country Japan
drives_right False
cars_per_cap 588
Name: JAP, dtype: object
ETC
ETC
>>>
To loop through the DataFrame and grab certain data (not a whole series) we can apply square brackets or use loc (label based) or iloc (integer based) to filter the Series.
>>> for lab, row in cars.iterrows():
print(row['country']) # this is the same as loc
print(row.loc['country'])
print(row.iloc[0])
United States
United States
United States
Australia
Australia
Australia
Japan
Japan
Japan
India
India
India
Russia
Russia
Russia
Morocco
Morocco
Morocco
Egypt
Egypt
Egypt
>>>
You can simply add a column using the same syntax as you would a dictionary
>>> cars['test'] = 0
>>> cars
country drives_right cars_per_cap test
US United States True 809 0
AUS Australia False 731 0
JAP Japan False 588 0
IN India False 18 0
RU Russia True 200 0
MOR Morocco True 70 0
EG Egypt True 45 0
>>>
Clearly this is useless, so let’s delete it and add something which might be slightly more applicable.
>>> del cars['test']
>>> cars
country drives_right cars_per_cap
US United States True 809
AUS Australia False 731
JAP Japan False 588
IN India False 18
RU Russia True 200
MOR Morocco True 70
EG Egypt True 45
>>>
To add a new column and values we can either use a for loop (not optimal - imagine you have millions of rows) or use an 'apply' function.
>>> for lab, row in cars.iterrows():
cars.loc[lab, 'COUNTRY'] = row['country'].upper()
>>> cars
country drives_right cars_per_cap COUNTRY
US United States True 809 UNITED STATES
AUS Australia False 731 AUSTRALIA
JAP Japan False 588 JAPAN
IN India False 18 INDIA
RU Russia True 200 RUSSIA
MOR Morocco True 70 MOROCCO
EG Egypt True 45 EGYPT
>>>
Note how we have created a new column and inserted based upon the label in the for loop (so the data is inserted into the correct row) and also applied a function to our new data. A simpler way to do this is using the apply method which accepts a function.
>>> cars['COUNTRY'] = cars['country'].apply(str.upper)
>>> cars
country drives_right cars_per_cap COUNTRY
US United States True 809 UNITED STATES
AUS Australia False 731 AUSTRALIA
JAP Japan False 588 JAPAN
IN India False 18 INDIA
RU Russia True 200 RUSSIA
MOR Morocco True 70 MOROCCO
EG Egypt True 45 EGYPT
>>>
Notice, the apply() method accepts a function i.e you can pass you own function into this for you new column but without the parenthesis as you would a normal function.
>>> def times_2(data):
data = data * 2
return data
>>>
>>> cars['times_2'] = cars['cars_per_cap'].apply(times_2)
>>> cars
country drives_right cars_per_cap COUNTRY times_2
US United States True 809 UNITED STATES 1618
AUS Australia False 731 AUSTRALIA 1462
JAP Japan False 588 JAPAN 1176
IN India False 18 INDIA 36
RU Russia True 200 RUSSIA 400
MOR Morocco True 70 MOROCCO 140
EG Egypt True 45 EGYPT 90
>>>
This code can be added to a script and run in the terminal.
import numpy as np
import pandas as pd
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(diction)
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels
print(cars)
# Looping through a DataFrame - this returns a Series
for lab, row in cars.iterrows():
print(lab)
print(row)
# With the loop as you are getting a series for each iteration, you can access unique row_labelsfor lab, row in cars.iterrows():
for lab, row in cars.iterrows():
print(row['country']) # this is the same as loc
print(row.loc['country'])
print(row.iloc[0])
# Adding columns to a DataFrame - 1 - sqaure brackets
cars['test'] = 0
print(cars)
del cars['test']
# 2 - square brackets in loop
for lab, row in cars.iterrows():
cars.loc[lab, 'COUNTRY'] = row['country'].upper()
print(cars)
del cars['COUNTRY']
# 3 using apply method
cars['COUNTRY'] = cars['country'].apply(str.upper)
print(cars)
# using a homemade function to the 'apply' method
def times_2(data):
data = data * 2
return data
cars['times_2'] = cars['cars_per_cap'].apply(times_2)
print(cars)