Posted on March 14, 2019 at 12:00 PM
In this tutorial we take a look at the basic features of the Python package Pandas. Pandas is an essential package for data science tool kits and well worth taking the time to learn.
Pandas is a library written for data science. It offers great features for data analysis and manipulation in the structure of data frames and series. Visually, it is somewhat like an excel table with columns and rows. It is also built up top of the python package NumPy.
Pandas is not installed by default and thus must be installed using pip:
pip install pandas
Once installed, you must import pandas; the convention is to "import pandas as pd" as depicted below. We then create a DataFrame by passing data to Pandas; this can be in the form of a Python dictionary, a CSV, SQL and many other forms. The code below shows how to pass a list and a CSV.
>>> import pandas as pd
>>> names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
>>> dr = [True, False, False, False, True, True, True]
>>> cpc = [809, 731, 588, 18, 200, 70, 45]
>>> diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
>>> cars = pd.DataFrame(diction)
>>> row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
>>> cars.index = row_labels
>>> cars
country drives_right cars_per_cap
US United States True 809
AUS Australia False 731
JAP Japan False 588
IN India False 18
RU Russia True 200
MOR Morocco True 70
EG Egypt True 45
Alternatively, you can pass in a csv. Note, when passing a CSV you can specify which column you want to set as the index label.
>>> cars = pd.read_csv('cars.csv', index_col=0)
>>> cars
country drives_right cars_per_cap
US United States True 809
AUS Australia False 731
JAP Japan False 588
IN India False 18
RU Russia True 200
MOR Morocco True 70
EG Egypt True 45
You can access the columns using a number of ways that will return a data series of data frame depending upon on how you slice the DataFrame. The most simply way is to use square brackets. The final example in the code block shows how you effectively pass in a list to return multiple columns.
>>> cars['country']
US United States
AUS Australia
JAP Japan
IN India
RU Russia
MOR Morocco
EG Egypt
Name: country, dtype: object
>>>
>>> type(cars['country'])
'pandas.core.series.Series'
>>>
>>> cars[['country']]
country
US United States
AUS Australia
JAP Japan
IN India
RU Russia
MOR Morocco
EG Egypt
>>>
>>> type(cars[['country']])
'pandas.core.frame.DataFrame'
>>> cars[['country','drives_right']]
country drives_right
US United States True
AUS Australia False
JAP Japan False
IN India False
RU Russia True
MOR Morocco True
EG Egypt True
>>>
Although the above works quite effectively for simple slicing, Pandas has a more advanced way that slices based upon rows and columns; DataFrame[rows, columns]. To use this feature you must use the loc or iloc attributes:
>>> cars.loc['US'] # returns data series
country United States
drives_right True
cars_per_cap 809
Name: US, dtype: object
>>> cars.loc[['US']] # returns data frame
country drives_right cars_per_cap
US United States True 809
>>>
>>> cars.loc[['US','JAP']]
country drives_right cars_per_cap
US United States True 809
JAP Japan False 588
>>>
>>>
>>> cars.iloc[0]
country United States
drives_right True
cars_per_cap 809
Name: US, dtype: object
>>>
>>> cars.iloc[[1]]
country drives_right cars_per_cap
AUS Australia False 731
>>> cars.iloc[[0,1]]
country drives_right cars_per_cap
US United States True 809
AUS Australia False 731
>>>
Finally, as noted previously, you can slice with DataFrame[rows, columns]. Note in the final code block, how you can replace the list with a colon to get all the rows. The respective syntax will also work with iloc.
>>> cars.loc[['US','JAP'],['country','drives_right']]
country drives_right
US United States True
JAP Japan False
>>>
>>> cars.loc[:,['country', 'drives_right']]
country drives_right
US United States True
AUS Australia False
JAP Japan False
IN India False
RU Russia True
MOR Morocco True
EG Egypt True
>>>
Square brackets
loc (label based)
There are three steps to querying a DataFrame
>>> cf = cars['cars_per_cap']
>>> sl = cf > 500
>>> cars[sl]
country drives_right cars_per_cap
US United States True 809
AUS Australia False 731
JAP Japan False 588
>>>
Note, this can be done on one line as so:
>>> cars[cars['cars_per_cap'] > 500]
country drives_right cars_per_cap
US United States True 809
AUS Australia False 731
JAP Japan False 588
>>>
As with NumPy, you must use logical_and, logical_or method operators for more than one query. You must import ("import numpy as np") at the top of your script to use the logical_and() method.
>>> cars[np.logical_and(cars['cars_per_cap'] >= 100, cars['cars_per_cap'] <= 500)]
country drives_right cars_per_cap
RU Russia True 200
>>>
Currently there is a lot of prints being executed in the terminal; it’s worth playing around, commenting out certain print statements to identify what exactly is going on.
import pandas as pd
import numpy as np
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(diction)
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels
print(cars) # returns data frame
# basic square brackets to get index
print(cars['country']) # gets column,returns series
print(type(cars['country'])) #
print(cars[['country']]) # gets column, returns dataframe
print(type(cars[['country']]))
print(cars[['country','drives_right']]) # mutilple columns, returns dataframe
# iloc (integer) and loc (label)
print(cars.loc['US']) # returns data series
print(cars.loc[['US']]) # returns data frame
print(cars.loc[['US','JAP']]) # returns multiple columns in data frame
print(cars.iloc[0]) # returns data series
print(cars.iloc[[1]]) # returns data frame
print(cars.iloc[[0,1]]) # returns multiple columns in data frame
print(cars.loc[['US','JAP'],['country','drives_right']]) # returns specific rows and columns in data frame
print(cars.loc[:,['country', 'drives_right']]) # returns all rows but specific columns in data frame
# querying data frame - 3 steps
cf = cars['cars_per_cap'] # step 1 - select area column to return a series
sl = cf > 500 # step 2 - do comparison to return boolean series
print(cars[sl]) step 3 - use result to index dataframe
# one liner
print(cars[cars['cars_per_cap'] > 500])
# multiple queries using logical and
print(cars[np.logical_and(cars['cars_per_cap'] >= 100, cars['cars_per_cap'] <= 500)])