Pandas basics - part 1

by Charlie Jackson


Posted on March 14, 2019 at 12:00 PM


displaying a graphic of a 2 dimensional array

In this tutorial we take a look at the basic features of the Python package Pandas. Pandas is an essential package for data science tool kits and well worth taking the time to learn.

What is Pandas?

Pandas is a library written for data science. It offers great features for data analysis and manipulation in the structure of data frames and series. Visually, it is somewhat like an excel table with columns and rows. It is also built up top of the python package NumPy.

Installing Pandas

Pandas is not installed by default and thus must be installed using pip:

pip install pandas
				

Pandas basics

Once installed, you must import pandas; the convention is to "import pandas as pd" as depicted below. We then create a DataFrame by passing data to Pandas; this can be in the form of a Python dictionary, a CSV, SQL and many other forms. The code below shows how to pass a list and a CSV.

>>> import pandas as pd
>>> names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
>>> dr =  [True, False, False, False, True, True, True]
>>> cpc = [809, 731, 588, 18, 200, 70, 45]
>>> diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
>>> cars = pd.DataFrame(diction)
>>> row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
>>> cars.index = row_labels
>>> cars
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45

				 

Alternatively, you can pass in a csv. Note, when passing a CSV you can specify which column you want to set as the index label.

>>> cars = pd.read_csv('cars.csv', index_col=0)
>>> cars
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45
			 

Accessing columns

You can access the columns using a number of ways that will return a data series of data frame depending upon on how you slice the DataFrame. The most simply way is to use square brackets. The final example in the code block shows how you effectively pass in a list to return multiple columns.

>>> cars['country']
US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object
>>>
>>> type(cars['country'])
'pandas.core.series.Series'
>>>
>>> cars[['country']]
           country
US   United States
AUS      Australia
JAP          Japan
IN           India
RU          Russia
MOR        Morocco
EG           Egypt
>>>
>>> type(cars[['country']])
'pandas.core.frame.DataFrame'
>>> cars[['country','drives_right']]
           country  drives_right
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          True
>>> 
			

Although the above works quite effectively for simple slicing, Pandas has a more advanced way that slices based upon rows and columns; DataFrame[rows, columns]. To use this feature you must use the loc or iloc attributes:

  • loc = label based
  • iloc = integer based
>>> cars.loc['US'] # returns data series
country         United States
drives_right             True
cars_per_cap              809
Name: US, dtype: object
 >>> cars.loc[['US']] # returns data frame
          country  drives_right  cars_per_cap
US  United States          True           809
>>>
>>> cars.loc[['US','JAP']]
           country  drives_right  cars_per_cap
US   United States          True           809
JAP          Japan         False           588
>>>
>>>
>>> cars.iloc[0]
country         United States
drives_right             True
cars_per_cap              809
Name: US, dtype: object
>>>
>>> cars.iloc[[1]]
       country  drives_right  cars_per_cap
AUS  Australia         False           731
>>> cars.iloc[[0,1]]
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
>>>
			

Finally, as noted previously, you can slice with DataFrame[rows, columns]. Note in the final code block, how you can replace the list with a colon to get all the rows. The respective syntax will also work with iloc.

>>> cars.loc[['US','JAP'],['country','drives_right']]
           country  drives_right
US   United States          True
JAP          Japan         False
>>> 
>>> cars.loc[:,['country', 'drives_right']]
           country  drives_right
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          True
>>> 
			

Summary of accessing columns

Square brackets

  • Column access: cars['country']
  • Row access: cars[1:4]

loc (label based)

  • Row access: cars.loc[['AUS','US']]
  • Column access: cars.loc[:,['country','drives_right']]
  • Row and Column: cars.loc[['US','AUS'],['country','drives_right']]

Filtering Pandas DataFrame

There are three steps to querying a DataFrame

  1. Select the area column - this must return a Pandas Series
  2. Do comparison on area column - this must return a Boolean series
  3. Use result / Boolean to index and return new DataFrame
>>> cf = cars['cars_per_cap']
>>> sl = cf > 500
>>> cars[sl]
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
>>>
			

Note, this can be done on one line as so:

>>> cars[cars['cars_per_cap'] > 500]
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
>>> 
			

As with NumPy, you must use logical_and, logical_or method operators for more than one query. You must import ("import numpy as np") at the top of your script to use the logical_and() method.

>>> cars[np.logical_and(cars['cars_per_cap'] >= 100, cars['cars_per_cap'] <= 500)]
   country  drives_right  cars_per_cap
RU  Russia          True           200
>>> 
			

Code takeaway

Currently there is a lot of prints being executed in the terminal; it’s worth playing around, commenting out certain print statements to identify what exactly is going on.

import pandas as pd
import numpy as np

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
diction = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(diction)
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels
print(cars) # returns data frame

# basic square brackets to get index
print(cars['country']) # gets column,returns series
print(type(cars['country'])) #
print(cars[['country']]) # gets column, returns dataframe
print(type(cars[['country']]))
print(cars[['country','drives_right']]) # mutilple columns, returns dataframe

# iloc (integer) and loc (label)
print(cars.loc['US']) # returns data series
print(cars.loc[['US']]) # returns data frame
print(cars.loc[['US','JAP']]) # returns multiple columns in data frame
print(cars.iloc[0]) # returns data series
print(cars.iloc[[1]]) # returns data frame
print(cars.iloc[[0,1]]) # returns multiple columns in data frame
print(cars.loc[['US','JAP'],['country','drives_right']]) # returns specific rows and columns in data frame
print(cars.loc[:,['country', 'drives_right']]) # returns all rows but specific columns in data frame

# querying data frame - 3 steps
cf = cars['cars_per_cap'] # step 1 - select area column to return a series
sl = cf > 500 # step 2 - do comparison to return boolean series
print(cars[sl]) step 3 - use result to index dataframe
# one liner
print(cars[cars['cars_per_cap'] > 500])
# multiple queries using logical and
print(cars[np.logical_and(cars['cars_per_cap'] >= 100, cars['cars_per_cap'] <= 500)])