my notes: Python Pandas

4 min readMar 12, 2020

Pandas data frame

2D, tabular, mutable, labelled index

1.Creating a pandas data frame

Create empty data frame

import pandas as pd df = pd.DataFrame()

Create using list

tmp = [‘hello', ‘world']df = pd.DataFrame(tmp)

Create using dict

tmp = {’Name' : [‘blair’, ‘cady’, ‘holly’],
       ‘Age' : [20, 18, 25],
       ’Title' : [‘gg’, ‘mg’, ‘bat’]}df = pd.DataFrame(tmp)

Read from file

df = pd.read_csv(‘movie.csv’, index_col = ’Name’)

2.Basic operations on data frame

Column selection

df[’Name’]df[[’Name’, ’Title’]]df.loc[:, [’Name’, ’Title’]]

Row selection

df.loc[‘blair’]

Selection by index

df.iloc[:,0] #retrieve all rows and a single column
df.iloc[:,[0, 1]] #retrieve all rows and multiple single columndf.iloc[2] #retrieve one row
df.iloc[[0, 2]] #retrieve multiple row

Add new columns

ratings = [5, 5, 5]df[‘Ratings’] = ratings

Add new rows

new_row = pd.DataFrame({’Name' : [‘belle’],
                        ‘Age' : [20],
                        ’Title' : [‘bnb’],
                        ‘Ratings' : [5]})df = pd.concat([df, new_row]).reset_index(drop=True) #to ensure that the index resets else new_row will retain its original index

Delete columns

df.drop([’Title’, ‘Ratings], axis=1, inplace=True)

Delete rows

df.set_index('Name', inplace=True) #need to set the index for row deletion
df.drop(['blair', 'holly'], inplace = True)

Drop duplicate rows

df.drop_duplicates([‘Age’, ‘Ratings’])

3.Missing values

Check for missing values using isnull() and notnull()

import pandas as pd
import numpy as nptmp = {’Name' : [‘blair’, ‘cady’, ‘holly’, np.nan],
       ‘Age' : [20, 18, np.nan, 25],
       ’Title' : [‘gg’, ‘mg’, ‘bat’, np.nan]}df = pd.DataFrame(tmp)df.isnull()

Fill missing values with constant values with fillna() and replace()

df.fillna(0) #fills empty cells with 0

Fill values with previous row data

df.fillna(method=‘pad’) #if nan value is in the first row it will not be filled

Fill values with next row data

df.fillna(method=‘bfill’) #if nan value is in the last row it will not be filled

Drop missing values with dropna()

df.dropna(inplace=True) #drop all rows and columns that contains any Nan/Null

4.Convert data type

df.Age = df.Age.astype(‘int64’) #ensure no null values type(df.Age[0]) #type changed from numpy.float64 to numpy.int64

5.Data comparison and selection

Select data

df[df[’Name’]==‘blair]df[df[’Name’]!=‘blair’]

Compare data

checks = df[’Name’]==‘blair’
df[‘Checks’] = checks #returns boolean

6.Joins

Concatenate by rows for 2 dataframe (rbind)

tmp = pd.DataFrame({'Name' : ['blair', 'cady'],
                    'Age' : [20, 18],
                    'Title' : ['gg', 'mg']},
                     index=[0, 1])tmp1 = pd.DataFrame({'Name' : ['holly', 'belle'],
                     'Age' : [25, 20],
                     'Title' : ['bat', 'bnb']},
                      index=[2,3]) #can combine without indexpd.concat([tmp, tmp1])

Concatenate by columns for 2 data frame (cbind)

tmp = pd.DataFrame({'Name' : ['blair', 'cady','holly', 'belle'],
                    'Age' : [20, 18, 25, 20]},
                     index=[0, 1, 2, 3])tmp1 = pd.DataFrame({'Title' : ['gg', 'mg', 'bat', 'bnb'],
                     'Ratings' : [5, 5, 5, 5]},                  
                      index=[0, 1, 2, 3])tmp.join(tmp1)

Joins

tmp = pd.DataFrame({'Name' : ['blair', 'cady', 'belle'],
                    'Age' : [20, 18, 20]})tmp1 = pd.DataFrame({'Name' : ['blair', 'cady','holly', 'belle'],
                     'Title' : ['gg', 'mg', 'bat', 'bnb'],
                     'Ratings' : [5, 5, 5, 5]},                  
                      index=[0, 1, 2, 3])#how: left, right, inner, outer
#on: primary indexpd.merge(tmp, tmp1, how ='left', on ='Name')

7. Data processing

Split with delimiter

df = pd.DataFrame({'First Name' : ['blair', 'cady', 'holly'],
                   'Last Name' : ['waldorf', 'heron', 'golightly'],
                   'Title' : ['gg', 'mg', 'bat'],
                   'Timing' : ['9:00', '11:30', '6:28']})time = df['Timing'].str.split(':', n =1, expand=True) #n=1 indicates the split at the first occurrence of the delimiterdf['Hour'] = time[0]
df['Min'] = time[1]

Combine with delimiter

df["Full Name"]= df["First Name"].str.cat(df['Last Name'], sep =",")

Filter data

df1 = df[df[‘First Name']=='blair'] #contains valuedf1 = df[df[‘First Name']!='blair'] #does not contain value

Frequency of element in column

df[‘First Name'].value_counts() #return count for each elementdf['First Name'].value_counts().max() #return max countdf['First Name'].value_counts().idxmax() #return element with max count

my notes: Python Pandas

Written by Cheryl

No responses yet