Pandas data frame
- 2D, tabular, mutable, labelled index
1.Creating a pandas data frame
- Create empty data frame
import pandas as pd df = pd.DataFrame()
- Create using list
tmp = [‘hello', ‘world']df = pd.DataFrame(tmp)
- Create using dict
tmp = {’Name' : [‘blair’, ‘cady’, ‘holly’],
‘Age' : [20, 18, 25],
’Title' : [‘gg’, ‘mg’, ‘bat’]}df = pd.DataFrame(tmp)
- Read from file
df = pd.read_csv(‘movie.csv’, index_col = ’Name’)
2.Basic operations on data frame
- Column selection
df[’Name’]df[[’Name’, ’Title’]]df.loc[:, [’Name’, ’Title’]]
- Row selection
df.loc[‘blair’]
- Selection by index
df.iloc[:,0] #retrieve all rows and a single column
df.iloc[:,[0, 1]] #retrieve all rows and multiple single columndf.iloc[2] #retrieve one row
df.iloc[[0, 2]] #retrieve multiple row
- Add new columns
ratings = [5, 5, 5]df[‘Ratings’] = ratings
- Add new rows
new_row = pd.DataFrame({’Name' : [‘belle’],
‘Age' : [20],
’Title' : [‘bnb’],
‘Ratings' : [5]})df = pd.concat([df, new_row]).reset_index(drop=True) #to ensure that the index resets else new_row will retain its original index
- Delete columns
df.drop([’Title’, ‘Ratings], axis=1, inplace=True)
- Delete rows
df.set_index('Name', inplace=True) #need to set the index for row deletion
df.drop(['blair', 'holly'], inplace = True)
- Drop duplicate rows
df.drop_duplicates([‘Age’, ‘Ratings’])
3.Missing values
- Check for missing values using isnull() and notnull()
import pandas as pd
import numpy as nptmp = {’Name' : [‘blair’, ‘cady’, ‘holly’, np.nan],
‘Age' : [20, 18, np.nan, 25],
’Title' : [‘gg’, ‘mg’, ‘bat’, np.nan]}df = pd.DataFrame(tmp)df.isnull()
- Fill missing values with constant values with fillna() and replace()
df.fillna(0) #fills empty cells with 0
- Fill values with previous row data
df.fillna(method=‘pad’) #if nan value is in the first row it will not be filled
- Fill values with next row data
df.fillna(method=‘bfill’) #if nan value is in the last row it will not be filled
- Drop missing values with dropna()
df.dropna(inplace=True) #drop all rows and columns that contains any Nan/Null
4.Convert data type
df.Age = df.Age.astype(‘int64’) #ensure no null values type(df.Age[0]) #type changed from numpy.float64 to numpy.int64
5.Data comparison and selection
- Select data
df[df[’Name’]==‘blair]df[df[’Name’]!=‘blair’]
- Compare data
checks = df[’Name’]==‘blair’
df[‘Checks’] = checks #returns boolean
6.Joins
- Concatenate by rows for 2 dataframe (rbind)
tmp = pd.DataFrame({'Name' : ['blair', 'cady'],
'Age' : [20, 18],
'Title' : ['gg', 'mg']},
index=[0, 1])tmp1 = pd.DataFrame({'Name' : ['holly', 'belle'],
'Age' : [25, 20],
'Title' : ['bat', 'bnb']},
index=[2,3]) #can combine without indexpd.concat([tmp, tmp1])
- Concatenate by columns for 2 data frame (cbind)
tmp = pd.DataFrame({'Name' : ['blair', 'cady','holly', 'belle'],
'Age' : [20, 18, 25, 20]},
index=[0, 1, 2, 3])tmp1 = pd.DataFrame({'Title' : ['gg', 'mg', 'bat', 'bnb'],
'Ratings' : [5, 5, 5, 5]},
index=[0, 1, 2, 3])tmp.join(tmp1)
- Joins
tmp = pd.DataFrame({'Name' : ['blair', 'cady', 'belle'],
'Age' : [20, 18, 20]})tmp1 = pd.DataFrame({'Name' : ['blair', 'cady','holly', 'belle'],
'Title' : ['gg', 'mg', 'bat', 'bnb'],
'Ratings' : [5, 5, 5, 5]},
index=[0, 1, 2, 3])#how: left, right, inner, outer
#on: primary indexpd.merge(tmp, tmp1, how ='left', on ='Name')
7. Data processing
- Split with delimiter
df = pd.DataFrame({'First Name' : ['blair', 'cady', 'holly'],
'Last Name' : ['waldorf', 'heron', 'golightly'],
'Title' : ['gg', 'mg', 'bat'],
'Timing' : ['9:00', '11:30', '6:28']})time = df['Timing'].str.split(':', n =1, expand=True) #n=1 indicates the split at the first occurrence of the delimiterdf['Hour'] = time[0]
df['Min'] = time[1]
- Combine with delimiter
df["Full Name"]= df["First Name"].str.cat(df['Last Name'], sep =",")
- Filter data
df1 = df[df[‘First Name']=='blair'] #contains valuedf1 = df[df[‘First Name']!='blair'] #does not contain value
- Frequency of element in column
df[‘First Name'].value_counts() #return count for each elementdf['First Name'].value_counts().max() #return max countdf['First Name'].value_counts().idxmax() #return element with max count