05 Pandas
05 Pandas
1
Pandas as a data exploration tool
• Import the package
import pandas as pd
• Read file
df = pd.read_csv('C:\\Rekha\\InputFiles\\DAUP\\Student.csv’)
• Explore/Analyze data
• df.size – Total values
• df.shape – Rows and Columns
• df.describe() – Gives statistical information on the columns. Descriptive Statistics
• df.info() - Displays the column names and their data types
• df.columns : Access column names
• df[‘gender’], df.gender : Access data by column name or index values df[0:3]
• df["Team"].value_counts() – Frequency Distribution
• df.Team.unique() - All unique values for the column
• Selection of data by applying conditions
• Using loc, iloc, query
2
• Cleanup the data
• Rename columns
• Replace values in the dataframe
• Drop columns
• Remove duplicates
• Address null values : Drop
• Prepare the Data
• Add new columns if necessary
• Format the Date Column
• Apply selection, Filter to analyze a subset of data
3
• Analyze the data
• Sorting data
• Aggregation using group by, pivot table, crosstab– Sum, Count, mean, max
• Visualization
4
Functions that we will be working on:
• shape, size, index, columns
• df.loc[2:5,['gender','group’]]
5
• Renaming: df.rename
(columns = {'gender':'Gender','group':'Group'},inplace=True)
• Replace Values:
df.replace({'gender': {'female':'F', 'male' : 'M'}},inplace=True)
• df.group.value_counts() – To obtaine the count for each of the unique values in the
columns ‘group’
• Grouping data :
df.groupby(by='gender').mean()
df.groupby(by=['group','gender']).sum()
df.groupby(['gender', 'group']).agg({'total' : ['min', 'max', 'mean', 'std'], 'math_score': ['mean’]})
7
• df.sort_values(by=['math_score'], na_position='first',ascending=False)[:5]
• df.pivot_table(index=['group','gender’])
• pd.crosstab(df.gender,df.group,margins=True)
• pd.cut(df.Age,bins=bins, labels=bins[1:])
• df.corr()
8
Data Cleaning
• Exploring Data
• Shape, info, columns, indexes, describe, head, tail
• Filter the data
• Handling Missing Data
• Drop data (delete), fill the values with mean, median, mode.
• Use ML algorithm to identify highly probable value using regression
• Handling Outliers
• Using box plot, scatter plot to identify outliers and handle using z-score or inter
quartile range method
9
Feature Engineering
• Feature Encoding Technique
• One-hot coding, Label Coding, Ordinal encoding
• Feature Scaling
• Features have different ranges, magnitudes and units. To be able to compare
data in multiple scales like salary, age. This is known has feature normalization or
feature scaling.
• Feature transformation:
• Converting numerical data to categorical data
• Splitting a categorical data to multiple columns
10
Feature encoding
11
Feature Scaling
• Standard scaling or Z score Normalization. Derive value based on its z value.
It is best suited for normally suited distribution. Suppose is the mean and is the standard deviation of
the feature column. Then z score is as follows.
• Min Max Scaling : This method linearly transforms the original data into the given
range. It preserves the relationships between the scaled data and the original data. If
the distribution is not normally distributed and the value of the standard deviation is
very small, then the min-max scaler works better since it is more sensitive to outliers.
12