1.3 Data Analysis with Python- Data Wrangling 1
1.3 Data Analysis with Python- Data Wrangling 1
Objectives
Pre-processing Data in Python
Describe how to handle missing values
Describe data formatting techniques
Describe data normalization
Demonstrate the use of binning
Demonstrate the use of categotical variables
Data Wrangling 2
Pre-processing Data in
Python
Data preprocessing is a necessary step in data analysis.
It is the process of converting or mapping data from one raw
form into another format to make it ready for further
analysis.
Data preprocessing is often called data cleaning or data
wrangling:
Identify and handle missing values
Data formatting
Data Normalization( centering/ scaling)
Data binning
Turning Categorical values to numeric variables
Data Wrangling 3
Dealing with Missing
Values
A missing value condition occurs whenever a data entry
is left empty.
When no data value is stored for a variable in an
observation
Missing value in data set appears as question mark and
a zero or just a blank cell.
Data Wrangling 4
Dealing with Missing
Values
How to deal with missing data?
Data Wrangling 5
Dealing with Missing
Values
How to deal with missing data?
Go back and find what the actual value should be
Just to remove the data where that missing value is found
Drop the whole variable
Data Wrangling 6
Dealing with Missing
Values
How to deal with missing data?
Replace the missing values
Replace it with an average
Replace it by frequency
Data Wrangling 7
Dealing with Missing
Values
Using dataframes.dropna() to drop missing data
Inplace= true: writes the result back into the data frame
Data Wrangling 8
Dealing with Missing
Values
Using dataframe.replace(missingValue, newValue):
replace missing data by other value
Data Wrangling 9
Dealing with Missing
Values
How to deal with missing data?
Go back and find what the actual value should be
Leave it as missing data
You can always check for a higher quality data set or
source
You may want to leave the missing data as missing
data.
Data Wrangling 10
Data Formatting in Python
Data are usually collected from different places and
stored in different formats
What is data formatting? bring data into a common
standard of expression allows users to make meaningful
comparison.
Data Wrangling 11
Data Formatting in Python
Data types in Python and Pandas
Objects: “B”, “HoaDNT”
Int64: 0,2,4
Float64: 1.345, 78.9
To identify data types: dataframe.dtypes().
To convert data types: dataframe.astype().
Example: convert data type to integer in column “price”
Data Wrangling 12
Summary
Pre-processing Data in Python
Describe how to handle missing values
Describe data formatting techniques
Data Wrangling 13
Q&A
Data Wrangling 14