0% found this document useful (0 votes)
20 views

Lecture 4 Data Pre-Processing

The document provides an overview of a lecture on data pre-processing for a machine learning course. 1) It discusses using Pandas to import, clean, and visualize data. Common techniques like handling missing values, encoding categorical features, and feature scaling are covered. 2) Examples demonstrate loading data from CSV, dropping rows with null values, replacing empty cells, and handling incorrect data. 3) The goal is for students to understand these pre-processing techniques and apply them for cleaning machine learning data.

Uploaded by

choudharynipun69
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture 4 Data Pre-Processing

The document provides an overview of a lecture on data pre-processing for a machine learning course. 1) It discusses using Pandas to import, clean, and visualize data. Common techniques like handling missing values, encoding categorical features, and feature scaling are covered. 2) Examples demonstrate loading data from CSV, dropping rows with null values, replacing empty cells, and handling incorrect data. 3) The goal is for students to understand these pre-processing techniques and apply them for cleaning machine learning data.

Uploaded by

choudharynipun69
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

MACHINE LEARNING (21CSH-286)


Faculty: Prof. (Dr.) Vineet Mehan (E13038)

Lecture – 4 DISCOVER . LEARN . EMPOWER


1
Data Pre-Processing
Machine Learning: Course Objectives
COURSE OBJECTIVES
The Course aims to:
1. Understand and apply various data handling and visualization techniques.
2. Understand about some basic learning algorithms and techniques and their applications, as well as
general questions related to analysing and handling large data sets.
3. To develop skills of supervised and unsupervised learning techniques and implementation of these to
solve real life problems.
4. To develop basic knowledge on the machine techniques to build an intellectual machine for making
decisions behalf of humans.
5. To develop skills for selecting suitable model parameters and apply them for designing optimized
machine learning applications.

2
COURSE OUTCOMES

On completion of this course, the students shall be able to:-

CO2 Understand data pre-processing techniques and apply these for data cleaning.

3
Unit-1 Syllabus
Unit-1 Introduction to Machine Learning
Introduction to Definition of Machine Learning, Working principles of Machine
Machine Learning Learning; Classification of Machine Learning algorithms: Supervised
Learning, Unsupervised Learning, Reinforcement Learning, Semi-
Supervised Learning; Applications of Machine Learning.
Data Pre- Data Sourcing and Cleaning, Handling Missing data, Encoding
Processing and Categorical data, Feature Scaling, Handling Time Series data; Feature
Feature Selection techniques, Data Transformation, Normalization,
Extraction Dimensionality reduction
Data Visualization Data Frame Basics, Different types of analysis, Different types of
plots, Plotting fundamentals using Matplotlib, Plotting Data
Distributions using Seaborn.

4
SUGGESTIVE READINGS
• TEXT BOOKS:
• There is no single textbook covering the material presented in this course. Here is a list of books
recommended for further reading in connection with the material presented:
• T1: Tom.M.Mitchell, “Machine Learning, McGraw Hill International Edition”.
• T2: Ethern Alpaydin,” Introduction to Machine Learning. Eastern Economy Edition, Prentice Hall of
India, 2005”.
• T3: Andreas C. Miller, Sarah Guido, Introduction to Machine Learning with Python, O’REILLY (2001).

• REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, Python Machine Learning, (2014)
• R2 Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classification, Wiley, 2nd Edition”.
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning, illustrated Edition, Springer, 2006”.

5
Data Sourcing
• For data sourcing Panda is used.

• Panda is a python Library for analyzing data.

• Name?
• Panda = Panel Data + Python Data Analysis (Combination) gave the
name.
• Panel data is a subset of longitudinal data where observations are for
the same subjects each time.
By: Prof. (Dr.) Vineet Mehan 6
Data Sourcing
• Use of Panda ?

• Pandas allow us to analyze big data and make conclusions based on


statistical theories.

• Pandas can clean messy data sets, and make them readable and
relevant.

• Pandas are used in Data Science.


By: Prof. (Dr.) Vineet Mehan 7
Data Sourcing
• Data Science: is a branch of computer science where we study how to
store, use and analyze data for deriving information from it.

• How to install Pandas?


• 1. Open cmd prompt
• 2. Type
• >>> python –m pip install pandas

By: Prof. (Dr.) Vineet Mehan 8


Make a data Frame that tells the type of
vehicles that passed a toll plaza.
• import pandas
• mydataset = { 'cars': ["Maruti", "Hundai", "Tata"], 'passings': [20, 12,
15]}
• myvar = pandas.DataFrame(mydataset)
• print(myvar)

By: Prof. (Dr.) Vineet Mehan 9


Import pandas as pd and use pd

By: Prof. (Dr.) Vineet Mehan 10


Read data from a CSV File

By: Prof. (Dr.) Vineet Mehan 11


Reading CSV but print without converting to
string

By: Prof. (Dr.) Vineet Mehan 12


Checking the pandas version

By: Prof. (Dr.) Vineet Mehan 13


Pandas Data Frames
• A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.

• Create a simple Panda Data Frame

By: Prof. (Dr.) Vineet Mehan 14


Load the CSV file into data Frame

By: Prof. (Dr.) Vineet Mehan 15


Data Cleaning
• Data cleaning means fixing bad data in your data set.

• Bad data could be:


• Empty cells

• Data in wrong format

• Wrong data

• Duplicates

By: Prof. (Dr.) Vineet Mehan 16


The data set contains some empty cells ("Date" in row
22, and "Calories" in row 18 and 28).

By: Prof. (Dr.) Vineet Mehan 17


The data set contains wrong format ("Date" in row 26).

By: Prof. (Dr.) Vineet Mehan 18


The data set contains wrong data ("Duration" in row 7).

By: Prof. (Dr.) Vineet Mehan 19


The data set contains duplicates (row 11 and 12).

By: Prof. (Dr.) Vineet Mehan 20


1. Remove Rows
• One way to deal with empty cells is to remove rows that contain
empty cells.

• This is usually OK, since data sets can be very big, and removing a few
rows will not have a big impact on the result.

• See Row 17 and 27 (removed)

By: Prof. (Dr.) Vineet Mehan 21


Pandas dropna() method allows the user to analyze
and drop Rows/Columns with Null values

By default, the dropna() method returns a new


DataFrame, and will not change the original.

By: Prof. (Dr.) Vineet Mehan 22


By default, the dropna() method returns a new
DataFrame, and will not change the original.

If you want to change the original DataFrame, use


the inplace = True argument.

By: Prof. (Dr.) Vineet Mehan 23


3. Replace Empty Values

See Row 17 replaced with 130

The fillna() method allows us to replace


empty cells with a value.

It will Replace NULL values with the number 130.

By: Prof. (Dr.) Vineet Mehan 24


4. Replace value in a particular column

Values are replaced at position 17, 27, 91,


118, and 141 in the Calories column only.

By: Prof. (Dr.) Vineet Mehan 25


5. Replace Using Mean, Median, or Mode
• A common way to replace empty cells, is to calculate the mean,
median or mode value of the column.

• Mean  Average

• Median  Center value

• Mode  Most common occurring value

By: Prof. (Dr.) Vineet Mehan 26


Empty Values are replaced with mean
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Mean here is 375.790244

By: Prof. (Dr.) Vineet Mehan 27


Empty Values are replaced with median
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Median here is 318.6

By: Prof. (Dr.) Vineet Mehan 28


Empty Values are replaced with mode
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Mode here is 300.0

By: Prof. (Dr.) Vineet Mehan 29


Wrong Data
• "Wrong data" does not have to be "empty cells" or "wrong format", it
can just be wrong, like if someone registered "199" instead of "1.99".

• Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.

• If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is between 30
and 60.

By: Prof. (Dr.) Vineet Mehan 30


By: Prof. (Dr.) Vineet Mehan 31
One way to fix wrong values is to
replace them with something else.

In our example, it is most likely a typo,


and the value should be "45" instead of
"450", and we could just insert "45" in
row 7:

By: Prof. (Dr.) Vineet Mehan 32


For Larger Data
• For small data sets you might be able to replace the wrong data one
by one, but not for big data sets.

• To replace wrong data for larger data sets you can create some rules,
e.g. set some boundaries for legal values, and replace any values that
are outside of the boundaries.

By: Prof. (Dr.) Vineet Mehan 33


By: Prof. (Dr.) Vineet Mehan 34
Removing Rows
• Another way of handling wrong data is to remove the rows that
contains wrong data.

• This way you do not have to find out what to replace them with, and
there is a good chance you do not need them to do your analyses.

• Value at position no 7 is removed

By: Prof. (Dr.) Vineet Mehan 35


By: Prof. (Dr.) Vineet Mehan 36
Duplicate Data
• Duplicate rows are rows that have been registered more than one
time.

• By taking a look at our test data set, we can assume that row 11 and
12 are duplicates.

• To discover duplicates, we can use the duplicated() method.

• The duplicated() method returns a Boolean values for each row.


By: Prof. (Dr.) Vineet Mehan 37
Above program Returns True for every
row that is a duplicate, otherwise False

By: Prof. (Dr.) Vineet Mehan 38


Removing Duplicates
• To remove duplicates, use the drop_duplicates() method.

The duplicate row (row no 12) is now removed

By: Prof. (Dr.) Vineet Mehan 39


Summary
• Methods of Sourcing Data

• Methods of Cleaning Data

40
Task
• Applying various methods that are used for sourcing the data by
taking a suitable arrays\datasets etc. (BT-Level3)

• Design a model that is used to clean Empty cells, Data in wrong


format, Wrong data, and Duplicates. (BT-Level6)

By: Prof. (Dr.) Vineet Mehan 41


REFERENCES
• https://www.javatpoint.com/machine-learning

• https://www.tutorialspoint.com/machine_learning/index.htm

• https://www.w3schools.com/python/

42
THANK YOU

For queries
Email: [email protected]
43

You might also like