0% found this document useful (0 votes)

29 views43 pages

Lecture 4 Data Pre-Processing

The document provides an overview of a lecture on data pre-processing for a machine learning course. 1) It discusses using Pandas to import, clean, and visualize data. Common techniques like handling missing values, encoding categorical features, and feature scaling are covered. 2) Examples demonstrate loading data from CSV, dropping rows with null values, replacing empty cells, and handling incorrect data. 3) The goal is for students to understand these pre-processing techniques and apply them for cleaning machine learning data.

Uploaded by

choudharynipun69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views43 pages

Lecture 4 Data Pre-Processing

Uploaded by

choudharynipun69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

MACHINE LEARNING (21CSH-286)

Faculty: Prof. (Dr.) Vineet Mehan (E13038)

Lecture – 4 DISCOVER . LEARN . EMPOWER

1
Data Pre-Processing
Machine Learning: Course Objectives
COURSE OBJECTIVES
The Course aims to:
1. Understand and apply various data handling and visualization techniques.
2. Understand about some basic learning algorithms and techniques and their applications, as well as
general questions related to analysing and handling large data sets.
3. To develop skills of supervised and unsupervised learning techniques and implementation of these to
solve real life problems.
4. To develop basic knowledge on the machine techniques to build an intellectual machine for making
decisions behalf of humans.
5. To develop skills for selecting suitable model parameters and apply them for designing optimized
machine learning applications.

2
COURSE OUTCOMES

On completion of this course, the students shall be able to:-

CO2 Understand data pre-processing techniques and apply these for data cleaning.

3
Unit-1 Syllabus
Unit-1 Introduction to Machine Learning
Introduction to Definition of Machine Learning, Working principles of Machine
Machine Learning Learning; Classification of Machine Learning algorithms: Supervised
Learning, Unsupervised Learning, Reinforcement Learning, Semi-
Supervised Learning; Applications of Machine Learning.
Data Pre- Data Sourcing and Cleaning, Handling Missing data, Encoding
Processing and Categorical data, Feature Scaling, Handling Time Series data; Feature
Feature Selection techniques, Data Transformation, Normalization,
Extraction Dimensionality reduction
Data Visualization Data Frame Basics, Different types of analysis, Different types of
plots, Plotting fundamentals using Matplotlib, Plotting Data
Distributions using Seaborn.

4
SUGGESTIVE READINGS
• TEXT BOOKS:
• There is no single textbook covering the material presented in this course. Here is a list of books
recommended for further reading in connection with the material presented:
• T1: Tom.M.Mitchell, “Machine Learning, McGraw Hill International Edition”.
• T2: Ethern Alpaydin,” Introduction to Machine Learning. Eastern Economy Edition, Prentice Hall of
India, 2005”.
• T3: Andreas C. Miller, Sarah Guido, Introduction to Machine Learning with Python, O’REILLY (2001).

• REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, Python Machine Learning, (2014)
• R2 Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classification, Wiley, 2nd Edition”.
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning, illustrated Edition, Springer, 2006”.

5
Data Sourcing
• For data sourcing Panda is used.

• Panda is a python Library for analyzing data.

• Name?
• Panda = Panel Data + Python Data Analysis (Combination) gave the
name.
• Panel data is a subset of longitudinal data where observations are for
the same subjects each time.
By: Prof. (Dr.) Vineet Mehan 6
Data Sourcing
• Use of Panda ?

• Pandas allow us to analyze big data and make conclusions based on

statistical theories.

• Pandas can clean messy data sets, and make them readable and
relevant.

• Pandas are used in Data Science.

By: Prof. (Dr.) Vineet Mehan 7
Data Sourcing
• Data Science: is a branch of computer science where we study how to
store, use and analyze data for deriving information from it.

• How to install Pandas?

• 1. Open cmd prompt
• 2. Type
• >>> python –m pip install pandas

By: Prof. (Dr.) Vineet Mehan 8

Make a data Frame that tells the type of
vehicles that passed a toll plaza.
• import pandas
• mydataset = { 'cars': ["Maruti", "Hundai", "Tata"], 'passings': [20, 12,
15]}
• myvar = pandas.DataFrame(mydataset)
• print(myvar)

By: Prof. (Dr.) Vineet Mehan 9

Import pandas as pd and use pd

By: Prof. (Dr.) Vineet Mehan 10

Read data from a CSV File

By: Prof. (Dr.) Vineet Mehan 11

Reading CSV but print without converting to
string

By: Prof. (Dr.) Vineet Mehan 12

Checking the pandas version

By: Prof. (Dr.) Vineet Mehan 13

Pandas Data Frames
• A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.

• Create a simple Panda Data Frame

By: Prof. (Dr.) Vineet Mehan 14

Load the CSV file into data Frame

By: Prof. (Dr.) Vineet Mehan 15

Data Cleaning
• Data cleaning means fixing bad data in your data set.

• Bad data could be:

• Empty cells

• Data in wrong format

• Wrong data

• Duplicates

By: Prof. (Dr.) Vineet Mehan 16

The data set contains some empty cells ("Date" in row
22, and "Calories" in row 18 and 28).

By: Prof. (Dr.) Vineet Mehan 17

The data set contains wrong format ("Date" in row 26).

By: Prof. (Dr.) Vineet Mehan 18

The data set contains wrong data ("Duration" in row 7).

By: Prof. (Dr.) Vineet Mehan 19

The data set contains duplicates (row 11 and 12).

By: Prof. (Dr.) Vineet Mehan 20

1. Remove Rows
• One way to deal with empty cells is to remove rows that contain
empty cells.

• This is usually OK, since data sets can be very big, and removing a few
rows will not have a big impact on the result.

• See Row 17 and 27 (removed)

By: Prof. (Dr.) Vineet Mehan 21

Pandas dropna() method allows the user to analyze
and drop Rows/Columns with Null values

By default, the dropna() method returns a new

DataFrame, and will not change the original.

By: Prof. (Dr.) Vineet Mehan 22

By default, the dropna() method returns a new
DataFrame, and will not change the original.

If you want to change the original DataFrame, use

the inplace = True argument.

By: Prof. (Dr.) Vineet Mehan 23

3. Replace Empty Values

See Row 17 replaced with 130

The fillna() method allows us to replace

empty cells with a value.

It will Replace NULL values with the number 130.

By: Prof. (Dr.) Vineet Mehan 24

4. Replace value in a particular column

Values are replaced at position 17, 27, 91,

118, and 141 in the Calories column only.

By: Prof. (Dr.) Vineet Mehan 25

5. Replace Using Mean, Median, or Mode
• A common way to replace empty cells, is to calculate the mean,
median or mode value of the column.

• Mean  Average

• Median  Center value

• Mode  Most common occurring value

By: Prof. (Dr.) Vineet Mehan 26

Empty Values are replaced with mean
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Mean here is 375.790244

By: Prof. (Dr.) Vineet Mehan 27

Empty Values are replaced with median
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Median here is 318.6

By: Prof. (Dr.) Vineet Mehan 28

Empty Values are replaced with mode
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Mode here is 300.0

By: Prof. (Dr.) Vineet Mehan 29

Wrong Data
• "Wrong data" does not have to be "empty cells" or "wrong format", it
can just be wrong, like if someone registered "199" instead of "1.99".

• Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.

• If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is between 30
and 60.

By: Prof. (Dr.) Vineet Mehan 30

By: Prof. (Dr.) Vineet Mehan 31
One way to fix wrong values is to
replace them with something else.

In our example, it is most likely a typo,

and the value should be "45" instead of
"450", and we could just insert "45" in
row 7:

By: Prof. (Dr.) Vineet Mehan 32

For Larger Data
• For small data sets you might be able to replace the wrong data one
by one, but not for big data sets.

• To replace wrong data for larger data sets you can create some rules,
e.g. set some boundaries for legal values, and replace any values that
are outside of the boundaries.

By: Prof. (Dr.) Vineet Mehan 33

By: Prof. (Dr.) Vineet Mehan 34
Removing Rows
• Another way of handling wrong data is to remove the rows that
contains wrong data.

• This way you do not have to find out what to replace them with, and
there is a good chance you do not need them to do your analyses.

• Value at position no 7 is removed

By: Prof. (Dr.) Vineet Mehan 35

By: Prof. (Dr.) Vineet Mehan 36
Duplicate Data
• Duplicate rows are rows that have been registered more than one
time.

• By taking a look at our test data set, we can assume that row 11 and
12 are duplicates.

• To discover duplicates, we can use the duplicated() method.

• The duplicated() method returns a Boolean values for each row.

By: Prof. (Dr.) Vineet Mehan 37
Above program Returns True for every
row that is a duplicate, otherwise False

By: Prof. (Dr.) Vineet Mehan 38

Removing Duplicates
• To remove duplicates, use the drop_duplicates() method.

The duplicate row (row no 12) is now removed

By: Prof. (Dr.) Vineet Mehan 39

Summary
• Methods of Sourcing Data

• Methods of Cleaning Data

40
Task
• Applying various methods that are used for sourcing the data by
taking a suitable arrays\datasets etc. (BT-Level3)

• Design a model that is used to clean Empty cells, Data in wrong

format, Wrong data, and Duplicates. (BT-Level6)

By: Prof. (Dr.) Vineet Mehan 41

REFERENCES
• https://www.javatpoint.com/machine-learning

• https://www.tutorialspoint.com/machine_learning/index.htm

• https://www.w3schools.com/python/

42
THANK YOU

For queries
Email: [email protected]
43

Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Pandas
No ratings yet
Pandas
30 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
R PROGRAMMING QUESTION BANK Answer
100% (1)
R PROGRAMMING QUESTION BANK Answer
20 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Christian Mayer, Lukas Rieger, Kyrylo Kravets - Coffee Break Pandas - 74 Pandas Puzzles To Build Your Pandas Data Science Superpower-Finxter - Com (2020)
No ratings yet
Christian Mayer, Lukas Rieger, Kyrylo Kravets - Coffee Break Pandas - 74 Pandas Puzzles To Build Your Pandas Data Science Superpower-Finxter - Com (2020)
156 pages
Pandas-1
No ratings yet
Pandas-1
50 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Effective Pandas. Patterns For Data Manipulation (Treading On Python) - Matt Harrison - Independently Published (2021)
100% (12)
Effective Pandas. Patterns For Data Manipulation (Treading On Python) - Matt Harrison - Independently Published (2021)
392 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Unit 4_Working With Graphs _python
No ratings yet
Unit 4_Working With Graphs _python
49 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
lecture-week5
No ratings yet
lecture-week5
72 pages
Data Science Exam Prep-unit 2
No ratings yet
Data Science Exam Prep-unit 2
18 pages
L6
No ratings yet
L6
67 pages
hduud
No ratings yet
hduud
55 pages
DAP_3_module
No ratings yet
DAP_3_module
62 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Pandas
No ratings yet
Pandas
39 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
lec 4
No ratings yet
lec 4
9 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
Learningthepandaslibrary PDF
100% (1)
Learningthepandaslibrary PDF
233 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
asfasdas
No ratings yet
asfasdas
36 pages
datascience
No ratings yet
datascience
26 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Prediction of Consumer Behavior
No ratings yet
Prediction of Consumer Behavior
201 pages
data handling module
No ratings yet
data handling module
10 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Pandas
No ratings yet
Pandas
21 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Types of Communication
50% (2)
Types of Communication
13 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Pandas-1
No ratings yet
Pandas-1
13 pages
Lecture 8
No ratings yet
Lecture 8
126 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Artificial Cognitive Systems: David Vernon
No ratings yet
Artificial Cognitive Systems: David Vernon
41 pages
Master Thesis Sentiment Analysis
100% (2)
Master Thesis Sentiment Analysis
5 pages
Rakesh M. Verma - David J. Marchette - Cybersecurity Analytics-CRC Press (2020)
No ratings yet
Rakesh M. Verma - David J. Marchette - Cybersecurity Analytics-CRC Press (2020)
357 pages
The Cosmic History Chronicles
100% (1)
The Cosmic History Chronicles
5 pages
Final Exam
No ratings yet
Final Exam
5 pages
E-Commerce Data: Topic-5.2: Text Mining/Analytics
No ratings yet
E-Commerce Data: Topic-5.2: Text Mining/Analytics
63 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Hw3 Growth of Functions
No ratings yet
Hw3 Growth of Functions
4 pages
An Intuitive Approach To DTW - Dynamic Time Warping
No ratings yet
An Intuitive Approach To DTW - Dynamic Time Warping
10 pages
Multilabel Feature Selection: A Comprehensive Review and Guiding Experiments
No ratings yet
Multilabel Feature Selection: A Comprehensive Review and Guiding Experiments
29 pages
Cross-Lingual Contextualized Topic Models With Zero-Shot Learning
No ratings yet
Cross-Lingual Contextualized Topic Models With Zero-Shot Learning
8 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
Machine Learning Machine Learning and Da
No ratings yet
Machine Learning Machine Learning and Da
19 pages
Advanced Scikit Learn
No ratings yet
Advanced Scikit Learn
98 pages
Subbu
No ratings yet
Subbu
3 pages
Deep Learning References: 1 Textbooks and Surveys About DL
No ratings yet
Deep Learning References: 1 Textbooks and Surveys About DL
9 pages
The Ultimate Guide To Data Integration
No ratings yet
The Ultimate Guide To Data Integration
48 pages
Laboratory 10: Identification by The Least-Squares Method: Problem 1
No ratings yet
Laboratory 10: Identification by The Least-Squares Method: Problem 1
3 pages
Dynamic Response of 2 Dof Quarter Car Passive Suspension System (QC-PSS) and 2 Dof Quarter Car Electrohydraulic Active Suspension System (QC-EH-ASS)
No ratings yet
Dynamic Response of 2 Dof Quarter Car Passive Suspension System (QC-PSS) and 2 Dof Quarter Car Electrohydraulic Active Suspension System (QC-EH-ASS)
21 pages
11 Slides
No ratings yet
11 Slides
6 pages
Beyond Models and Metaphors Complexity Theory, Systems Thinking and - Bousquet & Curtis
0% (1)
Beyond Models and Metaphors Complexity Theory, Systems Thinking and - Bousquet & Curtis
21 pages
Effects of ZoH and Controller Design
No ratings yet
Effects of ZoH and Controller Design
22 pages
Common Annotation in CDS
No ratings yet
Common Annotation in CDS
6 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
4 pages
Unreachable Setpoints in MPC
No ratings yet
Unreachable Setpoints in MPC
7 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Exploring the World of Data Science and Machine Learning
From Everand
Exploring the World of Data Science and Machine Learning
NIBEDITA Sahu
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Lecture 4 Data Pre-Processing

Uploaded by

Lecture 4 Data Pre-Processing

Uploaded by

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

MACHINE LEARNING (21CSH-286)

Lecture – 4 DISCOVER . LEARN . EMPOWER

On completion of this course, the students shall be able to:-

• Panda is a python Library for analyzing data.

• Pandas allow us to analyze big data and make conclusions based on

• Pandas are used in Data Science.

• How to install Pandas?

By: Prof. (Dr.) Vineet Mehan 8

By: Prof. (Dr.) Vineet Mehan 9

By: Prof. (Dr.) Vineet Mehan 10

By: Prof. (Dr.) Vineet Mehan 11

By: Prof. (Dr.) Vineet Mehan 12

By: Prof. (Dr.) Vineet Mehan 13

• Create a simple Panda Data Frame

By: Prof. (Dr.) Vineet Mehan 14

By: Prof. (Dr.) Vineet Mehan 15

• Bad data could be:

• Data in wrong format

By: Prof. (Dr.) Vineet Mehan 16

By: Prof. (Dr.) Vineet Mehan 17

By: Prof. (Dr.) Vineet Mehan 18

By: Prof. (Dr.) Vineet Mehan 19

By: Prof. (Dr.) Vineet Mehan 20

• See Row 17 and 27 (removed)

By: Prof. (Dr.) Vineet Mehan 21

By default, the dropna() method returns a new

By: Prof. (Dr.) Vineet Mehan 22

If you want to change the original DataFrame, use

By: Prof. (Dr.) Vineet Mehan 23

See Row 17 replaced with 130

The fillna() method allows us to replace

It will Replace NULL values with the number 130.

By: Prof. (Dr.) Vineet Mehan 24

Values are replaced at position 17, 27, 91,

By: Prof. (Dr.) Vineet Mehan 25

• Median  Center value

• Mode  Most common occurring value

By: Prof. (Dr.) Vineet Mehan 26

Mean here is 375.790244

By: Prof. (Dr.) Vineet Mehan 27

Median here is 318.6

By: Prof. (Dr.) Vineet Mehan 28

Mode here is 300.0

By: Prof. (Dr.) Vineet Mehan 29

By: Prof. (Dr.) Vineet Mehan 30

In our example, it is most likely a typo,

By: Prof. (Dr.) Vineet Mehan 32

By: Prof. (Dr.) Vineet Mehan 33

• Value at position no 7 is removed

By: Prof. (Dr.) Vineet Mehan 35

• To discover duplicates, we can use the duplicated() method.

• The duplicated() method returns a Boolean values for each row.

By: Prof. (Dr.) Vineet Mehan 38

The duplicate row (row no 12) is now removed

By: Prof. (Dr.) Vineet Mehan 39

• Methods of Cleaning Data

• Design a model that is used to clean Empty cells, Data in wrong

By: Prof. (Dr.) Vineet Mehan 41

You might also like