0% found this document useful (0 votes)

35 views25 pages

Lecture 2.2

Uploaded by

sahillodha1903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views25 pages

Lecture 2.2

Uploaded by

sahillodha1903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Apex Institute of Technology

Department of Computer Science & Engineering

Bachelor of Engineering (Computer Science &
Engineering)
Python for Machine Learning – (20CST255)
Prepared By: Dr. Dinesh Vij

Lecture - 7
DISCOVER . LEARN . EMPOWER
Introduction to Pandas - II
Python for Machine Learning
Course Objective:
The Course aims to: :

CO Will be covered in
Title
Number this lecture
Make students understand the structure, semantics
CO1 and syntax of Python programming Languages.

Make students understand and apply various data

CO2
handling and visualization techniques.

Enable students to develop and implement the first

CO3 principles of data science.

DISCOVER . LEARN . EMPOWER

Python for Machine Learning
Course Outcome:
Upon successful completion of this course, students will be able to:

CO
Title Will be covered in
Number
this lecture
Understand Python programming language by
CO1
navigating software documentation
Development of Python programs using Numpy and
CO2
Pandas.
Visualization of Data Models using Matplotlib and
CO3 Seaborn.
CO4 Implement simple learning strategies using data
science principles.

CO5 Optimize the evaluation results obtained after

applying machine learning model.

DISCOVER . LEARN . EMPOWER

Data Formatting and
Cleaning

Apex Institute of Technology- CSE

Introduction
• ML depends heavily on data. It’s the most crucial aspect that
makes algorithm training possible and explains why machine
learning became so popular in recent years.
• But regardless of your actual terabytes of information and
data science expertise, if you can’t make sense of data
records, a machine will be nearly useless or perhaps even
harmful.
• The thing is, all datasets are flawed. That’s why data
preparation is such an important step in the machine learning
process.
• In a nutshell, data preparation is a set of procedures that
helps make your dataset more suitable for machine learning.
Apex Institute of Technology- CSE
Data Formatting
• Data formatting is sometimes referred to as the file format
you’re using. And this isn’t much of a problem to convert a
dataset into a file format that fits your machine learning
system best.
• We’re talking about format consistency of records
themselves.
• If you’re aggregating data from different sources or your
dataset has been manually updated by different people, it’s
worth making sure that all variables within a given attribute
are consistently written.
• These may be date formats, sums of money (4.03 or $4.03, or
even 4 dollars 3 cents), addresses, etc.
Apex Institute of Technology- CSE
Data Formatting (contd.)
• The input format should be the same across the
entire dataset.
• Also, there are other aspects of data consistency. For
instance, if you have a set numeric range in an
attribute from 0.0 to 5.0, ensure that there are no
5.5s in your set.

Apex Institute of Technology- CSE

Data Cleaning
• Machine learning is all about training and feeding data to
algorithms to perform various compute intensive tasks.
• However, businesses typically face challenges in feeding the
right data to machine learning algorithms or cleaning of
irrelevant and error-prone data.
• In other words, when it comes to utilizing ML data, most of
the time is spent on cleaning data sets or creating a dataset
that is free of errors.
• Setting up a quality plan, filling missing values, removing
rows, reducing data size are some of the practices used for
data cleaning in Machine Learning.

Apex Institute of Technology- CSE

Data Cleaning Techniques
• Your choice of data cleaning techniques relies on a lot of
factors.
• First, what kind of data are you dealing with? Are they
numeric values or strings? Unless you have too few values to
handle, you shouldn’t expect to clean your data with just one
technique as well.
• You might need to use multiple techniques for a better result.
The more data types you have to handle, the more cleansing
techniques you’ll have to use.
• Being familiar with all of these methods will help you in
rectifying errors and getting rid of useless data.

Apex Institute of Technology- CSE

Remove Irrelevant Values
• The first and foremost thing you should do is remove
useless pieces of data from your system.
• Any useless or irrelevant data is the one you don’t need.
It might not fit the context of your issue.
• You might only have to measure the average age of your
sales staff. Then their email address wouldn’t be
required.
• Another example is you might be checking to see how
many customers you contacted in a month. In this case,
you wouldn’t need the data of people you reached in a
prior month.
Apex Institute of Technology- CSE
Remove Irrelevant Values (contd.)
• However, before you remove a particular piece of
data, make sure that it is irrelevant because you
might need it to check its correlated values later on
(for checking the consistency).
• You wouldn’t want to delete some values and regret
the decision later on.
• But once you’re assured that the data is irrelevant,
get rid of it.

Apex Institute of Technology- CSE

Get Rid of Duplicate Values
• Duplicates are similar to useless values – You don’t need
them. They only increase the amount of data you have
and waste your time.
• You can get rid of them with simple searches. Duplicate
values could be present in your system for several
reasons.
• Maybe you combined the data of multiple sources. Or,
perhaps the person submitting the data repeated a value
mistakingly. Some user clicked twice on ‘enter’ when
they were filling an online form. You should remove the
duplicates as soon as you find them.
Apex Institute of Technology- CSE
Avoid Typos (and similar errors)
• Typos are a result of human error and can be present
anywhere.
• You can fix typos through multiple algorithms and
techniques. You can map the values and convert them into
the correct spelling.
• Typos are essential to fix because models treat different
values differently.
• Strings rely a lot on their spellings and cases.
• ‘George’ is different from ‘george’ even though they have the
same spelling. Similarly ‘Mike’ and ‘Mice’ are different from
each other, also though they have the same number of
characters.
Apex Institute of Technology- CSE
Avoid Typos (and similar errors)
• You’ll need to look for typos such as these and fix them
appropriately.
• Another error similar to typos is of strings’ size. You might
need to pad them to keep them in the same format. For
example, your dataset might require you to have 5-digit
numbers only. So if you have any value which only has four
digits such as ‘3994’ you can add a zero in the beginning to
increase its number of digits.
• Its value would remain the same as ‘03994’, but it’ll keep your
data uniform. An additional error with strings is of white
spaces. Make sure you remove them from your strings to
keep them consistent.
Apex Institute of Technology- CSE
Fill-out missing values
• In terms of machine learning, assumed or approximated
values are “more right” for an algorithm than just missing
ones.
• Even if you don’t know the exact value, methods exist to
better “assume” which value is missing or bypass the issue.
• How to сlean data? Choosing the right approach also heavily
depends on data and the domain you have:
• Substitute missing values with dummy values, e.g., n/a for
categorical or 0 for numerical values
• Substitute the missing numerical values with mean figures
• For categorical values, you can also use the most frequent
items to fill in.
Apex Institute of Technology- CSE
Removing rows with missing values
• One of the simplest things to do in data cleansing is to
remove or delete rows with missing values.
• This may not be the ideal step in case of a huge amount
of errors in your training data.
• If the missing values are considerably less, then removing
or deleting missing values can be the right approach.
• You will have to be very sure that the data you are
deleting does not include information that is present in
the other rows of the training data.

Apex Institute of Technology- CSE

Reducing data for proper data handling
• It is good to reduce the data you are handling.
• A downsized dataset can help you generate results that
are more accurate. There are different ways of reducing
data in your dataset.
• Whatever data records you have, sample them and
choose the relevant subset from that data. This method
of data handling is called Record Sampling.
• Apart from this method, you can also use Attribute
Sampling. When it comes to the attribute sampling,
select a subset of the most important attributes from the
dataset.
Apex Institute of Technology- CSE
Encoding categorical data
• Machine learning and deep learning models, require all input and
output variables to be numeric.
• This means that if your data contains categorical data, you must
encode it to numbers before you can fit and evaluate a model.
• The two most popular techniques are an label encoding and a one
hot encoding.
• Label encoding: In this encoding, each category is assigned a value
from 1 through N (where N is the number of categories for the
feature).
• One major issue with this approach is there is no relation or order
between these classes, but the algorithm might consider them as
some order, or there is some relationship. In below example it may
look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .
Apex Institute of Technology- CSE
Label encoding

Apex Institute of Technology- CSE

Encoding categorical data (contd.)
One hot encoding:
• In this method, we map each category to a vector that
contains 1 and 0 denoting the presence or absence of
the feature.
• The number of vectors depends on the number of
categories for features.
• This method produces a lot of columns that slows down
the learning significantly if the number of the category is
very high for the feature.

Apex Institute of Technology- CSE

One hot encoding

Apex Institute of Technology- CSE

Data Scaling
• Data scaling belongs to a group of data
normalization procedures that aim at improving the quality of a
dataset by reducing dimensions and avoiding the situation when
some of the values overweight others.
• For example, Imagine that you run a chain of car dealerships and
most of the attributes in your dataset are either categorical to
depict models and body styles (sedan, hatchback, van, etc.) or
have 1-2 digit numbers, for instance, for years of use.
• But the prices are 4-5 digit numbers ($10000 or $8000) and you
want to predict the average time for the car to be sold based on
its characteristics (model, years of previous use, body style,
price, condition, etc.)

Apex Institute of Technology- CSE

Data Scaling (contd.)
• While the price is an important criterion, you don’t want
it to overweight the other ones.
• In this case, min-max normalization can be used.
• It entails transforming numerical values to ranges, e.g.,
from 0.0 to 1.0 where 0.0 represents the minimal and 1.0
the maximum values to even out the weight of the price
attribute with other attributes in a dataset.
• Besides min-max normalization, various other types of
normalization techniques can also be used.

Apex Institute of Technology- CSE

Suggestive Readings
• https://www.einfochips.com/blog/data-cleaning-in-machine-l
earning-best-practices-and-methods/

• https://www.altexsoft.com/blog/datascience/preparing-your-
dataset-for-machine-learning-8-basic-techniques-that-make-y
our-data-better/

• https://www.upgrad.com/blog/data-cleaning-techniques/

Apex Institute of Technology- CSE

THANK YOU

For queries
Email: [email protected]

Apex Institute of Technology- CSE

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Gayle McDowell CareerCup Sample Resume
No ratings yet
Gayle McDowell CareerCup Sample Resume
2 pages
Practical File of Numerical Methods in Engineering
100% (2)
Practical File of Numerical Methods in Engineering
43 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Dynamic Memory Allocation 19
No ratings yet
Dynamic Memory Allocation 19
14 pages
Data Science and Machine Learning Syllabus V1.0
No ratings yet
Data Science and Machine Learning Syllabus V1.0
6 pages
9858 Iso 140012015 Self Appraisal Questionnaire
No ratings yet
9858 Iso 140012015 Self Appraisal Questionnaire
12 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
NFA To DFA - FST
No ratings yet
NFA To DFA - FST
95 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
MCS Index 802.11n and 802.11ac
No ratings yet
MCS Index 802.11n and 802.11ac
1 page
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Data Cleaning Thesis
100% (2)
Data Cleaning Thesis
5 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Usermanual hps50 PDF
No ratings yet
Usermanual hps50 PDF
72 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
2 - Machine Learning - 130824
No ratings yet
2 - Machine Learning - 130824
81 pages
HHHH
No ratings yet
HHHH
6 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
64 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Introduction To PHP: Common Uses of PHP
No ratings yet
Introduction To PHP: Common Uses of PHP
36 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
ML Da
No ratings yet
ML Da
55 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Class 2 - Extraction, Transformation and Load (ETL)
No ratings yet
Class 2 - Extraction, Transformation and Load (ETL)
25 pages
CENG3300 Lecture 3
No ratings yet
CENG3300 Lecture 3
24 pages
寶馬E SYS漢化
No ratings yet
寶馬E SYS漢化
23 pages
Anshu Complete Data Science Files
No ratings yet
Anshu Complete Data Science Files
26 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
B13670 07
No ratings yet
B13670 07
30 pages
Intro To ISO 13485 Presentation Materials
No ratings yet
Intro To ISO 13485 Presentation Materials
10 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
ADS TT1 QB Solutions
No ratings yet
ADS TT1 QB Solutions
14 pages
Mca New Assignment List
No ratings yet
Mca New Assignment List
11 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Visa Bulletin Implementation of 8 Digit BINs
No ratings yet
Visa Bulletin Implementation of 8 Digit BINs
8 pages
Data Services Code Migration
No ratings yet
Data Services Code Migration
8 pages
GWRC 2018 Fare Review Farebox Recovery Calculations For 2018-19 Combined
No ratings yet
GWRC 2018 Fare Review Farebox Recovery Calculations For 2018-19 Combined
6 pages
MDCM Memo
No ratings yet
MDCM Memo
3 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
IBM Software Defined Storage For Dummies ES
No ratings yet
IBM Software Defined Storage For Dummies ES
10 pages
Arnav MLlab01
No ratings yet
Arnav MLlab01
7 pages
COBIT 4.1 Resumen
No ratings yet
COBIT 4.1 Resumen
10 pages
Task: Using A Biography For A Radio Interview
No ratings yet
Task: Using A Biography For A Radio Interview
4 pages
Install PHP 5.3 and 5.2 Together On Ubuntu 12.04
No ratings yet
Install PHP 5.3 and 5.2 Together On Ubuntu 12.04
8 pages
Module 2
No ratings yet
Module 2
8 pages
Unit 4 - DS - 1st Year
No ratings yet
Unit 4 - DS - 1st Year
6 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Data Science Notes Full
No ratings yet
Data Science Notes Full
5 pages
Akshay Gupta Resume
No ratings yet
Akshay Gupta Resume
2 pages
227C4A Data Science
No ratings yet
227C4A Data Science
2 pages
Discrete Mathematical Structures IS314 SUPPLEMENTARY EXAM 3RD SEM AUG 2017
No ratings yet
Discrete Mathematical Structures IS314 SUPPLEMENTARY EXAM 3RD SEM AUG 2017
4 pages
Network Security
No ratings yet
Network Security
2 pages
Devoir N° 1 - AlloSchool
No ratings yet
Devoir N° 1 - AlloSchool
2 pages
Chapter 3
No ratings yet
Chapter 3
1 page
UT Dallas Syllabus For cs3354.501 05f Taught by Hieu Vu (hdv013000)
No ratings yet
UT Dallas Syllabus For cs3354.501 05f Taught by Hieu Vu (hdv013000)
3 pages
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet

Lecture 2.2

Uploaded by

Lecture 2.2

Uploaded by

Apex Institute of Technology

Department of Computer Science & Engineering

Make students understand and apply various data

Enable students to develop and implement the first

DISCOVER . LEARN . EMPOWER

CO5 Optimize the evaluation results obtained after

DISCOVER . LEARN . EMPOWER

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

Apex Institute of Technology- CSE

You might also like