0% found this document useful (0 votes)
35 views25 pages

Lecture 2.2

Uploaded by

sahillodha1903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views25 pages

Lecture 2.2

Uploaded by

sahillodha1903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Apex Institute of Technology

Department of Computer Science & Engineering


Bachelor of Engineering (Computer Science &
Engineering)
Python for Machine Learning – (20CST255)
Prepared By: Dr. Dinesh Vij

Lecture - 7
DISCOVER . LEARN . EMPOWER
Introduction to Pandas - II
Python for Machine Learning
Course Objective:
The Course aims to: :

CO Will be covered in
Title
Number this lecture
Make students understand the structure, semantics
CO1 and syntax of Python programming Languages.

Make students understand and apply various data


CO2
handling and visualization techniques.

Enable students to develop and implement the first


CO3 principles of data science.

DISCOVER . LEARN . EMPOWER


Python for Machine Learning
Course Outcome:
Upon successful completion of this course, students will be able to:

CO
Title Will be covered in
Number
this lecture
Understand Python programming language by
CO1
navigating software documentation
Development of Python programs using Numpy and
CO2
Pandas.
Visualization of Data Models using Matplotlib and
CO3 Seaborn.
CO4 Implement simple learning strategies using data
science principles.

CO5 Optimize the evaluation results obtained after


applying machine learning model.

DISCOVER . LEARN . EMPOWER


Data Formatting and
Cleaning

Apex Institute of Technology- CSE


Introduction
• ML depends heavily on data. It’s the most crucial aspect that
makes algorithm training possible and explains why machine
learning became so popular in recent years.
• But regardless of your actual terabytes of information and
data science expertise, if you can’t make sense of data
records, a machine will be nearly useless or perhaps even
harmful.
• The thing is, all datasets are flawed. That’s why data
preparation is such an important step in the machine learning
process.
• In a nutshell, data preparation is a set of procedures that
helps make your dataset more suitable for machine learning.
Apex Institute of Technology- CSE
Data Formatting
• Data formatting is sometimes referred to as the file format
you’re using. And this isn’t much of a problem to convert a
dataset into a file format that fits your machine learning
system best.
• We’re talking about format consistency of records
themselves.
• If you’re aggregating data from different sources or your
dataset has been manually updated by different people, it’s
worth making sure that all variables within a given attribute
are consistently written.
• These may be date formats, sums of money (4.03 or $4.03, or
even 4 dollars 3 cents), addresses, etc.
Apex Institute of Technology- CSE
Data Formatting (contd.)
• The input format should be the same across the
entire dataset.
• Also, there are other aspects of data consistency. For
instance, if you have a set numeric range in an
attribute from 0.0 to 5.0, ensure that there are no
5.5s in your set.

Apex Institute of Technology- CSE


Data Cleaning
• Machine learning is all about training and feeding data to
algorithms to perform various compute intensive tasks.
• However, businesses typically face challenges in feeding the
right data to machine learning algorithms or cleaning of
irrelevant and error-prone data.
• In other words, when it comes to utilizing ML data, most of
the time is spent on cleaning data sets or creating a dataset
that is free of errors.
• Setting up a quality plan, filling missing values, removing
rows, reducing data size are some of the practices used for
data cleaning in Machine Learning.

Apex Institute of Technology- CSE


Data Cleaning Techniques
• Your choice of data cleaning techniques relies on a lot of
factors.
• First, what kind of data are you dealing with? Are they
numeric values or strings? Unless you have too few values to
handle, you shouldn’t expect to clean your data with just one
technique as well.
• You might need to use multiple techniques for a better result.
The more data types you have to handle, the more cleansing
techniques you’ll have to use.
• Being familiar with all of these methods will help you in
rectifying errors and getting rid of useless data.

Apex Institute of Technology- CSE


Remove Irrelevant Values
• The first and foremost thing you should do is remove
useless pieces of data from your system.
• Any useless or irrelevant data is the one you don’t need.
It might not fit the context of your issue.
• You might only have to measure the average age of your
sales staff. Then their email address wouldn’t be
required.
• Another example is you might be checking to see how
many customers you contacted in a month. In this case,
you wouldn’t need the data of people you reached in a
prior month.
Apex Institute of Technology- CSE
Remove Irrelevant Values (contd.)
• However, before you remove a particular piece of
data, make sure that it is irrelevant because you
might need it to check its correlated values later on
(for checking the consistency).
• You wouldn’t want to delete some values and regret
the decision later on.
• But once you’re assured that the data is irrelevant,
get rid of it.

Apex Institute of Technology- CSE


Get Rid of Duplicate Values
• Duplicates are similar to useless values – You don’t need
them. They only increase the amount of data you have
and waste your time.
• You can get rid of them with simple searches. Duplicate
values could be present in your system for several
reasons.
• Maybe you combined the data of multiple sources. Or,
perhaps the person submitting the data repeated a value
mistakingly. Some user clicked twice on ‘enter’ when
they were filling an online form. You should remove the
duplicates as soon as you find them.
Apex Institute of Technology- CSE
Avoid Typos (and similar errors)
• Typos are a result of human error and can be present
anywhere.
• You can fix typos through multiple algorithms and
techniques. You can map the values and convert them into
the correct spelling.
• Typos are essential to fix because models treat different
values differently.
• Strings rely a lot on their spellings and cases.
• ‘George’ is different from ‘george’ even though they have the
same spelling. Similarly ‘Mike’ and ‘Mice’ are different from
each other, also though they have the same number of
characters.
Apex Institute of Technology- CSE
Avoid Typos (and similar errors)
• You’ll need to look for typos such as these and fix them
appropriately.
• Another error similar to typos is of strings’ size. You might
need to pad them to keep them in the same format. For
example, your dataset might require you to have 5-digit
numbers only. So if you have any value which only has four
digits such as ‘3994’ you can add a zero in the beginning to
increase its number of digits.
• Its value would remain the same as ‘03994’, but it’ll keep your
data uniform. An additional error with strings is of white
spaces. Make sure you remove them from your strings to
keep them consistent.
Apex Institute of Technology- CSE
Fill-out missing values
• In terms of machine learning, assumed or approximated
values are “more right” for an algorithm than just missing
ones.
• Even if you don’t know the exact value, methods exist to
better “assume” which value is missing or bypass the issue.
• How to сlean data? Choosing the right approach also heavily
depends on data and the domain you have:
• Substitute missing values with dummy values, e.g., n/a for
categorical or 0 for numerical values
• Substitute the missing numerical values with mean figures
• For categorical values, you can also use the most frequent
items to fill in.
Apex Institute of Technology- CSE
Removing rows with missing values
• One of the simplest things to do in data cleansing is to
remove or delete rows with missing values.
• This may not be the ideal step in case of a huge amount
of errors in your training data.
• If the missing values are considerably less, then removing
or deleting missing values can be the right approach.
• You will have to be very sure that the data you are
deleting does not include information that is present in
the other rows of the training data.

Apex Institute of Technology- CSE


Reducing data for proper data handling
• It is good to reduce the data you are handling.
• A downsized dataset can help you generate results that
are more accurate. There are different ways of reducing
data in your dataset.
• Whatever data records you have, sample them and
choose the relevant subset from that data. This method
of data handling is called Record Sampling.
• Apart from this method, you can also use Attribute
Sampling. When it comes to the attribute sampling,
select a subset of the most important attributes from the
dataset.
Apex Institute of Technology- CSE
Encoding categorical data
• Machine learning and deep learning models, require all input and
output variables to be numeric.
• This means that if your data contains categorical data, you must
encode it to numbers before you can fit and evaluate a model.
• The two most popular techniques are an label encoding and a one
hot encoding.
• Label encoding: In this encoding, each category is assigned a value
from 1 through N (where N is the number of categories for the
feature).
• One major issue with this approach is there is no relation or order
between these classes, but the algorithm might consider them as
some order, or there is some relationship. In below example it may
look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .
Apex Institute of Technology- CSE
Label encoding

Apex Institute of Technology- CSE


Encoding categorical data (contd.)
One hot encoding:
• In this method, we map each category to a vector that
contains 1 and 0 denoting the presence or absence of
the feature.
• The number of vectors depends on the number of
categories for features.
• This method produces a lot of columns that slows down
the learning significantly if the number of the category is
very high for the feature.

Apex Institute of Technology- CSE


One hot encoding

Apex Institute of Technology- CSE


Data Scaling
• Data scaling belongs to a group of data
normalization procedures that aim at improving the quality of a
dataset by reducing dimensions and avoiding the situation when
some of the values overweight others.
• For example, Imagine that you run a chain of car dealerships and
most of the attributes in your dataset are either categorical to
depict models and body styles (sedan, hatchback, van, etc.) or
have 1-2 digit numbers, for instance, for years of use.
• But the prices are 4-5 digit numbers ($10000 or $8000) and you
want to predict the average time for the car to be sold based on
its characteristics (model, years of previous use, body style,
price, condition, etc.)

Apex Institute of Technology- CSE


Data Scaling (contd.)
• While the price is an important criterion, you don’t want
it to overweight the other ones.
• In this case, min-max normalization can be used.
• It entails transforming numerical values to ranges, e.g.,
from 0.0 to 1.0 where 0.0 represents the minimal and 1.0
the maximum values to even out the weight of the price
attribute with other attributes in a dataset.
• Besides min-max normalization, various other types of
normalization techniques can also be used.

Apex Institute of Technology- CSE


Suggestive Readings
• https://www.einfochips.com/blog/data-cleaning-in-machine-l
earning-best-practices-and-methods/

• https://www.altexsoft.com/blog/datascience/preparing-your-
dataset-for-machine-learning-8-basic-techniques-that-make-y
our-data-better/

• https://www.upgrad.com/blog/data-cleaning-techniques/

Apex Institute of Technology- CSE


THANK YOU

For queries
Email: [email protected]

Apex Institute of Technology- CSE

You might also like