0% found this document useful (0 votes)
1 views

Copy of ML_preprocessing_introduction.pptx

The document outlines the ML project life cycle and emphasizes the importance of data preprocessing, which involves cleaning and transforming raw data into a usable format. Key steps in data preprocessing include handling missing values, encoding categorical data, and normalizing datasets to enhance model efficiency and accuracy. It also discusses various data types and techniques such as one-hot encoding and feature scaling to prepare data for analysis.

Uploaded by

a0253j
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Copy of ML_preprocessing_introduction.pptx

The document outlines the ML project life cycle and emphasizes the importance of data preprocessing, which involves cleaning and transforming raw data into a usable format. Key steps in data preprocessing include handling missing values, encoding categorical data, and normalizing datasets to enhance model efficiency and accuracy. It also discusses various data types and techniques such as one-hot encoding and feature scaling to prepare data for analysis.

Uploaded by

a0253j
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

ML Project Life Cycle &

Preprocessing techniques
Basics of ML Project Life Cycle :

https://www.analyticsvidhya.com/blog/2020/09/10-things-know-before-first-data-science-project/

What is preprocessing?

Data Preprocessing is the process of making data suitable for use.

The dataset initially provided for training might not be in a ready-to-use state, for e.g. it might not be formatted
properly, or may contain missing or null values.

Solving all these problems using various methods is called Data Preprocessing.

A properly processed dataset increase the efficiency and accuracy of the models.
Steps in Data Preprocessing:

● Importing the libraries


● Importing the dataset
● Taking care of missing data
● Encoding categorical data
● Normalizing the data
● Splitting the data into test and train
As per the World Economic Forum, by 2025 we will be generating about 463 exabytes of data globally per day!
whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Data Preprocessing
Data Preprocessing is a technique that is used to convert the raw data into a clean data set.

Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

Dataset has Data objects which are described by a number of features that capture the basic characteristics of an object

Preprocessing is required based on the nature of data


Data wrangling : The process includes identifying and removing inaccurate and irrelevant data, dealing with the missing
data, removing the duplicate data, etc. Thus, eliminating the major inconsistencies and making the data more efficient to
work with.
Data Cleaning :
Data cleaning aims at filling missing values, smoothing out noise while determining outliers and rectifying inconsistencies in
the data.

Handling Missing Values: Example : Customer sales data.


Think of possible values for all fields especially customer’s income or address.

Methods to deal with missing values :


• Ignore Tuple: This method is followed if there is no class label specified. It is useful for small number of attributes with
missing values.

• Fill the missing value manually: Entering values manually is time consuming.

• Fill the missing value with the help of global constant: Replace all the values of missing attribute with label unknown
by using same constant.

• Fill in the missing value by using most probable value:


It uses regression, decision tree or Bayesian method to fill in missing values.

• Fill inprepared by Smita Bhanap


the missing value with the help of attribute mean:It uses average of the attribute value to fill missing value.
● Steps in Cleaning
● Data of Wrong Format
● Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

● To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.

● Convert Into a Correct Format


● convert all cells in the 'Date' column into dates.

● Pandas has a to_datetime() method for this:


import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data)

Import the Pandas library


Name the data frame as health_data.
header=0 means that the headers for the variable names are to be found in the first row (note that 0 means the first row
in Python)
sep="," means that "," is used as the separator between the values. This is because we are using the file type .csv
(comma separated values)
•There are some blank fields
•Average pulse of 9 000 is not possible
•9 000 will be treated as non-numeric, because
of the space separator
•One observation of max pulse is denoted as
"AF", which does not make sense

Remove blank data

health_data.dropna(axis=0,inplace=True)

print(health_data)
use the dropna() function to remove the NaNs. axis=0 means
that we want to remove all rows that have a NaN value:
Data Categories
To analyze data, we also need to know the types of data we are dealing with.

Data can be split into three main categories:

Numerical - Contains numerical values. Can be divided into two categories:


Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either 2 or
3
Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and
20 seconds, or 7.533 hours
Categorical - Contains values that cannot be measured up against each other. Example: A color or a type of
training
Ordinal - Contains categorical data that can be measured up against each other. Example: School grades
where A is better than B and so on

Data Types
We can use the info() function to list the data types within our data set:

print(health_data.info())
We cannot use objects to calculate and perform analysis here. We must convert the type
object to float64 (float64 is a number with a decimal in Python).

We can use the astype() function to convert the data into float64.

convert"Average_Pulse" and "Max_Pulse" into data type float64 (the other variables are
already of data type float64):

health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float)
print (health_data.info())
Analyze data using
print(health_data.describe())
One Hot Encoding
Categorical column into their respective numeric values conversion

One way to do this is to have a column representing each group in the category.
For each column, the values will be 1 or 0 where 1 represents the inclusion of the group and 0 represents the exclusion. This transformation is called one
hot encoding.

Use the Python Pandas module has a function that called get_dummies() which does one hot encoding.

import pandas as pd

cars = pd.read_csv('data.csv')
ohe_cars = pd.get_dummies(cars[['Car']])

print(ohe_cars.to_string())
Feature Scaling
When your data has different values, and even different measurement units, it can
be difficult to compare them. What is kilograms compared to meters? Or altitude
compared to time?
The answer to this problem is scaling. We can scale data into new values that are
easier to compare.

There are different methods for scaling data, in this tutorial we will use a method
called standardization.
The standardization method uses this formula:
z = (x - u) / s
Where z is the new value, x is the original value, u is the mean and s is
the standard deviation.

import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")

X = df[['Weight', 'Volume']]

scaledX = scale.fit_transform(X)
https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encodi
ng-using-scikit-learn/

https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it
-f0ae272f1179

You might also like