0% found this document useful (0 votes)

10 views14 pages

ML - Preprocessing - Introduction

The document outlines the ML project life cycle and emphasizes the importance of data preprocessing, which involves cleaning and transforming raw data into a usable format. Key steps in data preprocessing include handling missing values, encoding categorical data, and normalizing datasets to enhance model efficiency and accuracy. It also discusses various data types and techniques such as one-hot encoding and feature scaling to prepare data for analysis.

Uploaded by

a0253j

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views14 pages

ML - Preprocessing - Introduction

Uploaded by

a0253j

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

ML Project Life Cycle &

Preprocessing techniques
Basics of ML Project Life Cycle :

https://www.analyticsvidhya.com/blog/2020/09/10-things-know-before-first-data-science-project/

What is preprocessing?

Data Preprocessing is the process of making data suitable for use.

The dataset initially provided for training might not be in a ready-to-use state, for e.g. it might not be formatted
properly, or may contain missing or null values.

Solving all these problems using various methods is called Data Preprocessing.

A properly processed dataset increase the efficiency and accuracy of the models.
Steps in Data Preprocessing:

● Importing the libraries

● Importing the dataset
● Taking care of missing data
● Encoding categorical data
● Normalizing the data
● Splitting the data into test and train
As per the World Economic Forum, by 2025 we will be generating about 463 exabytes of data globally per day!
whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Data Preprocessing
Data Preprocessing is a technique that is used to convert the raw data into a clean data set.

Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

Dataset has Data objects which are described by a number of features that capture the basic characteristics of an object

Preprocessing is required based on the nature of data

Data wrangling : The process includes identifying and removing inaccurate and irrelevant data, dealing with the missing
data, removing the duplicate data, etc. Thus, eliminating the major inconsistencies and making the data more efficient to
work with.
Data Cleaning :
Data cleaning aims at filling missing values, smoothing out noise while determining outliers and rectifying inconsistencies in
the data.

Handling Missing Values: Example : Customer sales data.

Think of possible values for all fields especially customer’s income or address.

Methods to deal with missing values :

• Ignore Tuple: This method is followed if there is no class label specified. It is useful for small number of attributes with
missing values.

• Fill the missing value manually: Entering values manually is time consuming.

• Fill the missing value with the help of global constant: Replace all the values of missing attribute with label unknown
by using same constant.

• Fill in the missing value by using most probable value:

It uses regression, decision tree or Bayesian method to fill in missing values.

• Fill inprepared by Smita Bhanap

the missing value with the help of attribute mean:It uses average of the attribute value to fill missing value.
● Steps in Cleaning
● Data of Wrong Format
● Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

● To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.

● Convert Into a Correct Format

● convert all cells in the 'Date' column into dates.

● Pandas has a to_datetime() method for this:

import pandas as pd

health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data)

Import the Pandas library

Name the data frame as health_data.
header=0 means that the headers for the variable names are to be found in the first row (note that 0 means the first row
in Python)
sep="," means that "," is used as the separator between the values. This is because we are using the file type .csv
(comma separated values)
•There are some blank fields
•Average pulse of 9 000 is not possible
•9 000 will be treated as non-numeric, because
of the space separator
•One observation of max pulse is denoted as
"AF", which does not make sense

Remove blank data

health_data.dropna(axis=0,inplace=True)

print(health_data)
use the dropna() function to remove the NaNs. axis=0 means
that we want to remove all rows that have a NaN value:
Data Categories
To analyze data, we also need to know the types of data we are dealing with.

Data can be split into three main categories:

Numerical - Contains numerical values. Can be divided into two categories:

Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either 2 or
3
Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and
20 seconds, or 7.533 hours
Categorical - Contains values that cannot be measured up against each other. Example: A color or a type of
training
Ordinal - Contains categorical data that can be measured up against each other. Example: School grades
where A is better than B and so on

Data Types
We can use the info() function to list the data types within our data set:

print(health_data.info())
We cannot use objects to calculate and perform analysis here. We must convert the type
object to float64 (float64 is a number with a decimal in Python).

We can use the astype() function to convert the data into float64.

convert"Average_Pulse" and "Max_Pulse" into data type float64 (the other variables are
already of data type float64):

health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float)
print (health_data.info())
Analyze data using
print(health_data.describe())
One Hot Encoding
Categorical column into their respective numeric values conversion

One way to do this is to have a column representing each group in the category.
For each column, the values will be 1 or 0 where 1 represents the inclusion of the group and 0 represents the exclusion. This transformation is called one
hot encoding.

Use the Python Pandas module has a function that called get_dummies() which does one hot encoding.

import pandas as pd

cars = pd.read_csv('data.csv')
ohe_cars = pd.get_dummies(cars[['Car']])

print(ohe_cars.to_string())
Feature Scaling
When your data has different values, and even different measurement units, it can
be difficult to compare them. What is kilograms compared to meters? Or altitude
compared to time?
The answer to this problem is scaling. We can scale data into new values that are
easier to compare.

There are different methods for scaling data, in this tutorial we will use a method
called standardization.
The standardization method uses this formula:
z = (x - u) / s
Where z is the new value, x is the original value, u is the mean and s is
the standard deviation.

import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")

X = df[['Weight', 'Volume']]

scaledX = scale.fit_transform(X)
https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encodi
ng-using-scikit-learn/

https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it
-f0ae272f1179

GS-01 Control de Rondas Traducido V1
100% (2)
GS-01 Control de Rondas Traducido V1
49 pages
2 Packaging Specification
100% (1)
2 Packaging Specification
77 pages
Data Science
No ratings yet
Data Science
8 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Python Basics - Hamza Zahoor
No ratings yet
Python Basics - Hamza Zahoor
6 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Ch01 - Introduction To Data Science
No ratings yet
Ch01 - Introduction To Data Science
65 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Pandas
No ratings yet
Pandas
30 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Unit 2
No ratings yet
Unit 2
19 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
LP II Practical
No ratings yet
LP II Practical
5 pages
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
No ratings yet
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
10 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Phython Example
No ratings yet
Phython Example
12 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Résumé-Analyse Des Données resumee resumee
No ratings yet
Résumé-Analyse Des Données resumee resumee
4 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
18 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Hgs Phase II
No ratings yet
Hgs Phase II
27 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
JD Intern - SpringCT
No ratings yet
JD Intern - SpringCT
1 page
Sample Omniscient
No ratings yet
Sample Omniscient
1 page
AI Webinar FC College
No ratings yet
AI Webinar FC College
9 pages
Agenda GA 202
No ratings yet
Agenda GA 202
2 pages
New RK
No ratings yet
New RK
6 pages
MTDC
No ratings yet
MTDC
6 pages
Dilotte Resume Template
No ratings yet
Dilotte Resume Template
2 pages
Mngla LKSDP Exp Third Ac (3A) : Electronic Reservation Slip (ERS)
No ratings yet
Mngla LKSDP Exp Third Ac (3A) : Electronic Reservation Slip (ERS)
3 pages
ViewResult1 5
No ratings yet
ViewResult1 5
2 pages
View Result 1
No ratings yet
View Result 1
2 pages
Generated File
No ratings yet
Generated File
1 page
View Result 1
No ratings yet
View Result 1
1 page
XII CS Term-1 Question Paper 2024-25
No ratings yet
XII CS Term-1 Question Paper 2024-25
6 pages
ABB ICS Cyber Security Reference Architecture - Drawings - Public
No ratings yet
ABB ICS Cyber Security Reference Architecture - Drawings - Public
10 pages
FI11 - GL F.13 GL Account Auto Clearing
No ratings yet
FI11 - GL F.13 GL Account Auto Clearing
5 pages
Online Art Gallery Management System: B.Tech (Computer Science and Engineering)
No ratings yet
Online Art Gallery Management System: B.Tech (Computer Science and Engineering)
27 pages
FTS Eday Process
No ratings yet
FTS Eday Process
47 pages
Agile
No ratings yet
Agile
288 pages
Control Flow Statements-Conditional Statements
No ratings yet
Control Flow Statements-Conditional Statements
34 pages
Keysight E4980A/AL Precision LCR Meter: User's Guide
No ratings yet
Keysight E4980A/AL Precision LCR Meter: User's Guide
529 pages
Micro Controllers I
No ratings yet
Micro Controllers I
18 pages
Sivakrishna Yaganti SDE AI ML
No ratings yet
Sivakrishna Yaganti SDE AI ML
1 page
Automatinghecras Excel Conversion Gate01
100% (1)
Automatinghecras Excel Conversion Gate01
11 pages
PHP Import Excel Into Database (Xls & XLSX) - Stack Overflow
No ratings yet
PHP Import Excel Into Database (Xls & XLSX) - Stack Overflow
4 pages
Key Features of Supply Chain Management Software PDF
No ratings yet
Key Features of Supply Chain Management Software PDF
8 pages
Management Information System Mcqs With Answers 1
No ratings yet
Management Information System Mcqs With Answers 1
53 pages
ELI Link v5 DICOM Walkthrough
No ratings yet
ELI Link v5 DICOM Walkthrough
31 pages
Technical Concepts
No ratings yet
Technical Concepts
6 pages
DSC Signer Service User Guidelines
No ratings yet
DSC Signer Service User Guidelines
50 pages
Kodak I3000 Datasheet en
No ratings yet
Kodak I3000 Datasheet en
3 pages
GSA Manual PDF
No ratings yet
GSA Manual PDF
618 pages
Stepper Motor & Drivers - USB MACH3 4 Axis Controller - PENTING!
100% (1)
Stepper Motor & Drivers - USB MACH3 4 Axis Controller - PENTING!
6 pages
777-Digital-Solutions UOwn API v2.6
No ratings yet
777-Digital-Solutions UOwn API v2.6
50 pages
EU 22 Fitzl Knockout Win Against TCC
No ratings yet
EU 22 Fitzl Knockout Win Against TCC
73 pages
E Ticketting
No ratings yet
E Ticketting
21 pages
Introduction To Docker
No ratings yet
Introduction To Docker
29 pages
Fundamentals of Digital Logic With VHDL Design 4th Edition Brown - Download The Ebook Today and Experience The Full Content
100% (4)
Fundamentals of Digital Logic With VHDL Design 4th Edition Brown - Download The Ebook Today and Experience The Full Content
54 pages
Main Configuration in SAP Project Systems 1675973838
No ratings yet
Main Configuration in SAP Project Systems 1675973838
27 pages
Smooth-Archive-Dcm-Cs 5 23 3
No ratings yet
Smooth-Archive-Dcm-Cs 5 23 3
347 pages
openSAP Sac3 Week 1 Exercise1
No ratings yet
openSAP Sac3 Week 1 Exercise1
30 pages

ML - Preprocessing - Introduction

Uploaded by

ML - Preprocessing - Introduction

Uploaded by

ML Project Life Cycle &

Data Preprocessing is the process of making data suitable for use.

● Importing the libraries

Preprocessing is required based on the nature of data

Handling Missing Values: Example : Customer sales data.

Methods to deal with missing values :

• Fill in the missing value by using most probable value:

• Fill inprepared by Smita Bhanap

● Convert Into a Correct Format

● Pandas has a to_datetime() method for this:

health_data = pd.read_csv("data.csv", header=0, sep=",")

Import the Pandas library

Remove blank data

Data can be split into three main categories:

Numerical - Contains numerical values. Can be divided into two categories:

You might also like