Unit 2 Data Preprocessing

Data preprocessing is essential for preparing raw data for machine learning, addressing issues like noise and missing values to enhance model accuracy. Key steps include importing datasets, handling missing data, encoding categorical data, and splitting datasets into training and test sets. Data transformation further improves data quality and facilitates integration, although it can be time-consuming and complex.

Uploaded by

anveshapsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views3 pages

Unit 2 Data Preprocessing

Uploaded by

anveshapsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Unit 2 Data Pre-processing

Data Preprocessing • It is a process of preparing the raw data and making it suitable for a
machine learning model.
• It is the first and crucial step while creating a machine learning
model.
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
• Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also increases
the accuracy and efficiency of a machine learning model.

Getting the dataset

• Getting the dataset • Data is a valuable asset to any company today
•Importing libraries • The first thing we required a dataset as a ML model completely works
•Importing datasets on data.
•Finding Missing Data • The collected data for a particular problem in a proper format is
•Encoding Categorical Data known as the dataset.
•Splitting dataset into training and test set • Dataset may be of different formats for different purposes.
• Ex. Excel, CSV, HTML or xlsx file, Image, etc.

Importing datasets Importing datasets

• Need to import some predefined Python libraries. These libraries are used to perform some specific
jobs. • The next key step is to load the data that will be utilized in the machine learning
• import pandas as pd:- Importing and managing the datasets. It is an open-source data manipulation algorithm.
and analysis library. • Many companies start by storing data in warehouses that require data to pass
• import numpy as np:- any type of mathematical operation in the code. It is the fundamental package through an ETL.
for scientific calculation in Python. • The problem with this method is that you never know which data will be useful
• It also supports to add large, multidimensional arrays and matrices. for an ML project.
• import matplotlib.pyplot as plt:- Data visulazation, which is a Python 2D plotting library, and with • As a result, warehouses are commonly used to access data through business
this library, we need to import a sub-library pyplot. This library is used to plot any type of charts in intelligence interfaces in order to observe metrics that we know we need to
Python monitor.
• import seaborn as sns:- Advanace data visulazation
• From sklearn.preprocessing import SimpleImputer, OneHotEncoder, LabelEncoder, StandardScaler
:- pre-processing, cross-validation, and visualization algorithms using a unified interface.
• various machine-learning tasks such as Classification, Regression, Clustering, and many more.
Cont.. Finding Missing Data
• # Load the dataset
• df = pd.read_excel(‘E:/Data/Salary.xlsx') • The next step of data preprocessing is to handle missing data in the
datasets. If our dataset contains some missing data, then it may
• print(df.head()) / df.head()
create a huge problem for our machine learning model.

• # Checking the null values in dataset

• df.isna()
• df.info() • df.isna().sum()
• df.describe()

• # if we want to delete column from dataset To delete the null values

• df = df.drop(["Unnamed: 5","Unnamed: 6","Unnamed: 7","Unnamed: • df.dropna()
8","Unnamed: 9","Unnamed: 10"], axis=1)
• #handling missing data (Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
• #Fitting imputer object to the independent variables x.
imputerimputer= imputer.fit(x[:, 1:3])
• After check the Missing Value
• #Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
• Please check the dataset / actual dataset size

Noisy data Encoding Categorical Data

• Noisy data is a meaningless data that can’t be interpreted by machines. It • Categorical data is data which has some categories such as, in our
can be generated due to faulty data collection, data entry errors etc. dataset; there are two categorical variable, University Name,
• Binning Method: This method works on sorted data in order to smooth it. and Program. (Gender, color) (string to int)
The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.
• Regression: Here data can be made smooth by fitting it to a regression
function. The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
• Clustering: This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
Splitting dataset into training and test set Data Integration
• We divide our dataset into a training set and test set. • To the process of combining data from multiple sources into a single,
unified view.
• We can enhance the performance of our machine learning model. • This can involve cleaning and transforming the data, as well as
• Training Set: A subset of dataset to train the machine learning model, and resolving any inconsistencies or conflicts that may exist between the
we already know the output. different sources.
• Test set: A subset of dataset to test the machine learning model, and by • The goal of data integration is to make the data more useful and
using the test set, model predicts the output. meaningful for the purposes of analysis and decision making.
• Techniques used in data integration include data warehousing, ETL
(extract, transform, load) processes, and data federation.
• from sklearn.model_selection import train_test_split • Data Integration is a data preprocessing technique that combines
• x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_ data from multiple heterogeneous data sources into a coherent data
state=0) store and provides a unified view of the data.
• G= Global schema, S= Heterogeneous source of Schema, M=Mapping
between the queries of source and global schema.

Data Transformation
• Data transformation in data mining refers to the process of • Data cleaning: Removing or correcting errors, inconsistencies, and missing
values in the data.
converting raw data into a format that is suitable for analysis and
• Data integration: Combining data from multiple sources, such as databases
modeling. and spreadsheets, into a single format.
• The goal of data transformation is to prepare the data for data mining • Data normalization: Scaling the data to a common range of values, such as
so that it can be used to extract useful insights and knowledge. between 0 and 1, to facilitate comparison and analysis.
• Data reduction: Reducing the dimensionality of the data by selecting a
subset of relevant features or attributes.
• Data discretization: Converting continuous data into discrete categories or
bins.
• Data aggregation: Combining data at different levels of granularity, such as
by summing or averaging, to create new features or attributes.

Advantages of Data Transformation Disadvantages of Data Transformation

• Improves Data Quality: Data transformation helps to improve the quality of data • Time-consuming: especially when dealing with large datasets.
by removing errors, inconsistencies, and missing values.
• Facilitates Data Integration: Data transformation enables the integration of data • Complexity: requiring specialized skills and knowledge to implement
from multiple sources, which can improve the accuracy and completeness of the and interpret the results.
data.
• Improves Data Analysis: Data transformation helps to prepare the data for • Data Loss: when discretizing continuous data, or when removing
analysis and modeling by normalizing, reducing dimensionality, and discretizing attributes or features from the data.
the data.
• Increases Data Security: Data transformation can be used to mask sensitive data, • Biased transformation: if the data is not properly understood or
or to remove sensitive information from the data, which can help to increase data used.
security.
• Enhances Data Mining Algorithm Performance: Data transformation can improve • High cost: expensive process, requiring significant investments in
the performance of data mining algorithms by reducing the dimensionality of the hardware, software, and personnel.
data and scaling the data to a common range of values.

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
4.installing A New Product in T24
100% (1)
4.installing A New Product in T24
17 pages
JD - Salesforce Business Analyst-1
No ratings yet
JD - Salesforce Business Analyst-1
1 page
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Datascience
No ratings yet
Datascience
26 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Unit - II
No ratings yet
Unit - II
56 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Unit 2
No ratings yet
Unit 2
18 pages
Module 2
No ratings yet
Module 2
8 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Learning Progress Review Week 10
No ratings yet
Learning Progress Review Week 10
35 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Mining
No ratings yet
Data Mining
22 pages
Normalization
No ratings yet
Normalization
35 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
CH 3
No ratings yet
CH 3
34 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Project Topics and Titles
No ratings yet
Project Topics and Titles
4 pages
Entrepreneurship-Unit 1-Prof - Supriya (For Ref)
No ratings yet
Entrepreneurship-Unit 1-Prof - Supriya (For Ref)
64 pages
7 Data Collection Methods in Business Analytics
No ratings yet
7 Data Collection Methods in Business Analytics
1 page
Interview Schedule Panel 1
No ratings yet
Interview Schedule Panel 1
2 pages
JNTUH B.Tech Database Management Systems Lab R13 Syllabus - Studentboxoffice - in PDF
No ratings yet
JNTUH B.Tech Database Management Systems Lab R13 Syllabus - Studentboxoffice - in PDF
9 pages
HYCU Backup VMController
No ratings yet
HYCU Backup VMController
2 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
52 pages
Billable Roles
No ratings yet
Billable Roles
5 pages
Osdb Migration CMD S
No ratings yet
Osdb Migration CMD S
10 pages
Lecture 2 - Relational Databases
No ratings yet
Lecture 2 - Relational Databases
32 pages
Vsan 803 Administration Guide
No ratings yet
Vsan 803 Administration Guide
161 pages
AST-0060878 Wayne Eckerson Research Report Big Data Analytics Final
No ratings yet
AST-0060878 Wayne Eckerson Research Report Big Data Analytics Final
57 pages
Asset List Audio
No ratings yet
Asset List Audio
31 pages
Calculating Order Totals On The Fly
No ratings yet
Calculating Order Totals On The Fly
14 pages
Week 3: Assignment: Assignment Submitted On 2025-02-12, 12:17 IST
No ratings yet
Week 3: Assignment: Assignment Submitted On 2025-02-12, 12:17 IST
5 pages
Noemi
No ratings yet
Noemi
3 pages
Authorization Objects
No ratings yet
Authorization Objects
5 pages
1ST ASSIGNMENT Implementation of DDL and DML Queries 1
No ratings yet
1ST ASSIGNMENT Implementation of DDL and DML Queries 1
3 pages
Pro SQL Server Relational Database Design and Implementation 5th Edition Louis Davidson Download
No ratings yet
Pro SQL Server Relational Database Design and Implementation 5th Edition Louis Davidson Download
59 pages
Proj4 Zoo PPT 2
No ratings yet
Proj4 Zoo PPT 2
10 pages
R23 Stack1 JBoss External Runbook
100% (1)
R23 Stack1 JBoss External Runbook
59 pages
Programming Assignment Unit 1 Solution
No ratings yet
Programming Assignment Unit 1 Solution
3 pages
Sample Project Report
No ratings yet
Sample Project Report
19 pages
Saa C03.rapidshare.2022 Sep 29.vce
0% (1)
Saa C03.rapidshare.2022 Sep 29.vce
19 pages
Roles in Database Environment
67% (15)
Roles in Database Environment
25 pages
CloneVdi - Release Notes
No ratings yet
CloneVdi - Release Notes
20 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
School Management System: User Guide
No ratings yet
School Management System: User Guide
3 pages
Dbms Lab
No ratings yet
Dbms Lab
4 pages
How To Allow A Different Users To See Your Output File Which Are Having Same Responsibility
No ratings yet
How To Allow A Different Users To See Your Output File Which Are Having Same Responsibility
11 pages
Big Data Anlaytics: Unit 1 & 2 - Question Bank MCQ's
100% (1)
Big Data Anlaytics: Unit 1 & 2 - Question Bank MCQ's
4 pages
Operating Systems
No ratings yet
Operating Systems
23 pages

Unit 2 Data Preprocessing

Uploaded by

Unit 2 Data Preprocessing

Uploaded by

Unit 2 Data Pre-processing

Getting the dataset

Importing datasets Importing datasets

• # Checking the null values in dataset

• # if we want to delete column from dataset To delete the null values

Noisy data Encoding Categorical Data

Advantages of Data Transformation Disadvantages of Data Transformation

You might also like