2 DataPreProcessing Code

The lecture covers data pre-processing techniques essential for machine learning, focusing on the California housing prices dataset. Key topics include data cleaning, feature scaling, model selection, and hyperparameter tuning, emphasizing the importance of data quality and preparation. Various methods for evaluating model performance and handling outliers are also discussed.

Uploaded by

Shashank Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views46 pages

2 DataPreProcessing Code

Uploaded by

Shashank Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Lecture 2

Data Pre-processing
How many of you use ChatGPT / other LLMs ?
Well don’t ask everything to ChatGPT
• Some of our recent research work …

1. ChatGPT in the Classroom: An Analysis of Its Strengths and Weaknesses for Solving Undergraduate
Computer Science Questions Link

2. ‘It’s not like Jarvis, but it’s pretty close!’ - Examining ChatGPT’s Usage among Undergraduate Students in
Computer Science Link
Working with Real Data
• Popular Data Sets:
• OpenML.org
• —Kaggle.com F

• —PapersWithCode.com
• —UC Irvine Machine Learning Repository
• —Amazon’s AWS datasets &

• —TensorFlow datasets -

• Meta portals (they list open data repositories):

• —DataPortals.org
• —OpenDataMonitor.eu
• Other pages listing many popular open data repositories:
• —Wikipedia’s list of machine learning datasets
• —Quora.com
• —The datasets subreddit
End to End ML Project - Major Steps
StatLib Data
• California Housing prices
dataset (StatLib repo)
• Data based on California 1990
census

&
Look at the Big Picture
• The task is to use California census data to build a model of housing prices in the state
• This data includes metrics such as the population, median income, and median housing price for each block
group in California
• Your model should learn from this data and be able to predict the median housing price in any district, given
all the other metrics
• FRAME THE PROBLEM:
- Is building a model the end goal ?
- Will be fed into another system to analyse investment
- Current solution ? A: It is manual
It is costly and time consuming
• Data Pipeline
• Check the Assumptions
Performance Measure
• Root Mean Square Error:
Performance Measure 2
&
h(u) Oo + 0x + On
i = = r

Y
Performance Measure
↑
• Mean Absolute Error:
• Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of
predictions and the vector of target values. Various distance measures, or norms, are possible:
• Computing the root of a sum of squares (RMSE) corresponds to the Euclidean norm: this is the notion of
distance we are all familiar with. It is also called the ℓ2 norm, noted ∥ ・ ∥2 (or just ∥ ・ ∥).
• Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm, noted ∥ ・ ∥1. This is sometimes called
the Manhattan norm because it measures the distance between two points in a city if you can only travel
along orthogonal city blocks – e.g. Grid like formation
#
• • More generally, the ℓk norm of a vector v containing n elements is defined as ∥v∥k = (|v1|k + |v2|k + ... +
|vn|k)1/k.
• The higher the norm index, the more it focuses on large values and neglects small ones. This is why the
RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-
shaped curve), the RMSE performs very well and is generally preferred.
Get the Data S

• Running via Google Colab

• Data and notebooks: https://homl.info/colab3 \
• There will be a tutorial after this lecture by TAs on how to use Google
Colab / Jupyter Notebook / some python Libraries

• Loading the data

Explore the Data

>>> housing.info()

• 20,640 data instances

• 20,433 non-null values – 207 districts are values.
Explore the Data

-
&

Create a Test Set

• Roughly 20% split (or less if data is larger)
Test Data
• How can you ensure, across various runs of the code, that your model
does not see the test data ?

# untoured
Unique Identifier
III

• Have a look at train_test_split() in Scikit-Learn

Tagging
-
-

1000
&
-
-
o
seeding
Visualise the Data
Visualise the Data
Look for Correlations
• Standard correlation coefficient
(Pearson’s r)
• The correlation coefficient ranges from –1 to
1.
• When it is close to 1, it means that there is a
strong positive correlation; for example, the
median house value tends to go up when the
median income goes up.
• When the coefficient is close to –1, it means
that there is a strong negative correlation;
you can see a small negative correlation
between the latitude and the median house
value
Correlations
• This scatter matrix plots every -

numerical attribute against every

other numerical attribute, plus a
histogram of each numerical
attribute’s values on the main
diagonal (top left to bottom right)

• The main diagonal would be full of

straight lines if Pandas plotted each
variable against itself, which would
not be very useful. So instead, the
Pandas displays a histogram of each
attribute (other options are available;
see the Pandas documentation
formore details).
Correlations
• Median income seems to be promising
correlation
• The correlation is indeed quite strong; you
can clearly see the upward trend, and the
points are not too dispersed
• Second, the price cap you noticed earlier
is clearly visible as a horizontal line at
$500,000.
• Less obvious straight lines: a horizontal
line around $450,000, another around
$350,000, perhaps one around $280,000,
and
Standard Correlation Coefficients for Various Datasets
More Correlations …
Data Preparation
• Structured data in machine learning consists of rows and columns.
• Data preparation is a required step in each machine learning project.
• The routineness of machine learning algorithms means the majority of effort on each project is spent on
data preparation.
• “Data quality is one of the most important problems in data management, since dirty data often leads to
inaccurate data analytics results and incorrect business decisions.”
• “it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data.
However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting),
it's essential that you become fluent and efficient in data wrangling techniques.”
• Step 1: clean the data: (e.g. total bedroom field – as attributes are missing )
• 1. Get rid of the corresponding districts.
• 2. Get rid of the whole attribute.
• 3. Set the missing values to some value (zero, the mean, the median, etc.). This is called imputation.
• Text and Categorial attributes / imputation.
• Homework: what are some good imputation approaches
Data Cleaning
• Using statistics to detect noisy data and identify outliers
• Identifying columns that have the same value or no variance and
removing them
• Identifying duplicate rows of data and removing them.
• Marking empty values as missing
• Imputing missing values using statistics or a learned model
total_bedrooms attribute
Converting Text to Numbers
Data Cleaning
Feature Scaling and Transformation
• ML algorithms don’t perform well when input numerical attributes have very different scales.
• This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320,
while the median incomes only range from 0 to 15.
• Min-max scaling normalization – each attribute is shifted and rescaled such that they end up
between 0 -1 .
• Scikit-Learn provides a transformer called MinMaxScaler for this. It has a feature_range
hyperparameter that lets you change the range if, for some reason, you don’t want 0–1 (e.g.,
neural networks work best with zero-mean inputs, so a range of –1 to 1 is preferable) – affected
by outliers.
• Standardization is different: first it subtracts the mean value (so standardized values have a zero
mean), then it divides the result by the standard deviation (so standardized values have a
standard deviation equal to 1). Less affected by outliers
Multimodal Distribution
• Two or more clear peaks
• Either create buckets of the data
• Add a feature for the main modes
• Radial basis function – Gaussian RBF

exp(–γ(x – 35)2)
The hyperparameter γ (gamma) determines
how quickly the similarity measure decays
as x moves away from 35.
Transformation Pipeline
Transformation Pipeline
• First imputations
• Scaling of features
• Transform the data
• Ex. clusterSimilarity Transform
Model Selection & Training

How good is this model ?

Model Selection & Training
• How is this model ?
Cross-Validation

• The decision tree has an RMSE of about 66,868, with a standard deviation of about 2,061
• RandomForestRegressor – train on many decision trees.
Hyperparameter Tuning
• Grid Search: GridSearchCV
• You need to do is tell it which hyperparameters you want it to experiment with and what values to try out,
and it will use cross-validation to evaluate all the possible combinations
Hyperparameter Tuning
Randomized Search
• GridSearch is ok for a few combinations
• RandomizedSearch is good for a large parameter space
• It evaluates a fixed number of combinations, selecting a random value for each hyperparameter at every
iteration
• If some of your hyperparameters are continuous (or discrete but with many possible values), and you let
randomized search run for, say, 1,000 iterations, then it will explore 1,000 different values for each of these
hyperparameters, whereas grid search would only explore the few values you listed for each one.
• Suppose a hyperparameter does not actually make much difference, but you don’t know it yet. If it has 10
possible values and you add it to your grid search, then training will take 10 times longer. But if you add it to
a random search, it will not make any difference.
• If there are 6 hyperparameters to explore, each with 10 possible values, then grid search offers no other
choice than training the model a million times, whereas random search can always run for any number of
iterations you choose.
Randomized Search
Feature Importance
• Drop less important
features
Evaluate on Test Set

In this California housing example, the final performance of the system is not much better than the
experts’ price estimates, which were often off by 30%, but it may still be a good idea to launch it,
especially if this frees up some time for the experts so they can work on more interesting and
productive tasks.
Additional Data Pre-processing Steps
• Can consider removing variables with zero variance (each
row for that column has the same value)
• Can remove columns of data that have low variance
• Identify rows that contain duplicate data
• Outlier identification and removal:
• Is the data outside 3 standard deviations ?
• Inter-quartile range methods
• Automatic outlier removal (LocalOutlierFactor Class)
• Marking and removing missing data
• Statistical Imputation
• KNN Imputation
• Iterative imputation

Activity A Worksheet: How Paper Is Recycled
100% (5)
Activity A Worksheet: How Paper Is Recycled
6 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
Hands On Machine Learning, End-to-End Machine Learning Project Notes
No ratings yet
Hands On Machine Learning, End-to-End Machine Learning Project Notes
10 pages
Module 2
No ratings yet
Module 2
20 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Module 2
No ratings yet
Module 2
35 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Data Science
No ratings yet
Data Science
13 pages
Data Science Classes
No ratings yet
Data Science Classes
13 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Week 10
No ratings yet
Week 10
50 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
My Notes
No ratings yet
My Notes
15 pages
Ml Lab File
No ratings yet
Ml Lab File
47 pages
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
No ratings yet
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
41 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Hint Sheet
No ratings yet
Hint Sheet
13 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Module 5
No ratings yet
Module 5
46 pages
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
ML Syllabus
No ratings yet
ML Syllabus
5 pages
Allpiedml Unit2
No ratings yet
Allpiedml Unit2
19 pages
@vtudeveloper - in ISMLA Mod 5
No ratings yet
@vtudeveloper - in ISMLA Mod 5
30 pages
Lecture Slides - ML - Part 2
No ratings yet
Lecture Slides - ML - Part 2
22 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
Data Science New
No ratings yet
Data Science New
8 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
ML Cyber Lab
No ratings yet
ML Cyber Lab
16 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Distributed Linear Regression Class Notes
No ratings yet
Distributed Linear Regression Class Notes
140 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Salary Estimation Using K-Nearest Neighbour
No ratings yet
Salary Estimation Using K-Nearest Neighbour
1 page
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Guia de Instalação Das Expansões cfw11
No ratings yet
Guia de Instalação Das Expansões cfw11
21 pages
Failure of Boiler Tubes Due To EROSION
No ratings yet
Failure of Boiler Tubes Due To EROSION
9 pages
MDCG 2022-6
No ratings yet
MDCG 2022-6
17 pages
Ingersoll Rand HD Series
No ratings yet
Ingersoll Rand HD Series
44 pages
Business Plan On Renthub
No ratings yet
Business Plan On Renthub
35 pages
Component Tester Test Almost Anything
100% (1)
Component Tester Test Almost Anything
21 pages
Complete Guitar Chord Poster Free Version E-BOOK
100% (2)
Complete Guitar Chord Poster Free Version E-BOOK
52 pages
Index PDF - PDF
No ratings yet
Index PDF - PDF
1 page
Data Transmission by Frequency-Division Multiplexing Using The Discrete Fourier Transform
No ratings yet
Data Transmission by Frequency-Division Multiplexing Using The Discrete Fourier Transform
7 pages
Tax Invoice For LT Current Consumption Charges For The Month of January 2024
No ratings yet
Tax Invoice For LT Current Consumption Charges For The Month of January 2024
1 page
Cisco Sdwan Design Guide
No ratings yet
Cisco Sdwan Design Guide
102 pages
Week 10
No ratings yet
Week 10
3 pages
Lectura 2 EBSD
No ratings yet
Lectura 2 EBSD
7 pages
Tishchenko'S Method - 3 Rotor Hub Weight Estimation
No ratings yet
Tishchenko'S Method - 3 Rotor Hub Weight Estimation
30 pages
PHD Thesis Proposal Structure
100% (1)
PHD Thesis Proposal Structure
6 pages
Solved The Pricing Model For Itunes Has Been To Price Songs
No ratings yet
Solved The Pricing Model For Itunes Has Been To Price Songs
1 page
Internet of Things
No ratings yet
Internet of Things
50 pages
MIS All Assignment
No ratings yet
MIS All Assignment
34 pages
Benelux CFO Network Agenda 2023
No ratings yet
Benelux CFO Network Agenda 2023
4 pages
Thermal Oil-Level Sensor: Technical Information
100% (1)
Thermal Oil-Level Sensor: Technical Information
2 pages
Corporate Computer Security 4E Randall J Boyle Raymond R Panko Instructor Test Bank
No ratings yet
Corporate Computer Security 4E Randall J Boyle Raymond R Panko Instructor Test Bank
318 pages
SCADA Remote Support Compliance Checklist
No ratings yet
SCADA Remote Support Compliance Checklist
2 pages
A PRACH Preamble Generation and Detectio
No ratings yet
A PRACH Preamble Generation and Detectio
8 pages
Gonna Fly Now - Rocky
No ratings yet
Gonna Fly Now - Rocky
12 pages
Pressure Vessels Parts
No ratings yet
Pressure Vessels Parts
3 pages
CATERPILLAR ET 3600 5 (Modo de Compatibilidad)
No ratings yet
CATERPILLAR ET 3600 5 (Modo de Compatibilidad)
61 pages
FLOWSHEET Minyak Ikan Penampung
No ratings yet
FLOWSHEET Minyak Ikan Penampung
2 pages
CADCAM
No ratings yet
CADCAM
14 pages
Module 1 Part 4 Exercise 0.1
No ratings yet
Module 1 Part 4 Exercise 0.1
29 pages