0% found this document useful (0 votes)
2 views

2_DataPreProcessing_code

The lecture covers data pre-processing techniques essential for machine learning, focusing on the California housing prices dataset. Key topics include data cleaning, feature scaling, model selection, and hyperparameter tuning, emphasizing the importance of data quality and preparation. Various methods for evaluating model performance and handling outliers are also discussed.

Uploaded by

Shashank Jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2_DataPreProcessing_code

The lecture covers data pre-processing techniques essential for machine learning, focusing on the California housing prices dataset. Key topics include data cleaning, feature scaling, model selection, and hyperparameter tuning, emphasizing the importance of data quality and preparation. Various methods for evaluating model performance and handling outliers are also discussed.

Uploaded by

Shashank Jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Lecture 2

Data Pre-processing
How many of you use ChatGPT / other LLMs ?
Well don’t ask everything to ChatGPT
• Some of our recent research work …

1. ChatGPT in the Classroom: An Analysis of Its Strengths and Weaknesses for Solving Undergraduate
Computer Science Questions Link

2. ‘It’s not like Jarvis, but it’s pretty close!’ - Examining ChatGPT’s Usage among Undergraduate Students in
Computer Science Link
Working with Real Data
• Popular Data Sets:
• OpenML.org
• —Kaggle.com F

• —PapersWithCode.com
• —UC Irvine Machine Learning Repository
• —Amazon’s AWS datasets &

• —TensorFlow datasets -

• Meta portals (they list open data repositories):


• —DataPortals.org
• —OpenDataMonitor.eu
• Other pages listing many popular open data repositories:
• —Wikipedia’s list of machine learning datasets
• —Quora.com
• —The datasets subreddit
End to End ML Project - Major Steps
StatLib Data
• California Housing prices
dataset (StatLib repo)
• Data based on California 1990
census

&
Look at the Big Picture
• The task is to use California census data to build a model of housing prices in the state
• This data includes metrics such as the population, median income, and median housing price for each block
group in California
• Your model should learn from this data and be able to predict the median housing price in any district, given
all the other metrics
• FRAME THE PROBLEM:
- Is building a model the end goal ?
- Will be fed into another system to analyse investment
- Current solution ? A: It is manual
It is costly and time consuming
• Data Pipeline
• Check the Assumptions
Performance Measure
• Root Mean Square Error:
Performance Measure 2
&
h(u) Oo + 0x + On
i = = r

Y
Performance Measure

• Mean Absolute Error:
• Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of
predictions and the vector of target values. Various distance measures, or norms, are possible:
• Computing the root of a sum of squares (RMSE) corresponds to the Euclidean norm: this is the notion of
distance we are all familiar with. It is also called the ℓ2 norm, noted ∥ ・ ∥2 (or just ∥ ・ ∥).
• Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm, noted ∥ ・ ∥1. This is sometimes called
the Manhattan norm because it measures the distance between two points in a city if you can only travel
along orthogonal city blocks – e.g. Grid like formation
#
• • More generally, the ℓk norm of a vector v containing n elements is defined as ∥v∥k = (|v1|k + |v2|k + ... +
|vn|k)1/k.
• The higher the norm index, the more it focuses on large values and neglects small ones. This is why the
RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-
shaped curve), the RMSE performs very well and is generally preferred.
Get the Data S

• Running via Google Colab


• Data and notebooks: https://homl.info/colab3 \
• There will be a tutorial after this lecture by TAs on how to use Google
Colab / Jupyter Notebook / some python Libraries

• Loading the data


Explore the Data

>>> housing.info()

• 20,640 data instances


• 20,433 non-null values – 207 districts are values.
Explore the Data

-
&

Create a Test Set


• Roughly 20% split (or less if data is larger)
Test Data
• How can you ensure, across various runs of the code, that your model
does not see the test data ?

# untoured
Unique Identifier
III

• Have a look at train_test_split() in Scikit-Learn


Tagging
-
-

1000
&
-
-
o
seeding
Visualise the Data
Visualise the Data
Look for Correlations
• Standard correlation coefficient
(Pearson’s r)
• The correlation coefficient ranges from –1 to
1.
• When it is close to 1, it means that there is a
strong positive correlation; for example, the
median house value tends to go up when the
median income goes up.
• When the coefficient is close to –1, it means
that there is a strong negative correlation;
you can see a small negative correlation
between the latitude and the median house
value
Correlations
• This scatter matrix plots every -

numerical attribute against every


other numerical attribute, plus a
histogram of each numerical
attribute’s values on the main
diagonal (top left to bottom right)

• The main diagonal would be full of


straight lines if Pandas plotted each
variable against itself, which would
not be very useful. So instead, the
Pandas displays a histogram of each
attribute (other options are available;
see the Pandas documentation
formore details).
Correlations
• Median income seems to be promising
correlation
• The correlation is indeed quite strong; you
can clearly see the upward trend, and the
points are not too dispersed
• Second, the price cap you noticed earlier
is clearly visible as a horizontal line at
$500,000.
• Less obvious straight lines: a horizontal
line around $450,000, another around
$350,000, perhaps one around $280,000,
and
Standard Correlation Coefficients for Various Datasets
More Correlations …
Data Preparation
• Structured data in machine learning consists of rows and columns.
• Data preparation is a required step in each machine learning project.
• The routineness of machine learning algorithms means the majority of effort on each project is spent on
data preparation.
• “Data quality is one of the most important problems in data management, since dirty data often leads to
inaccurate data analytics results and incorrect business decisions.”
• “it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data.
However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting),
it's essential that you become fluent and efficient in data wrangling techniques.”
• Step 1: clean the data: (e.g. total bedroom field – as attributes are missing )
• 1. Get rid of the corresponding districts.
• 2. Get rid of the whole attribute.
• 3. Set the missing values to some value (zero, the mean, the median, etc.). This is called imputation.
• Text and Categorial attributes / imputation.
• Homework: what are some good imputation approaches
Data Cleaning
• Using statistics to detect noisy data and identify outliers
• Identifying columns that have the same value or no variance and
removing them
• Identifying duplicate rows of data and removing them.
• Marking empty values as missing
• Imputing missing values using statistics or a learned model
total_bedrooms attribute
Converting Text to Numbers
Data Cleaning
Feature Scaling and Transformation
• ML algorithms don’t perform well when input numerical attributes have very different scales.
• This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320,
while the median incomes only range from 0 to 15.
• Min-max scaling normalization – each attribute is shifted and rescaled such that they end up
between 0 -1 .
• Scikit-Learn provides a transformer called MinMaxScaler for this. It has a feature_range
hyperparameter that lets you change the range if, for some reason, you don’t want 0–1 (e.g.,
neural networks work best with zero-mean inputs, so a range of –1 to 1 is preferable) – affected
by outliers.
• Standardization is different: first it subtracts the mean value (so standardized values have a zero
mean), then it divides the result by the standard deviation (so standardized values have a
standard deviation equal to 1). Less affected by outliers
Multimodal Distribution
• Two or more clear peaks
• Either create buckets of the data
• Add a feature for the main modes
• Radial basis function – Gaussian RBF

exp(–γ(x – 35)2)
The hyperparameter γ (gamma) determines
how quickly the similarity measure decays
as x moves away from 35.
Transformation Pipeline
Transformation Pipeline
• First imputations
• Scaling of features
• Transform the data
• Ex. clusterSimilarity Transform
Model Selection & Training

How good is this model ?


Model Selection & Training
• How is this model ?
Cross-Validation

• The decision tree has an RMSE of about 66,868, with a standard deviation of about 2,061
• RandomForestRegressor – train on many decision trees.
Hyperparameter Tuning
• Grid Search: GridSearchCV
• You need to do is tell it which hyperparameters you want it to experiment with and what values to try out,
and it will use cross-validation to evaluate all the possible combinations
Hyperparameter Tuning
Randomized Search
• GridSearch is ok for a few combinations
• RandomizedSearch is good for a large parameter space
• It evaluates a fixed number of combinations, selecting a random value for each hyperparameter at every
iteration
• If some of your hyperparameters are continuous (or discrete but with many possible values), and you let
randomized search run for, say, 1,000 iterations, then it will explore 1,000 different values for each of these
hyperparameters, whereas grid search would only explore the few values you listed for each one.
• Suppose a hyperparameter does not actually make much difference, but you don’t know it yet. If it has 10
possible values and you add it to your grid search, then training will take 10 times longer. But if you add it to
a random search, it will not make any difference.
• If there are 6 hyperparameters to explore, each with 10 possible values, then grid search offers no other
choice than training the model a million times, whereas random search can always run for any number of
iterations you choose.
Randomized Search
Feature Importance
• Drop less important
features
Evaluate on Test Set

In this California housing example, the final performance of the system is not much better than the
experts’ price estimates, which were often off by 30%, but it may still be a good idea to launch it,
especially if this frees up some time for the experts so they can work on more interesting and
productive tasks.
Additional Data Pre-processing Steps
• Can consider removing variables with zero variance (each
row for that column has the same value)
• Can remove columns of data that have low variance
• Identify rows that contain duplicate data
• Outlier identification and removal:
• Is the data outside 3 standard deviations ?
• Inter-quartile range methods
• Automatic outlier removal (LocalOutlierFactor Class)
• Marking and removing missing data
• Statistical Imputation
• KNN Imputation
• Iterative imputation

You might also like