2_DataPreProcessing_code
2_DataPreProcessing_code
Data Pre-processing
How many of you use ChatGPT / other LLMs ?
Well don’t ask everything to ChatGPT
• Some of our recent research work …
1. ChatGPT in the Classroom: An Analysis of Its Strengths and Weaknesses for Solving Undergraduate
Computer Science Questions Link
2. ‘It’s not like Jarvis, but it’s pretty close!’ - Examining ChatGPT’s Usage among Undergraduate Students in
Computer Science Link
Working with Real Data
• Popular Data Sets:
• OpenML.org
• —Kaggle.com F
• —PapersWithCode.com
• —UC Irvine Machine Learning Repository
• —Amazon’s AWS datasets &
• —TensorFlow datasets -
&
Look at the Big Picture
• The task is to use California census data to build a model of housing prices in the state
• This data includes metrics such as the population, median income, and median housing price for each block
group in California
• Your model should learn from this data and be able to predict the median housing price in any district, given
all the other metrics
• FRAME THE PROBLEM:
- Is building a model the end goal ?
- Will be fed into another system to analyse investment
- Current solution ? A: It is manual
It is costly and time consuming
• Data Pipeline
• Check the Assumptions
Performance Measure
• Root Mean Square Error:
Performance Measure 2
&
h(u) Oo + 0x + On
i = = r
Y
Performance Measure
↑
• Mean Absolute Error:
• Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of
predictions and the vector of target values. Various distance measures, or norms, are possible:
• Computing the root of a sum of squares (RMSE) corresponds to the Euclidean norm: this is the notion of
distance we are all familiar with. It is also called the ℓ2 norm, noted ∥ ・ ∥2 (or just ∥ ・ ∥).
• Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm, noted ∥ ・ ∥1. This is sometimes called
the Manhattan norm because it measures the distance between two points in a city if you can only travel
along orthogonal city blocks – e.g. Grid like formation
#
• • More generally, the ℓk norm of a vector v containing n elements is defined as ∥v∥k = (|v1|k + |v2|k + ... +
|vn|k)1/k.
• The higher the norm index, the more it focuses on large values and neglects small ones. This is why the
RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-
shaped curve), the RMSE performs very well and is generally preferred.
Get the Data S
>>> housing.info()
-
&
# untoured
Unique Identifier
III
1000
&
-
-
o
seeding
Visualise the Data
Visualise the Data
Look for Correlations
• Standard correlation coefficient
(Pearson’s r)
• The correlation coefficient ranges from –1 to
1.
• When it is close to 1, it means that there is a
strong positive correlation; for example, the
median house value tends to go up when the
median income goes up.
• When the coefficient is close to –1, it means
that there is a strong negative correlation;
you can see a small negative correlation
between the latitude and the median house
value
Correlations
• This scatter matrix plots every -
exp(–γ(x – 35)2)
The hyperparameter γ (gamma) determines
how quickly the similarity measure decays
as x moves away from 35.
Transformation Pipeline
Transformation Pipeline
• First imputations
• Scaling of features
• Transform the data
• Ex. clusterSimilarity Transform
Model Selection & Training
• The decision tree has an RMSE of about 66,868, with a standard deviation of about 2,061
• RandomForestRegressor – train on many decision trees.
Hyperparameter Tuning
• Grid Search: GridSearchCV
• You need to do is tell it which hyperparameters you want it to experiment with and what values to try out,
and it will use cross-validation to evaluate all the possible combinations
Hyperparameter Tuning
Randomized Search
• GridSearch is ok for a few combinations
• RandomizedSearch is good for a large parameter space
• It evaluates a fixed number of combinations, selecting a random value for each hyperparameter at every
iteration
• If some of your hyperparameters are continuous (or discrete but with many possible values), and you let
randomized search run for, say, 1,000 iterations, then it will explore 1,000 different values for each of these
hyperparameters, whereas grid search would only explore the few values you listed for each one.
• Suppose a hyperparameter does not actually make much difference, but you don’t know it yet. If it has 10
possible values and you add it to your grid search, then training will take 10 times longer. But if you add it to
a random search, it will not make any difference.
• If there are 6 hyperparameters to explore, each with 10 possible values, then grid search offers no other
choice than training the model a million times, whereas random search can always run for any number of
iterations you choose.
Randomized Search
Feature Importance
• Drop less important
features
Evaluate on Test Set
In this California housing example, the final performance of the system is not much better than the
experts’ price estimates, which were often off by 30%, but it may still be a good idea to launch it,
especially if this frees up some time for the experts so they can work on more interesting and
productive tasks.
Additional Data Pre-processing Steps
• Can consider removing variables with zero variance (each
row for that column has the same value)
• Can remove columns of data that have low variance
• Identify rows that contain duplicate data
• Outlier identification and removal:
• Is the data outside 3 standard deviations ?
• Inter-quartile range methods
• Automatic outlier removal (LocalOutlierFactor Class)
• Marking and removing missing data
• Statistical Imputation
• KNN Imputation
• Iterative imputation