0% found this document useful (0 votes)
12 views

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

emad qedies
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

emad qedies
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Lecture 2: ML Pipeline

CSC 484 / 584, DA 515

Fall 2024

REF: Chapter 2: End to End ML


Ch2. An Example: end to end

1. Look at the big picture.


2. Get the data.
3. Discover and visualize the data to gain
insights.

4. Prepare the data for Machine Learning


algorithms.
5. Select models and train them.
6. Fine-tune your models.
7. Present your solution.
2

8. Launch, monitor, and maintain your


ML pipeline example

# Create the pipeline (not totally matched the pipeline above)


pipeline = make_pipeline(StandardScaler(), PCA(n_components=8),
RandomForestClassifier(criterion='gini', n_estimators=50,
max_depth=2, random_state=1))
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Evaluate the model
print('Model Accuracy: %.3f' % pipeline.score(X_test, y_test))
3
Questions you might ask:
 What does the dataset include?
 District median price along with population, income…

 What are the business objectives?


 (summary, viz, prediction…)

 What do we have currently? Complex rules (low accuracy)

 What kind of problem is it?


 Supervised/Non-Supervised

 Classification/Regression
 Which algorithm(linear, poly, neural, kernel, tree, k-nn)

4
Example: California Housing Price (1990)
 Given California housing prices(by district)
 Train a model to predict a district’s median
housing price

5
Load in the dataset
 Demo Code:
 Create an isolated Environment: DA515
 Install sklearn (code: pip install scikit-learn)

 Download the dataset (check the book code)

 For us, the data is saved in disk in the same folder as your
code is in.
housing = pd.read_csv(“CA_housing.csv”)

 For my case: data is saved in subfolder “datasets”


housing = pd.read_csv(“./datasets/CA_housing.csv”) 6
Take a Quick Look at the Data Structure
The first 5 rows:

7
Explore the dataset
 Check data numbers of rows and cols
 Check data types(int, float, categorical…)

 Check missing values

 (Column total bedroom: 207 rows missing)

 Check duplicated data

 Check statistics(max, min, mean, …)

8
Missing values
 In this example, we do not have a lot of missing values:

 We need to fix the missing values later.

9
Categorical data
# check the categorical data
ocean_proximity 20640 non-null object

housing["ocean_proximity"].value_counts()
<1H OCEAN 9034
INLAND 6496
NEAR OCEAN 2628
NEAR BAY 2270
ISLAND 5

There are 5 categories, counts are dif.


10
Plotting

# install matplotlib
! pip install matplotlib

# import matplotlib to memory


import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

11
.hist()

12
Observations
 Data distribution varies. Need to have outlier removed.

 These attributes have very different scales.

 Finally, many histograms are tail-heavy: extend much


farther to the right

13
For Machine Learning
 Data pre-processing
 Missing values

 Categorical data

 Feature selection/engineering

 Data scaling
 Data Sampling:
 using stratify parameter:

This stratify parameter makes a split so that the proportion of values in


the sample produced will be the same as the proportion of values
provided to parameter stratify

14
For missing values 1/2
 You can:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).

median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

4. Find the closest neighbors and use the average

15
Use SimpleImputer to fill in 2/2

You can use Scikit-Learn: SimpleImputer:


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer =
imputer.fit(housing[["total_bedrooms"]])
housing['total_bedrooms'] =
imputer.transform(housing[['total_bedrooms
']])

16
Text and Categorical Attributes
 Check the first 10 categorical “ocean_proximity” data
samples
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

 Computer cannot deal with text data directly

17
Use value_count or distinct
# count all distinct values.
housing_cat.ocean_proximity.value_counts()

<1H OCEAN 7276


INLAND 5263
NEAR OCEAN 2124
NEAR BAY 1847
ISLAND 2

There are 5 categories, counts are dif.

18
Convert:
categories text => ordinal numbers

Scikit-Learn provides
OrdinalEncoder to change 5
categorical data using number
list: [0 1 2 3 4] to represents:

[‘<1H OCEAN', 'INLAND',


'ISLAND', 'NEAR BAY', 'NEAR
OCEAN’]

Don’t Use it:


Reason: Not
Ordinal data
19
DATA TYPES
 Variables Types:
 Continuous
 Discrete:
 Ordinal: can order such as A, B, C, D, or Mon, Tue,
 Nominal: blue, red, .. Or banana, apple, orange, …
 Text -> encode to numerical
 Image -> use pixel
 Voice -> text
 ….

20
Correct Encoding: One-Hot encoding

There are total 5 categories:


use list to encode: [x1 , x2 , x3, x4, x5]
xi is 1 for yes, 0 for no

1-hot example:

21
Problem: TOO MANY VARIABLES
USE one-hot encoder
# Use pd.get_dummies( )
housing_cat_1hot =
pd.get_dummies(housing_cat).astype(int)

# then merge it with numerical attribute


housing = housing.join(housing_cat_1hot)

Now we have 14 features


Feature selection/engineering
 Feature engineering:
create some new features which make more
sense.

 For example: rooms_per_household is better than


total_rooms

 Feature selection:
keep only the important relevant features
 There are several diff methods
 Recursive elimination, Important Ranking, PCA, etc.

 We talk about Correlation 23


Experiments of attribute combinations

What you really want is the number of rooms per


household

housing["rooms_per_household"]
=housing["total_rooms"]/housing["households"]

housing["bedrooms_per_room"] =
housing["total_bedrooms"]/housing["total_rooms"]

housing["population_per_household"]=housing["population"]/
housing["households"]

24
Matrix of Correlations

25
Standard correlation coefficient of various
datasets
(source: Wikipedia; public domain image)

26
Feature Selection
Correlations: median_house_value
#Looking for Correlations:
# for ML, keep only the important features
(Assume:linear relation)

27
Separate X and y: features and
target
 Separate the dependent Y/independent variables X:
# training X
X = housing.drop("median_house_value", axis=1)

# label Y
y = housing["median_house_value"]

28
Splitting and Scaling
 Which one needs to be done first?

 Book Code did splitting first


Splitting with random sampling methods. This is generally
fine if your dataset is large enough (especially relative to
the number of attributes)

 I prefer do it later.

29
Feature Scaling
 Now, all data are numerical
----------------------------------------------------------------------------
 Scaling is very important:
 For distance computation
 For optimization

 Two ways:
 Normalization: (x-x_min)/(x_max – x_min) => [0,1]
 Standardization: (x- mu)/sigma => N(0, 1)

30
Why do data scaling:
Lesson of the widow's mite
 This poor widow put in more than all the other contributors
to the treasury. For they have all contributed from their
surplus wealth, but she, from her poverty, has contributed
all she had, her whole livelihood (Wikipedia)

31
Source of figure:

Feature Scaling http://cs231n.github.io/neural-


networks-2/

𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2

𝑥2 𝑥2

𝑥1 𝑥1

Make different features have the same scaling


Feature Selection
 Add combined features
 Remove less relevant features:

33
Feature Scaling 𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2

w1 w1
1, 2 …… x1  y 1, 2 …… x1  y
w2 w2

100, 200 …… x2 b 1, 2 …… x2 b

w2 Loss L w2 Loss L

w1 w1
Data Splitting: 80 vs 20
Scikit-Learn provides a few functions to split datasets into
multiple subsets in various ways. The simplest function is
train_test_split(),

# random sampling
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2,
random_state= 100)

35
Now visualization for training data

36
ML models
 Regression: for continuous data (Y)
 Linear, or Poly
 Tree
 Random Forest

 K-NN(HOMEWORK)

 SVM
 ANN
 Kernel
 ….
37
Regression Evaluation Metrics (1/3)
1. Mean Square Error(MSE) or Root Mean Square
Error(RMSE):

38
Regression Evaluation Metrics (2/3)
2. Mean Absolute Error(MAE):

39
Regression Evaluation Metrics (3/3)
3. Square/Adjusted R Squared:
 Simple linear Regression ( r2 instead of R2 )
 The R2 quantifies the degree of any linear correlation between Yobs and Ypred,
or assess the goodness-of-fit

https://en.wikipedia.org/wiki/Coefficient_of_determination

40
Limitation of using R squared
 R-squared Is Not Valid for Nonlinear Regression
https://statisticsbyjim.com/regression/r-squared-invalid-nonlinear-regression/#:~:text=Nonlinear%20regression%20is%20an
%20extremely,just%20don't%20go%20together
.

 If you use R-squared for nonlinear models, their study


indicates you will experience the following problems:

 R-squared is consistently high for both excellent and appalling models.


 R-squared will not rise for better models all of the time.
 If you use R-squared to pick the best model, it leads to the proper
model only 28-43% of the time.

 More info: https://en.wikipedia.org/wiki/Coefficient_of_determination


41
3 Steps of ML
 Import a model
 Train: fit (X, y)
 Evaluate:
 MSE(Mean Squared Error)
 RMSE(Root Mean Squared Error (RMSE).)
 Cross Validation

42
Grid Search: looking for best user-defined
parameters

43
Example:
 Forest: Total 12 combinations + 6 combinations

44
Short Summary

 After hyperparameter tuning,


Compare the RMSEs

 We also need to avoid


overfitting

 Feature Selection can be


done differently

 Feature Importance from


Random Forest:
45
Final Pipeline
 Example Bayesian Algorithm

46
Finally

 Evaluate Your System on the Test


Set
 Launch, Monitor, and Maintain

Your System

47
Homework: K-NN for Appraisal

Data: California Housing (chapter 2 )

Cannot use the sk-learn library

K_NN:K nearest neighbors:


• Lazy algorithm ,
• No training,
• No distribution assumption ,
• Based on feature similarity ,
• Used in classification by majority vote
• For regression, find the average price (scalar)

48
FYI: Real data sources
 UCI Data Repository
http://archive.ics.uci.edu/ml/index.php

 Kaggle
https://www.kaggle.com/datasets

 Google datasets
https://cloud.google.com/public-datasets/

 Government (Agriculture/Commerce/Education/FDA…. )
https://catalog.data.gov/dataset
49
END
• Read book Chapter 2
• Practice the code
• Do your homework 1

You might also like