0% found this document useful (0 votes)
11 views22 pages

Lecture Slides - ML - Part 2

The document outlines the end-to-end machine learning process, emphasizing the importance of problem definition, data acquisition, visualization, transformation, model selection, evaluation, and deployment. It details various techniques for data transformation, including feature engineering and handling missing values, as well as model evaluation metrics for regression and classification tasks. The document concludes by highlighting the necessity of monitoring and maintaining models post-launch to ensure their effectiveness.

Uploaded by

Neeraja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Lecture Slides - ML - Part 2

The document outlines the end-to-end machine learning process, emphasizing the importance of problem definition, data acquisition, visualization, transformation, model selection, evaluation, and deployment. It details various techniques for data transformation, including feature engineering and handling missing values, as well as model evaluation metrics for regression and classification tasks. The document concludes by highlighting the necessity of monitoring and maintaining models post-launch to ensure their effectiveness.

Uploaded by

Neeraja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

End-to-end ML

Varol Kayhan, PhD


End-to-End ML (and our Agenda)
Discover and
Problem Data
Get the data visualize the
definition transformation
data

Present Evaluate the Select a model


Fine tuning
solution model and train

Launch,
monitor,
maintain
Problem Definition
• Select SPECIFIC goals/problems!
• Avoid broad problems!
• Several questions to ask:
• What is the expected benefit?
• Is it supervised or unsupervised?
• Is it regression or classification (binary or multi-class)?

• (There could be a single course on "Problem Definition" alone)


Get the Data
• What is the unit of analysis?
• What are the predictors (features)?
• What is the target/outcome variable?
• Where is the data coming from?
• Internal, external, combined?
• How much data do you have (i.e., how many rows?)

• (There could be a single course on "Getting the Data" alone)


Discover and Visualize
• Use visualization
• Histograms, boxplots, scatter plots
• Use multiple variables in one plot when possible
• Observe descriptive statistics
• Mean, median, std.dev, min, max
• Observe correlations
• Reduce variables (if needed)

• (There could be a single course on "Discovery and Viz." alone)


Data Transformation
• Perform "feature engineering" (if needed)
• Derive variables using other variables
• Convert continuous to categorical
• Create new categories from existing categories
• Etc.
• Transform skewed variables if needed

• (There could be a single course on "Data Transformation" alone)


Data Transformation
• Convert categorical text to numerical values:
• "Ordinal encoding“ OR "one-hot-encoding"

Size Color Price Size White Red Black Price


Compact White 22,000 1 1 0 0 22,000
Compact Red 22,750 1 0 1 0 22,750
Mid-size Black 25,000 2 0 0 1 25,000
Full-size White 32,000 3 1 0 0 32,000
Mid-size Red 26,000 2 0 1 0 26,000

Ordinal encoding One-hot-encoding


Data Transformation
• Standardize continuous variables (why???)
Standardized value

Age Salary Default Age_std Salary_std Default


23 58,560 1 0.19 -1.26 1
21 78,224 1 -0.77 0.43 1
25 86,571 0 1.16 1.15 0
20 78,841 1 -1.25 0.49 1
24 63,839 0 0.68 -0.81 0

Mean = 22.6 Mean = 73,207


Std.Dev. = 2.07 Std.Dev. = 11,595
Data Transformation
• Identify and deal with missing values
• Drop those rows/columns?
• Impute values (mean/median/etc.)?

Age Salary Default Age Salary Default Age Salary Default


58,560 1 23 58,560 1 23 1
78,224 1 21 78,224 1 21 78,224 1
25 0 25 86,571 0 25 0
20 1 20 1 20 1
24 63,839 0 24 63,839 0 24 0

What to do? What to do? What to do?


Data Transformation
• Identify and deal with outliers:
• How do you define outliers?
• Are outliers legitimate or data entry errors?
• For each outlier: eliminate, include, or correct?
Data Transformation
• Do you do data transformation BEFORE or AFTER the train/test split?
• BEFORE is easier – recommended for beginners
• AFTER is better – recommended for real-world applications
Data Transformation
• There are two functions we need to know about:

• fit_transform()
• Learns and uses the parameters needed to transform the data (e.g., mean, std.dev, etc.)
• Must be used to transform the TRAINING data

• transform()
• Uses the learned parameters to transform the data
• Can only be used after fit_transformed()is used
• Must be used to transform the TEST data
Data Transformation
• What does fit_transform() and transform() do?
Train set (learns)
Size Color Price Mean=25,550 Size White Red Black Price
Compact White 22,000 Stddev=3,537 1 1 0 0 -1.00
Compact Red 22,750 1 0 1 0 -0.79
Mid-size Black 25,000 2 0 0 1 -0.16
fit_transform()
Full-size White 32,000 3 1 0 0 1.82
Mid-size Red 26,000 2 0 1 0 0.13

Test set
Size Color Price transform()
Size White Red Black Price
Mid-size Purple 23,000 2 0 0 0 -0.60
Data Transformation
• Do not use fit_transform() on train and test separately!!!
Train set
Size Color Price Mean=23,250 Size White Red Black Price
Compact White 22,000 Stddev=1,275 1 1 0 0 -0.98
Compact Red 22,750 fit_transform() 1 0 1 0 Instead, use only
-0.39
transform()
Mid-size Black 25,000 2 0 0 1 1.37
Test set
Size Color Price Mean=46,000 Size White Red Blue Price Size White Red Black Price
Mid-size Blue 43,000 Stddev=2,944 1 0 0 1 2 0 0 0
-1.02 15.49
Full-size White 45,000 2 1 0 0 -0.34 unk 1 0 0 17.06
fit_transform()
Mid-size Red 50,000 1 0 1 0 1.36 2 0 1 0 20.98
Data Transformation (fit_transform)
• When to use fit_transform and transform()
Train Models
• Select the right algorithm(s) for the task:
Regression Tasks Classification Tasks
DecisionTreeRegressor DecisionTreeClassifier
SVR SCV
RandomForestRegressor RandomForestClassifier
Etc. Etc.

• Build alternative models to compare


Model Evaluation
• Regression evaluation metric: (to be discussed later)
• Mean squared error (MSE)
• Root mean squared error (RMSE)
• Classification evaluation metric: (to be discussed later)
• Accuracy
• Sensitivity/specificity
• F1 score
• ROC curve
• Cross-validation! (Next slide)
• Goal: generate robust results (rather than one-off/non-replicable results)
Model Evaluation
• Cross-validation has two use cases:
• Data set is small (a single train/test split will not generate robust results)
• Data set is large, but train/test split created lucky/unlucky samples
• There are several influential outliers in the data???
• 5-fold cross-validation:
Entire data set
Test Train Train Train Train
Train Test Train Train Train
Train Train Test Train Train Overall
Train = Avg(trains)
Train Train Train Test Train Test = Avg(tests)

Train Train Train Train Test


Fold #1 Fold #2 Fold #3 Fold #4 Fold #5
Train = … Train = … Train = … Train = … Train = …
Test = … Test = … Test = … Test = … Test = …
Fine tuning
• Change hyperparameters
• Grid search!
• Define the set of parameters
• Perform a (random) search!
• (Exhaustive search not recommended!)
Present Solution
• Is the "error" acceptable?
• Can you interpret the model (black-box or white box)?
• Can you identify "variable importance"?
Launch, Monitor, Maintain
• Is the model successful/effective?
• Conduct A/B experiments to test model effectiveness
Most time consuming &

Conclusion error prone

Discover and
Problem Data
Get the data visualize the
definition transformation
data

Present Evaluate the Select a model


Fine tuning
solution model and train

We will perform these steps in Python


Launch,
monitor,
maintain

You might also like