0% found this document useful (0 votes)

115 views44 pages

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

Natch Sadindum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views44 pages

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

Natch Sadindum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

COMPX310-19A

Machine Learning
An introduction using Python, Scikit-Learn, Keras, and Tensorflow

Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
C2: A first end-to-end-application
 Blueprint:
 Big picture
 Data
 Visualize to understand
 Preprocess data
 Select model and train
 Fine-tune model
 Present
 Launch, monitor, and maintain

03/08/2021 COMPX310 2
Many data sources
 Open data:
 UC Irvine Machine Learning repository
 Kaggle
 Amazon AWS datasets

 Meta portals:
 dataportals.org
 opendatamonitor.eu
 quandl.com

 Other:
 https://en.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research
 https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-
public
 https://www.reddit.com/r/datasets

03/08/2021 COMPX310 3
California house prices, 1990 census

03/08/2021 COMPX310 4
One cog in a larger system

03/08/2021 COMPX310 5
Some performance measure: RMSE

Root mean squared error

m .. Number of example
x .. Input value for this example, e.g. latitude, longitude,
district size, median income
y .. Target value, e.g. median house price
h .. Our regression for predicting this median price

Often used in regression, but may over-emphasise outliers

Also called L2 norm

03/08/2021 COMPX310 6
MAE: mean absolute error

Also called: L1 norm, manhattan distance, city block distance

More robust to outliers

Both RMSE and MAE are instances of the Lk norm idea:

k can be any natural number,

L0 counts the number of elements (n here)
Linfinity computes the max absolute value
03/08/2021 COMPX310 7
California housing is also on Kaggle

03/08/2021 COMPX310 8
Inspect some more:

03/08/2021 COMPX310 9
And some more:

03/08/2021 COMPX310 10
What about ‘ocean_proximity’?

03/08/2021 COMPX310 11
Some histograms

Notebook ”magic” commands start with %

This time we use matplotlib, not seaborn
There are only histogram plots for numeric features,
Ocean_proximity will be missing

Have a look at the values and try to make sense of them

03/08/2021 COMPX310 12
03/08/2021 COMPX310 13
Some observations
 Many plots have a long right tail
 Scales are very different, e.g. 0-16 vs. 0-500000
 Some data is preprocessed, e.g. median income 3 means $30k
 Some data is capped, like median_age, median_house_value,
and median_income
 Can be problematic
 Maybe remove
 Maybe try to get correct values

03/08/2021 COMPX310 14
Manually splitting into train and test

03/08/2021 COMPX310 15
More on splitting
 Generally it is a better idea to use scikit_learn functions, e.g.
 from sklearn.model_selection import train_test_split
 train, test = train_test_split(df, test_size=0.2, random_state=42)

 The text book then also explains how to use hashing to keep
splits similar, even when adding new data
 And how to do stratification of some attribute, and stratified
sampling with regard to such an attribute
 Read this in your own time

03/08/2021 COMPX310 16
Visualising

03/08/2021 COMPX310 17
More Visualising

03/08/2021 COMPX310 18
Looking for correlations

03/08/2021 COMPX310 19
Be careful with correlations

Linear correlations only: does y increase with x, or decrease

-1 max decrease, +1 max increase, no relationship around 0
03/08/2021 COMPX310 20
Some scatter plot

03/08/2021 COMPX310 21
Focus

03/08/2021 COMPX310 22
Derived attributes/features

03/08/2021 COMPX310 23
Preparing to train a model
 Split the augmented dataframe into train and test

 And then train into input and output (or target): X and y

03/08/2021 COMPX310 24
What about missing values
 Most learner do not handle missing values, simple options are
 Drop examples with missing values
 Drop features with missing values
 Replace missing values somehow: 0, mean, median, smarter …

03/08/2021 COMPX310 25
Or the ‘scikit_learn’ way

03/08/2021 COMPX310 26
And applying it:

03/08/2021 COMPX310 27
Scikit-learn design
 Consistency:
 Estimators: fit(dataset)
 Transformers: transform(), fit_transform()
 Predictors: predict(), score()

 Inspection:
 hyperparameters are public instance variables:
 imputer.strategy -> median
 Learned parameters are public instance variables with ‘_’ suffix:
 imputer.statistics_

 Datasets are NumPy arrays or SciPy sparse matrices, hyperparameters are

numbers and strings

 Composition: some transformers + estimator -> Pipeline estimator

 Sensible defaults

03/08/2021 COMPX310 28
What about ‘ocean_proximity’?

03/08/2021 COMPX310 29
Or use separate 0/1 feature for each value

03/08/2021 COMPX310 30
Notes
 OrdinalEncoder is perfect for ’ordinal’ scales, e.g. ‘bad’,
‘average’, ‘good’, ’excellent’,
 But make sure to define this order explicitly

 OneHot can generate too many features, then maybe

 Replace with some numeric feature, e.g. distance from the sea
 Or one or more reasonable proxies, e.g. zip code with average
income, education, …

 Later we will learn about ‘embeddings’

03/08/2021 COMPX310 31
More notes
 The text book also covers:
 Custom transformers
 Scaling of numeric attributes
 Transformation pipelines

 General warning: always fit estimators and transformers one just

the training data, otherwise information will ‘leak’ and may
make your results look better than they are

03/08/2021 COMPX310 32
Preparing X

03/08/2021 COMPX310 33
And y and a linear regression model

03/08/2021 COMPX310 34
How well does it do?

03/08/2021 COMPX310 35
Now try a decision tree:

03/08/2021 COMPX310 36
Try cross-validation to get a better estimate

03/08/2021 COMPX310 37
CV for linear regression

03/08/2021 COMPX310 38
Now try a RandomForest

03/08/2021 COMPX310 39
Plot all cv results

03/08/2021 COMPX310 40
How well do we do on TEST data?

03/08/2021 COMPX310 41
Plot predictions: linear regression

03/08/2021 COMPX310 42
Plot predictions: Random Forest

03/08/2021 COMPX310 43
More book stuff
 Fine tuning the model:
 Grid search
 Random search
 Analyze best model

 More later

03/08/2021 COMPX310 44

30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Cheat Sheet: Python For Data Science
100% (1)
Cheat Sheet: Python For Data Science
1 page
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
65 pages
Python Library Functions
No ratings yet
Python Library Functions
12 pages
Comp3314 5. Data Preprocessing
No ratings yet
Comp3314 5. Data Preprocessing
51 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
IML Lab Manual
No ratings yet
IML Lab Manual
31 pages
COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction
No ratings yet
COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction
35 pages
COMPX310-19A Machine Learning Chapter 10: Neural Networks
No ratings yet
COMPX310-19A Machine Learning Chapter 10: Neural Networks
35 pages
COMPX310-19A Machine Learning Chapter 5: Support Vector Machines
No ratings yet
COMPX310-19A Machine Learning Chapter 5: Support Vector Machines
29 pages
60 ChatGPT Prompts For Data Science 2023
100% (3)
60 ChatGPT Prompts For Data Science 2023
67 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
100% (1)
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
23 pages
COMPX310-19A Machine Learning Chapter 4: Training Models
No ratings yet
COMPX310-19A Machine Learning Chapter 4: Training Models
48 pages
DE - Python For Data Science - Machine Learning
No ratings yet
DE - Python For Data Science - Machine Learning
45 pages
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
No ratings yet
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
25 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Chap15 Statistical Quality Control
No ratings yet
Chap15 Statistical Quality Control
111 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Bio 510 Exp 1
50% (6)
Bio 510 Exp 1
9 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Python: Master
No ratings yet
Python: Master
37 pages
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Featureselection
No ratings yet
Featureselection
11 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Latin Square Design
No ratings yet
Latin Square Design
10 pages
Python
No ratings yet
Python
38 pages
COMPX310-19A Machine Learning Chapter 3: Classification
No ratings yet
COMPX310-19A Machine Learning Chapter 3: Classification
39 pages
Case Solution On Agricomp
No ratings yet
Case Solution On Agricomp
10 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
6 - Machine Learning 2
No ratings yet
6 - Machine Learning 2
14 pages
BU EC507 Syllabus
No ratings yet
BU EC507 Syllabus
3 pages
Chapter 4 - BUSINESS STATISTICS
No ratings yet
Chapter 4 - BUSINESS STATISTICS
14 pages
Assignment1 LATEX
No ratings yet
Assignment1 LATEX
11 pages
Beginning Behavioral Research A Conceptual Primer, 7th Edition Exclusive Download
100% (17)
Beginning Behavioral Research A Conceptual Primer, 7th Edition Exclusive Download
14 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Mlviva
No ratings yet
Mlviva
14 pages
COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest
No ratings yet
COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest
41 pages
Viva
No ratings yet
Viva
7 pages
Forecasting
No ratings yet
Forecasting
45 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Econ G2 Final
No ratings yet
Econ G2 Final
10 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Hint Sheet
No ratings yet
Hint Sheet
13 pages
3.3 Percentiles and Box-and-Whisker Plots
No ratings yet
3.3 Percentiles and Box-and-Whisker Plots
16 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
Statistical Significance Versus Clinical Relevance
No ratings yet
Statistical Significance Versus Clinical Relevance
38 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Solucionario Repaso Tema 4 Estadística y Probabilidad Prueba 02
No ratings yet
Solucionario Repaso Tema 4 Estadística y Probabilidad Prueba 02
350 pages
Datascience
No ratings yet
Datascience
8 pages
In-House Validation of Analytical Methods: Procedures & Calculation Sheets
No ratings yet
In-House Validation of Analytical Methods: Procedures & Calculation Sheets
21 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Parametric and Nonparametric
No ratings yet
Parametric and Nonparametric
2 pages
ML Cyber Lab
No ratings yet
ML Cyber Lab
16 pages
Activity # 9
No ratings yet
Activity # 9
3 pages
01-Simple Regression
No ratings yet
01-Simple Regression
13 pages
Final Defense
No ratings yet
Final Defense
25 pages
Lovedeep Singh Bussiness Analytics
No ratings yet
Lovedeep Singh Bussiness Analytics
24 pages
Illustrating A Probability Distribution For A Discrete Random Variable and Its Properties
No ratings yet
Illustrating A Probability Distribution For A Discrete Random Variable and Its Properties
5 pages
Random Variables: Corr (X, Y) Cov (X, Y) / Cov (X, Y) Is The Covariance (X, Y)
No ratings yet
Random Variables: Corr (X, Y) Cov (X, Y) / Cov (X, Y) Is The Covariance (X, Y)
15 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
16 pages
Mantel-Haenszel Test: M M N N
No ratings yet
Mantel-Haenszel Test: M M N N
11 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Jurnal Hannum Anggina Dia
No ratings yet
Jurnal Hannum Anggina Dia
19 pages
Final ML
No ratings yet
Final ML
2 pages
Statistics and Probability
No ratings yet
Statistics and Probability
11 pages
02 Multicollinearity
100% (1)
02 Multicollinearity
8 pages
Researach Paper Hardik SSRG
No ratings yet
Researach Paper Hardik SSRG
6 pages
07 - Principles of Visualizing Data
No ratings yet
07 - Principles of Visualizing Data
41 pages
Early Childhood Teachers' Work Environment, Perceived
No ratings yet
Early Childhood Teachers' Work Environment, Perceived
21 pages
Stats Cheatsheet Final
No ratings yet
Stats Cheatsheet Final
2 pages
Q1 MCQ
No ratings yet
Q1 MCQ
25 pages
Data Smart: Using Data Science to Transform Information into Insight
From Everand
Data Smart: Using Data Science to Transform Information into Insight
Jordan Goldmeier
4/5 (16)
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

COMPX310-19A

Root mean squared error

Often used in regression, but may over-emphasise outliers

Also called L2 norm

Also called: L1 norm, manhattan distance, city block distance

Both RMSE and MAE are instances of the Lk norm idea:

k can be any natural number,

Notebook ”magic” commands start with %

Have a look at the values and try to make sense of them

Linear correlations only: does y increase with x, or decrease

 Datasets are NumPy arrays or SciPy sparse matrices, hyperparameters are

 Composition: some transformers + estimator -> Pipeline estimator

 OneHot can generate too many features, then maybe

 Later we will learn about ‘embeddings’

 General warning: always fit estimators and transformers one just

You might also like