0% found this document useful (0 votes)
115 views44 pages

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

Natch Sadindum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views44 pages

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

Natch Sadindum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

COMPX310-19A

Machine Learning
An introduction using Python, Scikit-Learn, Keras, and Tensorflow

Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
C2: A first end-to-end-application
 Blueprint:
 Big picture
 Data
 Visualize to understand
 Preprocess data
 Select model and train
 Fine-tune model
 Present
 Launch, monitor, and maintain

03/08/2021 COMPX310 2
Many data sources
 Open data:
 UC Irvine Machine Learning repository
 Kaggle
 Amazon AWS datasets

 Meta portals:
 dataportals.org
 opendatamonitor.eu
 quandl.com

 Other:
 https://en.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research
 https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-
public
 https://www.reddit.com/r/datasets

03/08/2021 COMPX310 3
California house prices, 1990 census

03/08/2021 COMPX310 4
One cog in a larger system

03/08/2021 COMPX310 5
Some performance measure: RMSE

Root mean squared error


m .. Number of example
x .. Input value for this example, e.g. latitude, longitude,
district size, median income
y .. Target value, e.g. median house price
h .. Our regression for predicting this median price

Often used in regression, but may over-emphasise outliers

Also called L2 norm

03/08/2021 COMPX310 6
MAE: mean absolute error

Also called: L1 norm, manhattan distance, city block distance


More robust to outliers

Both RMSE and MAE are instances of the Lk norm idea:

k can be any natural number,


L0 counts the number of elements (n here)
Linfinity computes the max absolute value
03/08/2021 COMPX310 7
California housing is also on Kaggle

03/08/2021 COMPX310 8
Inspect some more:

03/08/2021 COMPX310 9
And some more:

03/08/2021 COMPX310 10
What about ‘ocean_proximity’?

03/08/2021 COMPX310 11
Some histograms

Notebook ”magic” commands start with %


This time we use matplotlib, not seaborn
There are only histogram plots for numeric features,
Ocean_proximity will be missing

Have a look at the values and try to make sense of them

03/08/2021 COMPX310 12
03/08/2021 COMPX310 13
Some observations
 Many plots have a long right tail
 Scales are very different, e.g. 0-16 vs. 0-500000
 Some data is preprocessed, e.g. median income 3 means $30k
 Some data is capped, like median_age, median_house_value,
and median_income
 Can be problematic
 Maybe remove
 Maybe try to get correct values

03/08/2021 COMPX310 14
Manually splitting into train and test

03/08/2021 COMPX310 15
More on splitting
 Generally it is a better idea to use scikit_learn functions, e.g.
 from sklearn.model_selection import train_test_split
 train, test = train_test_split(df, test_size=0.2, random_state=42)

 The text book then also explains how to use hashing to keep
splits similar, even when adding new data
 And how to do stratification of some attribute, and stratified
sampling with regard to such an attribute
 Read this in your own time

03/08/2021 COMPX310 16
Visualising

03/08/2021 COMPX310 17
More Visualising

03/08/2021 COMPX310 18
Looking for correlations

03/08/2021 COMPX310 19
Be careful with correlations

Linear correlations only: does y increase with x, or decrease


-1 max decrease, +1 max increase, no relationship around 0
03/08/2021 COMPX310 20
Some scatter plot

03/08/2021 COMPX310 21
Focus

03/08/2021 COMPX310 22
Derived attributes/features

03/08/2021 COMPX310 23
Preparing to train a model
 Split the augmented dataframe into train and test

 And then train into input and output (or target): X and y

03/08/2021 COMPX310 24
What about missing values
 Most learner do not handle missing values, simple options are
 Drop examples with missing values
 Drop features with missing values
 Replace missing values somehow: 0, mean, median, smarter …

03/08/2021 COMPX310 25
Or the ‘scikit_learn’ way

03/08/2021 COMPX310 26
And applying it:

03/08/2021 COMPX310 27
Scikit-learn design
 Consistency:
 Estimators: fit(dataset)
 Transformers: transform(), fit_transform()
 Predictors: predict(), score()

 Inspection:
 hyperparameters are public instance variables:
 imputer.strategy -> median
 Learned parameters are public instance variables with ‘_’ suffix:
 imputer.statistics_

 Datasets are NumPy arrays or SciPy sparse matrices, hyperparameters are


numbers and strings

 Composition: some transformers + estimator -> Pipeline estimator

 Sensible defaults

03/08/2021 COMPX310 28
What about ‘ocean_proximity’?

03/08/2021 COMPX310 29
Or use separate 0/1 feature for each value

03/08/2021 COMPX310 30
Notes
 OrdinalEncoder is perfect for ’ordinal’ scales, e.g. ‘bad’,
‘average’, ‘good’, ’excellent’,
 But make sure to define this order explicitly

 OneHot can generate too many features, then maybe


 Replace with some numeric feature, e.g. distance from the sea
 Or one or more reasonable proxies, e.g. zip code with average
income, education, …

 Later we will learn about ‘embeddings’

03/08/2021 COMPX310 31
More notes
 The text book also covers:
 Custom transformers
 Scaling of numeric attributes
 Transformation pipelines

 General warning: always fit estimators and transformers one just


the training data, otherwise information will ‘leak’ and may
make your results look better than they are

03/08/2021 COMPX310 32
Preparing X

03/08/2021 COMPX310 33
And y and a linear regression model

03/08/2021 COMPX310 34
How well does it do?

03/08/2021 COMPX310 35
Now try a decision tree:

03/08/2021 COMPX310 36
Try cross-validation to get a better estimate

03/08/2021 COMPX310 37
CV for linear regression

03/08/2021 COMPX310 38
Now try a RandomForest

03/08/2021 COMPX310 39
Plot all cv results

03/08/2021 COMPX310 40
How well do we do on TEST data?

03/08/2021 COMPX310 41
Plot predictions: linear regression

03/08/2021 COMPX310 42
Plot predictions: Random Forest

03/08/2021 COMPX310 43
More book stuff
 Fine tuning the model:
 Grid search
 Random search
 Analyze best model

 More later

03/08/2021 COMPX310 44

You might also like