Quiz 1 Materials
Quiz 1 Materials
Quiz 1 Materials
& Analytics
David Gómez-Ullate Oteiza
1
Type of data
2
Supervised ML
3
Contents
4
What kind of learning would you use ?
5
What kind of learning would you use ?
6
What kind of learning would you use ?
7
What kind of learning would you use ?
8
Supervised learning: Terminology
● Collection of labeled examples. Also
called
○ samples
○ observations
● Several variables per example. Also called:
○ inputs
○ predictors
○ attributes
○ features
○ covariates
○ independent variables
● One of the variables is of special interest:
○ label
○ target
○ output
○ dependent variable
9
Contents
10
Homework
4. Fiddle around with your model and make it to the top 10% of the
Leaderboard
11
Model selection
For every problem, we can choose from a large set of models
13
No Free Lunch Theorem
There is no classification method
that systematically outperforms
others on a wide range of problems
… but RF or XGBoost on tabular data should work fine out of the box,
and even better after some hyperparameter tuning. 14
Choosing a model
Given a supervised learning problem, how to choose the best model / hyperparameters?
● hyper parameter tuning (grid search, stochastic sampling, bayesian methods, etc.),
● model selection, etc.
3. Pragmatic answer:
● Don’t bother to strive for the best, settle for one that’s good enough for your purposes.
15
Overfitting
16
Overfitting
• The larger the training set, the more complex the model can be (without overfitting)
• Complexity/flexibility of a model is proportional to the number of tunable parameters
17
Partitioning the Data
Problem: How well will our model perform with new data?
Estimate generalization error
18
Assessing under/overfitting
We want the point that minimizes test error
19
Strategy
20
Hyperparameter tuning:
Experiment Tracking Tools
https://mlflow.org/
https://wandb.ai/site
21
Hyperparameter optimization
Balance exploration and exploitation (local vs global search)
https://towardsdatascience.com/hyperparameters-tuning-from-grid-search-to-optimization-a09853e4e9b8
https://en.wikipedia.org/wiki/Hyperparameter_optimization
22
Adding a validation set
23
Train / Val / Test split
24
Cross Validation
25
Cross Validation
Solution:
K-fold cross validation
26
Cross Validation
27
Strategy (Refined)
28
Loss functions
Categorical cross-entropy
Introduce an extra term in the loss function to penalize very complex models
Linear model
L2-regularization (ridge)
L1-regularization (lasso)
Elastic net
Increase
Decrease
Regression tasks
MAE is easier to interpret, but RMSE has better properties for convergence in computation (it
is differentiable)
Predicted label
P N
Recall F1 - score
N False Positive True Negative
Example of regression as
Supervised Learning problem
https://towardsdatascience.com/what-we-can-learn-from-zillow-on-basing-a-business-around-machi
ne-learning-646ee5daf7e0
David Gómez-Ullate - Topics in IS 39
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza
1
Homework
4. Fiddle around with your model and make it to the top 10% of the
Leaderboard
2
If you so wish, feel free to investigate and formulate your hypothesis on what
went wrong with Zillow Zestimates for house pricing and the house flipping
business model.
https://towardsdatascience.com/what-we-can-learn-from-zillow-on-basing-a-business-around-machi
ne-learning-646ee5daf7e0
David Gómez-Ullate - Topics in IS 3
1st assignment - Supervised ML
For the first assignment I would like you to complete the Kaggle Course "Intermediate Machine Learning” and to train your best possible
regression model on the Iowa house pricing dataset. To assess how good your model performs, you will be required to report your ranking in
the corresponding Kaggle competition.
I am particularly interested (this is in fact new for me) in finding out whether use of ChatGPT can improve your results, so I would ask you to
work through the material by just learning your stuff and trying your own ideas and see how far that gets you. And then try with ChatGPT to
see whether it actually gets you further or not. I would expect without using ChatGPT, just by working through the course material, you should
be able to get within the top 10% of all participants.
For the submission of the assignment, I would like to ask you to upload a single document with:
1
Outline
● General introduction
● Clustering
● Outlier detection
● Dimensionality reduction
● Generative models
● Recommender systems (next week)
A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis (2014)
https://scikit-learn.org/stable/modules/clustering.html
David Gómez-Ullate Oteiza
7
K-Means Algorithm
● Hyperparameter: Number of clusters K
● Computes K centroids that are used to define clusters
● An observation in a particular cluster if it is closer to that cluster’s centroid that to
any other one
Algorithm:
1. Select a value for K
2. Generate K random centroids
3. Assign observations to each cluster (label examples)
4. Compute the new centroids for each cluster
5. Repeat 3 and 4 until convergence
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
● Centroids are usually initialized uniformly at random: different runs can lead to
different clusters.
● Can also start from previous values: Online learning or periodic trainings
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
Steps:
1. compute distances and similarity for every
point in the high-dimensional space
● Non-linear: Difficult to interpret 2. random projection on 2-dim space and
● Much slower than PCA compute similarity matrix again
● Captures “clustering” relations 3. move points in 2-dim space until similarity
● Mainly used for visualization matrix is close to the original one
● Less common: Feature engineering 4. perplexity hyper-parameter: determines the
local density around a given point.
https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding
https://cs.stanford.edu/people/karpathy/tsnejs/
https://clauswilke.com/art/project/t-sne
David Gómez-Ullate Oteiza
28
Deep Learning: Autoencoders
● Anomaly detection
● Dimensional reduction
● “Black-box”
● Many hyperparameters
https://thispersondoesnotexist.com/
1
Customer Segmentation
Jupyter notebook
https://colab.research.google.com/drive/1iqt6PvdRSH6tji_4HteHD7-BB6FFihFT?usp=sharing
1
Netflix Prize
Recommender systems are everywhere:
Netflix, Amazon, Google, …
2
Netflix Prize
Training data set
● 100,480,507 ratings
● 480,189 users
● 17,770 movies.
Each training rating is a quadruplet
<user, movie, date of grade, grade>.
The user and movie fields are integer IDs, while grades are
from 1 to 5 (integer) stars
3
The Netflix Prize (2006-2009)
4
RecSys conferences
5
Content based vs Collaborative filtering
Content-based methods describe users and
items by their known metadata. Each item i is
represented by a set of relevant tags—e.g.
movies of the IMDb platform can be tagged
as“action”, “comedy”, etc.
Pro: Can be used in cold start problem.
Con: does not use full set of user-item
interactions, treats each user independently.
Explicit Feedback
To collect explicit feedback from the user, the system must
ask users to provide their ratings for items. requires direct
participation from the user, it is often not easy to collect.
David Gómez-Ullate Oteiza
9
User-Item Matrix
● Cells are user preferences, rij, for items
● Sparse matrix, sometimes better to save the triplets (user, item, rank)
● Preferences can be ratings, or binary (buy, click, like)
● Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback
datasets." 2008 Eighth IEEE international conference on data mining. Ieee, 2008.
● Zhang, Shuai, et al. "Deep learning based recommender system: A survey and new
perspectives." ACM computing surveys (CSUR) 52.1 (2019): 1-38.
● Jannach, D., Zanker, M., Felfernig, A., & Friedrich, G. (2010). Recommender systems: an
introduction. Cambridge University Press.
Blog Posts
1
Motivation
2
Explainable AI
Interpretability, also often referred to as explainability, in artificial intelligence (AI) refers to
the study of how to understand the decisions of machine learning systems, and how to
design systems whose decisions are easily understood, or interpretable
3
Overview
4
https://www.kaggle.com/learn/machine-learning-explainability
5
Permutation importance
● Proposed by Leo Breiman in 2001
(the guy who invented Random Forests)
○ Link to RF original paper
https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html
6
Permutation importance
● Intuitive and model agnostic
● Calculated only on evaluation test, once the model is trained
Original
Evaluation set
Shuffled
Evaluation set
https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html
7
Partial dependence plots
Partial dependence plots (PDP) show the dependence between the target response and
a set of input features of interest, marginalizing over the values of all other input features
(the ‘complement’ features). Intuitively, we can interpret the partial dependence as the
expected target response as a function of the input features of interest.
9
To Learn more…
https://christophm.github.io/interpretable-ml-book/
10
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza
1
A brief history
David Gómez-Ullate 2
Supervised Learning
David Gómez-Ullate 3
Biological inspiration
David Gómez-Ullate 4
Perceptron
• Very old model (1962)
• Linear combination of the input variables (like LR)
David Gómez-Ullate 5
Adding non-linearity
David Gómez-Ullate 6
Input, hidden and output layers
• Solution: Add hidden layer with many units
David Gómez-Ullate 7
Representation Learning
• Remember feature engineering step:
○ Key for good performance (adds value)
○ Very costly, and depends on specific knowledge
• Example: extract numerical features from non-tabular data
like audio, video, images or text
David Gómez-Ullate 8
Traditional ML vs Deep Learning
• Deep Learning can be interpreted as a 2-step process:
○ Creates new variables computing linear combinations of the
original ones
○ Fit a simple model in the new representation
• Key idea: everything is trained automatically at the same time
David Gómez-Ullate 9
End to end approach
• Advantages:
○ No need for specific domain knowledge
○ Less costly, new variables are created automatically
○ New variables tailored to the specific task
David Gómez-Ullate 10
Example
• Classify cells into bening or not
• Traditional models: Most important part of the pipeline is to extract
features manually from images
○ Cell segmentation, Nucleus identification, …
• (Deep) NN can automatically extract features from the images that are
useful for this classification task
David Gómez-Ullate 11
Stack more layers: Shallow -> Deep
David Gómez-Ullate 12
Why Deep ?
David Gómez-Ullate 13
Why now?
David Gómez-Ullate 14
Hardware
David Gómez-Ullate 15
Data
• Exponential increase in the storage
capacity
• Internet: Collect and distribute big data
easily
○ Wikipedia (text)
○ Youtube (video)
○ Flicker, Instagram (images)
○ Twitter (Graphs)
• Standard benchmark competitions when
models can “compete”
David Gómez-Ullate 16
Increase in computing power
David Gómez-Ullate 17
Scaling laws for SOTA NLP models
Scaling laws relating model efficiency with size (computational time, dataset
size, number of trainable parameters) for large size neural NLP models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv
preprint arXiv:2001.08361.
18
Deep Learning: a business perspective
David Gómez-Ullate 19
DL frameworks
David Gómez-Ullate 20
Counting parameters
MLP
David Gómez-Ullate 21
Backpropagation
1. Forward pass: calculate loss (similar to a metric) for current value of parameters
2. Backward pass: update value of parameters using gradient of loss function
3. Go to 1.
David Gómez-Ullate 22
Gradient Descent
David Gómez-Ullate 23
Mini-batch SGD
● Don’t need to have all dataset in RAM memory
● Can use the dataset in pieces: batches
● Computationally more efficient
David Gómez-Ullate 24
Summary
• Very flexible models able of performing complex learning tasks
• Very prone to overfitting the data / huge number of parameters
• Need very large training datasets to compensate
• Training is computationally costly
• Coding the models has become easier thanks to DL Frameworks
David Gómez-Ullate 25
Benchmark Dataset 1: MNIST
Yann LeCun (1998)
26
Benchmark Dataset 2: ImageNet
https://www.image-net.org/challenges/LSVRC/
27
Evolution of SOTA for ImageNet Challenge
Update: https://paperswithcode.com/sota/image-classification-on-imagenet
28
MNIST
29
Arrays as sequences
Converting a 2D array into a sequence of number (Flatten) makes you loose translation
invariance.
30
Transfer Learning
● Take a network trained in a different domain and/or a different task
● Adapt part of it to your domain and task (i.e. don’t start from scratch)
David Gómez-Ullate 31
Transfer Learning
Task 1
Task 2
● You do not need to (re)train the entire model. The base convolutional network
already contains features that are generically useful for classifying pictures.
However, the final, classification part of the pretrained model is specific to the
original classification task, and subsequently specific to the set of classes on which
the model was trained.
● Fine-Tuning: Unfreeze a few of the top layers of a frozen model base and jointly train
both the newly-added classifier layers and the last layers of the base model. This
allows us to "fine-tune" the higher-order feature representations in the base model in
order to make them more relevant for the specific task.
David Gómez-Ullate 33
To learn more…
Jeremy Howard @ FastAI
https://course.fast.ai/
Very nice & updated course to FastAI library. Covers also model deployment
Andrew Ng @ DeepLearning.ai
https://www.deeplearning.ai/
Standard course for Deep Learning fundamentals (5 course specialization)
David Gómez-Ullate 34
Questions?
35
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza
1
Softmax layer
Whenever you are dealing with
a classification problem, the
last layer always has a softmax
activation layer, whose number
of units coincides with the
number of classes.
https://towardsdatascience.com/sigmoid-and-softmax-functions-in-5-minutes-f516c80ea1f9 2
Loss function for classification
Categorical cross-entropy
3
Convolutional Neural Networks (CNNs)
4
Convolutional layers
https://www.codingninjas.com/codestudio/library/convolution-layer-padding-stride-and-pooling-in-cnn
5
Stride
https://www.codingninjas.com/codestudio/library/convolution-layer-padding-stride-and-pooling-in-cnn
6
Padding
Learn the concepts of:
● kernel size
● stride
● padding
https://www.codingninjas.com/codestudio/library/convolution-layer-padding-stride-and-pooling-in-cnn
7
Pooling
https://androidkt.com/explain-pooling-layers-max-pooling-average-pooling-global-average-pooling-and-global-max-pooling/
8
Dropout
https://medium.com/analytics-vidhya/a-simple-introduction-to-dropout-regularization-with-code-5279489dda1e
9
Early stopping
10
Famous ConvNet Architectures
https://paperswithcode.com/sota/image-classification-on-imagenet
11
Residual Connections (ResNets)
12
Feature extraction + classification task
13
Transfer Learning
● Take a network trained in a different domain and/or a different task
● Adapt part of it to your domain and task (i.e. don’t start from scratch)
Task 1
Task 2
David Gómez-Ullate 14
Transfer Learning Architecture
David Gómez-Ullate 15
Latent representation
David Gómez-Ullate 16
Auto encoders
Train the network to
minimize reproduction loss
Decoder is able to
reconstruct the whole
image out of a reduced
representation (bottleneck)
So we can assume that the
bottleneck is a good
compressed representation
of the original object
https://www.jeremyjordan.me/autoencoders/
David Gómez-Ullate 17