Quiz 1 Materials

Download as pdf or txt
Download as pdf or txt
You are on page 1of 159

AI-Machine Learning

& Analytics
David Gómez-Ullate Oteiza

Session 2 - Supervised ML on tabular data

1
Type of data

Structured Data Unstructured Data

Tables Images/video Text Audio

“Traditional” ML Computer Vision Natural Language


speech recognition
(CV) Processing (NLP)

2
Supervised ML

3
Contents

4
What kind of learning would you use ?

Credit card fraud detection Car insurance fraud detection

5
What kind of learning would you use ?

Autonomous Driving Medical Image Diagnosis

6
What kind of learning would you use ?

Object detection Customer segmentation

7
What kind of learning would you use ?

Image generation Playing Games

8
Supervised learning: Terminology
● Collection of labeled examples. Also
called
○ samples
○ observations
● Several variables per example. Also called:
○ inputs
○ predictors
○ attributes
○ features
○ covariates
○ independent variables
● One of the variables is of special interest:
○ label
○ target
○ output
○ dependent variable

9
Contents

1. Train regression models to make predictions on tabular data


2. Take part in Kaggle competitions
3. Split your dataset in train-val-test sets
4. Understand metrics to assess model’s performance
5. Handle missing data
6. Handle categorical predictor variables
7. Create pipelines
8. Hyperparameter tuning and cross validation
9. Avoid overfitting via regularization
10. Learn to detect and avoid data leakage

10
Homework

1. Open an account on Kaggle


2. Go to Kaggle Learn (courses) and work through:
a. Intro to Machine Learning

b. Intermediate Machine Learning


3. Submit your regression model to the Housing Prices Competition

4. Fiddle around with your model and make it to the top 10% of the
Leaderboard

11
Model selection
For every problem, we can choose from a large set of models

To name but a few:


● Linear models (LR, Ridge, Lasso, etc.)
● Classification Trees (CART)
● Support Vector Machines (SVM)
● k-Nearest Neighbors
● Ensemble methods (Random Forest)
● Bagging, AdaBoost, etc.
● Neural networks (a large family)

• Most of them have been implemented in sklearn python library


• They are ready to use with a common syntax, and a few methods.
• Each of them has its own set of tunable hyper-parameters
12
(Hyper) parameters

● Model parameters are estimated from data


automatically
● Model hyperparameters are set manually*
before estimating model parameters
● Model hyperparameters cannot be estimated
directly from data
● Hyperparameter tuning: Heuristics, rules of
thumb, copy values used on other problems, trial
and error.
COMPUTATIONALLY EXPENSIVE

13
No Free Lunch Theorem
There is no classification method
that systematically outperforms
others on a wide range of problems

The optimal choice of classification


algorithm is problem specific

… but RF or XGBoost on tabular data should work fine out of the box,
and even better after some hyperparameter tuning. 14
Choosing a model
Given a supervised learning problem, how to choose the best model / hyperparameters?

BEST = HIGHER SCORE = SMALLER ERROR


1. State-of the-art answer (Panoramix approach):

● hyper parameter tuning (grid search, stochastic sampling, bayesian methods, etc.),
● model selection, etc.

2. Near future answer: Auto ML (Bender’s approach)

● Automate all of the above -> need a lot of computing power


● (but almost no human intervention)

3. Pragmatic answer:

● Don’t bother to strive for the best, settle for one that’s good enough for your purposes.

15
Overfitting

● We can shrink error almost indefinitely by increasing model complexity


● With enough parameters, you can memorize the dataset used for training!

16
Overfitting

● Model is too simple -> Underfitting (cannot explain the data)


● Model is too flexible/complex -> Overfitting (does not generalize well)

• The larger the training set, the more complex the model can be (without overfitting)
• Complexity/flexibility of a model is proportional to the number of tunable parameters

17
Partitioning the Data

Problem: How well will our model perform with new data?
Estimate generalization error

Solution: Separate data into two parts


● Training set: tune model parameters
● Test set: estimate performance on “new” data

18
Assessing under/overfitting
We want the point that minimizes test error

19
Strategy

1. Split dataset into train/test


2. Train N models* on train set
3. Evaluate the models on test set
4. Choose best performing model
5. Error on test set ≈ Generalization error

* Each model is the combination of the model and its hyperparameters

20
Hyperparameter tuning:
Experiment Tracking Tools

https://mlflow.org/

https://wandb.ai/site

21
Hyperparameter optimization
Balance exploration and exploitation (local vs global search)

Grid search Random search Bayesian Optimization

https://towardsdatascience.com/hyperparameters-tuning-from-grid-search-to-optimization-a09853e4e9b8
https://en.wikipedia.org/wiki/Hyperparameter_optimization
22
Adding a validation set

We need an extra “train” set to choose the


hyperparameters

Solution: Separate data into three parts


● Training set: tune model parameters
● Validation set: tune model hyperparameters
● Test set: estimate performance on “new” data

23
Train / Val / Test split

24
Cross Validation

● When splitting the data into train/val/test we can introduce biases


● Randomly, we can over/under-estimate the performance of the
model
● Simple solution: Generate several splits
● Each one will be biased, but in random directions
● Aggregating, we can estimate the mean of the performance and its
standard deviation

25
Cross Validation

Problem with one hold-out set:


Statistical fluctuations

Solution:
K-fold cross validation

If k=N (number of samples):


Leave one out (LOO) cross validation

26
Cross Validation

● Repeated partitioning = cross-validation (“cv”)


● k-fold cross validation, e.g. k=5
○ For each fold, set aside ⅕ of data as validation
○ Use full remainder as training
○ The validation folds are non-overlapping
○ Better estimation of performance - Less statistical noise

27
Strategy (Refined)

1. Split dataset into train/val/test (possibly k-fold CV)


2. Train N models* on train set
3. Evaluate the models on val set
4. Choose best performing model (on val set)
5. Train ONLY this model in combined train+val set
6. Error on test set ≈ Generalization error

* Each model is the combination of the model and its hyperparameters

28
Loss functions

Regression problems Predictions


Real values

Mean Absolute Error (MAE)


(L1 norm)

Mean square error (MSE)


(L2 norm)

MAE is more robust to outliers, MSE is easier to optimize due to gradient.

David Gómez-Ullate - Topics in IS 29


Loss functions

Classification problems Categories

Probability that instance-i belongs to class Ck

Equals 1 if instance-i belongs to class Ck and 0 otherwise

Loss for example-i

Categorical cross-entropy

David Gómez-Ullate - Topics in IS 30


Regularization

Introduce an extra term in the loss function to penalize very complex models

Linear model

L2-regularization (ridge)

L1-regularization (lasso)

Elastic net

David Gómez-Ullate - Topics in IS 31


Overfitting

Increase
Decrease

David Gómez-Ullate - Topics in IS 32


Metrics

Choosing what metric to report for a given task can be tricky.


Let us see the standard choices

Regression tasks

Mean Absolute Error (MAE)


(L1 norm)

Mean Square Error (MSE)


(L2 norm)

MAE is easier to interpret, but RMSE has better properties for convergence in computation (it
is differentiable)

David Gómez-Ullate - Topics in IS 33


Metrics

Choosing what metric to report for a given task can be tricky.


Let us see the standard choices

Classification tasks (binary)


Just 2 categories: Positive (P) and Negative (N)

Predicted label
P N

P True Positive False Negative Precision Accuracy


label
True

Recall F1 - score
N False Positive True Negative

David Gómez-Ullate - Topics in IS 34


Regularization

When shall we use precision or recall as the relevant metric ?

• Cancer test (P=cancer, N=no cancer)


• Spam detection (P=spam, N=ham)
• Fraud detection (P=fraud, N=legit)
ROC Curve Confusion matrix

David Gómez-Ullate - Topics in IS 35


ML paradigms

Supervised Learning problem


with structured data

Example of regression as
Supervised Learning problem

David Gómez-Ullate - Topics in IS 36


Supervised ML on tabular data

1. Identify features and target variable


2. Exploratory Data Analysis
3. Missing data & imputation
4. Train-Val-Test split
5. Select loss function for the problem
6. Build baseline model
7. Model selection
Features X Target y 8. Hyper-parameter tuning

David Gómez-Ullate - Topics in IS 37


Kaggle
Now we that we are familiar with some basic ML jargon…
Let’s dive into a true ML competition !

David Gómez-Ullate - Topics in IS 38


Homework

1. Complete Kaggle courses on:


○ Intermediate Machine Learning (4 hours)
2. Reach top 10% in Leaderboard for House Pricing Competition
3. Investigate and formulate your hypothesis on what went wrong with Zillow
Zestimates for house pricing

https://towardsdatascience.com/what-we-can-learn-from-zillow-on-basing-a-business-around-machi
ne-learning-646ee5daf7e0
David Gómez-Ullate - Topics in IS 39
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza

Session 3 - Supervised ML on tabular data

1
Homework

1. Open an account on Kaggle


2. Go to Kaggle Learn (courses) and work through:
a. Intro to Machine Learning

b. Intermediate Machine Learning


3. Submit your regression model to the Housing Prices Competition

4. Fiddle around with your model and make it to the top 10% of the
Leaderboard

2
If you so wish, feel free to investigate and formulate your hypothesis on what
went wrong with Zillow Zestimates for house pricing and the house flipping
business model.

https://towardsdatascience.com/what-we-can-learn-from-zillow-on-basing-a-business-around-machi
ne-learning-646ee5daf7e0
David Gómez-Ullate - Topics in IS 3
1st assignment - Supervised ML
For the first assignment I would like you to complete the Kaggle Course "Intermediate Machine Learning” and to train your best possible
regression model on the Iowa house pricing dataset. To assess how good your model performs, you will be required to report your ranking in
the corresponding Kaggle competition.

In the assignment you will practise:

● different regression models (Random Forest and XGBoost)


● handling numeric and categorical variables
● hyper-parameter tuning and cross-validation
● missing values and data imputation

I am particularly interested (this is in fact new for me) in finding out whether use of ChatGPT can improve your results, so I would ask you to
work through the material by just learning your stuff and trying your own ideas and see how far that gets you. And then try with ChatGPT to
see whether it actually gets you further or not. I would expect without using ChatGPT, just by working through the course material, you should
be able to get within the top 10% of all participants.

For the submission of the assignment, I would like to ask you to upload a single document with:

● your certificate of completion of the Intermediate ML Course.


● A snapshot of your ranking in the Leaderboard, based on your own efforts (no ChatGPT aid)
● A snapshot of your ranking in the LeaderBoard after being able to use ChatGPT
● A short discussion (1-2 paragraphs) on whether you were able to use ChatGPT to improve your results.

David Gómez-Ullate - Topics in IS 4


AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza

Session 4 - Unsupervised ML on tabular data

1
Outline

● General introduction
● Clustering
● Outlier detection
● Dimensionality reduction
● Generative models
● Recommender systems (next week)

David Gómez-Ullate Oteiza


2
Unsupervised Learning

● Dataset is a collection of unlabeled examples (only feature vectors)


● Cannot compute error between model output and true value (labels)
● Goal: Transform each feature vector into another vector or a value that can be used to
solve a practical problem.
● “Try to find patterns in data”
● Typical tasks:
○ Recommender systems: Recommend products to customers
○ Clustering: Finding groups of similar objects
○ Dimensionality reduction: Reduce number of features
○ Outlier Detection: Detects inputs which are “atypical”

David Gómez-Ullate Oteiza


3
Clustering
● Most popular unsupervised learning technique
● Clustering is useful for finding groups of similar objects in a large collection of
objects, such as images or text documents.
● You can label each cluster afterwards (sampling several examples from each
cluster)
● Goal: samples in the same cluster should be more similar between them than to
samples from other clusters
● Business knowledge required, not only for interpretation of the results, but also to
create/select features (feature engineering stage)
● Usually you need to iterate the process several times

David Gómez-Ullate Oteiza


4
Clustering
● Client segmentation
● Topic modelling (text document
segmentation)
● Classification of species
● Grouping securities in portfolios
● Grouping firms for structural analysis of
economy
● Army uniform sizes
● Segments of voters
● Feature engineering
● EDA

David Gómez-Ullate Oteiza


5
Clustering: Model zoo

A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis (2014)

David Gómez-Ullate Oteiza


6
Clustering at sklearn

https://scikit-learn.org/stable/modules/clustering.html
David Gómez-Ullate Oteiza
7
K-Means Algorithm
● Hyperparameter: Number of clusters K
● Computes K centroids that are used to define clusters
● An observation in a particular cluster if it is closer to that cluster’s centroid that to
any other one

Algorithm:
1. Select a value for K
2. Generate K random centroids
3. Assign observations to each cluster (label examples)
4. Compute the new centroids for each cluster
5. Repeat 3 and 4 until convergence

David Gómez-Ullate Oteiza


8
K-Means Algorithm

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

David Gómez-Ullate Oteiza


9
K-Means hyper-parameters

● Two ingredients needed to initialize K-Means:


○ How to initialize the centroids
○ Value of K

● Centroids are usually initialized uniformly at random: different runs can lead to
different clusters.

● Can also start from previous values: Online learning or periodic trainings

● How to select the appropriate value for K? Later

David Gómez-Ullate Oteiza


10
Hierarchical clustering
● Groups similar points/clusters in a hierarchical manner
● Usually based on Agglomerative algorithms: bottom-up approach
● Simplifies interpretation
● Freedom to choose bigger or smaller clusters
● Drawback: it’s slow… O(N3) not appropriate for larger datasets.

https://www.learndatasci.com/glossary/hierarchical-clustering/ David Gómez-Ullate Oteiza


11
Hierarchical clustering

David Gómez-Ullate Oteiza


12
DBscan
● Density based Spatial Clustering of applications of Noise
● Clusters can have arbitrary shapes
● Noise points can be considered anomalies
● Classifies each point into one of 3 categories:
○ core point (if it contains min pts around its radius),
○ border point (not a core point but has a core point within vicinity),
○ noise point (all other points which are not core neither border points).

David Gómez-Ullate Oteiza


13
DBscan
● The number of clusters is not an hyperparameter, but depends on them.
● There are two hyperparameters:
○ epsilon (a radius around a point)
○ min points (min points around core point to classify it as core point/dense
region)that acts as proxy metrics for density. There are rules of thumb.

David Gómez-Ullate Oteiza


14
DBscan

https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

David Gómez-Ullate Oteiza


15
Choosing the number of clusters
● Business knowledge. Some applications come with a given value
○ Example: A company want to create 10 segments

● There are automatized ways:


○ Dunn Index = (min(Inter cluster distance)) / (max(Intra cluster distance))
○ Elbow-method
○ Silhouette: how similar an object is to its own cluster (cohesion) compared to
other clusters (separation)
Example in sklearn
○ In general, very hard to define “better”, no performance metric

David Gómez-Ullate Oteiza


16
Elbow method
● Plot the average distance of every point in the cluster to its centroid (cohesion)
● It can be used to asses hyperparameters of different algorithms
● Elbow-shaped curve is generated
● There are simple heuristics to determine where is the “elbow”
● It is a bit arbitrary
Source: Open
AccessArticle
Research on
K-Value
Selection
Method of
K-Means
Clustering
Algorithm

David Gómez-Ullate Oteiza


17
Clustering: Model zoo

David Gómez-Ullate Oteiza


18
Anomaly/Outlier detection
● Classify observations as “normal” or “anomalous”
● Problem: data is not labelled! We cannot use classification
● Try to find “suspicious” patterns in the dataset
● Goal: Output is a real number indicating how “typical” is an example in the
dataset
● Examples:
○ Network intrusion problem (by detecting abnormal network packets that are
different from a typical packet in “normal” traffic)
○ Detecting novelty (such as a document different from the existing
documents in a collection)
○ Machine failures in industry
○ Electricity consumption peaks

David Gómez-Ullate Oteiza


19
Isolation Forest
● Idea: Isolate anomalies instead of profiling normal points
● Assumption about anomalies / outliers:
○ Few: They are a minority
○ Different: Their features are very different from those of normal instances
● Algorithm:
○ Build a very deep decision tree
○ Outliers are easier to isolate / less partitions are needed

David Gómez-Ullate Oteiza


20
Isolation Forest: Example
Outlier: 3 partitions to isolate Nomal: 9 partitions to isolate

David Gómez-Ullate Oteiza


21
Isolation Forest: Example
● Counting partitions is the same as computing the depth of the leaf node
● Assign an outlier score to every observation based on how deep they are in
the tree

https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html David Gómez-Ullate Oteiza


22
Dimensionality Reduction
● The model’s output is a feature vector with fewer dimensions than the input
● Exploratory data analysis / Interpretation:
○ the scientist has a feature vector that is too complex to visualize (it has more
than three dimensions).
○ The dimensionality reduction model can transform that feature vector into a
new feature vector (by preserving the information up to some extent) with
only two or three dimensions.
○ This new feature vector can be plotted on a graph.
● Reduce “noise” in data, reduce overfitting in algorithms
● Anonymize features

David Gómez-Ullate Oteiza


23
PCA
● Principal Component Analysis (PCA) is the oldest of the techniques. It is also, by
far, the fastest option.
● Feature engineering: find the optimal value of the reduced dimensionality
experimentally as part of the hyperparameter tuning process.
● Linear algorithm: less practical for visualization purposes as compared to the other
techniques, but interpretable

David Gómez-Ullate Oteiza


24
PCA: Algorithm
● Find the principal axes of the dataset: diagonalize covariance matrix
● Sort axes by amount of covariance explained by that feature
● Select top N variables (number of principal components)

David Gómez-Ullate Oteiza


25
t-distributed stochastic neighbor embedding (t-SNE)
Laurens van der Maaten (2009): check his explanation on Google Tech Talk

Steps:
1. compute distances and similarity for every
point in the high-dimensional space
● Non-linear: Difficult to interpret 2. random projection on 2-dim space and
● Much slower than PCA compute similarity matrix again
● Captures “clustering” relations 3. move points in 2-dim space until similarity
● Mainly used for visualization matrix is close to the original one
● Less common: Feature engineering 4. perplexity hyper-parameter: determines the
local density around a given point.

https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

https://distill.pub/2016/misread-tsne/ David Gómez-Ullate Oteiza


26
t-SNE

Project semantic content of 200 tweets


on a 2 dimensional space to analize
proximity and similarity between
authors.

https://cs.stanford.edu/people/karpathy/tsnejs/

David Gómez-Ullate Oteiza


27
t-SNE as art generator

https://clauswilke.com/art/project/t-sne
David Gómez-Ullate Oteiza
28
Deep Learning: Autoencoders
● Anomaly detection
● Dimensional reduction
● “Black-box”
● Many hyperparameters

Unsupervised Anomaly Detection in Flight Data Using Convolutional Variational Auto-Encoder

David Gómez-Ullate Oteiza


29
Generative Models
● "Generative" describes a class of statistical models that contrasts with
discriminative models.
○ Generative models can generate new data instances
○ Discriminative models discriminate between different kinds of data instances
(supervised)
● Example: A generative model could generate new photos of animals that look
like real animals, while a discriminative model could tell a dog from a cat
● A generative model includes the distribution of the data itself, and tells you how
likely a given example is
● For example, models that predict the next word in a sequence are typically
generative models because they can assign a probability to a sequence of words
● Much harder problem than classification

David Gómez-Ullate Oteiza


30
GANs
● Generative Adversarial Networks
● Combination of two NN: Generator and Discriminator
● They are trained simultaneously to make them compete

David Gómez-Ullate Oteiza


31
GANs
used top be state of the art for pictures, videos, super-resolution and even music
(now beaten by stable-diffusion)

Generative Fashion Design

Barack Obama DeepFake

https://thispersondoesnotexist.com/

David Gómez-Ullate Oteiza


32
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza

Session 5 - Customer Segmentation

1
Customer Segmentation

David Gómez-Ullate Oteiza


2
Customer Segmentation

Dataset: Online Retail II (UCI)

David Gómez-Ullate Oteiza


3
Customer Segmentation

● Learn how to manipulate raw data from an online retailer


● Recency, Frequency and Monetary Value as relevant customer features.
● Learn to build a RFM dataframe from the original raw purchases dataframe.
● Prepare the dataframe for clustering: clean, remove outliers, standardize to same scale.
● Apply clustering with k-means and reason on a good value for k (# clusters)
● Interpret the results by looking at the average RFM-values in each clusters.
● If possible, attach labels to each group and decide which segment is the most valuable
for your company, design taylor-made strategies for each group, etc.

Jupyter notebook
https://colab.research.google.com/drive/1iqt6PvdRSH6tji_4HteHD7-BB6FFihFT?usp=sharing

David Gómez-Ullate Oteiza


4
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza

Session 6 - Recommender systems

1
Netflix Prize
Recommender systems are everywhere:
Netflix, Amazon, Google, …

On 2009 Netflix started a competition to improve


the accuracy of predictions about movie
recommendations

More on the Netflix Prize

More on success stories:


● Google ML & AI Success Stories
● Nvidia Deep Learning & AI Success Stories

2
Netflix Prize
Training data set
● 100,480,507 ratings
● 480,189 users
● 17,770 movies.
Each training rating is a quadruplet
<user, movie, date of grade, grade>.
The user and movie fields are integer IDs, while grades are
from 1 to 5 (integer) stars

Qualifying set (2,817,131 ratings) consisting of:


● Test set (1,408K ratings) - determine winners
● Quiz set (1,408K ratings) - calculate leaderboard
scores

Predictions could be real numbers 1-5


Metric is RMSE

3
The Netflix Prize (2006-2009)

4
RecSys conferences

5
Content based vs Collaborative filtering
Content-based methods describe users and
items by their known metadata. Each item i is
represented by a set of relevant tags—e.g.
movies of the IMDb platform can be tagged
as“action”, “comedy”, etc.
Pro: Can be used in cold start problem.
Con: does not use full set of user-item
interactions, treats each user independently.

Collaborative filtering methods do not use


item or user metadata, but try instead to
leverage the feedbacks or activity history of
all users in order to predict the rating of a user
on a given item by inferring
interdependencies between users and items
from the observed activities.

David Gómez-Ullate Oteiza


6
Recommender systems
● Main purpose: Product recommendation
● Two approaches (or both approaches simultaneously):
○ User based: Find people with similar interests and recommend our user the
same items.
○ Item based: Look at the items similar to ones which the user bought earlier,
and recommend products which are like them.
● Algorithm:
○ Find out how many users/items in the database are similar to the given
user/item.
○ Assess other users/items to predict what grade you would give the user of this
product, given the total weight of the users/items that are more similar to this
one.
● What does “most similar” mean in this algorithm?

David Gómez-Ullate Oteiza


7
Similarity
● Like nearest-neighbor algorithm
● Euclidean distance does not do well
● Correlation/Cosine similarity does better
● For each user pair, find the co-rated items, calculate correlation between the
vectors of their ratings for those items
● “Cold start” problem: For users with just one item, or items with just one
neighbor, neither cosine similarity nor correlation produces useful metric

David Gómez-Ullate Oteiza


8
Explicit vs implicit recommender systems
Implicit Feedback
There is no user participation required to
gather implicit feedback, unlike the
explicit feedback. The system
automatically tracks users’ preferences
by monitoring the performed actions,
such as which item they visited, where
they clicked, which items they
purchased, or how long they stayed on a
web page. Another advantage of
implicit feedback is that it reduces the
cold start problems.

Explicit Feedback
To collect explicit feedback from the user, the system must
ask users to provide their ratings for items. requires direct
participation from the user, it is often not easy to collect.
David Gómez-Ullate Oteiza
9
User-Item Matrix
● Cells are user preferences, rij, for items
● Sparse matrix, sometimes better to save the triplets (user, item, rank)
● Preferences can be ratings, or binary (buy, click, like)

David Gómez-Ullate Oteiza


10
Collaborative Filtering: User based
● Features of each user is their ratings on all
movies (say, there is M of them)
● Given a user, use cosine similarity in
M-dimensional space to identify its
k-nearest users (people that gave similar
ratings to the same movies, i.e. users with a
similar taste)
● Consider all the items the neighbours
rated/purchased, except for co-rated ones
● Make predictions on empty cells for the
user, using the ratings made by its
neighbours.
● Can also take the weighted arithmetic
mean according to the degree of similarity
to fill empty cells in the table
David Gómez-Ullate Oteiza
11
Collaborative Filtering: Item based
● Features of each movie are the ratings
given to it by all of the users, say there are
N of them.
● Use cosine-similarity in N-dimensional space
to find the closest movies to a given one.
(i.e. movies are close if they received similar
ratings by users)
● If you know that a user watched a certain
movie, recommend other movies that are
similar according to this feature
representation.

David Gómez-Ullate Oteiza


12
Cold start problem
● If there is a new movie that has not been rated yet, we cannot use
Item-based recommendation (we do not know what its features are).
● If there is a new user in the system, we cannot use User-based
recommendation (we do not know what his taste is like).

David Gómez-Ullate Oteiza


13
Collaborative Filtering: Matrix Factorization
● Based on linear algebra matrix factorization
● Automatically creates embeddings for users and items
● Produces dimensional reduction for free

David Gómez-Ullate Oteiza


14
Alternating Least Squares (ALS)
● Simple explanation: Link
● Further math details: Link

● Factor the matrix in 2 factors


minimizing the sum of squares
between the product of the factors
and the original matrix.
● Add a regularization term
● Optimization problem is non-convex,
but it becomes convex if we solve for
one of the factors having fix the
other one… iterate until
convergence.

David Gómez-Ullate Oteiza


15
CF: Deep Learning
● Create embeddings: Item2Vec. More on NLP lecture
● Translate into ranking problem, two neural networks:
○ one for candidate generation
○ one for ranking

Deep Neural Networks for YouTube Recommendations (Google, 2016)

David Gómez-Ullate Oteiza


16
MovieLens Dataset
full data set contains 26,000,000 ratings
and 750,000 tag applications applied to
45,000 movies by 270,000 users
(almost the same size as Netflix Prize
dataset)

We will use a smaller version:


● ~100K ratings for ~10K movies
● 600 users
https://grouplens.org/datasets/movielens/
Complete dataset description: here

David Gómez-Ullate Oteiza


17
To learn more…
Research papers

● Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback
datasets." 2008 Eighth IEEE international conference on data mining. Ieee, 2008.
● Zhang, Shuai, et al. "Deep learning based recommender system: A survey and new
perspectives." ACM computing surveys (CSUR) 52.1 (2019): 1-38.
● Jannach, D., Zanker, M., Felfernig, A., & Friedrich, G. (2010). Recommender systems: an
introduction. Cambridge University Press.

Blog Posts

● A nice blog post with basic technical concepts


● Another blog post with a more business and practitioner perspective

David Gómez-Ullate Oteiza


18
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza

Session 8 - Explainable ML models

1
Motivation

Prediction is sometimes not enough…

● A bank making a decision on who gets a loan


● A prison making a decision on inmates getting parole
● A doctor making a decision on a medical diagnosis
● … and many more.

Predictive vs prescriptive analytics


Prediction vs inference
Black-box vs White-box models

2
Explainable AI
Interpretability, also often referred to as explainability, in artificial intelligence (AI) refers to
the study of how to understand the decisions of machine learning systems, and how to
design systems whose decisions are easily understood, or interpretable

3
Overview

1. Permutation importance - a feature importance


algorithm that is model agnostic
2. Partial dependence plots - insightful plots that
reveal dependence of the model on some
variables or combinations of variables.
3. SHAP values - an additive representation that
describes the influence of each predictor
variable on a single, specific prediction.

4
https://www.kaggle.com/learn/machine-learning-explainability

5
Permutation importance
● Proposed by Leo Breiman in 2001
(the guy who invented Random Forests)
○ Link to RF original paper

● Implemented in ELI5 library (Explain Like I’m 5)

https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html
6
Permutation importance
● Intuitive and model agnostic
● Calculated only on evaluation test, once the model is trained

Original
Evaluation set

Shuffled
Evaluation set

https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html
7
Partial dependence plots
Partial dependence plots (PDP) show the dependence between the target response and
a set of input features of interest, marginalizing over the values of all other input features
(the ‘complement’ features). Intuitively, we can interpret the partial dependence as the
expected target response as a function of the input features of interest.

Number of rented bikes as a function of humidity and temperature


8
Partial dependence plots

9
To Learn more…

https://christophm.github.io/interpretable-ml-book/

● Model explainability with SHAP


● An introduction to explainable AI with Shapley values

10
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza

Session 9 - Introduction to Deep Learning

1
A brief history

David Gómez-Ullate 2
Supervised Learning

David Gómez-Ullate 3
Biological inspiration

David Gómez-Ullate 4
Perceptron
• Very old model (1962)
• Linear combination of the input variables (like LR)

• Followed by a non-linear activation function

David Gómez-Ullate 5
Adding non-linearity

David Gómez-Ullate 6
Input, hidden and output layers
• Solution: Add hidden layer with many units

David Gómez-Ullate 7
Representation Learning
• Remember feature engineering step:
○ Key for good performance (adds value)
○ Very costly, and depends on specific knowledge
• Example: extract numerical features from non-tabular data
like audio, video, images or text

David Gómez-Ullate 8
Traditional ML vs Deep Learning
• Deep Learning can be interpreted as a 2-step process:
○ Creates new variables computing linear combinations of the
original ones
○ Fit a simple model in the new representation
• Key idea: everything is trained automatically at the same time

David Gómez-Ullate 9
End to end approach
• Advantages:
○ No need for specific domain knowledge
○ Less costly, new variables are created automatically
○ New variables tailored to the specific task

David Gómez-Ullate 10
Example
• Classify cells into bening or not
• Traditional models: Most important part of the pipeline is to extract
features manually from images
○ Cell segmentation, Nucleus identification, …
• (Deep) NN can automatically extract features from the images that are
useful for this classification task

Benign or not ? Traditional ML


Number of cells
Size Classification model
curvature (SVM,Random Forest ,LR, etc.)
mean radius
Classification model
(Conv-Net)
Deep Learning

David Gómez-Ullate 11
Stack more layers: Shallow -> Deep

David Gómez-Ullate 12
Why Deep ?

David Gómez-Ullate 13
Why now?

Hardware: Increase in computing power & efficiency


Data: Increase in size of datasets (data-centric world)
Algorithmic improvements: CNN, Word embeddings, ...

David Gómez-Ullate 14
Hardware

• GPUs: Graphical processing units


○ Simpler processing units
○ Perform thousands of simpler operations in parallel (matrix
multiplications, …)
○ Computer graphics, blockchain, …
• NVIDIA Titan X (~1K $) has 350 times the power of
a modern laptop: 6.6 trillions floating point ops/second

David Gómez-Ullate 15
Data
• Exponential increase in the storage
capacity
• Internet: Collect and distribute big data
easily
○ Wikipedia (text)
○ Youtube (video)
○ Flicker, Instagram (images)
○ Twitter (Graphs)
• Standard benchmark competitions when
models can “compete”

David Gómez-Ullate 16
Increase in computing power

David Gómez-Ullate 17
Scaling laws for SOTA NLP models

Scaling laws relating model efficiency with size (computational time, dataset
size, number of trainable parameters) for large size neural NLP models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv
preprint arXiv:2001.08361.

18
Deep Learning: a business perspective

David Gómez-Ullate 19
DL frameworks

David Gómez-Ullate 20
Counting parameters

MLP

David Gómez-Ullate 21
Backpropagation

1. Forward pass: calculate loss (similar to a metric) for current value of parameters
2. Backward pass: update value of parameters using gradient of loss function
3. Go to 1.

David Gómez-Ullate 22
Gradient Descent

David Gómez-Ullate 23
Mini-batch SGD
● Don’t need to have all dataset in RAM memory
● Can use the dataset in pieces: batches
● Computationally more efficient

David Gómez-Ullate 24
Summary
• Very flexible models able of performing complex learning tasks
• Very prone to overfitting the data / huge number of parameters
• Need very large training datasets to compensate
• Training is computationally costly
• Coding the models has become easier thanks to DL Frameworks

David Gómez-Ullate 25
Benchmark Dataset 1: MNIST
Yann LeCun (1998)

● 60,000 images train


● 10,000 mages test
● 28 x 28 pixels
● grayscale
● 10 categories

26
Benchmark Dataset 2: ImageNet

https://www.image-net.org/challenges/LSVRC/
27
Evolution of SOTA for ImageNet Challenge

Update: https://paperswithcode.com/sota/image-classification-on-imagenet

28
MNIST

29
Arrays as sequences

Converting a 2D array into a sequence of number (Flatten) makes you loose translation
invariance.

30
Transfer Learning
● Take a network trained in a different domain and/or a different task
● Adapt part of it to your domain and task (i.e. don’t start from scratch)

David Gómez-Ullate 31
Transfer Learning

Task 1

Task 2

Keep these layers frozen Train these layers


David Gómez-Ullate 32
Transfer Learning
● Feature Extraction: Use the representations learned by a previous network to extract
meaningful features from new samples. You simply add a new classifier, which will
be trained from scratch, on top of the pretrained model so that you can repurpose
the feature maps learned previously for the dataset.

● You do not need to (re)train the entire model. The base convolutional network
already contains features that are generically useful for classifying pictures.
However, the final, classification part of the pretrained model is specific to the
original classification task, and subsequently specific to the set of classes on which
the model was trained.

● Fine-Tuning: Unfreeze a few of the top layers of a frozen model base and jointly train
both the newly-added classifier layers and the last layers of the base model. This
allows us to "fine-tune" the higher-order feature representations in the base model in
order to make them more relevant for the specific task.

David Gómez-Ullate 33
To learn more…
Jeremy Howard @ FastAI
https://course.fast.ai/
Very nice & updated course to FastAI library. Covers also model deployment

Andrew Ng @ DeepLearning.ai
https://www.deeplearning.ai/
Standard course for Deep Learning fundamentals (5 course specialization)

Adrian Rosenbrock @ PyImageSearch


https://www.pyimagesearch.com/
More geared towards Computer Vision. Great for OpenCV

David Gómez-Ullate 34
Questions?

35
AI-Machine Learning
& Analytics
David Gómez-Ullate Oteiza

Session 12 - Convolutional Networks

1
Softmax layer
Whenever you are dealing with
a classification problem, the
last layer always has a softmax
activation layer, whose number
of units coincides with the
number of classes.

https://towardsdatascience.com/sigmoid-and-softmax-functions-in-5-minutes-f516c80ea1f9 2
Loss function for classification

Categorical cross-entropy

For each prediction, cross-entropy loss is minus


the log of the probability that the models
assigns to the class that had the true label.

3
Convolutional Neural Networks (CNNs)

3 types of layers: Convolutional, Pooling and Fully Connected.

4
Convolutional layers

Learn the concepts of:


● kernel size
● stride
● padding

https://www.codingninjas.com/codestudio/library/convolution-layer-padding-stride-and-pooling-in-cnn

5
Stride

Learn the concepts of:


● kernel size
● stride
● padding

https://www.codingninjas.com/codestudio/library/convolution-layer-padding-stride-and-pooling-in-cnn

6
Padding
Learn the concepts of:
● kernel size
● stride
● padding

https://www.codingninjas.com/codestudio/library/convolution-layer-padding-stride-and-pooling-in-cnn

7
Pooling

Global Average Pooling is used as a


replacement for flattening the spatial features
before applying the FC layers.

https://androidkt.com/explain-pooling-layers-max-pooling-average-pooling-global-average-pooling-and-global-max-pooling/
8
Dropout

Dropout is a regularization technique


It prevents neural networks from overfitting

Original Dropout paper from


2014 in JLMR [link]

https://medium.com/analytics-vidhya/a-simple-introduction-to-dropout-regularization-with-code-5279489dda1e

9
Early stopping

Early stopping: stop training


automatically when a specific
performance measure (eg. Validation
loss, accuracy) stops improving

10
Famous ConvNet Architectures

This is a bit of a “historical” slide to tell the story


of modern Computer Vision development
associated with ImageNet Challenge.
https://theaisummer.com/cnn-architectures/ The current best performing models
(State-of-the-art) are based on transformers

https://paperswithcode.com/sota/image-classification-on-imagenet
11
Residual Connections (ResNets)

Residual connections skip layers and


propagate the original input
unchanged

They make it easier for the network to


learn the identity transformation
“when nothing happens”.

Original ResNet paper from 2015 [link]

12
Feature extraction + classification task

● Convolutional and Pooling


Layers are feature
extractors (base).
● Fully Connected Layers
perform classification on
the extracted features.
● Where are most of the
weight parameters of the
CNN?

13
Transfer Learning
● Take a network trained in a different domain and/or a different task
● Adapt part of it to your domain and task (i.e. don’t start from scratch)

Task 1

Task 2

Keep these layers frozen Train these layers

David Gómez-Ullate 14
Transfer Learning Architecture

● Conv Layers are able to


extract relevant features:
keep them frozen
● Last Fully Connected layers
(Head) need to be
replaced and re-trained
with the new task and
dataset.

David Gómez-Ullate 15
Latent representation

Last layer has softmax


Input image has 227*227*3 pixels, Next-to-last layer has 4096 units, we can activation and 1000 units =
which is 150K numbers take these to be the extracted features # of classification classes
from the image

David Gómez-Ullate 16
Auto encoders
Train the network to
minimize reproduction loss
Decoder is able to
reconstruct the whole
image out of a reduced
representation (bottleneck)
So we can assume that the
bottleneck is a good
compressed representation
of the original object

https://www.jeremyjordan.me/autoencoders/

David Gómez-Ullate 17

You might also like