lecture2_introduction_ml
lecture2_introduction_ml
1 / 72
Today’s Lecture
2 / 72
Outline
1 Introduction
3 Feature Engineering
3 / 72
Introduction and Basic ML concepts
4 / 72
Introduction and Basic ML concepts
5 / 72
What is required to reach True AGI?
Adaptivity or Adaptability
Be able to learn from previous errors and improve overall performance
by learning from experience.
Learning
Learning is the process of acquiring new understanding, knowledge,
behaviors, skills, values, attitudes, and preferences.
Generalization
Lessons and learned knowledge should be general enough that they
also work in (new) previously unseen environments.
6 / 72
The Qualification Problem
Modeling the real world is very difficult. The Qualification
problem is that it is impossible to enumerate or list all
pre-conditions that are required for an action in the real
world to succeed and have the intended effect. For
example:
• To attend an online lecture, you need to have an
internet connection.
• You also need a power source for your computer.
• And the computer should not be broken.
• And the network interface should also work correctly.
• And so on...
Actions can also have unintended consequences, and
enumerating these is also impossible.
7 / 72
Data Driven Paradigms
In our time, the AI field is a combination of numerous
techniques. The (currently) dominant set of techniques is
machine learning, meaning that algorithms learn from
data and/or experience.
8 / 72
Learning Paradigms
Supervised Learning
In supervised learning, the agent learns through supervision, that is, a
teacher tells the agent what is the correct answer given some input
or stimuli, and the agent should learn the mapping between input and
desired output.
Producing labels for datasets is expensive (in time and money), and
in many cases there might not be a correct answer to provide a label.
9 / 72
Learning Paradigms
Unsupervised Learning
In Unsupervised Learning, with contrast of Supervised Learning, the
agent learns from data directly, without any kind of supervision.
There is no correct answer. The agent should learn structure,
properties, and relationships in the data itself.
12 / 72
Learning Paradigms
Reinforcement Learning
Figure: Diagram
representing how the agent
interacts with the
Environment environment, by performing
an action, and then
receiving a reward and a
Action
Rew new state observation. The
a rd process is repeated
Interpreter sequentially until a final
state is reached, or the
Stat
e episode ends. The agent
learns through multiple
episodes.
Agent
Figure from https://en.wikipedia.org/wiki/Reinforcement˙learning
13 / 72
Learning Paradigms
Transfer Learning
It is a combination of previous paradigms, where learned knowledge is
used to learn or improve on new tasks. Generally this means part of
the learned knowledge transfers from previous tasks into new ones.
14 / 72
Learning Paradigms
15 / 72
Learning Paradigms
16 / 72
The Cake/Lecake - Relative Importance of
Learning Paradigms
18 / 72
What is not Machine Learning?
19 / 72
Core Idea of Machine Learning
Machine Learning, as a field, has a bunch of methods,
techniques, methodologies, but in the end the core idea
or most important concept is:
Generalization
This means you train a model on a dataset, and it should
work (make correct predictions) on very different input
data, be useful outside of its training space.
20 / 72
Outline
1 Introduction
3 Feature Engineering
21 / 72
Basic ML concepts
Model
A model is the actual equations that define how the input is
transformed into output predictions. For example:
X
f (x) = wi xi + b
i
Parameters or Weights
These are the variables that change during the training process,
implicitly encoding the knowledge that is learned. We usually denote
these with letters w and b (weights and biases).
22 / 72
Basic ML concepts
Loss Function
A function that makes a comparison between the model’s prediction
and the ground truth labels. It provides direct supervision to the
model so it can learn successfully. Loss functions are selected to learn
a particular task, in coming lectures we will cover the full details.
Training Data
The data that you use to train the model, represented
mathematically as the x and y variables, where x represents inputs,
and y are the labels associated with that input.
23 / 72
Basic ML Concepts
Classification
Classification is the task where the output variable and/or labels are
discrete, that is, the opposite of continuous. A model that performs
classification is called a classifier.
The set of possible outputs for the classifier is called the class set,
and the individual elements are called classes.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
1
Figure: Binary Classification Figure: Multi-Class
(Two classes) Classification ( > 2 classes)
24 / 72
Basic ML Concepts
Regression
Regression is the task where the output variable and/or labels are
continuous. A model that performs regression is called a regressor.
1
0.8 Training Data
0.6 Model Prediction
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Figure: Regression in 2D
25 / 72
Basic ML Concepts
Multi-Task Learning
• It is also possible to perform/learn more than one task at a
time.
• For example, performing object detection requires object
recognition (classification) and bounding box estimation
(regression).
• The idea is that tasks have some input or features in common,
so performance is higher when learning these tasks together,
than each individual task in isolation.
26 / 72
Learning as Optimization
All machine learning models learn from data, using a
general formulation that consists of the following
optimization problem:
n
X
∗
θ = min L(f (xi ), yi , θ) (1)
θ
i
The solution to the optimization problem is θ∗ , the
weights/parameters that produce the best predictions,
according to what the loss function scores between the
prediction f (xi ) and the labels yi .
28 / 72
Trainable and Non-Trainable Parameters
In the previous formulation, we usually train the
parameters using an optimization algorithm, but there
could be multiple types of parameters.
Trainable Parameters
Parameters that are directly trained using an optimization algorithm.
Non-Trainable Parameters
Parameters that are trained but not using an optimization algorithm,
for example, models that indirectly learn parameters from data that
passed through them.
Test Set
The set that will be used for evaluation of the trained model, it
should be a independent dataset containing samples that have not
been seen during training. It can be the remaining 20% to 30% of
the available data.
30 / 72
Linear Separability
This is important as the basic classification concept uses
separating hyperplanes.
Definition
Given points xi ∈ Rn in a high dimensional space (features), with
binary labels yi ∈ [0, 1] they are linearly separable if there exists a
weight vector w ∈ Rn and real number k such as:
32 / 72
Not Linearly Separable
1
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure: Example of Not linearly separable data for Binary
Classification.
1 Introduction
3 Feature Engineering
34 / 72
Basic ML Concepts
Features
This is a very important and highly misunderstood term. A feature is
a measurement, measurable property, or characteristic of a
phenomenon or physical object.
The inputs to a machine learning model are feature values, that are
used to perform classification or regression. The quality of the
features is what defines the performance of your model.
• Good features will lead to good performance of your model.
• Bad features will prevent your model from making correct
predictions. Model might not train at all.
• Features should be de-correlated and independent for best
results. If features are related to each other, then they are
redundant and not useful for learning a model.
35 / 72
Basic ML Concepts
Features - Examples
Images
Pixel values, edges, parts, color(s), distribution of pixel
values (histograms), distribution of edges, gradient of
image.
Text
Frequency of words, language, presence or absence of
specific words, # of grammatical errors, overall
Sound
Frequency response (Fourier transform coefficients),
response to filter banks, signal to noise ratio, length of
sound sequences.
36 / 72
Basic ML Concepts
Feature Vectors
To finalize the concept of a feature, we need to talk about the
concept of a feature vector. This is a vector with elements
corresponding to individual features. For example let’s try to predict
the rent for an apartment, our features could be:
A. Area in m2 (continuous).
B. Location in the city (categorical).
C. Distance to nearest train station (continuous).
D. Build year (categorical).
Discrete
Integer numbers that can be ordered and are countable. For example,
number of shoes in your apartment, most currencies (€, $), number
of persons living in a house, population of a country, etc.
Categorical
Categories, where no order can be established. For example, colors,
nationality, course code, etc.
38 / 72
Obtaining Features
39 / 72
Feature Transformations
A simple way to ”improve” your features is to make
aggregation or non-linear transformations of features. For
example:
Aggregation
Mean, standard deviation, variance, skew,
kurtosis, etc.
Polynomials
For a scalar feature x, use x, x 2 , x 3 , x 3 , ..., x p .
The polynomial degree p is a
hyper-parameter.
Frequency
Fourier transform of your data is commonly
used as features, with a frequency cutoff.
40 / 72
Not Useful Features
41 / 72
Outline
1 Introduction
3 Feature Engineering
42 / 72
Pre-Processing
In general data needs to be pre-processed and normalized
before being used with any model. This is because when
learning a model, it will only work with data that is similar
to the training data distribution, so normalizing allows
data to be more similar to this distribution, and not to
vary too much.
45 / 72
Normalization
Min-Max
For each feature x, compute the following:
x − min x
xnorm = (2)
max x − min x
This transforms all values to range inside [0, 1]. The minimum
feature value is mapped to zero, and the maximum feature is mapped
to one. Intermediate values are mapped to (0, 1).
x − min x
xnorm = (b − a) +a (3)
max x − min x
46 / 72
Normalization
Standard or Mean-Std
For each feature x, compute the following:
x − mean(x) x − µx
z= = (4)
std(x) σx
47 / 72
Normalization
Unit Vector
For a feature vector x, compute the following:
x xi
v= ={ }i (5)
||x|| ||x||
pP Where ||x||
This makes the data lie on a n dimensional hyper-sphere.
is a vector norm, for example the L2 norm ||x|| = 2
i xi
48 / 72
Normalization - Whitening
Z = WX (6)
The matrix W is called a whitening matrix, and should
follow that W T W = Σ−1 .
49 / 72
Normalization - Whitening
50 / 72
Normalization - PCA Whitening
Here the matrix W is computed using the eigenvalue
decomposition on Σ, which is a matrix decomposition in
the form:
Σ = UΛU T (7)
Where U contains the eigenvectors of Σ as columns, and
Λ is a diagonal matrix containing the eigenvalues of Σ.
This can be computed using functions like
np.linalg.eig. Then W is computed as:
1
W = Λ− 2 U T (8)
51 / 72
Normalization - Images
52 / 72
Normalization - Images
53 / 72
Normalization - Histogram Equalization
55 / 72
Normalization - Audio
Amplitude Normalization The overall volume of audio can
also be normalized, to make sure all samples
in the dataset have similar volume. This is
done in two ways:
1. Peak Normalization. Normalizing to
the maximum value across time samples.
This is similar to min-max scaling.
2. Loudness Normalization. Normalizing
to a aggregate loudness of the whole
audio signal across time. For example it
can be the root mean square (RMS), or
a value related to human perception of
loudness.
56 / 72
Normalization - Text
Text is the most difficult to normalize, since many factors
can vary:
• Stop words are usually removed.
• Numbers, dates, acronyms and abbreviations can be
standarized to common terms (like transforming 10
to Ten, dates to ISO formats, and acronyms
removed or expanded.
• Non alphanumeric characters are usually removed.
• Specific words can be standardized, specially if they
have multiple meanings, depending on the
application.
• Spelling across variations of the same language can
also be normalized, for example British vs Australian
vs American English.
57 / 72
Normalization - Categorical Variables
58 / 72
Normalization - Categorical Variables
An unbiased way to encode categorical variables is to use
one-hot encoding.
An example:
• Cat = [1.0, 0.0, 0.0]
• Dog = [0.0, 1.0, 0.0]
• Bunny = [0.0, 0.0, 1.0]
59 / 72
Normalization - Regression
60 / 72
Normalization - Which to use?
61 / 72
Normalization - Importance
62 / 72
Normalization - (Negative) Example
63 / 72
Outline
1 Introduction
3 Feature Engineering
64 / 72
Challenges in Machine Learning
Quality of Data
The biggest bottleneck is obtaining high quality data (including
features and labels) for the task of interest. Without data, a model
cannot be learned.
65 / 72
Challenges in Machine Learning
Model Misspecification
Model misspecification is using the wrong model for a given dataset.
66 / 72
Challenges in Machine Learning
Generalization
Overall I think the biggest issue currently in AI is the lack of
generalization, that means, solutions are given to specific problems,
but general AI solutions (true AGI) does not exist yet.
• Alpha Go and variations (Alpha Go Zero and Alpha Zero) only
solved playing the game of Go, but these advances have not
benefited other applications.
• Deep Blue can play chess, but it has not served to solve some
of humanities biggest issues (like peace, hunger, wealth
distribution, etc).
• All models overfit to some degree, but some are useful.
67 / 72
Challenges in Machine Learning
Explainability
Most AI methods overall are Black Boxes, meaning it is difficult to
interpret their decisions, or to get an explanation on how a decision is
made.
• Explanations and interpretability are required for AI to gain
human trust, and it is even legally required in some countries
(like in the EU with the GDPR).
• Explaining black box method is very difficult, as computational
concepts do not always map to human understandable
concepts. The use of non-linear methods also makes it difficult.
• Building methods that are easily interpretable (AKA white box
methods) is also very difficult, and in many cases they work but
are less accurate than black box models.
68 / 72
Challenges in Machine Learning
Bias, Safety, and Trustworthiness
As AI is used for real world applications, specially ones involving
humans, it is very important to ensure that it will not perform
unexpected actions or behave abnormally. Right now there are AI
systems that are actively hurting people, for example:
• In the US, poorly understood AI systems are being used to
decide if a person will commit a crime, or how long their
sentence should be.
• AI systems being used for monitoring of populations, violating
their human rights.
• AI systems used by banks to review and recommend on
decisions for loans, and by companies to decide who to
interview for a job.
• Discrimination by DUO and the Toeslagenaffaire (Benefits
scandal). 69 / 72
Questions to Think About
1. What is a feature?
2. Describe linear separability in your own words.
3. What is the need for normalization and
pre-processing of features?
4. Can regression labels/targets be normalized?
5. What factors can hurt the performance of a Machine
Learning model?
6. Your model does not learn, what is the most likely
cause?
70 / 72
Take Home Messages
71 / 72
Questions?
72 / 72