ML 23 First Lectures 2 3 v0.1
ML 23 First Lectures 2 3 v0.1
Machine Learning
A. Micheli 2
Practical information Dip. Informatica
University of Pisa
In class:
• Please, max silence during the lecture (to avoid noise)
recording in progress! (of course, you can make questions)
Connect to ML:
• Material: Moodle https://elearning.di.unipi.it/
• Streaming & recordings of lectures: Teams platform
– See lecture 1 and Moodle: «FAQ and general
information»
– The enrolling students mechanism for attendance “in
presence” (and to connect to Teams) is through the
App “Didactic Agenda” (“Agenda Didattica”) for 654AA
23/24
– Please, remember to fill the poll (see INTRO-curricula22)
4
A. Micheli
Introduction to ML:
plan of the next lectures Dip. Informatica
University of Pisa
• Introduction aims:
– Critical contextualization of the ML in comp. science [lect 1 and 2]
A. Micheli 5
Learning Dip. Informatica
University of Pisa
6
Machine Learning (I) Dip. Informatica
University of Pisa
A. Micheli 7
Machine Learning (II): When? Dip. Informatica
University of Pisa
A. Micheli 8
Machine Learning (III): When? Dip. Informatica
University of Pisa
A. Micheli 10
An useful framework:
Learning as an approximation of an
unknown function from examples
8x8
A. Micheli 13
Build a function from
examples Dip. Informatica
University of Pisa
Image
8x8
f
0 1 2 3 4 5 6 7 8 9 Output class
A. Micheli 14
Handwritten Digits
Recognition Dip. Informatica
University of Pisa
Image
8x8
f
Output class
0 1 2 3 4 5 6 7 8 9 Classification problem
• Difficult to formalize exactly the solution of the problem:
Possible presence of noise and ambiguous data;
• Relatively easy to collect a set of labeled examples
A. Micheli 16
Examples of x - f(x) Dip. Informatica
University of Pisa
• Handwriting Recognition
– x: Data from pen motion.
– f(x): Letter of the alphabet.
• Disease diagnosis (from database of past medical records)
– x: Properties of patient (symptoms, lab tests)
– f(x): Disease (or maybe, recommended therapy)
– TR Training Set: <x,f(x)>: database of past medical records
• Face recognition
– x: Bitmap picture of person's face
– f(x): Name of the person.
• Spam Detection
– x: Email message
– f(x): Spam or not spam.
A. Micheli 18
Complex data Dip. Informatica
University of Pisa
• Protein folding
– x: sequence of amino acids
– f(x): sequence of atoms’ 3D coordinates
– TR <x,f(x)>: known proteins
– Type of x: string (variable length)
– Type of f(x): sequence of 3D vectors
• Drug design
– x: a molecule
– f(x): binding strength to HIV protease
– TR <x,f(x)>: molecules already tested
– Type of x: a graph or a relational description of atoms/chemical bonds
– Type of f(x): a real number
A. Micheli 19
Overview of a ML
(predictive) System Dip. Informatica
University of Pisa
VALIDATION
Medical records i
Patients Age Smoke Sex Lab
Test
Pat 1 101 0.8 M 1 Attributes
(discrete/continuous)
p Pat 2 30 0.0 F ? xp
• Each row (x, vector in bold): example, pattern, instance, sample,….
• Dimension of data set: number of examples l
• Dimension (of the input x): number of features n
• If we will index the features/inputs/variables by i or j : variable xi is
(typically) the i-th feature/property/attribute/element/component of x.
(but may be to simplify we need to use subscript index for other meanings)
• xp (or xi) is (typically) the p-th (or i-th) pattern/example/raw (vector)
• xp,i (for example) can be the attribute i of the pattern p
A. Micheli 22
DATA Encoding Dip. Informatica
University of Pisa
Flat case:
• Numerical encoding for categories: e.g.
– 0/1 (or –1/+1) for 2 classes
– More classes:
• 1,2,3… Warning: grade of similarity (1 vs 2 or 3): useful for “order
categorical” variables (e.g small, medium, large)
• 1-of-k (or 1-hot) encoding: useful for symbols
A 1 0 0
It will be useful
B 0 1 0
for the project !
C 0 0 1
A. Micheli 23
DATA : Structures Dip. Informatica
University of Pisa
Graph/network data
l1 l2 l3 l4 l5
A. Micheli 24
DATA
Further terminologies Dip. Informatica
University of Pisa
• Outliers: are unusual data values that are not consistent with most
observations (e.g. due to abnormal measurements errors)
– outlier detection – preprocessing: removal
– Robust modeling methods
A. Micheli 27
Tasks: Supervised Learning Dip. Informatica
University of Pisa
x Categories o
Input space f real values (R)
Centroids
▪ Clustering:
Partition of data into clusters (subsets of “similar” data) 29
A. Micheli
Tasks: Classification Dip. Informatica
University of Pisa
A. Micheli 30
Example Dip. Informatica
University of Pisa
Terminology in statistics:
• Inputs are the “independent variables”
• Outputs are the “dependent variables” or “responses”
A. Micheli 31
Tasks: Classification Dip. Informatica
University of Pisa
The classification may be viewed as the allocation of the input space in decision regions
(e.g. 0/1)
Example: graphical illustration of a linear separator on a
instance space x T=(x ,x ) in IR2 , f(x)=0/1 (or -1/+1)
1 2
A. Micheli 34
Tasks: Regression: example Dip. Informatica
University of Pisa
12
x target
10 Via Neural Network ?
1 2.1 or by …
2 3.9
8
Guessing f(x)=2x
3 6.1
6
5 9.8 2
… … 0
1 2 3 4 5
A. Micheli 35
Tasks: regression Dip. Informatica
University of Pisa
Linear hypothesis
• Semi-supervised learning
– combines both labeled and unlabeled examples to generate an
appropriate function or classifier.
A. Micheli 38
Models
and survey of useful concepts Dip. Informatica
University of Pisa
• MODEL:
– Aim: to capture/describes the relationships among the data (on the
basis of the task) by a “language” (numerical, symbolic, …)
– The “language” is related to the representation used to get knowledge
– The model defines the class of functions that the learning machine can
implement (hypotheses space)
• E.g. set of functions h(x,w), where w is the (abstract) parameter
Age Again,
a class of functions !!!
Smoke
Alcool
A. Micheli 44
Paradigms and methods
(Languages for H) Dip. Informatica
University of Pisa
• Theory (No Free Lunch Theorem) : there is no universal “best” learning method
(without any knowledge, for any problems,…):
if an algorithm achieves superior results on some problems, it must pay with
inferiority on other problems. In this sense there is no free lunch.
E.g. Devroye (1982), Wolpert and Macready (1997), and others
– E.g. free parameters of the model are fitted to the task at hand:
– Examples: best w in linear models, best rules for symbolic models, ….
– Remember the regression example, we proposed h(x)=2x, for
hw(x)=w1x+w0 assuming w1=2 and w0 =0 as the best parameter value:
how?
• H may not coincide with the set of all possible functions and the
search can not be exhaustive: we need to make assumptions →
(we will see the role of) Inductive bias
A. Micheli 53
Learning Algorithms: search Dip. Informatica
University of Pisa
Hypotheses space H
Each point represents
a different hypothesis
(function)
(minimum “error”)
A. Micheli 56
Recap and next topics Dip. Informatica
University of Pisa
After the introduction of the first four ingredients (Data, Task, Model and
Learning Alg.), we need to focus on three mentioned relevant concepts
not yet discussed so far:
A. Micheli 57
1. The Role of the
Inductive Bias Dip. Informatica
University of Pisa
• We will see that such assumptions are strictly need to obtain an useful model
for the ML aims, i.e. a model with generalization capabilities
A. Micheli 58
An example:
Learning Boolean functions Dip. Informatica
University of Pisa
Table 1
A. Micheli 59
Learning Boolean functions:
ill-posed Dip. Informatica
University of Pisa
4
• There are 216 = 22 = 65536 possible Boolean functions
over four input features. We can not figure out which
one is correct until we have seen every possible input-
output pair.
• After 7 examples, we still have 29 possibilities.
A. Micheli 60
Another discrete H space:
Conjutive rules Dip. Informatica
University of Pisa
A. Micheli 61
Find the Version Space Dip. Informatica
University of Pisa
• We are only interested to say that these algorithms find the VS:
• Call the version space, VSH,TR , with respect to hypothesis space H,
and training set TR, the subset of hypotheses from H consistent with
all training examples
A. Micheli 62
Unbiased Learner I Dip. Informatica
University of Pisa
A. Micheli 63
Unbiased Learner II (formal) Dip. Informatica
University of Pisa
Recall that the version space, VSH,TR , with respect to hypothesis space
H, and training set TR, is the subset of hypotheses from H consistent
with all training examples
A. Micheli 64
Futility of Bias-Free Learning Dip. Informatica
University of Pisa
In other words, in order to learn the target concept, one would have to
present every single instance in X as a training example (lookup table)
A. Micheli 65
Inductive Systems and
Equivalent Deductive Systems Dip. Informatica
University of Pisa
training classification of
examples learning new instance or
algorithm don’t know
new instance
using hypothesis space H
Ind. Bias
A. Micheli 66
Language or search bias? Dip. Informatica
University of Pisa
Why the search bias can be preferred over the language bias?
▪ In ML typically use flexible approaches (expressive hypothesis spaces,
universal capability of the models, e.g. Neural Networks, DT)
▪ avoiding the language bias, hence without excluding a priori the unknown
target function,
▪ retaining an inductive bias but focusing on the search bias (which is ruled by
the learning algorithm).
▪ In practice using an incomplete search strategy.
Conclusions:
• Learning without bias cannot extract any regularities from data (lookup-table:
no generalization capabilities)
• Every state-of-the-art ML approach shows an inductive bias
• Issue: characterize the bias for different models/learning approaches
A. Micheli 67
The Kanizsa triangle
Example of perception bias of our visual system Dip. Informatica
University of Pisa
A. Micheli 68
2. Tasks & Loss Dip. Informatica
University of Pisa
1 𝑙 Note:
E(w)= σ𝑝=1 𝐿(ℎ(𝒙𝑝 ), 𝑑𝑝 ) index p is used for the
𝑙
samples p=1..l
We will change L for different tasks
Note: at moment Error, Risk and Loss are considered equivalent, we will specify
A. Micheli differences later through the course 69
Tasks: Common Tasks review Dip. Informatica
University of Pisa
A. Micheli 70
Regression Dip. Informatica
University of Pisa
• The mean over the data set provide the Mean Square Error (MSE)
A. Micheli 71
MSE example Dip. Informatica
University of Pisa
1
E(w)= σ𝑙𝑝=1 𝑦𝑝 − ℎ𝑤 𝒙𝑝 2
Note: this plot is taken elsewhere, I used
𝑙
different colors before: here the line is in blue.
w are the free parameters of Also, the y are therein the desidered (target d)
the linear model values
A. Micheli 72
Classification Dip. Informatica
University of Pisa
A. Micheli 73
Clustering and
Vector Quantization*preview Dip. Informatica
University of Pisa
Centroids
A. Micheli 77
Generalization Dip. Informatica
University of Pisa
A. Micheli 79
Validation Dip. Informatica
University of Pisa
• Validation !
• Validation !!
• Validation !!!
A. Micheli 80
Exemplification of the
Deployment/ Inference use Dip. Informatica
University of Pisa
A. Micheli 81
Exemplification of the
Deployment/ Inference use Dip. Informatica
University of Pisa
Even the inference part can be costly if you have millions of requests
(e.g. at google)
A Google server rack containing multiple Tensor Processing Units, a special-
purpose chip designed specifically for machine learning
The original TPU was designed specifically to work best with Google’s TensorFlow.
• Part I (now)
– Motivations, contextualization in CS
– Course info
• Part II (in Lect.s 2 and 3)
– Utility of ML
– Learning as function approximation (pilot example)
– Design components of a ML system, including
• Learning tasks
• Hypothesis space (and first overview)
• Inductive bias (examples in discrete hypothesis spaces)
• Loss and learning tasks
• Generalization (first part)
• Part III (in Lect. 4)
– Generalization and Validation
Aim: overview and terminology
before starting to study models and learning algorithms
A. Micheli 84
For information
Alessio Micheli
[email protected]
http://ciml.di.unipi.it