0% found this document useful (0 votes)

26 views66 pages

ML 23 First Lectures 2 3 v0.1

Uploaded by

robertodellarocca01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views66 pages

ML 23 First Lectures 2 3 v0.1

Uploaded by

robertodellarocca01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

A Short Introduction to

Machine Learning

Introduction to Machine Learning

Lect.s 2 and 3
Alessio Micheli
[email protected]

Dipartimento di Informatica Computational Intelligence &

Università di Pisa - Italy Machine Learning Group
About ML Dip. Informatica
University of Pisa

• Machine Learning (ML)

– Master Programme in Computer Science
– Master Programme in Data Science and Business Informatics
– Master Programme in Digital Humanities

• Code: 654AA Credits (ECTS): 9 Semester: 1

• Lecturer: Alessio Micheli: [email protected]

A. Micheli 2
Practical information Dip. Informatica
University of Pisa

In class:
• Please, max silence during the lecture (to avoid noise)
recording in progress! (of course, you can make questions)

Connect to ML:
• Material: Moodle https://elearning.di.unipi.it/
• Streaming & recordings of lectures: Teams platform
– See lecture 1 and Moodle: «FAQ and general
information»
– The enrolling students mechanism for attendance “in
presence” (and to connect to Teams) is through the
App “Didactic Agenda” (“Agenda Didattica”) for 654AA
23/24
– Please, remember to fill the poll (see INTRO-curricula22)

4
A. Micheli
Introduction to ML:
plan of the next lectures Dip. Informatica
University of Pisa

• Introduction aims:
– Critical contextualization of the ML in comp. science [lect 1 and 2]

– Overview and Terminologies [lect 2, 3, 4]

• the relevant concepts will be developed later in the course

– First basic models and learning algorithms [lect 5, 6, 7]

– Then, we will start with Neural Networks!

See the “Course structure” slide!

A. Micheli 5
Learning Dip. Informatica
University of Pisa

The problem of learning is arguably

at the very core of the problem of
intelligence, both biological and
artificial

Poggio, Shelton, AI Magazine 1999

i.e. Learning as a major challenge

and a strategic way to provide
intelligence into the systems

6
Machine Learning (I) Dip. Informatica
University of Pisa

We restrict to the computational framework:

• Principles, methods and algorithms for learning and prediction:
– Learning by the system of the experience (known data) to
approach a defined computational task
– Build a model (or hypothesis) to be used for predictions
❖(see examples on email-spam or face recognition)

Most common specific framework :

• Infer a model / function from a set of examples which allows the

generalization (to provide accurate response on new data)

A. Micheli 7
Machine Learning (II): When? Dip. Informatica
University of Pisa

Opportunity (if useful) and awareness (needs and limits)

• Utility of predictive models: (in the following cases)

– no (or poor) theory (or knowledge to explain the phenomenon)
– uncertain, noisy or incomplete data (which hinder
formalization of solutions)
• Requests:
– source of training experience (representative data)
– tolerance on the precision of results

A. Micheli 8
Machine Learning (III): When? Dip. Informatica
University of Pisa

• Models to solve real-world problems that are difficult to be treated

with traditional techniques (complementary to analytical models
based on previous knowledge, algorithms and imperative
programming, classical AI, ...)
• Examples of appropriate applications versus standard programming:
– Knowledge is too difficult (to be formalized by ‘hand-made’ algorithm)
• e.g. face recognition: humans can do it but cannot describe how they do it
• e.g. voice automatic telephone answering service

– Not enough human knowledge

• e.g., predicting binding strength of molecules to proteins
– Personalized behavior
• scoring email messages or web pages according to user preferences
• individualized (intelligent) human-computer interfaces

• Due to this flexibility ML applicative area is very large: see lecture 1

A. Micheli 9
General challenges Dip. Informatica
University of Pisa

• Build autonomous Intelligent/learning systems:

– Robotics, HRI, search engines, …

• Build powerful tools for emerging challenges in intelligent data

analysis
– Tools for the “data scientist”

• Open new areas of applications in CS: innovative interdisciplinary

open problems (more in general, “machine learning scientist”)
– Fantasy is your limitation !
– ML in the era of “changing of paradigm in science, in which scientific
advances are becoming more and more data-driven”
– Growing data sources opens up a huge application area for ML and
related areas (Web, Social Net., IoT, BioMed, …)

A. Micheli 10
An useful framework:
Learning as an approximation of an
unknown function from examples

Specific vision but widespread in ML

For us:
• Different tasks seen in uniform framework Hilbert spaces
• Enables a rigorous formulation

→ Intro guided by intuitive examples

Please, note that the following example was already

introduced in Lect 1
An Example Dip. Informatica
University of Pisa

• A pilot example: recognition of handwritten digits

• Input: collection of images of handwritten digits (arrays/matrix of

values)
• Problem: build model that receives in input an image of handwritten
digit and "predict" the digits

8x8

A. Micheli 13
Build a function from
examples Dip. Informatica
University of Pisa

Image
8x8

f
0 1 2 3 4 5 6 7 8 9 Output class

A. Micheli 14
Handwritten Digits
Recognition Dip. Informatica
University of Pisa

Image
8x8

f
Output class
0 1 2 3 4 5 6 7 8 9 Classification problem
• Difficult to formalize exactly the solution of the problem:
Possible presence of noise and ambiguous data;
• Relatively easy to collect a set of labeled examples

=> Example of successful application of the ML!

A. Micheli 15
Machine Learning Dip. Informatica
University of Pisa

A new extended definition (looking to the pilot example)

• The ML studies and proposes methods to build (infer) dependencies /

functions / hypotheses from examples of observed data
– that fits the know examples
– able to generalize, with reasonable accuracy for new data
• According to verifiable results
• Under statistical and computational conditions and criteria
– Considering the expressiveness and algorithmic complexity of the
models and learning algorithms

A. Micheli 16
Examples of x - f(x) Dip. Informatica
University of Pisa

Inferring general functions from know data:

• Handwriting Recognition
– x: Data from pen motion.
– f(x): Letter of the alphabet.
• Disease diagnosis (from database of past medical records)
– x: Properties of patient (symptoms, lab tests)
– f(x): Disease (or maybe, recommended therapy)
– TR Training Set: <x,f(x)>: database of past medical records
• Face recognition
– x: Bitmap picture of person's face
– f(x): Name of the person.
• Spam Detection
– x: Email message
– f(x): Spam or not spam.

A. Micheli 18
Complex data Dip. Informatica
University of Pisa

• Protein folding
– x: sequence of amino acids
– f(x): sequence of atoms’ 3D coordinates
– TR <x,f(x)>: known proteins
– Type of x: string (variable length)
– Type of f(x): sequence of 3D vectors

• Drug design
– x: a molecule
– f(x): binding strength to HIV protease
– TR <x,f(x)>: molecules already tested
– Type of x: a graph or a relational description of atoms/chemical bonds
– Type of f(x): a real number
A. Micheli 19
Overview of a ML
(predictive) System Dip. Informatica
University of Pisa

Build or improve the

agent/model/hypothesis
by learning from data
(world observations)

DATA MODEL Prediction

world observations TASK

Drive the model building
by tuning the system
LEARNING ALG. parameters to the
problem at hand

VALIDATION

Also as a guide to the key design choices

(ML system “ingredients”)
A. Micheli 20
DATA Dip. Informatica
University of Pisa

• The data represent the available facts (experience).

– Representation problem: to capture the structure of the analyzed objects
Types: Flat, Structured, …
• Flat (attribute-value language):
fixed-size vectors of properties (features), single table of tuple
(measurements of the objects)

Fruits Weight Cost $ Color Bio

Fruit 1 2.1 0.5 y 1 Attributes
(lemon) (categorical/discrete
or continuous)
Fruit 2 3.5 0.6 r ?
(apple) missing data

Data can be subject to

preprocessing: e.g. Variable scaling, encoding*, feature selection…
A. Micheli 21
DATA
Examples and terminologies Dip. Informatica
University of Pisa

Medical records i
Patients Age Smoke Sex Lab
Test
Pat 1 101 0.8 M 1 Attributes
(discrete/continuous)
p Pat 2 30 0.0 F ? xp
• Each row (x, vector in bold): example, pattern, instance, sample,….
• Dimension of data set: number of examples l
• Dimension (of the input x): number of features n
• If we will index the features/inputs/variables by i or j : variable xi is
(typically) the i-th feature/property/attribute/element/component of x.
(but may be to simplify we need to use subscript index for other meanings)
• xp (or xi) is (typically) the p-th (or i-th) pattern/example/raw (vector)
• xp,i (for example) can be the attribute i of the pattern p
A. Micheli 22
DATA Encoding Dip. Informatica
University of Pisa

Flat case:
• Numerical encoding for categories: e.g.
– 0/1 (or –1/+1) for 2 classes
– More classes:
• 1,2,3… Warning: grade of similarity (1 vs 2 or 3): useful for “order
categorical” variables (e.g small, medium, large)
• 1-of-k (or 1-hot) encoding: useful for symbols

A 1 0 0
It will be useful
B 0 1 0
for the project !
C 0 0 1

Useful both for input or output variables

A. Micheli 23
DATA : Structures Dip. Informatica
University of Pisa

• Structured: Sequences (lists), trees, graphs, Multi-relational data

(table) (in DB)
Examples: images, microarray, temporal data, strings of a language,
DNA e proteins, hierarchical relationships, molecules, hyperlink
connectivity in web pages, ...
Which natural representation?

Graph/network data
l1 l2 l3 l4 l5

A. Micheli 24
DATA
Further terminologies Dip. Informatica
University of Pisa

• Noise: addition of external factors to the stream of (target) information

(signal); due to randomness in the measurements, not due to the underlying
law: e.g. Gaussian noise

• Outliers: are unusual data values that are not consistent with most
observations (e.g. due to abnormal measurements errors)
– outlier detection – preprocessing: removal
– Robust modeling methods

• Feature selection: selection of a small number of informative features: it

can provide an optimal input representation for a learning problem
A. Micheli 25
TASKS Dip. Informatica
University of Pisa

• The task defines the purpose of the application:

– Knowledge that we want to achieve? (e.g. pattern in DM or model in ML)
– Which is the helpful nature of the result?
– What information are available?
Mainly in the ML course
• Predictive (Classification, Regression): function approximation

x Categories o real values (R)

Input space f
E.g. recall the “pilot” example on handwritten digits: Build a function
from examples

• Descriptive (Cluster Analysis, Association Rules): find subsets or groups of

unclassified data

A. Micheli 27
Tasks: Supervised Learning Dip. Informatica
University of Pisa

• Given: Training examples as <input,output>=<x,d> (labeled examples)

Def for an unknown function f (known only at the given points of example)
– Target value: desiderate value d or t or y … is given by the teacher according to f(x)
to label the data
• Find: A good approximation to f (a hypothesis h that can used for prediction
on unseen data x’, i.e. that is able to generalize)

x Categories o
Input space f real values (R)

• Target d (or t or y): a categorical or numerical label

– Classification: discrete value outputs:
f(x)  {1,2,…,K} classes (discrete-valued function)
– Regression: real continuous output values (approximate a real-valued target
function, in R or RK)
Unified vision thanks to the formalism of a
function approximation task 28
A. Micheli
Tasks: Unsupervised Learning Dip. Informatica
University of Pisa

Unsupervised Learning: No teacher!

• TR (Training Set)= set of unlabeled data <x>
• E.g. to find natural groupings in a set of data
– Clustering
– Dimensionality reduction/ Visualization/Preprocessing
– Modeling the data density

Centroids

▪ Clustering:
Partition of data into clusters (subsets of “similar” data) 29
A. Micheli
Tasks: Classification Dip. Informatica
University of Pisa

(Supervised) Classification: Patterns (features vectors) are seen as

members of a class and the goal is to assign the patterns observed
classes (label)

• Classification: f(x) return the correct class for x

• Number of classes:

• =2 : f(x) is a Boolean function: binary classification, concept

learning (T/F or 0/1 or –1/+1 or negative/positive),

• > 2: multi-class problem (C1,C2,C3 ….CK)

A. Micheli 30
Example Dip. Informatica
University of Pisa

From DATA to TASK (e.g. classification)

Patients Age Smoke Sex Lab Target:

Test diagnose
Pat 1 101 0.8 M 1 +
Pat 2 30 0.0 F ? -
f
x : Input space

Terminology in statistics:
• Inputs are the “independent variables”
• Outputs are the “dependent variables” or “responses”

A. Micheli 31
Tasks: Classification Dip. Informatica
University of Pisa

The classification may be viewed as the allocation of the input space in decision regions
(e.g. 0/1)
Example: graphical illustration of a linear separator on a
instance space x T=(x ,x ) in IR2 , f(x)=0/1 (or -1/+1)
1 2

x2 Point belonging to class 1

1 Separating (hyper)plane : x s.t. PREVIEW

𝒘𝑇𝒙 + 𝑤0 = 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑤0 = 0
1 0
1
 1 _ if wx
T
+ w0  0
0 h( x) = 
0 ______ otherwise
1 0 or
1 0
h( x) = sign(wx
T
+ w0)
Linear threshold unit (LTU)
x1
Indicator functions
33
A. Micheli How many? (H): set of dichotomies induced by hyperplanes
Geometrical 3D (pre)view:
Classifier Dip. Informatica
University of Pisa

The 0/1 classification function in 3D

Region where the (on a 2D input space)
output of the
classifier is 1

A. Micheli 34
Tasks: Regression: example Dip. Informatica
University of Pisa

• Process of estimating of a real-value function on the basis of finite set of

noisy samples (supervised task)
– known pairs (x, f(x)+random noise)
Task (exercise): find f for the data in the following table:

x target
10 Via Neural Network ?
1 2.1 or by …
2 3.9
8
Guessing f(x)=2x

3 6.1
6

Small errors at the points!

4 8.4
4

5 9.8 2

… … 0
1 2 3 4 5

A. Micheli 35
Tasks: regression Dip. Informatica
University of Pisa

• Regression: x = variables (e.g. real values), f (x) real values: curve

fitting (x is 1-dim in the example but it becomes k-dim in general)
• Process of estimating of a real-value function on the basis of finite set of
noisy samples
– known pairs (x, f(x)+random noise)

Point where we know the

value of f(x)

Linear hypothesis

Among the infinite possibilities,

what is the most appropriate?

An example (linear hypothesis): hw(x)=w1x+w0=0.2 x -0.4

A. Micheli 36
Tasks: Other Topics … Dip. Informatica
University of Pisa

• Semi-supervised learning
– combines both labeled and unlabeled examples to generate an
appropriate function or classifier.

• Reinforcement Learning (learning with right/wrong critic).

– Adaptation in autonomous systems
– “the algorithm learns a policy of how to act given an observation of the
world. Every action has some impact in the environment, and the
environment provides feedback that guides the learning algorithm”.
– Not step by step examples
– Toward decision-making aims
– Useful in modern AI

A. Micheli 38
Models
and survey of useful concepts Dip. Informatica
University of Pisa

• MODEL:
– Aim: to capture/describes the relationships among the data (on the
basis of the task) by a “language” (numerical, symbolic, …)
– The “language” is related to the representation used to get knowledge
– The model defines the class of functions that the learning machine can
implement (hypotheses space)
• E.g. set of functions h(x,w), where w is the (abstract) parameter

• Training example (superv.): An example of the form (x, f(x)+noise)

x is usually an input vector of features, (d or t or) y=f(x)+noise is called
the target value
• Target function: The true function f
• Hypothesis: A proposed function h believed to be similar to f. An
expression in a given language that describes the relationships among data
• Hypotheses space H: The space of all hypotheses (specific models) that
can, in principle be output by the learning algorithm
A. Micheli 41
Models:
few trivial examples…. Dip. Informatica
University of Pisa

Just to have a preview of different representation of hypothesis

(because you already know the language of equations, logic, probability):
• Linear models (representation of H defines a continuously
parameterized space of potential hypothesis);
each assignment of w is a different hypothesis, e.g:
–
ℎ(𝐱) = 𝑠𝑖𝑔𝑛(𝒘𝑇 𝒙 + 𝑤0) hw(x)=w1x+w0 E.g. hw(x)= 2x+150
binary classifier simple linear regression

• Symbolic Rules: (hypothesis space is based on discrete

representations); different rules are possible , e.g:
– if (x1=0) and (x2=1) then h(x)=1 binary classifier
– else h(x)=0
• Probabilistic models: estimate p(x,y)
• K Nearest neighbor regression: Predict mean y value of nearest neighbors
(memory-based) 42
A. Micheli
Neural Networks (just a look) Dip. Informatica
University of Pisa

An example: we will see a neural networks, beyond the

neurobiological inspiration, as a computational model for the treatment
of data, capable of approximating complex (non-linear) relationships
between inputs and outputs

x Categories or IR (real) values

Input space
f

Age Again,
a class of functions !!!
Smoke

Alcool

A. Micheli 44
Paradigms and methods
(Languages for H) Dip. Informatica
University of Pisa

• Symbolic and Rule-based (or discrete H)

– Conjuction of literals*, Decision trees (propositional rules)
– Inductive grammars, Evolutionary algorithms, …
– Inductive Logic Programming (first order logic rules)
• Sub-symbolic (or continuous H)
– Linear discriminant analysis, Multiple Linear Regression*, LTU
– Neural networks
– Kernel methods (SVMs, gaussian kernels, spectral kernels, etc)
• Probabilistic/Generative
– Traditional parametric models (density estimation, discriminant analysis, polynomial regression,…)

– Graphical models: Bayesian networks, Naïve Bayes, PLSA, Markov models,

Hidden Markov models, …
• Instance-based
Note: Underlined – >ML
– Nearest neighbor*
1. Some models can be expressed by different languages
2. * Next lectures 49
A. Micheli
How many models? Dip. Informatica
University of Pisa

• Theory (No Free Lunch Theorem) : there is no universal “best” learning method
(without any knowledge, for any problems,…):
if an algorithm achieves superior results on some problems, it must pay with
inferiority on other problems. In this sense there is no free lunch.
E.g. Devroye (1982), Wolpert and Macready (1997), and others

→ The course provides a

– set of models and the
– critical instruments to compare them

• However, not all the models are equivalent:

– Important differences are for the flexibility of the approaches, toward models that
can in principle approximate arbitrary functions (e.g. no just linear approximation
seen in the examples)
– Important differences are for the control of the complexity (we will see later)
– Use of flexible models and principia for the control of the complexity are the core of
ML
A. Micheli 50
Learning Algorithms Dip. Informatica
University of Pisa

• LEARNING ALGORITHM Basing on data, task and model

• (Heuristic) search through the hypothesis space H of the best

hypothesis
– i.e. the best approximation to the (unknown) target function
– Typically searching for the h with the minimum “error”

– E.g. free parameters of the model are fitted to the task at hand:
– Examples: best w in linear models, best rules for symbolic models, ….
– Remember the regression example, we proposed h(x)=2x, for
hw(x)=w1x+w0 assuming w1=2 and w0 =0 as the best parameter value:
how?

• H may not coincide with the set of all possible functions and the
search can not be exhaustive: we need to make assumptions →
(we will see the role of) Inductive bias
A. Micheli 53
Learning Algorithms: search Dip. Informatica
University of Pisa

Hypotheses space H
Each point represents
a different hypothesis
(function)

(minimum “error”)

Typically local search approaches

A. Micheli 55
Learning (terminologies) Dip. Informatica
University of Pisa

According to the different paradigms/contexts “learning” can be

differently termed or have different acceptations:
• Inference (statistics)
• Inference: Abduction/Induction (logic)
• Adapting (biology, systems)
• Optimizing (mathematics)
• Training (e.g. Neural Networks)
• Function approximations (mathematics)

Can be more specifically found in other sub-fields:

– Regression analysis (statistics), curve fitting (math, CS), …
– Or using other terminologies e.g. “Fitting a multivariate function”

A. Micheli 56
Recap and next topics Dip. Informatica
University of Pisa

After the introduction of the first four ingredients (Data, Task, Model and
Learning Alg.), we need to focus on three mentioned relevant concepts
not yet discussed so far:

1. The inductive bias (examples in discrete hypothesis spaces)

2. The loss, used to measure the quality of our approximation
3. The concept of generalization and validation (next lecture)

A. Micheli 57
1. The Role of the
Inductive Bias Dip. Informatica
University of Pisa

In order to set up a model and a learning algorithm we can make assumptions

(about the nature of the target function) concerning either
– Constraints in the model (in the hypothesis space H, due to the set of
hypotheses that we can express or consider) (Language Bias)
– Constraints or preferences in learning algorithm/search strategy (Search
Bias)
– Or Both.

• We will see that such assumptions are strictly need to obtain an useful model
for the ML aims, i.e. a model with generalization capabilities

• We start to discuss it within examples in discrete hypotheses spaces (rules),

learning a concept (a Boolean function) [Mitchell chapt. 2]
– E.g. x is a “cat” if hcat(x) =1, otherwise is 0 for x in “animals”

A. Micheli 58
An example:
Learning Boolean functions Dip. Informatica
University of Pisa

Find the function s.t.

This is an ill posed (inverse)

problem:
We may violate either
existence, uniqueness,
stability of the solution or
solutions

Table 1
A. Micheli 59
Learning Boolean functions:
ill-posed Dip. Informatica
University of Pisa

4
• There are 216 = 22 = 65536 possible Boolean functions
over four input features. We can not figure out which
one is correct until we have seen every possible input-
output pair.
• After 7 examples, we still have 29 possibilities.

• In the general case, in this discrete hypothesis space H:

n
|H| = 2#-input-instances= 22
for binary inputs/outputs, n= input dimension
Lookup table model
• I.e. a rote learner: Store/memorize examples, classify x
if and only if it matches a previously observed example
(else ”no answer”).
– No inductive bias → no generalization!

A. Micheli 60
Another discrete H space:
Conjutive rules Dip. Informatica
University of Pisa

• As second example of discrete H, we can image to learn a discrete function

with discrete inputs assuming conjunctive rules (propositions with AND
among literals, a language bias)
• i.e. using a language bias to work with a restricted hypothesis space
• E.g. h1= l2, h2=(l1 and l2), h3= true, h4 = not(l1) and l2 , …
– Rules such as if l2(=true) then h(x)=true, else h(x)=false
or equivalently if (x2=1) then h(x)=1, else h(x)=0
n
• With n binary inputs we had |H| = 2#-input-instances= 22
• With only conjunctive rules:
#semantically distinct hypotheses (conjunctions):
3n (for each of the n positions we can have li, not(li), don’t care) + 1
(+1 because all h with (li AND not(li)) are equivalent to ”false”)
(e.g. from 65536 to just 34+1=82 in the example with n=4)

A. Micheli 61
Find the Version Space Dip. Informatica
University of Pisa

• Given the def.: a hypothesis h is consistent with the TR, if

h(x)=d(x) for each training example <x,d(x)> in TR.

• It is possible to perform a complete search (finding the set of all h

consistent with the TR set) in an efficient way in this reduced space
(of conjunctive rules) by cleverer algorithms (Mitchell chap. 2)
– Instead of searching enumerating all the possible combination of literals,
i.e. every h in H

• We are only interested to say that these algorithms find the VS:
• Call the version space, VSH,TR , with respect to hypothesis space H,
and training set TR, the subset of hypotheses from H consistent with
all training examples

A. Micheli 62
Unbiased Learner I Dip. Informatica
University of Pisa

• Hence, this conjunctive assumption for H leads to an efficient solution in

finding a VS.
However, using only conjunctive rules may be too restrictive: if the target
concept is not in H, it cannot be represented in H.
– e.g. if (x1=1) or (x2=1) then h(x)=1, else h(x)=0

• Idea: Choose H that expresses every teachable concept (among

propositions), that means H is the set of all possible subsets of X (instance or
input space): the power set P(X)
• E.g. n=10 binary inputs |X|= 210=1024, |P(X)|=21024 ~ 10308 distinct
concepts (much more than the num. of atoms in the universe)

• H = disjunctions, conjunctions, negations

• H surely contains the target concept.

• What for generalitazion ?

A. Micheli 63
Unbiased Learner II (formal) Dip. Informatica
University of Pisa

Recall that the version space, VSH,TR , with respect to hypothesis space
H, and training set TR, is the subset of hypotheses from H consistent
with all training examples

The only examples that are unambiguolsy classified by an unbiased

learner represented with the VS are the training examples themselves
I.e. the lookup table !

Property: An unbiased learner is unable to generalize (on new instances):

Proof: Each unobserved instance will be classified 1 (positive) by precisely half
the hypothesis in VS and 0 (or negative) by the other half (rejection: no answer
is made by the VS for new input instances).
Indeed:
h consistent with xi (test),  h’ identical to h except h’ (xi) <> h(xi),
hVS → h’  VS (because they are identical on TR)

A. Micheli 64
Futility of Bias-Free Learning Dip. Informatica
University of Pisa

• A learner that makes no prior assumptions regarding the identity of

the target function/concept has no rational basis for classifying any
unseen instances.
• (Restriction, preference) bias not only assumed for efficiency,
it is needed for the generalization capability
– However, it does no tell us (quantify) which one is the best solution for
generalization yet

• Trivial Example (TR= Training Set, TS= Test Set): :

X d(x) H={x, not(x), 0, 1}
TR 0 0 VS={x,0}
TS 1 ? → Can be 1 or 0 … Unless you use all X as TR set.

In other words, in order to learn the target concept, one would have to
present every single instance in X as a training example (lookup table)

A. Micheli 65
Inductive Systems and
Equivalent Deductive Systems Dip. Informatica
University of Pisa

training classification of
examples learning new instance or
algorithm don’t know
new instance
using hypothesis space H

equivalent deductive system

training
classification of
examples theorem prover new instance or
new instance don’t know

Ind. Bias

A. Micheli 66
Language or search bias? Dip. Informatica
University of Pisa

Why the search bias can be preferred over the language bias?
▪ In ML typically use flexible approaches (expressive hypothesis spaces,
universal capability of the models, e.g. Neural Networks, DT)
▪ avoiding the language bias, hence without excluding a priori the unknown
target function,
▪ retaining an inductive bias but focusing on the search bias (which is ruled by
the learning algorithm).
▪ In practice using an incomplete search strategy.

Conclusions:
• Learning without bias cannot extract any regularities from data (lookup-table:
no generalization capabilities)
• Every state-of-the-art ML approach shows an inductive bias
• Issue: characterize the bias for different models/learning approaches

A. Micheli 67
The Kanizsa triangle
Example of perception bias of our visual system Dip. Informatica
University of Pisa

A. Micheli 68
2. Tasks & Loss Dip. Informatica
University of Pisa

We said … A “good” approximation to f from examples.

How to measure the quality of the approximation?
▪ Recall that we produce
h(x) value (output of the model for input x)
▪ We want to measure the “distance” between h(x) and d
(objective function for minimization of errors in training, check of errors in test)

We use a (“inner”) loss function/measure: L(h(x),d) (for a pattern x)

e.g. high value → poor approximation

The Error (or Risk or Loss) is an expected value of this L

e.g. a “sum” or mean of the inner loss L over the set of samples

1 𝑙 Note:
E(w)= σ𝑝=1 𝐿(ℎ(𝒙𝑝 ), 𝑑𝑝 ) index p is used for the
𝑙
samples p=1..l
We will change L for different tasks
Note: at moment Error, Risk and Loss are considered equivalent, we will specify
A. Micheli differences later through the course 69
Tasks: Common Tasks review Dip. Informatica
University of Pisa

I will show a short survey of common learning tasks by specifying the

(changing of the) nature
• of the output and hypothesis space
• of the loss function (in particular of L),

i.e. Examples of loss functions: use it for future reference

A. Micheli 70
Regression Dip. Informatica
University of Pisa

• Regression: predicting a numerical value

• Output: dp=f(xp) + e (real value function + random error)

• H: a set of real-valued functions

• Loss function L : measures the approximation accuracy/error

• A common loss function for regression: the squared error

𝐿(ℎ(𝒙𝑝), 𝑑𝑝 ) = (𝑑𝑝 − ℎ(𝒙𝑝))2

• The mean over the data set provide the Mean Square Error (MSE)

A. Micheli 71
MSE example Dip. Informatica
University of Pisa

In the example we have

h(x)=w1x+w0 as the blue line
and in green the errors at the data
points (xi yi) (in red), where the
target di for xi is denoted yi in the ℎ𝑤(x)
example

The Mean Square Error (MSE)

is the mean of the square of the
green errors:

1
E(w)= σ𝑙𝑝=1 𝑦𝑝 − ℎ𝑤 𝒙𝑝 2
Note: this plot is taken elsewhere, I used
𝑙
different colors before: here the line is in blue.
w are the free parameters of Also, the y are therein the desidered (target d)
the linear model values
A. Micheli 72
Classification Dip. Informatica
University of Pisa

• Classification of data into discrete classes

• Output: e.g. {0,1}

• H: a set of indicator functions

• Loss function L : measures the classification error

0 𝑖𝑓 ℎ(𝒙𝑝) = 𝑑𝑝 0/1 Loss

𝐿(ℎ(𝒙𝑝), 𝑑𝑝 ) = ቊ
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Def
• The mean over the data set provide the number/percentage of
misclassified patterns
• E.g. 20 out of 100 are misclassified → 20% errors, i.e. 80% of accuracy

A. Micheli 73
Clustering and
Vector Quantization*preview Dip. Informatica
University of Pisa

• Goal: optimal partitioning of unknown distribution in x-space into

regions (clusters) approximated by a cluster center or prototype.

Centroids

• H: a set of vector quantizers x→c(x)

continuos space → discrete space
• Loss function L : measures the vector quantizer optimality
• A common loss function would be the squared error distortion:

𝐿(ℎ(𝐱𝑝)) = (𝐱𝑝 − ℎ(𝐱𝑝)) • (𝐱𝑝 − ℎ(𝐱𝑝))

We’ll see later
•= 𝑖𝑛𝑛𝑒𝑟_𝑝𝑟𝑜𝑑𝑢𝑐𝑡

Proximity of the pattern to the centroid of its cluster

A. Micheli 74
Density estimation* preview Dip. Informatica
University of Pisa

• Density estimation (generative, “parametric methods”)

from an assumed class of density

• Output: a density e.g. normal distribution with mean m and

variance sigma2 : p(x | m, sigma2 )
• H: a set of densities (e.g. m and sigma2 are the two unknown
parameters)

• A common loss function L for density estimation:

𝐿(ℎ(𝐱𝑝)) = − ln( ℎ(𝐱𝑝)) We’ll see later

• Related to “maximizing the (log) likelihood function”. [not hear]

• E.g. P(x1,x2,x3,… | m, sigma2 )
A. Micheli 75
3. Machine Learning &
generalization Dip. Informatica
University of Pisa

This is a fundamental concept of the course

• Learning: search for a good function in a function space from

known data (typically minimizing an Error/Loss)

• Good w.r.t. generalization error: it measures how accurately the

model predicts over novel samples of data
(Error/Loss measured over new data)

Generalization: crucial point of ML!!!

Easy to use ML tools versus correct/good use of ML

A. Micheli 77
Generalization Dip. Informatica
University of Pisa

• Learning phase (training, fitting): build the model from know

data – training data (and bias)
• Predictive or Test phase (deployment/ Inference use of the ML built
model): apply the model to new examples:
– we take the new inputs x’ and we compute the response by the model
– we compare with its target d’ that the model has never seen
– i.e. we make evaluation of the generalization capability of our predictive
hypothesis

Note: performance in ML = generalization accuracy/ predictive accuracy

estimated by the error computed on the (hold out) Test Set
• Theory: E.g. Statistical Learning Theory [Vapnik] :
– under what (mathematical) conditions is a model able to
generalize? → see next lecture (just basic notions)

A. Micheli 79
Validation Dip. Informatica
University of Pisa

• Evaluation of performances for ML systems =

Generalization/Predictive accuracy evaluation, i.e.:

• Validation !
• Validation !!
• Validation !!!

• In the following (next lecture) we will discuss some validation

techniques
– to evaluate (model assessment) and
– to manage the generalization capability (model selection).

A. Micheli 80
Exemplification of the
Deployment/ Inference use Dip. Informatica
University of Pisa

A. Micheli 81
Exemplification of the
Deployment/ Inference use Dip. Informatica
University of Pisa

Even the inference part can be costly if you have millions of requests
(e.g. at google)
A Google server rack containing multiple Tensor Processing Units, a special-
purpose chip designed specifically for machine learning
The original TPU was designed specifically to work best with Google’s TensorFlow.

Just for inference (mapping) !!!!

A. Micheli 82
Summary of the Intro to ML Dip. Informatica
University of Pisa

• Part I (now)
– Motivations, contextualization in CS
– Course info
• Part II (in Lect.s 2 and 3)
– Utility of ML
– Learning as function approximation (pilot example)
– Design components of a ML system, including
• Learning tasks
• Hypothesis space (and first overview)
• Inductive bias (examples in discrete hypothesis spaces)
• Loss and learning tasks
• Generalization (first part)
• Part III (in Lect. 4)
– Generalization and Validation
Aim: overview and terminology
before starting to study models and learning algorithms
A. Micheli 84
For information

Alessio Micheli
[email protected]

http://ciml.di.unipi.it

Dipartimento di Informatica Computational Intelligence &

Università di Pisa - Italy Machine Learning Group

Unit 1 - Fundamentals of Ai - Part I
No ratings yet
Unit 1 - Fundamentals of Ai - Part I
59 pages
Complete ML
No ratings yet
Complete ML
325 pages
Mlall
No ratings yet
Mlall
186 pages
ABES Presentation
No ratings yet
ABES Presentation
91 pages
00intro 1
No ratings yet
00intro 1
43 pages
ML - Unit I - Final
No ratings yet
ML - Unit I - Final
132 pages
Unit1 ML NGP
No ratings yet
Unit1 ML NGP
106 pages
ML Module 1 (Bcs602)
No ratings yet
ML Module 1 (Bcs602)
48 pages
AIML-HC Mod 01
No ratings yet
AIML-HC Mod 01
50 pages
PDF Machine Learning
100% (1)
PDF Machine Learning
222 pages
1 Introduction
No ratings yet
1 Introduction
81 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
Introduction To Machine Learning
100% (1)
Introduction To Machine Learning
119 pages
Core Concepts of AI
No ratings yet
Core Concepts of AI
46 pages
Lecture 1
No ratings yet
Lecture 1
39 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
33 pages
2024 Machine Learning Intro
No ratings yet
2024 Machine Learning Intro
50 pages
Machine Learning Notes - Concepts, Algorithms
No ratings yet
Machine Learning Notes - Concepts, Algorithms
171 pages
01 ML Basics
No ratings yet
01 ML Basics
61 pages
UNIT-1 Machine Learning
No ratings yet
UNIT-1 Machine Learning
43 pages
AA12 Deep Learning 2024
No ratings yet
AA12 Deep Learning 2024
30 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
39 pages
SEng5305-chap-1-Introduction To ML
No ratings yet
SEng5305-chap-1-Introduction To ML
85 pages
Upload Unit 1
No ratings yet
Upload Unit 1
36 pages
ML Final
No ratings yet
ML Final
98 pages
ML Overview
No ratings yet
ML Overview
26 pages
MLUnit 1
No ratings yet
MLUnit 1
131 pages
ICT - REPORT - Final Report 100
No ratings yet
ICT - REPORT - Final Report 100
17 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
40 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
ML Chapter 01
No ratings yet
ML Chapter 01
38 pages
Intro To ML - 1
No ratings yet
Intro To ML - 1
29 pages
ENG6500 1 IntroductionToMLDL Part1
No ratings yet
ENG6500 1 IntroductionToMLDL Part1
63 pages
1 Introduction
No ratings yet
1 Introduction
24 pages
MAI Lecture 01 Introduction
No ratings yet
MAI Lecture 01 Introduction
52 pages
Intro Slides
No ratings yet
Intro Slides
31 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
58 pages
AI-Lecture 8 (Machine Learning Overview)
No ratings yet
AI-Lecture 8 (Machine Learning Overview)
42 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
Lecture 1
100% (1)
Lecture 1
81 pages
Lesson 4 - Introduction Machine Learning
No ratings yet
Lesson 4 - Introduction Machine Learning
44 pages
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
No ratings yet
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
106 pages
LM #01-Introduction To ML
No ratings yet
LM #01-Introduction To ML
33 pages
So Far : Lecture 1: Review of Classical & Modern Control Lecture 2: MATLAB Lecture
No ratings yet
So Far : Lecture 1: Review of Classical & Modern Control Lecture 2: MATLAB Lecture
12 pages
Module1 - Deep Learning
No ratings yet
Module1 - Deep Learning
26 pages
CS-871-Lecture 1
No ratings yet
CS-871-Lecture 1
41 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
Unit 1&2
No ratings yet
Unit 1&2
270 pages
Machine Learning and Soft Computing: CSCC53 Mca V Sem 2020
No ratings yet
Machine Learning and Soft Computing: CSCC53 Mca V Sem 2020
33 pages
Digital Signal Processing Ppt-1
100% (1)
Digital Signal Processing Ppt-1
12 pages
Introduction To Machine Learning: Pekka Parviainen
No ratings yet
Introduction To Machine Learning: Pekka Parviainen
39 pages
Lec 01 - Intro To ML
No ratings yet
Lec 01 - Intro To ML
28 pages
Machine Learning Unit 1
No ratings yet
Machine Learning Unit 1
14 pages
cp4252 Machine Learning
100% (2)
cp4252 Machine Learning
49 pages
ML Final
No ratings yet
ML Final
95 pages
Unit-1 MLT
No ratings yet
Unit-1 MLT
51 pages
Dumpsys ANR WindowManager
No ratings yet
Dumpsys ANR WindowManager
5,317 pages
Lecture1 PDF
No ratings yet
Lecture1 PDF
37 pages
SAP S - 4HANA Sourcing and Procurement - 1
100% (2)
SAP S - 4HANA Sourcing and Procurement - 1
36 pages
COMP484 Machine Learning Syllabus Undergraduate Fall 2018
No ratings yet
COMP484 Machine Learning Syllabus Undergraduate Fall 2018
6 pages
ML 01
No ratings yet
ML 01
15 pages
Thermal Properties of Matter
No ratings yet
Thermal Properties of Matter
21 pages
Mechanisms in Modern Engineering Design PDF
100% (3)
Mechanisms in Modern Engineering Design PDF
618 pages
CSR BC417 Datasheet
100% (2)
CSR BC417 Datasheet
116 pages
35 ตัน XCT35 - Y - 1
No ratings yet
35 ตัน XCT35 - Y - 1
20 pages
Man0029199 Accuseqv3.2 Ug
No ratings yet
Man0029199 Accuseqv3.2 Ug
234 pages
Data 98
No ratings yet
Data 98
4 pages
AL Tamil Medium Answer
No ratings yet
AL Tamil Medium Answer
93 pages
Ficha Técnica American Marsh
No ratings yet
Ficha Técnica American Marsh
8 pages
Seminar Report Format
No ratings yet
Seminar Report Format
7 pages
Codigos de FalhaCP 224 e 274
No ratings yet
Codigos de FalhaCP 224 e 274
6 pages
Intro S4HANA Using Global Bike Exercises FI en v4.1
No ratings yet
Intro S4HANA Using Global Bike Exercises FI en v4.1
10 pages
Excel Associate
No ratings yet
Excel Associate
7 pages
Proposal - SRI SAI ENTERPRISES MOHAN NAGAR
No ratings yet
Proposal - SRI SAI ENTERPRISES MOHAN NAGAR
4 pages
Arrays Strings Vectors Wrapper Class
No ratings yet
Arrays Strings Vectors Wrapper Class
9 pages
Swami Tech
No ratings yet
Swami Tech
32 pages
Digitalization and The Future of Work in The Financial Services
No ratings yet
Digitalization and The Future of Work in The Financial Services
53 pages
2025-03-17
No ratings yet
2025-03-17
3 pages
ADA Flanger Manual
No ratings yet
ADA Flanger Manual
11 pages
Parameter List EPA Commander SK (English)
No ratings yet
Parameter List EPA Commander SK (English)
2 pages
Arranz - 2022 - Fluid-Structure Interaction of Multi-Body Systems Methodology and Applications
No ratings yet
Arranz - 2022 - Fluid-Structure Interaction of Multi-Body Systems Methodology and Applications
20 pages
EE3402 LIC Notes QUESTION BANK - by WWW - Notesfree.in
No ratings yet
EE3402 LIC Notes QUESTION BANK - by WWW - Notesfree.in
9 pages
Software Engineering: UNIT-2
No ratings yet
Software Engineering: UNIT-2
53 pages
1 s2.0 S0306261924004148 Main
No ratings yet
1 s2.0 S0306261924004148 Main
20 pages
CFM @45 Tr. - Coil Selection
No ratings yet
CFM @45 Tr. - Coil Selection
1 page
Conduit User Manual
No ratings yet
Conduit User Manual
29 pages
A) Collection of Values
No ratings yet
A) Collection of Values
9 pages
AT04 - AT05 Series Datasheet V2.1
No ratings yet
AT04 - AT05 Series Datasheet V2.1
3 pages
Artificial Intelligence: A Beginner's Guide
From Everand
Artificial Intelligence: A Beginner's Guide
Park Windsor
No ratings yet
The AI Artificial Intelligence Course From Beginner to Expert
From Everand
The AI Artificial Intelligence Course From Beginner to Expert
Asomoo Ebooks
No ratings yet