0% found this document useful (0 votes)

12 views

Machine Learning Lecture1

Uploaded by

Saswat Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Machine Learning Lecture1

Uploaded by

Saswat Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Applied Machine Learning

Annalisa Marsico
OWL RNA Bionformatics group
Max Planck Institute for Molecular Genetics
Free University of Berlin
SoSe 2015
What is Machine Learning?
What is Machine Learning?

The field of Machine Learning seeks to answer the question:

“How can we build computer systems that automatically improves with experience,
and what are the fundamental laws that govern all learning processes?”

Arthur Samuel (1959): field of study that gives computers the ability to learn
without being explicitly programmed
– ex: playing checkers against Samuel,
the computer eventually became much better than Samuel
– this was the first solid refutation
to the claim that computers cannot learn
What is Machine Learning?

Tom Mitchell (1998): a computer learns from experience E

with respect to some task T and some performance
measure P, if its performance on T as measured by P
improves with E
What is Machine Learning?

Computer ML
Statistics
Science

How can we build machines What can be inferred from the data plus
that solve problems and which problems some modeling assumptions, with what
are tractable/intractable? reliability?
ML‘s applications
– Army, security
– imaging: object/face detection and recognition, object traking
– mobility: robotics, action learning, automatic driving

– Computers, internet
– interfaces: brainwaves (for the disable), handwriting / speech recognition
– security: spam / virus filtering, virus troubleshooting
ML‘s applications
– Finance
– banking: identify good, dissatisfied or prospective customers
– optimize / minimize credit risk
– market analysis

– Gaming
– intelligent: adaptibility to the player, agents
– object tracking, 3D modeling, etc...
ML‘s applications
– Biomedicine, biometrics
– medicine: screening, diagnosis and prognosis, drug discovery etc..
– security: face recognition, signature, fingerprint, iris verification etc..

– Bioinformatics
– motif finder, gene detectors, interaction networks, gene expression
predictors, cancer/disease classification, protein folding prediction, etc..
Examples of Learning problems
• Predict whether a patient, hospitalized due to a heart attack, will
have a seocnd heart attack, based on diet, blood tests, diesease
history..
• Identify the risk factor for colon cancer, based on gene expression
and clinical measurements.
• Predict if an e-mail is spam or not based on most commonly
occurring words (email/spam -> classification problem)
• Predict the price of a stock in 6 months from now, based on
company performance and economic data
You already use it! Some more
examples from daily life..
• Based on past choices, which movies will interest this viewer?
(Netflix)
• Based on past choices and metadata which music this user will
probably like? (Lastfm, Spotify)
• Based on past choices and profile features should we match these
people in online dating service (Tinder)
• Based on previous purchases, which shoes is the user likely to like?
(Zalando)

However, predictive models regularly generate wrong predicitons:

In 2010 an errouneous algorithm has caused a finantial crash..
Learning process

• Predictive modeling: process of developing a

mathematical tool or model that generates
accurate predictions
Prediction vs Interpretation
• It is always a trade-off
• If the goal is high accuracy (e.g. Spam filter) then we
do not care ‚why‘ and ‚how‘ the model reaches it
• If the goal is interpretability (e.g. In Biology, SNPs
which predict a certian disease risk) then we care
‚why‘ and ‚how‘
Key ingredients for a successful
predictive model
• Deep knowledge of the context and the problem
– If a signal is present in the data you are gonna find it
– Choose your features carefully (e.g. collect relevant
data)

• Versatile computational toolbox for model

building, but also data pre-processing,
visualization, statistics
– Weka, Knime, R (check out caret package)

• Critical evaluation
Supervised vs Unsupervised Learning
Typical Scenario
We have an outcome quantitative (price of a stock, risk factor..) or
categorical (heart attack yes or no) that we want to predict based on some
features. We have a training set of data and build a prediction model, a
learner, able to predict the outcome of new unseen objects
- A good learner accurately predicts such an outcome

- Supervised learning: the presence of the outcome variable is guiding the

learning process
- Unsupervised learning: we have only features, no outcome
- Task is rather to describe the data
Unsupervised learning
• find a structure in the data

• Given X ={xn} measurements / obervations /features

find a model M such that p(M|X) is maximized
i.e. Find the process that is most likely to have generate the data
Supervised learning
Find the connection between two sets of observations: the input
set, and the output set

– given {xn , yn}, find an hypothesis f (function, classification boundary)

, such that ∀n ∈ [1..N], N number of observations, f(xn ) = yn
X={xn} also called predictors, independent variables or covariates
Y={yn} also called response, dependent variable
Example1: Colorectal Cancer
There is a correlation between CSA (colon specific antigen) and a number of
clinical measuremnets in 200 patients.

Goal: predict CSA from clinical measurements

Supervised learning
Regression problem (outcome measure is quantitative)
Example2: Gene expression
microarrays
Measure the expression of all genes in a cell simultaneously,
by measuring the amount of RNA present in the cell for
that gene. We do this for several experiments (samples).

Goal: understand how genes and samples are organized

- Which genes are predictive for certain samples?

Unsupervised learning: p (# of samples) << N (# of genes)

Supervised learning: yes, possible, with some tricks
Variable Types
• Y quantitative -> regression model
Y qualitative (categorical) -> classification model (two or more classes)

• Inputs X can also be quantitative or qualitative

• there can be missing values
• dummy variable sometime a convenient way

• Both problems can be viewed as a task in function approximation f(X)

Let‘s re-formulate the training task
• Given X (features), make a good prediction of Y,
denoted by Ŷ (i.e. Identify appropriate function f(X)
to model Y). If Y takes values in R, then so should Ŷ
(quantitative response). For categorical output Ĝ
should take a class value, as G (categorical response).
Supervised Linear Models
Linear Models and Least Square
Given a vector of inputs X  X 1 , X 2 ,.... X p 
T

p = # of features; N = # of points,
we want to predict the output Y via the model:
Unknown coefficients
p
parameters of the model
Yˆ  f ( X )  ˆ0   X j ˆ j
j 1

or N.B we have included β0 in the coefficient vector

Yˆ  X T ˆ Matrix notation

For each point i, i=1....N

y i   0   1 x i1   2 x i 2  .......   p x ip
Linear Models and Least Square
We want to fit a linear model to a set
of training data {(xi1...xip), yi}. There might be several choices of β.
How do we choose them?
Linear Models and Least Square
• Least square method: we pick the coefficients β to minimize the
residual sum of squares
2
p
N N   N
2
 T
RSS (  )    yi  f ( xi )    yi   0   xij  j    yi  xi  
2

i 1 i 1  j 1  i 1

The solution is easy to characterize

If we write it in matrix notation

RSS (  )  (Y  X )T (Y  X )
X T (Y  X )  0 differentiation
with respect to β
ˆ  ( X T X ) 1 X T Y
One feature two features
What happens if p > N? I.e. XTX is singular?
Another geometrical interpretation of
linear regression

Least-square regression with two predictors. The outcome vector y is orthogonally

projected into the hyperplane spanned by input vectors x1 and x2. The projection
^y represents the vector of the least square prediction.
2
We minimize RSS (  )  y  X by choosing β so that the residual vector is orthogonal
to this subspace.
Example: Quantitative Structure-
Activity Relationship
We want to study the relationship between chemical structure and activity (solubility)
Screen several compounds against a target in a biolgical assay
Measure quantitative features xj (molecular weight,
electrical charge, surface area, # of atoms..)
The response y is the activity (inibition, solubility..)

yi   0  1 xi1   2 xi 2  ...   p xip Aspirin

Quantitative structure-activity relationship (QSAR modeling)

Measuring Performance in Regression
Models
If the outcome is a number -> RMSE (function of the model residuals)

N
1 2
RMSE 
N
 i i
( y
i 1
 ˆ
y )
predicted value
real value

Another measure is R2 -> proportion of information in the data which is explained

by the model. More a measure of correlation
A short de-tour of the Predictive
Modeling Process
Always do a scatter plot of response vs each feature to see if a linear relationship
exists!

Introduce some Fit to local linear regression

Non-linearity into the model
y   0  1x1   2 x12
A short de-tour of the Predictive
Modeling Process
„How“ the predictors enter the model is very important:

1. Data transformation
1. Centering / scaling
2. data skewed
3. Outliers

2. feature engineering / feature extraction

1. What are actually the informative features?
A short de-tour of the Predictive
Modeling Process
Data transformation
Necessary to avoid biases
xx mean of the data - centering
Z
 standard deviation - scaling

Skewness

 x  x 
i
i
3

s
( n  1)v 3 / 2
 x  x 
i
i
2

v
n 1

A value s of 20 indicates
high skewness. Log transformation
helps reducing the skewness
Between-Predictor Correlations
Predictors can be correlated. If correlation among predictors is high, then the
Ordinary least square solution for linear regression will have high variability
and will be unstable -> poor interpretation

Correlation heatmap for the structure-solubility data

Collinearity: high correlation between pairs of variables
Data reduction and feature extraction

We want to have a smaller set of predictors which captures most of the

Information in the data -> maybe predictors which are combinations of
the original predictors?

Principal Component Analysis (PCA)

is a commonly used data reduction technique
A short de-tour of the Predictive
Modeling Process
Data reduction and feature extraction
What about removing correlated predictors?

Yes, possible, but there are cases where a predictor is correlated to a

Linear combination of other predictors..Not detectable with correlation analysis

Other reasons to remove predictors:

1. Zero variance predictors (variables with few unique values)
2. Frequency of unique values is severely disproportioned
Goal: We want a technique (regression)
which takes into account (solves)
correlated variables..
Regression + feature reduction
Principal Component Analysis (PCA)
Idea:
• Given data points (predictors) in d-dimensional space,
project into lower dimensional space while preserving as
much information as possible
– E.g. Find best planar approximation to 3D data
• Learns lower dimensional representation of inputs
• Underlines structure in the data
• It generates a smaller set of predictors which captures the
majority of the information in the original variables
• New predictors are functions of the original predictors
Example 1: study the motion of a spring

• The important dimension to describe the dynamics

of the system is x – but we do not know that!
• Every time sample recorded by the cameras is a point
(vector) in a D-dimensional space, D=6
• Form linear algebra: every vector in a D-dimensional space
can be written as linear combination of some basis
• Is there other basis (linear combination of original basis) which better
re-expresses the data?
Principal Component Analysis (PCA)

The hope is that the new basis will filter out the noise and reveal the hidden
structure of the data -> In my case they will determine x as the important
direction..

You may have noticed the use of the word linear: PCA makes the stringent
but powerful assumption of linearity -> restricts the set of potential bases
PCA – formal definition
• PCA: orthogonal projection of the data into a
lower dimensional space, such as the variance
of the projected data is maximum
Variance and the goal

Quantitatively we assume that directions with largest variances in

our data space contain the dynamics of interest and so highest SNR
Principal Component analysis
y
x

Geometrical interpretation: find the rotation of the basis (axes) in a way that the first axis lies
in the direction of greatest variation. In the new system the predictors (PCs) are orthogonal
PCA - Redundancy

When two predictors x1 and x2 are correlated (measure redundant information),

this will complicate the effect of x1 and x2 on the response. It seems that either one
predictor or a linear combination of predictors can be used here
PCA in words
– Find the linear combination of X (in the new basis) which has the
maximum variation

– How do we formally find these new directions (basis) ui?

– Project data on new directions XTu

– Find u1 such that var(XTu1) is maximized subjected to the condition u1Tu1=1

– Find u2 such that var(XTu2) is maximized subjected to the condition u2Tu2=1

and u1Tu2=0

– Keep finding direction of greatest variation orthogonal to those already found

– Ideally, if N is the dimensionality of original data, we need only few D < N

directions to explain sufficiently the variability in the data
How many Principal
Components?
• Use the eigenvalues, which represent the variance explained by each component
• Choose the number of eigenvalues that amount to the desired percentage of the
variance

Scree plot
PCA example: image compression
Principal Component Analysis (PCA)

PCs are surrogate features / variables and therefore (linear) functions of the
original variables which better re-express the data

Then we can express the PCs as linear combinations of the original predictors.
The first PC is the best linear combination – the one capturing most of the variance

PC j  a j1  feature 1  a j 2  feature 2   ...  a jp  featurep 

p = # of predictors
aj1, aj2,.... ajp component weights / loadings
Summarizing..
The cool thing is that we have create components PCs which are uncorrelated
Some predictive models prefer predictors which are uncorrelated in order
To find a good solution. PCA creates new predictors with such characteristics!

To get an intuition of the data:

If the PCA captured most of the
information in the data, then plotting
E.g. PC1 vs PC2 can reveal clusters/structures
In the data
PCA – practical hints
1. PCA seeks direction of maximum variance, so it is sensitive to the
scale of the data, it might give higher weights to variables
on ‚large‘ scales.
Good practise is re-scale the data before doing PCA
2. Skeweness can also cause problems
Goal: We want a technique (regression)
which takes into account (solves)
correlated variables..
Regression + feature reduction

But PCA is an unsupervised technique..so it is blind to the response

Principal Component Regression (PCR)
Dimension reduction method: it works in two steps
1. Find transformed predictors Z1, Z2, ..Zm with m < p (# of original features)
2. Fit a least square model to these new predictors

p M
Z m   a jm X j yi  0   m Z im
j 1 m 1
Fitting a regression
model to Zm

The choise of Z1...Zm and the selection of ajm can be achieved in different ways
One way is Principal Component Regression (PCR) – almost PLS..
E.g. Z1  a11 x1  a21 x2 First principal components in the case of two variables
scores or loadings
Drawback of PCR
We assume that the direction in which xi show the most variation are the
directions associated to the reponse y..

If this assumption holds, then an appropriate choice of M = # of components

will give better results.

But this assumption is not always fullfilled and when Z1..Zm are produced in
an unsupervised way there is no garantee that these directions (which best
explain the input) are also the best to explain the output.

When will PCR perform worse than normal least square regression?
Partial Least Square Regression (PLSR)
• Supervised alternative to PCR. It makes use of the response Y to identify
the new features
• attempts to find directions that help explaining both the response and the
predictors
PLS Algorithm p
Z m   a jm X j
p
1. Compute first partial least square direction Z1 j 1
by setting aj1 in the formula to the coefficients
Z1   a j1 X j
j 1
from simple linear regression of Y onto Xj

2. Different intepretation of the loading ajm : here, how

much the predictor is important for the reponse!

3. Then Y is regressed on Z1, giving θ1

4. To find Z2 we ‚adjust‘ all variables for Z1. Means we project

or regress them to Z1 ˆ
X j   j Z1
5. Compute the residuals (the remaining information which has not been explained
by the first PLS) X  Z
j j 1

6. Compute Z2 (Zm) in the same way, using the projected data

7. The iterative approach can be repeated M times to identify multiple PLS comp
Example from the QSAR modeling
problem - PCR

Scatter plot of two predictors The first PC direction contains no

Direction of the first PC predictive information of the response
Example from the QSAR modeling
problem - PLS

PLS direction contains highly

PLS direction on two predictors
predictive information of the response
Example from the QSAR modeling
problem – PCR & PLS

Compaison of PLS and PCR

Summary
• Dimension reduction (PCA)
• Regression problem
– Linear regression (least-square)
– PCR and PLS are methods for feature reduction
and de-correlation of the features
Improves over-fitting, accuracy, can be hard to
interpret

Sas & Pharmaceutical
No ratings yet
Sas & Pharmaceutical
2 pages
Charmat Method
No ratings yet
Charmat Method
2 pages
ML_Introduction
No ratings yet
ML_Introduction
76 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
DS-05 Introduction To Machine Learning
No ratings yet
DS-05 Introduction To Machine Learning
103 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
65 pages
Machine Learning
No ratings yet
Machine Learning
87 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
First Cours 2
No ratings yet
First Cours 2
42 pages
Predictive Analysis 1
No ratings yet
Predictive Analysis 1
22 pages
ML Chapter 1
No ratings yet
ML Chapter 1
41 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Comp Vis Week 2
No ratings yet
Comp Vis Week 2
16 pages
Deep Learning
No ratings yet
Deep Learning
68 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
No ratings yet
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
387 pages
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
No ratings yet
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
91 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Unit I
No ratings yet
Unit I
44 pages
QSRI-lecture1
No ratings yet
QSRI-lecture1
45 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
Linear Regression for ML ass
No ratings yet
Linear Regression for ML ass
99 pages
Lecture - 2 & 3
No ratings yet
Lecture - 2 & 3
62 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
15 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
Supervised Machine Learning - Linear Regression
No ratings yet
Supervised Machine Learning - Linear Regression
92 pages
Machine_Learning_&_AI
No ratings yet
Machine_Learning_&_AI
38 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Unit Iii
No ratings yet
Unit Iii
18 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Developing A Machining Learning Models From Start To Finish.
No ratings yet
Developing A Machining Learning Models From Start To Finish.
59 pages
Supervised ML
No ratings yet
Supervised ML
69 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Week 3 Lecture Slides BUS265 2023
No ratings yet
Week 3 Lecture Slides BUS265 2023
41 pages
AIYA SESSION 4
No ratings yet
AIYA SESSION 4
42 pages
Basic Machine Learning
No ratings yet
Basic Machine Learning
34 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Machine Learning Concepts
No ratings yet
Machine Learning Concepts
68 pages
Statistical Prediction and Machine Learning
100% (2)
Statistical Prediction and Machine Learning
314 pages
Slide 1
No ratings yet
Slide 1
29 pages
Intro To Data Science Lecture 1
No ratings yet
Intro To Data Science Lecture 1
7 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
Machine Learning
No ratings yet
Machine Learning
115 pages
A06-Intro-to-ML (2)
No ratings yet
A06-Intro-to-ML (2)
17 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Lecture 3 - Machine learning and data driven analysis
No ratings yet
Lecture 3 - Machine learning and data driven analysis
36 pages
ML Unit-IV Notes
No ratings yet
ML Unit-IV Notes
49 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Unit 2 - NOTES1 - ML
No ratings yet
Unit 2 - NOTES1 - ML
35 pages
An Introduction To Machine Learning
No ratings yet
An Introduction To Machine Learning
136 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
2.0 Machine Learning Introduction
No ratings yet
2.0 Machine Learning Introduction
24 pages
Regression 0
No ratings yet
Regression 0
108 pages
Lecture1 2015
No ratings yet
Lecture1 2015
52 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Civil Services in India
No ratings yet
Civil Services in India
34 pages
COMPUTER NETWORKS Answers To Selected Exam Questions
0% (1)
COMPUTER NETWORKS Answers To Selected Exam Questions
38 pages
Sas in Pharma Industry
No ratings yet
Sas in Pharma Industry
13 pages
Database Management System
No ratings yet
Database Management System
92 pages
Буклет зфн 2023 англ
No ratings yet
Буклет зфн 2023 англ
8 pages
Lehle Parallel L en
No ratings yet
Lehle Parallel L en
22 pages
Grade 3 - Connect - Final Revision - First Term
No ratings yet
Grade 3 - Connect - Final Revision - First Term
9 pages
Alchemy (Mabi David)
No ratings yet
Alchemy (Mabi David)
9 pages
Save Ocean, Save Marine Life
No ratings yet
Save Ocean, Save Marine Life
6 pages
warehouse plan
No ratings yet
warehouse plan
1 page
Unit 2 Studylib
No ratings yet
Unit 2 Studylib
23 pages
ACS Offshore Upstream Construction CoP SoW CoA
No ratings yet
ACS Offshore Upstream Construction CoP SoW CoA
32 pages
IEEE Report Graphs - Sreekar
No ratings yet
IEEE Report Graphs - Sreekar
3 pages
Soal Bahasa Inggris
No ratings yet
Soal Bahasa Inggris
7 pages
Avinash Synopsis
No ratings yet
Avinash Synopsis
21 pages
FINAL LT Cable Catalog
No ratings yet
FINAL LT Cable Catalog
88 pages
Car Auto - Data
No ratings yet
Car Auto - Data
6 pages
World Drywall & Building Plaster
No ratings yet
World Drywall & Building Plaster
4 pages
Prime and Primeg: Operator'S Manual
No ratings yet
Prime and Primeg: Operator'S Manual
67 pages
Salvia Monograph 1
No ratings yet
Salvia Monograph 1
6 pages
PharmD WINTER 2018
No ratings yet
PharmD WINTER 2018
1 page
ASTM A860-2022
No ratings yet
ASTM A860-2022
5 pages
Symbols, Units and Quantities
No ratings yet
Symbols, Units and Quantities
2 pages
2017 Buick Encore Getting To Know B
No ratings yet
2017 Buick Encore Getting To Know B
16 pages
A Comparative Review On Power Conversion Topologies and Energy Storage EV - IMPORTANT - CLASSIFICATION OF EVS - HEVS
No ratings yet
A Comparative Review On Power Conversion Topologies and Energy Storage EV - IMPORTANT - CLASSIFICATION OF EVS - HEVS
24 pages
Friut 1
No ratings yet
Friut 1
14 pages
The 12 Fundamental Forms of Surya
No ratings yet
The 12 Fundamental Forms of Surya
15 pages
Sistem Lampu Besar Dan Lampu Belakang
No ratings yet
Sistem Lampu Besar Dan Lampu Belakang
21 pages
AMTED398078EN - Web 55
No ratings yet
AMTED398078EN - Web 55
1 page
ESE-2015 Allocation Sheet - Mechanical
No ratings yet
ESE-2015 Allocation Sheet - Mechanical
11 pages
GD&T Symbols, Definitions ASME Y14.5-2009 Training - ISO G&T Symbols 1101 Definitions - GD&T Trainers - Engineers Edge
No ratings yet
GD&T Symbols, Definitions ASME Y14.5-2009 Training - ISO G&T Symbols 1101 Definitions - GD&T Trainers - Engineers Edge
8 pages
MET Quiz 2
No ratings yet
MET Quiz 2
16 pages