Machine Learning Lecture1
Machine Learning Lecture1
Annalisa Marsico
OWL RNA Bionformatics group
Max Planck Institute for Molecular Genetics
Free University of Berlin
SoSe 2015
What is Machine Learning?
What is Machine Learning?
“How can we build computer systems that automatically improves with experience,
and what are the fundamental laws that govern all learning processes?”
Arthur Samuel (1959): field of study that gives computers the ability to learn
without being explicitly programmed
– ex: playing checkers against Samuel,
the computer eventually became much better than Samuel
– this was the first solid refutation
to the claim that computers cannot learn
What is Machine Learning?
Computer ML
Statistics
Science
How can we build machines What can be inferred from the data plus
that solve problems and which problems some modeling assumptions, with what
are tractable/intractable? reliability?
ML‘s applications
– Army, security
– imaging: object/face detection and recognition, object traking
– mobility: robotics, action learning, automatic driving
– Computers, internet
– interfaces: brainwaves (for the disable), handwriting / speech recognition
– security: spam / virus filtering, virus troubleshooting
ML‘s applications
– Finance
– banking: identify good, dissatisfied or prospective customers
– optimize / minimize credit risk
– market analysis
– Gaming
– intelligent: adaptibility to the player, agents
– object tracking, 3D modeling, etc...
ML‘s applications
– Biomedicine, biometrics
– medicine: screening, diagnosis and prognosis, drug discovery etc..
– security: face recognition, signature, fingerprint, iris verification etc..
– Bioinformatics
– motif finder, gene detectors, interaction networks, gene expression
predictors, cancer/disease classification, protein folding prediction, etc..
Examples of Learning problems
• Predict whether a patient, hospitalized due to a heart attack, will
have a seocnd heart attack, based on diet, blood tests, diesease
history..
• Identify the risk factor for colon cancer, based on gene expression
and clinical measurements.
• Predict if an e-mail is spam or not based on most commonly
occurring words (email/spam -> classification problem)
• Predict the price of a stock in 6 months from now, based on
company performance and economic data
You already use it! Some more
examples from daily life..
• Based on past choices, which movies will interest this viewer?
(Netflix)
• Based on past choices and metadata which music this user will
probably like? (Lastfm, Spotify)
• Based on past choices and profile features should we match these
people in online dating service (Tinder)
• Based on previous purchases, which shoes is the user likely to like?
(Zalando)
• Critical evaluation
Supervised vs Unsupervised Learning
Typical Scenario
We have an outcome quantitative (price of a stock, risk factor..) or
categorical (heart attack yes or no) that we want to predict based on some
features. We have a training set of data and build a prediction model, a
learner, able to predict the outcome of new unseen objects
- A good learner accurately predicts such an outcome
Supervised learning
Regression problem (outcome measure is quantitative)
Example2: Gene expression
microarrays
Measure the expression of all genes in a cell simultaneously,
by measuring the amount of RNA present in the cell for
that gene. We do this for several experiments (samples).
p = # of features; N = # of points,
we want to predict the output Y via the model:
Unknown coefficients
p
parameters of the model
Yˆ f ( X ) ˆ0 X j ˆ j
j 1
y i 0 1 x i1 2 x i 2 ....... p x ip
Linear Models and Least Square
We want to fit a linear model to a set
of training data {(xi1...xip), yi}. There might be several choices of β.
How do we choose them?
Linear Models and Least Square
• Least square method: we pick the coefficients β to minimize the
residual sum of squares
2
p
N N N
2
T
RSS ( ) yi f ( xi ) yi 0 xij j yi xi
2
i 1 i 1 j 1 i 1
RSS ( ) (Y X )T (Y X )
X T (Y X ) 0 differentiation
with respect to β
ˆ ( X T X ) 1 X T Y
One feature two features
What happens if p > N? I.e. XTX is singular?
Another geometrical interpretation of
linear regression
N
1 2
RMSE
N
i i
( y
i 1
ˆ
y )
predicted value
real value
1. Data transformation
1. Centering / scaling
2. data skewed
3. Outliers
Skewness
x x
i
i
3
s
( n 1)v 3 / 2
x x
i
i
2
v
n 1
A value s of 20 indicates
high skewness. Log transformation
helps reducing the skewness
Between-Predictor Correlations
Predictors can be correlated. If correlation among predictors is high, then the
Ordinary least square solution for linear regression will have high variability
and will be unstable -> poor interpretation
The hope is that the new basis will filter out the noise and reveal the hidden
structure of the data -> In my case they will determine x as the important
direction..
You may have noticed the use of the word linear: PCA makes the stringent
but powerful assumption of linearity -> restricts the set of potential bases
PCA – formal definition
• PCA: orthogonal projection of the data into a
lower dimensional space, such as the variance
of the projected data is maximum
Variance and the goal
Geometrical interpretation: find the rotation of the basis (axes) in a way that the first axis lies
in the direction of greatest variation. In the new system the predictors (PCs) are orthogonal
PCA - Redundancy
Scree plot
PCA example: image compression
Principal Component Analysis (PCA)
PCs are surrogate features / variables and therefore (linear) functions of the
original variables which better re-express the data
Then we can express the PCs as linear combinations of the original predictors.
The first PC is the best linear combination – the one capturing most of the variance
p = # of predictors
aj1, aj2,.... ajp component weights / loadings
Summarizing..
The cool thing is that we have create components PCs which are uncorrelated
Some predictive models prefer predictors which are uncorrelated in order
To find a good solution. PCA creates new predictors with such characteristics!
p M
Z m a jm X j yi 0 m Z im
j 1 m 1
Fitting a regression
model to Zm
The choise of Z1...Zm and the selection of ajm can be achieved in different ways
One way is Principal Component Regression (PCR) – almost PLS..
E.g. Z1 a11 x1 a21 x2 First principal components in the case of two variables
scores or loadings
Drawback of PCR
We assume that the direction in which xi show the most variation are the
directions associated to the reponse y..
But this assumption is not always fullfilled and when Z1..Zm are produced in
an unsupervised way there is no garantee that these directions (which best
explain the input) are also the best to explain the output.
When will PCR perform worse than normal least square regression?
Partial Least Square Regression (PLSR)
• Supervised alternative to PCR. It makes use of the response Y to identify
the new features
• attempts to find directions that help explaining both the response and the
predictors
PLS Algorithm p
Z m a jm X j
p
1. Compute first partial least square direction Z1 j 1
by setting aj1 in the formula to the coefficients
Z1 a j1 X j
j 1
from simple linear regression of Y onto Xj
7. The iterative approach can be repeated M times to identify multiple PLS comp
Example from the QSAR modeling
problem - PCR