0% found this document useful (0 votes)
19 views52 pages

Lecture1 2015

This document provides information about the STA 414/2104 Machine Learning course taught by Russ Salakhutdinov at the University of Toronto. It outlines the evaluation criteria including assignments, midterm, and final worth a total of 100%. It lists recommended textbooks and additional books. The document describes statistical machine learning as developing algorithms to learn from data by constructing stochastic models for prediction and decision making. It provides examples of machine learning successes and discusses finding structure in large datasets through methods like matrix factorization, topic modeling, and collaborative filtering. Finally, it outlines tentative course topics and different types of machine learning.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views52 pages

Lecture1 2015

This document provides information about the STA 414/2104 Machine Learning course taught by Russ Salakhutdinov at the University of Toronto. It outlines the evaluation criteria including assignments, midterm, and final worth a total of 100%. It lists recommended textbooks and additional books. The document describes statistical machine learning as developing algorithms to learn from data by constructing stochastic models for prediction and decision making. It provides examples of machine learning successes and discusses finding structure in large datasets through methods like matrix factorization, topic modeling, and collaborative filtering. Finally, it outlines tentative course topics and different types of machine learning.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

STA

 414/2104:    
Machine  Learning    

Russ  Salakhutdinov  
Department of Computer Science!
Department of Statistics!
[email protected]!
h0p://www.cs.toronto.edu/~rsalakhu/  

Lecture 1  
Evalua;on  
• 3  Assignments  worth  40%.  
• Midterm  worth  20%.    
• Undergrads:  Final  worth  40%  
• Graduate:  10%  oral  presenta;on,  30%  final  
Text  Books  
•  Christopher  M.  Bishop  (2006)  
Pa@ern  RecogniBon  and  Machine  Learning,  Springer.    

Addi;onal  Books  
•  Kevin  Murphy,  Machine  Learning:  A  Probabilis;c  Perspec;ve.  

•  Trevor  Has;e,  Robert  Tibshirani,  Jerome  Friedman  (2009)  


The  Elements  of  Sta;s;cal  Learning    

•  David  MacKay  (2003)  


Informa;on  Theory,  Inference,  and  Learning  Algorithms    

•  Most  of  the  figures  and  material  will  come  from  these  books.    
Sta;s;cal  Machine  Learning  
Sta;s;cal  machine  learning  is  a  very  dynamic  field  that  lies  at  
the  intersec;on  of  Sta;s;cs  and  computa;onal  sciences.  

The  goal  of  sta;s;cal  machine  learning  is  to  develop  


algorithms  that  can  learn  from  data  by  construc;ng  stochas;c  
models  that  can  be  used  for  making  predic;ons  and  decisions.    
Machine  Learning’s  Successes  
•  Biosta;s;cs  /  Computa;onal  Biology.  
•  Neuroscience.    
•  Medical  Imaging:  
-  computer-­‐aided  diagnosis,  image-­‐guided  therapy.  
-  image  registra;on,  image  fusion.    

•  Informa;on  Retrieval  /  Natural  Language  Processing:  


-  Text,  audio,  and  image  retrieval.    
-  Parsing,  machine  transla;on,  text  analysis.  
•  Speech  processing:  
-  Speech  recogni;on,  voice  iden;fica;on.    
•  Robo;cs:  
-  Autonomous  car  driving,  planning,  control.  
Mining  for  Structure  
Massive  increase  in  both  computa;onal  power  and  the  amount  of  
data  available  from  web,  video  cameras,  laboratory  measurements.  

Images  &  Video   Text  &  Language     Speech  &  Audio  

Gene  Expression  
Develop  sta;s;cal  models  that  can  
discover  Rela;onal  
underlying  
Data/    
structure,  cause,  
Product     Climate  Change  
or  sta;s;cal  
Recommenda;on   correla;ons  from  data.  
Social  Network   Geological  Data  
Example:  Boltzmann  Machine  
Latent  (hidden)  
Model  parameters   variables  

Input  data  (e.g.  pixel   Target  variables  


intensi;es  of  an  image,   (response)  (e.g.  class  
words  from  webpages,   labels,  categories,  
speech  signal).   phonemes).  

Markov  Random  Fields,  Undirected  Graphical  Models.  


Finding  Structure  in  Data  

Vector  of  word  counts   Latent  variables:  


on  a  webpage   hidden  topics  

European Community
Interbank Markets Monetary/Economic

Energy Markets
Disasters and
Accidents

Leading Legal/Judicial
Economic
Indicators

804,414  newswire  stories   Accounts/


Government
Borrowings
Earnings
Matrix  Factoriza;on  
Collabora;ve  Filtering/  
Matrix  Factoriza;on/  

Hierarchical  Bayesian  Model  


Ra;ng  value  of     Latent  user  feature   Latent  item  
user  i  for  item  j     (preference)  vector   feature  vector  

Latent  variables  that  


we  infer  from  
observed  ra;ngs.  
PredicBon:  predict  a  ra;ng  r*ij  for  user  i  and  query  movie  j.      

Posterior  over  Latent  Variables  


Infer  latent  variables  and  make  predic;ons  using  Markov  chain  Monte  Carlo.    
Finding  Structure  in  Data  
Collabora;ve  Filtering/  
Matrix  Factoriza;on/  
Product  Recommenda;on  

Learned    ``genre’’  
Neflix  dataset:     Fahrenheit  9/11   Independence  Day  
Bowling  for  Columbine   The  Day  Aher  Tomorrow  
480,189  users     The  People  vs.  Larry  Flynt   Con  Air  
17,770  movies     Canadian  Bacon   Men  in  Black  II  
La  Dolce  Vita   Men  in  Black  
Over  100  million  ra;ngs.  
Friday  the  13th  
The  Texas  Chainsaw  Massacre  
Children  of  the  Corn  
Child's  Play  
The  Return  of  Michael  Myers  

•  Part  of  the  wining  solu;on  in  the  Neflix  contest  (1  million  dollar  prize).  
Tenta;ve  List  of  Topics  
• Linear  methods  for  regression,  Bayesian  linear  regression    
• Linear  models  for  classifica;on    
• Probabilis;c  Genera;ve  and  Discrimina;ve  models    
• Regulariza;on  methods    
• Model  Comparison  and  BIC    
• Neural  Networks    
• Radial  basis  func;on  networks    
• Kernel  Methods,  Gaussian  processes,  Support  Vector  Machines    
• Mixture  models  and  EM  algorithm    
• Graphical  Models  and  Bayesian  Networks    
Types  of  Learning  
Consider  observing  a  series  of  input  vectors:  

• Supervised  Learning:  We  are  also  given  target  outputs  (labels,  


responses):  y1,y2,…,  and  the  goal  is  to  predict  correct  output  given  a  
new  input.    

• Unsupervised  Learning:  The  goal  is  to  build  a  sta;s;cal  model  of  x,  
which  can  be  used  for  making  predic;ons,  decisions.      

• Reinforcement  Learning:  the  model  (agent)  produces  a  set  of  ac;ons:  


a1,  a2,…    that  affect  the  state  of  the  world,  and  received  rewards  r1,  
r2…    The  goal  is  to  learn  ac;ons  that  maximize  the  reward  (we  will  not  
cover  this  topic  in  this  course).          

• Semi-­‐supervised  Learning:  We  are  given  only  a  limited  amount  of  


labels,  but  lots  of  unlabeled  data.    
Supervised  Learning  
ClassificaBon:  target  outputs  yi  
are  discrete  class  labels.  The  goal  
is  to  correctly  classify  new  inputs.    

Regression:  target  outputs  yi  are  


con;nuous.  The  goal  is  to  predict  
the  output  given  new  inputs.    
Handwri0en  Digit  Classifica;on  
Unsupervised  Learning  
The  goal  is  to  construct  sta;s;cal  
model  that  finds  useful  representa;on  
of  data:  
• Clustering  
• Dimensionality  reduc;on  
• Modeling  the  data  density    
• Finding  hidden  causes  (useful  
explana;on)  of  the  data  

Unsupervised  Learning  can  be  used  for:  


• Structure  discovery  
• Anomaly  detec;on  /  Outlier  detec;on  
• Data  compression,  Data  visualiza;on  
• Used  to  aid  classifica;on/regression  
tasks  
DNA  Microarray  Data  
Expression  matrix  of  6830  genes  (rows)  
and  64  samples  (columns)  for  the  human  
tumor  data.    

The  display  is  a  heat  map  ranging  from  


bright  green  (under  expressed)  to  bright  
red  (over  expressed).    

Ques;ons  we  may  ask:  


•  Which  samples  are  similar  to  other  
samples  in  terms  of  their  expression  levels  
across  genes.    
•  Which  genes  are  similar  to  each  other  in  
terms  of  their  expression  levels  across  
samples.  
Linear  Least  Squares  
•  Given  a  vector  of  d-­‐dimensional  inputs                                                                                we  want  
to  predict  the  target  (response)  using  the  linear  model:    

•  The  term  w0  is  the  intercept,  or  ohen  called  bias  term.  It  will  be  
convenient  to  include  the  constant  variable  1  in  x  and  write:  

•  Observe  a  training  set  consis;ng  of  N  observa;ons                                                                            


together  with  corresponding  target  values    

•  Note  that  X  is  an                                                matrix.  


Linear  Least  Squares  
One  op;on  is  to  minimize  the  sum  of  the  squares  of  the  errors  between  
the  predic;ons                                      for  each  data  point  xn  and  the  corresponding  
real-­‐valued    targets  tn.      

Loss  func;on:  sum-­‐of-­‐squared  error  


func;on:  

Source:  Wikipedia  
Linear  Least  Squares  
If                            is  nonsingular,  then  the  unique  solu;on  is  given  by:  

op;mal   vector  of  


weights   target  values  

the  design  matrix  has  one  


input  vector  per  row  

Source:  Wikipedia  

•  At  an  arbitrary  input            ,  the  predic;on  is                                                                  


•  The  en;re  model  is  characterized  by  d+1  parameters  w*.  
Example:  Polynomial  Curve  Firng  
Consider  observing  a  training  set  consis;ng  of  N  1-­‐dimensional  observa;ons:                              
                                                                                   together  with  corresponding  real-­‐valued  targets:  

•  The  green  plot  is  the  true  func;on      


•  The  training  data  was  generated  by  taking  
xn  spaced  uniformly  between  [0  1].    
•  The  target  set  (blue  circles)  was  obtained  
by  first  compu;ng  the  corresponding  values  
of  the  sin  func;on,  and  then  adding    a  small  
Gaussian  noise.    

Goal:  Fit  the  data  using  a  polynomial  func;on  of  the  form:  

Note:  the  polynomial  func;on  is  a  nonlinear  func;on  of  x,  but  it  is  a  linear  
func;on  of  the  coefficients  w  !  Linear  Models.    
Example:  Polynomial  Curve  Firng  
•  As  for  the  least  squares  example:    we  can  minimize  the  sum  of  the  
squares  of  the  errors  between  the  predic;ons                                    for  each  data  
point  xn  and  the  corresponding  target  values  tn.      

Loss  func;on:  sum-­‐of-­‐squared  


error  func;on:  

•  Similar  to  the  linear  least  squares:  Minimizing  sum-­‐of-­‐squared  error  


func;on  has  a  unique  solu;on  w*.    
•  The  model  is  characterized  by  M+1  parameters  w*.  
•  How  do  we  choose  M?  !  Model  SelecBon.  
Some  Fits  to  the  Data  

For  M=9,  we  have  fi0ed  the  training  data  perfectly.    


Overfirng  
•  Consider    a  separate  test  set  containing  100  new  data  points  generated  
using  the  same  procedure  that  was  used  to  generate  the  training  data.  

•  For  M=9,  the  training  error  is  zero  !  The  polynomial  contains  10  
degrees  of  freedom  corresponding  to  10  parameters  w,  and  so  can  be  
fi0ed  exactly  to  the  10  data  points.      
•  However,  the  test  error  has  become  very  large.  Why?  
Overfirng  

•  As  M  increases,  the  magnitude  of  coefficients  gets  larger.      


•  For  M=9,  the  coefficients  have  become  finely  tuned  to  the  data.  
•  Between  data  points,  the  func;on  exhibits  large  oscilla;ons.  
More  flexible  polynomials  with  larger  M  tune  to  the  random  noise  
on  the  target  values.  
Varying  the  Size  of  the  Data  
9th  order  polynomial  

•  For  a  given  model  complexity,  the  overfirng  problem  becomes  less  


severe  as  the  size  of  the  dataset  increases.    

•  However,  the  number  of  parameters  is  not  necessarily  the  most  
appropriate  measure  of  the  model  complexity.  
Generaliza;on  
•  The  goal  is  achieve  good  generalizaBon  by  making  accurate  predic;ons  
for  new  test  data  that  is  not  known  during  learning.    
•  Choosing  the  values  of  parameters  that  minimize  the  loss  func;on  on  
the  training  data  may  not  be  the  best  op;on.    
•  We  would  like  to  model  the  true  regulari;es  in  the  data  and  ignore  the  
noise  in  the  data:    
-  It  is  hard  to  know  which  regulari;es  are  real  and  which  are  accidental  
due  to  the  par;cular  training  examples  we  happen  to  pick.    
•  Intui;on:  We  expect  the  model  to  generalize  
if  it  explains  the  data  well  given  the  complexity  
of  the  model.    
•  If  the  model  has  as  many  degrees  of  freedom  
as  the  data,  it  can  fit  the  data  perfectly.  But  this  
is  not  very  informa;ve.    
•  Some  theory  on  how  to  control  model  
complexity  to  op;mize  generaliza;on.    
A  Simple  Way  to  Penalize  Complexity    
One  technique  for  controlling  over-­‐firng  phenomenon  is  regularizaBon,  
which  amounts  to  adding  a  penalty  term  to  the  error  func;on.    
penalized  error     target  value   regulariza;on    
func;on   parameter  

where                                                                            and  ¸  is      called  the  regulariza;on  term.  


Note  that  we  do  not  penalize  the  bias  term  w0.          
•  The  idea  is  to  “shrink”  es;mated  parameters  
towards  zero  (or  towards  the  mean  of  some  other  
weights).  
•  Shrinking  to  zero:  penalize  coefficients  based  on  
their  size.  
•  For  a  penalty  func;on  which  is  the  sum  of  the  
squares  of  the  parameters,  this  is  known  as  a  
“weight  decay”,  or    “ridge  regression”.          
Regulariza;on  

Graph  of  the  root-­‐mean-­‐squared  training  and  test  errors  vs.  ln¸  
for  the  M=9  polynomial.    
How  to  choose  ¸?    
Cross  Valida;on  
If  the  data  is  plen;ful,  we  can  divide  the  dataset  into  three  subsets:  
• Training  Data:  used  to  firng/learning  the  parameters  of  the  model.  
• Valida;on  Data:  not  used  for  learning  but  for  selec;ng  the  model,  or  
choosing  the  amount  of  regulariza;on  that  works  best.  
• Test  Data:  used  to  get  performance  of  the  final  model.    

For  many  applica;ons,  the  supply  of  data  for  training  and  tes;ng  is  limited.  
To  build  good  models,  we  may  want  to  use  as  much  training  data  as  possible.  
If  the  valida;on  set  is  small,  we  get  noisy  es;mate  of  the  predic;ve  performance.    

S  fold  cross-­‐valida;on   •  The  data  is  par;;oned  into  S  groups.  


•  Then  S-­‐1  of  the  groups  are  used  for  training  
the  model,  which  is  evaluated  on  the  
remaining  group.  
•  Repeat  procedure  for  all  S  possible  choices  
of  the  held-­‐out  group.  
•  Performance  from  the  S  runs  are  averaged.    
Basics  of  Probability  Theory  
•    Consider  two  random  variables  X  and  Y:  
- X  takes  any  values  xi,  where  i=1,..,M.  
- Y  takes  any  values  yj,  j=1,…,L.  
•    Consider  a  total  of  N  trials  and  let  the  number  of  trials  in  which  X  =  xi  
and  Y  =  yj  is  nij.      
•  Joint  Probability:    

•  Marginal  Probability:    

where    
Basics  of  Probability  Theory  
•    Consider  two  random  variables  X  and  Y:  
- X  takes  any  values  xi,  where  i=1,..,M.  
- Y  takes  any  values  yj,  j=1,…,L.  
•    Consider  a  total  of  N  trials  and  let  the  number  of  trials  in  which  X  =  xi  
and  Y  =  yj  is  nij.      
•  Marginal  probability  can  be  
wri0en  as:  

•  Called  marginal  probability  because    


it  is  obtained  by  marginalizing,  or  
summing  out,  the  other  variables.    
Basics  of  Probability  Theory  

•  Condi;onal  Probability:    

•  We  can  derive  the  following  rela;onship:      

which  is  the  product  rule  of  probability.      


The  Rules  of  Probability    

Sum  Rule  

Product  Rule  
Bayes’  Rule  
•  From  the  product  rule,  together  with  symmetric  property:    

•  Remember  the  sum  rule:  

•  We  will  revisit  Bayes’  Rule  later  in  class.      


Illustra;ve  Example    
•  Distribu;on  over  two  variables:  X  takes  on  9  possible  values,  and  
Y  takes  on  2  possible  values.  
Probability  Density  
•  If  the  probability  of  a  real-­‐valued  variable  x  falling  in  the  interval    
is  given  by                                                                                          then  p(x)  is  called  the  
probability  density  over  x.      

•  The  probability  density  must  


sa;sfy  the  following  two  
condi;ons  
Probability  Density  
•  Cumula;ve  distribu;on  func;on  is  defined  as:  

which  also  sa;sfies:    

•  The  sum  and  product  rules  


take  similar  forms:  
Expecta;ons  
•  The  average  value  of  some  func;on  f(x)  under  a  probability  distribu;on  
(density)    p(x)  is  called  the  expecta;on  of  f(x):      

•  If  we  are  given  a  finite  number  N  of  points  drawn  from  the  probability  
distribu;on  (density),  then  the  expecta;on  can  be  approximated  as:    

•  Condi;onal  Expecta;on  with  respect  to  the  condi;onal  distribu;on:    


Variances  and  Covariances    
•  The  variance  of  f(x)  is  defined  as:    

which  measures  how  much  variability  there  is  in  f(x)  around  its  mean  
value  E[f(x)].    

•  Note  that  if  f(x)  =  x,  then    


Variances  and  Covariances    
•  For  two  random  variables  x  and  y,  the  covariance  is  defined  as:  

which  measures  the  extent  to  which  x  and  y  vary  together.  If  x  and  y  are  
independent,  then  their  covariance  vanishes.    

•  For  two  vectors  of  random  variables  x  and  y,  the  covariance  is  a  matrix:  
The  Gaussian  Distribu;on    
•  For  the  case  of  single  real-­‐valued  variable  x,  the  Gaussian  distribu;on  is  
defined  as:  

which  is  governed  by  two  parameters:  


-  µ  (mean)  
-  ¾2  (variance)  

is  called  the  precision.    

•  Next  class,  we  will  look  at  various  distribu;ons  as  well  as  at  
mul;variate  extension  of  the  Gaussian  distribu;on.    
The  Gaussian  Distribu;on    
•  For  the  case  of  single  real-­‐valued  variable  x,  the  Gaussian  distribu;on  is  
defined  as:  

• The  Gaussian  distribu;on  sa;sfies:  

which  sa;sfies  the  two  requirements  


for  a  valid  probability  density    
Mean  and  Variance  
•  Expected  value  of  x  takes  the  following  form:  

Because  the  parameter  µ  represents  the  average  value  of  x  under  the  
distribu;on,  it  is  referred  to  as  the  mean.    

•  Similarly,  the  second  order  moment  takes  form:  

•  It  then  follows  that  the  variance  of  x  is  given  by:  
Sampling  Assump;ons    
•  Suppose  we  have  a  dataset  of  observa;ons  x  =  (x1,…,xN)T,  represen;ng  
N  1-­‐dimensional  observa;ons.      
•  Assume  that  the  training  examples  are  drawn  independently  
from  the  set  of  all  possible  examples,  or  from  the  same  underlying  
distribu;on  

•  We  also  assume  that  the  training  examples  are  iden;cally  


distributed  !      i.i.d  assump;on.    

•  Assume  that  the  test  samples  are  drawn  in  exactly  the  same  way  
-­‐-­‐  i.i.d  from  the  same  distribu;on  as  the  training  data.      

•  These  assump;ons  make  it  unlikely  that  some  strong  regularity  


in  the  training  data  will  be  absent  in  the  test  data.    
Gaussian  Parameter  Es;ma;on  
•  Suppose  we  have  a  dataset  of  i.i.d.  observa;ons  x  =  (x1,…,xN)T,  
represen;ng  N  1-­‐dimensional  observa;ons.      
•  Because  out  dataset  x  is  i.i.d.,  we  can  write  down  the  joint  
probability  of  all  the  data  points  as  given  µ  and  ¾2:      

Likelihood  func;on   •  When  viewed  as  a  func;on  of  µ  


and  ¾2,  this  is  called  the  likelihood  
func;on  for  the  Gaussian.    
Maximum  (log)  likelihood  
•  The  log-­‐likelihood  can  be  wri0en  as:    

Sample  mean  
•  Maximizing  w.r.t  µ  gives:    

Likelihood  func;on   •  Maximizing  w.r.t  ¾2  gives:    


Sample  variance  
Proper;es  of  the  ML  es;ma;on  
•  ML  approach  systema;cally  underes;mates  the  variance  of  the  
distribu;on.  

•  This  is  an  example  of  a  phenomenon  called  bias.    

•  It  is  straighforward  to  show  that:      

•  It  follows  that  the  following  es;mate  is  unbiased:    


Proper;es  of  the  ML  es;ma;on  
•  Example  of  how  bias  arises  in  using  ML  to  determine  the  variance  of  a  
Gaussian  distribu;on:    

•  The  green  curve  shows  the  


true  Gaussian  distribu;on.  

•  Fit  three  datasets,  each  


consis;ng  of  two  blue  points.    

•  Averaged  across  3  datasets,  


the  mean  is  correct.    

•  But  the  variance  is  under-­‐


es;mated  because  it  is  measured  
rela;ve  to  the  sample  (and  not  
the  true)  mean.    
Probabilis;c  Perspec;ve  
•  So  far  we  saw  that  polynomial  curve  firng  can  be  expressed  in  terms  
of  error  minimiza;on.  We  now  view  it  from  probabilis;c  perspec;ve.    
•  Suppose  that  our  model  arose  from  a  sta;s;cal  model:  

where  ²  is  a  random  error  having  Gaussian  distribu;on  with  zero  


mean,  and  is  independent  of  x.    
Thus  we  have:  

where  ¯  is  a  precision  parameter,  


corresponding  to  the  inverse  variance.      

I will use probability distribution and


probability density interchangeably. It
should be obvious from the context.!
Maximum  Likelihood  
If  the  data  are  assumed  to  be  independently  and  iden;cally  
distributed  (i.i.d  assump*on),  the  likelihood  func;on  takes  form:      

It  is  ohen  convenient  to  maximize  the  log  of  the  likelihood  func;on:  

•  Maximizing  log-­‐likelihood  with  respect  to  w  (under  the  assump;on  of  a  


Gaussian  noise)  is  equivalent  to  minimizing  the  sum-­‐of-­‐squared  error  
func;on.    
•  Determine                        by  maximizing  log-­‐likelihood.  Then  maximizing  
w.r.t.  ¯:    
Predic;ve  Distribu;on  
Once  we  determined  the  parameters  w  and  ¯,  we  can  make  predic;on  
for  new  values  of  x:      

Later  we  will  consider  Bayesian  linear  regression.    


Sta;s;cal  Decision  Theory  
•  We  now  develop  a  small  amount  of  theory  that  provides  a  
framework  for  developing  many  of  the  models  we  consider.    
•  Suppose  we  have  a  real-­‐valued  input  vector  x  and  a  corresponding  
target  (output)  value  t  with  joint  probability  distribu;on:    

•  Our  goal  is  predict  target  t  given  a  new  value  for  x:  
- for  regression:  t  is  a  real-­‐valued  con;nuous  target.  
- for  classifica;on:  t  a  categorical  variable  represen;ng  class  labels.      

The  joint  probability  distribu;on                            provides  a  complete  


summary  of  uncertain;es  associated  with  these  random  variables.    
Determining                            from  training  data  is  known  as  the  inference  
problem.      

You might also like