0% found this document useful (0 votes)
92 views40 pages

ML Book

Machine learning exists at the intersection of computer science and statistics. The document provides examples of machine learning applications such as spam filters, recommendations, and fraud detection. It discusses common machine learning tasks like classification, clustering, and feature extraction. Classification involves learning a mapping from entities to discrete labels. The document outlines the typical classification workflow, which includes data preprocessing, feature extraction, model training, and evaluation. It also provides an example of applying classification to spam detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views40 pages

ML Book

Machine learning exists at the intersection of computer science and statistics. The document provides examples of machine learning applications such as spam filters, recommendations, and fraud detection. It discusses common machine learning tasks like classification, clustering, and feature extraction. Classification involves learning a mapping from entities to discrete labels. The document outlines the typical classification workflow, which includes data preprocessing, feature extraction, model training, and evaluation. It also provides an example of applying classification to spam detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Machine

 Learning  Crash  Course:  


Part  I  
Ariel  Kleiner  
August  21,  2012  
Machine  learning  
exists  at  the  intersec<on  of  
computer  science  and  sta<s<cs.  
Examples  
•  Spam  filters  
•  Search  ranking  
•  Click  (and  clickthrough  rate)  predic<on  
•  Recommenda<ons  (e.g.,  NeJlix,  Facebook  friends)  
•  Speech  recogni<on  
•  Machine  transla<on  
•  Fraud  detec<on  
•  Sen<ment  analysis  
•  Face  detec<on,  image  classifica<on  
•  Many  more  
A  Variety  of  Capabili<es  
•  Classifica<on   •  Collabora<ve  filtering  
•  Regression   •  Ac<ve  learning  and  
•  Ranking   experimental  design  
•  Clustering   •  Reinforcement  learning  
•  Dimensionality   •  Time  series  analysis  
reduc<on   •  Hypothesis  tes<ng  
•  Feature  selec<on   •  Structured  predic<on  
•  Structured  probabilis<c  
modeling  
For  Today  

Classifica<on  
Clustering  

(with  emphasis  on  


implementability  and  scalability)  
Typical  Data  Analysis  Workflow  
Obtain  and  load  raw  data  

Data  explora<on  

Preprocessing  and  featuriza<on  

Learning  

Diagnos<cs  and  evalua<on  


Classifica<on  
•  Goal:  Learn  a  mapping  from  en<<es  to  
discrete  labels.  
–  Refer  to  en<<es  as  x  and  labels  as  y.  

•  Example:  spam  classifica<on  


–  En<<es  are  emails.  
–  Labels  are  {spam,  not-­‐spam}.  
–  Given  past  labeled  emails,  want  to  predict  
whether  a  new  email  is  spam  or  not-­‐spam.  
Classifica<on  
•  Examples  
–  Spam  filters  
–  Click  (and  clickthrough  rate)  predic<on  
–  Sen<ment  analysis  
–  Fraud  detec<on  
–  Face  detec<on,  image  classifica<on  
Classifica<on  
Given  a  labeled  dataset  (x1,  y1),  ...,  (xN,  yN):  
1.  Randomly  split  the  full  dataset  into  two  disjoint  
parts:  
–  A  larger  training  set  (e.g.,  75%)  
–  A  smaller  test  set  (e.g.,  25%)  
2.  Preprocess  and  featurize  the  data.  
3.  Use  the  training  set  to  learn  a  classifier.  
4.  Evaluate  the  classifier  on  the  test  set.  
5.  Use  classifier  to  predict  in  the  wild.  
Classifica<on  

training
classifier
full set
dataset

test set new entity

accuracy prediction
Example:  Spam  Classifica<on  

From: [email protected]

"Eliminate your debt by spam


giving us your money..."

From: [email protected]

"Hi, it's been a while! not-spam


How are you? ..."
Featuriza<on  
•  Most  classifiers  require  numeric  descrip<ons  
of  en<<es  as  input.  
•  Featuriza1on:  Transform  each  en<ty  into  a  
vector  of  real  numbers.  
–  StraighJorward  if  data  already  numeric  (e.g.,  
pa<ent  height,  blood  pressure,  etc.)  
–  Otherwise,  some  effort  required.    But,  provides  an  
opportunity  to  incorporate  domain  knowledge.  
Featuriza<on:  Text  
•  Ofen  use  “bag  of  words”  features  for  text.  
–  En<<es  are  documents  (i.e.,  strings).  
–  Build  vocabulary:  determine  set  of  unique  words  
in  training  set.    Let  V  be  vocabulary  size.  
–  Featuriza1on  of  a  document:  
•  Generate  V-­‐dimensional  feature  vector.  
•  Cell  i  in  feature  vector  has  value  1  if  document  contains  
word  i,  and  0  otherwise.  
Example:  Spam  Classifica<on  

From: [email protected]

"Eliminate your debt by Vocabulary


giving us your money..."
been
debt
eliminate
giving
how
From: [email protected] it's
money
"Hi, it's been a while! while
How are you? ..."
Example:  Spam  Classifica<on  
0 been

1 debt

1 eliminate
From: [email protected]
1 giving
"Eliminate your debt by
0 how
giving us your money..."
0 it's

1 money

0 while
Example:  Spam  Classifica<on  
•  How  might  we  construct  a  classifier?  
•  Using  the  training  data,  build  a  model  that  will  
tell  us  the  likelihood  of  observing  any  given  (x,  y)  
pair.  
–  x  is  an  email’s  feature  vector  
–  y  is  a  label,  one  of  {spam,  not-­‐spam}  
•  Given  such  a  model,  to  predict  label  for  an  email:  
–  Compute  likelihoods  of  (x,  spam)  and  (x,  not-­‐spam).  
–  Predict  label  which  gives  highest  likelihood.  
Example:  Spam  Classifica<on  
•  What  is  a  reasonable  probabilis<c  model  for  
(x,  y)  pairs?  
•  A  baseline:  
–  Before  we  observe  an  email’s  content,  can  we  say  
anything  about  its  likelihood  of  being  spam?  
–  Yes:  p(spam)  can  be  es<mated  as  the  frac<on  of  
training  emails  which  are  spam.  
–  p(not-­‐spam)  =  1  –  p(spam)  
–  Call  this  the  “class  prior.”    Wrinen  as  p(y).  
Example:  Spam  Classifica<on  
•  How  do  we  incorporate  an  email’s  content?  
•  Suppose  that  the  email  were  spam.    Then,  
what  would  be  the  probability  of  observing  its  
content?  
Example:  Spam  Classifica<on  
•  Example:  “Eliminate  your  debt  by  giving  us  your  
money”  with  feature  vector  (0,  1,  1,  1,  0,  0,  1,  0)  
•  Ignoring  word  sequence,  probability  of  email  is  
 p(seeing  “debt”  AND  seeing  “eliminate”  AND  
seeing  “giving”  AND  seeing  “money”  AND  not  
seeing  any  other  vocabulary  words  |  given  that  
email  is  spam)  
•  In  feature  vector  nota<on:  
 p(x1=0,  x2=1,  x3=1,  x4=1,  x5=0,  x6=0,  x7=1,  x8=0  |  
given  that  email  is  spam)  
Example:  Spam  Classifica<on  
•  Now,  to  simplify,  model  each  word  in  the  
vocabulary  independently:  
–  Assume  that  (given  knowledge  of  the  class  label)  
probability  of  seeing  word  i  (e.g.,  eliminate)  is  
independent  of  probability  of  seeing  word  j  (e.g.,  
money).  
–  As  a  result,  probability  of  email  content  becomes  
 p(x1=0  |  spam)  p(x2=1  |  spam)  ...  p(x8=0  |  spam)  
 rather  than  
 p(x1=0,  x2=1,  x3=1,  x4=1,  x5=0,  x6=0,  x7=1,  x8=0  |  spam)  
Example:  Spam  Classifica<on  
•  Now,  we  only  need  to  model  the  probability  of  
seeing  (or  not  seeing)  a  par<cular  word  i,  
assuming  that  we  knew  the  email’s  class  y
(spam  or  not-­‐spam).  
–  But,  this  is  easy!  
–  To  es<mate  p(xi  =  1  |  y),  simply  compute  the  
frac<on  of  emails  in  the  set  {emails  in  training  set  
with  label  y}  which  contain  the  word  i.  
Example:  Spam  Classifica<on  
•  Pusng  it  all  together:  
–  Based  on  the  training  data,  es<mate  the  class  prior            
p(y).  
•  i.e.,  es<mate  p(spam)  and  p(not-­‐spam).  
–  Also  es<mate  the  (condi<onal)  probability  of  seeing  
any  individual  word  i,  given  knowledge  of  the  class  
label  y.  
•  i.e.,  es<mate  p(xi  =  1  |  y)  for  each  i  and  y  
–  The  (condi<onal)  probability  p(x  |  y)  of  seeing  an  
en<re  email,  given  knowledge  of  the  class  label  y,  is  
then  simply  the  product  of  the  condi<onal  word  
probabili<es.  
•  e.g.,  p(x=(0,  1,  1,  1,  0,  0,  1,  0)  |  y)  =  
                                               p(x1=0  |  y)  p(x2=1  |  y)  ...  p(x8=0  |  y)  
Example:  Spam  Classifica<on  
•  Recall:  we  want  a  model  that  will  tell  us  the  
likelihood  p(x,  y)  of  observing  any  given  (x,  y)  pair.  
•  The  probability  of  observing  (x,  y)  is  the  
probability  of  observing  y,  and  then  observing  x  
given  that  value  of  y:  
p(x,  y)  =  p(y)  p(x  |  y)  
•  Example:  
 p(“Eliminate  your  debt...”,  spam)  =  
       p(spam)  p(“Eliminate  your  debt...”  |  spam)  
Example:  Spam  Classifica<on  
•  To  predict  label  for  a  new  email:  
–  Compute  log[p(x,  spam)]  and  log[p(x,  not-­‐spam)].  
–  Choose  the  label  which  gives  higher  value.  
–  We  use  logs  above  to  avoid  underflow  which  
otherwise  arises  in  compu<ng  the  p(x  |  y),  which  
are  products  of  individual  p(xi  |  y)  <  1:  
       log[p(x,  y)]  =  log[p(y)  p(x  |  y)]  
                         =  log[  p(y)  p(x1  |  y)  p(x2  |  y)  ...]  
                         =  log[p(y)]  +  log[p(x1  |  y)]  +  log[p(x2  |  y)]  +  ...  
Classifica<on:  Beyond  Text  
•  You  have  just  seen  an  instance  of  the  Naive  
Bayes  classifier.  
•  Applies  as  shown  to  any  classifica<on  problem  
with  binary  feature  vectors.  
•  What  if  the  features  are  real-­‐valued?  
–  S<ll  model  each  element  of  the  feature  vector  
independently.  
–  But,  change  the  form  of  the  model  for  p(xi  |  y).  
Classifica<on:  Beyond  Text  
•  If  xi  is  a  real  number,  ofen  model  p(xi  |  y)  as  
2
Gaussian  with  mean  µ        iy
       and  variance    σ
     iy
     :  
� x −µ �2
1 − 12 i iy
p(xi |y) = √ e σiy

σiy 2π
•  Es<mate  the  mean  and  variance  for  a  given  i,y  as  
the  mean  and  variance  of  the  xi  in  the  training  set  
which  have  corresponding  class  label  y.  
•  Other,  non-­‐Gaussian  distribu<ons  can  be  used  if  
know  more  about  the  xi.  
Naive  Bayes:  Benefits  
•  Can  easily  handle  more  than  two  classes  and  
different  data  types  
•  Simple  and  easy  to  implement  
•  Scalable  
Naive  Bayes:  Shortcomings  
•  Generally  not  as  accurate  as  more  sophis<cated  
methods  (but  s<ll  generally  reasonable).  
•  Independence  assump<on  on  the  feature  vector  
elements  
–  Can  instead  directly  model  p(x  |  y)  without  this  
independence  assump<on.  
•  Requires  us  to  specify  a  full  model  for  p(x,  y)  
–  In  fact,  this  is  not  necessary!  
–  To  do  classifica<on,  we  actually  only  require  p(y  |  x),  
the  probability  that  the  label  is  y,  given  that  we  have  
observed  en<ty  features  x.  
Logis<c  Regression  
•  Recall:  Naive  Bayes  models  the  full  ( joint)  
probability  p(x,  y).  
•  But,  Naive  Bayes  actually  only  uses  the  
condi<onal  probability  p(y  |  x)  to  predict.  
•  Instead,  why  not  just  directly  model  p(y  |  x)?  
–  Logis1c  regression  does  exactly  that.  
–  No  need  to  first  model  p(y)  and  then  separately          
p(x  |  y).  
Logis<c  Regression  
•  Assume  that  class  labels  are  {0,  1}.  
•  Given  an  en<ty’s  feature  vector  x,  probability  that  
label  is  1  is  taken  to  be  
1
p(y = 1|x) = −b Tx
1+e
 where  b  is  a  parameter  vector  and  bTx  denotes  a  
dot  product.  
•  The  probability  that  the  label  is  1,  given  features  
x,  is  determined  by  a  weighted  sum  of  the  
features.  
Logis<c  Regression  
•  This  is  libera<ng:  
–  Simply  featurize  the  data  and  go.  
–  No  need  to  find  a  distribu<on  for  p(xi  |  y)  which  is  
par<cularly  well  suited  to  your  sesng.  
–  Can  just  as  easily  use  binary-­‐valued  (e.g.,  bag  of  
words)  or  real-­‐valued  features  without  any  
changes  to  the  classifica<on  method.  
–  Can  ofen  improve  performance  simply  by  adding  
new  features  (which  might  be  derived  from  old  
features).  
Logis<c  Regression  
•  Can  be  trained  efficiently  at  large  scale,  but  
not  as  easy  to  implement  as  Naive  Bayes.  
–  Trained  via  maximum  likelihood.  
–  Requires  use  of  itera<ve  numerical  op<miza<on  
(e.g.,  gradient  descent,  most  basically).  
–  However,  implemen<ng  this  effec<vely,  robustly,  
and  at  large  scale  is  non-­‐trivial  and  would  require  
more  <me  than  we  have  today.  
•  Can  be  generalized  to  mul<class  sesng.  
Other  Classifica<on  Techniques  
•  Support  Vector  Machines  (SVMs)  
•  Kernelized  logis<c  regression  and  SVMs  
•  Boosted  decision  trees  
•  Random  Forests  
•  Nearest  neighbors  
•  Neural  networks  
•  Ensembles  

See  The  Elements  of  Sta4s4cal  Learning  by  Has<e,  


Tibshirani,  and  Friedman  for  more  informa<on.  
Featuriza<on:  Final  Comments  
•  Featuriza<on  affords  the  opportunity  to  
–  Incorporate  domain  knowledge  
–  Overcome  some  classifier  limita<ons  
–  Improve  performance  
•  Incorpora<ng  domain  knowledge:  
–  Example:  in  spam  classifica<on,  we  might  suspect  that  
sender  is  important,  in  addi<on  to  email  body.  
–  So,  try  adding  features  based  on  sender’s  email  
address.  
Featuriza<on:  Final  Comments  
•  Overcoming  classifier  limita<ons:  
–  Naive  Bayes  and  logis<c  regression  do  not  model  
mul<plica<ve  interac<ons  between  features.  
–  For  example,  the  presence  of  the  pair  of  words  
[eliminate,  debt]  might  indicate  spam,  while  the  
presence  of  either  one  individually  might  not.  
–  Can  overcome  this  by  adding  features  which  explicitly  
encode  such  interac<ons.  
–  For  example,  can  add  features  which  are  products  of  
all  pairs  of  bag  of  words  features.  
–  Can  also  include  nonlinear  effects  in  this  manner.  
–  This  is  actually  what  kernel  methods  do.  
Classifica<on  
Given  a  labeled  dataset  (x1,  y1),  ...,  (xN,  yN):  
1.  Randomly  split  the  full  dataset  into  two  disjoint  
parts:  
–  A  larger  training  set  (e.g.,  75%)  
–  A  smaller  test  set  (e.g.,  25%)  
2.  Preprocess  and  featurize  the  data.  
3.  Use  the  training  set  to  learn  a  classifier.  
4.  Evaluate  the  classifier  on  the  test  set.  
5.  Use  classifier  to  predict  in  the  wild.  
Classifica<on  

training
classifier
full set
dataset

test set new entity

accuracy prediction
Classifier  Evalua<on  
•  How  do  we  determine  the  quality  of  a  trained  
classifier?  
•  Various  metrics  for  quality,  most  common  is  
accuracy.  
•  How  do  we  determine  the  probability  that  a  
trained  classifier  will  correctly  classify  a  new  
en<ty?  
Classifier  Evalua<on  
•  Cannot  simply  evaluate  a  classifier  on  the  
same  dataset  used  to  train  it.  
–  This  will  be  overly  op<mis<c!  
•  This  is  why  we  set  aside  a  disjoint  test  set  
before  training.  
Classifier  Evalua<on  
•  To  evaluate  accuracy:  
–  Train  on  the  training  set  without  exposing  the  test  set  
to  the  classifier.  
–  Ignoring  the  (known)  labels  of  the  data  points  in  the  
test  set,  use  the  trained  classifier  to  generate  label  
predic<ons  for  the  test  points.  
–  Compute  the  frac<on  of  predicted  labels  which  are  
iden<cal  to  the  test  set’s  known  labels.  
•  Other,  more  sophis<cated  evalua<on  methods  
are  available  which  make  more  efficient  use  of  
data  (e.g.,  cross-­‐valida<on).  

You might also like