0% found this document useful (0 votes)
93 views49 pages

Cs253 01 Introduction Marked

This document provides an introduction to an advanced machine learning course. It discusses how machine learning can help gain insights from massive, noisy datasets that are common in various scientific and industrial applications. Key challenges addressed in the course include handling large datasets that don't fit in memory, obtaining informative labels at low cost, and adapting model complexity for large data. The course will cover online learning, active learning, and nonparametric learning through homework assignments, a course project, and lectures. Background in machine learning is required.

Uploaded by

sanketsdive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views49 pages

Cs253 01 Introduction Marked

This document provides an introduction to an advanced machine learning course. It discusses how machine learning can help gain insights from massive, noisy datasets that are common in various scientific and industrial applications. Key challenges addressed in the course include handling large datasets that don't fit in memory, obtaining informative labels at low cost, and adapting model complexity for large data. The course will cover online learning, active learning, and nonparametric learning through homework assignments, a course project, and lectures. Background in machine learning is required.

Uploaded by

sanketsdive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Advanced

 Topics  in  
Machine  Learning  

Lecture  1  –  Introduc8on  

CS/CNS/EE  253  
Andreas  Krause  
Learning  from  massive  data  
!  Many  applica8ons  require  gaining  insights  from  
massive,  noisy  data  sets  
!   Science    
!  Physics  (LHC,  …),  Astronomy  (sky  surveys,  …),  Neuroscience  
(fMRI,  micro-­‐electrode  arrays,  …),  Biology  (High-­‐throughput  
microarrays,  …),  Geology  (sensor  arrays,  …),  …    
!   Social  science,  economics,  …  

!  Commercial  /  civil  applica8ons  


!  Consumer  data  (online  adver8sing,  viral  marke8ng,  …)  
!   Health  records  (evidence  based  medicine,  …)  

!  Security  /  defense  related  applica8ons  


!  Spam  filtering  /  intrusion  detec8on  
!   Surveillance,  …   2  
Web-­‐scale  machine  learning  
!  Predict  relevance  of  search    
results  from  click  data  
!   Personaliza8on  

!   Online  adver8sing  

!   Machine  transla8on  

!   Learning  to  index  

!   Spam  filtering  

!   Fraud  detec8on  

!   …   L.  Brouwer  

>21  billion  indexed     T.  Riley  


web  pages  
Analyzing  fMRI  data  
Mitchell  et  al.,  
Science,  2008  

!  Predict  ac8va8on  pa\erns  for  nouns  


!   Google’s  Trillion  word  corpus  used  to  measure    
co-­‐occurrence  

4  
Monitoring  transients  in  astronomy  [Djorgovski]  

Novae,  Cataclysmic  Variables   Supernovae  

Gamma-­‐Ray  Bursts   Gravita8onal  Microlensing   Accre8on  to  SMBHs  


Data-­‐rich  astronomy  [Djorgovski]  
!  Typical  digital  sky  survey  now  generates  ~  10  -­‐  100  TB,  plus  
a  comparable  amount  of  derived  data  products  
!  PB-­‐scale  data  sets  are  on  the  horizon  
!  Astronomy  today  has  ~  1  -­‐  2  PB  of  archived  data,  and  
generates  a  few  TB/day  
!   Both  data  volumes  and  data  rates  grow  exponen8ally,  with  a  
doubling  8me  ~  1.5  years  
!   Even  more  important  is  the  growth  of  data  complexity  

!  For  comparison:  
Human  memory  ~  a  few  hundred  MB  
Human  Genome  <  1  GB  
1  TB  ~  2  million  books  
Library  of  Congress  (print  only)  ~  30  TB  
How  is  the  data-­‐rich  science  different?  [Djorgovski]  
!  The  informa8on  volume  grows  exponen8ally  
Most  data  will  never  be  seen  by  humans  
     The  need  for  data  storage,  network,  database-­‐related  
technologies,  standards,  etc.  
!  Informa8on  complexity  is  also  increasing  greatly  
Most  data  (and  data  constructs)  cannot  be  
comprehended  by  humans  directly  
     The  need  for  data  mining,  KDD,  data  understanding  technologies,  
hyperdimensional  visualiza8on,  AI/Machine-­‐assisted  discovery  …  
!  We  need  to  create  a  new  scien8fic  methodology  to  do  the  
21st  century,  computa8onally  enabled,  data-­‐rich  science…  
!   ML  and  AI  will  be  essen8al  components  of  the  new  
scien8fic  toolkit  
Data  volume  in  scien8fic  and  industrial  applica8ons  

[Meiron  et  al]  


8  
 How  can  we  get  gain  insight  from    
massive,  noisy  data  sets?  

9  
Key  ques8ons  
!  How  can  we  deal  with  data  sets  that  don’t  fit  in  main  
memory  of  a  single  machine?  
     Online  learning  

!  Labels  are  expensive.  How  can  we  obtain  most  


informa8ve  labels  at  minimum  cost?  
     Ac=ve  learning  

!  How  can  we  adapt  complexity  of  classifiers  for  large  


data  sets?  
     Nonparametric  learning    
10  
Overview  
!  Research-­‐oriented  advanced  topics  course  
!   3  main  topics  
!  Online  learning  (from  streaming  data)  
!   Ac8ve  learning  (for  gathering  most  useful  labels)  

!   Nonparametric  learning  (for  model  selec8on)  

!  Both  theory  and  applica8ons  


!   Handouts  etc.  on  course  webpage  
!  h?p://www.cs.caltech.edu/courses/cs253/  

11  
Overview  
!  Instructors:    
Andreas  Krause  ([email protected])  and    
Daniel  Golovin  ([email protected])  

!  Teaching  assistant:    
Deb  Ray  ([email protected])  

!  Administra9ve  assistant:    
Sheri  Garcia  ([email protected])  

12  
Background  &  Prequisites  
!  Formal  requirement:    
CS/CNS/EE  156a  or  instructor’s  permission  

13  
Coursework  
!  Grading  based  on  
!  3  homework  assignments  (one  per  topic)  (50%)  
!   Course  project  (40%)  

!   Scribing  (10%)  

!  3  late  days  
!   Discussing  assignments  allowed,  but  everybody  must  
turn  in  their  own  solu8ons  
!   Start  early!    

14  
Course  project  
!   “Get  your  hands  dirty”  with  the  course  material  
!  Implement  an  algorithm  from  the  course  or  a  paper  you  
read  and  apply  it  to  some  data  set  
!   Ideas  on  the  course  website  (soon)  

!   Applica8on  of  techniques  you  learnt  to  your  own  


research  is  encouraged  
!   Must  be  something  new  (e.g.,  not  work  done  last  term)  

15  
Project:  Timeline  and  grading  
!  Small  groups  (2-­‐3  students)  
!   January  20:  Project  proposals  due  (1-­‐2  pages);  
feedback  by  instructor  and  TA  
!   February  10:  Project  milestone  

!   March  ~10:  Poster  session  (TBA)  

!   March  15:  Project  report  due  

!  Grading  based  on  quality  of  poster  (20%),  milestone  


report  (20%)  and  final  report  (60%)  

!  We  will  have  a  Best  Project  Award!!  


16  
Course  overview  
!  Online  learning  from  massive  data  sets  

!  Ac=ve  learning  to  gather  most  informa8ve  labels  

!  Nonparametric  learning  to  adapt  model  complexity  

 This  lecture:    Quick  overview  over  all  these  topics  

17  
Tradi8onal  classifica8on  task  

Spam   + +
+ –
+ +
+ – ––
+ + –

– ––

Ham  

!  Input:  Labeled  data  set  with  posi8ve  (+)  and  nega8ve  


(-­‐)  examples  
!   Output:  Decision  rule  (e.g.,  linear  separator)  

18  
Main  memory  vs.  disk  access  

Main  memory:  
Fast,  random  access,  expensive  

Secondary  memory  (hard  disk)  


~104  slower,  sequen8al  access,  inexpensive  

Massive  data    Sequen8al  access  


How  can  we  learn  from  streaming  data?  
19  
Online  classifica8on  task  

Spam   + X  

+ –


X  
Ham  
X:  Classifica=on  error  
!  Data  arrives  sequen8ally  
!   Need  to  classify  one  data  point  at  a  8me  

!   Use  a  different  decision  rule  (lin.  separator)  each  8me  

!   Can’t  remember  all  data  points!   20  


Model:  Predic8on  from  expert  advice  
Experts   l  1   l  2   l  3   …   l  T  
e1   0   1   0   1   1  
e2   0   0   1   0   0  
e3   1   1   0   1   0  
…  
en   1   0   0   0   0  

Loss   Time  
Total: ∑t l (t,it)  min
 Expert  =  Someone  with  an  opinion  (not  necessarily  
someone  who  knows  something)  
Think  of  an  expert  as  a  decision  rule  (e.g.,  lin.  separator)   21  
Performance  metric:  Regret  
!  Best  expert:  i*  =  mini  ∑t  l (t,i)  
!   Let  i1,…,iT  be  the  sequence  of  experts  selected  

!   Instantaneous  regret  at  8me  t:  rt  =  l (t,it)-­‐  l (t,i*)  

!   Total  regret:  

!  Typical  goal:  Want  selec8on  strategy  that  guarantees  

     

22  
Expert  selec8on  strategies  
!  Pick  an  expert  (classifier)  uniformly  at  random?  

!  Always  pick  the  best  expert?  

23  
Randomized  weighted  majority  
 Input:    
!  Learning  rate
 Ini8aliza8on:  
!  Associate  weight  w1,s  =  1  with  every  expert  s  
 For  each  round  t  
!  Choose  expert  s  with  prob.      

!  Obtain  losses    

!  Update  weights:  

24  
Guarantees  for  RWM  
 Theorem  
For  appropriately  chosen  learning  rate,  Randomized  
Weighted  Majority  obtains  sublinear  regret:  

 Note:  No  assump8on  about  how  the  loss  vectors  l  


are  generated!  

25  
Prac8cal  problems  
!  In  many  applica8ons,  number  of  experts  (classifiers)  is  
infinite  
  Online  op8miza8on  (e.g.,  online  convex  programming)  
!  Ozen,  only  par8al  feedback  is  available    
(e.g.,  obtain  loss  only  for  chosen  classifier)  
  Mul8-­‐armed  bandits,  sequen8al  experimental  design  
!  Many  prac8cal  problems  are  high-­‐dimensional  
  Dimension  reduc8on,  sketching  

26  
Course  overview  
!  Online  learning  from  massive  data  sets  

!  Ac=ve  learning  to  gather  most  informa8ve  labels  

!  Nonparametric  learning  to  adapt  model  complexity  

 This  lecture:    Quick  overview  over  all  these  topics  

27  
Spam  or  Ham?  
o o
Spam   o
+
o

o
+ o o o
+ o –
o
o o

o
o
Ham  

!  Labels  are  expensive  (need  to  ask  expert)  


!   Which  labels  should  we  obtain  to  maximize  
classifica=on  accuracy?  
28  
Learning  binary  thresholds  
!  Input  domain:  D=[0,1]  
!   True  concept  c:   -­‐  -­‐   -­‐   -­‐   +   +  +   +  
   c(x)  =  +1  if  x¸  t   0   Threshold  t   1  
 c(x)  =  -­‐1    if  x  <  t  

!  Samples  x1,…,xn  2  D  
uniform  at  random  

29  
Passive  learning  
!  Input  domain:  D=[0,1]  
!   True  concept  c:  

   c(x)  =  +1  if  x¸  t   0   1  


 c(x)  =  -­‐1    if  x  <  t  

!  Passive  learning:  
Acquire  all  labels  yi  2 {+,-­‐}  

30  
Ac8ve  learning  
!  Input  domain:  D=[0,1]  
!   True  concept  c:  

   c(x)  =  +1  if  x¸  t   0   1  


 c(x)  =  -­‐1    if  x  <  t  

!  Passive  learning:  
Acquire  all  labels  yi  2 {+,-­‐}  
!   Ac8ve  learning:  
Decide  which  labels  to  obtain  

31  
Classifica8on  error  
!  Azer  obtaining  n  labels,  
Dn  =  {(x1,y1),…,(xn,yn)}  
-­‐  -­‐   -­‐   -­‐   +   +  +   +  
learner  outputs  hypothesis   0   1  
consistent  with  labels  Dn   Threshold   t  

!  Classifica8on  error:  R(h)  =  Ex~P[h(x)  ≠  c(x)]  

32  
Sta8s8cal  ac8ve  learning  protocol  
Data  source  P  (produces  inputs  xi)  

Ac8ve  learner  assembles  data  set  


Dn  =  {(x1,y1),…,(xn,yn)}  
by  selec8vely  obtaining  labels  

Learner  outputs  hypothesis  h  

Classifica8on  error  R(h)  =  Ex~P[h(x)  ≠  c(x)]  

How  many  labels  do  we  need  to  ensure  that  R(h)  ·  ε?  
33  
Label  complexity  for  passive  learning  

34  
Label  complexity  for  ac8ve  learning  

35  
Comparison  
Labels  needed  to  learn  
with  classifica8on  
error  ε

Passive  learning   Ω(1/ε)  
Ac8ve  learning   O(log  1/ε)  

 Ac=ve  learning  can  exponen=ally  reduce  the  number  


of  required  labels!  

36  
Key  ques8ons  
!  For  which  classifica8on  tasks  can  we  provably  reduce  
the  number  of  labels?  

!  Can  we  do  worse  by  ac8ve  learning?  

!  Can  we  implement  ac8ve  learning  efficiently?  

37  
Course  overview  
!  Online  learning  from  massive  data  sets  

!  Ac=ve  learning  to  gather  most  informa8ve  labels  

!  Nonparametric  learning  to  adapt  model  complexity  

 This  lecture:    Quick  overview  over  all  these  topics  

38  
Nonlinear  classifica8on  

– – –
– + –
– +
+
+ + –
– +
+ –
– –
– – – – – ––

!  How  should  we  adapt  the  classifier  complexity  to  


growing  data  set  size?    

39  
Nonlinear  classifica8on  
+
– ++
– – – + –
– – –
+ + – –
– +
+ + –
– +
+ –
– –
– – – – – ––

!  How  should  we  adapt  the  classifier  complexity  to  


growing  data  set  size?    

40  
Nonlinear  classifica8on  
+
++ –
– – + –

– – –
+ + – –
– + +
+ + – +
– +
+ –– –
– –
– – – – – –– + –

!  How  should  we  adapt  the  classifier  complexity  to  


growing  data  set  size?    

41  
Linear  classifica8on  

Linear  
classifica8on  

Loss  func8on   Complexity  


e.g.,  hinge  loss:     penalty  

42  
From  linear  to  nonlinear  classifica8on  
Linear  
classifica8on  

Nonlinear  
classifica8on  

Complexity  penalty  
for  func=on  f??  

43  
1D  Example  

Nonlinear  
classifica8on  

1   +   +  +   +   +   +  +   +   +  +  

0  

-­‐1   –   –   –   –   –   –  

44  
Representa8on  of  func8on  f  

Solu8on  of    

can  be  wri\en  as  

for  appropriate  choice  of  ||f||  (Representer  Theorem)  

Hereby,  k(  .  ,  .  )  is  called  a  kernel  func=on    


(associated  with  ||.||)  
45  
Examples  of  kernels  

Squared  exp.  kernel   Exponen=al  k.   Finite  dimensional  46  

46  
Nonparametric  solu8on  
Solu8on  of    

can  be  wri\en  as  

Func8on  f  has  one  parameter  αt  for  each  data  point  xt!  
No  finite-­‐dimensional  representa8on    “non-­‐parametric”  

Large  data  set    Huge  number  of  parameters!!  


47  
Key  ques8ons  
!  How  can  we  determine  the  right  tradeoff  between  
func8on  expressiveness  (#parameters)  and  
computa8onal  complexity?  

!  How  can  we  control  model  complexity  in  an  online  


fashion?  

!  How  can  we  quan8fy  uncertainty  in  nonparametric  


learning?  

48  
Course  overview  
Online  
Learning  

Response   Bandit  
surface  methods   op8miza8on  

Nonparametric   Ac=ve  
Learning   Learning  
Ac8ve  set  
selec8on  
49  

You might also like