0% found this document useful (0 votes)
18 views

Project 4

project 4 is all about advanced big data

Uploaded by

betsegaw123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Project 4

project 4 is all about advanced big data

Uploaded by

betsegaw123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

 

 
 
 
 
 
 
 
 
CS525:  Advanced  Topics  In  Database  Systems  
Large-­‐Scale  Data  Management  
Spring-­‐2013  
 
Project  4  
 
 
 
 
Total  Points:    130    
 
Release  Date:    03/12/2013  
 
Due  Date:    03/28/2013  
 
Teams:  Project  to  be  done  in  teams  of  two.  
 
   

  1  
Short  Description  
In   this   project,   you   will   write   map-­‐reduce   jobs   that   implement   data   mining   and   machine   learning  
techniques  in  Hadoop.  More  specifically,  you  will  implement  the  K-­‐Means  clustering  technique  and  Naïve  
Bayes  classifier.      
 
 
Problem  1  (Naïve  Bayes  Classifier)  [50  points]  
Naive   Bayes   classifier   is   a   simple   probabilistic   classifier   based   on   applying   Bayes'   theorem   with   strong  
(naive)  independence  assumptions.  In  general,  any  classifier  technique  has  two  phases:  (1)  Phase  1:  the  
creation   of   the   classifier,   and   (2)   Phase   2:   using   the   created   classifier   to   classify   (label)   new   objects.   In  
this  problem  you  will  implement  phase  1  of  Naïve  Bayes  (i.e.,  the  creation  of  the  classifier)  using  Hadoop.  
 
Hint:  You  may  reference  these  links  to  get  some  ideas  (in  addition  to  the  course  slides):  
http://nickjenkin.com/blog/?p=85  
http://en.wikipedia.org/wiki/Naive_Bayes_classifier      (especially  the  ‘Sex  Classification  Example’)  
 
Step  1  (Creation  a  Training  Dataset)  [10  points]:    
• Assume  we  have  20  class  labels,  namely  {C1,  C2,  ….,  C20},  and  50  numeric  features,  namely  {F1,  F2,  …,  
F50}.   You   need   to   create   a   training   dataset   for   the   classifier   that   consists   of   2M   (2   x   106)   records,  
where  the  first  field  is  the  class  label,  and  then  the  numeric  values  for  the  50  features.    
• Make  the  probability  of  labels  C1  to  C5  10%  each,  from  C6  to  C10  6%  each,  and  the  rest  is  2%  each  
(this  should  some  to  1).  
• Use  a  range  and  distribution  of  your  choice  for  the  values  in  each  feature.  
• The  values  in  each  record  are  comma  separated.  
 
 
 
Step  2  (Build  the  Classifier  Model)  [20  points]:    
Write  map-­‐reduce  job(s)  to  build  the  Naïve  Bayes  classifier  in  a  distributed  fashion.  The  final  output  file  
(Refer   to   Table   1)   should   have   one   record   for   each   class   label   along   with   its   learned   probability   (the  
percentage  of  records  having  is  class  label),  and  the  mean  and  variance  of  each  feature.    
• The  output  file  should  not  include  a  header  line  
• Use  the  appropriate  separator  (of  your  choice)  between  the  values.  
 
 
 
Class  label   Learned   Feature  1   Feature  2   ….   Feature  50  
probability  
C1   %  of  records   Mean  and  variance  of  F1   …     ….  
with  C1   values  having  label  C1  
C2   %  of  records   Mean  and  variance  of  F1   ….     …  
with  C2   values  having  label  C2  
…            
C20   %  of  records   ….   Mean  and  variance  of  F2   …   …  
with  C20   values  having  label  C20  
Table  1:  Classifier  Model    
 
 
 

  2  
Step  3  (Classify  Unseen  Values)  [20  points]:      
• Create  a  dataset  similar  to  the  training  one,  but  without  class  labels.  Create  500K  line,  each  line  
consists  of  the  values  of  the  50  features.    
• Write  map-­‐reduce  job(s)  that  reads  the  unseen  data  records  and  classifies  them,  i.e.,  assigns  a  label  to  
each  record  based  on  the  model  created  in  Step  2.  
• The  classification  equation  is  given  in  the  course  slides,  and  you  can  also  refer  to  the  example  in  this  
link:  http://en.wikipedia.org/wiki/Naive_Bayes_classifier      (the  ‘Sex  Classification  Example’)  
 
 
 
 
Problem  2  (K-­‐Means  Clustering)  [50  points]  
K-­‐Means  clustering  is  a  popular  algorithm  for  clustering  similar  objects  into  K  groups  (clusters).  It  starts  
with  an  initial  seed  of  K  points  (randomly  chosen)  as  centers,  and  then  the  algorithm  iteratively  tries  to  
enhance   these   centers.   The   algorithm   terminates   either   when   two   consecutive   iterations   generate   the  
same  K  centers,  i.e.,  the  centers  did  not  change,  or  a  maximum  number  of  iterations  is  reached.    
 
Hint:  You  may  reference  these  links  to  get  some  ideas  (in  addition  to  the  course  slides):  
  http://en.wikipedia.org/wiki/K-­‐means_clustering#Standard_algorithm  
https://cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering  
 
 
Step  1  (Creation  of  Dataset)  [10  points]:    
• Create  a  dataset  that  consists  of  2-­‐dimenional  points,  i.e.,  each  point  has  (x,  y)  values.  X  and  Y  values  
each  range  from  0  to  10,000.  Each  point  is  in  a  separate  line.  
• Scale  the  dataset  such  that  its  size  is  around  100MB.  
• Create  another  file  that  will  contain  K  initial  seed  points.  Make  the  “K”  value  as  a  parameter  to  your  
program,   so   in   the   demo   session,   I   will   give   you   certain   K,   your   program   will   generate   these   K   seeds,  
and  then  you  upload  the  generated  file  to  the  cluster.  
 
 
Step  2  (Clustering  the  Data)  [40  points]:    
Write  map-­‐reduce  job(s)  that  implement  the  K-­‐Means  clustering  algorithm  as  given  in  the  course  slides.  
The  algorithm  should  terminates  if  either  of  these  two  conditions  become  true:  
a) The  K  centers  did  not  change  over  two  consecutive  iterations  
b) The  maximum  number  of  iterations  (make  it  six  (6)  iterations)  has  reached.  
• Apply  the  tricks  given  in  class  and  in  the  2nd  link  above  such  as:  
o Use  of  a  combiner  
o Use  a  single  reducer  
o The  reducer  should  indicate  in  its  output  file  whether  centers  have  changed  or  not.  
 
Hint:  Since  the  algorithm  is  iterative,  then  you  need  your  program  that  generates  the  map-­‐reduce  jobs  to  
control  whether  it  should  start  another  iteration  or  not.  
 
 
 
 
 
 

  3  
Problem  3  (Use  of  Mahout)  [30  points]  
Mahout   is   a   package   that   implements   data   mining   and   machine   learning   techniques   on   top   of   Hadoop  
including  Naïve  Bayes  and  K-­‐Means  clustering.    
• Choose  one  of  the  two  techniques  above  and  run  it  using  Mahout.    
• You  need  to  understand  the  data  format  that  Mahout  accepts  and  the  parameters  that  it  takes  to  run  
either  or  the  two  algorithms.  
 
Hint:  You  may  reference  these  links  to  get  some  ideas  (in  addition  to  the  course  slides):  
http://mahout.apache.org/  
   

  4  
What  to  Submit  
You  will  submit  a  single  zip  file  containing  the  java  code  needed  to  answer  the  queries  above.  Also  include  
a  .doc  or  .pdf  report  file  containing  any  required  documentation.    
 
 
How  to  Submit  
Use  blackboard  system  to  submit  your  files.  
 
 
Demonstrating  Your  Code  
Each  team  will  schedule  an  appointment  with  the  instructor  to  demonstrate  the  project.  Demonstration  
should  be  within  the  week  after  the  due  date.  
 

  5  

You might also like