Project 4

project 4 is all about advanced big data

Uploaded by

betsegaw123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Project 4

project 4 is all about advanced big data

Uploaded by

betsegaw123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS525: Advanced Topics In Database Systems
Large-‐Scale Data Management
Spring-‐2013

Project 4

Total Points: 130

Release Date: 03/12/2013

Due Date: 03/28/2013

Teams: Project to be done in teams of two.

1
Short Description
In this project, you will write map-‐reduce jobs that implement data mining and machine learning
techniques in Hadoop. More specifically, you will implement the K-‐Means clustering technique and Naïve
Bayes classifier.

Problem 1 (Naïve Bayes Classifier) [50 points]
Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong
(naive) independence assumptions. In general, any classifier technique has two phases: (1) Phase 1: the
creation of the classifier, and (2) Phase 2: using the created classifier to classify (label) new objects. In
this problem you will implement phase 1 of Naïve Bayes (i.e., the creation of the classifier) using Hadoop.

Hint: You may reference these links to get some ideas (in addition to the course slides):
http://nickjenkin.com/blog/?p=85
http://en.wikipedia.org/wiki/Naive_Bayes_classifier (especially the ‘Sex Classification Example’)

Step 1 (Creation a Training Dataset) [10 points]:
• Assume we have 20 class labels, namely {C1, C2, …., C20}, and 50 numeric features, namely {F1, F2, …,
F50}. You need to create a training dataset for the classifier that consists of 2M (2 x 106) records,
where the first field is the class label, and then the numeric values for the 50 features.
• Make the probability of labels C1 to C5 10% each, from C6 to C10 6% each, and the rest is 2% each
(this should some to 1).
• Use a range and distribution of your choice for the values in each feature.
• The values in each record are comma separated.

Step 2 (Build the Classifier Model) [20 points]:
Write map-‐reduce job(s) to build the Naïve Bayes classifier in a distributed fashion. The final output file
(Refer to Table 1) should have one record for each class label along with its learned probability (the
percentage of records having is class label), and the mean and variance of each feature.
• The output file should not include a header line
• Use the appropriate separator (of your choice) between the values.

Class label Learned Feature 1 Feature 2 …. Feature 50
probability
C1 % of records Mean and variance of F1 … ….
with C1 values having label C1
C2 % of records Mean and variance of F1 …. …
with C2 values having label C2
…
C20 % of records …. Mean and variance of F2 … …
with C20 values having label C20
Table 1: Classifier Model

2
Step 3 (Classify Unseen Values) [20 points]:
• Create a dataset similar to the training one, but without class labels. Create 500K line, each line
consists of the values of the 50 features.
• Write map-‐reduce job(s) that reads the unseen data records and classifies them, i.e., assigns a label to
each record based on the model created in Step 2.
• The classification equation is given in the course slides, and you can also refer to the example in this
link: http://en.wikipedia.org/wiki/Naive_Bayes_classifier (the ‘Sex Classification Example’)

Problem 2 (K-‐Means Clustering) [50 points]
K-‐Means clustering is a popular algorithm for clustering similar objects into K groups (clusters). It starts
with an initial seed of K points (randomly chosen) as centers, and then the algorithm iteratively tries to
enhance these centers. The algorithm terminates either when two consecutive iterations generate the
same K centers, i.e., the centers did not change, or a maximum number of iterations is reached.

Hint: You may reference these links to get some ideas (in addition to the course slides):
http://en.wikipedia.org/wiki/K-‐means_clustering#Standard_algorithm
https://cwiki.apache.org/confluence/display/MAHOUT/K-‐Means+Clustering

Step 1 (Creation of Dataset) [10 points]:
• Create a dataset that consists of 2-‐dimenional points, i.e., each point has (x, y) values. X and Y values
each range from 0 to 10,000. Each point is in a separate line.
• Scale the dataset such that its size is around 100MB.
• Create another file that will contain K initial seed points. Make the “K” value as a parameter to your
program, so in the demo session, I will give you certain K, your program will generate these K seeds,
and then you upload the generated file to the cluster.

Step 2 (Clustering the Data) [40 points]:
Write map-‐reduce job(s) that implement the K-‐Means clustering algorithm as given in the course slides.
The algorithm should terminates if either of these two conditions become true:
a) The K centers did not change over two consecutive iterations
b) The maximum number of iterations (make it six (6) iterations) has reached.
• Apply the tricks given in class and in the 2nd link above such as:
o Use of a combiner
o Use a single reducer
o The reducer should indicate in its output file whether centers have changed or not.

Hint: Since the algorithm is iterative, then you need your program that generates the map-‐reduce jobs to
control whether it should start another iteration or not.

3
Problem 3 (Use of Mahout) [30 points]
Mahout is a package that implements data mining and machine learning techniques on top of Hadoop
including Naïve Bayes and K-‐Means clustering.
• Choose one of the two techniques above and run it using Mahout.
• You need to understand the data format that Mahout accepts and the parameters that it takes to run
either or the two algorithms.

Hint: You may reference these links to get some ideas (in addition to the course slides):
http://mahout.apache.org/

4
What to Submit
You will submit a single zip file containing the java code needed to answer the queries above. Also include
a .doc or .pdf report file containing any required documentation.

How to Submit
Use blackboard system to submit your files.

Demonstrating Your Code
Each team will schedule an appointment with the instructor to demonstrate the project. Demonstration
should be within the week after the due date.

Mình gửi kèm cho bạn bộ đề Reading & Listening kèm hướng dẫn trong đó nhé
80% (5)
Mình gửi kèm cho bạn bộ đề Reading & Listening kèm hướng dẫn trong đó nhé
5 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
Practical Monte Carlo Simulation with Excel - Part 2 of 2: Applications and Distributions
From Everand
Practical Monte Carlo Simulation with Excel - Part 2 of 2: Applications and Distributions
Akram Najjar
2/5 (1)
StoneOS WebUI User Guide A 5.5R8-1
No ratings yet
StoneOS WebUI User Guide A 5.5R8-1
1,310 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
42 pages
cp4252-machine-learning-lab-manual
No ratings yet
cp4252-machine-learning-lab-manual
38 pages
DWM - END SEM LAB Questions
No ratings yet
DWM - END SEM LAB Questions
9 pages
DM 2023
No ratings yet
DM 2023
8 pages
Clustering For Clasification
No ratings yet
Clustering For Clasification
13 pages
Exercises ML PDF
No ratings yet
Exercises ML PDF
4 pages
AI & ML Question Bank
No ratings yet
AI & ML Question Bank
4 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Data Mining
No ratings yet
Data Mining
18 pages
ML Questions
No ratings yet
ML Questions
9 pages
Sms Spam Detection Using Machine Learning and Deep Learning Techniques
No ratings yet
Sms Spam Detection Using Machine Learning and Deep Learning Techniques
11 pages
Problem 1: Cse352 AI Homework 3 Solutions
No ratings yet
Problem 1: Cse352 AI Homework 3 Solutions
31 pages
DWDM Unit Wise Question Bank
No ratings yet
DWDM Unit Wise Question Bank
8 pages
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
No ratings yet
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
6 pages
Lesson 6.0 Supervised Learning with Naive Bayes Classifiers (1)
No ratings yet
Lesson 6.0 Supervised Learning with Naive Bayes Classifiers (1)
13 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
PS2
No ratings yet
PS2
4 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
AD3461-Machine Learning Lab Manual
No ratings yet
AD3461-Machine Learning Lab Manual
26 pages
Machine Learning Pesit Lab Manual
0% (1)
Machine Learning Pesit Lab Manual
35 pages
ML Lab Programs (1-13)
No ratings yet
ML Lab Programs (1-13)
44 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
39 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Machine Learning Interview Questions PDF
No ratings yet
Machine Learning Interview Questions PDF
14 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
PCCCS504 Module 4
No ratings yet
PCCCS504 Module 4
4 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Mental Illness Prediction Using Deep Learning
No ratings yet
Mental Illness Prediction Using Deep Learning
58 pages
Ambo University Inistitute of Technology Department of Computer Science
No ratings yet
Ambo University Inistitute of Technology Department of Computer Science
13 pages
Machine Learning CA 2
No ratings yet
Machine Learning CA 2
19 pages
DMBI Sample Questions
No ratings yet
DMBI Sample Questions
7 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
Big Data Analytics Prac
No ratings yet
Big Data Analytics Prac
37 pages
ML Lab Manual - Ex No. 1 To 9
No ratings yet
ML Lab Manual - Ex No. 1 To 9
26 pages
6.034 Design Assignment 2: 1 Data Sets
No ratings yet
6.034 Design Assignment 2: 1 Data Sets
6 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Data Mining - UOG (HH) - Final - F23-1
No ratings yet
Data Mining - UOG (HH) - Final - F23-1
10 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
DSCI 303: Machine Learning For Data Science Fall 2020
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
5 pages
Operation Java
No ratings yet
Operation Java
10 pages
QB - Data Science
No ratings yet
QB - Data Science
7 pages
Exp 6
No ratings yet
Exp 6
3 pages
DWM Exp5 C49
No ratings yet
DWM Exp5 C49
12 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
Unit 2 AAM
No ratings yet
Unit 2 AAM
32 pages
Pattern Recognition Pyq
No ratings yet
Pattern Recognition Pyq
9 pages
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
No ratings yet
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
21 pages
Midterm F07 Solutions
No ratings yet
Midterm F07 Solutions
4 pages
40 Interview Questions On Machine Learning - AnalyticsVidhya
100% (1)
40 Interview Questions On Machine Learning - AnalyticsVidhya
21 pages
ML Lab Manual (1-10) FINAL
No ratings yet
ML Lab Manual (1-10) FINAL
34 pages
Latest Data Mining Lab Manual
No ratings yet
Latest Data Mining Lab Manual
74 pages
MFDS - Test 1 Problems
No ratings yet
MFDS - Test 1 Problems
9 pages
About Classificatio1
No ratings yet
About Classificatio1
5 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
PBI - How To Connect MS Power BI To Snowflake
No ratings yet
PBI - How To Connect MS Power BI To Snowflake
33 pages
Examsoft Notes: Minimum-System-Requirements Update - Device-MSR
No ratings yet
Examsoft Notes: Minimum-System-Requirements Update - Device-MSR
2 pages
29 Software Engineer Interview Questions
No ratings yet
29 Software Engineer Interview Questions
10 pages
Samsung Health - Google Search
No ratings yet
Samsung Health - Google Search
1 page
Ayrus Series: PWPS-050P02Y-BU01B PWPS-045P02Y-BU01B
No ratings yet
Ayrus Series: PWPS-050P02Y-BU01B PWPS-045P02Y-BU01B
7 pages
Agbara Igbaani Season
No ratings yet
Agbara Igbaani Season
1 page
Cha Resume
No ratings yet
Cha Resume
3 pages
05 Recursion Part 2 (Merge Sort)
No ratings yet
05 Recursion Part 2 (Merge Sort)
19 pages
MyQ HP Embedded Terminal 10.1 RTM Rev.10
No ratings yet
MyQ HP Embedded Terminal 10.1 RTM Rev.10
102 pages
Connectors of Purpose: Year 10 English Worksheet 16 Unit 3
No ratings yet
Connectors of Purpose: Year 10 English Worksheet 16 Unit 3
1 page
DataLink DL-4000 Manual - Congrav S To DF1
No ratings yet
DataLink DL-4000 Manual - Congrav S To DF1
51 pages
I TUTOR Business Plan
No ratings yet
I TUTOR Business Plan
53 pages
Automatic Error
No ratings yet
Automatic Error
364 pages
Araling Panlipunan 9
No ratings yet
Araling Panlipunan 9
4 pages
Group 2 Charmed 1
No ratings yet
Group 2 Charmed 1
15 pages
5 6102386836640892842
No ratings yet
5 6102386836640892842
7 pages
Technology Conversation Questions
No ratings yet
Technology Conversation Questions
2 pages
Worldwide Mobile Phone 2015-2019 Forecast and Analysis
No ratings yet
Worldwide Mobile Phone 2015-2019 Forecast and Analysis
25 pages
30-3001-835 WebClient Planning and Installation Guide
No ratings yet
30-3001-835 WebClient Planning and Installation Guide
148 pages
How To Add Videos in Play Next Queue On YouTube
No ratings yet
How To Add Videos in Play Next Queue On YouTube
2 pages
Name: - Date: - Assinatura Aluno: - Eletrical/Light/Communications A320
No ratings yet
Name: - Date: - Assinatura Aluno: - Eletrical/Light/Communications A320
3 pages
Com 122
No ratings yet
Com 122
13 pages
How To Configure The SRTP Live View
No ratings yet
How To Configure The SRTP Live View
8 pages
ION Cloud Mining - Pioneering the Future of Industrial Innovation
No ratings yet
ION Cloud Mining - Pioneering the Future of Industrial Innovation
3 pages
How To Edit STL
No ratings yet
How To Edit STL
16 pages
《程序设计实践》中文版
No ratings yet
《程序设计实践》中文版
200 pages
3BSE092640 C en ALERT - S800 I O DI810 DI811 DI830 DO810 DO820 Incorrect I O State
No ratings yet
3BSE092640 C en ALERT - S800 I O DI810 DI811 DI830 DO810 DO820 Incorrect I O State
3 pages
Facial Expressions Recogination Systen With Voice Alert Using
No ratings yet
Facial Expressions Recogination Systen With Voice Alert Using
17 pages

Project 4

Uploaded by

Project 4

Uploaded by

You might also like