Project 1

idea about pr1, please read and use

Uploaded by

betsegaw123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views4 pages

Project 1

idea about pr1, please read and use

Uploaded by

betsegaw123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CS525: Advanced Topics In Database Systems
Large-‐Scale Data Management
Spring-‐2013

Project 1

Total Points: 120

Release Date: 01/22/2013

Due Date: 01/31/2013 (11:59PM)

Teams: Project to be done in teams of two.

1
Short Description
In this project, you will write map-‐reduce jobs in Java language and run them on Hadoop system.

Detailed Description
You are asked to perform three activities in this project, (1) Create datasets, (2) upload the datasets into
Hadoop HDFS, and (3) Query the data by writing map-‐reduce Java code.

1-‐Createing Datasets [20 Points]
Write a java program that creates two datasets (two files), Customers and Transactions. Each line in
Customers file represents one customer, and each line in Transactions file represents one transaction. The
attributed within each line are comma separated.

The Customers dataset should have the following attributes for each customer:
ID: unique sequential number (integer) from 1 to 50,000 (that is the file will have 50,000 line)
Name: random sequence of characters of length between 10 and 20 (do not include commas)
Age: random number (integer) between 10 to 70
CountryCode: random number (integer) between 1 and 10
Salary: random number (float) between 100 and 10000

The Transactions dataset should have the following attributes for each transaction:
TransID: unique sequential number (integer) from 1 to 5,000,000 (the file has 5M transactions)
CustID: References one of the customer IDs, i.e., from 1 to 50,000 (on Avg. a customer has 100 trans.)
TransTotal: random number (float) between 10 and 1000
TransNumItems: random number (integer) between 1 and 10
TransDesc: random text of characters of length between 20 and 50 (do not include commas)

Note: The column names will NOT be stored in the file. Only the values comma separated. Form the order of
the columns; you will know each column represents what.

2-‐Uploading Data into Hadoop [10 Points]
Use hadoop file system commands (e.g., put) to upload the files you created to Hadoop cluster.

Note: It is good to check your files and see how the files are divided into blocks and each block is replicated.

3-‐Writing MapReduce Jobs [90 Points]
You will write Java programs to query the data in Hadoop. Before writing your code you should perfectly
understand the “WordCount” example in:
http://hadoop.apache.org/common/docs/r0.17.0/mapred_tutorial.html

2
Notes:
• You should decide whether each query is a map-‐only job or a map-‐reduce job, and write your
code based on that. A given query may require more that a single map-‐reduce job to be done.
• You can always check the query output file from the HDFS website and see its content.
• You can test your code on a small file first to make sure it is working correctly before running it
on the large datasets.

3.1) Query 1 [20 Points]
Write a job(s) that reports the customers whose CountryCode between 2 and 6 (inclusive).

3.2) Query 2 [20 Points]
Write a job(s) that reports for every customer, the number of transactions that customer did and the total
sum of these transactions. The output file should have one line for each customer containing:
CustomerID, NumTransactions, TotalSum

Repeat Q2 twice, once with a map-‐reduce combiner and once without a combiner. In the submitted report,
compare the performance between the two cases and write down your conclusion.

3.3) Query 3 [20 Points]
Write a job(s) that joins the Customers and Transactions datasets (based on the customer ID) and reports
for each customer the following info:
CustomerID, Name, Salary, NumOf Transactions, TotalSum, MinItems

Where NumOfTransactions is the total number of transactions done by the customer, TotalSum is the sum
of field “TransTotal” for that customer, and MinItems is the minimum number of items in transactions
done by the customer.

3.4) Query 4 [30 Points]
Write a job(s) that reports for every country code, the number of customers having this code as well as
the min and max of TransTotal fields for the transactions done by those customers. The output file should
have one line for each country code containing:
CountryCode, NumberOfCustomers, MinTransTotal, MaxTransTotal

Hint: To get the full mark of Query 4, you need to do it in a single map-‐reduce job.

3
Hint: It is important two know how Hadoop reads and writes integers, floats, and text fields. Check
IntWritable, FloatWritable, and Text classes to know which one to use and when.

What to Submit
You will submit a single zip file containing the Java programs for Creating Data Files and MapReduce
Queries, plus a document (.doc or .pdf) containing any required documentation.

How to Submit
Use blackboard system to submit your files.

Demonstrating Your Code
Each team will schedule an appointment with the instructor to demonstrate the project. Demonstration
should be within the week after the due date.

IMS506 Chapt 8 9780538469685 - PPT - ch07
100% (1)
IMS506 Chapt 8 9780538469685 - PPT - ch07
51 pages
Oracle SQL plsql-264-349
No ratings yet
Oracle SQL plsql-264-349
86 pages
BDA Lab Manual - BAD601-Final One - 7-11
No ratings yet
BDA Lab Manual - BAD601-Final One - 7-11
25 pages
Python Data Structures Cheat Sheet
No ratings yet
Python Data Structures Cheat Sheet
1 page
Qlik Interview Questions & Answers Updated
No ratings yet
Qlik Interview Questions & Answers Updated
20 pages
Assignment 2 Write-Up
No ratings yet
Assignment 2 Write-Up
7 pages
Apache Kafka
No ratings yet
Apache Kafka
27 pages
Unit 2 Part 2 System Analysis and Design
No ratings yet
Unit 2 Part 2 System Analysis and Design
243 pages
LIBRARY BOOK LOCATOR PROJECT - Android
No ratings yet
LIBRARY BOOK LOCATOR PROJECT - Android
22 pages
BDF 2022 Combined 2
No ratings yet
BDF 2022 Combined 2
266 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Lab 10 Report
No ratings yet
Lab 10 Report
10 pages
Bigdata Question
No ratings yet
Bigdata Question
16 pages
Moodle For E-Learning: A How To Tutorial and Practical Example of Radiation Protection Teaching
No ratings yet
Moodle For E-Learning: A How To Tutorial and Practical Example of Radiation Protection Teaching
30 pages
Bda Lab
No ratings yet
Bda Lab
94 pages
Practical-1: AIM: Practical On Transaction Control Language. Theory
No ratings yet
Practical-1: AIM: Practical On Transaction Control Language. Theory
19 pages
CS442 DSA Practical File
No ratings yet
CS442 DSA Practical File
60 pages
Bda Lab Output
No ratings yet
Bda Lab Output
22 pages
Database Administration and Security Revised Notes Ver 3.0
No ratings yet
Database Administration and Security Revised Notes Ver 3.0
60 pages
Mysqlpracticaltutorial 231002225649 93643326
No ratings yet
Mysqlpracticaltutorial 231002225649 93643326
103 pages
Notes
No ratings yet
Notes
53 pages
Short Programs
No ratings yet
Short Programs
41 pages
An Ace Up The Sleeve PDF
No ratings yet
An Ace Up The Sleeve PDF
68 pages
DSE 3222 05 Mar 2025
No ratings yet
DSE 3222 05 Mar 2025
14 pages
DBMS 9
No ratings yet
DBMS 9
26 pages
Pattern Warehouse
No ratings yet
Pattern Warehouse
6 pages
Problems On Relational Algebra
No ratings yet
Problems On Relational Algebra
12 pages
RPSC Programmer 2013 Peper 1
No ratings yet
RPSC Programmer 2013 Peper 1
32 pages
Special Purpose Databases
No ratings yet
Special Purpose Databases
2 pages
Milvus Overview
No ratings yet
Milvus Overview
53 pages
BDA Practical
No ratings yet
BDA Practical
18 pages
21SE28 BDA CA III SET B-Key
No ratings yet
21SE28 BDA CA III SET B-Key
8 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Dbms r18 Unit 5 Notes
No ratings yet
Dbms r18 Unit 5 Notes
24 pages
CA01
No ratings yet
CA01
14 pages
Lab File Format
No ratings yet
Lab File Format
60 pages
Day01 Business Intelligence - Iti
No ratings yet
Day01 Business Intelligence - Iti
16 pages
23CP309T BDA MSE Question Paper
No ratings yet
23CP309T BDA MSE Question Paper
2 pages
Linking NFT Transaction Events To Identify Privacy Risks Final
No ratings yet
Linking NFT Transaction Events To Identify Privacy Risks Final
17 pages
List of Questions Big Data
No ratings yet
List of Questions Big Data
5 pages
Supplementary Exam 23CP309T BDA ESE Question Paper
No ratings yet
Supplementary Exam 23CP309T BDA ESE Question Paper
2 pages
Tutorial For Course Work
No ratings yet
Tutorial For Course Work
15 pages
AE 2008 Sa Install Adden
No ratings yet
AE 2008 Sa Install Adden
2 pages
Resume
No ratings yet
Resume
4 pages
SLOT - D1+D2: Digital Assignment - I - Summer Semester 2020-2021
No ratings yet
SLOT - D1+D2: Digital Assignment - I - Summer Semester 2020-2021
9 pages
Int 421
No ratings yet
Int 421
2 pages
Big Data With Hadoop & Spark - VII
No ratings yet
Big Data With Hadoop & Spark - VII
3 pages
6 Concept About SM37 TR Code in SAP
No ratings yet
6 Concept About SM37 TR Code in SAP
8 pages
Project 3
No ratings yet
Project 3
5 pages
Project 2
No ratings yet
Project 2
4 pages
ABAP1
No ratings yet
ABAP1
4 pages
Assignment 1 - Ue21cs343ab2 - Big Data
No ratings yet
Assignment 1 - Ue21cs343ab2 - Big Data
8 pages
Act 9 Csef
No ratings yet
Act 9 Csef
6 pages
Naukri KranthiKumarPeerikatla (5y 4m)
No ratings yet
Naukri KranthiKumarPeerikatla (5y 4m)
4 pages
UEC718
No ratings yet
UEC718
2 pages
CCA-175 Docs and Projects
No ratings yet
CCA-175 Docs and Projects
5 pages
Microsoft Certified Azure Data Engineer Associate Skills Measured
No ratings yet
Microsoft Certified Azure Data Engineer Associate Skills Measured
5 pages
Bda Lab
No ratings yet
Bda Lab
2 pages
SPAM - SUM - SQL Error 208 - Invalid Object Name 'Sap - Get - Para'
No ratings yet
SPAM - SUM - SQL Error 208 - Invalid Object Name 'Sap - Get - Para'
3 pages
DA Practice Questions - Unit - 1
No ratings yet
DA Practice Questions - Unit - 1
5 pages
Dsebl ZG522
No ratings yet
Dsebl ZG522
4 pages
2023 DSE BDS Assignment 2 Problem Statement 2
No ratings yet
2023 DSE BDS Assignment 2 Problem Statement 2
3 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
INFO 2312-A10 Summer 2021 Mahapatra, Chinmaya
No ratings yet
INFO 2312-A10 Summer 2021 Mahapatra, Chinmaya
5 pages
Proposal For Coding Challenge
No ratings yet
Proposal For Coding Challenge
3 pages
Exam Question Paper - BDT - 35
No ratings yet
Exam Question Paper - BDT - 35
3 pages
Assignment No.1
No ratings yet
Assignment No.1
1 page
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Assignment 1 Spec
No ratings yet
Assignment 1 Spec
5 pages
Punit Ghosh
No ratings yet
Punit Ghosh
4 pages
18CN627 Big Data Framework For Data Science: Centre For Excellence in Computational Engineering and Networking
No ratings yet
18CN627 Big Data Framework For Data Science: Centre For Excellence in Computational Engineering and Networking
1 page
Sakilafull
No ratings yet
Sakilafull
1 page
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learning Programming and Computer Science: 1, #1
From Everand
Learning Programming and Computer Science: 1, #1
MATHY WISDOM
No ratings yet
Programming And Coding in Intermidiate Level
From Everand
Programming And Coding in Intermidiate Level
Memo
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Living with Linux in the Industrial World
From Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
No ratings yet
Projects with IOTA
From Everand
Projects with IOTA
Guillermo Perez Guillen
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)

Project 1

Uploaded by

Project 1

Uploaded by

You might also like