0% found this document useful (0 votes)

4 views4 pages

Assignment 2 - Data Storage

The assignment requires designing a batch analytics pipeline using HDFS for data storage and Hive for querying user activity logs from a streaming platform. Key tasks include ingesting daily log files, creating Hive tables with a star schema, performing data transformations, and executing analytical queries. Deliverables include a GitHub repository with input data, a shell ingestion script, Hive DDL, transformation commands, sample queries, and a write-up explaining design choices and performance considerations.

Uploaded by

Muhammad Adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views4 pages

Assignment 2 - Data Storage

Uploaded by

Muhammad Adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment 2: Building a Batch Analytics Pipeline on HDFS &

Hive

Due Date: 11:59 PM 7th March

Scenario & Objectives

Your company, MediaCo, gathers large daily logs of user activity from a streaming platform (e.g., plays, skips,
pauses). Your task is to design a batch analytics solution using HDFS for data storage and Hive for querying:

1. Ingest daily log files from a local directory into HDFS, organizing them by date.
2. Create Hive tables to store raw data (CSV/JSON) and a star schema (fact + dimension tables) for
analytics.
3. Run analytical queries to generate insights (monthly usage, top content, average session times).

Data Description

1. User Logs: (user_id, content_id, action, timestamp, device, region, session_id, ...)
○ Arrives in CSV or JSON format.
○ Each day’s logs in a local folder named YYYY-MM-DD.
2. Content Metadata (content_id, title, category, length, artist, ...)
○ Static reference data about each piece of content.

Core Requirements

1. Ingestion Script

1. Write a shell script (e.g., ingest_logs.sh) that:
■ Accepts a date parameter (e.g., 2023-09-01).
■ Parses year/month/day.
■ Copies files into HDFS under a directory like /raw/logs/<year>/<month>/<day> and
/raw/metadata/<year>/<month>/<day>
2. Raw Tables in Hive
1. Create external tables pointing to /raw/logs and /raw/metadata.
2. Partition by (year, month, day) for the log table so queries can filter by date.
3. Star Schema
1. Fact Table: e.g., fact_user_actions storing user actions (partitioned by date).
2. Dimension Table: e.g., dim_content storing content metadata.
3. Store them in a columnar format (e.g., Parquet).
4. Transformation
1. Use Hive SQL (INSERT OVERWRITE, CTAS) to move data from the raw tables to the star schema
tables.
2. Convert timestamps to proper types, if needed.
5. Queries
1. Demonstrate 2–3 analytical queries:
■ E.g., “Monthly active users by region,” “Top categories by play count,” “Average session
length weekly.”
2. Include group by, join (fact + dimension), and filters on date partitions.
6. Deliverables: Please create a GitHub repository with 2 files and 1 folder. PDF file to be uploaded on
LMS.
1. Input Data: Create a folder named raw_data and put your generated input files here
2. Shell Ingestion Script: Short .sh file name ingest_logs.sh
3. Hive DDL for raw and star schema tables. The working queries should be included in the
document.
4. Data Transformation commands. The working queries should be included in the document.
5. Sample Queries with results (Screenshots) to be included in the docs.
6. Short Write-Up with the above queries and commands. Please explain the design choices and
performance considerations. Especially including 1- how long the execution of the whole
pipeline takes. 2- query execution times.

Grading / Assessment Criteria

● Dataset generation: Generate a reasonable dataset. Feel free to increase number of days.
● Ingestion: Correct partitioning, shell script usage.
● Data Modeling: Proper star schema (fact/dimension separation), partition columns.
● Transformation: Successful movement from raw CSV to Parquet, correct field typing.
● SQL Queries: Logical joins, aggregations, beneficial use of date partitions.
● Write-Up: Clear rationale for design, mention of potential performance optimizations.

Note: There might be vivas for this assignment so understand what you are doing!

Helping Resources
1. Hive Documentation:
○ https://cwiki.apache.org/confluence/display/Hive/Home
Covers CREATE EXTERNAL TABLE, partitioning, INSERT OVERWRITE, SerDes for CSV/JSON, etc.
2. HDFS Basics:
○ https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.ht
ml
○ https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Explains file system commands (hdfs dfs -mkdir, -put, etc.).
○ Note: Please follow the Pseudo-Distributed Operation for the deployment of a single node
cluster
(https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.
html)
3. Introduction to Shell Scripting:
○ https://www.shellscript.sh/
4. Dimensional Modeling:
○ Ralph Kimball’s “The Data Warehouse Toolkit” or numerous online articles about star
schemas, fact and dimension design.
5. CSV to Parquet with Hive:
○ Example: https://docs.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive.html
Illustrates how to store final data in a columnar format.
6. Partitioning in Hive:
○ https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDD
L-PartitionedTables
For dynamic partitioning settings and partition maintenance.

Using LLM for generating synthetic data (use any free LLM)

System / User Prompt

“Please generate two separate CSV datasets that I can use to simulate a streaming application’s data in a data
engineering assignment:

1) User Activity Logs

● Columns: user_id, content_id, action, timestamp, device, region, session_id

● Number of Rows: ~20–30 per day, for at least 7 different days (e.g., 2023-09-01, 2023-09-02,
2023-09-03).
● Provide the logs in CSV format with a header row and valid data.
● The timestamp should be a full date+time (e.g., 2023-09-01 08:23:55).
● action: from {play, pause, skip, forward}, randomly assigned.
● device: from {mobile, desktop, tablet}.
● region: from {US, EU, APAC}, randomly assigned.
● session_id: short alphanumeric IDs, repeated occasionally for the same user’s session.
● user_id: integer range ~100–200; content_id: integer range ~1000–1010.
2) Content Metadata

● Columns: content_id, title, category, length, artist

● ~8–12 rows total, with content_id matching the same range used in the logs (1000–1010).
● title: short text (e.g., “Summer Vibes”, “Rock Anthem”).
● category: {Pop, Rock, Podcast, News, Jazz, etc.}, pick randomly.
● length: integer representing total seconds or minutes (e.g., 180 for 3 minutes).
● artist: random short name (e.g., “DJ Alpha”, “The Beats”).
● Provide separate CSV output for this metadata file, also with a header row.

Output Format:

● Return two code blocks:

1. The user activity logs for multiple days (with ~20–30 rows per day).
2. The content metadata (8–12 rows).
● Use valid CSV syntax, comma-delimited, including header rows.

Make sure the content_id in the logs overlaps the content_id in the metadata so we can join them later.

Thank you!”

Tips/Notes:

● Tweak the date range, row count, or field distributions. We need at least 7 days of data.
● For separate files per day, ask LLM to generate each date’s logs in a separate code block or with a
clear label.
● For realism, we want to ask for variations in user_id distribution, session_id formats, or location
(region).

Good Luck!

Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Great Expectations Vs Apache Griffin v1.2
100% (1)
Great Expectations Vs Apache Griffin v1.2
2 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
74 pages
Bda Hadoop Unit-2
No ratings yet
Bda Hadoop Unit-2
71 pages
4-2 Bda PPTS
No ratings yet
4-2 Bda PPTS
114 pages
02 HDP Introduction
No ratings yet
02 HDP Introduction
58 pages
构建基于Apache Kylin的大数据分析平台讲话
No ratings yet
构建基于Apache Kylin的大数据分析平台讲话
37 pages
Tutorial For Course Work
No ratings yet
Tutorial For Course Work
15 pages
BDA Courseplan
No ratings yet
BDA Courseplan
3 pages
Jackson Rodrigues
No ratings yet
Jackson Rodrigues
7 pages
Big Data-Spark Lab Syllabus
No ratings yet
Big Data-Spark Lab Syllabus
2 pages
Log File Analysis
No ratings yet
Log File Analysis
4 pages
Graduate Programmes (SBASSE) - Fall - 2025 MS Artificial Intelligence
No ratings yet
Graduate Programmes (SBASSE) - Fall - 2025 MS Artificial Intelligence
4 pages
Big SQL
No ratings yet
Big SQL
6 pages
Trend Analysis of Access Patterns Using Hadoop PDF
No ratings yet
Trend Analysis of Access Patterns Using Hadoop PDF
84 pages
Group 3&4 Assignment
No ratings yet
Group 3&4 Assignment
6 pages
CW1 Project Report Brief
No ratings yet
CW1 Project Report Brief
10 pages
IT460 Software Quality Assurance
No ratings yet
IT460 Software Quality Assurance
3 pages
Cloudera JDBC Driver For Apache Hive Install Guide 2 5 4
No ratings yet
Cloudera JDBC Driver For Apache Hive Install Guide 2 5 4
21 pages
Syllabus BDA
No ratings yet
Syllabus BDA
1 page
Welcome To The Age of $10 - Month Lakehouses
No ratings yet
Welcome To The Age of $10 - Month Lakehouses
29 pages
Content Scheduling
No ratings yet
Content Scheduling
2 pages
Narendra Dataengineer Resume - pdf-1
No ratings yet
Narendra Dataengineer Resume - pdf-1
4 pages
Se File
No ratings yet
Se File
25 pages
Job Offer Email Template
No ratings yet
Job Offer Email Template
1 page
Shaukat Khanum Memorial Cancer Hospital & Research Centre
No ratings yet
Shaukat Khanum Memorial Cancer Hospital & Research Centre
1 page
3 Hours / 70 Marks: Instructions
100% (1)
3 Hours / 70 Marks: Instructions
2 pages
Exploring Client-Server Connections
No ratings yet
Exploring Client-Server Connections
32 pages
Certified Cloudera
No ratings yet
Certified Cloudera
5 pages
NMIMS MBA BA Hadoop Project
No ratings yet
NMIMS MBA BA Hadoop Project
3 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
2023 DSE BDS Assignment 2 Problem Statement 2
No ratings yet
2023 DSE BDS Assignment 2 Problem Statement 2
3 pages
And Hadoop: Integration of Data and Analytics
No ratings yet
And Hadoop: Integration of Data and Analytics
7 pages
Exam Question Paper - BDT - 35
No ratings yet
Exam Question Paper - BDT - 35
3 pages
Radoop: Analyzing Big Data With Rapidminer and Hadoop
No ratings yet
Radoop: Analyzing Big Data With Rapidminer and Hadoop
12 pages
Working With Databricks Tables, Databricks File System (DBFS) Etc
No ratings yet
Working With Databricks Tables, Databricks File System (DBFS) Etc
3 pages
Working With Databricks Tables, Databricks File System (DBFS) Etc
No ratings yet
Working With Databricks Tables, Databricks File System (DBFS) Etc
3 pages
List of Questions Big Data
No ratings yet
List of Questions Big Data
5 pages
Objectives: Apache HTTP Server Log Analysis Business Analytics Using Hadoop - Project
No ratings yet
Objectives: Apache HTTP Server Log Analysis Business Analytics Using Hadoop - Project
4 pages
Assignment 1 Spec
No ratings yet
Assignment 1 Spec
5 pages
ZOL Website Content
No ratings yet
ZOL Website Content
4 pages
Data Engineering System Design
No ratings yet
Data Engineering System Design
37 pages
Ajai Chaganti AH
No ratings yet
Ajai Chaganti AH
6 pages
Polikanti Goutham Krishna Cloud Data Engineer - Aws Certified Developer
No ratings yet
Polikanti Goutham Krishna Cloud Data Engineer - Aws Certified Developer
4 pages
Big Data Analytics Practical Through Practice
No ratings yet
Big Data Analytics Practical Through Practice
4 pages
Trend Nologies Curriculum
No ratings yet
Trend Nologies Curriculum
30 pages
AWS DAS-C01 Sample Questions
No ratings yet
AWS DAS-C01 Sample Questions
5 pages
Group 3&4 Assignment Sample Solution
No ratings yet
Group 3&4 Assignment Sample Solution
5 pages
Ravi
No ratings yet
Ravi
4 pages
Assignment-3 Bda
No ratings yet
Assignment-3 Bda
5 pages
Bda - Cat Iii - Set Ii
No ratings yet
Bda - Cat Iii - Set Ii
3 pages
Admin, 32788
No ratings yet
Admin, 32788
9 pages
Bda Lab
No ratings yet
Bda Lab
94 pages
Course Outline Hadoop and Spark For Big Data and Data Science PDF
No ratings yet
Course Outline Hadoop and Spark For Big Data and Data Science PDF
4 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
STUTI - GUPTA Hadoop Resume PDF
No ratings yet
STUTI - GUPTA Hadoop Resume PDF
2 pages
Course Structure & Syllabus M.Tech For Programme: Department of Computer Science
No ratings yet
Course Structure & Syllabus M.Tech For Programme: Department of Computer Science
57 pages
Data Analysis PHASE
No ratings yet
Data Analysis PHASE
14 pages
Azure Data Factory Notes 1682135573
No ratings yet
Azure Data Factory Notes 1682135573
78 pages
1 - cn7022 18 19 CRWK
No ratings yet
1 - cn7022 18 19 CRWK
7 pages
BIG Data Master
No ratings yet
BIG Data Master
24 pages
Unit V
No ratings yet
Unit V
35 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Vishal
No ratings yet
Vishal
4 pages
Anushka Shetty 35
No ratings yet
Anushka Shetty 35
34 pages
General Question Bank
No ratings yet
General Question Bank
5 pages
DSA Practical Index
No ratings yet
DSA Practical Index
3 pages
Ravi Shankar Chittela DataEngg
No ratings yet
Ravi Shankar Chittela DataEngg
10 pages
Big Data With Hadoop & Spark - VII
No ratings yet
Big Data With Hadoop & Spark - VII
3 pages
End Sem Paper
No ratings yet
End Sem Paper
4 pages
DSECLZG529-AIMLCZG529-Data Management For Machine Learning-Midsem - Makeup-AK
No ratings yet
DSECLZG529-AIMLCZG529-Data Management For Machine Learning-Midsem - Makeup-AK
12 pages
Bda Assign2
No ratings yet
Bda Assign2
4 pages
Bigdata
No ratings yet
Bigdata
3 pages
NUST at A Glance
No ratings yet
NUST at A Glance
10 pages
Merchant Rating System Using Hadoop MapReduce
No ratings yet
Merchant Rating System Using Hadoop MapReduce
32 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
Lab Syllabus Format
No ratings yet
Lab Syllabus Format
4 pages
Apache Hive
No ratings yet
Apache Hive
77 pages
BDAA
No ratings yet
BDAA
4 pages
Big Data & Hadoop - Course Curriculum
No ratings yet
Big Data & Hadoop - Course Curriculum
6 pages
2024 25 ODD CE449 BDA Syllabus
No ratings yet
2024 25 ODD CE449 BDA Syllabus
4 pages
Data Analytics TOC
No ratings yet
Data Analytics TOC
6 pages
Big Data Analystics
No ratings yet
Big Data Analystics
4 pages
IV Yr II Sem Lesson Plans
No ratings yet
IV Yr II Sem Lesson Plans
19 pages
Big Data Hadoop - Course Curriculum - V1
No ratings yet
Big Data Hadoop - Course Curriculum - V1
7 pages
KCS061 Big Data
No ratings yet
KCS061 Big Data
2 pages
Int 421
No ratings yet
Int 421
2 pages
Big Data
No ratings yet
Big Data
4 pages
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Assignment 2 - Data Storage

Uploaded by

Assignment 2 - Data Storage

Uploaded by

Assignment 2: Building a Batch Analytics Pipeline on HDFS &

Due Date: 11:59 PM 7th March

Scenario & Objectives

1.​ Ingestion Script

Grading / Assessment Criteria

System / User Prompt

1) User Activity Logs

●​ Columns: user_id, content_id, action, timestamp, device, region, session_id

●​ Columns: content_id, title, category, length, artist

●​ Return two code blocks:

You might also like

1. Ingestion Script

● Columns: user_id, content_id, action, timestamp, device, region, session_id

● Columns: content_id, title, category, length, artist

● Return two code blocks: