Assignment 2_ Data Storage
Assignment 2_ Data Storage
Hive
1. Ingest daily log files from a local directory into HDFS, organizing them by date.
2. Create Hive tables to store raw data (CSV/JSON) and a star schema (fact + dimension tables) for
analytics.
3. Run analytical queries to generate insights (monthly usage, top content, average session times).
Data Description
1. User Logs: (user_id, content_id, action, timestamp, device, region, session_id, ...)
○ Arrives in CSV or JSON format.
○ Each day’s logs in a local folder named YYYY-MM-DD.
2. Content Metadata (content_id, title, category, length, artist, ...)
○ Static reference data about each piece of content.
Core Requirements
● Dataset generation: Generate a reasonable dataset. Feel free to increase number of days.
● Ingestion: Correct partitioning, shell script usage.
● Data Modeling: Proper star schema (fact/dimension separation), partition columns.
● Transformation: Successful movement from raw CSV to Parquet, correct field typing.
● SQL Queries: Logical joins, aggregations, beneficial use of date partitions.
● Write-Up: Clear rationale for design, mention of potential performance optimizations.
Note: There might be vivas for this assignment so understand what you are doing!
Helping Resources
1. Hive Documentation:
○ https://cwiki.apache.org/confluence/display/Hive/Home
Covers CREATE EXTERNAL TABLE, partitioning, INSERT OVERWRITE, SerDes for CSV/JSON, etc.
2. HDFS Basics:
○ https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.ht
ml
○ https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Explains file system commands (hdfs dfs -mkdir, -put, etc.).
○ Note: Please follow the Pseudo-Distributed Operation for the deployment of a single node
cluster
(https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.
html)
3. Introduction to Shell Scripting:
○ https://www.shellscript.sh/
4. Dimensional Modeling:
○ Ralph Kimball’s “The Data Warehouse Toolkit” or numerous online articles about star
schemas, fact and dimension design.
5. CSV to Parquet with Hive:
○ Example: https://docs.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive.html
Illustrates how to store final data in a columnar format.
6. Partitioning in Hive:
○ https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDD
L-PartitionedTables
For dynamic partitioning settings and partition maintenance.
Using LLM for generating synthetic data (use any free LLM)
“Please generate two separate CSV datasets that I can use to simulate a streaming application’s data in a data
engineering assignment:
Output Format:
Make sure the content_id in the logs overlaps the content_id in the metadata so we can join them later.
Thank you!”
Tips/Notes:
● Tweak the date range, row count, or field distributions. We need at least 7 days of data.
● For separate files per day, ask LLM to generate each date’s logs in a separate code block or with a
clear label.
● For realism, we want to ask for variations in user_id distribution, session_id formats, or location
(region).
Good Luck!