0% found this document useful (0 votes)

11 views7 pages

BDA MakeUp Solution

Uploaded by

Prash809

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

BDA MakeUp Solution

Uploaded by

Prash809

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Semester 1- 2024-2025
Mid-Semester Regular (EC-2)

Course No. : BA ZG525

Course Title : BIG DATA ANALYTICS
Nature of Exam : Closed Book
Weightage : 30% No. of Pages =2
Duration : 2 Hours No. of Questions = 7
Date of Exam : 06 October, 2024 - 01:00 PM
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1 Case Study:

XYZ Corp, an e-commerce company, is rapidly expanding its business across multiple countries,
resulting in a significant increase in data volume. They are currently facing high operational costs
and performance issues with their traditional storage systems. As they move to distributed storage,
they are considering Hadoop Distributed File System (HDFS) and cloud-based options like AWS
S3.
In regions where network stability is inconsistent, XYZ Corp also needs to ensure that their chosen
storage solution can handle the distributed nature of data access without escalating costs or
compromising performance.
Given XYZ Corp’s expansion into regions with varying levels of network stability and data volume
growth, argue how HDFS could be a better fit than AWS S3. Justify your answer with two specific
technical advantages of HDFS in terms of scalability, performance, or operational control in this
scenario. [5 Marks]

Solution
HDFS vs AWS S3 for XYZ Corp’s Expansion:
In the context of XYZ Corp’s rapid growth, expansion into multiple regions, and network instability
in some areas, Hadoop Distributed File System (HDFS) offers notable advantages over AWS S3.
The key reasons why HDFS might be a better fit for XYZ Corp are:

Data Locality for Performance Optimization:

Explanation: HDFS provides data locality, where the processing occurs close to the data,
significantly reducing network bandwidth consumption. This is particularly useful in regions with
unstable network conditions, as the majority of data processing happens on the same node or rack
where the data is stored. By reducing data transfer over the network, HDFS can enhance
performance and minimize the dependency on stable network infrastructure.
Justification: AWS S3, being cloud-based, lacks data locality, leading to higher network
dependencies and potentially lower performance in regions with fluctuating network conditions. For
large-scale batch processing tasks, HDFS can outperform S3 by reducing latency. 2.5 Marks

Operational Control and Customization:

Explanation: HDFS offers greater operational control over how data is stored, replicated, and
processed. XYZ Corp can fine-tune the replication factor to ensure higher availability in regions
with network instability. Moreover, HDFS allows for seamless integration with the Hadoop
ecosystem (e.g., MapReduce, Hive), giving XYZ Corp more flexibility to optimize for cost,
performance, and fault tolerance.
Justification: AWS S3 is a managed service, offering limited operational control in comparison to
HDFS. While S3 is convenient, the lack of granular control can result in higher operational costs
and potential performance issues for large-scale, distributed data analytics. 2.5 Marks

Q.2 Case Study:

You work as a data engineer at a retail company, and you are tasked with analyzing sales data from
multiple stores. The company has millions of records of daily sales data, and you need to find the
total sales for each store using Hadoop's MapReduce framework. The sales data is stored in a text
file in the following format:
store_id, item_id, amount
1, 101, 100
2, 102, 200
1, 103, 150
2, 104, 300
...
Your task is to write Python scripts for the Mapper and Reducer to process this sales data. The
Mapper should read each record and output the store ID and the sales amount, while the Reducer
should sum the sales for each store and output the total sales for each store.
Write the Mapper and Reducer Python scripts. [7 Marks]

Solution
Mapper Script (mapper.py) 3 Marks
#!/usr/bin/env python

import sys

# Read from standard input

for line in sys.stdin:
line = line.strip() # Remove leading/trailing whitespace
if line: # Check if the line is not empty
# Split the line into store_id, item_id, and amount
store_id, item_id, amount = line.split(",")

# Output the store_id and amount, separated by a tab

print(f"{store_id}\t{amount}")

Reducer Script (reducer.py) 4 Marks

#!/usr/bin/env python

import sys

current_store = None
current_total = 0.0
# Read from standard input
for line in sys.stdin:
line = line.strip() # Remove leading/trailing whitespace
if line: # Check if the line is not empty
store_id, amount = line.split("\t") # Split by tab

# Convert amount to float

amount = float(amount)

# If we are still processing the same store

if current_store == store_id:
current_total += amount # Sum the amounts for the same store
else:
# If we reach a new store, print the total for the previous store
if current_store is not None:
print(f"{current_store}\t{current_total}")
# Update current store and reset total
current_store = store_id
current_total = amount

# Output the last store's total if needed

if current_store is not None:
print(f"{current_store}\t{current_total}")

Q.3 State whether the following statements are True or False with proper justification.
Answers without proper justification will not be given any marks. [3Marks]
a) In HDFS, the NameNode stores both metadata and actual data.
b) HDFS follows a master-slave architecture where the NameNode acts as the master and
DataNodes as the slaves.
c) There can be multiple NameNodes in HDFS, each managing its own set of metadata.

Solution
a) In HDFS, the NameNode stores both metadata and actual data.
Answer: False
Justification: In HDFS (Hadoop Distributed File System), the NameNode is responsible solely for
storing metadata about the files in the system, such as file names, permissions, and the structure of
the file system. It does not store the actual data blocks. The actual data is stored on DataNodes,
which manage the storage of data blocks.
b) HDFS follows a master-slave architecture where the NameNode acts as the master and
DataNodes as the slaves.
Answer: True
Justification: HDFS indeed follows a master-slave architecture. The NameNode serves as the master
node, managing metadata and coordinating the DataNodes. The DataNodes act as slave nodes,
storing the actual data blocks and serving read and write requests from clients. This architecture
allows for effective data management and scalability.
c) There can be multiple NameNodes in HDFS, each managing its own set of metadata.
Answer: False
Justification: HDFS is designed to have a single active NameNode at a time that manages the
metadata for the entire file system. Although there can be a secondary NameNode or standby
NameNode for high availability and failover, these do not operate simultaneously with multiple
NameNodes managing their own sets of metadata. Only one NameNode is active at a time to avoid
conflicts in metadata management.

Q.4 You are a Hadoop administrator at an online retail company that has recently adopted Hadoop for
processing and analyzing its server logs. The company is setting up a log analysis pipeline using
HDFS and MapReduce. Your task is to prepare the Hadoop environment to ensure smooth log
collection, storage, and processing.
Your task is to write commands for the following:
a) Check disk space usage of the HDFS file system. [0.5Mark]
b) Create a new directory in HDFS named 'log_analysis' under the root directory. [0.5Mark]
c) Create 3 files named 'server.log', 'access.log', and 'error.log' on the local file system. [1Mark]
d) Upload these log files from the local file system to the 'log_analysis' directory in HDFS. [1 Mark]
e) List all files in the 'log_analysis' directory in HDFS. [0.5 Mark]
f) Download the 'error.log' file from HDFS to the local file system. [1 Mark]
g) Remove the 'access.log' file from the HDFS directory 'log_analysis'. [0.5 Mark]
h) Change the ownership of the 'server.log' file in HDFS to a user named 'admin'. [1 Mark]

Solution:
a) Check disk space usage of the HDFS file system. [0.5Mark]
hadoop dfs -df -h
b) Create a new directory in HDFS named 'log_analysis' under the root directory. [0.5Mark]
hadoop dfs -mkdir /log_analysis
c) Create 3 files named 'server.log', 'access.log', and 'error.log' on the local file system. [1Mark]
touch server.log access.log error.log
d) Upload these log files from the local file system to the 'log_analysis' directory in HDFS. [1 Mark]
hadoop dfs -put server.log /log_analysis/
hadoop dfs -put access.log /log_analysis/
hadoop dfs -put error.log /log_analysis/
e) List all files in the 'log_analysis' directory in HDFS. [0.5 Mark]
hadoop dfs -ls /log_analysis/
f) Download the 'error.log' file from HDFS to the local file system. [1 Mark]
hadoop dfs -get /log_analysis/error.log .
g) Remove the 'access.log' file from the HDFS directory 'log_analysis'. [0.5 Mark]
hadoop dfs -rm /log_analysis/access.log
h) Change the ownership of the 'server.log' file in HDFS to a user named 'admin'. [1 Mark]
hadoop dfs -chown admin /log_analysis/server.log

Q.5 You are working with employee data for a large organization. The data includes employee details
such as department and age. Your tasks involve writing queries to analyze this data using Hive and
Pig.
Hive Task:
Write a Hive query to count the number of employees in each department. [2Marks]
Pig Task:
Write a Pig script to filter the employee data, selecting only employees who are above 30 years old.
[2Marks]
Solution
Hive Task:
Write a Hive query to count the number of employees in each department. [2Marks]
SELECT department, COUNT(*) AS employee_count
FROM employee_data
GROUP BY department;

Pig Task:
Write a Pig script to filter the employee data, selecting only employees who are above 30 years old.
[2Marks]
-- Load the employee data from a specified source
employee_data = LOAD 'employee_data.csv' USING PigStorage(',') AS (name:chararray, age:int,
department:chararray);

-- Filter employees who are older than 30

filtered_employees = FILTER employee_data BY age > 30;

-- Store the filtered results (optional, based on requirement)

STORE filtered_employees INTO 'filtered_employees_output' USING PigStorage(',');

Q.6 Case Study: Real-Time Financial Analytics

You are tasked with implementing a system for monitoring financial transactions to detect fraud in
real-time. The organization has decided to use a Kappa architecture, leveraging Cassandra for stream
processing and employing an event-driven architecture for alert triggering. Additionally, a data lake
will be utilized to store historical transactions for analysis.
Discuss the advantages of using Kappa architecture and Cassandra in this scenario. How do these
technologies contribute to faster fraud detection and reduced losses? Your answer should address
at least two specific benefits of this architecture. [2Marks]

Solution:
In the context of Real-Time Financial Analytics for fraud detection, using Kappa architecture and
Cassandra provides several advantages:

Real-Time Stream Processing with Kappa Architecture:

Low Latency: Kappa architecture is designed for real-time stream processing by
processing data as it arrives. This ensures faster fraud detection as transactions are analyzed
immediately, enabling quicker responses to fraudulent activities.
Simplified Architecture: Unlike Lambda architecture, which maintains separate paths for
batch and stream processing, Kappa only processes streams, reducing complexity. This
enables seamless real-time monitoring without needing to handle a separate batch layer,
improving efficiency and agility in fraud detection.
Scalability and High Availability with Cassandra:

Distributed Data Storage: Cassandra is a distributed NoSQL database known for its
ability to handle large volumes of data across multiple nodes. This is crucial for handling
high transaction volumes in a global financial environment. Its horizontal scalability
ensures the system can grow without performance degradation.
High Write and Read Throughput: Cassandra offers high write speeds and is optimized
for real-time analytics, ensuring transaction data can be quickly stored and retrieved for
fraud analysis, supporting timely alerts and minimizing potential financial losses.

Q.7 Scenario:
You are working with a Cassandra database to manage customer information. Below are examples
of basic CRUD (Create, Read, Update, Delete) operations for a customers table.
• Create: INSERT INTO customers (id, name, balance) VALUES (123, 'John', 1000);
• Read: SELECT * FROM customers WHERE id = 123;
• Update: UPDATE customers SET balance = 1200 WHERE id = 123;
Explain the purpose of each of the four CRUD operations shown above in the context of the
customers table. Provide a brief description of what each operation does. [3Marks]

Solution:
The CRUD (Create, Read, Update, Delete) operations shown for the Cassandra database managing
customer information can be explained as follows:

Create Operation (INSERT):

Purpose: To add a new record in the table.
Description: The command INSERT INTO customers (id, name, balance) VALUES (123, 'John',
1000); inserts a new customer with ID 123, name "John," and a balance of 1000 into the customers
table. This operation creates a new row in the database with the specified values.

Read Operation (SELECT):

Purpose: To retrieve existing records from the table.
Description: The command SELECT * FROM customers WHERE id = 123; retrieves all the
information for the customer with ID 123 from the customers table. This operation is used to read
or query the stored data.

Update Operation (UPDATE):

Purpose: To modify an existing record in the table.
Description: The command UPDATE customers SET balance = 1200 WHERE id = 123; updates
the balance of the customer with ID 123 to 1200. This operation changes or modifies the values of
the existing data without creating a new record.

Delete Operation (not explicitly shown but implied as part of CRUD):

Purpose: To remove records from the table.
Description: A typical DELETE command in Cassandra would look like DELETE FROM
customers WHERE id = 123; and would remove the customer record with ID 123 from the
database. This operation is used to delete the data.
******

Multiple Response Tasks
No ratings yet
Multiple Response Tasks
11 pages
Big Data 2020
No ratings yet
Big Data 2020
13 pages
Delhivery Feature Engineering - Solution Approach
No ratings yet
Delhivery Feature Engineering - Solution Approach
7 pages
Manual 5
No ratings yet
Manual 5
51 pages
Project Data Lake
No ratings yet
Project Data Lake
7 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
BDA Assignment 2
No ratings yet
BDA Assignment 2
5 pages
Bda 23
No ratings yet
Bda 23
12 pages
WINSEM2024-25 BITE411L TH VL2024250502647 2025-02-07 Reference-Material-I
No ratings yet
WINSEM2024-25 BITE411L TH VL2024250502647 2025-02-07 Reference-Material-I
5 pages
List of Questions Big Data
No ratings yet
List of Questions Big Data
5 pages
Bda Assignment-1
No ratings yet
Bda Assignment-1
3 pages
Hadoop Big Data Unit 3
No ratings yet
Hadoop Big Data Unit 3
22 pages
BDCC IA2 QP Set 1
No ratings yet
BDCC IA2 QP Set 1
2 pages
10 SparkIntroduction BigData 2x
No ratings yet
10 SparkIntroduction BigData 2x
33 pages
Project On Netflix Data Analysis
100% (1)
Project On Netflix Data Analysis
22 pages
DSE 3222 05 Mar 2025
No ratings yet
DSE 3222 05 Mar 2025
14 pages
Unit 3 (Big Data Analytics)
No ratings yet
Unit 3 (Big Data Analytics)
18 pages
Assignment-3 Bda
No ratings yet
Assignment-3 Bda
5 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
Big Data Unit-1
No ratings yet
Big Data Unit-1
9 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
DSLab2
No ratings yet
DSLab2
6 pages
Code Explanation
No ratings yet
Code Explanation
3 pages
Big Data Analytics
No ratings yet
Big Data Analytics
28 pages
BIG Data Master
No ratings yet
BIG Data Master
24 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
CT 2
No ratings yet
CT 2
8 pages
Data Science
No ratings yet
Data Science
10 pages
12 Ip PP1 QP
No ratings yet
12 Ip PP1 QP
11 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
FIT5196-S2-2020 Assessment 2
No ratings yet
FIT5196-S2-2020 Assessment 2
4 pages
12 Ip Question Paper
No ratings yet
12 Ip Question Paper
8 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
Ict450 SQL Exercise Question
No ratings yet
Ict450 SQL Exercise Question
12 pages
Merchant Rating System Using Hadoop MapReduce
No ratings yet
Merchant Rating System Using Hadoop MapReduce
32 pages
Int 421
No ratings yet
Int 421
2 pages
Rameshkumar FINAL PROJECT REPORT
No ratings yet
Rameshkumar FINAL PROJECT REPORT
66 pages
DT Associate
No ratings yet
DT Associate
60 pages
Concept Map 1
No ratings yet
Concept Map 1
1 page
QuickMetrix - Online Reputation Management Tools
No ratings yet
QuickMetrix - Online Reputation Management Tools
9 pages
Animesh Employee Management System Project Report in Python 2022
No ratings yet
Animesh Employee Management System Project Report in Python 2022
47 pages
Settings Provider
No ratings yet
Settings Provider
74 pages
Guide To Data-Centric System Threat Modeling: Draft NIST Special Publication 800-154
No ratings yet
Guide To Data-Centric System Threat Modeling: Draft NIST Special Publication 800-154
25 pages
SEPM Unit-4
No ratings yet
SEPM Unit-4
38 pages
How To Use Backupify
No ratings yet
How To Use Backupify
80 pages
Implementing Symantec EV With NetApp Snaplock
No ratings yet
Implementing Symantec EV With NetApp Snaplock
26 pages
Microsoft Cybersecurity Reference Architectures (MCRA)
No ratings yet
Microsoft Cybersecurity Reference Architectures (MCRA)
68 pages
Usabilla Presentation
No ratings yet
Usabilla Presentation
32 pages
Fortigate 200f Series
No ratings yet
Fortigate 200f Series
11 pages
07 CanEasy Hardware Configuration
No ratings yet
07 CanEasy Hardware Configuration
14 pages
Cyber Security Policy
100% (1)
Cyber Security Policy
3 pages
GSS110 Week1 Topic 1 Introduction
No ratings yet
GSS110 Week1 Topic 1 Introduction
17 pages
BSBPMG421 Project Management Plan Template V5.2021
No ratings yet
BSBPMG421 Project Management Plan Template V5.2021
13 pages
Er To Schema
No ratings yet
Er To Schema
16 pages
VIDEO GUIDE - How To Setup Jitsi in Docker With A Reverse Proxy - Page 4 - Docker Containers - Unraid
No ratings yet
VIDEO GUIDE - How To Setup Jitsi in Docker With A Reverse Proxy - Page 4 - Docker Containers - Unraid
14 pages
Flappy Bird SRS IUB
No ratings yet
Flappy Bird SRS IUB
18 pages
Data Science Bootcamp: Curriculum
No ratings yet
Data Science Bootcamp: Curriculum
19 pages
Grade 10 Ict Q3 W78
No ratings yet
Grade 10 Ict Q3 W78
7 pages
Sample Avr Report
No ratings yet
Sample Avr Report
17 pages
Oracle-DBA Trainining in Hyderabad
No ratings yet
Oracle-DBA Trainining in Hyderabad
4 pages
Resolver Problema 0X0709 Impressora Nao Conecta Na Rede Win11
No ratings yet
Resolver Problema 0X0709 Impressora Nao Conecta Na Rede Win11
3 pages
SAP AIF - Simple Inbound Scenario - Part-2 - SAP Blogs
No ratings yet
SAP AIF - Simple Inbound Scenario - Part-2 - SAP Blogs
9 pages
ISO 27001-2005 Awareness
No ratings yet
ISO 27001-2005 Awareness
14 pages
Dice Resume CV Bhandari A
No ratings yet
Dice Resume CV Bhandari A
8 pages
The Knowledge Value Chain
No ratings yet
The Knowledge Value Chain
14 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering DynamoDB
From Everand
Mastering DynamoDB
Tanmay Deshpande
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
AWS Solutions Architect Certification Case Based Practice Questions Latest Edition 2023
From Everand
AWS Solutions Architect Certification Case Based Practice Questions Latest Edition 2023
Exam OG
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Microsoft Azure Data Engineer DP 203
From Everand
Microsoft Azure Data Engineer DP 203
Manish Soni
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

BDA MakeUp Solution

Uploaded by

BDA MakeUp Solution

Uploaded by

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Course No. : BA ZG525

Q.1 Case Study:

Data Locality for Performance Optimization:

Operational Control and Customization:

Q.2 Case Study:

# Read from standard input

# Output the store_id and amount, separated by a tab

Reducer Script (reducer.py) 4 Marks

# Convert amount to float

# If we are still processing the same store

# Output the last store's total if needed

-- Filter employees who are older than 30

-- Store the filtered results (optional, based on requirement)

Q.6 Case Study: Real-Time Financial Analytics

Real-Time Stream Processing with Kappa Architecture:

Create Operation (INSERT):

Read Operation (SELECT):

Update Operation (UPDATE):

Delete Operation (not explicitly shown but implied as part of CRUD):

You might also like