0% found this document useful (0 votes)
11 views7 pages

BDA MakeUp Solution

Uploaded by

Prash809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

BDA MakeUp Solution

Uploaded by

Prash809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division


Semester 1- 2024-2025
Mid-Semester Regular (EC-2)

Course No. : BA ZG525


Course Title : BIG DATA ANALYTICS
Nature of Exam : Closed Book
Weightage : 30% No. of Pages =2
Duration : 2 Hours No. of Questions = 7
Date of Exam : 06 October, 2024 - 01:00 PM
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1 Case Study:


XYZ Corp, an e-commerce company, is rapidly expanding its business across multiple countries,
resulting in a significant increase in data volume. They are currently facing high operational costs
and performance issues with their traditional storage systems. As they move to distributed storage,
they are considering Hadoop Distributed File System (HDFS) and cloud-based options like AWS
S3.
In regions where network stability is inconsistent, XYZ Corp also needs to ensure that their chosen
storage solution can handle the distributed nature of data access without escalating costs or
compromising performance.
Given XYZ Corp’s expansion into regions with varying levels of network stability and data volume
growth, argue how HDFS could be a better fit than AWS S3. Justify your answer with two specific
technical advantages of HDFS in terms of scalability, performance, or operational control in this
scenario. [5 Marks]

Solution
HDFS vs AWS S3 for XYZ Corp’s Expansion:
In the context of XYZ Corp’s rapid growth, expansion into multiple regions, and network instability
in some areas, Hadoop Distributed File System (HDFS) offers notable advantages over AWS S3.
The key reasons why HDFS might be a better fit for XYZ Corp are:

Data Locality for Performance Optimization:


Explanation: HDFS provides data locality, where the processing occurs close to the data,
significantly reducing network bandwidth consumption. This is particularly useful in regions with
unstable network conditions, as the majority of data processing happens on the same node or rack
where the data is stored. By reducing data transfer over the network, HDFS can enhance
performance and minimize the dependency on stable network infrastructure.
Justification: AWS S3, being cloud-based, lacks data locality, leading to higher network
dependencies and potentially lower performance in regions with fluctuating network conditions. For
large-scale batch processing tasks, HDFS can outperform S3 by reducing latency. 2.5 Marks

Operational Control and Customization:


Explanation: HDFS offers greater operational control over how data is stored, replicated, and
processed. XYZ Corp can fine-tune the replication factor to ensure higher availability in regions
with network instability. Moreover, HDFS allows for seamless integration with the Hadoop
ecosystem (e.g., MapReduce, Hive), giving XYZ Corp more flexibility to optimize for cost,
performance, and fault tolerance.
Justification: AWS S3 is a managed service, offering limited operational control in comparison to
HDFS. While S3 is convenient, the lack of granular control can result in higher operational costs
and potential performance issues for large-scale, distributed data analytics. 2.5 Marks

Q.2 Case Study:


You work as a data engineer at a retail company, and you are tasked with analyzing sales data from
multiple stores. The company has millions of records of daily sales data, and you need to find the
total sales for each store using Hadoop's MapReduce framework. The sales data is stored in a text
file in the following format:
store_id, item_id, amount
1, 101, 100
2, 102, 200
1, 103, 150
2, 104, 300
...
Your task is to write Python scripts for the Mapper and Reducer to process this sales data. The
Mapper should read each record and output the store ID and the sales amount, while the Reducer
should sum the sales for each store and output the total sales for each store.
Write the Mapper and Reducer Python scripts. [7 Marks]

Solution
Mapper Script (mapper.py) 3 Marks
#!/usr/bin/env python

import sys

# Read from standard input


for line in sys.stdin:
line = line.strip() # Remove leading/trailing whitespace
if line: # Check if the line is not empty
# Split the line into store_id, item_id, and amount
store_id, item_id, amount = line.split(",")

# Output the store_id and amount, separated by a tab


print(f"{store_id}\t{amount}")

Reducer Script (reducer.py) 4 Marks


#!/usr/bin/env python

import sys

current_store = None
current_total = 0.0
# Read from standard input
for line in sys.stdin:
line = line.strip() # Remove leading/trailing whitespace
if line: # Check if the line is not empty
store_id, amount = line.split("\t") # Split by tab

# Convert amount to float


amount = float(amount)

# If we are still processing the same store


if current_store == store_id:
current_total += amount # Sum the amounts for the same store
else:
# If we reach a new store, print the total for the previous store
if current_store is not None:
print(f"{current_store}\t{current_total}")
# Update current store and reset total
current_store = store_id
current_total = amount

# Output the last store's total if needed


if current_store is not None:
print(f"{current_store}\t{current_total}")

Q.3 State whether the following statements are True or False with proper justification.
Answers without proper justification will not be given any marks. [3Marks]
a) In HDFS, the NameNode stores both metadata and actual data.
b) HDFS follows a master-slave architecture where the NameNode acts as the master and
DataNodes as the slaves.
c) There can be multiple NameNodes in HDFS, each managing its own set of metadata.

Solution
a) In HDFS, the NameNode stores both metadata and actual data.
Answer: False
Justification: In HDFS (Hadoop Distributed File System), the NameNode is responsible solely for
storing metadata about the files in the system, such as file names, permissions, and the structure of
the file system. It does not store the actual data blocks. The actual data is stored on DataNodes,
which manage the storage of data blocks.
b) HDFS follows a master-slave architecture where the NameNode acts as the master and
DataNodes as the slaves.
Answer: True
Justification: HDFS indeed follows a master-slave architecture. The NameNode serves as the master
node, managing metadata and coordinating the DataNodes. The DataNodes act as slave nodes,
storing the actual data blocks and serving read and write requests from clients. This architecture
allows for effective data management and scalability.
c) There can be multiple NameNodes in HDFS, each managing its own set of metadata.
Answer: False
Justification: HDFS is designed to have a single active NameNode at a time that manages the
metadata for the entire file system. Although there can be a secondary NameNode or standby
NameNode for high availability and failover, these do not operate simultaneously with multiple
NameNodes managing their own sets of metadata. Only one NameNode is active at a time to avoid
conflicts in metadata management.

Q.4 You are a Hadoop administrator at an online retail company that has recently adopted Hadoop for
processing and analyzing its server logs. The company is setting up a log analysis pipeline using
HDFS and MapReduce. Your task is to prepare the Hadoop environment to ensure smooth log
collection, storage, and processing.
Your task is to write commands for the following:
a) Check disk space usage of the HDFS file system. [0.5Mark]
b) Create a new directory in HDFS named 'log_analysis' under the root directory. [0.5Mark]
c) Create 3 files named 'server.log', 'access.log', and 'error.log' on the local file system. [1Mark]
d) Upload these log files from the local file system to the 'log_analysis' directory in HDFS. [1 Mark]
e) List all files in the 'log_analysis' directory in HDFS. [0.5 Mark]
f) Download the 'error.log' file from HDFS to the local file system. [1 Mark]
g) Remove the 'access.log' file from the HDFS directory 'log_analysis'. [0.5 Mark]
h) Change the ownership of the 'server.log' file in HDFS to a user named 'admin'. [1 Mark]

Solution:
a) Check disk space usage of the HDFS file system. [0.5Mark]
hadoop dfs -df -h
b) Create a new directory in HDFS named 'log_analysis' under the root directory. [0.5Mark]
hadoop dfs -mkdir /log_analysis
c) Create 3 files named 'server.log', 'access.log', and 'error.log' on the local file system. [1Mark]
touch server.log access.log error.log
d) Upload these log files from the local file system to the 'log_analysis' directory in HDFS. [1 Mark]
hadoop dfs -put server.log /log_analysis/
hadoop dfs -put access.log /log_analysis/
hadoop dfs -put error.log /log_analysis/
e) List all files in the 'log_analysis' directory in HDFS. [0.5 Mark]
hadoop dfs -ls /log_analysis/
f) Download the 'error.log' file from HDFS to the local file system. [1 Mark]
hadoop dfs -get /log_analysis/error.log .
g) Remove the 'access.log' file from the HDFS directory 'log_analysis'. [0.5 Mark]
hadoop dfs -rm /log_analysis/access.log
h) Change the ownership of the 'server.log' file in HDFS to a user named 'admin'. [1 Mark]
hadoop dfs -chown admin /log_analysis/server.log

Q.5 You are working with employee data for a large organization. The data includes employee details
such as department and age. Your tasks involve writing queries to analyze this data using Hive and
Pig.
Hive Task:
Write a Hive query to count the number of employees in each department. [2Marks]
Pig Task:
Write a Pig script to filter the employee data, selecting only employees who are above 30 years old.
[2Marks]
Solution
Hive Task:
Write a Hive query to count the number of employees in each department. [2Marks]
SELECT department, COUNT(*) AS employee_count
FROM employee_data
GROUP BY department;

Pig Task:
Write a Pig script to filter the employee data, selecting only employees who are above 30 years old.
[2Marks]
-- Load the employee data from a specified source
employee_data = LOAD 'employee_data.csv' USING PigStorage(',') AS (name:chararray, age:int,
department:chararray);

-- Filter employees who are older than 30


filtered_employees = FILTER employee_data BY age > 30;

-- Store the filtered results (optional, based on requirement)


STORE filtered_employees INTO 'filtered_employees_output' USING PigStorage(',');

Q.6 Case Study: Real-Time Financial Analytics


You are tasked with implementing a system for monitoring financial transactions to detect fraud in
real-time. The organization has decided to use a Kappa architecture, leveraging Cassandra for stream
processing and employing an event-driven architecture for alert triggering. Additionally, a data lake
will be utilized to store historical transactions for analysis.
Discuss the advantages of using Kappa architecture and Cassandra in this scenario. How do these
technologies contribute to faster fraud detection and reduced losses? Your answer should address
at least two specific benefits of this architecture. [2Marks]

Solution:
In the context of Real-Time Financial Analytics for fraud detection, using Kappa architecture and
Cassandra provides several advantages:

Real-Time Stream Processing with Kappa Architecture:


Low Latency: Kappa architecture is designed for real-time stream processing by
processing data as it arrives. This ensures faster fraud detection as transactions are analyzed
immediately, enabling quicker responses to fraudulent activities.
Simplified Architecture: Unlike Lambda architecture, which maintains separate paths for
batch and stream processing, Kappa only processes streams, reducing complexity. This
enables seamless real-time monitoring without needing to handle a separate batch layer,
improving efficiency and agility in fraud detection.
Scalability and High Availability with Cassandra:

Distributed Data Storage: Cassandra is a distributed NoSQL database known for its
ability to handle large volumes of data across multiple nodes. This is crucial for handling
high transaction volumes in a global financial environment. Its horizontal scalability
ensures the system can grow without performance degradation.
High Write and Read Throughput: Cassandra offers high write speeds and is optimized
for real-time analytics, ensuring transaction data can be quickly stored and retrieved for
fraud analysis, supporting timely alerts and minimizing potential financial losses.

Q.7 Scenario:
You are working with a Cassandra database to manage customer information. Below are examples
of basic CRUD (Create, Read, Update, Delete) operations for a customers table.
• Create: INSERT INTO customers (id, name, balance) VALUES (123, 'John', 1000);
• Read: SELECT * FROM customers WHERE id = 123;
• Update: UPDATE customers SET balance = 1200 WHERE id = 123;
Explain the purpose of each of the four CRUD operations shown above in the context of the
customers table. Provide a brief description of what each operation does. [3Marks]

Solution:
The CRUD (Create, Read, Update, Delete) operations shown for the Cassandra database managing
customer information can be explained as follows:

Create Operation (INSERT):


Purpose: To add a new record in the table.
Description: The command INSERT INTO customers (id, name, balance) VALUES (123, 'John',
1000); inserts a new customer with ID 123, name "John," and a balance of 1000 into the customers
table. This operation creates a new row in the database with the specified values.

Read Operation (SELECT):


Purpose: To retrieve existing records from the table.
Description: The command SELECT * FROM customers WHERE id = 123; retrieves all the
information for the customer with ID 123 from the customers table. This operation is used to read
or query the stored data.

Update Operation (UPDATE):


Purpose: To modify an existing record in the table.
Description: The command UPDATE customers SET balance = 1200 WHERE id = 123; updates
the balance of the customer with ID 123 to 1200. This operation changes or modifies the values of
the existing data without creating a new record.

Delete Operation (not explicitly shown but implied as part of CRUD):


Purpose: To remove records from the table.
Description: A typical DELETE command in Cassandra would look like DELETE FROM
customers WHERE id = 123; and would remove the customer record with ID 123 from the
database. This operation is used to delete the data.
******

You might also like