BDA MakeUp Solution
BDA MakeUp Solution
Solution
HDFS vs AWS S3 for XYZ Corp’s Expansion:
In the context of XYZ Corp’s rapid growth, expansion into multiple regions, and network instability
in some areas, Hadoop Distributed File System (HDFS) offers notable advantages over AWS S3.
The key reasons why HDFS might be a better fit for XYZ Corp are:
Solution
Mapper Script (mapper.py) 3 Marks
#!/usr/bin/env python
import sys
import sys
current_store = None
current_total = 0.0
# Read from standard input
for line in sys.stdin:
line = line.strip() # Remove leading/trailing whitespace
if line: # Check if the line is not empty
store_id, amount = line.split("\t") # Split by tab
Q.3 State whether the following statements are True or False with proper justification.
Answers without proper justification will not be given any marks. [3Marks]
a) In HDFS, the NameNode stores both metadata and actual data.
b) HDFS follows a master-slave architecture where the NameNode acts as the master and
DataNodes as the slaves.
c) There can be multiple NameNodes in HDFS, each managing its own set of metadata.
Solution
a) In HDFS, the NameNode stores both metadata and actual data.
Answer: False
Justification: In HDFS (Hadoop Distributed File System), the NameNode is responsible solely for
storing metadata about the files in the system, such as file names, permissions, and the structure of
the file system. It does not store the actual data blocks. The actual data is stored on DataNodes,
which manage the storage of data blocks.
b) HDFS follows a master-slave architecture where the NameNode acts as the master and
DataNodes as the slaves.
Answer: True
Justification: HDFS indeed follows a master-slave architecture. The NameNode serves as the master
node, managing metadata and coordinating the DataNodes. The DataNodes act as slave nodes,
storing the actual data blocks and serving read and write requests from clients. This architecture
allows for effective data management and scalability.
c) There can be multiple NameNodes in HDFS, each managing its own set of metadata.
Answer: False
Justification: HDFS is designed to have a single active NameNode at a time that manages the
metadata for the entire file system. Although there can be a secondary NameNode or standby
NameNode for high availability and failover, these do not operate simultaneously with multiple
NameNodes managing their own sets of metadata. Only one NameNode is active at a time to avoid
conflicts in metadata management.
Q.4 You are a Hadoop administrator at an online retail company that has recently adopted Hadoop for
processing and analyzing its server logs. The company is setting up a log analysis pipeline using
HDFS and MapReduce. Your task is to prepare the Hadoop environment to ensure smooth log
collection, storage, and processing.
Your task is to write commands for the following:
a) Check disk space usage of the HDFS file system. [0.5Mark]
b) Create a new directory in HDFS named 'log_analysis' under the root directory. [0.5Mark]
c) Create 3 files named 'server.log', 'access.log', and 'error.log' on the local file system. [1Mark]
d) Upload these log files from the local file system to the 'log_analysis' directory in HDFS. [1 Mark]
e) List all files in the 'log_analysis' directory in HDFS. [0.5 Mark]
f) Download the 'error.log' file from HDFS to the local file system. [1 Mark]
g) Remove the 'access.log' file from the HDFS directory 'log_analysis'. [0.5 Mark]
h) Change the ownership of the 'server.log' file in HDFS to a user named 'admin'. [1 Mark]
Solution:
a) Check disk space usage of the HDFS file system. [0.5Mark]
hadoop dfs -df -h
b) Create a new directory in HDFS named 'log_analysis' under the root directory. [0.5Mark]
hadoop dfs -mkdir /log_analysis
c) Create 3 files named 'server.log', 'access.log', and 'error.log' on the local file system. [1Mark]
touch server.log access.log error.log
d) Upload these log files from the local file system to the 'log_analysis' directory in HDFS. [1 Mark]
hadoop dfs -put server.log /log_analysis/
hadoop dfs -put access.log /log_analysis/
hadoop dfs -put error.log /log_analysis/
e) List all files in the 'log_analysis' directory in HDFS. [0.5 Mark]
hadoop dfs -ls /log_analysis/
f) Download the 'error.log' file from HDFS to the local file system. [1 Mark]
hadoop dfs -get /log_analysis/error.log .
g) Remove the 'access.log' file from the HDFS directory 'log_analysis'. [0.5 Mark]
hadoop dfs -rm /log_analysis/access.log
h) Change the ownership of the 'server.log' file in HDFS to a user named 'admin'. [1 Mark]
hadoop dfs -chown admin /log_analysis/server.log
Q.5 You are working with employee data for a large organization. The data includes employee details
such as department and age. Your tasks involve writing queries to analyze this data using Hive and
Pig.
Hive Task:
Write a Hive query to count the number of employees in each department. [2Marks]
Pig Task:
Write a Pig script to filter the employee data, selecting only employees who are above 30 years old.
[2Marks]
Solution
Hive Task:
Write a Hive query to count the number of employees in each department. [2Marks]
SELECT department, COUNT(*) AS employee_count
FROM employee_data
GROUP BY department;
Pig Task:
Write a Pig script to filter the employee data, selecting only employees who are above 30 years old.
[2Marks]
-- Load the employee data from a specified source
employee_data = LOAD 'employee_data.csv' USING PigStorage(',') AS (name:chararray, age:int,
department:chararray);
Solution:
In the context of Real-Time Financial Analytics for fraud detection, using Kappa architecture and
Cassandra provides several advantages:
Distributed Data Storage: Cassandra is a distributed NoSQL database known for its
ability to handle large volumes of data across multiple nodes. This is crucial for handling
high transaction volumes in a global financial environment. Its horizontal scalability
ensures the system can grow without performance degradation.
High Write and Read Throughput: Cassandra offers high write speeds and is optimized
for real-time analytics, ensuring transaction data can be quickly stored and retrieved for
fraud analysis, supporting timely alerts and minimizing potential financial losses.
Q.7 Scenario:
You are working with a Cassandra database to manage customer information. Below are examples
of basic CRUD (Create, Read, Update, Delete) operations for a customers table.
• Create: INSERT INTO customers (id, name, balance) VALUES (123, 'John', 1000);
• Read: SELECT * FROM customers WHERE id = 123;
• Update: UPDATE customers SET balance = 1200 WHERE id = 123;
Explain the purpose of each of the four CRUD operations shown above in the context of the
customers table. Provide a brief description of what each operation does. [3Marks]
Solution:
The CRUD (Create, Read, Update, Delete) operations shown for the Cassandra database managing
customer information can be explained as follows: