Question Bank-BDA (Module 1&2) 2
Question Bank-BDA (Module 1&2) 2
MODULE 1
1. What is big data analytics? Explain four ‘V’s of Big data. Briefly discuss applications
of big data.
2. What are the advantages of Hadoop? Explain Hadoop Architecture and its
Components with proper diagram.
3. What are the benefits of Big Data? Discuss challenges under Big Data.
4. Discuss various components of Hadoop Ecosystem.
5. With proper examples discuss and differentiate structured, unstructured and semi-
structured data.
6. With suitable block diagram explain architecture of HDFS. Discuss role of Data node
and Name node in HDFS. Give commands with appropriate arguments to perform
data transfer between local file system and HDFS.
7. Write commands for the following:
a. Create a directory in HDFS at given path(s)
b. List the contents of a directory.
c. Upload and download a file in HDFS.
d. See contents of a file
e. Copy a file from source to destination
f. Copy a file from/To Local file system to HDFS
g. Move file from source to destination
h. Remove a file or directory in HDFS
i. Display last few lines of a file.
j. Display the aggregate length of a file
Lab Programs
8. You have been given a CSV file named customer_data.csv containing customer
information. The goal is to perform a series of data transformations and clean the
data using PySpark. The CSV file has the following schema:
• customer_id (String): Unique identifier for each customer.
• name (String): Name of the customer.
• age (Integer): Age of the customer.
• gender (String): Gender of the customer ('M' for Male, 'F' for Female).
• country (String): Country of residence.
• signup_date (String): Date when the customer signed up (in the format YYYY-
MM-DD).
• last_purchase_amount (Float): The amount of the last purchase made by the
customer.
• total_purchases (Integer): Total number of purchases made by the customer.
Queries:
a. Load the CSV file into a PySpark DataFrame.
b. Display the first 5 rows of the DataFrame.
c. Print the schema of the DataFrame to verify the data types.
d. Create a new column age_group which categorizes customers into:
• 'Teen' if the age is less than 20
• 'Young Adult' if the age is between 20 and 35 (inclusive)
• 'Adult' if the age is between 36 and 50 (inclusive)
• 'Senior' if the age is above 50
e. Standardize the gender column to have values 'Male' and 'Female' instead of
'M' and 'F'.
f. Create a new column high_value_customer which is 'Yes' if total_purchases
is greater than 100 or last_purchase_amount is greater than 1000,
otherwise 'No'.
g. Filter out rows where age is less than or equal to 0 or greater than 120.
h. Filter out rows where total_purchases is less than 0.
i. Group the data by country and age_group, and calculate the following
aggregations:
• Total number of customers
• Average last purchase amount
• Total number of purchases
9. You are given a DataFrame containing information about employees in a company.
The DataFrame has the following schema:
• employee_id (String): Unique identifier for the employee
• name (String): Name of the employee
• department (String): Department where the employee works
• salary (Float): The employee's salary
• years_of_experience (Integer): The number of years the employee has been
working
Queries:
a. Add a column experience_level: Create a new column that categorizes the
employee's experience into "Junior", "Mid", and "Senior":
• "Junior" if the years_of_experience is less than 3
• "Mid" if the years_of_experience is between 3 and 7 (inclusive)
• "Senior" if the years_of_experience is greater than 7
b. Add a column bonus_eligible: Create a new column that indicates whether
an employee is eligible for a bonus. An employee is eligible for a bonus if their
salary is greater than 70,000 or if they are in the "Senior" experience level.
10. You are given a DataFrame containing information about students in a
university. The DataFrame has the following schema:
• student_id (String): Unique identifier for the student
• first_name (String): First name of the student
• last_name (String): Last name of the student
• math_score (Integer): Score in Math subject
• science_score (Integer): Score in Science subject
• english_score (Integer): Score in English subject
Queries:
a. Add a column full_name: Combine first_name and last_name into a new
column named full_name.
b. Add a column average_score: Calculate the average score of the three
subjects (math_score, science_score, and english_score) and add it as a new
column named average_score.
c. Add a column status: Create a new column that indicates whether the
student has passed or failed. A student is considered to have passed if their
average_score is greater than or equal to 50.
11. You are given a DataFrame containing sales data for a retail store. The
DataFrame has the following schema:
• transaction_id (String): Unique identifier for the transaction
• customer_id (String): Unique identifier for the customer
• transaction_date (Date): The date when the transaction occurred
• total_amount (Float): The total amount of the transaction
• product_category_1 (String): Category of the first product purchased
• product_category_2 (String): Category of the second product purchased
• product_category_3 (String): Category of the third product purchased
Queries:
a. Drop columns product_category_2 and product_category_3 if they have fewer
than 100 transactions in total. These columns are considered to have low
sales.
b. Drop columns starting with the prefix "product_category_" but not including
the column product_category_1. These columns are no longer required for
analysis.
12. Write a pyspark SQL query to find the top three salaries in each department
from the Employee table. If a department has less than three employees, include
the highest salaries for that department. The output should contain the following
columns:
• Department (String): The department name.
• Employee (String): The employee name.
• Salary (Integer): The employee salary.
Employee Table Schema:
• Id (Integer): Unique identifier for the employee.
• Name (String): The name of the employee.
• Salary (Integer): The salary of the employee.
• DepartmentId (Integer): The department ID where the employee works.
13. Write a pyspark SQL query to find all employees who earn more than their
managers. The output should contain the following columns:
• Employee (String): The employee name.
• Manager (String): The manager name.
Employee Table Schema:
• Id (Integer): Unique identifier for the employee.
• Name (String): The name of the employee.
• Salary (Integer): The salary of the employee.
• ManagerId (Integer): The ID of the employee's manager. NULL if the employee
has no manager.
QUERIES
• Query to Find All Employees and Their Managers
• Query to Find Employees Earning More Than Their Managers
• Query to Find Employees Without Managers
• Query to Find the Highest Paid Employee
• Query to Calculate the Average Salary of Employees
• Query to Find Employees with a Salary Above a Certain Threshold
• Query to Find the Total Salary Paid to Employees Managed by Each
Manager
• Query to Count the Number of Employees Managed by Each Manager
• Query to List All Managers
• Query to List All Employees Grouped by Their Managers
14. You have been given a dataset containing sales information, and you need to
perform several sorting operations to analyze the data effectively. The dataset
contains the following columns:
a. TransactionID: Unique identifier for each transaction.
b. CustomerID: Unique identifier for each customer.
c. Product: The name of the product sold.
d. Quantity: The number of units sold.
e. Price: The price per unit of the product.
f. TransactionDate: The date when the transaction occurred.
Queries:
• Load the Data: Create a PySpark DataFrame from the dataset.
• Sort by Transaction Date: Sort the DataFrame by TransactionDate in
ascending order.
• Sort by Price: Sort the DataFrame by Price in descending order.
• Sort by CustomerID and Transaction Date: Sort the DataFrame first by
CustomerID in ascending order and then by TransactionDate in descending
order.
• Top Transactions: Find the top 5 transactions with the highest total
amount spent (Quantity * Price).
• Calculate Total Sales per Customer: Calculate the total amount spent by
each customer.
• Top Products by Total Sales: Identify the top 3 products by total sales
amount.
• Running Total Sales by Date: Compute the running total of sales for each
day.
• Top N Transactions by Customer: For each customer, find their top N
transactions by amount spent.
• Monthly Sales Trend: Analyze the monthly sales trend by calculating the
total sales for each month.
Module 2
1. Describe the roles and functionalities of the Mapper and Reducer in the
MapReduce framework. Provide a simple example of each.
2. Explain the Components of MapReduce Architecture. How MapReduce handles
data query?
3. What is the role of the Partitioner in the MapReduce framework? Provide an
example of how a custom Partitioner can be implemented.
4. Explain How MapReduce completes a task? Explain Case of failures.
5. Explain How Job runs on MapReduce. How to submit Job?
6. How Job tracker and the task tracker deal with MapReduce
7. What is a combiner? How combiner works. Advantage of combiners.
Disadvantage of combiner.
APPLY/ANALYZE level Questions
1. Develop a pyspark program for Word Count in Map Reduce.
2. Develop a pyspark program to Implement Matrix Multiplication with Hadoop Map
Reduce
3. Develop a pyspark program to Implement Searching with Hadoop Map Reduce.
4. Develop a pyspark program to Implement Sorting with Hadoop Map Reduce.