Question Bank-BDA (Module 1&2) 2

Question bank for big data analytics module one and module two |autonomous Institute

Uploaded by

bibliophileonthesamepage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views5 pages

Question Bank-BDA (Module 1&2) 2

Question bank for big data analytics module one and module two |autonomous Institute

Uploaded by

bibliophileonthesamepage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

QUESTION BANK- Big Data Analytics (21AML161)

MODULE 1
1. What is big data analytics? Explain four ‘V’s of Big data. Briefly discuss applications
of big data.
2. What are the advantages of Hadoop? Explain Hadoop Architecture and its
Components with proper diagram.
3. What are the benefits of Big Data? Discuss challenges under Big Data.
4. Discuss various components of Hadoop Ecosystem.
5. With proper examples discuss and differentiate structured, unstructured and semi-
structured data.
6. With suitable block diagram explain architecture of HDFS. Discuss role of Data node
and Name node in HDFS. Give commands with appropriate arguments to perform
data transfer between local file system and HDFS.
7. Write commands for the following:
a. Create a directory in HDFS at given path(s)
b. List the contents of a directory.
c. Upload and download a file in HDFS.
d. See contents of a file
e. Copy a file from source to destination
f. Copy a file from/To Local file system to HDFS
g. Move file from source to destination
h. Remove a file or directory in HDFS
i. Display last few lines of a file.
j. Display the aggregate length of a file

Lab Programs
8. You have been given a CSV file named customer_data.csv containing customer
information. The goal is to perform a series of data transformations and clean the
data using PySpark. The CSV file has the following schema:
• customer_id (String): Unique identifier for each customer.
• name (String): Name of the customer.
• age (Integer): Age of the customer.
• gender (String): Gender of the customer ('M' for Male, 'F' for Female).
• country (String): Country of residence.
• signup_date (String): Date when the customer signed up (in the format YYYY-
MM-DD).
• last_purchase_amount (Float): The amount of the last purchase made by the
customer.
• total_purchases (Integer): Total number of purchases made by the customer.
Queries:
a. Load the CSV file into a PySpark DataFrame.
b. Display the first 5 rows of the DataFrame.
c. Print the schema of the DataFrame to verify the data types.
d. Create a new column age_group which categorizes customers into:
• 'Teen' if the age is less than 20
• 'Young Adult' if the age is between 20 and 35 (inclusive)
• 'Adult' if the age is between 36 and 50 (inclusive)
• 'Senior' if the age is above 50
e. Standardize the gender column to have values 'Male' and 'Female' instead of
'M' and 'F'.
f. Create a new column high_value_customer which is 'Yes' if total_purchases
is greater than 100 or last_purchase_amount is greater than 1000,
otherwise 'No'.
g. Filter out rows where age is less than or equal to 0 or greater than 120.
h. Filter out rows where total_purchases is less than 0.
i. Group the data by country and age_group, and calculate the following
aggregations:
• Total number of customers
• Average last purchase amount
• Total number of purchases
9. You are given a DataFrame containing information about employees in a company.
The DataFrame has the following schema:
• employee_id (String): Unique identifier for the employee
• name (String): Name of the employee
• department (String): Department where the employee works
• salary (Float): The employee's salary
• years_of_experience (Integer): The number of years the employee has been
working
Queries:
a. Add a column experience_level: Create a new column that categorizes the
employee's experience into "Junior", "Mid", and "Senior":
• "Junior" if the years_of_experience is less than 3
• "Mid" if the years_of_experience is between 3 and 7 (inclusive)
• "Senior" if the years_of_experience is greater than 7
b. Add a column bonus_eligible: Create a new column that indicates whether
an employee is eligible for a bonus. An employee is eligible for a bonus if their
salary is greater than 70,000 or if they are in the "Senior" experience level.
10. You are given a DataFrame containing information about students in a
university. The DataFrame has the following schema:
• student_id (String): Unique identifier for the student
• first_name (String): First name of the student
• last_name (String): Last name of the student
• math_score (Integer): Score in Math subject
• science_score (Integer): Score in Science subject
• english_score (Integer): Score in English subject
Queries:
a. Add a column full_name: Combine first_name and last_name into a new
column named full_name.
b. Add a column average_score: Calculate the average score of the three
subjects (math_score, science_score, and english_score) and add it as a new
column named average_score.
c. Add a column status: Create a new column that indicates whether the
student has passed or failed. A student is considered to have passed if their
average_score is greater than or equal to 50.
11. You are given a DataFrame containing sales data for a retail store. The
DataFrame has the following schema:
• transaction_id (String): Unique identifier for the transaction
• customer_id (String): Unique identifier for the customer
• transaction_date (Date): The date when the transaction occurred
• total_amount (Float): The total amount of the transaction
• product_category_1 (String): Category of the first product purchased
• product_category_2 (String): Category of the second product purchased
• product_category_3 (String): Category of the third product purchased
Queries:
a. Drop columns product_category_2 and product_category_3 if they have fewer
than 100 transactions in total. These columns are considered to have low
sales.
b. Drop columns starting with the prefix "product_category_" but not including
the column product_category_1. These columns are no longer required for
analysis.
12. Write a pyspark SQL query to find the top three salaries in each department
from the Employee table. If a department has less than three employees, include
the highest salaries for that department. The output should contain the following
columns:
• Department (String): The department name.
• Employee (String): The employee name.
• Salary (Integer): The employee salary.
Employee Table Schema:
• Id (Integer): Unique identifier for the employee.
• Name (String): The name of the employee.
• Salary (Integer): The salary of the employee.
• DepartmentId (Integer): The department ID where the employee works.
13. Write a pyspark SQL query to find all employees who earn more than their
managers. The output should contain the following columns:
• Employee (String): The employee name.
• Manager (String): The manager name.
Employee Table Schema:
• Id (Integer): Unique identifier for the employee.
• Name (String): The name of the employee.
• Salary (Integer): The salary of the employee.
• ManagerId (Integer): The ID of the employee's manager. NULL if the employee
has no manager.
QUERIES
• Query to Find All Employees and Their Managers
• Query to Find Employees Earning More Than Their Managers
• Query to Find Employees Without Managers
• Query to Find the Highest Paid Employee
• Query to Calculate the Average Salary of Employees
• Query to Find Employees with a Salary Above a Certain Threshold
• Query to Find the Total Salary Paid to Employees Managed by Each
Manager
• Query to Count the Number of Employees Managed by Each Manager
• Query to List All Managers
• Query to List All Employees Grouped by Their Managers

14. You have been given a dataset containing sales information, and you need to
perform several sorting operations to analyze the data effectively. The dataset
contains the following columns:
a. TransactionID: Unique identifier for each transaction.
b. CustomerID: Unique identifier for each customer.
c. Product: The name of the product sold.
d. Quantity: The number of units sold.
e. Price: The price per unit of the product.
f. TransactionDate: The date when the transaction occurred.
Queries:
• Load the Data: Create a PySpark DataFrame from the dataset.
• Sort by Transaction Date: Sort the DataFrame by TransactionDate in
ascending order.
• Sort by Price: Sort the DataFrame by Price in descending order.
• Sort by CustomerID and Transaction Date: Sort the DataFrame first by
CustomerID in ascending order and then by TransactionDate in descending
order.
• Top Transactions: Find the top 5 transactions with the highest total
amount spent (Quantity * Price).
• Calculate Total Sales per Customer: Calculate the total amount spent by
each customer.
• Top Products by Total Sales: Identify the top 3 products by total sales
amount.
• Running Total Sales by Date: Compute the running total of sales for each
day.
• Top N Transactions by Customer: For each customer, find their top N
transactions by amount spent.
• Monthly Sales Trend: Analyze the monthly sales trend by calculating the
total sales for each month.
Module 2
1. Describe the roles and functionalities of the Mapper and Reducer in the
MapReduce framework. Provide a simple example of each.
2. Explain the Components of MapReduce Architecture. How MapReduce handles
data query?
3. What is the role of the Partitioner in the MapReduce framework? Provide an
example of how a custom Partitioner can be implemented.
4. Explain How MapReduce completes a task? Explain Case of failures.
5. Explain How Job runs on MapReduce. How to submit Job?
6. How Job tracker and the task tracker deal with MapReduce
7. What is a combiner? How combiner works. Advantage of combiners.
Disadvantage of combiner.
APPLY/ANALYZE level Questions
1. Develop a pyspark program for Word Count in Map Reduce.
2. Develop a pyspark program to Implement Matrix Multiplication with Hadoop Map
Reduce
3. Develop a pyspark program to Implement Searching with Hadoop Map Reduce.
4. Develop a pyspark program to Implement Sorting with Hadoop Map Reduce.

Quantiphi Interview
No ratings yet
Quantiphi Interview
2 pages
Practical File-Informatics Practices (Class XII) : Create A Pandas Series From A Dictionary of Values and An Ndarray
No ratings yet
Practical File-Informatics Practices (Class XII) : Create A Pandas Series From A Dictionary of Values and An Ndarray
22 pages
Class 12 IP Practical Record
No ratings yet
Class 12 IP Practical Record
33 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Ip Practical File
No ratings yet
Ip Practical File
20 pages
Class 12 Practical File Informatics Practices (Laxmi Yadav)
No ratings yet
Class 12 Practical File Informatics Practices (Laxmi Yadav)
26 pages
Practical File Ishanvi
No ratings yet
Practical File Ishanvi
36 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Ip Practical 2024 2025
No ratings yet
Ip Practical 2024 2025
14 pages
Bda Assignment-1
No ratings yet
Bda Assignment-1
3 pages
Sample Questions For XII IP
No ratings yet
Sample Questions For XII IP
59 pages
Sneak Peek BCTCI - First 7 Chapters - What's Broken About Coding Interviews, What Recruiters Won't Tell You, How To Get in The Door, and More
100% (1)
Sneak Peek BCTCI - First 7 Chapters - What's Broken About Coding Interviews, What Recruiters Won't Tell You, How To Get in The Door, and More
70 pages
HHHH
No ratings yet
HHHH
22 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Mikrotik rb4011-rm Datasheet
No ratings yet
Mikrotik rb4011-rm Datasheet
4 pages
SAMPLE
No ratings yet
SAMPLE
5 pages
Pyspark Tutorial 3
No ratings yet
Pyspark Tutorial 3
5 pages
IP Practical File2
No ratings yet
IP Practical File2
35 pages
Screenshot 2023-12-27 at 7.05.37 PM
No ratings yet
Screenshot 2023-12-27 at 7.05.37 PM
23 pages
Spark 1
No ratings yet
Spark 1
21 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
22 pages
Data Handling Ques
No ratings yet
Data Handling Ques
2 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
29 pages
KPM180 Manual
No ratings yet
KPM180 Manual
108 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
22 pages
Xii Ip Jpr-Ms-Pb-1-Set-2
No ratings yet
Xii Ip Jpr-Ms-Pb-1-Set-2
12 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Jade M Kit
No ratings yet
Jade M Kit
1 page
Practical File IP
No ratings yet
Practical File IP
27 pages
Python - Pandas - Numpy Interview Q&A
No ratings yet
Python - Pandas - Numpy Interview Q&A
12 pages
Requirements Engineering SE Notes
No ratings yet
Requirements Engineering SE Notes
7 pages
IP Record Python 23-24 Aryan
No ratings yet
IP Record Python 23-24 Aryan
42 pages
HL-740 (TM) 7-5
No ratings yet
HL-740 (TM) 7-5
17 pages
Reverse Engineering Notes
No ratings yet
Reverse Engineering Notes
4 pages
Free Ebook MCQ Series Based On e PG Pathshala P02-M1,2,3
No ratings yet
Free Ebook MCQ Series Based On e PG Pathshala P02-M1,2,3
81 pages
Question Paper
No ratings yet
Question Paper
5 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
22 pages
Data Sufficiency
No ratings yet
Data Sufficiency
39 pages
Informatics Practices Practical Class-XII
No ratings yet
Informatics Practices Practical Class-XII
25 pages
Races & Games
No ratings yet
Races & Games
34 pages
Pre Board III Ip
No ratings yet
Pre Board III Ip
9 pages
Progression
No ratings yet
Progression
29 pages
Probability
No ratings yet
Probability
38 pages
Pyspark Questions
No ratings yet
Pyspark Questions
2 pages
Seating Arrangement
No ratings yet
Seating Arrangement
28 pages
DW Lab File
No ratings yet
DW Lab File
18 pages
Pipes and Cisterns
No ratings yet
Pipes and Cisterns
25 pages
Ratio and Proportion
No ratings yet
Ratio and Proportion
40 pages
Even Students
No ratings yet
Even Students
36 pages
Big Data Hadoop and Spark
No ratings yet
Big Data Hadoop and Spark
27 pages
IPModel Practicals QP
No ratings yet
IPModel Practicals QP
4 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
12th - QPAPER - Half Yearly 2023
No ratings yet
12th - QPAPER - Half Yearly 2023
9 pages
Hive
No ratings yet
Hive
9 pages
BDA - MongoDB
No ratings yet
BDA - MongoDB
12 pages
Big Data
No ratings yet
Big Data
11 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Army Public School, Bangalore Class Xii Info Set 1
No ratings yet
Army Public School, Bangalore Class Xii Info Set 1
14 pages
Intro S4HANA Using Global Bike Exercises FI en v4.1
No ratings yet
Intro S4HANA Using Global Bike Exercises FI en v4.1
10 pages
BDA Mod 3 Piglatin
No ratings yet
BDA Mod 3 Piglatin
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
Bda Ans For Ia2 (Partial
No ratings yet
Bda Ans For Ia2 (Partial
5 pages
Formal Languages and Automata Theory (06CS56)
No ratings yet
Formal Languages and Automata Theory (06CS56)
2 pages
Ip MS
No ratings yet
Ip MS
6 pages
Ip CV QB 1
No ratings yet
Ip CV QB 1
3 pages
Module 4
No ratings yet
Module 4
4 pages
Sau 1366720897
No ratings yet
Sau 1366720897
2 pages
Automata Theory and Computability (18CS54)
No ratings yet
Automata Theory and Computability (18CS54)
3 pages
XIIInfo Pract S E 427
No ratings yet
XIIInfo Pract S E 427
5 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
CFE
No ratings yet
CFE
5 pages
6.4.1 Packet Tracer - Implement Etherchannel
0% (1)
6.4.1 Packet Tracer - Implement Etherchannel
2 pages
ATC Question Bank
No ratings yet
ATC Question Bank
2 pages
Automata Theory and Computability (18CS54)
No ratings yet
Automata Theory and Computability (18CS54)
2 pages
Pyq - Atc - 5TH Sem
No ratings yet
Pyq - Atc - 5TH Sem
2 pages
Ia-2 QB
No ratings yet
Ia-2 QB
2 pages
Text 3
No ratings yet
Text 3
3 pages
RCC11 Element Design
No ratings yet
RCC11 Element Design
6 pages
Question Bank For Module 1 and 2
No ratings yet
Question Bank For Module 1 and 2
2 pages
LinkedIn EBook
No ratings yet
LinkedIn EBook
71 pages
Sop Epfl PDF
No ratings yet
Sop Epfl PDF
4 pages
Question Bank - BDA (Module 5) 2
No ratings yet
Question Bank - BDA (Module 5) 2
1 page
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Data Warehouse Scheme and Syllabus
No ratings yet
Data Warehouse Scheme and Syllabus
2 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
A) Collection of Values
No ratings yet
A) Collection of Values
9 pages
Data Science Sample
No ratings yet
Data Science Sample
5 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
28 pages
Int 421
No ratings yet
Int 421
2 pages
2.AquaArm SBS 3000X
No ratings yet
2.AquaArm SBS 3000X
3 pages
Proposal Brochure - Academia
No ratings yet
Proposal Brochure - Academia
10 pages
MatLab Add
No ratings yet
MatLab Add
9 pages
Definition and Evolution of Marketing Management
No ratings yet
Definition and Evolution of Marketing Management
13 pages
Etx 2 v6.7 - Datasheet
No ratings yet
Etx 2 v6.7 - Datasheet
13 pages
BB - Cac Phuong Phap Dieu Khien Tien Tien Nham Nang Cao Chat Luong Va TKNL - 11tr
No ratings yet
BB - Cac Phuong Phap Dieu Khien Tien Tien Nham Nang Cao Chat Luong Va TKNL - 11tr
11 pages
Test48 - Google Search
No ratings yet
Test48 - Google Search
3 pages
مهارات الحاسب
No ratings yet
مهارات الحاسب
257 pages
Homecharger: Type 1 Plug Type 2 Plug Type 2 Socket
No ratings yet
Homecharger: Type 1 Plug Type 2 Plug Type 2 Socket
2 pages
Python and PowerBI Syllabus
No ratings yet
Python and PowerBI Syllabus
3 pages
Dissertation Final Lusungu Munthali
No ratings yet
Dissertation Final Lusungu Munthali
48 pages
Panel Options LCD Samsung PDF
No ratings yet
Panel Options LCD Samsung PDF
11 pages
WPC MP
No ratings yet
WPC MP
19 pages
PB 1 IP Answer Key 2024
No ratings yet
PB 1 IP Answer Key 2024
6 pages
Computer Vision in Banking
No ratings yet
Computer Vision in Banking
7 pages
PICKIT2 Device Support List
No ratings yet
PICKIT2 Device Support List
7 pages
TLWA Assignment-1 - 03-09-2024
No ratings yet
TLWA Assignment-1 - 03-09-2024
2 pages
TH460 Service Report 023832
No ratings yet
TH460 Service Report 023832
1 page
Lenovo IdeaPad Flex 5 14 2-In-1 Touchscreen Lapt
No ratings yet
Lenovo IdeaPad Flex 5 14 2-In-1 Touchscreen Lapt
1 page
Informatics Practices Practical List22-2323
No ratings yet
Informatics Practices Practical List22-2323
6 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet

Question Bank-BDA (Module 1&2) 2

Uploaded by

Question Bank-BDA (Module 1&2) 2

Uploaded by

QUESTION BANK- Big Data Analytics (21AML161)

You might also like