0% found this document useful (0 votes)

8 views3 pages

Bda Assignment-1

BDA_ASSIGNMENT-1

Uploaded by

suhada.rancholabs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views3 pages

Bda Assignment-1

BDA_ASSIGNMENT-1

Uploaded by

suhada.rancholabs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

2CEIT702 : BIG DATA ANALYTICS

ASSIGNMENT - 1

Instructions:

● Write your solutions in file pages only (Use both the sides of page)
● Write programming solution (code ) with output ( You can use Databicks’community
edition)

Questions :

1. Explain the key characteristics that make Apache Cassandra a NoSQL database
management system. Compare and contrast these characteristics with those of
traditional relational databases. Provide examples to illustrate your points.

2. Imagine you are designing a database system for a social media platform that needs to
handle a massive amount of user data, including profiles, posts, and messages. Why
might you choose Apache Cassandra as the database solution for this project?
Describe how you would model the data in Cassandra to efficiently handle the
requirements of such a system. Highlight the key considerations and advantages of
using Cassandra in this scenario.

3. You are tasked with building a data processing system for a real-time e-commerce
platform that needs to analyze customer behavior and generate personalized
recommendations. Explain the advantages and disadvantages of using both data
streaming and batch processing approaches for this scenario. Additionally, propose a
hybrid solution that combines elements of both streaming and batch processing to
optimize the recommendation engine's performance. Justify your choice of the hybrid
approach and outline the key components and considerations involved in its
implementation.

4. Describe the fundamental components of Apache Kafka, including producers, topics,

brokers, consumers, and ZooKeeper. Provide a use-case scenario where these
components work together to solve a specific problem. Explain how each component
plays a role in this scenario and the advantages of using Kafka for this particular use
case.

5. Explain Kafka's message anatomy.

6. You are tasked with analyzing a large dataset containing information about online
customer reviews. The dataset is stored as a text file, where each line represents a
review in the following format:

<product_id>,<review_text>,<rating>
Your goal is to use Apache Spark RDDs in Python to perform the following tasks:
Calculate the average rating for each product. Identify the product with the highest
average rating.
Write a Python script using Apache Spark RDDs to accomplish these tasks. Your
script should read the dataset, perform the calculations, and print the results in the
following format:

Product with the highest average rating: <product_id> (Average Rating:

<average_rating>)

Note :To help you get started, you can use the following Spark RDD operations:

sc.textFile("input.txt"): Read the text file and create an RDD.

map(): Transform each line of the RDD to extract the product ID, review text, and
rating.
mapValues(): Transform the RDD to key-value pairs with the product ID as the key
and the rating as the value.
reduceByKey(): Calculate the sum of ratings for each product.
countByKey(): Count the number of reviews for each product.
mapValues(): Calculate the average rating for each product.
max(): Find the product with the highest average rating.
Please write your Python script and include comments to explain your code.
7. You are working as a data engineer for a retail company that sells products online.
The company has collected a large amount of data about customer orders, including
information about the products, customers, and order details. Your task is to use
Apache Spark DataFrames in Python to perform the following tasks:

1) Load the provided dataset into a Spark DataFrame.

2) Calculate the total revenue generated by each product (i.e., the sum of the
product price for each quantity sold).
3) Identify the top 5 products with the highest total revenue.
The dataset is stored in a CSV file with the following columns:
product_id: A unique identifier for each product.
product_name: The name of the product.
price: The price of one unit of the product.
quantity_sold: The quantity of the product sold in each order.
Write a Python script using Spark DataFrames to accomplish these tasks. Your script
should read the dataset, perform the calculations, and print the top 5 products with the
highest total revenue in the following format:
Top 5 Products by Total Revenue:
1. Product Name: <product_name_1>, Total Revenue: <total_revenue_1>
2. Product Name: <product_name_2>, Total Revenue: <total_revenue_2>
3. Product Name: <product_name_3>, Total Revenue: <total_revenue_3>
4. Product Name: <product_name_4>, Total Revenue: <total_revenue_4>
5. Product Name: <product_name_5>, Total Revenue: <total_revenue_5>

8. Differentiate Regression and Classification in Machine Learning.

9. Explain narrow and wide dependency in Apache Spark with sample data.

10. Define following terms :

1) Artificial Intelligence
2) Machine Learning
3) Deep Learning
4) Supervised Learning
5) Unsupervised Learning

TCPB Workflow English
No ratings yet
TCPB Workflow English
168 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
12 IPRevision Papers 2025
No ratings yet
12 IPRevision Papers 2025
93 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
All-In-One Xii Ip PB QP Ms 2024-25 (301 Pages)
No ratings yet
All-In-One Xii Ip PB QP Ms 2024-25 (301 Pages)
301 pages
Pyspark Scenario Based Qs
No ratings yet
Pyspark Scenario Based Qs
13 pages
DataGrokr Technical Assignment
No ratings yet
DataGrokr Technical Assignment
4 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
ISACA Glossary English Arabic Mis Ara 0615 1
No ratings yet
ISACA Glossary English Arabic Mis Ara 0615 1
76 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Blancco Drive Eraser
100% (1)
Blancco Drive Eraser
2 pages
What Is Big Data Mcqs
No ratings yet
What Is Big Data Mcqs
24 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
TUTORIAL MS IP Class 12 For 2023
No ratings yet
TUTORIAL MS IP Class 12 For 2023
13 pages
Ip 1
No ratings yet
Ip 1
26 pages
Python Pyspark Q's
No ratings yet
Python Pyspark Q's
16 pages
Abhishek BDA File
No ratings yet
Abhishek BDA File
23 pages
Army Public School, Bangalore Class Xii Info Set 1
No ratings yet
Army Public School, Bangalore Class Xii Info Set 1
14 pages
Orange IP065 12 QP
No ratings yet
Orange IP065 12 QP
9 pages
QP of IP - 1st Preboard 2024-25 - Set1
No ratings yet
QP of IP - 1st Preboard 2024-25 - Set1
14 pages
12 Ip Question Paper
No ratings yet
12 Ip Question Paper
8 pages
Class Xii Info Set 1
No ratings yet
Class Xii Info Set 1
13 pages
Assignment 03 BigData Computing Noc23-Cs112
No ratings yet
Assignment 03 BigData Computing Noc23-Cs112
6 pages
Class Xii Informatics Practices (065) : Section A
No ratings yet
Class Xii Informatics Practices (065) : Section A
9 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
23CP309T BDA RE-MSE Question Paper
No ratings yet
23CP309T BDA RE-MSE Question Paper
2 pages
23CP309T BDA MSE Question Paper
No ratings yet
23CP309T BDA MSE Question Paper
2 pages
Int 421
No ratings yet
Int 421
2 pages
Supplementary Exam 23CP309T BDA ESE Question Paper
No ratings yet
Supplementary Exam 23CP309T BDA ESE Question Paper
2 pages
Protocol en
No ratings yet
Protocol en
10 pages
Unit-2 Data Science Assignment1
No ratings yet
Unit-2 Data Science Assignment1
2 pages
Ibm Flashsystem 7200 Product Guide: Paper
No ratings yet
Ibm Flashsystem 7200 Product Guide: Paper
60 pages
ShaftDesigner by IMT (English)
100% (1)
ShaftDesigner by IMT (English)
2 pages
IOT Lab Manual
No ratings yet
IOT Lab Manual
52 pages
A Prompt Pattern Catalog To Enhance Prompt Engineering With Chatgpt
No ratings yet
A Prompt Pattern Catalog To Enhance Prompt Engineering With Chatgpt
19 pages
Iec 2023 Smart Pilots A4 LR 0
No ratings yet
Iec 2023 Smart Pilots A4 LR 0
8 pages
SPORTS Ques Asked by TCS Part 2
No ratings yet
SPORTS Ques Asked by TCS Part 2
10 pages
CEP For Real TIme Applications
No ratings yet
CEP For Real TIme Applications
81 pages
Imd 123 Assignment 1
No ratings yet
Imd 123 Assignment 1
15 pages
HP Z1 G9 Tower Desktop: The Most Affordable Pro-Certi Ed Z Desktop
No ratings yet
HP Z1 G9 Tower Desktop: The Most Affordable Pro-Certi Ed Z Desktop
4 pages
What I Like About Me Pages 1-13 - Flip PDF Download - FlipHTML5
No ratings yet
What I Like About Me Pages 1-13 - Flip PDF Download - FlipHTML5
13 pages
Sample - Virtual Private Network - VPN - Market Report - 2026
No ratings yet
Sample - Virtual Private Network - VPN - Market Report - 2026
82 pages
Web 3 For Beginners
No ratings yet
Web 3 For Beginners
4 pages
R3trans Export Error Due To Scrap Characters v12
No ratings yet
R3trans Export Error Due To Scrap Characters v12
3 pages
Log TGT101MM2 26082023
No ratings yet
Log TGT101MM2 26082023
7 pages
Armis Use Cases For OT Environments
No ratings yet
Armis Use Cases For OT Environments
11 pages
D20 Combination Module
No ratings yet
D20 Combination Module
10 pages
CH 1
No ratings yet
CH 1
55 pages
Unit-4 Os
No ratings yet
Unit-4 Os
49 pages
GNN MetaLayer
No ratings yet
GNN MetaLayer
14 pages
Silo - Tips - Arabic Mathematical Symbol Insertion Application System Using Arabic Pack For Math Type Software
No ratings yet
Silo - Tips - Arabic Mathematical Symbol Insertion Application System Using Arabic Pack For Math Type Software
8 pages
VS Code Extensions
No ratings yet
VS Code Extensions
8 pages
04 The - Model - of - Web-Based - Crowdfunding - Platform
No ratings yet
04 The - Model - of - Web-Based - Crowdfunding - Platform
5 pages
Registration - Mediology Software Pvt. LTD - B.Tech CS - IT 2025 & 2026 Batch - GU - GCET
No ratings yet
Registration - Mediology Software Pvt. LTD - B.Tech CS - IT 2025 & 2026 Batch - GU - GCET
2 pages
Sap Successfactors What'S New Viewer: Warning
No ratings yet
Sap Successfactors What'S New Viewer: Warning
3 pages
CaseStudyDesign (Gantt Chart)
No ratings yet
CaseStudyDesign (Gantt Chart)
3 pages
Amazon DynamoDB - The Definitive Guide: Explore enterprise-ready, serverless NoSQL with predictable, scalable performance
From Everand
Amazon DynamoDB - The Definitive Guide: Explore enterprise-ready, serverless NoSQL with predictable, scalable performance
Aman Dhingra
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Hyper-V 2016 Best Practices
From Everand
Hyper-V 2016 Best Practices
Benedict Berger
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
IBM Cognos Business Intelligence
From Everand
IBM Cognos Business Intelligence
Dustin Adkison
No ratings yet
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
From Everand
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Kim Chantala
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Microsoft Dynamics NAV Administration
From Everand
Microsoft Dynamics NAV Administration
Amit Sachdev
No ratings yet
Learning Azure DocumentDB
From Everand
Learning Azure DocumentDB
Becker Riccardo
No ratings yet
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
From Everand
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
Matthew Windham
No ratings yet
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
From Everand
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
Marije Brummel
No ratings yet
Practical and Efficient SAS Programming: The Insider's Guide
From Everand
Practical and Efficient SAS Programming: The Insider's Guide
Martha Messineo
No ratings yet
Amazon SimpleDB: LITE
From Everand
Amazon SimpleDB: LITE
Prabhakar Chaganti
No ratings yet
Frank Kane's Taming Big Data with Apache Spark and Python
From Everand
Frank Kane's Taming Big Data with Apache Spark and Python
Frank Kane
No ratings yet
Mastering DynamoDB
From Everand
Mastering DynamoDB
Tanmay Deshpande
No ratings yet
Mastering RethinkDB
From Everand
Mastering RethinkDB
Shahid Shaikh
No ratings yet
Creating your MySQL Database: Practical Design Tips and Techniques
From Everand
Creating your MySQL Database: Practical Design Tips and Techniques
Marc Delisle
3/5 (1)
React Components
From Everand
React Components
Christopher Pitt
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
SAP Basis Configuration Frequently Asked Questions
From Everand
SAP Basis Configuration Frequently Asked Questions
Equity Press
3.5/5 (4)
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
QlikView Essentials
From Everand
QlikView Essentials
Sinha Chandraish
No ratings yet
JavaScript Fundamentals: JavaScript Syntax, What JavaScript is Use for in Website Development, JavaScript Variable, Strings, Popup Boxes, JavaScript Objects, Function, and Event Handlers
From Everand
JavaScript Fundamentals: JavaScript Syntax, What JavaScript is Use for in Website Development, JavaScript Variable, Strings, Popup Boxes, JavaScript Objects, Function, and Event Handlers
Steven Bright
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Bda Assignment-1

Uploaded by

Bda Assignment-1

Uploaded by

2CEIT702 : BIG DATA ANALYTICS

4. Describe the fundamental components of Apache Kafka, including producers, topics,

5. Explain Kafka's message anatomy.

Product with the highest average rating: <product_id> (Average Rating:

sc.textFile("input.txt"): Read the text file and create an RDD.

1) Load the provided dataset into a Spark DataFrame.

8. Differentiate Regression and Classification in Machine Learning.

10. Define following terms :

You might also like