0% found this document useful (0 votes)
8 views3 pages

Bda Assignment-1

BDA_ASSIGNMENT-1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Bda Assignment-1

BDA_ASSIGNMENT-1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

2CEIT702 : BIG DATA ANALYTICS

ASSIGNMENT - 1

Instructions:

● Write your solutions in file pages only (Use both the sides of page)
● Write programming solution (code ) with output ( You can use Databicks’community
edition)

Questions :

1. Explain the key characteristics that make Apache Cassandra a NoSQL database
management system. Compare and contrast these characteristics with those of
traditional relational databases. Provide examples to illustrate your points.

2. Imagine you are designing a database system for a social media platform that needs to
handle a massive amount of user data, including profiles, posts, and messages. Why
might you choose Apache Cassandra as the database solution for this project?
Describe how you would model the data in Cassandra to efficiently handle the
requirements of such a system. Highlight the key considerations and advantages of
using Cassandra in this scenario.

3. You are tasked with building a data processing system for a real-time e-commerce
platform that needs to analyze customer behavior and generate personalized
recommendations. Explain the advantages and disadvantages of using both data
streaming and batch processing approaches for this scenario. Additionally, propose a
hybrid solution that combines elements of both streaming and batch processing to
optimize the recommendation engine's performance. Justify your choice of the hybrid
approach and outline the key components and considerations involved in its
implementation.

4. Describe the fundamental components of Apache Kafka, including producers, topics,


brokers, consumers, and ZooKeeper. Provide a use-case scenario where these
components work together to solve a specific problem. Explain how each component
plays a role in this scenario and the advantages of using Kafka for this particular use
case.

5. Explain Kafka's message anatomy.


6. You are tasked with analyzing a large dataset containing information about online
customer reviews. The dataset is stored as a text file, where each line represents a
review in the following format:

<product_id>,<review_text>,<rating>
Your goal is to use Apache Spark RDDs in Python to perform the following tasks:
Calculate the average rating for each product. Identify the product with the highest
average rating.
Write a Python script using Apache Spark RDDs to accomplish these tasks. Your
script should read the dataset, perform the calculations, and print the results in the
following format:

Product with the highest average rating: <product_id> (Average Rating:


<average_rating>)

Note :To help you get started, you can use the following Spark RDD operations:

sc.textFile("input.txt"): Read the text file and create an RDD.


map(): Transform each line of the RDD to extract the product ID, review text, and
rating.
mapValues(): Transform the RDD to key-value pairs with the product ID as the key
and the rating as the value.
reduceByKey(): Calculate the sum of ratings for each product.
countByKey(): Count the number of reviews for each product.
mapValues(): Calculate the average rating for each product.
max(): Find the product with the highest average rating.
Please write your Python script and include comments to explain your code.
7. You are working as a data engineer for a retail company that sells products online.
The company has collected a large amount of data about customer orders, including
information about the products, customers, and order details. Your task is to use
Apache Spark DataFrames in Python to perform the following tasks:

1) Load the provided dataset into a Spark DataFrame.


2) Calculate the total revenue generated by each product (i.e., the sum of the
product price for each quantity sold).
3) Identify the top 5 products with the highest total revenue.
The dataset is stored in a CSV file with the following columns:
product_id: A unique identifier for each product.
product_name: The name of the product.
price: The price of one unit of the product.
quantity_sold: The quantity of the product sold in each order.
Write a Python script using Spark DataFrames to accomplish these tasks. Your script
should read the dataset, perform the calculations, and print the top 5 products with the
highest total revenue in the following format:
Top 5 Products by Total Revenue:
1. Product Name: <product_name_1>, Total Revenue: <total_revenue_1>
2. Product Name: <product_name_2>, Total Revenue: <total_revenue_2>
3. Product Name: <product_name_3>, Total Revenue: <total_revenue_3>
4. Product Name: <product_name_4>, Total Revenue: <total_revenue_4>
5. Product Name: <product_name_5>, Total Revenue: <total_revenue_5>

8. Differentiate Regression and Classification in Machine Learning.

9. Explain narrow and wide dependency in Apache Spark with sample data.

10. Define following terms :


1) Artificial Intelligence
2) Machine Learning
3) Deep Learning
4) Supervised Learning
5) Unsupervised Learning

You might also like