Big Data Technologies Lab - 5th Unit

Uploaded by

rohith.robotic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views2 pages

Big Data Technologies Lab - 5th Unit

Uploaded by

rohith.robotic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Big Data Technologies Lab

1. A company wants to use Avro to store and transmit employee data across systems in a
compact, efficient binary format. The data needs to include basic employee information
such as name, age, position, and salary. Additionally, the company has offices in different
locations, so each employee record should include a nested address field with details
about the employee's location, such as city, state, and country. The schema should be
structured to allow for future expansion of fields while maintaining compatibility.-NAVIN

2. An e-commerce company needs a system to collect, process, and store server logs
generated by their web application in real-time. They want these logs to be stored in HDFS
for long-term storage and further analysis using Hadoop tools like Apache Hive and
Apache Spark. Since the log data volume is high and grows continuously, they require a
reliable way to capture, buffer, and transport the data from the application server to
HDFS. Apache Flume is chosen for this purpose due to its reliability in streaming data
ingestion. It will read the logs generated in a local directory, process them, and write them
into HDFS in batches.-SHANMITHAA S

3. An analytics team at a retail company needs to analyze customer data stored in a MySQL
database. They want to process this data in Hadoop to derive insights, such as customer
purchasing patterns and preferences. To perform these analyses efficiently, they plan to
import the MySQL data into Hadoop's HDFS using Apache Sqoop.The MySQL database
contains a customers table with fields such as customer_id, name, email, age, and city.
The data needs to be imported as text files into HDFS, where Hadoop and other
processing tools like Hive or Spark can access it for analysis.-SHRUTI

4. A news website wants to analyze the frequency of words in their article database to better
understand trending topics and keywords. The articles are stored in a Spark DataFrame
with a text column named content containing the full text of each article.The analytics
team needs to tokenize each article (split the text into individual words), filter out
common stop words, and then count the occurrence of each unique word. They plan to
use Spark NLP for the tokenization and word counting process.-LAYA K

5. A car rental company wants to analyze how the rental price of their cars is influenced by
the number of miles driven. They have a dataset with two columns: miles_driven (number
of miles the car has been driven) and rental_price (price in dollars for renting the car). The
company’s goal is to predict the rental price for any given car based on its mileage. They
decide to use simple linear regression in Spark MLlib, which will help them model the
relationship between miles_driven and rental_price and make future predictions.-
ROHITH

6. A marketing team at a smartphone company wants to analyze public sentiment on social

media about their latest smartphone model. They aim to gather tweets mentioning their
new product to understand how customers feel about its features, pricing, and overall
value. The team also wants to identify trending topics or common keywords associated
with the product. They decide to use Spark with Spark NLP and Spark MLlib for this task.
Their objectives are:
• Gather tweets related to the smartphone.
• Clean and preprocess the data.
• Perform sentiment analysis to classify tweets as positive, negative, or neutral.
• Identify trending topics by analyzing word frequencies and hashtags.-MADHUMITHA

7. A financial services company has applications handling high-value transactions, and they
need to ensure that any critical errors (e.g., connectivity issues, transaction failures) are
detected and addressed promptly. Their applications log errors and other messages in
real-time to log files stored on a Hadoop Distributed File System (HDFS). To maintain a
high level of service reliability, the IT team needs a solution that:
• Monitors the log files in real time.
• Identifies specific error messages, such as "TransactionFailure" or
"DatabaseConnectionError."
• Triggers an alert if these error messages occur frequently within a short time frame
(e.g., five instances in five minutes).
The team decides to use Apache Flume to collect logs in real time, Apache Spark
Streaming to process and analyze the logs, and Apache Kafka to manage alerts for
frequent errors.-KARTHIK SIRAM

8. A telecommunications company wants to reduce its customer churn by identifying users

likely to leave the service. Using historical data on customer transactions, service usage
patterns, demographics, and support interactions, the company aims to build a model to
predict which customers are at risk of churning. This will enable the company to
proactively reach out to high-risk customers with retention offers and targeted
engagement strategies.-PRANITHA

9. A manufacturing company uses IoT sensors on its machinery to monitor metrics such as
temperature, vibration, and pressure. These sensors generate large volumes of data in
real time, and the company wants to:
• Collect and store this data in a centralized repository for historical analytics.
• Monitor the sensor data in real time to detect potential machine failures or unusual
patterns that could indicate maintenance needs.
The company decides to use Apache Kafka to handle the high-frequency data ingestion,
Apache Spark Streaming to process the data in real time, and HDFS (Hadoop
Distributed File System) to store the data for further analytics.-JAFRIN NIDHA

10. A logistics company wants to optimize its package delivery network. They need a
database that can:
1. Store complex relationships between locations, hubs, and routes as a graph to easily
find the shortest or most efficient delivery routes.
2. Track packages as they move through various hubs and locations.
3. Store metadata about packages and delivery personnel in a flexible document format.
The company decides to use ArangoDB for its multi-model capabilities, as it can store
and manage the relationships between locations and packages, track package status
updates, and handle the flexible schema requirements for package metadata.-MENAKA

1-NAVIN 6-MADHUMITHA
2-SHANMITHAA 7-KARTHIK SIRAM
3-SHRUTI 8-PRANITHA
4-LAYA 9-JAFRIN NIDHA
5-ROHITH 10-MENAKA

Bigdata With Python
No ratings yet
Bigdata With Python
19 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
No ratings yet
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
8 pages
SAP HANA Migration Sizing
100% (1)
SAP HANA Migration Sizing
39 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
An Introduction: Data Medium Exchange Engine (DMEE)
50% (2)
An Introduction: Data Medium Exchange Engine (DMEE)
26 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
Real-Time Big Data Analytics - Sample Chapter
100% (2)
Real-Time Big Data Analytics - Sample Chapter
30 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
CS403 Quiz 2 Solution by MCS of Virtuallians
100% (1)
CS403 Quiz 2 Solution by MCS of Virtuallians
2 pages
Terminologies Used in Big Data Environments
No ratings yet
Terminologies Used in Big Data Environments
3 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Module 1 Glossary What Is Big Data
No ratings yet
Module 1 Glossary What Is Big Data
2 pages
TSQL Interview Questions and Answers
No ratings yet
TSQL Interview Questions and Answers
8 pages
BATCH12
No ratings yet
BATCH12
32 pages
Bda Final Sem 7
No ratings yet
Bda Final Sem 7
120 pages
Assignment 1 Spec
No ratings yet
Assignment 1 Spec
5 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Big Data Computing - Week-5
No ratings yet
Big Data Computing - Week-5
3 pages
BIG DATA ANALYTICS MCQs
No ratings yet
BIG DATA ANALYTICS MCQs
8 pages
15 Big Data Tools and Technologies To Know About in 2021
No ratings yet
15 Big Data Tools and Technologies To Know About in 2021
7 pages
BDTools
No ratings yet
BDTools
15 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
Open Source Technologies
No ratings yet
Open Source Technologies
19 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
Module 2
No ratings yet
Module 2
20 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Big Data Analytics - Notes
No ratings yet
Big Data Analytics - Notes
13 pages
Fast and Interactive Analytics Over Hadoop Data With Spark
No ratings yet
Fast and Interactive Analytics Over Hadoop Data With Spark
7 pages
Biggdata
No ratings yet
Biggdata
24 pages
Assignment Group 3
No ratings yet
Assignment Group 3
21 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Week - 5
No ratings yet
Week - 5
7 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
RIT Question Bank
No ratings yet
RIT Question Bank
2 pages
Group 3&4 Assignment Sample Solution
No ratings yet
Group 3&4 Assignment Sample Solution
5 pages
Unit 5
No ratings yet
Unit 5
14 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Yasir f29 Ass1 Bigdata
No ratings yet
Yasir f29 Ass1 Bigdata
7 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
Unit 5
No ratings yet
Unit 5
4 pages
Big Data Technologies UNIT 1
No ratings yet
Big Data Technologies UNIT 1
5 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
16 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
DC Unit V
No ratings yet
DC Unit V
26 pages
Dhan Singh Big Data File - 7
No ratings yet
Dhan Singh Big Data File - 7
1 page
BAD601 Big Data Model Question Paper Solution Search Creators
No ratings yet
BAD601 Big Data Model Question Paper Solution Search Creators
50 pages
SPARK
No ratings yet
SPARK
47 pages
Query Optimization
No ratings yet
Query Optimization
30 pages
Clinical Data Repositories PDF
No ratings yet
Clinical Data Repositories PDF
3 pages
Human Resource Management System: A Case Study On An Information Management Design
No ratings yet
Human Resource Management System: A Case Study On An Information Management Design
6 pages
InstallationChecklist - Primtech - R16 - EN PDF
No ratings yet
InstallationChecklist - Primtech - R16 - EN PDF
7 pages
SQL INterview Questions
No ratings yet
SQL INterview Questions
27 pages
RDBMS Notes For MBA Program-1
No ratings yet
RDBMS Notes For MBA Program-1
50 pages
Oracle (SQL) Documentation (Karthik)
No ratings yet
Oracle (SQL) Documentation (Karthik)
62 pages
How To Guide Developing Oracle BI XML Pu
No ratings yet
How To Guide Developing Oracle BI XML Pu
37 pages
Mirroring PPT
No ratings yet
Mirroring PPT
6 pages
Xii Cs Practical File (2024)
No ratings yet
Xii Cs Practical File (2024)
23 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
Advance Java Programming
No ratings yet
Advance Java Programming
7 pages
Expert T-SQL Window Functions in SQL Server 2019: The Hidden Secret To Fast Analytic and Reporting Queries Kathi Kellenberger
100% (2)
Expert T-SQL Window Functions in SQL Server 2019: The Hidden Secret To Fast Analytic and Reporting Queries Kathi Kellenberger
65 pages
Additional Host Option in SWPM
No ratings yet
Additional Host Option in SWPM
10 pages
6 +Athena,+QuickSight,+EMR
No ratings yet
6 +Athena,+QuickSight,+EMR
63 pages
Unit 3
No ratings yet
Unit 3
16 pages
The Ship Schema
No ratings yet
The Ship Schema
3 pages
LLM4
No ratings yet
LLM4
3 pages
DBMS Experiment No 8
No ratings yet
DBMS Experiment No 8
4 pages
Muhammad Anas Bin Mohd Yusof (Am2304013250) - Assignmnet 2
No ratings yet
Muhammad Anas Bin Mohd Yusof (Am2304013250) - Assignmnet 2
31 pages
20764C-ENU-Handbook-Module 01
No ratings yet
20764C-ENU-Handbook-Module 01
18 pages
Vehicle Number Plate Recognition: Design Document
No ratings yet
Vehicle Number Plate Recognition: Design Document
18 pages
Salesforce Developer Skills
No ratings yet
Salesforce Developer Skills
2 pages
2016-12-05-Midterm - Q and Answers
No ratings yet
2016-12-05-Midterm - Q and Answers
6 pages
NazimShaikh (8 0)
No ratings yet
NazimShaikh (8 0)
3 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Learn HANA in 24 Hours
From Everand
Learn HANA in 24 Hours
Alex Nordeen
5/5 (1)

Big Data Technologies Lab - 5th Unit

Uploaded by

Big Data Technologies Lab - 5th Unit

Uploaded by

Big Data Technologies Lab

6. A marketing team at a smartphone company wants to analyze public sentiment on social

8. A telecommunications company wants to reduce its customer churn by identifying users

You might also like