Big Data Technologies Lab - 5th Unit
Big Data Technologies Lab - 5th Unit
1. A company wants to use Avro to store and transmit employee data across systems in a
compact, efficient binary format. The data needs to include basic employee information
such as name, age, position, and salary. Additionally, the company has offices in different
locations, so each employee record should include a nested address field with details
about the employee's location, such as city, state, and country. The schema should be
structured to allow for future expansion of fields while maintaining compatibility.-NAVIN
2. An e-commerce company needs a system to collect, process, and store server logs
generated by their web application in real-time. They want these logs to be stored in HDFS
for long-term storage and further analysis using Hadoop tools like Apache Hive and
Apache Spark. Since the log data volume is high and grows continuously, they require a
reliable way to capture, buffer, and transport the data from the application server to
HDFS. Apache Flume is chosen for this purpose due to its reliability in streaming data
ingestion. It will read the logs generated in a local directory, process them, and write them
into HDFS in batches.-SHANMITHAA S
3. An analytics team at a retail company needs to analyze customer data stored in a MySQL
database. They want to process this data in Hadoop to derive insights, such as customer
purchasing patterns and preferences. To perform these analyses efficiently, they plan to
import the MySQL data into Hadoop's HDFS using Apache Sqoop.The MySQL database
contains a customers table with fields such as customer_id, name, email, age, and city.
The data needs to be imported as text files into HDFS, where Hadoop and other
processing tools like Hive or Spark can access it for analysis.-SHRUTI
4. A news website wants to analyze the frequency of words in their article database to better
understand trending topics and keywords. The articles are stored in a Spark DataFrame
with a text column named content containing the full text of each article.The analytics
team needs to tokenize each article (split the text into individual words), filter out
common stop words, and then count the occurrence of each unique word. They plan to
use Spark NLP for the tokenization and word counting process.-LAYA K
5. A car rental company wants to analyze how the rental price of their cars is influenced by
the number of miles driven. They have a dataset with two columns: miles_driven (number
of miles the car has been driven) and rental_price (price in dollars for renting the car). The
company’s goal is to predict the rental price for any given car based on its mileage. They
decide to use simple linear regression in Spark MLlib, which will help them model the
relationship between miles_driven and rental_price and make future predictions.-
ROHITH
7. A financial services company has applications handling high-value transactions, and they
need to ensure that any critical errors (e.g., connectivity issues, transaction failures) are
detected and addressed promptly. Their applications log errors and other messages in
real-time to log files stored on a Hadoop Distributed File System (HDFS). To maintain a
high level of service reliability, the IT team needs a solution that:
• Monitors the log files in real time.
• Identifies specific error messages, such as "TransactionFailure" or
"DatabaseConnectionError."
• Triggers an alert if these error messages occur frequently within a short time frame
(e.g., five instances in five minutes).
The team decides to use Apache Flume to collect logs in real time, Apache Spark
Streaming to process and analyze the logs, and Apache Kafka to manage alerts for
frequent errors.-KARTHIK SIRAM
9. A manufacturing company uses IoT sensors on its machinery to monitor metrics such as
temperature, vibration, and pressure. These sensors generate large volumes of data in
real time, and the company wants to:
• Collect and store this data in a centralized repository for historical analytics.
• Monitor the sensor data in real time to detect potential machine failures or unusual
patterns that could indicate maintenance needs.
The company decides to use Apache Kafka to handle the high-frequency data ingestion,
Apache Spark Streaming to process the data in real time, and HDFS (Hadoop
Distributed File System) to store the data for further analytics.-JAFRIN NIDHA
10. A logistics company wants to optimize its package delivery network. They need a
database that can:
1. Store complex relationships between locations, hubs, and routes as a graph to easily
find the shortest or most efficient delivery routes.
2. Track packages as they move through various hubs and locations.
3. Store metadata about packages and delivery personnel in a flexible document format.
The company decides to use ArangoDB for its multi-model capabilities, as it can store
and manage the relationships between locations and packages, track package status
updates, and handle the flexible schema requirements for package metadata.-MENAKA
1-NAVIN 6-MADHUMITHA
2-SHANMITHAA 7-KARTHIK SIRAM
3-SHRUTI 8-PRANITHA
4-LAYA 9-JAFRIN NIDHA
5-ROHITH 10-MENAKA