Week - 5
Week - 5
a. Incorrect -MLlib:
This is Spark’s machine learning library, not specifically for graph processing.
b. Correct- GraphX
This is used for processing real-time data streams, not for graph processing.
d. ALL:
2.Which of the following frameworks is best suited for fast, in-memory data processing
and supports advanced analytics such as machine learning and graph processing?
a) Apache Hadoop MapReduce
b) Apache Flink
c) Apache Storm
d) Apache Spark
b) Apache Flink
● Incorrect: Apache Flink is excellent for real-time stream processing and supports
complex event processing. However, it is primarily optimized for stream
processing rather than providing a comprehensive suite for batch processing,
machine learning, and graph analytics as Spark does.
c) Apache Storm
d)Apache Spark
3. A financial institution needs to analyze historical stock market data to predict market
trends and make investment decisions. Which Big Data processing framework is best
suited for this scenario?
a. Apache Spark
b. Apache Storm
c. Hadoop MapReduce
d. Apache Flume
Explanation: Apache Spark is well-suited for analyzing historical data due to its fast
in-memory processing capabilities. It supports a variety of data analytics tasks and is
ideal for predictive analytics and machine learning.
b) Incorrect- Apache Storm: This is for real-time stream processing, which is less
suited for historical data analysis.
c) Incorrect- Hadoop MapReduce: While it can handle large-scale data
processing, it is slower than Spark for iterative algorithms used in predictive
modeling.
d) Incorrect- Apache Flume: It is primarily used for data ingestion, not analysis.
4.A telecommunications company needs to process real-time call logs from millions of
subscribers to detect network anomalies. Which combination of Big Data tools would be
appropriate for this use case?
a. Apache Hadoop and Apache Pig
b. Apache Kafka and Apache HBase
c. Apache Spark and Apache Hive
d. Apache Storm and Apache pig
a. Incorrect- Apache Hadoop and Apache Pig: Hadoop is designed for batch
processing, and Pig is used for ETL tasks in batch mode. This combination is not
suited for real-time processing.
c. Incorrect- Apache Spark and Apache Hive: While Spark can handle real-time
processing (with Structured Streaming) and is powerful for analytics, Hive is
more oriented towards batch processing and querying rather than real-time
analytics.
d. Incorrect- Apache Storm and Apache Pig: Storm is good for real-time stream
processing, but Pig is used for batch processing and ETL tasks. Combining
Storm with Pig does not align with the need for real-time analytics and storage.
b. Incorrect -Compaction: Kafka supports log compaction, but it's not a substitute
for log aggregation.
c. Incorrect -Collection: Kafka is used for collecting logs, but the term 'substitute'
more directly refers to log aggregation.
d. Incorrect -All of the mentioned: Not all options are directly applicable.
a. Incorrect - DAG (Directed Acyclic Graph): While the DAG is essential for
understanding how Spark schedules tasks, it is the lineage information
specifically that ensures fault tolerance.
c. Incorrect -Lazy evaluation: This optimizes execution and resource usage but
does not specifically address fault tolerance.
a. Incorrect: Pig Latin scripts are compiled into HiveQL for execution:
a. Incorrect - Apache HBase for storing student data and Apache Pig for
processing: While HBase is a NoSQL database suitable for real-time read/write
access, Pig is used for ETL tasks and batch processing, which may not be as
efficient as Spark for complex analytics and recommendations.
b. Incorrect - Apache Kafka for data streaming and Apache Storm for real-time
analytics: Kafka is used for streaming data, and Storm is for real-time analytics.
This combination is more suited for real-time data processing rather than batch
analytics and recommendation generation.
c. Incorrect - Hadoop MapReduce for batch processing and Apache Hive for
querying: While MapReduce and Hive are both part of the Hadoop ecosystem,
MapReduce is less efficient for iterative processing compared to Spark. Hive is
used for querying but is more oriented towards batch processing rather than
real-time analytics and personalized recommendations.
d. Correct - Apache Spark for data processing and Apache Hadoop for
storage.
● Apache Spark for data processing: Spark is a powerful and versatile data
processing engine that supports complex analytics, machine learning, and
iterative algorithms. It is well-suited for analyzing large datasets and generating
recommendations. Spark's in-memory computation capabilities provide high
performance for such tasks.
● Apache Hadoop for storage: Hadoop's HDFS (Hadoop Distributed File System)
is a scalable and reliable storage system designed for storing large volumes of
data across a distributed cluster. It is ideal for storing the large datasets of
student performance data.
10.A company is analyzing customer behavior across multiple channels (web, mobile
app, social media) to personalize marketing campaigns. Which technology is best suited
to handle this type of data processing?
a. Hadoop MapReduce
b. Apache Kafka
c. Apache Spark
d. Apache Hive
b. Incorrect - Apache Kafka: Primarily used for message streaming, not data
processing.
d. Incorrect - Apache Hive: Used for querying and not for complex data
processing and analytics.