Learn 2
Learn 2
For a healthcare use case scenario, where does the volume, velocity, and variety of data necessitate the
adoption of big data analytic techniques?
The healthcare sector generates vast and complex data, making traditional data management methods inadequate.
The adoption of big data analytics is critical to process, store, and derive actionable insights from this data efficiently,
especially due to its volume, velocity, and variety.
1. Volume
• Healthcare data includes electronic health records (EHRs), medical imaging, genomic data, clinical trials, and
insurance claims.
• Massive data sources such as IoT-enabled devices (wearables, sensors) and public health data contribute
significantly to the volume.
• Big data platforms like Hadoop and cloud storage systems help handle and store this massive amount of
structured and unstructured data.
2. Velocity
• Real-time data from patient monitoring systems, wearables, and emergency response systems require
immediate processing to ensure timely interventions.
• For example, heart rate or blood pressure data from a wearable device must be analyzed instantly to alert
healthcare providers in case of abnormalities.
• Big data tools like Apache Kafka and Spark Streaming enable real-time data processing and decision-making.
3. Variety
• Healthcare data is highly diverse, encompassing structured data (EHRs, laboratory results), semi-structured
data (sensor logs, XML), and unstructured data (physician notes, radiology images, video recordings).
• Advanced analytics techniques, including natural language processing (NLP) and image recognition, are
employed to extract meaningful insights from varied data formats.
Design a MapReduce algorithm to calculate the total number of occurrences of various words in a
dataset using the map, shuffle, combiner, and reduce phases. Assume that the dataset consists of a large
size file, and your task is to efficiently count the occurrences of each word across the document.
1. Map Phase
• Input: Splits of the large file (key: line number, value: text of the line).
• Process:
o Split each line into individual words using a delimiter (e.g., spaces, punctuation).
o Emit intermediate key-value pairs in the form of (word, 1) for each word in the line.
• Output: List of key-value pairs (word, 1).
4. Reduce Phase
• Input: Key-value pairs (word, [1, 1, ..., 1]) or (word, [partial counts]).
• Process:
o Sum the values for each key to calculate the total count.
o Emit final key-value pairs in the form of (word, total_count).
• Output: Final word count results, e.g., (word, total_count) for all words.
Example Walkthrough
Dataset:
Input Text: "cat dog cat mouse dog"
Map Phase Output:
(cat, 1), (dog, 1), (cat, 1), (mouse, 1), (dog, 1)
Combiner Phase Output (Optional):
(cat, 2), (dog, 2), (mouse, 1)
Shuffle and Sort Phase Output:
(cat, [1, 1]), (dog, [1, 1]), (mouse, [1])
Reduce Phase Output:
(cat, 2), (dog, 2), (mouse, 1)
Describe the role of visualization libraries and tools in creating insightful visualizations for big data
analysis.
Visualization libraries and tools play a crucial role in making sense of large and complex datasets by transforming raw
data into accessible, interpretable visual formats. They help to uncover patterns, trends, and outliers that would be
difficult to detect otherwise.
Key Roles:
1. Simplifying Complexity: They convert massive datasets into intuitive visual representations like graphs,
charts, and heatmaps, making it easier to identify patterns and trends.
2. Real-Time Insights: Many visualizations tools support real-time data streaming, allowing for dynamic updates
and immediate insights from big data.
3. Interactive Exploration: Tools like Tableau, D3.js, and Power BI allow users to interact with data, drill down
into specific points, and explore relationships between variables.
4. Enhanced Decision-Making: Visualizations help decision-makers quickly grasp key insights, leading to more
informed decisions in business, healthcare, and other sectors.
5. Storytelling with Data: They enable data storytelling by visually representing insights in a clear, compelling
way to communicate findings to stakeholders.
Visualization tools, such as Tableau, D3.js, and Matplotlib, help bridge the gap between raw data and actionable
insights, improving the efficiency of big data analysis and fostering data-driven decision-making.
RDD (Resilient Distributed Dataset): RDD is the fundamental data structure in Apache Spark. It is an immutable,
distributed collection of objects that can be processed in parallel across a cluster. RDDs are fault-tolerant, meaning
that if a partition of an RDD is lost, it can be recomputed from the original data.
Role in Spark:
• RDDs allow distributed data processing by enabling parallel operations like map(), filter(), and reduce().
• They offer fault tolerance and scalability, making them ideal for handling large-scale data.
• RDDs are the building blocks for Spark transformations and actions, serving as the primary abstraction for
handling data.
How does Spark Streaming facilitate the processing of continuous streams of data, and what are its
advantages over batch processing frameworks like MapReduce?
Spark Streaming: Spark Streaming processes real-time data by breaking it into small batches of data (micro-batches),
which are then processed using the same Spark engine used for batch processing. This allows continuous data
streams (like sensor data, logs, etc.) to be processed in near real-time.
Advantages over Batch Processing (MapReduce):
• Real-Time Processing: Unlike MapReduce, which processes data in large, discrete batches, Spark Streaming
processes data as it arrives, enabling real-time analysis.
• Lower Latency: Spark Streaming reduces the time between data arrival and its processing, offering lower
latency compared to batch processing.
• Ease of Use: Spark Streaming leverages the same APIs used in batch processing (like RDDs), making it easier
for developers to write continuous data processing applications.
• Fault Tolerance: Spark Streaming provides the same fault tolerance as Spark through data replication and
checkpointing mechanisms, ensuring reliable data processing.
• Data Ingestion: Data is initially uploaded to the distributed file system (e.g., HDFS). It is split into smaller blocks
and stored across multiple nodes in the cluster.
• Data Distribution: Each block is replicated (typically 3 copies) across different nodes to ensure fault tolerance.
• Data Access: When a user or application requests data, the system retrieves the required block from the node
where it resides, ensuring high availability.
• Fault Tolerance: If a node fails, the system uses the replicated blocks on other nodes to ensure data availability,
without data loss.
• Data Processing: Distributed processing frameworks like Hadoop or Spark access and process these blocks in
parallel, improving speed and scalability.
1. map(): Applies a function to each element in the RDD and returns a new RDD.
o Example: rdd.map(x => x * 2) will multiply each element by 2.
2. filter(): Filters elements of an RDD based on a predicate function.
o Example: rdd.filter(x => x > 5) filters out elements greater than 5.
3. flatMap(): Similar to map(), but can return multiple output elements for each input element.
o Example: rdd.flatMap(x => x.split(" ")) splits each string into words.
4. reduceByKey(): Aggregates data by key using a given function.
o Example: rdd.reduceByKey((a, b) => a + b) sums the values of each key.
5. groupByKey(): Groups data by key, returning a pair of key and a collection of associated values.
o Example: rdd.groupByKey() groups data by their keys, but does not perform any aggregation.
1. MEMORY_ONLY: Stores data only in memory as deserialized objects. Fastest but may not fit large data in
memory.
2. MEMORY_AND_DISK: Stores data in memory, spilling to disk if necessary. Slower than MEMORY_ONLY but more
fault-tolerant.
3. MEMORY_ONLY_SER: Stores data in memory as serialized objects. More space-efficient but slower than
deserialized storage.
4. DISK_ONLY: Stores data only on disk. Suitable for very large datasets but involves slower I/O operations.
5. MEMORY_AND_DISK_SER: Similar to MEMORY_AND_DISK, but stores data as serialized objects in memory and
on disk, offering better space efficiency with additional processing cost.
Compare and contrast Pig Latin with SQL in terms of their syntax, capabilities, and performance for data
processing tasks.
1. Syntax
• Pig Latin:
o Designed for data flow processing, Pig Latin uses a procedural style.
o Statements are sequential and describe the series of steps needed to process data.
o Example:
data = LOAD 'data.txt' AS (name:chararray, age:int);
filtered = FILTER data BY age > 30;
GROUP data BY name;
• SQL:
o SQL is declarative and uses a set-based approach, focusing on what the result should look like, not
how to compute it.
o Example:
SELECT name, AVG(age) FROM data WHERE age > 30 GROUP BY name;
2. Capabilities
• Pig Latin:
o Data Flow Model: Supports complex data transformations, often involving multiple steps.
o Flexibility: Can handle semi-structured data (like logs, JSON, etc.), making it ideal for ETL processes.
o User-Defined Functions (UDFs): Extends functionality through custom Java functions.
o Complex transformations: Easier for tasks like joins, grouping, filtering, and aggregating across large
datasets.
• SQL:
o Relational Query Language: Focuses on structured data stored in relational databases.
o Set-based Operations: Primarily used for operations on tables and requires data to be organized in
structured formats (like relational schemas).
o Standardized: Has standardized syntax, making it universally applicable to relational databases.
o Limited to relational data: Not as flexible with semi-structured or unstructured data.
3. Performance
• Pig Latin:
o Optimized for Large-Scale Data: Pig can optimize queries at runtime using the MapReduce model,
though it’s less efficient for simple queries compared to SQL.
o Latency: Due to its procedural nature, Pig might have higher latency for certain tasks that SQL could
handle more efficiently.
o Parallel Execution: Stronger in handling complex transformations over large datasets in parallel.
• SQL:
o Optimized Query Execution: SQL engines (like Apache Hive on Hadoop or PostgreSQL) have highly
optimized query planners and indexes.
o Faster for simple queries: SQL can be faster for simple data retrieval, aggregation, and joins due to
its declarative nature and query optimizers.
o Limited for large-scale distributed processing: While SQL can scale with databases, it often doesn't
handle large, unstructured datasets as efficiently as Pig or other Hadoop-based tools.
Apply the transformation function in Scala to count the occurrences of each word in a text file. Given that
file path is stored in variable path_textFile. Design appropriate code for the below statements.
val textFile =
val counts =
Design a Kafka streaming architecture for Uber's ride-booking platform to support real-time data
processing and analytics. Discuss the key components, data flow, and use cases within the Kafka
ecosystem. considering the challenges and requirements specific to Uber's operations. (not in syllabus)
Key Components:
1. Producers:
o Uber's mobile apps and backend services (ride requests, user activities, GPS data, etc.) act as Kafka
producers.
o These components send real-time data (e.g., user ride requests, location updates, driver status) to
Kafka topics.
2. Kafka Brokers:
o Kafka brokers handle the ingestion and distribution of incoming data across multiple partitions for
scalability and fault tolerance.
o Uber can have multiple Kafka brokers to handle high traffic loads and ensure high availability.
3. Kafka Topics:
o Topics represent different data streams (e.g., "ride_requests", "driver_status", "trip_updates").
o Each topic can have multiple partitions to ensure parallel processing and fault tolerance.
4. Kafka Consumers:
o Uber’s analytics platforms, real-time monitoring systems, and microservices act as consumers,
subscribing to relevant topics (e.g., real-time analytics for surge pricing, driver availability, and ride
matching).
o Consumers process data for specific use cases such as trip status tracking, dynamic pricing, or
customer notifications.
5. Kafka Streams:
o For real-time processing, Kafka Streams API can be used to analyze streaming data, such as
computing real-time ride availability, monitoring traffic conditions, or generating reports on driver
performance.
6. Sink Systems:
o Processed data is pushed to databases, data lakes, or external storage systems for further analytics
and reporting (e.g., HDFS, relational databases, or NoSQL stores).
Data Flow:
1. Ride Requests:
o When a user requests a ride, the mobile app (producer) sends a message to the "ride_requests"
Kafka topic.
o Kafka brokers store and distribute the messages to consumers.
2. Driver Updates:
o Drivers update their status (available, busy, offline) via mobile app, and the data is sent to the
"driver_status" Kafka topic.
3. Real-Time Processing:
o Kafka consumers, such as real-time analytics engines or microservices, process ride requests and
driver status updates. For example, Kafka Streams can calculate the nearest available drivers,
dynamically adjust pricing based on traffic data, and match riders with drivers.
4. Analytics and Monitoring:
o Data flows to analytics platforms, where Uber's business intelligence teams can run batch or real-
time analytics on ride patterns, driver efficiency, user behavior, and system performance.
Use Cases:
1. Real-Time Ride Matching:
o Kafka helps Uber match riders with available drivers in real-time by processing incoming ride
requests and driver statuses.
2. Dynamic Pricing (Surge Pricing):
o Kafka processes real-time traffic data and demand-supply balance to implement surge pricing
algorithms.
3. Driver and Rider Notifications:
o Kafka enables immediate notifications to riders (ride confirmations, driver arrival) and drivers (new
ride requests, updates).
4. Fraud Detection:
o Real-time processing of patterns from ride requests and payment data can help identify fraudulent
activities.
Design a Pig script to process a large log file containing user activity data. The script should extract
relevant information such as user IDs, timestamps, and actions performed. and perform aggregation
tasks like counting the number of actions per user, at the end also store the aggregated results.
To process a large log file containing user activity data using Apache Pig, you need to design a Pig script that performs
the following tasks:
1. Extract relevant data: Parse the log file to extract user IDs, timestamps, and actions performed.
2. Aggregate data: Count the number of actions per user.
3. Store results: Save the aggregated results to a desired output location (e.g., HDFS).
Here is a step-by-step Pig script that achieves this:
Load the log file (assuming it's in a CSV format with user_id, timestamp, and action columns)
logs = LOAD 'hdfs://path_to_log_file/user_activity.log' USING PigStorage(',') AS
(user_id:chararray, timestamp:chararray, action:chararray);
Filter out any empty or malformed records (optional step based on your data)
filtered_logs = FILTER logs BY user_id IS NOT NULL AND action IS NOT NULL;
3. Feedback Collection
{
"_id": "feedback_id", // Unique identifier for the feedback
"product_id": "string", // Reference to the associated product
"customer_id": "string", // ID of the customer who gave the feedback
"rating": "number",
"comment": "string"
}
MongoDB Commands
i. Find products with price less than 5000:
db.product.find({ price: { $lt: 5000 } })
ii. Drop the material collection:
db.material.drop()
Discuss ML Pipeline in text processing for large-scale datasets with suitable example.
An ML pipeline for text processing involves a sequence of stages to preprocess text data, extract features, train
models, and evaluate results. It ensures scalability and efficiency for large datasets.
Key Stages:
1. Data Ingestion: Load large-scale text datasets from distributed storage (e.g., HDFS, S3).
2. Text Preprocessing:
o Tokenization: Split text into words or tokens.
o Stopword Removal: Eliminate common, non-informative words.
o Normalization: Convert text to lowercase, remove punctuation.
3. Feature Extraction:
o TF-IDF (Term Frequency-Inverse Document Frequency).
o Word Embeddings (e.g., Word2Vec, GloVe).
4. Model Training:
o Use scalable ML algorithms like Logistic Regression, SVM, or Neural Networks (e.g., LSTMs for NLP
tasks).
5. Evaluation:
o Assess model performance using metrics like accuracy or F1-score.
Example:
For sentiment analysis of customer reviews:
• Input: A dataset with review texts and labels (positive/negative).
• Process:
o Preprocess text (tokenize, normalize, remove stopwords).
o Extract features using TF-IDF.
o Train a Logistic Regression model on distributed systems (e.g., Spark MLlib).
• Output: Predict sentiment for new reviews.
Page Rank Question
Find page rank of all web pages
How many data nodes would be the minimum requirement for allocating these jobs to the Hadoop
system. Draw the allocation of blocks in the appropriate data nodes.
The minimum number of data nodes required is 3, based on the maximum replication factor of File3 (3 replicas). The
block allocation is as follows:
File Name Blocks Replication Factor Allocation
File1 [1, 2, 3] 1 DN1: [1], DN2: [2], DN3: [3]
File2 [4, 5, 6] 2 DN1: [4, 5], DN2: [4, 6], DN3: [5, 6]
File3 [7, 8, 9] 3 DN1: [7, 8], DN2: [8, 9], DN3: [9, 7]
File4 [10] 2 DN1: [10], DN2: [10]
File5 [11] 1 DN1: [11]
Consider a scenario where you have a list of transactions, where each transaction is represented as
(Transaction ID, Price, Name). Apply filter transformation to identify transactions where the price is
greater than 500, and for the remaining transactions, apply a discount of 10% to the price. Finally, display
the discounted price and Name for each transaction. Write a Scala/pyspark program that performs the
aforementioned operations on the list of transactions.
val transactions = List( ("T001", 600, "Product A"), ("T002", 450, "Product B"), ("T003", 700, "Product C"),
("T004", 800, "Product D"), ("T005", 300, "Product E") )
import org.apache.spark.sql.SparkSession
// List of transactions
val transactions = List(
("T001", 600, "Product A"),
("T002", 450, "Product B"),
("T003", 700, "Product C"),
("T004", 800, "Product D"),
("T005", 300, "Product E")
)
// Convert to DataFrame
val transactionDF = transactions.toDF("TransactionID", "Price", "Name")
• JSON (JavaScript Object Notation): a lightweight, human-readable, text-based format that is widely used for
asynchronous browser/server communication.
• XML (Extensible Markup Language): a markup language that is used to store and transport data. XML is
often used for large amounts of structured data and has a wide range of applications.
• YAML (Yet Another Markup Language): a human-friendly data serialization standard for all programming
languages. It is often used for configuration files.
• CBOR (Concise Binary Object Representation): a binary data format that is similar to JSON, but more
compact and efficient.
• BSON (Binary JSON): a binary-encoded serialization of JSON-like documents.
• Protocol Buffers (also known as protobuf): a compact binary format developed by Google for high-
performance communication protocols.
1. Apache Avro: Avro is a popular data serialization system that provides a compact binary format and a rich
data structure for big data analytics. Avro supports schema evolution, which allows for changes to the data
structure over time without breaking compatibility.
2. Apache Parquet: Parquet is a columnar storage format that is optimized for big data analytics. It uses a
binary encoding that is more compact and efficient than traditional row-based formats, making it well-suited
for big data analytics.
3. Apache Thrift: Thrift is a data serialization framework that supports efficient serialization and deserialization
of data. It is widely used in big data analytics for its ability to handle complex data structures and support
multiple programming languages.
4. Apache Arrow: Arrow is a high-performance data serialization format that is optimized for big data analytics.
It provides a high-performance binary format for columnar data storage, making it well-suited for analytics
workflows that involve large amounts of data.
Distributed File System
A distributed file system is a type of file system that allows multiple users to access and manage the same data stored
on different nodes or computers in a network. In a distributed file system, the data is stored across multiple nodes in
a way that makes it appear as if it is stored on a single machine, while providing the ability to scale out storage and
processing capabilities as needed.
1. Data replication: The ability to replicate data across multiple nodes to ensure data availability and reliability.
2. Data distribution: The ability to divide the data into smaller chunks and distribute them across multiple
nodes for improved scalability and performance.
3. Data access: The ability for multiple users to access and manipulate the same data from different nodes in a
network.
4. Data management: The ability to manage the data and metadata, such as permissions, ownership, and data
placement, in a coordinated and consistent manner.
1. Hadoop HDFS (Hadoop Distributed File System): A scalable and fault-tolerant file system designed for use
with the Hadoop big data platform.
2. GlusterFS: An open-source, scalable, and highly available distributed file system that can be used on
commodity hardware.
A distributed file system typically provides a number of interfaces that allow users and applications to access and
manage data stored in the system. The most common interfaces in distributed file systems are:
1. File System API: A set of application programming interfaces (APIs) that allow users and applications to
interact with the file system, such as reading and writing files, creating and deleting directories, and
managing metadata.
2. Network File System (NFS) Protocol: A widely used protocol that allows users to access files stored in a
remote file system as if they were stored locally. NFS provides a common set of file operations and is
supported by many operating systems and applications.
3. Server Message Block (SMB) Protocol: A protocol that allows users to access and manage files on a remote
file system using the same operations as they would with a local file system. SMB is commonly used in
Windows-based environments.
4. Object Storage Interface: A set of APIs that provide a way to store and retrieve unstructured data as objects,
rather than as traditional files and directories. Object storage is commonly used in cloud-based storage
environments.
5. Distributed File System Protocol (DFS): A protocol that provides a unified namespace for a set of distributed
file systems, allowing users to access files stored on different nodes as if they were stored in a single file
system.
Data Ingest
Data ingest is the process of bringing data into a data processing or storage system for further analysis and
processing. Data ingest can involve multiple steps, including data acquisition, data extraction, data transformation,
and data loading.
Data ingest is a critical step in many big data analytics pipelines and is typically performed in a scalable and efficient
manner to handle the volume and velocity of the data being ingested. The choice of data ingest tools and
technologies depends on the specific requirements of the data processing and storage system, such as the type and
format of the data, the scale of the data, and the performance requirements of the system.
Data ingest with Flume and Scoop; Flume and Scoop are two popular tools for data ingest in big data analytics.
1. Flume is an open-source, distributed data collection and ingestion framework that is designed to make it easy to
move large volumes of data into Hadoop for further processing and analysis. Flume supports a variety of sources,
including log files, databases, and network sockets, and provides a scalable and reliable mechanism for
transmitting data to a Hadoop cluster. Flume is also extensible and can be configured to handle complex data
processing and transformation requirements.
2. Scoop is a tool for data ingestion and data export in Hadoop. It is used to transfer data between Hadoop and
other systems such as relational databases, NoSQL databases, and other data stores. Scoop provides a high-level
command line interface that makes it easy to perform common data transfer tasks, such as loading data into
Hadoop and exporting data from Hadoop. Scoop also supports parallel data transfer and can be used to perform
data ingestion at scale.
Spark Storage Level and Cache Persistence
Spark storage levels and cache persistence are critical for optimizing the performance of Spark applications. These
mechanisms enable the reuse of intermediate computation results, reducing recomputation and improving
efficiency.
Storage Levels in Spark
1. Disk Storage:
o Provides fault-tolerance.
o Suitable for large datasets.
2. Memory Storage:
o Enables faster access to frequently accessed datasets.
3. Off-Heap Storage:
o Stores data outside the JVM heap.
o Prevents impact on garbage collection processes.
Spark Cache vs Persist
• cache():
o Default storage level: MEMORY_ONLY.
o Used for intermediate computations of RDDs, DataFrames, or Datasets.
• persist():
o Allows specifying storage levels like memory, disk, or both.
o Ensures fault-tolerance; lost partitions are recomputed using original transformations.
Advantages of Caching and Persistence
1. Cost Efficiency: Reduces the cost of repeated computations.
2. Time Efficiency: Speeds up jobs by avoiding recomputation.
3. Execution Time Optimization: Frees up resources for more jobs on the same cluster.
Examples
1. Caching Syntax:
val dfCache = df.cache()
dfCache.show(false)
2. Persist Syntax:
val dfPersist = df.persist(StorageLevel.MEMORY_AND_DISK)
dfPersist.show(false)
3. Unpersist Syntax:
val dfPersist = dfPersist.unpersist()
Spark Architecture
Spark follows a master-slave architecture consisting of several key components:
1. Driver: The main program that defines transformations and actions on data. It coordinates and schedules
tasks.
2. Cluster Manager: Manages the cluster resources and allocates them to applications. Examples include
YARN, Mesos, or a standalone manager.
3. Executors: Workers that run tasks and store data. Each node in the cluster runs an executor.
4. Tasks: Individual units of work sent to the executors.
5. RDD (Resilient Distributed Dataset): A distributed collection of data that is fault-tolerant and can be
operated on in parallel.
Resilient Distributed Datasets (RDDs)
RDDs are the core abstraction in Apache Spark, representing a distributed collection of data that is fault-tolerant
and can be processed in parallel across a cluster. They allow efficient and large-scale data processing, supporting
both in-memory and disk storage.
Key Features of RDDs:
1. Fault Tolerance: RDDs can recover data lost due to failures using lineage information.
2. Parallel Processing: Operations are distributed across multiple nodes in a cluster.
3. Lazy Evaluation: RDD operations are only executed when an action is triggered.
Operations on RDDs
The two major types of operations available are transformations and actions.
1. Transformations:
o Return a new, modified RDD based on the original.
o Common transformations include:
▪ map()
▪ filter()
▪ sample()
▪ union()
2. Actions:
o Return a value based on some computation performed on an RDD.
o Common actions include:
▪ reduce()
▪ count()
▪ first()
▪ foreach()
Iterative Operations on MapReduce
MapReduce lacks efficient support for iterative operations since it writes intermediate results to disk after each
Map and Reduce phase. This is inefficient for algorithms that require multiple passes over the same data (e.g.,
machine learning algorithms), leading to high latency and slower processing times.
Spark Ecosystem
The Spark ecosystem consists of several components and libraries that extend Spark’s capabilities for big data
processing:
1. Spark Core: The foundation that provides basic functionalities like task scheduling and memory
management.
2. Spark SQL: Enables querying of structured data using SQL and DataFrame API.
3. Spark Streaming: Allows real-time stream processing.
4. MLlib: A machine learning library for scalable algorithms like classification and clustering.
5. GraphX: Spark’s API for graph processing and graph-parallel computations.
6. SparkR: R language integration for statistical computing.
Spark vs Hadoop (PYQ)
Feature Apache Spark Apache Hadoop (MapReduce)
Processing Model In-memory computing Disk-based, batch processing
Speed Faster (due to in-memory processing) Slower (due to disk I/O for intermediate data)
Supports APIs in Scala, Java, Python, R; high- Java-based, lower-level MapReduce
Ease of Use
level APIs (DataFrames, SQL) programming
Data Processing Batch, real-time (streaming), interactive, and
Primarily batch processing
Type iterative
Uses data replication across nodes for fault
Fault Tolerance Uses lineage and DAG to recompute lost data
tolerance
Can work with various data sources (HDFS, S3, Works mainly with HDFS (Hadoop Distributed
Data Storage
HBase, etc.) File System)
High-latency (disk I/O for every MapReduce
Latency Low-latency (due to in-memory processing)
step)
Supports real-time stream processing (via
Streaming Support Does not support real-time streaming
Spark Streaming)
Machine Learning Includes MLlib for machine learning tasks Requires third-party libraries (e.g., Mahout)
Graph Processing Provides GraphX for graph computations Lacks built-in graph processing capabilities
Compatible with Hadoop ecosystem (can run
Compatibility Runs only in Hadoop ecosystem (HDFS, YARN)
on YARN, access HDFS)
Resource Can use YARN, Mesos, or standalone cluster
Relies on YARN for resource management
Management manager
Iterative Optimized for iterative algorithms (e.g., ML Not optimized for iterative algorithms (requires
Algorithms algorithms) multiple MapReduce jobs)
Maturity Newer, but rapidly growing in popularity Older, more stable, and widely used
Spark Scheduler
The Spark scheduler handles the execution of jobs by dividing them into stages and tasks. It uses a Dryad-like
Directed Acyclic Graph (DAG) to represent job execution, where nodes are operations (transformations) and
edges represent dependencies.
Key Features of the Spark Scheduler:
1. Pipelines functions within a stage: It groups operations into stages, executing pipelined functions (e.g.,
map, filter) without waiting.
2. Cache-aware work reuse & locality: It optimizes execution by reusing cached data and scheduling tasks
close to where data is stored.
3. Partitioning-aware: To avoid expensive data shuffling, it keeps track of data partitioning.
Example of Stages:
• Stage 1: Operations like groupBy, map (A, B, C, D) are executed.
• Stage 2: Combines transformations like union and join.
• Stage 3: Uses cached data partitions to minimize recomputation.
Hadoop Ecosystem (PYQ)
• Data Storage:
o HDFS: Distributed file system for storing large files.
o HBase: Columnar database for real-time access to large datasets.
• Data Processing:
o MapReduce: Parallel processing framework for handling large-scale data.
o YARN: Manages cluster resources and job scheduling.
• Data Access:
o Hive: SQL-like interface for querying data in HDFS.
o Pig: Data flow scripting for processing large datasets.
o Mahout: Machine learning library for scalable algorithms.
o Avro: Framework for data serialization and RPC.
o Sqoop: Connects and imports data between Hadoop and relational databases.
• Data Management:
o Oozie: Workflow scheduler for managing Hadoop jobs.
o Chukwa: System for monitoring and collecting data.
o Flume: Collects and aggregates log data from various sources.
o ZooKeeper: Coordination and management for distributed applications.
Pig
Apache Pig is a high-level platform for processing large datasets in Hadoop. It uses a language called Pig Latin for
expressing data transformations. Pig simplifies coding in MapReduce by providing a more accessible scripting
interface.
Why do we need Pig?
• Simplified MapReduce: Writing MapReduce directly is complex; Pig provides a more intuitive approach.
• Data Transformation: Useful for tasks like filtering, grouping, and joining large datasets.
• Less Code: With Pig, operations are concise and easier to maintain.
• Extensibility: Supports user-defined functions (UDFs) for custom tasks.
Features of Pig
• Ease of Programming: Pig Latin is easier than raw MapReduce.
• Data Flow Language: Describes transformations as a data flow.
• Schema Flexibility: Works with both structured and unstructured data.
• Optimization: Automatically optimizes execution by generating efficient MapReduce code.
• Extensibility: Supports UDFs in multiple languages (Java, Python).
Applications of Pig
• Log Analysis: Analyze and process web server logs.
• Data Processing: ETL (Extract, Transform, Load) tasks for large datasets.
• Ad Targeting: For marketing data, processing user behavior data.
• Data Research: Quick prototyping of algorithms in big data analytics.
Apache Pig Architecture
Apache Pig’s architecture is designed to execute Pig Latin scripts efficiently over large datasets. The key components
of Pig's architecture include:
1. Pig Latin Script: The user writes a Pig Latin script to specify data transformations.
2. Parser: Converts the Pig Latin script into a logical plan (a series of steps representing the data flow) after
checking syntax and type.
3. Optimizer: Optimizes the logical plan for better performance, generating an optimized logical plan.
4. Compiler: Converts the optimized logical plan into a physical plan of MapReduce jobs.
5. Execution Engine: This executes the physical plan as MapReduce jobs on a Hadoop cluster.
6. HDFS (Hadoop Distributed File System): The data storage and retrieval system, where Pig processes the
data.
Pig vs MapReduce
Feature Pig MapReduce
Language Pig Latin (high-level scripting) Java (low-level programming)
Ease of Use Easier with fewer lines of code Complex and requires more code
Abstraction Higher level; abstracts MapReduce Low level; direct MapReduce coding
Development Speed Faster for developers Slower; requires detailed coding
Optimization Automatically optimized Manual optimization needed
Use Case For ETL, data analysis, and querying Best for complex and custom operations
Pig vs SQL
Feature Pig SQL
Data Type Support Supports both structured and unstructured data Primarily structured (RDBMS)
Data Processing Procedural, step-by-step data flow Declarative, focus on "what" to retrieve
Schema Requirement Can work with or without schema Requires predefined schema
Platform Designed for Hadoop (Big Data) Designed for RDBMS (Relational Databases)
Flexibility More flexible with unstructured data Limited to structured data
Language Pig Latin (procedural) SQL (declarative)
Pig vs Hive
Feature Pig Hive
Language Pig Latin (procedural) HiveQL (SQL-like, declarative)
Data Handling Works with unstructured and structured data Primarily for structured data
Use Case ETL, data processing, and analysis Data querying and reporting
Execution Translates scripts to MapReduce jobs Also translates HiveQL into MapReduce jobs
Learning Curve Easier for programmers Easier for SQL users
Optimization Automatic but procedural control Query optimization via SQL-based execution plan
Hive Architecture
1. User Interfaces
• Web UI: Web-based interaction.
• CLI: Command Line Interface for executing HiveQL queries.
• HDInsight: Cloud-based interface on Azure for Hive.
2. Meta Store
• Stores metadata like table schemas, partitions, and data locations.
• Uses RDBMS (e.g., MySQL) for managing metadata.
3. HiveQL Process Engine
• Parsing: Converts HiveQL queries into a logical plan.
• Optimization: Optimizes query execution using metadata.
4. Execution Engine
• Executes the optimized query using MapReduce, Tez, or Spark based on the configuration.
5. MapReduce
• Hive translates queries into MapReduce jobs for distributed data processing.
6. HDFS or HBase Data Storage
• HDFS: Default data storage for Hive.
• HBase: Supports NoSQL-style data storage for real-time access.
Big Data
Big Data is a term for extremely large and complex datasets that traditional data processing tools can't handle
efficiently. It involves data from diverse sources and is characterized by its vast scale, rapid growth, and varying
formats.
5 V's of Big Data
1. Volume: Amount of data.
2. Velocity: Speed of data generation.
3. Variety: Types of data.
4. Veracity: Data accuracy.
5. Value: Insights and benefits.