Big Data Exam Help
Big Data Exam Help
Got it! Please go ahead and upload the images. I'll provide a detailed and beginner-friendly explanation
of each concept and architecture you share.
This image represents the Apache Hadoop YARN Architecture, which stands for "Yet Another Resource
Negotiator." It is an essential component of Hadoop for managing resources in a distributed system and
executing tasks efficiently.
1. Client:
The client is the user or application that submits jobs (tasks) to the YARN system.
A job could be a data processing task like sorting or analyzing large datasets.
2. Resource Manager (RM):
The Resource Manager is the central authority of YARN. It manages all the cluster's resources
and decides how to allocate them.
It performs two main tasks:
1. Monitoring Node Managers: Tracks the health and resource usage of all nodes.
2. Assigning Resources: Based on resource requests, it assigns resources to jobs.
3. Node Manager (NM):
Every machine (node) in the cluster has its own Node Manager.
The NM is responsible for:
Reporting resource usage (CPU, memory) to the Resource Manager.
Managing containers on its node.
4. App Master (AM):
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/7
Each job submitted by the client has its own Application Master.
The AM is responsible for:
Requesting resources from the Resource Manager.
Assigning tasks to the containers and monitoring their progress.
5. Container:
A container is a unit of resources (CPU, memory) allocated by the Resource Manager.
It runs the tasks assigned by the Application Master.
How It Works:
1. Job Submission:
The client submits a job to the Resource Manager, specifying the amount of resources it
needs.
2. Application Master Setup:
The Resource Manager launches the Application Master in a container on one of the nodes.
3. Resource Request and Allocation:
The Application Master requests additional containers from the Resource Manager to execute
the job's tasks.
The Resource Manager finds the appropriate Node Manager and allocates resources.
4. Task Execution:
Node Managers create containers to run the tasks.
The Application Master assigns the tasks to the containers.
5. Monitoring and Completion:
Node Managers report the status of the tasks to the Application Master.
The Application Master updates the client with the job's progress.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/7
Let me know if you'd like me to explain any part further!
Hadoop Streamming
This image represents the Hadoop Streaming Architecture, a feature of Hadoop that allows you to
write MapReduce jobs in any programming language (not just Java). It is particularly useful when you
want to use tools or scripts written in Python, Ruby, or other languages for processing data in a Hadoop
cluster.
1. Input Reader/Format:
This is the starting point of the process. The input data is split into smaller chunks (e.g., lines
of text or other formats).
Each chunk is converted into a `<Key, Value>` pair.
Key: Usually the offset (position) of the line in the file.
Value: The content of the line.
2. Mapper Stream:
The Mapper processes the input `<Key, Value>` pairs and generates intermediate `<Key,
Value>` pairs.
Hadoop Streaming allows you to use your external script (e.g., Python, Bash) as the Mapper.
The script receives the input through `STDIN` (Standard Input) and sends the output through
`STDOUT` (Standard Output).
Example:
Input: `line_offset, "apple orange banana"`
Mapper Output:
`apple → 1`
`orange → 1`
`banana → 1`
3. Intermediate Key-Value Pairs:
After the Mapper processes the input, the intermediate `<Key, Value>` pairs are shuffled and
sorted by Hadoop automatically to group values for the same key.
Example:
Mapper Output: `apple → 1, orange → 1, apple → 1`
Intermediate Output: `apple → [1, 1], orange → [1]`
4. Reducer Stream:
The Reducer takes the grouped `<Key, List of Values>` and performs aggregation or
computation on the data.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/7
Similar to the Mapper, an external script can be used for this step.
Input to the Reducer: `STDIN`, Output: `STDOUT`.
Example:
Input: `apple → [1, 1]`
Reducer Output: `apple → 2`
5. Output Format:
The final `<Key, Value>` pairs from the Reducer are written to the output in the specified
format.
The output can be a text file, database, or any other storage system supported by Hadoop.
Example
Let's say we want to count word frequencies in a file using Python scripts.
python
import sys
for line in sys.stdin:
for word in line.strip().split():
print(f"{word}\t1")
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/7
python
import sys
from collections import defaultdict
counts = defaultdict(int)
for line in sys.stdin:
word, count = line.strip().split("\t")
counts[word] += int(count)
for word, count in counts.items():
print(f"{word}\t{count}")
Command to Run:
bash
steps
This image represents a Big Data Pipeline Architecture. A data pipeline is a structured flow of data
from its source to its destination, where it can be analyzed or used for business purposes. This pipeline
involves several steps and processes, as shown in the diagram.
Let’s break it down step by step:
1. Capture
This is the first step, where data is collected from various sources such as:
Mobile devices (e.g., apps).
Web applications (e.g., e-commerce platforms).
Microservices (small, independent services that perform specific tasks).
IoT devices (sensors, smart home devices, etc.).
Batch data (data collected over time, e.g., from logs or databases).
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/7
Protocols like MQTT (used for IoT communication).
2. Ingest
Once data is captured, it is ingested into the system for processing. Two main types of ingestion
occur:
Event Queue:
Real-time events are sent to an event queue (e.g., Kafka, RabbitMQ).
This queue ensures events are orderly and can be processed continuously.
Batch Data Load:
For historical or periodic data, it is loaded in batches.
The ingested data is then routed to storage locations:
Data Warehouse: Optimized for querying and analytics (e.g., Snowflake, BigQuery).
Data Lake: Stores raw, unstructured data for future use (e.g., S3, Hadoop HDFS).
3. Store
Data Warehouse:
Used for filtered and cleaned data.
Data is structured and ready for analysis (e.g., SQL queries).
Data Lake:
Stores raw data in its native format (e.g., JSON, video files, etc.).
Useful for data scientists and analysts who need access to raw data.
4. Compute
Data processing and analysis occur here, which can be done in two ways:
Stream Processing:
Processes real-time data as it arrives.
Uses tools like Apache Kafka Streams, Flink, or Spark Streaming.
Example: Sending notifications immediately when certain conditions are met.
Batch Processing:
Processes large datasets periodically (e.g., daily, hourly).
Uses frameworks like Apache Spark or Hadoop MapReduce.
5. Use
Orchestration:
Tools like Apache Airflow manage the workflow of the pipeline.
Ensures all steps occur in the correct sequence and on time.
Data Quality:
Validates and cleans data to ensure it is accurate and reliable.
Catalog:
Maintains metadata about the data, making it easier to find and understand.
Governance:
Enforces security, compliance, and policies (e.g., GDPR compliance).
1. Scalability:
Can handle large volumes of data from diverse sources.
2. Real-Time Insights:
Stream processing enables immediate decision-making.
3. Flexibility:
Supports both structured (SQL-ready) and unstructured (raw) data.
4. Automation:
Orchestration tools automate complex workflows.
Let me know if you’d like me to explain any part further or give examples of specific tools!
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/7