0% found this document useful (0 votes)
33 views9 pages

Sub Unit 3

Uploaded by

radhikaporwal1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views9 pages

Sub Unit 3

Uploaded by

radhikaporwal1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A unit of Realwaves (P) Ltd Unit 3

UNIT-III

3.6 HADOOP ECOSYSTEM: INTRODUCTION TO HADOOP


ECOSYSTEM
What is Hadoop Ecosystem?
The Hadoop Ecosystem refers to a set of open-source tools and frameworks
developed to process and analyze Big Data efficiently. The core of the ecosystem
is the Hadoop framework, which manages the storage, processing, and retrieval
of vast data sets. It is scalable, cost-effective, and fault-tolerant.

Core Components of the Hadoop Ecosystem

1. HDFS (Hadoop Distributed File System)


• Function:
HDFS is the storage layer of Hadoop. It breaks large files into smaller pieces
called blocks and stores them across multiple machines in a distributed
manner. It ensures data is available even if one machine fails by maintaining
copies (replication).
• Example:
Flipkart stores its product catalog, transaction logs, and user behavior data in
HDFS. This enables efficient analysis of shopping patterns to recommend
products to customers.

2. MapReduce
• Function:
MapReduce is the processing engine in Hadoop. It divides tasks into smaller
parts, distributes them across nodes (computers), and processes them
simultaneously. Results are then combined to provide the final output.
• Example:
Indian Railways uses MapReduce to analyze passenger ticket bookings and
cancellations to optimize train schedules and prevent overbooking.

Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.1
A unit of Realwaves (P) Ltd Unit 3

3. YARN (Yet Another Resource Negotiator)


• Function:
YARN is the resource manager in Hadoop. It allocates CPU, memory, and
storage to tasks running in the Hadoop cluster.
• Example:
Google Maps uses YARN to allocate resources when processing traffic data
collected from Indian roads and highways, providing real-time traffic
updates.

4. Hive
• Function:
Hive is a data warehouse system built on top of Hadoop. It allows users to
query structured data stored in HDFS using SQL-like language called
HiveQL.
• Example:
Reliance Jio uses Hive to query customer data for marketing campaigns,
such as sending personalized offers based on usage patterns.

5. Pig
• Function:
Pig is a scripting platform designed to process large datasets. It uses a
language called Pig Latin, which is easier than Java (used in MapReduce).
• Example:
The Indian government uses Pig to analyze demographic data from Aadhaar
to identify population trends and plan welfare schemes.

6. HBase
• Function:
HBase is a NoSQL database that allows real-time read and write access to
large datasets.

Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.2
A unit of Realwaves (P) Ltd Unit 3

• Example:
SBI stores and retrieves transaction records of millions of customers in real-
time using HBase, enabling quick account updates.

7. Spark
• Function:
Spark is a fast data processing engine that can process large datasets in
memory (RAM), making it much faster than traditional disk-based
processing.
• Example:
Ola uses Spark to analyze ride demand and optimize pricing dynamically in
Indian cities like Bengaluru and Mumbai.

8. Sqoop
• Function:
Sqoop bridges traditional databases and Hadoop, enabling data
import/export between them.
• Example:
ICICI Bank transfers customer transaction data from MySQL to Hadoop
using Sqoop for fraud detection and credit risk analysis.

9. Flume
• Function:
Flume is a data ingestion tool that collects, aggregates, and moves data (like
logs and events) into HDFS.
• Example:
Swiggy uses Flume to gather log data from its app to monitor delivery
performance and app usage.

10. Oozie

Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.3
A unit of Realwaves (P) Ltd Unit 3

• Function:
Oozie is a workflow scheduler that automates the execution of Hadoop jobs
in a specific order.
• Example:
Paytm uses Oozie to manage workflows for its cashback processing pipeline,
ensuring tasks like transaction analysis and payment processing happen
sequentially.

Other Tools in the Ecosystem


• Zookeeper: Manages coordination between distributed systems.
Example: Ensuring synchronization in Netflix India’s content delivery
network.
• Mahout: Provides machine learning capabilities.
Example: Used in Amazon India for personalized product
recommendations.
• Kafka: A messaging system for real-time data streams.
Example: Uber India uses Kafka for event-based data like driver availability
and ride requests.

Advantages of the Hadoop Ecosystem


1. Scalability:
It scales horizontally, allowing more machines to be added as data grows.
2. Cost-Effective:
Built on open-source software and uses commodity hardware.
3. Fault Tolerance:
Data replication ensures high availability.
4. Flexibility:
Can handle structured, unstructured, and semi-structured data.

Relevance of Hadoop Ecosystem in India

Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.4
A unit of Realwaves (P) Ltd Unit 3

1. E-Governance:
Projects like Aadhaar and Digital India use Hadoop to store and analyze
citizen data.
2. E-commerce Boom:
Companies like Flipkart and Amazon India rely on Hadoop for real-time
analytics of user behavior.
3. Banking and Finance:
Banks like ICICI and SBI analyze customer transactions for fraud detection
and risk management.
4. Healthcare:
Platforms like Aarogya Setu analyze COVID-19 data using Hadoop to track
and prevent outbreaks.
5. Telecom:
Jio and Airtel use Hadoop to analyze network usage and optimize
bandwidth.

3.7 INTRODUCTION TO DATA MANAGEMENT AND DATA ACCESS


TOOLS

Introduction to Data Management and Data Access Tools


Data management and access tools are essential components of the Big Data
Ecosystem. They streamline the storage, processing, and retrieval of data for
effective analysis and decision-making.
Here, we’ll explore the data management tools (Flume, Oozie, Zookeeper) and
data access tools (Hive, Pig, Avro, Sqoop), providing insights into their
functionality and practical Indian examples.

Data Management Tools


These tools handle the collection, scheduling, coordination, and management of
data within Big Data systems.

Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.5
A unit of Realwaves (P) Ltd Unit 3

1. Flume
• Purpose:
Flume is designed to ingest large volumes of log data and move it into
Hadoop’s storage system (HDFS).
• Key Features:
o Collects unstructured data from multiple sources like web servers or
application logs.
o Ensures reliable and distributed data ingestion.
• Example:
Swiggy uses Flume to collect app usage logs, such as the number of orders
placed or searches performed, and stores them in HDFS for performance
analysis.

2. Oozie
• Purpose:
Oozie is a workflow scheduler that automates the execution of a sequence of
Hadoop jobs.
• Key Features:
o Handles time-based (cron-like) or event-based workflows.
o Integrates seamlessly with other Hadoop tools like Hive, Pig, and
MapReduce.
• Example:
Paytm uses Oozie to manage its data processing workflows, such as
transaction analysis, cashback calculations, and notification delivery.

3. Zookeeper
• Purpose:
Zookeeper is a centralized service for coordinating distributed systems. It
ensures synchronization, configuration management, and fault tolerance.
• Key Features:
Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.6
A unit of Realwaves (P) Ltd Unit 3

o Manages distributed applications effectively.


o Prevents inconsistencies in data processing systems.
• Example:
Netflix India uses Zookeeper to manage its distributed content delivery
system, ensuring synchronized streaming services across the country.

Data Access Tools


These tools allow users to query, transform, and analyze data stored in Hadoop.

4. Hive
• Purpose:
Hive is a data warehouse tool for querying and managing large datasets
stored in HDFS using SQL-like syntax (HiveQL).
• Key Features:
o Supports structured data and complex queries.
o Ideal for users familiar with SQL.
• Example:
Reliance Jio uses Hive to analyze customer usage data and provide tailored
plans or offers to its users.

5. Pig
• Purpose:
Pig is a platform for analyzing large datasets using a high-level scripting
language called Pig Latin.
• Key Features:
o Simplifies data transformation and analysis.
o Handles semi-structured and unstructured data effectively.
• Example:
The Indian government uses Pig to analyze census data and extract insights
about population growth trends.
Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.7
A unit of Realwaves (P) Ltd Unit 3

6. Avro
• Purpose:
Avro is a serialization framework used to store and exchange data between
Hadoop applications.
• Key Features:
o Highly efficient and compact data format.
o Supports schema evolution, making it easy to update data formats.
• Example:
Flipkart uses Avro to store product data in a consistent and compact format,
enabling seamless communication between different systems.

7. Sqoop
• Purpose:
Sqoop bridges the gap between traditional databases (like MySQL, Oracle)
and Hadoop. It is used to import/export data between these systems.
• Key Features:
o Efficient data transfer to/from Hadoop.
o Supports structured data formats.
• Example:
ICICI Bank uses Sqoop to move transactional data from its MySQL
databases to Hadoop for risk analysis and fraud detection.

Why These Tools Matter


• Efficiency: Automate repetitive tasks like data ingestion, transformation,
and querying.
• Scalability: Handle growing datasets seamlessly.
• Flexibility: Work with structured, semi-structured, and unstructured data.
• Cost-Effectiveness: Open-source tools reduce the overall cost of data
processing.
Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.8
A unit of Realwaves (P) Ltd Unit 3

Indian Use Cases for Data Management and Access Tools


1. E-Governance:
Aadhaar uses tools like Hive and Sqoop to analyze demographic data for
policy formulation.
2. E-commerce:
Amazon India leverages Hive and Avro to analyze purchasing trends and
manage inventory.
3. Telecom:
Airtel uses Oozie and Pig to manage and analyze network usage, ensuring
optimal service delivery.
4. Healthcare:
Aarogya Setu relies on Flume to collect data from millions of app users and
analyze the spread of diseases.

Branches: (1) Vidhyadhar Nagar (2) Mansarovar (3) Tonk Phatak (Contact 7737733360)
ONLINE CLASSES on “More Education Plus” App | Recorded Videos Available 3.9

You might also like