0% found this document useful (0 votes)
7 views38 pages

Unit I

The document provides an overview of Big Data Analytics, focusing on its definition, advantages, and the architecture of Hadoop, including its components and characteristics. It outlines the Big Data Analytics Pipeline and the importance of managing large datasets efficiently. Additionally, the document discusses the Hadoop Distributed File System (HDFS), its components, and the High Availability (HA) architecture to ensure continuous operation and data integrity.

Uploaded by

ajaypanyala29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

Unit I

The document provides an overview of Big Data Analytics, focusing on its definition, advantages, and the architecture of Hadoop, including its components and characteristics. It outlines the Big Data Analytics Pipeline and the importance of managing large datasets efficiently. Additionally, the document discusses the Hadoop Distributed File System (HDFS), its components, and the High Availability (HA) architecture to ensure continuous operation and data integrity.

Uploaded by

ajaypanyala29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Big Data Analytics

Dr. Jagan Mohan Reddy


Associate Professor
Dept. of AIML
[email protected]
+91-91607 38986
Resources
• https://www3.cs.stonybrook.edu/~has/CS
E545/Slides/

• https://www.montecarlodata.com/blog-da
ta-pipeline-architecture-explained/
Hadoop Architecture and
Components
Type of Digital Data
Type of Digital Data
Type of Digital Data
Type of Digital Data

a1 … an
1 -- n
2 -- n
. . .
. . .
. . .
Type of Digital Data
Definition of Big Data
• Big Data refers to
• large, complex datasets that are difficult to
process, store, and analyze using traditional
data management tools due to their:
• Volume
• Velocity
• Variety
• Veracity
• value
5 – V’s of Big Data
 Volume: Massive amounts of data
generated daily.

 Velocity: High speed at which data is


generated and processed.

 Variety: Different types of data


(structured, unstructured, semi-
structured).

 Veracity: Uncertainty or quality of data.

 Value: Extracting meaningful insights to


drive decisions.
Advantages of Big Data
• Improved Decision-Making
• Big Data enables businesses to make data-driven decisions
by analyzing large datasets in real-time.
• Enhanced Customer Experience
• Personalized recommendations and targeted marketing are
possible through customer behavior analysis.
• Operational Efficiency
• Optimizes processes, reduces costs, and improves
productivity by identifying inefficiencies.
• Fraud Detection and Risk Management
• Detects unusual patterns in financial transactions and
reduces risks through predictive analytics.
Advantages of Big Data
• Innovation and Product Development
• Insights from data help in creating new products, services, and
business models.
• Competitive Advantage
• Businesses leveraging Big Data can stay ahead of competitors by
spotting trends and making timely adjustments.
• Improved Healthcare
• Analyzes patient data to improve diagnosis, treatment, and
disease prevention.
• Smarter Cities
• Enables better urban planning and efficient management of
resources in smart cities.
Characteristics of Hadoop
 1. Open Source
Hadoop is an open-source framework, freely available for modification
and use.
 2. Distributed Storage
Hadoop uses HDFS (Hadoop Distributed File System) to store data
across multiple machines, ensuring reliability and scalability.
 3. Fault Tolerance
Automatically replicates data across nodes to ensure no data is lost, even
if a node fails.
 4. Scalability
Easily scales from a single server to thousands of machines without
significant reconfiguration.
 5. Data Locality
Moves computation to where the data is stored, reducing data transfer
and speeding up processing.
Characteristics of Hadoop
 6. High Throughput
Handles large volumes of data by processing it in parallel across
distributed nodes.
 7. Flexibility
Can process structured, unstructured, and semi-structured data (e.g., text,
images, videos).
 8. Cost-Effective
Uses commodity hardware, which is cheaper compared to traditional
systems.
 9. Batch Processing
Processes large datasets in batch mode using tools like MapReduce.
 10. Ecosystem Integration
Works seamlessly with tools like Hive, Pig, HBase, Spark, and YARN for
varied tasks like querying, real-time analytics, and data streaming.
RDBMS vs Hadoop
Aspect RDBMS Hadoop

Handles structured, semi-


Data Type Handles structured data only. structured, and unstructured
data.
Centralized storage with tables Distributed storage across
Data Storage
and rows. multiple nodes using HDFS.
Batch processing using
Relies on SQL for real-time data
Processing Model MapReduce or real-time with
processing.
Spark.
Highly scalable, supports
Limited scalability, vertical
Scalability horizontal scaling (add more
scaling.
nodes).
High fault tolerance due to data
Fault Tolerance Limited fault tolerance.
replication across nodes.
RDBMS vs Hadoop
Aspect RDBMS Hadoop
Expensive, requires high-end Cost-effective, uses
Cost
hardware. commodity hardware.
Optimized for fast transactional Optimized for processing
Access Speed
queries. large datasets in batches.
Standalone databases (e.g., Ecosystem of tools like Hive,
Ecosystem
MySQL, Oracle). Pig, HBase, and Spark.
Schema-on-read; flexibility for
Schema Requires a predefined schema.
schema changes.
Supports multiple concurrent Primarily designed for batch
Concurrency
transactions. processing.
OLTP (Online Transaction Big Data analytics, batch
Use Cases Processing), banking, and processing, log analysis, and
inventory management. machine learning.
Big Data Analytics Pipeline
• The Big Data Analytics Pipeline refers to the end-to-
end process of
• managing
• processing
• analyzing large datasets to extract actionable insights.
• It involves a series of interconnected
• stages
• tools
• technologies that handle the complexities of big data.
Why is the Big Data Analytics
Pipeline Important?
• Efficiency: Automates and streamlines the
management of massive datasets.
• Scalability: Handles growing volumes, velocity, and
variety of data.
• Actionable Insights: Transforms raw data into
valuable information for business decisions.
• Cost-Effectiveness: Uses distributed systems and
open-source tools to manage costs.
ML

Transfor
m
Data Pipeline Data Pipeline
Extract Load Transfor
Load
m
IoT
Transfor
m
SQL
Web
Dashboard

Serv
er
Data Lake Data Science
Analysis
Transfor
m
APIs Data Warehouse
Analytical Data Plane
Data Source
Hadoop Distributions
Year Event

2006 – present Apache Hadoop launched.

2008 – present Cloudera founded.


MapR launched; AWS EMR
2009 – 2020
introduced.
2011 – 2019 Hortonworks founded.
Microsoft Azure HDInsight
2013 – present
released.
Google Cloud Dataproc
2016 – present
introduced.
2019 Cloudera and Hortonworks merge.

2020 MapR acquired by HPE.


Google File System
HDFS – Hadoop Distributed File
System
• The Hadoop Distributed File System (HDFS)
is a core component of the Hadoop ecosystem.

• It is a scalable and distributed file system


designed to store and manage large datasets
across a cluster of machines.
HDFS replication factor
• Defines how many copies (replicas) of each block are
stored in the cluster.
• Default: 3 (can be configured based on fault tolerance
needs).
• Block Size:
• Default size: 128 MB (can vary, e.g., 64 MB or 256 MB).
• A large file is split into blocks of this size.
• Data Distribution:
• HDFS splits files into blocks, replicates them, and distributes
them across the cluster.
• Each replica is stored on a different DataNode for
redundancy.
Example Calculations
• Scenario
• File size = 600 MB
• Block Size = 128 MB
• Replication factor = 3
• Cluster has 5 Datanodes

• Step1: Calculate # of Blocks?


• Number of Blocks =
Example Calculations
• Step 2: Calculate Total Data Stored (including Replicas)
• (HDFS stores each block Replication Factor times)

• Step 3: Data Distribution Across DataNodes


• Block 1: Replica 1 (Node A), Replica 2 (Node B), Replica 3
(Node C)
• Block 2: Replica 1 (Node B), Replica 2 (Node C), Replica 3
(Node D)
• Block 3: Replica 1 (Node C), Replica 2 (Node D), Replica 3
(Node E)
• Block 4: Replica 1 (Node D), Replica 2 (Node E), Replica 3
(Node A)
HDFS – Key Features
HDFS – Key Features
HDFS – Components
• NameNode
• Role: The NameNode is the master server in HDFS.
• Responsibilities:
• Manages the file system namespace (e.g., directory tree, file metadata).
• Tracks the location of data blocks across DataNodes.
• Handles client requests for file operations (e.g., read, write, delete).
• DataNode
• Role: The DataNode is the slave node in HDFS.
• Responsibilities:
• Stores the actual data blocks.
• Performs read/write operations as instructed by the NameNode.
• Sends periodic heartbeats and block reports to the NameNode to confirm its
health and block status.
HDFS – Components
• Secondary NameNode
• Role: The Secondary NameNode is not a backup
NameNode but assists the primary NameNode.
• Responsibilities:
• Periodically merges the fsimage (file system metadata snapshot) and
edit logs (recent changes) to reduce the NameNode's startup time.
• Helps in checkpointing the file system metadata.
• Checkpoint Node
• Role: Similar to the Secondary NameNode, but with more
flexibility.
• Responsibilities:
• Periodically creates checkpoints of the file system metadata.
• Can be configured to run on a separate machine.
HDFS – Components
• Backup Node
• Role: Provides a more up-to-date backup of the NameNode's
metadata.
• Responsibilities:
• Maintains an in-memory, up-to-date copy of the file system
namespace.
• Does not require merging of fsimage and edit logs.
• HDFS Client
• Role: The interface through which users interact with HDFS.
• Responsibilities:
• Communicates with the NameNode for metadata operations (e.g., file
creation, deletion).
• Communicates with DataNodes for read/write operations.
HDFS – Components
• Block
• Role: The smallest unit of data storage in HDFS.
• Default block size: 128 MB (configurable).
• Files larger than the block size are split into multiple blocks.
• Each block is replicated across multiple DataNodes (default replication
factor: 3).
HDFS – Hadoop Distributed File
System
HDFS High Availability (HA)
architecture
• Designed to eliminate the single point of failure in
Hadoop's NameNode.
• In a traditional HDFS setup, the NameNode is a critical
component, and its failure can render the entire cluster
unavailable.
• HDFS HA addresses this by introducing two
NameNodes in an active-standby configuration,
ensuring continuous availability of the file system.
HDFS High Availability (HA)
architecture
Active Standby Key Components of HDFS HA Architecture
NameNode NameNode
• Active NameNode
• Handles all client requests and manages the file
system namespace.
• Performs metadata operations (e.g., file creation,
JournalNod JournalNod deletion, block allocation).
es es • Standby NameNode
• Acts as a hot standby for the Active NameNode.
• Synchronizes with the Active NameNode to
maintain an up-to-date state.
• Takes over immediately if the Active NameNode
DataNodes DataNodes fails.
• JournalNodes (JNs)
• A group of lightweight daemons that manage the
edit logs (metadata changes).
• Both Active and Standby NameNodes
ZooKeeper ZooKeeper communicate with JournalNodes to synchronize
metadata.
• Typically, there are 3 or 5 JournalNodes for fault
tolerance.
HDFS High Availability (HA)
architecture
• ZooKeeper (ZK)
• A distributed coordination service used for:
• Failover detection: Detects when the Active NameNode fails.
• Leader election: Elects the Standby NameNode as the new Active
NameNode.
• DataNodes:
• Report block locations and heartbeats to both NameNodes
(Active and Standby).
• Ensure that both NameNodes have up-to-date information
about the data blocks.
Advantages of HDFS HA
• Eliminates Single Point of Failure:
• The Standby NameNode ensures continuous availability in case
of Active NameNode failure.
• Fast Failover:
• Failover is automatic and typically completes within seconds.
• No Data Loss:
• Metadata is synchronized between the Active and Standby
NameNodes using JournalNodes.
• Improved Cluster Uptime:
• Ensures that the HDFS cluster remains operational even during
NameNode failures.

You might also like