0% found this document useful (0 votes)

7 views38 pages

Unit I

The document provides an overview of Big Data Analytics, focusing on its definition, advantages, and the architecture of Hadoop, including its components and characteristics. It outlines the Big Data Analytics Pipeline and the importance of managing large datasets efficiently. Additionally, the document discusses the Hadoop Distributed File System (HDFS), its components, and the High Availability (HA) architecture to ensure continuous operation and data integrity.

Uploaded by

ajaypanyala29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views38 pages

Unit I

Uploaded by

ajaypanyala29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Big Data Analytics

Dr. Jagan Mohan Reddy

Associate Professor
Dept. of AIML
[email protected]
+91-91607 38986
Resources
• https://www3.cs.stonybrook.edu/~has/CS
E545/Slides/

• https://www.montecarlodata.com/blog-da
ta-pipeline-architecture-explained/
Hadoop Architecture and
Components
Type of Digital Data
Type of Digital Data
Type of Digital Data
Type of Digital Data

a1 … an
1 -- n
2 -- n
. . .
. . .
. . .
Type of Digital Data
Definition of Big Data
• Big Data refers to
• large, complex datasets that are difficult to
process, store, and analyze using traditional
data management tools due to their:
• Volume
• Velocity
• Variety
• Veracity
• value
5 – V’s of Big Data
 Volume: Massive amounts of data
generated daily.

 Velocity: High speed at which data is

generated and processed.

 Variety: Different types of data

(structured, unstructured, semi-
structured).

 Veracity: Uncertainty or quality of data.

 Value: Extracting meaningful insights to

drive decisions.
Advantages of Big Data
• Improved Decision-Making
• Big Data enables businesses to make data-driven decisions
by analyzing large datasets in real-time.
• Enhanced Customer Experience
• Personalized recommendations and targeted marketing are
possible through customer behavior analysis.
• Operational Efficiency
• Optimizes processes, reduces costs, and improves
productivity by identifying inefficiencies.
• Fraud Detection and Risk Management
• Detects unusual patterns in financial transactions and
reduces risks through predictive analytics.
Advantages of Big Data
• Innovation and Product Development
• Insights from data help in creating new products, services, and
business models.
• Competitive Advantage
• Businesses leveraging Big Data can stay ahead of competitors by
spotting trends and making timely adjustments.
• Improved Healthcare
• Analyzes patient data to improve diagnosis, treatment, and
disease prevention.
• Smarter Cities
• Enables better urban planning and efficient management of
resources in smart cities.
Characteristics of Hadoop
 1. Open Source
Hadoop is an open-source framework, freely available for modification
and use.
 2. Distributed Storage
Hadoop uses HDFS (Hadoop Distributed File System) to store data
across multiple machines, ensuring reliability and scalability.
 3. Fault Tolerance
Automatically replicates data across nodes to ensure no data is lost, even
if a node fails.
 4. Scalability
Easily scales from a single server to thousands of machines without
significant reconfiguration.
 5. Data Locality
Moves computation to where the data is stored, reducing data transfer
and speeding up processing.
Characteristics of Hadoop
 6. High Throughput
Handles large volumes of data by processing it in parallel across
distributed nodes.
 7. Flexibility
Can process structured, unstructured, and semi-structured data (e.g., text,
images, videos).
 8. Cost-Effective
Uses commodity hardware, which is cheaper compared to traditional
systems.
 9. Batch Processing
Processes large datasets in batch mode using tools like MapReduce.
 10. Ecosystem Integration
Works seamlessly with tools like Hive, Pig, HBase, Spark, and YARN for
varied tasks like querying, real-time analytics, and data streaming.
RDBMS vs Hadoop
Aspect RDBMS Hadoop

Handles structured, semi-

Data Type Handles structured data only. structured, and unstructured
data.
Centralized storage with tables Distributed storage across
Data Storage
and rows. multiple nodes using HDFS.
Batch processing using
Relies on SQL for real-time data
Processing Model MapReduce or real-time with
processing.
Spark.
Highly scalable, supports
Limited scalability, vertical
Scalability horizontal scaling (add more
scaling.
nodes).
High fault tolerance due to data
Fault Tolerance Limited fault tolerance.
replication across nodes.
RDBMS vs Hadoop
Aspect RDBMS Hadoop
Expensive, requires high-end Cost-effective, uses
Cost
hardware. commodity hardware.
Optimized for fast transactional Optimized for processing
Access Speed
queries. large datasets in batches.
Standalone databases (e.g., Ecosystem of tools like Hive,
Ecosystem
MySQL, Oracle). Pig, HBase, and Spark.
Schema-on-read; flexibility for
Schema Requires a predefined schema.
schema changes.
Supports multiple concurrent Primarily designed for batch
Concurrency
transactions. processing.
OLTP (Online Transaction Big Data analytics, batch
Use Cases Processing), banking, and processing, log analysis, and
inventory management. machine learning.
Big Data Analytics Pipeline
• The Big Data Analytics Pipeline refers to the end-to-
end process of
• managing
• processing
• analyzing large datasets to extract actionable insights.
• It involves a series of interconnected
• stages
• tools
• technologies that handle the complexities of big data.
Why is the Big Data Analytics
Pipeline Important?
• Efficiency: Automates and streamlines the
management of massive datasets.
• Scalability: Handles growing volumes, velocity, and
variety of data.
• Actionable Insights: Transforms raw data into
valuable information for business decisions.
• Cost-Effectiveness: Uses distributed systems and
open-source tools to manage costs.
ML

Transfor
m
Data Pipeline Data Pipeline
Extract Load Transfor
Load
m
IoT
Transfor
m
SQL
Web
Dashboard

Serv
er
Data Lake Data Science
Analysis
Transfor
m
APIs Data Warehouse
Analytical Data Plane
Data Source
Hadoop Distributions
Year Event

2006 – present Apache Hadoop launched.

2008 – present Cloudera founded.

MapR launched; AWS EMR
2009 – 2020
introduced.
2011 – 2019 Hortonworks founded.
Microsoft Azure HDInsight
2013 – present
released.
Google Cloud Dataproc
2016 – present
introduced.
2019 Cloudera and Hortonworks merge.

2020 MapR acquired by HPE.

Google File System
HDFS – Hadoop Distributed File
System
• The Hadoop Distributed File System (HDFS)
is a core component of the Hadoop ecosystem.

• It is a scalable and distributed file system

designed to store and manage large datasets
across a cluster of machines.
HDFS replication factor
• Defines how many copies (replicas) of each block are
stored in the cluster.
• Default: 3 (can be configured based on fault tolerance
needs).
• Block Size:
• Default size: 128 MB (can vary, e.g., 64 MB or 256 MB).
• A large file is split into blocks of this size.
• Data Distribution:
• HDFS splits files into blocks, replicates them, and distributes
them across the cluster.
• Each replica is stored on a different DataNode for
redundancy.
Example Calculations
• Scenario
• File size = 600 MB
• Block Size = 128 MB
• Replication factor = 3
• Cluster has 5 Datanodes

• Step1: Calculate # of Blocks?

• Number of Blocks =
Example Calculations
• Step 2: Calculate Total Data Stored (including Replicas)
• (HDFS stores each block Replication Factor times)

• Step 3: Data Distribution Across DataNodes

• Block 1: Replica 1 (Node A), Replica 2 (Node B), Replica 3
(Node C)
• Block 2: Replica 1 (Node B), Replica 2 (Node C), Replica 3
(Node D)
• Block 3: Replica 1 (Node C), Replica 2 (Node D), Replica 3
(Node E)
• Block 4: Replica 1 (Node D), Replica 2 (Node E), Replica 3
(Node A)
HDFS – Key Features
HDFS – Key Features
HDFS – Components
• NameNode
• Role: The NameNode is the master server in HDFS.
• Responsibilities:
• Manages the file system namespace (e.g., directory tree, file metadata).
• Tracks the location of data blocks across DataNodes.
• Handles client requests for file operations (e.g., read, write, delete).
• DataNode
• Role: The DataNode is the slave node in HDFS.
• Responsibilities:
• Stores the actual data blocks.
• Performs read/write operations as instructed by the NameNode.
• Sends periodic heartbeats and block reports to the NameNode to confirm its
health and block status.
HDFS – Components
• Secondary NameNode
• Role: The Secondary NameNode is not a backup
NameNode but assists the primary NameNode.
• Responsibilities:
• Periodically merges the fsimage (file system metadata snapshot) and
edit logs (recent changes) to reduce the NameNode's startup time.
• Helps in checkpointing the file system metadata.
• Checkpoint Node
• Role: Similar to the Secondary NameNode, but with more
flexibility.
• Responsibilities:
• Periodically creates checkpoints of the file system metadata.
• Can be configured to run on a separate machine.
HDFS – Components
• Backup Node
• Role: Provides a more up-to-date backup of the NameNode's
metadata.
• Responsibilities:
• Maintains an in-memory, up-to-date copy of the file system
namespace.
• Does not require merging of fsimage and edit logs.
• HDFS Client
• Role: The interface through which users interact with HDFS.
• Responsibilities:
• Communicates with the NameNode for metadata operations (e.g., file
creation, deletion).
• Communicates with DataNodes for read/write operations.
HDFS – Components
• Block
• Role: The smallest unit of data storage in HDFS.
• Default block size: 128 MB (configurable).
• Files larger than the block size are split into multiple blocks.
• Each block is replicated across multiple DataNodes (default replication
factor: 3).
HDFS – Hadoop Distributed File
System
HDFS High Availability (HA)
architecture
• Designed to eliminate the single point of failure in
Hadoop's NameNode.
• In a traditional HDFS setup, the NameNode is a critical
component, and its failure can render the entire cluster
unavailable.
• HDFS HA addresses this by introducing two
NameNodes in an active-standby configuration,
ensuring continuous availability of the file system.
HDFS High Availability (HA)
architecture
Active Standby Key Components of HDFS HA Architecture
NameNode NameNode
• Active NameNode
• Handles all client requests and manages the file
system namespace.
• Performs metadata operations (e.g., file creation,
JournalNod JournalNod deletion, block allocation).
es es • Standby NameNode
• Acts as a hot standby for the Active NameNode.
• Synchronizes with the Active NameNode to
maintain an up-to-date state.
• Takes over immediately if the Active NameNode
DataNodes DataNodes fails.
• JournalNodes (JNs)
• A group of lightweight daemons that manage the
edit logs (metadata changes).
• Both Active and Standby NameNodes
ZooKeeper ZooKeeper communicate with JournalNodes to synchronize
metadata.
• Typically, there are 3 or 5 JournalNodes for fault
tolerance.
HDFS High Availability (HA)
architecture
• ZooKeeper (ZK)
• A distributed coordination service used for:
• Failover detection: Detects when the Active NameNode fails.
• Leader election: Elects the Standby NameNode as the new Active
NameNode.
• DataNodes:
• Report block locations and heartbeats to both NameNodes
(Active and Standby).
• Ensure that both NameNodes have up-to-date information
about the data blocks.
Advantages of HDFS HA
• Eliminates Single Point of Failure:
• The Standby NameNode ensures continuous availability in case
of Active NameNode failure.
• Fast Failover:
• Failover is automatic and typically completes within seconds.
• No Data Loss:
• Metadata is synchronized between the Active and Standby
NameNodes using JournalNodes.
• Improved Cluster Uptime:
• Ensures that the HDFS cluster remains operational even during
NameNode failures.

Part 1.2
100% (1)
Part 1.2
88 pages
Bob Hunt Sheeting Wing
100% (3)
Bob Hunt Sheeting Wing
36 pages
The Philippine Green Building Code
No ratings yet
The Philippine Green Building Code
5 pages
320d Wiring
90% (10)
320d Wiring
2 pages
SF Calibration Impl en
No ratings yet
SF Calibration Impl en
114 pages
Evaluation of Gas Hydrate in Gas Pipeline Transportation
No ratings yet
Evaluation of Gas Hydrate in Gas Pipeline Transportation
107 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data
No ratings yet
Big Data
51 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
31 pages
Bda 3
No ratings yet
Bda 3
70 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Unit 5
No ratings yet
Unit 5
101 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data
No ratings yet
Big Data
67 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Bda Notes
No ratings yet
Bda Notes
110 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Bda 2 - Hadoop
No ratings yet
Bda 2 - Hadoop
112 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Introduction To
No ratings yet
Introduction To
7 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
HADOOP
No ratings yet
HADOOP
18 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Big Data
No ratings yet
Big Data
51 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
4
No ratings yet
4
53 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop
No ratings yet
Hadoop
154 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
From Everand
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 1.1
No ratings yet
Unit 1.1
22 pages
Unit 3
No ratings yet
Unit 3
41 pages
Unit 1
No ratings yet
Unit 1
37 pages
GR22 FSWD Lab
No ratings yet
GR22 FSWD Lab
3 pages
NCIP-AO-1-Series-of-2009-IPMC (Clear Copy But Unsigned Copy) PDF
No ratings yet
NCIP-AO-1-Series-of-2009-IPMC (Clear Copy But Unsigned Copy) PDF
7 pages
Fisheries Code
No ratings yet
Fisheries Code
33 pages
Grating and Expanded Metal Catalog
No ratings yet
Grating and Expanded Metal Catalog
118 pages
Power and Function of Income Tax Authorities
No ratings yet
Power and Function of Income Tax Authorities
23 pages
Lab 1 Answers
No ratings yet
Lab 1 Answers
3 pages
Rea P6 Extra Practice 1
No ratings yet
Rea P6 Extra Practice 1
16 pages
Peter Lorange - Innovations in Shipping (2020) PDF
100% (1)
Peter Lorange - Innovations in Shipping (2020) PDF
434 pages
2019 Shark Fishing and Finning Regulations
No ratings yet
2019 Shark Fishing and Finning Regulations
9 pages
Instructables Com FAN Repair
No ratings yet
Instructables Com FAN Repair
9 pages
DLL Personal Development
No ratings yet
DLL Personal Development
3 pages
Exec Order On PCART
No ratings yet
Exec Order On PCART
5 pages
9340-1131 Turbine Water Induction Protection - TWIP
100% (1)
9340-1131 Turbine Water Induction Protection - TWIP
2 pages
Observability For Dummies (R), O - Steve Kaelble
No ratings yet
Observability For Dummies (R), O - Steve Kaelble
51 pages
Control of Static Electricity Work Instruction
No ratings yet
Control of Static Electricity Work Instruction
7 pages
Excel Vba-Based Solution To Pipe Flow Measurement Problem: Spreadsheets in Education (Ejsie)
No ratings yet
Excel Vba-Based Solution To Pipe Flow Measurement Problem: Spreadsheets in Education (Ejsie)
16 pages
Updated Land Law 2 - 085110
No ratings yet
Updated Land Law 2 - 085110
34 pages
Curriculum Vitae: Present Libya +218 913008576 Residence +919816228430 +919736499006
100% (1)
Curriculum Vitae: Present Libya +218 913008576 Residence +919816228430 +919736499006
3 pages
Module 1: Short Questions
No ratings yet
Module 1: Short Questions
1 page
Midwifery Society of Nepal (MIDSoN)
No ratings yet
Midwifery Society of Nepal (MIDSoN)
5 pages
Engineering Foundation 2020-2021
No ratings yet
Engineering Foundation 2020-2021
5 pages
United States Court of Appeals, Eleventh Circuit
No ratings yet
United States Court of Appeals, Eleventh Circuit
5 pages
Yashaswini (DBMS)
No ratings yet
Yashaswini (DBMS)
8 pages
Department of Management Presentation
No ratings yet
Department of Management Presentation
84 pages
DAY 1-Dr Stephen Opio Okiror
No ratings yet
DAY 1-Dr Stephen Opio Okiror
33 pages

Unit I

Uploaded by

Unit I

Uploaded by

Big Data Analytics

Dr. Jagan Mohan Reddy

 Velocity: High speed at which data is

 Variety: Different types of data

 Veracity: Uncertainty or quality of data.

 Value: Extracting meaningful insights to

Handles structured, semi-

2006 – present Apache Hadoop launched.

2008 – present Cloudera founded.

2020 MapR acquired by HPE.

• It is a scalable and distributed file system

• Step1: Calculate # of Blocks?

• Step 3: Data Distribution Across DataNodes

You might also like