Unit 3 Da
Unit 3 Da
Unit 3 Da
Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-III:
PROCESSING BIG DATA: Integrating disparate data stores, Mapping data to the programming framework, Connecting
and extracting data from storage, Transforming data for processing, subdividing data in preparation for Hadoop Map
Reduce.
1998
2013
Hadoop’s Developers
2003
2004
2006
Some Hadoop Milestones
• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte
of data in 209 seconds, compared to previous record of 297 seconds)
• Design requirements:
o Low latency
• Classic alternatives
A Block Sever
Stores data in local file system
Stores meta-data of a block - checksum
Serves data and meta-data to clients
Block Report
Periodically sends a report of all existing blocks to
NameNode
Facilitate Pipelining of Data
Forwards data to other specified DataNodes
Block Placement
Replication Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
Clients read from nearest replica
Data Correctness
Use Checksums to validate data – CRC32
File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
File Access
Client retrieves the data and checksum from DataNode
If validation fails, client tries other replicas
Data Pipelining
Client retrieves a list of DataNodes on which to place replicas of a
block
Client writes block to the first DataNode
The first DataNode forwards the data to the next DataNode in the
Pipeline
When all replicas are written, the client moves on to write the next
block in file
Goals of HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10PB
Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
Optimized for Batch Processing
Data locations exposed so that computations
can move to where data resides
Provides very high aggregate bandwidth
Distributed File System
Metadata in Memory
The entire metadata is in main memory
No demand paging of metadata
Types of metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g. creation time, replication factor
A Transaction Log
Records file creations, file deletions etc
DataNode
A Block Server
Stores data in the local file system (e.g. ext3)
Stores metadata of a block (e.g. CRC)
Serves data and metadata to Clients
Block Report
Periodically sends a report of all existing blocks
to the NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
Block Placement
Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
Clients read from nearest replicas
Would like to make this policy pluggable
Heartbeats