0% found this document useful (0 votes)

29 views74 pages

HDFS 79

HDFS is Hadoop's distributed file system. It has a master-slave architecture with a name node as the master and multiple data nodes as slaves. Files are divided into blocks which are replicated across data nodes for fault tolerance. The name node manages file system metadata and data nodes store and retrieve blocks.

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views74 pages

HDFS 79

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

HDFS:

Hadoop Distributed
File System
CIS 612
Sunnie Chung
Introduction
• What is Big Data??
– Bulk Amount
– Unstructured

• Lots of Applications which need to handle

huge amount of data (in terms of 500+ TB per
day)
• If a regular machine need to transmit 1TB of
data through 4 channels : 43 Minutes.
• What if 500 TB ??
Hadoop 2
What is Hadoop?

• Framework for large-scale data processing

• Inspired by Google’s Architecture:
– Google File System (GFS) and MapReduce
• Open-source Apache project
– Nutch search engine project
– Apache Incubator
• Written in Java and shell scripts

Hadoop 3
Hadoop Distributed File System (HDFS)

• Storage unit of Hadoop

• Relies on principles of Distributed File System.
• HDFS have a Master-Slave architecture
• Main Components:
– Name Node : Master
– Data Node : Slave

• 3+ replicas for each block

• Default Block Size : 128MB

Hadoop 4
H
Hadoop Distributed File System (HDFS)

• Hadoop Distributed File System (HDFS)

– Runs entirely in userspace
– The file system is dynamically distributed across multiple
computers
– Allows for nodes to be added or removed easily
– Highly scalable in a horizontal fashion

• Hadoop Development Platform

– Uses a MapReduce model for working with data
– Users can program in Java, C++, and other languages

Hadoop 5
Why should I use Hadoop?

• Fault-tolerant hardware is expensive

• Hadoop designed to run on commodity hardware

• Automatically handles data replication and deals with

node failure

• Does all the hard work so you can focus on processing

data

Hadoop 6
HDFS: Key Features
• Highly Fault Tolerant:
Automatic Failure Recovery System
• High aggregate throughput for streaming large files
• Supports replication and locality features
• Designed to work with systems with vary large file
(files with size in TB) and few in number.
• Provides streaming access to file system data. It is
specifically good for write once read many kind of
files (for example Log files).

Hadoop 7
Hadoop Distributed File System (HDFS)

• Can be built out of commodity hardware. HDFS

doesn't need highly expensive storage devices
– Uses off the shelf hardware
• Rapid Elasticity
– Need more capacity, just assign some more nodes
– Scalable
– Can add or remove nodes with little effort or
reconfiguration
• Resistant to Failure
• Individual node failure does not disrupt the
system
Hadoop 8
Who uses Hadoop?

Hadoop 9
What features does Hadoop offer?

• API and implementation for working with

MapReduce
• Infrastructure
– Job configuration and efficient scheduling
– Web-based monitoring of cluster stats
– Handles failures in computation and data nodes
– Distributed File System optimized for huge amounts of
data

Hadoop 10
When should you choose Hadoop?

• Need to process a lot of unstructured data

• Processing needs are easily run in parallel

• Batch jobs are acceptable

• Access to lots of cheap commodity machines

Hadoop 11
When should you avoid Hadoop?

• Intense calculations with little or no data

• Processing cannot easily run in parallel

• Data is not self-contained

• Need interactive results

Hadoop 12
Hadoop Examples

• Hadoop would be a good choice for:

– Indexing log files
– Sorting vast amounts of data
– Image analysis
– Search engine optimization
– Analytics
• Hadoop would be a poor choice for:
– Calculating Pi to 1,000,000 digits
– Calculating Fibonacci sequences
– A general RDBMS replacement

Hadoop 13
Hadoop Distributed File System (HDFS)

• How does Hadoop work?

– Runs on top of multiple commodity systems
– A Hadoop cluster is composed of nodes
• One Master Node
• Many Slave Nodes
– Multiple nodes are used for storing data & processing
data
– System abstracts the underlying hardware to
users/software

Hadoop 14
How HDFS works: Split Data

• Data copied into HDFS is split into blocks

• Typical HDFS block size is 128 MB
– (VS 4 KB on UNIX File Systems)

Hadoop 15
How HDFS works: Replication

• Each block is replicated to multiple machines

• This allows for node failure without data loss

Block Block Block

#1 #2 #1
Data Data Data
Node 1 Node 2 Node 3
Block Block Block
#2 #3 #3
Hadoop 16
HDFS Architecture
Hadoop Distributed File System (HDFS)p:
HDFS
HDFS is a multi-node system • HDFS Consists of data blocks
– Files are divided into data
Name Node (Master)
blocks
Single point of failure
– Default size if 64MB
Data Node (Slave)
– Default replication of blocks is 3
Failure tolerant (Data
– Blocks are spread out over Data
replication) Nodes

Hadoop 18
Hadoop Architecture Overview

Client

Job Tracker

Task Tracker Task Tracker

Name
Node
Data Data
Node Data Node
Data
Node Node

Hadoop 19
Hadoop Components: Job Tracker

Client

Job Tracker

Task Tracker Task Tracker

Name Node
Data Data
Data Data
Node Node
Node Node
Only one Job Tracker per cluster
Receives job requests submitted by client
Schedules and monitors
Hadoop jobs on task trackers 20
Hadoop Components: Name Node

Client
Job
Tracker
Task Task
Tracker Tracker
Name Node
Data Data
Data Data
Node Node
Node Node
One active Name Node per cluster
Manages the file system namespace and metadata
Single point of failure: Good place to spend money on hardware
Hadoop 21
Name Node

• Master of HDFS
• Maintains and Manages data on Data Nodes
• High reliability Machine (can be even RAID)
• Expensive Hardware
• Stores NO data; Just holds Metadata!
• Secondary Name Node:
– Reads from RAM of Name Node and stores it to hard
disks periodically.
• Active & Passive Name Nodes from Gen2 Hadoop

Hadoop 22
Hadoop Components: Task Tracker

Client
Job
Tracker
Task Task
Tracker Tracker
Name Node
Data Data
Data Data
Node Node
Node Node
There are typically a lot of task trackers
Responsible for executing operations
Reads blocks of data from data nodes
Hadoop 23
Hadoop Components: Data Node

Client
Job
Tracker
Task Task
Tracker Tracker
Name Node
Data Data
Data Data
Node Node
Node Node
There are typically a lot of data nodes
Data nodes manage data blocks and serve them to clients
Data is replicated so failure is not a problem
Hadoop 24
Data Nodes

• Slaves in HDFS
• Provides Data Storage
• Deployed on independent machines
• Responsible for serving Read/Write requests from
Client.
• The data processing is done on Data Nodes.

Hadoop 25
HDFS Architecture

Hadoop 26
Hadoop Modes of Operation

Hadoop supports three modes of operation:

• Standalone
• Pseudo-Distributed
• Fully-Distributed

Hadoop 27
HDFS Operation

Hadoop 28
HDFS Operation
• Client makes a Write request to Name Node
• Name Node responds with the information about
on available data nodes and where data to be
written.
• Client write the data to the addressed Data Node.
• Replicas for all blocks are automatically created
by Data Pipeline.
• If Write fails, Data Node will notify the Client
and get new location to write.
• If Write Completed Successfully,
Acknowledgement is given to Client
• Non-Posted Write by Hadoop
Hadoop 29
HDFS: File Write

Hadoop 30
HDFS: File Read

Hadoop 31
• HadoopHadoop:
Development Platform
Hadoop Stack
– User written code runs on system
– System appears to user as a single entity
– User does not need to worry about
distributed system
– Many system can run on top of Hadoop
• Allows further abstraction from system

Hadoop 32
Hive and Hadoop: Hive on
HBase are layers & top
HBase
of Hadoop
HBase & Hive are applications
Provide an interface to data on the HDFS
Other programs or applications may use Hive or
HBase as an intermediate layer
HBase
ZooKeeper

Hadoop 33
Hadoop: Hive

• Hive
– Data warehousing application
– SQL like commands (HiveQL)
– Not a traditional relational database
– Scales horizontally with ease
– Supports massive amounts of data*

* Facebook has more than 15PB of information stored in it and imports 60TB each day (as of 2010)
Hadoop 34
Hadoop: HBase

• HBase
– No SQL Like language
• Uses custom Java API for working with data
– Modeled after Google’s BigTable
– Random read/write operations allowed
– Multiple concurrent read/write operations allowed

Hadoop 35
Hadoop MapReduce
Hadoop has it’s own implementation of MapReduce
Hadoop 1.0.4
API: http://hadoop.apache.org/docs/r1.0.4/api/
Tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
Custom Serialization
Data Types
Writable/Comparable
Text vs String
LongWritable vs long
IntWritable vs int
DoubleWritable vs double

Hadoop 36
Structure of a Hadoop Mapper (WordCount)

Hadoop 37
Structure of a Hadoop Reducer (WordCount)

Hadoop 38
Hadoop MapReduce
Working with the Hadoop
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
A quick overview of Hadoop commands
bin/start-all.sh
bin/stop-all.sh
bin/hadoop fs –put localSourcePath hdfsDestinationPath
bin/hadoop fs –get hdfsSourcePath localDestinationPath
bin/hadoop fs –rmr folderToDelete
bin/hadoop job –kill job_id
Running a Hadoop MR Program
bin/hadoop jar jarFileName.jar programToRun parm1 parm2…

SS Chung CIS 612 Lecture Notes 39

Useful Application Sites
[1] http://wiki.apache.org/hadoop/EclipsePlugIn
[2] 10gen. Mongodb. http://www.mongodb.org/
[3] Apache. Cassandra. http://cassandra.apache.org/
[4] Apache. Hadoop. http://hadoop.apache.org/
[5] Apache. Hbase. http://hbase.apache.org/
[6] Apache, Hive. http://hive.apache.org/
[7] Apache, Pig. http://pig.apache.org/
[8] Zoo Keeper, http://zookeeper.apache.org/

Hadoop 40
How MapReduce Works in Hadoop
Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as a

MapReduce job
Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as a

MapReduce job
Lifecycle of a MapReduce Job
Time

Input Map Map Reduce Reduce

Splits Wave 1 Wave 2 Wave 1 Wave 2
Hadoop MR Job Interface:
Input Format
• The Hadoop MapReduce framework spawns
one map task for each InputSplit
• InputSplit: Input File is Split to Input Splits (Logical
splits (usually 1 block), not Physically split chunks)
Input Format::getInputSplits()
• The number of maps is usually driven by the total
number of blocks (InputSplits) of the input files.
1 block size = 128 MB,
10 TB file configured with 82000 maps
Hadoop MR Job Interface:
map()
• The framework then calls
map(WritableComparable, Writable, OutputCollector,
Reporter) for each key/value pair (line_num, line_string
) in the InputSplit for that task.
• Output pairs are collected with calls to
OutputCollector.collect(WritableComparable,Writable).
Hadoop MR Job Interface:
combiner()
• Optional combiner, via
JobConf.setCombinerClass(Class)
• to perform local aggregation of the intermediate
outputs of mapper
Hadoop MR Job Interface:
Partitioner()
• Partitioner controls the partitioning of the keys of the
intermediate map-outputs.
• The key (or a subset of the key) is used to derive the
partition, typically by a hash function.
• The total number of partitions is the same as the
number of reducers
• HashPartitioner is the default Partitioner of reduce
tasks for the job
Hadoop MR Job Interface:
reducer()
• Reducer has 3 primary phases:
1. Shuffle:
2. Sort
3. Reduce
Hadoop MR Job Interface:
reducer()
• Shuffle
Input to the Reducer is the sorted output of the mappers.
In this phase the framework fetches the relevant
partition of the output of all the mappers, via HTTP.
• Sort
The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this
stage.
• The shuffle and sort phases occur simultaneously;
while map-outputs are being fetched they are merged.
Hadoop MR Job Interface:
reducer()
• Reduce
The framework then calls
reduce(WritableComparable, Iterator, OutputCollector, Reporter)
method for each <key, (list of values)> pair in the
grouped inputs.
• The output of the reduce task is typically written to
the FileSystem via
OutputCollector.collect(WritableComparable, Writable).
MR Job Parameters

• Map Parameters
io.sort.mb
• Shuffle/Reduce Parameters
io.sort.factor
mapred.inmem.merge.threshold
mapred.job.shuffle.merge.percent
Components in a Hadoop MR Workflow

Next few slides are from: http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Job Submission
Initialization
Scheduling
Execution
Map Task
Sort Buffer
Reduce Tasks
Quick Overview of Other Topics

• Dealing with failures

• Hadoop Distributed FileSystem (HDFS)
• Optimizing a MapReduce job
Dealing with Failures and Slow Tasks
• What to do when a task fails?
– Try again (retries possible because of idempotence)
– Try again somewhere else
– Report failure
• What about slow tasks: stragglers
– Run another version of the same task in parallel. Take
results from the one that finishes first
– What are the pros and cons of this approach?

Fault tolerance is of
high priority in the
MapReduce framework
HDFS Architecture
Lifecycle of a MapReduce Job
Time

Input Map Map Reduce Reduce

Splits Wave 1 Wave 2 Wave 1 Wave 2

How are the number of splits, number of map and reduce

tasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters
• 190+ parameters in
Hadoop
• Set manually or defaults
are used
Hadoop Job Configuration Parameters

Image source: http://www.jaso.co.kr/265

Tuning Hadoop Job Conf. Parameters
• Do their settings impact performance?
• What are ways to set these parameters?
– Defaults -- are they good enough?
– Best practices -- the best setting can depend on data, job, and
cluster properties
– Automatic setting
Experimental Setting
• Hadoop cluster on 1 master + 16 workers
• Each node:
– 2GHz AMD processor, 1.8GB RAM, 30GB local disk
– Relatively ill-provisioned!
– Xen VM running Debian Linux
– Max 4 concurrent maps & 2 reduces
• Maximum map wave size = 16x4 = 64
• Maximum reduce wave size = 16x2 = 32
• Not all users can run large Hadoop clusters:
– Can Hadoop be made competitive in the 10-25 node, multi GB
to TB data size range?
Parameters Varied in Experiments
Hadoop 50GB TeraSort

• Varying number of reduce tasks, number of concurrent sorted

streams for merging, and fraction of map-side sort buffer
devoted to metadata storage
Hadoop 50GB TeraSort

• Varying number of reduce tasks for different values

of the fraction of map-side sort buffer devoted to
metadata storage (with io.sort.factor = 500)
Hadoop 50GB TeraSort

• Varying number of reduce tasks for different values of

io.sort.factor (io.sort.record.percent = 0.05, default)
Hadoop 75GB TeraSort

• 1D projection for
io.sort.factor=500
Automatic Optimization? (Not yet in Hadoop)

Shuffle
Map Map Map Reduce Reduce
Wave 1 Wave 2 Wave 3 Wave 1 Wave 2

What if
#reduces
increased
to 9?

Map Map Map Reduce Reduce Reduce

Wave 1 Wave 2 Wave 3 Wave 1 Wave 2 Wave 3

CREST CPSA EXAM VERIFIED QUESTIONS AND CORRECT ANSWERS LATEST 2023-2024 (PDF) - CliffsNotes
No ratings yet
CREST CPSA EXAM VERIFIED QUESTIONS AND CORRECT ANSWERS LATEST 2023-2024 (PDF) - CliffsNotes
51 pages
Rom Pack With 25.000 Retro Games
No ratings yet
Rom Pack With 25.000 Retro Games
14 pages
Today Discusion
No ratings yet
Today Discusion
3 pages
Geography
No ratings yet
Geography
6 pages
NPS Transaction Statement For Tier I Account: Welcome Subscriber-110196210021 17-Jul-2023 Home - Logout
No ratings yet
NPS Transaction Statement For Tier I Account: Welcome Subscriber-110196210021 17-Jul-2023 Home - Logout
3 pages
Material Selection & Thickness Calculation:shell: Start Asme Code
No ratings yet
Material Selection & Thickness Calculation:shell: Start Asme Code
1 page
ccs367 Storagetechnologiesquestionbank
No ratings yet
ccs367 Storagetechnologiesquestionbank
25 pages
Liu Li 2024 Toward Artificial Intelligence Human Paired Programming A Review of The Educational Applications and
No ratings yet
Liu Li 2024 Toward Artificial Intelligence Human Paired Programming A Review of The Educational Applications and
31 pages
GAIQ Practice Answeres Google Analytics
No ratings yet
GAIQ Practice Answeres Google Analytics
4 pages
Excel Cheat Sheet
No ratings yet
Excel Cheat Sheet
4 pages
Critical Patch Update (CPU) Patch Advisor For Oracle Fusion Middleware - Updated For October 2024 (Doc ID 2806740.2)
No ratings yet
Critical Patch Update (CPU) Patch Advisor For Oracle Fusion Middleware - Updated For October 2024 (Doc ID 2806740.2)
2 pages
Lec 18
No ratings yet
Lec 18
21 pages
Event Driven Architecture With Kafka
No ratings yet
Event Driven Architecture With Kafka
8 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Intro To Database Concepts
No ratings yet
Intro To Database Concepts
31 pages
Advanced English Communication Skills Lab
No ratings yet
Advanced English Communication Skills Lab
2 pages
Module 2
No ratings yet
Module 2
37 pages
Pig 2
No ratings yet
Pig 2
63 pages
Lec 26
No ratings yet
Lec 26
10 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
3.2.1. LAB PRACTICE - Zenmap Network Scanning v1
No ratings yet
3.2.1. LAB PRACTICE - Zenmap Network Scanning v1
11 pages
Per Partition
No ratings yet
Per Partition
3 pages
NWEG5111A1
No ratings yet
NWEG5111A1
13 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Function Spark
No ratings yet
Function Spark
9 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
2 pages
Signature Verification
No ratings yet
Signature Verification
65 pages
Lec 5
No ratings yet
Lec 5
6 pages
Lec 7
No ratings yet
Lec 7
10 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Lec 8
No ratings yet
Lec 8
24 pages
Lec 1
No ratings yet
Lec 1
30 pages
Lec 2
No ratings yet
Lec 2
20 pages
Big Data Analytics Using Hadoop
No ratings yet
Big Data Analytics Using Hadoop
26 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Lec 4
No ratings yet
Lec 4
28 pages
Prog Python
No ratings yet
Prog Python
67 pages
Lec 6
No ratings yet
Lec 6
16 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
2 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Lec 3
No ratings yet
Lec 3
28 pages
Airport
No ratings yet
Airport
5 pages
Wa0002.
No ratings yet
Wa0002.
66 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
HP Z440 Workstation: Expand Your Power
No ratings yet
HP Z440 Workstation: Expand Your Power
21 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
1 To 11 Slips PDF
No ratings yet
1 To 11 Slips PDF
24 pages
PSP Project Report PDF
No ratings yet
PSP Project Report PDF
8 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
02 Orchestration Fulfillment To Invoice
No ratings yet
02 Orchestration Fulfillment To Invoice
72 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Unit 5
No ratings yet
Unit 5
101 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Basic Troubleshooting
No ratings yet
Basic Troubleshooting
10 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Credict Card
No ratings yet
Credict Card
6 pages
Action Log: Status Action Item Assigned To Date Assigned Due Date Priority % Complete Notes
No ratings yet
Action Log: Status Action Item Assigned To Date Assigned Due Date Priority % Complete Notes
1 page
Hadoop
No ratings yet
Hadoop
31 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Batch Code Amount Password Time/Data Speed Expiration
No ratings yet
Batch Code Amount Password Time/Data Speed Expiration
2 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit I
No ratings yet
Unit I
38 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
3GPP TR 22.803
No ratings yet
3GPP TR 22.803
45 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Integrating Design and Manufacturing With Solidworks:: Solidworks Certified Gold Cam Partners
No ratings yet
Integrating Design and Manufacturing With Solidworks:: Solidworks Certified Gold Cam Partners
15 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Tool Overview (Less Than 150 Words)
No ratings yet
Tool Overview (Less Than 150 Words)
1 page
Comparison of Different Software Systems
No ratings yet
Comparison of Different Software Systems
6 pages
TFM-Esteban Armas PDF
No ratings yet
TFM-Esteban Armas PDF
108 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
HADOOP
No ratings yet
HADOOP
18 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Informative Speech Outline
No ratings yet
Informative Speech Outline
3 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Big Data
No ratings yet
Big Data
67 pages

HDFS 79

Uploaded by

HDFS 79

Uploaded by

HDFS:

• Lots of Applications which need to handle

• Framework for large-scale data processing

• Storage unit of Hadoop

• 3+ replicas for each block

• Hadoop Distributed File System (HDFS)

• Hadoop Development Platform

• Fault-tolerant hardware is expensive

• Hadoop designed to run on commodity hardware

• Automatically handles data replication and deals with

• Does all the hard work so you can focus on processing

• Can be built out of commodity hardware. HDFS

• API and implementation for working with

• Need to process a lot of unstructured data

• Processing needs are easily run in parallel

• Batch jobs are acceptable

• Access to lots of cheap commodity machines

• Intense calculations with little or no data

• Processing cannot easily run in parallel

• Data is not self-contained

• Need interactive results

• Hadoop would be a good choice for:

• How does Hadoop work?

• Data copied into HDFS is split into blocks

• Each block is replicated to multiple machines

Block Block Block

Task Tracker Task Tracker

Task Tracker Task Tracker

Hadoop supports three modes of operation:

SS Chung CIS 612 Lecture Notes 39

Run this program as a

Run this program as a

Input Map Map Reduce Reduce

Next few slides are from: http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

• Dealing with failures

Input Map Map Reduce Reduce

How are the number of splits, number of map and reduce

Image source: http://www.jaso.co.kr/265

• Varying number of reduce tasks, number of concurrent sorted

• Varying number of reduce tasks for different values

• Varying number of reduce tasks for different values of

Map Map Map Reduce Reduce Reduce

You might also like