0% found this document useful (0 votes)
25 views18 pages

Big Data Unit 4 Own

The document provides an overview of the MapReduce programming model, detailing how data is stored in HDFS, how MapReduce jobs are executed, and the roles of Job Tracker and Task Trackers. It also covers Hadoop Streaming and Pipes for writing MapReduce programs in various languages, the architecture and design of HDFS, and the process of reading and writing files in HDFS. Additionally, it discusses the heartbeat mechanism for maintaining node health and the roles of the Combiner, Shuffler, and Sorter in MapReduce.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views18 pages

Big Data Unit 4 Own

The document provides an overview of the MapReduce programming model, detailing how data is stored in HDFS, how MapReduce jobs are executed, and the roles of Job Tracker and Task Trackers. It also covers Hadoop Streaming and Pipes for writing MapReduce programs in various languages, the architecture and design of HDFS, and the process of reading and writing files in HDFS. Additionally, it discusses the heartbeat mechanism for maintaining node health and the roles of the Combiner, Shuffler, and Sorter in MapReduce.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1)Data Flow in the MapReduce Programming Model (scaling out)

1. Storing Data in HDFS


o To handle large-scale data, Hadoop stores files in HDFS (Hadoop
Distributed File System).
o This allows Hadoop to process data by sending computation to the machines
where the data is stored.
2. What is a MapReduce Job?
o A MapReduce job is a task given by a client.
o It includes:
 Input data (the data to be processed).
 MapReduce program (logic for processing data).
 Configuration settings.
3. How Hadoop Executes a Job
o Hadoop splits a job into smaller tasks.
o There are two types of tasks:
 Map tasks – Process chunks of data.
 Reduce tasks – Combine and summarize processed data.
4. Role of Job Tracker and Task Trackers
o Job Tracker: Manages the execution of jobs by assigning tasks to machines.
o Task Tracker: Runs the assigned tasks and reports progress to the Job
Tracker.
5. How Data is Processed
o Hadoop divides input data into input splits.
o Each Map task processes one split.
6. Data Locality Optimization
o Hadoop tries to run Map tasks on the same machine where the data is
stored.
o This avoids unnecessary data transfer and saves network bandwidth.
o If tasks are assigned randomly, data must be copied between nodes, slowing
down processing.
7. Where is Map Task Output Stored?
o The Map task’s output is stored on local disk, not in HDFS.
o This is because it is temporary and only needed for the Reduce task.
o If a node fails before the Reduce task uses the output, Hadoop reruns the Map
task on another machine.
8. Reduce Task Execution
o The number of Reduce tasks is not dependent on input size.
o When there are multiple Reduce tasks, each Map task partitions its output so
that similar data goes to the same Reduce task.
9. Combiner Function (Optimization)
o A Combiner function can be used to reduce data before sending it to the
Reduce task.
o However, Hadoop does not guarantee how many times it will use the
Combiner.
2)Hadoop Streaming

What is Hadoop Streaming?

 Hadoop Streaming is a tool that lets you write MapReduce programs in any language (not
just Java).
 It uses UNIX standard streams (input/output) to communicate between Hadoop and your
program.
 It comes built-in with the Hadoop distribution.

How Does It Work?

 Hadoop Streaming processes text data efficiently.


 It treats each line as a key-value pair, separated by a tab (\t).
 The Reduce function reads sorted input lines and writes the final output.
 It is useful for real-time data analysis when combined with tools like Apache Spark and
Kafka.

Key Features of Hadoop Streaming

1. You can write MapReduce programs in languages like Python, Perl, or C++ (not just Java).
2. Hadoop Streaming monitors job progress and provides logs for debugging.
3. It supports scalability, flexibility, and security just like regular MapReduce jobs.
4. It is easy to develop and requires minimal coding effort.

o The Mapper reads input data from InputReader/Format in the form of key-
value pairs.
o The Mapper processes the data based on the logic written in the code.
o The processed data is then passed through the Reduce stream.
o The Reducer performs data aggregation on the intermediate data.
o The final processed data is released as output.
o Both Map and Reduce functions read input from STDIN (Standard Input)
and write output to STDOUT

Hadoop Pipes

What are Hadoop Pipes?

 Hadoop Pipes is a C++ interface for Hadoop MapReduce.


 Unlike Hadoop Streaming, which uses standard input/output, Pipes uses sockets to
communicate between Hadoop and C++ programs.

How Does It Work?

 The Task Tracker communicates with the C++ MapReduce process using sockets.
 This allows higher performance for tasks like numerical calculations in C++.

Execution of Streaming and Pipes

 Hadoop Pipes creates a persistent socket connection between:


o Java Pipes task (on one side)
o External C++ process (on the other side)
 This setup improves efficiency for compute-heavy applications.

Alternatives to Pipes

 Pydoop (for Python) and C wrappers are available as alternatives.


 These often use JNI (Java Native Interface) to interact with Hadoop.

When to Use Higher-Level Tools?

 MapReduce is often part of a larger workflow.


 Tools like Pig, Hive, and Cascading help simplify data processing and transformations.

3)HDFS Design and Architecture

What is HDFS?

 HDFS (Hadoop Distributed File System) is a distributed file system designed to store and
manage large amounts of data.
 It runs on commodity hardware (regular, low-cost servers).

How HDFS Works?

 HDFS separates metadata and actual data.


 It follows a Master-Slave architecture:
o NameNode (Master): Manages metadata (file locations, permissions, etc.).
o DataNodes (Slaves): Store actual data and handle read/write operations.
 All nodes communicate using TCP-based protocols.

Key Features of HDFS

1. Handles Large Data Sets – HDFS is designed to scale across hundreds of nodes.
2. Block-Based Storage – Files are divided into blocks (default size: 128 MB, configurable).
3. Fault Tolerance & Recovery – Data is replicated across multiple nodes to prevent data loss.
4. Hierarchical File Organization – Similar to traditional file systems (directories, file creation,
deletion, etc.).
5. Supports Commodity Hardware – No need for expensive machines; runs on low-cost
servers.

HDFS Design Challenges

1. Commodity Hardware: Uses cheap hardware to reduce costs.


2. Streaming Data Access: Data is mainly written once and read multiple times for efficiency.
3. Single Writer: Only one process can write to a file at a time, and it can only append data.
4. Low Latency: Optimized for fast data access.
5. Supports Small & Large Files: Can store both small and very large files efficiently.

HDFS Goals

1. Manage Large Datasets – Handles huge amounts of data efficiently.


2. Fault Detection & Recovery – Automatically detects failures and recovers lost data.
3. Hardware Efficiency – Reduces network traffic and improves processing speed.

Hadoop Architecture

Components of Hadoop Architecture

1. NameNode (Master Node)

2.
 Manages metadata (information about file locations, directories, and blocks).
 Stores two important files:
1. fsimage: Snapshot of the file system when NameNode starts.
2. Edit logs: Records changes made to the file system after NameNode starts.
 Problems with NameNode:
o If edit logs grow too large, they become difficult to manage.
o Restarting NameNode takes a long time because all changes must be merged.
o If NameNode crashes, old fsimage may cause metadata loss.

2. DataNode (Slave Nodes)

 Stores actual data blocks as per instructions from NameNode.


 Handles read and write requests from clients.
 Replicates data to maintain fault tolerance.

3. Secondary NameNode (Checkpoint Node)

 Helps manage NameNode issues by merging edit logs with fsimage at regular intervals.
 Steps in Secondary NameNode working:
1. Collects edit logs from NameNode at regular intervals.
2. Applies them to fsimage to create an updated version.
3. Sends the updated fsimage back to NameNode, reducing restart time.
 It is not a backup node but acts as a helper node to improve performance.

HDFS Block

What is an HDFS Block?

 HDFS is a block-structured file system, meaning files are divided into blocks before storing.
 The default block size is 64MB (but can be changed as needed).

If a DataNode fails, the block is automatically copied to another node.

HDFS Components and Functionality

 Replication: Blocks are replicated across multiple nodes to prevent data loss.
 Heartbeat Signals: DataNodes send signals to NameNode to stay synchronized.
MapReduce and HDFS

 Job Tracker (Master Node):


o Receives job requests from clients.
o Assigns tasks to Task Trackers.
o Prefers assigning tasks to nodes where data is already stored (data locality).
 Task Trackers (Slave Nodes):
o Execute the assigned tasks.
o If a node fails, the job is reassigned to another node with the data copy.

Command Line Interface (CLI) for HDFS

 You can interact with HDFS using command-line commands.


 Commands start with hadoop fs.
 Example: To list files in a directory →
 hadoop fs -ls
 General command format:
 hadoop fs -<command>

4)Java Interface in Hadoop File System

Hadoop is written in Java, so most of its file system interactions happen through Java APIs.
The FileSystem class in Java helps manage file operations in Hadoop.

1. Reading Data from HDFS Using a Hadoop URL

One way to read a file from HDFS is by using Java's URL class:

InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// Process the input stream
} finally {
IOUtils.closeStream(in);
}

 The URL.openStream() method helps fetch data from an HDFS location.


 Java recognizes Hadoop’s HDFS URL format using FS Url Stream HandlerFactory.
 The set URL Stream Handler Factory method ensures the correct handling of HDFS
URLs, but it can only be set once per JVM.

Example: Displaying an HDFS File on the Console


import java.io.InputStream;
import java.net.URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;

public class URLCat {


static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}

What this program does?

 Reads a file from HDFS.


 Displays the file’s content on the console.
 Uses IOUtils.copyBytes() to copy the data.

2. Reading Data Using the FileSystem API

If setting URLStreamHandlerFactory is not possible, we use the Hadoop FileSystem API.

 A file in HDFS is represented by a Path object.


 Two ways to get a FileSystem instance:
 public static FileSystem get(Configuration conf) throws IOException
 public static FileSystem get(URI url, Configuration conf) throws
IOException
 FileSystem.get(conf) finds the correct file system using configuration settings (core-
site.xml).

Example: Opening a File Using FileSystem API


FileSystem fs = FileSystem.get(new Configuration());
FSDataInputStream in = fs.open(new Path("hdfs://host/path"));

 FSDataInputStream is returned instead of a normal Java InputStream.


 It allows random access, so you can read from any position in the file.

3. Writing Data to HDFS

To create and write a file in HDFS, we use the create() method:

FSDataOutputStream out = fs.create(new Path("hdfs://host/path"));


out.writeBytes("Hello, Hadoop!");
out.close();

 This creates a file and writes data into it.

Appending Data to an Existing File


FSDataOutputStream out = fs.append(new Path("hdfs://host/path"));
out.writeBytes("Appending more data!");
out.close();

 Append operation lets you modify an existing file.


 However, not all Hadoop file systems support appending.

5)Data flow(How a File is Read and Written in HDFS )

1. How a File is Read in HDFS (Anatomy of a File Read)

Step-by-Step Process:

1. Client requests a file


o The client calls open() on the FileSystem object to read a file.
o HDFS contacts the NameNode to find where the file’s blocks are stored.
o The NameNode replies with a list of DataNodes that store copies of the blocks.
2. Client selects the nearest DataNode
o If the client is itself a DataNode, it reads from the local copy.
o Otherwise, it chooses the closest DataNode based on network location.
3. Client starts reading
o HDFS gives the client an FSDataInputStream, which handles communication.
o The client calls read(), and data starts streaming from the DataNode.
4. Reading happens in blocks
o The client reads one block at a time.
o When a block is finished, HDFS automatically connects to the next DataNode for the
next block.
5. Error Handling
o If one DataNode fails, the client will read from the next closest DataNode.
6. Closing the file
o Once reading is finished, the client calls close() to release resources.
2. How a File is Written in HDFS (Anatomy of a File Write)

Step-by-Step Process:

1. Client requests to create a file


o The client calls create() on the DistributedFileSystem.
o HDFS makes an RPC call to the NameNode to create the file entry.
2. Data is split into packets
o As the client writes, HDFS breaks the data into small packets.
o These packets are stored in an internal queue called the data queue.
3. Data is streamed to DataNodes
o A DataStreamer reads from the data queue and sends packets to a DataNode.
o The first DataNode stores the packet and forwards it to the second DataNode in the
pipeline.
4. Acknowledgment queue
o HDFS maintains an acknowledgment queue, which tracks packets waiting for
confirmation from DataNodes.
5. File is closed
o When writing is complete, the client calls close(), and the file is finalized in HDFS.

6)Heartbeat Mechanism in HDFS

What is a Heartbeat?

 A heartbeat is a signal sent by DataNodes to the NameNode and by TaskTrackers to the


JobTracker to confirm they are alive.
 Without a heartbeat, the system assumes the node has failed and takes action.
How Heartbeats Work in HDFS

1. DataNodes send heartbeats to the NameNode every 3 seconds.


2. The heartbeat contains important information, including:
o Whether the node is active.
o How much storage is available.
o How many data transfers are happening.
3. If a DataNode does not send a heartbeat for 10 minutes, the NameNode marks it as dead.
4. The NameNode then creates new replicas of lost blocks on other healthy DataNodes.
5. Heartbeats also carry instructions from the NameNode, such as:
o Replicating blocks to other DataNodes.
o Deleting blocks that are no longer needed.
o Shutting down a node if necessary.
o Sending an immediate block report for auditing.

This system helps HDFS maintain high availability and reliability.

MapReduce: Role of Sorter, Shuffler, and Combiner

1. Combiner (Mini Reducer)

 A combiner is like a mini reducer that helps optimize performance.


 It groups and summarizes data locally on the Mapper before sending it to the Reducer.
 This reduces the amount of data transferred over the network, improving efficiency.

2. Shuffling (Moving Data from Mapper to Reducer)

 The shuffling phase moves data from the Mapper to the Reducer.
 It groups all values with the same key and ensures they reach the correct Reducer.
 Without shuffling, Reducers would not get any input!

3. Sorting (Arranging Data for Reducers)

 Before the Reducer processes data, Hadoop sorts the mapped output.
 Data is sorted by key, so that each Reducer gets all the related data in order.
 The shuffling and sorting phases happen at the same time.

Hadoop I/O (Input & Output in Hadoop)

 Hadoop works with huge amounts of data (terabytes or more).


 It has a special input/output system to handle large-scale data efficiently.

7) Data Integrity and Hadoop Local File System in HDFS

What is Data Integrity?

 Data integrity means that data remains accurate, consistent, and unchanged throughout
storage, processing, and retrieval.
 Data can sometimes get corrupted due to errors during disk operations or network
transfers.

How is Data Corruption Detected?

 A checksum is used to detect errors.


 A checksum is a special code generated from the data that helps verify its correctness.
 HDFS uses CRC-32 (a 32-bit checksum) to check for errors.

Data Integrity in HDFS

How HDFS Ensures Data Integrity

1. Checksums for Every Block


o HDFS automatically calculates checksums for all data stored in it.
o The checksum is stored separately from the actual data.
o Default block size = 512 bytes, and each checksum takes only 4 bytes.
2. Data Verification Before Storage
o When data is sent to a DataNode, the checksum is verified before storage.
o If corruption is detected, an error message (ChecksumException) is sent to the
client.
3. Data Verification During Reads
o When a client reads data from HDFS, the DataNode checks the checksum again.
o A log is maintained to track verified blocks.
4. Periodic Background Verification
o A process called DataBlockScanner runs in the background to check stored data for
corruption.
5. Self-Healing Mechanism
o Since HDFS stores multiple copies (replicas) of each block, it can replace a
corrupted block with a good copy from another DataNode.
o If a client finds a corrupt block, it reports it to the NameNode.
o The NameNode then:
 Marks the block as corrupt.
 Copies a healthy version of the block to a new DataNode.
 Deletes the corrupt block once a new copy is created.
6. Turning Off Checksum Verification
o You can disable checksum verification by using setVerifyChecksum(false), but
this is not recommended for data safety.

Hadoop Local File System and Checksum Mechanism

How Checksum Works in the Local File System

1. The Hadoop Local File System (used when running Hadoop on a single machine) also uses
checksums to detect errors.
2. When a file is created, a hidden checksum file (.crc file) is also created in the background.
3. Each chunk of 512 bytes has its own checksum stored in the .crc file.
4. If the file is modified or corrupted, an error is thrown.

Handling Corrupted Files in the Local File System

 If a checksum error occurs, the system moves the file to a special folder and renames it as
bad_file.
 The system administrator must manually check and fix the corrupted files.

8) Apache Avro and File-Based Data Structures (Simplified)

What is Avro?

 Avro is a data serialization system that converts data into a format that can be stored and
transferred efficiently.
 It allows data to be written and read in different programming languages.
 Avro uses a schema (a predefined data structure) to serialize (convert) and deserialize
(read) data.
 The schema is written in JSON, making it easy to understand and modify.
Features of Avro

✔Stores Schema with Data: Avro saves the data along with its schema in a single file. This
makes it self-describing.
✔Supports Compression: Avro files can be compressed to save space.
✔Splitable: Large Avro files can be split into smaller parts, making them efficient for
MapReduce processing.
✔Supports Schema Evolution: The schema used for reading does not need to match the
one used for writing, as long as certain rules are followed.

Data Types in Avro

Avro supports two types of data:

1. Primitive Types: Basic data types like string, int, boolean, float, double, and bytes.
o Example: { "type": "string" } for a text field.
2. Complex Types: More advanced data structures like:
o Records (similar to structs or objects)
o Enums (fixed sets of values)
o Arrays (lists of values)
o Maps (key-value pairs)
o Unions (multiple possible types for a value)
o Fixed (fixed-size binary values)

Avro Data File Structure

An Avro data file contains:

1. Header
o Stores metadata, including the schema and a unique sync marker.
2. Data Blocks
o Contains the actual data serialized in Avro format.
o Blocks are separated by a sync marker to allow quick resynchronization when
reading large files.

File-Based Data Structures in Hadoop

Hadoop supports different file formats, including:

1. Text Files

 The simplest format, stores data in plain text.


 Used for small datasets but not efficient for big data.

2. Binary Files (Sequence Files)

 A flat file format that stores data in key-value pairs.


 Used internally by MapReduce for processing.
 Supports compression to save storage space.

Types of Sequence Files (Based on Compression)

1. Uncompressed: No compression, takes more space.


2. Record Compressed: Compresses each individual record separately.
3. Block Compressed: Compresses multiple records together (more efficient).

Reading and Writing Sequence Files

 Writing: Use createWriter() to write data into a SequenceFile.


 Reading: Use SequenceFile.Reader to read the records one by one.

Sequence File Structure

Structure of a Sequence File

A sequence file is a binary file format in Hadoop that stores key-value pairs. It consists of:

1. Header
o Starts with SEQ (magic number) to identify the file type.
o A version number to indicate the file format version.
o Metadata, including:
 Key and value class names
 Compression details (if any)
 Sync marker (used for easy data access)
2. Records
o Contains actual key-value data stored in sequence format.
o Sync markers are placed between records for quick access.

Cassandra-Hadoop Integration

Cassandra, a distributed NoSQL database, integrates with Hadoop to handle large-scale


data processing using tools like MapReduce, Pig, and Hive.

How Cassandra Works with Hadoop?

1. Reading Data into Hadoop


o ColumnFamilyInputFormat: Splits Cassandra data into small chunks and feeds them
to MapReduce tasks.
2. Writing Data from Hadoop to Cassandra
o ColumnFamilyOutputFormat: Writes MapReduce results back to Cassandra as
column family rows.
o Uses batch processing (lazy write-back caching) to improve performance.
3. Configuration Support
o ConfigHelper: Helps set up Cassandra-Hadoop configurations easily by preventing
manual errors in property names.
4. Bulk Loading for Faster Writes
o BulkOutputFormat: Streams data in binary format instead of inserting records one
by one, which speeds up data writing.
o Uses SSTableLoader for efficient data loading.

9) Compression

Compression helps in two main ways:

1. Saves Storage Space


2. Speeds Up Data Transfer

Common Compression Methods in Hadoop:

 Deflate
 Gzip – Faster than Bzip2 but not as good for decompression.
 Bzip2 – Slower compression but faster decompression.
 LZO, LZ4, Snappy – Faster and more efficient, can be optimized as needed.

All these methods improve speed and storage optimization, but each has its advantages.

What is a Codec?

A codec is an algorithm used to compress and decompress data for storage or transmission.
Hadoop uses different codecs to handle various compression formats.
Serialization in Hadoop

Serialization is the process of converting data objects into bytes so they can be stored or
sent to another system.

 Deserialization is the reverse process, where the bytes are converted back into their original
data form.
 It makes data easier to store, transfer, and process.

Why is Serialization Important?

 Saves Object State.


 Data Exchange – Helps transfer data between applications.
 – Makes it easy to handle complex objects.

There are two main uses of serialization:

1. Saves data in a structured format.


2. Used in Remote Procedure Calls (RPCs) where data is converted into binary, sent to another
node, and then converted back.

For RPC serialization, the format should be:

 Compact – Uses less bandwidth.


 Fast – Quick conversion between formats.
 Extensible – Can be modified when needed.
 Interoperable – Works across different programming languages.

Writable Interface in Hadoop

Hadoop has its own serialization format called Writable, which is:

 Fast
 Compact
 Written in Java

Key Methods in Writable Interface

 write(DataOutput out) – Writes the object to a binary stream.


 readFields(DataInput in) – Reads the object from a binary stream.

WritableComparable Interface

This is a sub-interface of both Writable and Comparable in Java. It allows objects to be


serialized and also compared for sorting in Hadoop.
Why is Writable Important?

 When data is sent from Mapper to Reducer, it goes through a shuffle and sort phase.
 If the keys are not comparable, this phase won’t work correctly.
 Writable ensures that data can be compared using a RawComparator.

Hadoop Data Types (Writable Classes)

Hadoop provides Writable classes to handle different data types.

Primitive Writable Classes

These are used for basic Java data types:

 BooleanWritable
 ByteWritable
 IntWritable
 VIntWritable (Variable-length integer)
 FloatWritable
 LongWritable
 VLongWritable (Variable-length long)
 DoubleWritable

✅ The size of these types is the same as in Java (e.g., IntWritable = 4 bytes).

Other Writable Classes

 Text – Like Java’s String, but optimized for Hadoop (supports UTF-8, max 2GB).
 BytesWritable – Used for binary data.
 NullWritable – A placeholder with zero-length serialization.

Special Writable Classes

 ObjectWritable – Can handle Java primitives, strings, enums, and arrays.


 GenericWritable – A more general-purpose wrapper.
 ArrayWritable & TwoDArrayWritable – Store arrays of Writable objects.
 MapWritable & SortedMapWritable – Work like Java’s Map<> but for Writable types.

You might also like