Big Data Unit 4 Own
Big Data Unit 4 Own
Hadoop Streaming is a tool that lets you write MapReduce programs in any language (not
just Java).
It uses UNIX standard streams (input/output) to communicate between Hadoop and your
program.
It comes built-in with the Hadoop distribution.
1. You can write MapReduce programs in languages like Python, Perl, or C++ (not just Java).
2. Hadoop Streaming monitors job progress and provides logs for debugging.
3. It supports scalability, flexibility, and security just like regular MapReduce jobs.
4. It is easy to develop and requires minimal coding effort.
o The Mapper reads input data from InputReader/Format in the form of key-
value pairs.
o The Mapper processes the data based on the logic written in the code.
o The processed data is then passed through the Reduce stream.
o The Reducer performs data aggregation on the intermediate data.
o The final processed data is released as output.
o Both Map and Reduce functions read input from STDIN (Standard Input)
and write output to STDOUT
Hadoop Pipes
The Task Tracker communicates with the C++ MapReduce process using sockets.
This allows higher performance for tasks like numerical calculations in C++.
Alternatives to Pipes
What is HDFS?
HDFS (Hadoop Distributed File System) is a distributed file system designed to store and
manage large amounts of data.
It runs on commodity hardware (regular, low-cost servers).
1. Handles Large Data Sets – HDFS is designed to scale across hundreds of nodes.
2. Block-Based Storage – Files are divided into blocks (default size: 128 MB, configurable).
3. Fault Tolerance & Recovery – Data is replicated across multiple nodes to prevent data loss.
4. Hierarchical File Organization – Similar to traditional file systems (directories, file creation,
deletion, etc.).
5. Supports Commodity Hardware – No need for expensive machines; runs on low-cost
servers.
HDFS Goals
Hadoop Architecture
2.
Manages metadata (information about file locations, directories, and blocks).
Stores two important files:
1. fsimage: Snapshot of the file system when NameNode starts.
2. Edit logs: Records changes made to the file system after NameNode starts.
Problems with NameNode:
o If edit logs grow too large, they become difficult to manage.
o Restarting NameNode takes a long time because all changes must be merged.
o If NameNode crashes, old fsimage may cause metadata loss.
Helps manage NameNode issues by merging edit logs with fsimage at regular intervals.
Steps in Secondary NameNode working:
1. Collects edit logs from NameNode at regular intervals.
2. Applies them to fsimage to create an updated version.
3. Sends the updated fsimage back to NameNode, reducing restart time.
It is not a backup node but acts as a helper node to improve performance.
HDFS Block
HDFS is a block-structured file system, meaning files are divided into blocks before storing.
The default block size is 64MB (but can be changed as needed).
Replication: Blocks are replicated across multiple nodes to prevent data loss.
Heartbeat Signals: DataNodes send signals to NameNode to stay synchronized.
MapReduce and HDFS
Hadoop is written in Java, so most of its file system interactions happen through Java APIs.
The FileSystem class in Java helps manage file operations in Hadoop.
One way to read a file from HDFS is by using Java's URL class:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// Process the input stream
} finally {
IOUtils.closeStream(in);
}
Step-by-Step Process:
Step-by-Step Process:
What is a Heartbeat?
The shuffling phase moves data from the Mapper to the Reducer.
It groups all values with the same key and ensures they reach the correct Reducer.
Without shuffling, Reducers would not get any input!
Before the Reducer processes data, Hadoop sorts the mapped output.
Data is sorted by key, so that each Reducer gets all the related data in order.
The shuffling and sorting phases happen at the same time.
Data integrity means that data remains accurate, consistent, and unchanged throughout
storage, processing, and retrieval.
Data can sometimes get corrupted due to errors during disk operations or network
transfers.
1. The Hadoop Local File System (used when running Hadoop on a single machine) also uses
checksums to detect errors.
2. When a file is created, a hidden checksum file (.crc file) is also created in the background.
3. Each chunk of 512 bytes has its own checksum stored in the .crc file.
4. If the file is modified or corrupted, an error is thrown.
If a checksum error occurs, the system moves the file to a special folder and renames it as
bad_file.
The system administrator must manually check and fix the corrupted files.
What is Avro?
Avro is a data serialization system that converts data into a format that can be stored and
transferred efficiently.
It allows data to be written and read in different programming languages.
Avro uses a schema (a predefined data structure) to serialize (convert) and deserialize
(read) data.
The schema is written in JSON, making it easy to understand and modify.
Features of Avro
✔Stores Schema with Data: Avro saves the data along with its schema in a single file. This
makes it self-describing.
✔Supports Compression: Avro files can be compressed to save space.
✔Splitable: Large Avro files can be split into smaller parts, making them efficient for
MapReduce processing.
✔Supports Schema Evolution: The schema used for reading does not need to match the
one used for writing, as long as certain rules are followed.
1. Primitive Types: Basic data types like string, int, boolean, float, double, and bytes.
o Example: { "type": "string" } for a text field.
2. Complex Types: More advanced data structures like:
o Records (similar to structs or objects)
o Enums (fixed sets of values)
o Arrays (lists of values)
o Maps (key-value pairs)
o Unions (multiple possible types for a value)
o Fixed (fixed-size binary values)
1. Header
o Stores metadata, including the schema and a unique sync marker.
2. Data Blocks
o Contains the actual data serialized in Avro format.
o Blocks are separated by a sync marker to allow quick resynchronization when
reading large files.
1. Text Files
A sequence file is a binary file format in Hadoop that stores key-value pairs. It consists of:
1. Header
o Starts with SEQ (magic number) to identify the file type.
o A version number to indicate the file format version.
o Metadata, including:
Key and value class names
Compression details (if any)
Sync marker (used for easy data access)
2. Records
o Contains actual key-value data stored in sequence format.
o Sync markers are placed between records for quick access.
Cassandra-Hadoop Integration
9) Compression
Deflate
Gzip – Faster than Bzip2 but not as good for decompression.
Bzip2 – Slower compression but faster decompression.
LZO, LZ4, Snappy – Faster and more efficient, can be optimized as needed.
All these methods improve speed and storage optimization, but each has its advantages.
What is a Codec?
A codec is an algorithm used to compress and decompress data for storage or transmission.
Hadoop uses different codecs to handle various compression formats.
Serialization in Hadoop
Serialization is the process of converting data objects into bytes so they can be stored or
sent to another system.
Deserialization is the reverse process, where the bytes are converted back into their original
data form.
It makes data easier to store, transfer, and process.
Hadoop has its own serialization format called Writable, which is:
Fast
Compact
Written in Java
WritableComparable Interface
When data is sent from Mapper to Reducer, it goes through a shuffle and sort phase.
If the keys are not comparable, this phase won’t work correctly.
Writable ensures that data can be compared using a RawComparator.
BooleanWritable
ByteWritable
IntWritable
VIntWritable (Variable-length integer)
FloatWritable
LongWritable
VLongWritable (Variable-length long)
DoubleWritable
✅ The size of these types is the same as in Java (e.g., IntWritable = 4 bytes).
Text – Like Java’s String, but optimized for Hadoop (supports UTF-8, max 2GB).
BytesWritable – Used for binary data.
NullWritable – A placeholder with zero-length serialization.