BDT UNIT - III
BDT UNIT - III
BDT UNIT - III
UNIT - III
Data Format
Hive and Impala table in HDFS can be created using four Hadoop file formats.
1) Text file format
2) Sequence file format
3) Avro data file
4) Parquet file format
1) Text file format:
A text file is the most basic and a human-readable file. It can be read or written in any
programming language and is mostly delimited by comma or tab.
The text file format consumes more space when a numeric value needs to be stored as a
string.
It is also difficult to represent binary data such an image.
2) Sequence file format:
The sequence file format can be used to store an image in the binary format.
They store key-value pairs in a binary container format and more efficient than a text file.
However, it is not human-readable.
3) Avro data file:
The Avro file format has efficient storage due to optimized binary encoding.
It is widely supported both inside and outside the Hadoop ecosystem.
The Avro file format is ideal for long-term storage of important data.
The Avro file format is considered the best choice for general purpose storage in Hadoop.
4) Parquet file format:
Parquet is a column format developed by Cloudera and Twitter.
It is supported in spark, MapReduce, Hive, Pig, Impala, Crunch, and so on.
Parquet file format used advanced optimizations described in Google’s Dremel paper.
These optimizations reduce the storage space and increase performance.
It is considered the most efficient for adding multiple records at a time.
We will look into what data serialization is in the next section.
ANALYZING THE DATA WITH HADOOP
To take advantage of the parallel processing that Hadoop provides, we need to express our
query as a MapReduce job.
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. Each phase has key-value pairs as input and output, the types of which may be
chosen by the programmer. The programmer also specifies two functions: the map function and
the reduce function.
Example:
Input Splits:
An input to a MapReduce in Big Data job is divided into fixed size pieces called input splits Input
split is a chunk of the input that is consumed by a single map
Mapping:
This is the first phase in the execution of map reduce program. In this phase data in each split is
passed to a mapping function to produce put values. In our example, a job of mapping phase is to
count a number of occurrences of each word from input splits (more details about input-split is
given below) and prepare a list in the form of <word, frequency>
Shuffling:
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubbed together along with their
respective frequency.
Reducing:
In this phase, output values from the Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.
(Or)
To visualize the way the map works, consider the following sample lines of input data
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits them
as its output (the temperature values have been interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before being sent to
the reduce function. This processing sorts and groups the key-value pairs by key. So
continuing the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce function has todo
now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
SCALING OUT
Scaling out in Hadoop is achieved by adding more nodes, or computers, to the cluster. Each
node in the cluster runs Hadoop software and contributes to the processing of data. As more nodes are
added to the cluster, the processing power of the cluster increases, allowing for faster processing times
and more efficient use of resources.
Scaling out in Hadoop is particularly useful for businesses that generate as data needs
grow businesses can simply more nodes to the cluster to handle the increased workload. This
makes Hadoop a highly scalable solution for big data analytics.
One example of how scaling out in Hadoop can be used in a real-world scenario is in
the analysis of social media data. Social media platforms generate enor mous amounts of data,
including user-generated content and demographic information. Businesses can use this data
to gain insights into their target audience and improve their marketing efforts.
In order to analyze social media data, businesses need a way to store and process the
data efficiently. Hadoop provides a flexible and scalable platform for analyzing social media
data. By adding more nodes to the cluster, businesses can handle the increased workload that
comes with analyzing large amounts of social media data. scaling out in Hadoop is a powerful
solution for businesses that need to analyze large amounts of data.
HADOOP STREAMING
It is a utility or feature that comes with a Hadoop distribution that allows developers or
programmers to write the Map-Reduce program using different programming languages like Ruby,
Perl, Python, C++, etc.
We can use any language that can read from the standard input(STDIN) like keyboard input
and all and write using standard output(STDOUT). We all know the Hadoop Framework is
completely written in java but programs for Hadoop are not necessarily need to code in Java
programming language. Feature of Hadoop Streaming is available since Hadoop version 0.14.1.
How Hadoop Streaming Works:
Input is read from standard input and the output is emitted to standard output by Mapper and the
Reducer. The utility creates a Map/Reduce job, submits the job to an appropriate cluster, and
monitors the progress of the job until completion.
Mapper task inputs are converted into lines and fed to the standard input and Line oriented outputs
are collected from the standard output of the procedure Mapper and every line is changed into a
key, value pair which is collected as the outcome of the mapper.
Reducer task runs, reducer task input key/value pairs are converted into lines and fed to the
standard input (STDIN) of the process.
Each line of the line-oriented outputs is converted into a key/value pair after it is collected from the
standard output (STDOUT) of the process, which is then collected as the output of the reducer.
HADOOP PIPES
Hadoop Pipes is a C++ API that allows developers to write MapReduce applications in C++.
It is a part of the Hadoop MapReduce framework and provides a way to use C++ code as a mapper
and/or reducer in a Hadoop job.
Hadoop Pipes allows you to use your existing C++ codebase and integrate it with Hadoop’s
MapReduce framework. It provides a simple way to write MapReduce programs in C++ by allowing
you to use standard input and output streams to communicate with the Hadoop framework.
If you have large files, you can use Hadoop streaming to map your large files into your C++
executable as-is. Since this will send your data via stdin it will naturally be streaming and buffered.
Your first map task must break up the data into smaller records for further processing. Further tasks
then operate on the smaller records. Pipes uses sockets as the channel over which the task tracker
communicates with the processrunning the C++ map or reduce function.
Before head over to learn about the HDFS(Hadoop Distributed File System), we should know
what actually the file system is. The file system is a kind of Data structure or method which we use in
an operating system to manage file on disk space. This means it allows the user to keep maintain and
retrieve data from the local disk.
Now coming to DFS stands for the distributed file system, it is a concept of storing the file in
multiple nodes in a distributed manner. DFS actually provides the Abstraction for a single large
system whose storage is equal to the sum of storage of other nodes in a cluster.
Now we think you become familiar with the term file system so let’s begin with HDFS. HDFS
(Hadoop Distributed File System) is utilized for storage permission is a Hadoop cluster. It mainly
designed for working on commodity Hardware devices (devices that are inexpensive), working on a
distributed file system design.
HDFS is designed in such a way that it believes more in storing the data in a large chunk of
blocks rather than storing small data blocks. HDFS in Hadoop provides Fault-tolerance and High
availability to the storage layer and the other devices present in that Hadoop cluster.
HDFS is capable of handling larger size data with high volume velocity and variety makes
Hadoop work more efficient and reliable with easy access to all its components.
1. Blocks: A Block is the minimum amount of data that it can read or write. HDFS blocks
are 128 MB by default and this is configurable. Files in HDFS are broken into block-sized
chunks, which are stored as independent units. Unlike a file system, if the file is in HDFS is
smaller than block size, then it does not occupy full block size, i.e. 5 MB of file stored in
HDFS of block size 128 MB takes 5MB of space only. The HDFS block size is large just
to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata of
all the files in HDFS; the metadata information being file permission, names and location
of each block. The metadata are small, so it is stored in the memory of name node, allowing
faster access to data. Moreover the HDFS cluster is accessed by multiple clients
concurrently, so all this information is handled by a single machine. The file system
operations like opening, closing, renaming etc. are executed by it. There are two kinds of
files in Name Node: Fs-Image files and Edit files.
Fs-Image: It contains all the details about a file system, including all the directories and
files, in a hierarchical format. It is also called a file image because it resembles a
photograph.
Edit Logs: The Edit Logs file keeps track of what modifications have been made to the
files of the file system.
3. Data Node: They store and retrieve blocks when they are told to by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
HDFS DataNode and NameNode Image:
4. Secondary NameNode: When NameNode runs out of disk space, a secondary NameNode
is activated to perform a checkpoint. The information stored in the file system is replicated
across all the cluster nodes and stored in all the data nodes. Data nodes store the data. The
cluster nodes store the information about the cluster nodes. This information is called
metadata. When a data node reads data from the cluster, it uses the metadata to determine
where to send the data and what type of data it is.
5. Checkpoint Node:It establishes checkpoints at specified intervals to generate checkpoint
nodes in FsImage and EditLogs from NameNode and joins them to produce a new image.
Whenever you generate FsImage and EditLogs from NameNode and merge them to create a
new image, checkpoint nodes in HDFS create a checkpoint and deliver it to the NameNode.
The directory structure is always identical to that of the name node, so the checkpointed
image is always available.
THE JAVA INTERFACE
In this section, we try to understand Java interface used for accessing Hadoop’s file system. In
order to interact with Hadoop’s filesystem programmatically, Hadoop provides multiple JAVA
classes.
The Java abstract class org.apache.hadoop.fs represents the client interface to a Hadoop’s
filesystem.
Filesysten operations are open, read, write, and close.
Hadoop is written in Java, Hadoop filesystem interactions are mediated through the Java API.
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
This code opens and reads contents of a file. Path of this file on HDFS is passed to the program as a
command line argument.
DATA FLOW
Map-Reduce is a processing framework used to process data over a large number of
machines. Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster.
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is
used for Transformation while the Reducer is used for aggregation kind of operation.
Steps of Data-Flow:
At a time single input split is processed. Mapper is overridden by the developer according to the
business logic and this Mapper run in a parallel manner in all the machines in our cluster.
The intermediate output generated by Mapper is stored on the local disk and shuffled to the
reducer to reduce the task.
Once Mapper finishes their task the output is then sorted and merged and provided to the Reducer.
Reducer performs some reducing tasks like aggregation and other compositional operation and the
final output is then stored on HDFS in part-r-00000(created by default) file.
Data Integrity
Data integrity means it refers to the accuracy, completeness and consistency of data.
Maintaining data integrity means making sure the data remains intact and unchanged
throughout its entire life cycle.
Consider a situation: a block of data fetched from datanode arrives corrupted.
This corruption may occur because of faults in a storage device, network faults, or buggy
software.
The usual way to detect corrupt data is through checksums.
A HDFS client created the checksum of every block of its file and stored it in hidden files in
the HDFS namespace.
When a clients retrieves the contents of file, it verifies that the corresponding checksums
match.
If does not match, the client can retrieve the block from a replica.
Compression
Data compression in Hadoop is a technique that reduces the size of data stored in Hadoop
Distributed File System (HDFS) and the amount of data transferred between nodes in a
Hadoop cluster.
This helps in reducing storage requirements and network usage, thereby improving the
performance of Hadoop jobs .
You can compress data in Hadoop MapReduce at various stages such as input files,
intermediate map output, and output files.
There are many different compression formats available in Hadoop framework such as
Deflate, gzip, bzip2, and Snappy.
The choice of compression format depends on the time it takes to compress, space saving,
and whether the compression format is splittable or not.
Using data compression in the Hadoop framework is usually a tradeoff between I/O and
speed of computation. When enabled to compression, it reduces I/O and network
usage. Compression happens when MapReduce reads the data or when it writes it out.
There are many different compression formats, tools and algorithms, each with different
characteristics.
Serialization :
Serialization is the process of converting an object into a sequence of bytes which can be
persisted storage on disk or database or can be sent through streams. The reverse process of
creating object from sequence of bytes is called Deserialization. Serialization appears in two quite
distinct areas of distributed data processing: for interprocess communication and for persistent
storage.
In Hadoop, interprocess communication between nodes in the system is implemented using
remote procedure calls (RPCs). The RPC protocol uses serialization to render the message into a
binary stream to be sent to the remote node, which then deserializes the binary stream into the
original message.
Compact − to make the best use of network bandwidth, which is the most scarce resource in a
data center.
Fast − since the communication between the nodes is crucial in distributed systems, the
serialization and deserialization process should be quick, producing less overhead.
Extensible − Protocols change over time to meet new requirements, so it should be straight
forward to evolve the protocol in a controlled manner for clients and servers.
Interoperable − the message format should support the nodes that are written in different
languages.
1. Sequencefile :
Sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary
formsof pairs.
Sequencefile files are not sorted by their stored key.
Sequencefile's internal class writer provides append functionality.
The key and value in Sequencefile can be any type writable or a custom writable type.
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
2. Mapfile :
Mapfile is the sequencefile of the sorted index and can be found based on key.
Unlike Sequencefile, mapfile key must implement Writable comparable interface, that is,
the key valueis comparable, and value is the writable type.
You can use the Mapfile.fix () method to reconstruct the index and convert the
Sequencefile to mapfile.
When mapfile is accessed, the index file is loaded into memory, and the index
mapping relationship quickly navigates to the location of the file where the specified
record is located.
Therefore, the retrieval efficiency of mapfile is efficient relative to sequencefile.
The disadvantage is that it consumes a portion of memory to store index data.
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream