Unit II Big Data Analytics
Unit II Big Data Analytics
Unit II Big Data Analytics
HDFS
When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed filesystems. Hadoop comes with a
distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.
The Design of HDFS : HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity hardware.
Very large files: “Very large” in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of
data.
Streaming data access : HDFS is built around the idea that the most efficient data processing
pattern is a write-once, readmany-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time.
Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware to run
on. It’s designed to run on clusters of commodity hardware (commonly available hardware
available from multiple vendors3) for which the chance of node failure across the cluster is
high, at least for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
These are areas where HDFS is not a good fit today:
-latency data access : Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.
Lots of small files : Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers,
or for modifications at arbitrary offsets in the file.
HDFS Concepts Blocks: HDFS has the concept of a block, but it is a much larger unit—64
MB by default. Files in HDFS are broken into block-sized chunks, which are stored as
independent units. Having a block abstraction for a distributed filesystem brings several
benefits.:
The first benefit : A file can be larger than any single disk in the network. There’s nothing
that requires the blocks from a file to be stored on the same disk, so they can take advantage of
any of the disks in the cluster.
Second: Making the unit of abstraction a block rather than a file simplifies the storage
subsystem. The storage subsystem deals with blocks, simplifying storage management (since
blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and
eliminating metadata concerns.
Third: Blocks fit well with replication for providing fault tolerance and availability. To insure
against corrupted blocks and disk and machine failure, each block is replicated to a small
number of physically separate machines (typically three).
Why Is a Block in HDFS So Large? HDFS blocks are large compared to disk blocks, and the
reason is to minimize the cost of seeks. By making a block large enough, the time to transfer
the data from the disk can be made to be significantly larger than the time to seek to the start
of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk
transfer rate. A quick calculation shows that if the seek time is around 10 ms, and the transfer
rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB. The default is actually 64 MB, although many HDFS installations
use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow
with new generations of disk drives.
Namenodes and Datanodes: An HDFS cluster has two types of node operating in a master-
worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode
manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the
HDFS is one primary components of Hadoop cluster and HDFS is designed to have Master-
slave architecture.
Master: NameNode
Slave: {Datanode}…..{Datanode} - The Master (NameNode) manages the file system
namespace operations like opening, closing, and renaming files and directories and determines
the mapping of blocks to DataNodes along with regulating access to files by clients - Slaves
(DataNodes) are responsible for serving read and write requests from the file system’s clients
along with perform block creation, deletion, and replication upon instruction from the Master
(NameNode). Datanodes are the workhorses of the filesystem. They store and retrieve blocks
when they are told to (by clients or the namenode), and they report back to the namenode
periodically with lists of blocks that they are storing.
NameNode failure: if the machine running the namenode failed, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from the
blocks on the datanodes.
HDFS Concepts
HDFS is a distributed file system that handles large data sets running on commodity hardware.
It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes.
HDFS is one of the major components of Apache Hadoop, the others being MapReduce and
YARN.
It is a concept of storing the file in multiple nodes in a distributed manner. DFS actually
provides the Abstraction for a single large system whose storage is equal to the sum of storage
of other nodes in a cluster.
Let’s understand this with an example. Suppose you have a DFS comprises of 4 different
machines each of size 10TB in that case you can store let say 30TB across this DFS as it
provides you a combined Machine of size 40TB. The 30TB data is distributed among these
Nodes in form of Blocks.
1.ls: This command is used to list all the files. Use lsr for recursive approach. It is useful
when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File
System) commands.
2.cat: To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->
3,copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This
is the most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks
4.touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
5. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So
let’s first create it.
Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero
myfile.txt from geeks folder will be copied to folder hero present on Desktop.
7. moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
8. cp: This command is used to copy files within hdfs.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
9. mv: This command is used to move files within hdfs.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
10.rmr: This command deletes a file from HDFS recursively. It is very useful command
when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the
directory itself.
Hadoop File System Interface.
• HFTP – this was the first mechanism that provided HTTP access to HDFS. It was
designed to facilitate data copying between clusters with different Hadoop versions.
HFTP is a part of HDFS. It redirects clients to the datanode containing the data for
providing data locality. Nevertheless, it supports only the read operations. The HFTP
HTTP API is neither curl/wget friendly nor RESTful. WebHDFS is a rewrite of HFTP
and is intended to replace HFTP.
• HdfsProxy – a HDFS contrib project. It runs as external servers (outside HDFS) for
providing proxy service. Common use cases of HdfsProxy are firewall tunneling and
user authentication mapping.
• HdfsProxy V3 – Yahoo!’s internal version that has a dramatic improvement over
HdfsProxy. It has a HTTP REST API and other features like bandwidth control.
Nonetheless, it is not yet publicly available.
• Hoop – a rewrite of HdfsProxy. It aims to replace HdfsProxy. Hoop has a HTTP REST
API. Like HdfsProxy, it runs as external servers to provide a proxy service. Because it
is a proxy running outside HDFS, it cannot take advantages of some features such as
redirecting clients to the corresponding datanodes for providing data locality. It has
advantages, however, in that it can be extended to control and limit bandwidth like
HdfsProxy V3, or to carry out authentication translation from one mechanism to
HDFS’s native Kerberos authentication. Also, it can serve proxy service to other file
systems such as Amazon S3 via Hadoop FileSystem API
Data Flow
A basic data flow of the Hadoop system can be divided into four phases:
1. Capture Big Data : The sources can be extensive lists that are structured, semi-
structured, and unstructured, some streaming, real-time data sources, sensors, devices,
machine-captured data, and many other sources. For data capturing and storage, we
have different data integrators such as, Flume, Sqoop, Storm, and so on in the Hadoop
ecosystem, depending on the type of data.
2. Process and Structure: We will be cleansing, filtering, and transforming the data by
using a MapReduce-based framework or some other frameworks which can perform
Flume: Flume Agents have the ability to transfer data created by a streaming application to
data stores like HDFS and HBase.
Flume is a distributed, and reliable tool for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on streaming data flows. It
is robust and fault-tolerant with tunable reliability mechanisms.
Sqoop: Sqoop can be used to bulk import data from typical RDBMS to Hadoop storage
structures like HDFS or Hive.
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores such as
relational databases (MS SQL Server, MySQL).
Apache Sqoop(which is a portmanteau for “sql-to-hadoop”) is an open source tool that allows
users to extract data from a structured data store into Hadoop for further processing. This
processing can be done with MapReduce programs or other higher-level tools such as Hive,
Pig or Spark.
• HDFS Shares small files in efficiently, since each file is stored in a block and block meta
data is held in memory by the Name Node.
• Thus, a large number of small files can take a lot of memory on the Name Node for
example, 1 MB file is stored with a block size of 128 MB uses 1.MB of disk space not 128
MB.
• Hadoop Archives or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, there by reducing Name Node memory usage while still allowing
transparent access to files,
• Hadoop Archives can be used as input to map reduce.
I/O Compression :
• In the Hadoop framework, where large data sets are stored and processed, you will
need storage for large files.
• These files are divided into blocks and those blocks are stored in different nodes
across the cluster so lots of I/O and network data transfer is also involved.
• In order to reduce the storage requirements and to reduce the time spent in-network
transfer, you can have a look at data compression in the Hadoop framework.
• Using data compression in Hadoop you can compress files at various steps, at all of
these steps it will help to reduce storage and quantity of data transferred.
• You can compress the input file itself.
• That will help you reduce storage space in HDFS.
• You can also configure that the output of a MapReduce job is compressed in Hadoop.
• That helps is reducing storage space if you are archiving output or sending it to some
other application for further processing.
Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC
Serialization
• Data serialization is a process that converts structure data manually back to the
original form.
• Serialize to translate data structures into a stream of data. Transmit this stream of data
over the network or store it in DB regardless of the system architecture.
• Isn't storing information in binary form or stream of bytes is the right approach.
• Serialization does the same but isn't dependent on architecture.
Consider CSV files contains a comma (,) in between data, so while Deserialization, wrong
outputs may occur. Now, if metadata is stored in XML form, a self- architected form of data
storage, data can easily deserialize
Avro
Avro is an open source project that provides data serialization and data exchange services for
Apache Hadoop. These services can be used together or independently. Avro facilitates the
exchange of big data between programs written in any language. With the serialization service,
programs can efficiently serialize data into files or into messages. The data storage is compact
and efficient. Avro stores both the data definition and the data together in one message or file.
Avro files include markers that can be used to split large data sets into subsets suitable
for Apache Map Reduce processing.
Avro handles schema changes like missing fields, added fields and changed fields; as a result,
old programs can read new data and new programs can read old data. Avro includes APIs for
Java, Python, Ruby, C, C++ and more. Data stored using Avro can be passed from programs
written in different languages, even from a compiled language like C to a scripting language
like Apache Pig.
The systems that are used to organize and maintain data files are known as file based data
systems. These file systems are used to handle a single or multiple files and are not very
efficient.
Twofileformats:
1,Sequencefile
2, Map File
Sequencefile
1. sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary
forms of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be
efficiently stored and processed small files .
Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC
3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer**
provides append functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom writable type.
Sequencefile Compression
1. The internal format of the sequencefile depends on whether compression is enabled, or, if
it is, either a record compression or a block compression.
2, three kinds of types:
A. No compression type : If compression is not enabled (the default setting), then each record
consists of its record length (number of bytes), the length of the key, the key and the value.
The Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the
uncompressed format, and the difference is that the value byte is compressed with the
encoder defined in the header. Note that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once , so it is
more compact than record compression and generally preferred . When the number of bytes
recorded reaches the minimum size, it is added to the block. The minimum value
io.seqfile.compress.blocksizeis defined by the property in. The default value is 1000000
bytes. The format is record count, key length, key, value length, value.
Mapfile
Mapfile is the sequencefile of the sorted index and can be found based on key.
Unlike Sequencefile, mapfile key must implement Writablecomparable interface, that is, the
key value is comparable, and value is the writable type.
Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC
You can use the Mapfile.fix () method to reconstruct the index and convert the Sequencefile
to mapfile.
It has two static member variables:
static final String INDEX_FILE_NAME
static final String DATA_FILE_NAME
By observing its directory structure, we can see that mapfile consists of two parts, data and
index respectively.
Index, which is a data-indexed file, primarily records the key value of each record and the
position at which the record is offset in the file.
When mapfile is accessed, the index file is loaded into memory, and the index mapping
relationship quickly navigates to the location of the file where the specified record is located.
Therefore, the retrieval efficiency of mapfile is efficient relative to sequencefile, and the
disadvantage is that it consumes a portion of memory to store index data.
It should be noted that the Mapfile does not record all records into index, which by default
stores an index map for every 128 records. Of course, the recording interval can be artificially
modified, throughMapFIle.Writer的setIndexInterval()methods, or
modifiedio.map.index.intervalattributes;
read/write Mapfile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream