0% found this document useful (0 votes)
15 views

Data Analytics Unit 2

Uploaded by

Darshan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Data Analytics Unit 2

Uploaded by

Darshan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

EXCEL ENGINEERING COLLEGE

(Autonomous)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
V Semester
20AI501 – Data Analytics
Regulations 2020
Question Bank
UNIT – II (HADOOP DISTRIBUTED FILE SYSTEM)
PART- A
Q.No. Questions Marks CO BL
1 Define HDFS. 2 CO2 R
2 state Limitations of HAR files 2 CO2 R
3 Tell about job execution in Hadoop 2 CO2 R
4 List out three different features of Hadoop? 2 CO2 R
5 Classify various compression formats that are used in Hadoop 2 CO2 R
6 State design concept of HDFS 2 CO2 R
7 which file format is best for Hadoop 2 CO2 R
8 Define Hadoop Archive 2 CO2 R
9 Which architecture is used by HDFS 2 CO2 R
10 Recall Sqoop and Flume 2 CO2 R
11 Define data flow in Hadoop 2 CO2 R
12 Write Down the drawback of Hadoop Serialization 2 CO2 R
13 What is serialization in Hadoop 2 CO2 R
14 Write down Avro schema 2 CO2 R
15 Define Hadoop I/O Compression 2 CO2 R
16 What is meant Avro serialization in Hadoop 2 CO2 R
17 What is data Replication 2 CO2 R
18 State cluster in Hadoop 2 CO2 R
19 Recall Flume used for 2 CO2 R
20 Infer Command line interface in hadoop 2 CO2 R
PART- B
Q.No. Questions Marks CO BL
1. Examine about 16 CO2 U
(i) The Design of HDFS
(ii) HDFS Concepts
2. Describe DataIngest With Flume and Scoop and Hadoop archives 16 CO2 U
Explain the following with
3. Hadoop I/O Compression 16 CO2 U
Hadoop I/O Serialization
4. Discuss in detail about Hadoop File system Interface 16 CO2 U
5. Summarize the Command Line Interface 16 CO2 U
6. Explain the Avro and File Based Data Structures 16 CO2 U

(Note:*Blooms Level (R – Remember, U – Understand, AP – Apply, AZ – Analyze, E – Evaluate, C –


Create)
PART A- Blooms Level : Remember, Understand, Apply
PART B- Blooms Level: Understand, Apply, Analyze, Evaluate(if possible)
Marks: 16 Marks, 8+8 Marks, 10+6 Marks)

Subject In charge Course Coordinator HOD


IQAC
(Name & Signature) (Name & Signature)
EXCEL ENGINEERING COLLEGE
(Autonomous)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
V Semester
20AI501 – Data Analytics
Regulations 2020
Question Bank
UNIT – II (HADOOP DISTRIBUTED FILE SYSTEM)
PART- A

Q.No. Questions Marks CO BL


Define HDFS
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications. HDFS employs a NameNode and
1 2 CO2 R
DataNode architecture to implement a distributed file system that
provides high-performance access to data across highly scalable Hadoop
clusters.
State Limitations of HAR files
Once an archive file is created, you can not update the file to add or
remove files. In other words, har files are immutable.
Archive file will have a copy of all the original files so once a .har is
2 created it will take as much space as the original files. Don’t mistake 2 CO2 R
.har files for compressed files.
When a .har file is given as an input to MapReduce job, the small files
inside the .har file will be processed individually by separate mappers
which is inefficient.
Tell about Phases of job execution in Hadoop
3 phases of MapReduce job execution are Input splits, Mapper, Combiner, 2 CO2 R
Partitioner, Shuffling, Sorting, Reducer.
List out the different features of Hadoop
Open Source: Hadoop is open-source, which means it is free to use.
Highly Scalable Cluster: Hadoop is a highly scalable model
4 Fault Tolerance is Available 2 CO2 R
High Availability is Provided
Cost-Effective ,Hadoop Provide Flexibility
Easy to Use ,Hadoop uses Data Locality
Classify various compression formats that are used in Hadoop
GZIP. Provides High compression ratio. Uses high CPU resources to
compress and decompress data
5 2 CO2 R
BZIP2. Provides High compression ratio (even higher than GZIP)
LZO. Provides Low compression ratio
SNAPPY
State design concept of HDFS
HDFS Design Concepts. HDFS is a distributed file system implemented
on Hadoop’s framework designed to store vast amount of data on low
6 2 CO2 R
cost commodity hardware and ensuring high speed process on data.
Hadoop Distributed File System design is based on the design of Google
File System
Which file format is best for Hadoop?
7 Schema evolution can accommodate changes. The Avro file format is 2 CO2 R
considered the best choice for general-purpose storage in Hadoop
Define Hadoop Archive
A Hadoop archive always has a *. har extension. A Hadoop archive
8 2 CO2 R
directory contains metadata (in the form of _index and _masterindex)
and data (part-*) files.
Which architecture is used by HDFS?
master/slave architecture
9 HDFS has a master/slave architecture. An HDFS cluster consists of a 2 CO2 R
single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.
Recall Sqoop and Flume
10 Sqoop is used for loading data from relational databases into HDFS 2 CO2 R
while Flume is used to capture a stream of moving data
Define data flow in Hadoop
A basic data flow of the Hadoop system can be divided into four phases:
Capture Big Data : The sources can be extensive lists that are structured,
11 2 CO2 R
semi-structured, and unstructured, some streaming, real-time data
sources, sensors, devices, machine-captured data, and many other
sources.
Write Down the drawback of Hadoop Serialization
The main drawback of these two mechanisms is
12 2 CO2 R
The Writables and SequenceFiles have only a Java API and they cannot
be written or read in any other language.
What is serialization in Hadoop
Enabling Data Serialization in Hadoop. Data serialization is a process
that converts structure data manually back to the original form. Serialize
13 2 CO2 R
to translate data structures into a stream of data. Transmit this stream of
data over the network or store it in DB regardless of the system
architecture.
Write down Avro schema
Avro schemas defined in JSON, facilitate implementation in the
languages that already have JSON libraries. Avro creates a self-
14 2 CO2 R
describing file named Avro Data File, in which it stores data along with
its schema in the metadata section. Avro is also used in Remote
Procedure Calls (RPCs).
Define Hadoop I/O Compression
File compression brings two major benefits: it reduces the space needed
to store files, and it speeds up data transfer across the network, or to or
15 2 CO2 R
from disk. When dealing with large volumes of data, both of these
savings can be significant, so it pays to carefully consider how to use
compression in Hadoo
What is meant Avro serialization in Hadoop?
Avro is an open source project that provides data serialization and data
16 exchange services for Apache Hadoop. These services can be used 2 CO2 R
together or independently. Avro facilitates the exchange of big data
between programs written in any language.
What is data Replication?
HDFS is designed to reliably store very large files across machines in a
large cluster. It stores each file as a sequence of blocks; all blocks in a
17 2 CO2 R
file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are
configurable per file
State cluster in Hadoop
A Hadoop cluster is a special type of computational cluster designed
specifically for storing and analyzing huge amounts of unstructured data
18 in a distributed computing environment. Such clusters run Hadoop's 2 CO2 R
open source distributed processing software on low-cost commodity
computers.

19 Recall Flume used for 2 CO2 R


Flume is an open-source distributed data collection service used
for transferring the data from source to destination. It is a reliable, and
highly available service for collecting, aggregating, and transferring
huge amounts of logs into HDFS
Infer Command line interface in hadoop
The HDFS can be manipulated through a Java API or through a
command-line interface. The File System (FS) shell includes various
20 2 CO2 R
shell-like commands that directly interact with the Hadoop Distributed
File System (HDFS) as well as other file systems that Hadoop supports

PART- B
PART- B
Q.No Questions Mar CO BL
. ks
1. Examine about 16 CO2 U
(i) The Design of HDFS
HDFS uses a master/slave architecture in which one device (the
master) controls one or more other devices (the slaves). The HDFS cluster
consists of a single Name Node and a master server manages the file
system namespace and regulates access to files.

(ii) HDFS Concepts


Hdfs works well on below scenario
1. Very large files – Files that are hundreds of megabytes, gigabytes, or
terabytes in size.
2. Streaming data access – HDFS is built around the idea that the most
efficient data processing pattern is a write-once, read-many-times pattern.
3. Commodity hardware – Hadoop doesn’t require expensive, highly
reliable hardware. It’s designed to run on clusters of commodity hardware.

Hdfs is not suitable for below cases


1. Low-latency data access – Applications that require low-latency access
to data, in the tens of milliseconds range, will not work well with HDFS.
2. Lots of small files – Because the namenode holds filesystem metadata in
memory, the limit to the number of files in a filesystem is governed by the
amount of memory on the namenode

Describe DataIngest With Flume and Scoop and Hadoop archives

Apache Flume is a reliable and distributed system for collecting,


aggregating and moving massive quantities of log data. It has a simple yet
flexible architecture based on streaming data flows. Apache Flume is used
to collect log data present in log files from web servers and aggregating it
into HDFS for analysis.
Flume in Hadoop supports multiple sources like –
 ‘tail’ (which pipes data from a local file and write into HDFS via
2. Flume, similar to Unix command ‘tail’) 16 CO2 U
 System logs
 Apache log4j (enable Java applications to write event
 to files in HDFS via Flume).
In this Apache Flume tutorial, you will learn-

Flume Architecture
A Flume agent is a JVM process which has 3 components –Flume
Source, Flume Channel and Flume Sink– through which events
propagate after initiated at an external source.
Sqoop :
The data which are stored in a relational database management system
needed to be transferred into the Hadoop structure. So the transfer of this
large amount of data manually is not possible but with the help of Sqoop
Some of the important Features of the Sqoop :
 Sqoop also helps us to connect the result from the SQL Queries into
Hadoop distributed file system.
 Sqoop helps us to load the processed data directly into the hive or
Hbase.
 It performs the security operation of data with the help of Kerberos.
 With the help of Sqoop, we can perform compression of processed data.
 Sqoop is highly powerful and efficient in nature.
There are two major operations performed in Sqoop :
1. Import
2. Export

Archives (HAR
Hadoop Archives (HAR) offers an effective way to deal with the small files
problem. This post will explain –
1. The problem with small files
2. What is HAR?
3. Limitations of HAR files
The problem with small files
Hadoop works best with big files and small files are handled
inefficiently in HDFS. As we know, Namenode holds the metadata
information in memory for all the files stored in HDFS. Let’s say we have a
file in HDFS which is 1 GB in size and the Namenode will store metadata
information of the file – like file name, creator, created time stamp, blocks,
permissions etc.
Now assume we decide to split this 1 GB file in to 1000 pieces and
store all 100o “small” files in HDFS. Now Namenode has to store metadata
information of 1000 small files in memory. This is not very efficient – first
it takes up a lot of memory and second soon Namenode will become a
bottleneck as it is trying to manage a lot of data.
Explain the following with
Hadoop I/O Compression
Compression. File compression brings two major benefits: it
reduces the space needed to store files, and it speeds up data transfer across
the network, or to or from disk.
Data compression at various stages in Hadoop
Compressing input files- You can compress the input file that will reduce
storage space in HDFS. If you compress the input files then the files will be
decompressed automatically
Hadoop compression formats
There are many different compression formats available in Hadoop
3. framework. You will have to use one that suits your requirement. 16 CO2 U
Parameters that you need to look for are-
 Time it takes to compress.
 Space saving.
 Compression format is splittable or not.
Let’s go through the list of available compression formats and see which
format provides what characteristics.
Deflate– It is the compression algorithm whose implementation is zlib.
Defalte compression algorithm is also used by gzip compression tool.
Filename extension is .deflate.
gzip- gzip compression is based on Deflate compression algorithm. Gzip
compression is not as fast as LZO or snappy but compresses better so space
saving is more.
Gzip is not splittable. Filename extension is .gz.
bzip2- Using bzip2 for compression will provide higher compression ratio
but the compressing and decompressing speed is slow. Bzip2 is splittable,
Bzip2Codec implements SplittableCompressionCodec interface which
provides the capability to compress / de-compress a stream starting at any
arbitrary position.
Filename extension is .bz2.
Snappy– The Snappy compressor from Google provides fast compression
and decompression but compression ratio is less.
Snappy is not splittable. Filename extension is .snappy.
LZO– LZO, just like snappy is optimized for speed so compresses and
decompresses faster but compression ratio is less.LZO is not splittable by
default but you can index the lzo files as a pre-processing step to make
them splittable. Filename extension is .lzo.
.LZ4– Has fast compression and decompression speed but compression
ratio is less. LZ4 is not splittable. Filename extension is .lz4.
Zstandard– Zstandard is a real-time compression algorithm, providing
high compression ratios. It offers a very wide range of compression / speed
trade-off.
Zstandard is not splittable. Filename extension is .zstd.
Codecs in Hadoop
Codec, short form of compressor-decompressor is the implementation of a
compression-decompression algorithm. In Hadoop framework there are
different codec classes for different compression formats
Hadoop I/O Serialization
Serialization is the process of turning structured objects into a byte
stream for transmission over a network or for writing to persistent storage.
Deserialization is the reverse process of turning a byte stream back into a
series of structured objects.Serialization appears in two quite distinct areas
of distributed data processing: for interprocess communication and for
persistent storage.
In Hadoop, interprocess communication between nodes in the
system is implemented using remote procedure calls (RPCs). The RPC
protocol uses serialization to render the message into a binary stream to be
sent to the remote node, RPC serialization format is:
Compact
A compact format makes the best use of network bandwidth, which
is the most scarce resource in a data center.
Fast
Intercrosses communication forms the backbone for a distributed
system, so it is essential that there is as little performance overhead as
possible for the serialization and deserialization process.
Extensible
Protocols change over time to meet new requirements, so it should
be straightforward to evolve the protocol in a controlled manner for clients
and servers. For example, it should be possible to add a new argument to a
method call, and have the new servers accept messages in the old format
(without the new argument) from old clients.
Interoperable
Hadoop uses its own serialization format, Writables, which is
certainly compact and fast, but not so easy to extend or use from languages
other than Java. Since Writables are central to Hadoop (most MapReduce
programs use them for their key and value types), we look at them in some
depth in the next three sections, before looking at serialization frameworks
in general, and then Avro (a serialization system that was designed to
overcome some of the limitations of Writables) in more detail.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its
data with the loss of power supply. Files, folders, databases are the
examples of persistent storage.
Writable Interface
This is the interface in Hadoop which provides methods for serialization
and deserialization. The following table describes the methods –

Discuss in detail about Hadoop File system Interface


Vertica provides several ways to interact with data stored in HDFS,
described below. Decisions about Cluster Layout can affect the decisions
you make about which Hadoop interfaces to use.
4. Querying Data Stored in HDFS 16 CO2 U
Vertica can query data directly from HDFS without requiring you to
copy data. Doing so allows you to explore a large data lake without
copying data into Vertica for analysis (and keeping the copy up to date).
Queries against data in columnar formats are optimized to reduce the
amount of data that Vertica has to read.
To query data, you define an external table as for any other external
data source. (This option does not make use of schema definitions from
Hive.) See Working with External Data for more about creating external
tables, and Reading ORC and Parquet Formats for more about
optimizations specific to these native Hadoop file formats.
Vertica can query the data directly from the HDFS nodes, bypassing
the WebHDFS service. To query data directly, Vertica needs access to
HDFS configuration files. See Using HDFS URLs.
Querying Data Using the HCatalog Connector
The HCatalog Connector uses Hadoop services (Hive and
HCatalog) to query data stored in HDFS. Using the HCatalog
Connector, Vertica can use Hive's schema definitions. However,
performance can be poor compared to defining your own external table and
querying the data directly. The HCatalog Connector is also sensitive to
changes in the Hadoop libraries on which it depends; upgrading your
Hadoop cluster might affect your HCatalog connections.
See Using the HCatalog Connector.
Using ROS Data
Storing data in the Vertica native file format (ROS) delivers better
query performance than reading externally-stored data. You can create
storage locations on your HDFS cluster to hold ROS data. Even if your data
is already stored in HDFS, you might choose to copy that data into an
HDFS storage location for better performance. If your Vertica cluster also
uses local file storage, you can use HDFS storage locations for lower-
priority data.
Using HDFS Storage Locations for information about creating and
managing HDFS storage locations, and Using HDFS URLs for information
about copying data into them.
Exporting Data
You might want to export data from Vertica, either to share it with
other Hadoop-based applications or to move lower-priority data from ROS
to less-expensive storage. You can export a table, or part of one, in Hadoop
columnar format. After export you can still query the data using an external
table, as for any other data.

5. Summarize the Command Line Interface 16 CO2 U


The File System (FS) shell includes various shell-like commands
that directly interact with the Hadoop Distributed File System (HDFS) as
well as other file systems that Hadoop supports, such as Local FS, HFTP
FS, S3 FS, and others. Below are the commands supported
appendToFile
hadoop fs -appendToFile /home/testuser/test/test.txt
/user/haas_queue/test/test.txt
append the content of the /home/testuser/test/test.txt to the
/user/haas_queue/test/test.txt in the hdfs.
Cat- Copies source paths to stdout.
hadoop fs -cat hdfs://nameservice1/user/haas_queue/test/test.txt
checksum-Returns the checksum information of a file.
hadoop fs -checksum hdfs://nameservice1/user/haas_queue/test/test.txt
chgrp- Change group association of files. The user must be the owner of
files, or else a super-user.
hadoop fs -chgrp [-R] GROUP URI [URI …]
Options
The -R option will make the change recursively through the directory
structure.
Chmod - Change the permissions of files. With -R, make the change
recursively through the directory structure. The user must be the owner of
the file, or else a super-user. Additional information is in the Permissions
guide HdfsPermissionsGuide.html.
hadoop fs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI
…]
Options
The -R option will make the change recursively through the directory
structure.
Chown- Change the owner of files. The user must be a super-user.
Additional information is in the Permissions
Guide HdfsPermissionsGuide.html
hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Options:The -R option will make the change recursively through the
directory structure.

copyFromLocal - This command copies all the files inside test folder in
the edge node to test folder in the hdfs Similar to put command, except that
the source is restricted to a local file reference
hadoop fs -copyFromLocal /home/sa081876/test/* /user/haas_queue/test
.ptions:The -f option will overwrite the destination if it already exists.
copyToLocal - This command copies all the files inside test folder in the
hdfs to test folder in the edge node.Similar to get command, except that the
destination is restricted to a local file reference.
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
hadoop fs -copyFromLocal /user/haas_queue/test/* /home/sa081876/test
count- Count the number of directories, files and bytes under the paths that
match the specified file pattern. The output columns with -count are:
DIR_COUNT, FILE_COUNT
hadoop fs -count [-q] [-h] [-v] <paths>, CONTENT_SIZE, PATHNAME
The output columns with -count -q are: QUOTA, REMAINING_QUATA,
SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT,
FILE_COUNT, CONTENT_SIZE, PATHNAME
The -h option shows sizes in human readable format.
The -v option displays a header line.
Example:
hadoop fs -count hdfs://nn1.example.com/file1
hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hdfs dfs -count -q -h -v hdfs://nn1.example.com/file1

cp: Copy files from source to destination. This command allows multiple
sources as well in which case the destination must be a directory.

Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI …] <dest>


Options:The -f option will overwrite the destination if it already exists.
The -p option will preserve file attributes [topx] (timestamps, ownership,
permission, ACL, XAttr). If -p is specified with no arg, then preserves
timestamps, ownership, permission. If -pa is specified, then preserves
permission also because ACL is a super-set of permission. Determination
of whether raw namespace extended attributes are preserved is independent
of the -p flag.
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
createSnapshot
HDFS Snapshots are read-only point-in-time copies of the file system.
Snapshots can be taken on a subtree of the file system or the entire file
system. Some common use cases of snapshots are data backup, protection
against user errors and disaster recovery. For more information refer the
link HdfsSnapshots.html
hdfs dfs -createSnapshot <path> [<snapshotName>]
path – The path of the snapshottable directory.
snapshotName – The snapshot name, which is an optional argument. When
it is omitted, a default name is generated using a timestamp with the format
“‘s’yyyyMMdd-HHmmss.SSS”, e.g. “s20130412-151029.033”

deleteSnapshot
Delete a snapshot of from a snapshottable directory. This operation requires
owner privilege of the snapshottable directory.For more information refer
the link HdfsSnapshots.html
hdfs dfs -deleteSnapshot <path> <snapshotName>
path – The path of the snapshottable directory.
snapshotName – The snapshot name.

df:Displays free space


hadoop fs -df [-h] URI [URI …]
Options:The -h option will format file sizes in a human-readable fashion.
Example:hadoop fs -df /user/hadoop/dir1

du: Displays sizes of files and directories contained in the given directory
or the length of a file in case its just a file.

hadoop fs -du [-s] [-h] URI [URI …]


Options: The -s option will result in an aggregate summary of file lengths
being displayed, rather than the individual files.The -h option will
format file sizes in a “human-readable” fashion
(e.g64.0 minstead of 67108864)
Example:hadoop fs-du/user/hadoop/dir1/user/hadoop/file1
hdfs://nn.example.com/user/hadoop/dir1
expunge:Empty the Trash.For more info refer the link HdfsDesign.html
hadoop fs -expunge

find: Finds all files that match the specified expression and applies selected
actions to them. If no path is specified then defaults to the current working
directory. If no expression is specified then defaults to -print.
hadoop fs -find <path> … <expression> …
hadoop fs -find / -name test -print

get: Copy files to the local file system. Files that fail the CRC check may
be copied with the -ignorecrc option. Files and CRCs may be copied using
the -crc option
hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
.
Example:hadoop fs-get/user/hadoop/filelocalfile
hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

Getfacl: Displays the Access Control Lists (ACLs) of files and directories.
If a directory has a default ACL, then getfacl also displays the default ACL.

hadoop fs -getfacl [-R] <path>


Options:
-R: List the ACLs of all files and directories recursively.
path:Fileordirectorytolist.
Examples:
hadoop fs-getfacl/file
hadoop fs -getfacl -R /dir

Explain the Avro and File Based Data Structures


Apache Avro is a language-neutral data serialization system. It was
developed by Doug Cutting, the father of Hadoop. Since Hadoop writable
classes lack language portability, Avro becomes quite helpful, as it deals
6. 16 CO2 U
with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop.
Avro has a schema-based system. A language-independent schema
is associated with its read and write operations. Avro serializes the data
which has a built-in schema. Avro serializes the data into a compact binary
format, which can be deserialized by any application.
Avro uses JSON format to declare the data structures. Presently, it supports
languages such as Java, C, C++, C#, Python, and Ruby.
Apache Avro is a language-neutral data serialization system. It was
developed by Doug Cutting, the father of Hadoop. Since Hadoop writable
classes lack language portability, Avro becomes quite helpful, as it deals
with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop.
Avro has a schema-based system. A language-independent schema is
associated with its read and write operations. Avro serializes the data which
has a built-in schema. Avro serializes the data into a compact binary
format, which can be deserialized by any application.
Avro uses JSON format to declare the data structures. Presently, it supports
languages such as Java, C, C++, C#, Python, and Ruby.
File Based Data Structures Two file formats:
1, Sequencefile
2, MapFile
Sequencefile
1. sequencefile files are <key,value>flat files (Flat file) designed by
Hadoop to store binary forms of pairs.
2, can sequencefile as a container, all the files packaged into the
Sequencefile class can be efficiently stored and processed small files .
3. sequencefile files are not sorted by their stored key, Sequencefile's
internal class writer** provides append functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom
writable type.

Sequencefile Compression
internal format of the sequencefile depends on whether compression is
enabled, or, if it is, either a record compression or a block compression.
Three kinds of types:
A. No compression type : If compression is not enabled (the default
setting), then each record consists of its record length (number of bytes),
the length of the key, the key and the value. The Length field is four bytes.
B. Record compression type : The record compression format is
basically the same as the uncompressed format, and the difference is that
the value byte is compressed with the encoder defined in the header. Note
that the key is not compressed.
C. Block compression type : Block compression compresses multiple
records at once , so it is more compact than record compression and
generally preferred . When the number of bytes recorded reaches the
minimum size, it is added to the block. The minimum
valueio.seqfile.compress.blocksizeis defined by the property in. The default
value is 1000000 bytes. The format is record count, key length, key, value
length, value.
Benefits of the Sequencefile file format:
A. Supports data compression based on records (record) or blocks (block).
B. Supports splittable, which can be used as input shards for mapreduce.
C. Simple to modify: The main responsibility is to modify the
corresponding business logic, regardless of the specific storage format.
Disadvantages of the Sequencefile file format:
The downside is the need for a merge file, and the merged file will be
inconvenient to view. because it is a binary file.
read/write Sequencefile
Write Process:
1) Create a configuration 2) Get filesystem 3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration ,2) Get filesystem ,3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream

Subject In charge Course Coordinator HOD IQAC


(Name & Signature) (Name & Signature)

You might also like