Unit II Big Data Analytics

Unit –II Introduction to HDFS
HDFS
When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed filesystems. Hadoop comes with a
distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.
The Design of HDFS : HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity hardware.
Very large files: “Very large” in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of
data.
Streaming data access : HDFS is built around the idea that the most efficient data processing
pattern is a write-once, readmany-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time.
Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware to run
on. It’s designed to run on clusters of commodity hardware (commonly available hardware
available from multiple vendors3) for which the chance of node failure across the cluster is
high, at least for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
These are areas where HDFS is not a good fit today:
-latency data access : Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.
Lots of small files : Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers,
or for modifications at arbitrary offsets in the file.
HDFS Concepts Blocks: HDFS has the concept of a block, but it is a much larger unit—64
MB by default. Files in HDFS are broken into block-sized chunks, which are stored as
independent units. Having a block abstraction for a distributed filesystem brings several
benefits.:
The first benefit : A file can be larger than any single disk in the network. There’s nothing
that requires the blocks from a file to be stored on the same disk, so they can take advantage of
any of the disks in the cluster.
Second: Making the unit of abstraction a block rather than a file simplifies the storage
subsystem. The storage subsystem deals with blocks, simplifying storage management (since
blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and
eliminating metadata concerns.
Third: Blocks fit well with replication for providing fault tolerance and availability. To insure
against corrupted blocks and disk and machine failure, each block is replicated to a small
number of physically separate machines (typically three).
Why Is a Block in HDFS So Large? HDFS blocks are large compared to disk blocks, and the
reason is to minimize the cost of seeks. By making a block large enough, the time to transfer
the data from the disk can be made to be significantly larger than the time to seek to the start
of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk
transfer rate. A quick calculation shows that if the seek time is around 10 ms, and the transfer
rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB. The default is actually 64 MB, although many HDFS installations
use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow
with new generations of disk drives.
Namenodes and Datanodes: An HDFS cluster has two types of node operating in a master-
worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode
manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the
Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC

files and directories in the tree. This information is stored persistently on the local disk in the
form of two files: the namespace image and the edit log. The namenode also knows the
datanodes on which all the blocks for a given file are located.
Apache Hadoop is designed to have Master Slave architecture: Master: Namenode,
JobTracker Slave: {DataNode, TaskTraker}, ….. {DataNode, TaskTraker}
HDFS is one primary components of Hadoop cluster and HDFS is designed to have Master-
slave architecture.
Master: NameNode
Slave: {Datanode}…..{Datanode} - The Master (NameNode) manages the file system
namespace operations like opening, closing, and renaming files and directories and determines
the mapping of blocks to DataNodes along with regulating access to files by clients - Slaves
(DataNodes) are responsible for serving read and write requests from the file system’s clients
along with perform block creation, deletion, and replication upon instruction from the Master
(NameNode). Datanodes are the workhorses of the filesystem. They store and retrieve blocks
when they are told to (by clients or the namenode), and they report back to the namenode
periodically with lists of blocks that they are storing.
NameNode failure: if the machine running the namenode failed, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from the
blocks on the datanodes.
HDFS Concepts
HDFS is a distributed file system that handles large data sets running on commodity hardware.
It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes.
HDFS is one of the major components of Apache Hadoop, the others being MapReduce and
YARN.
It is a concept of storing the file in multiple nodes in a distributed manner. DFS actually
provides the Abstraction for a single large system whose storage is equal to the sum of storage
of other nodes in a cluster.
Let’s understand this with an example. Suppose you have a DFS comprises of 4 different
machines each of size 10TB in that case you can store let say 30TB across this DFS as it
provides you a combined Machine of size 40TB. The 30TB data is distributed among these
Nodes in form of Blocks.

Why We Need DFS?
If we can store a file of size 30TB in a single system then why we need this DFS. This is
because the disk capacity of a system can only increase up to an extent. If somehow you
manage the data on a single system then you’ll face the processing problem, processing large
datasets on a single machine is not efficient.
Let’s understand this with an example. Suppose you have a file of size 40TB to process. On
a single machine, it will take suppose 4hrs to process it completely but what if you use a
DFS(Distributed File System). In that case, as you can see in the below image the File of size
40TB is distributed among the 4 nodes in a cluster each node stores the 10TB of file. As all
these nodes are working simultaneously it will take the only 1 Hour to completely process it
which is Fastest, that is why we need DFS.
Overview – HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission is a Hadoop
cluster. It mainly designed for working on commodity Hardware devices(devices that are
inexpensive), working on a distributed file system design. HDFS is designed in such a way
that it believes more in storing the data in a large chunk of blocks rather than storing small
data blocks. HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices present in that Hadoop cluster.
HDFS is capable of handling larger size data with high volume velocity and variety makes
Hadoop work more efficient and reliable with easy access to all its components. HDFS stores
the data in the form of the block where the size of each data block is 128MB in size which is
configurable means we can change it according to your requirement in hdfs-site.xml file in
your Hadoop directory.
Some Important Features of HDFS
• It’s easy to access the files stored in HDFS.
• HDFS also provides high availability and fault tolerance.
• Provides scalability to scaleup or scaledown nodes as per our requirement.
• Data is stored in distributed manner i.e. various Datanodes are responsible for storing the
data.
• HDFS provides Replication because of which no fear of Data Loss.
• HDFS Provides High Reliability as it can store data in a large range of Petabytes.
• HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve
the cluster information.
• Provides high throughput.
HDFS Storage Daemon’s

HDFS has NameNode and DataNode that works in the similar pattern.
1. NameNode(Master)
2. DataNode(Slave)
1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. nothing but the
data about the data. Meta Data can be the transaction logs that keep track of the user’s activity
in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
As our NameNode is working as a Master it should have a high RAM or Processing power
in order to Maintain or Guide all the slaves in a Hadoop cluster. Namenode receives heartbeat
signals and block reports from all the slaves i.e. DataNodes.
2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data
in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that,
the more number of DataNode your Hadoop cluster has More Data can be stored. so it is
advised that the DataNode should have High storing capacity to store a large number of file
blocks. Datanode performs operations like creation, deletion, etc. according to the instruction
provided by the NameNode.
Command line Interface
The HDFS can be manipulated through a Java API or through a command line interface. All
commands for manipulating HDFS through Hadoop's command line interface begin with
"hadoop", a space, and "fs". This is the file system shell. This is followed by the command
name as an argument to "hadoop fs". These commands start with a dash. For example, the "ls"
command for listing a directory is a common UNIX command and is preceded with a dash. As
on UNIX systems, ls can take a path as an argument. In this example, the path is the current
directory, represented by a single dot.
The File System (FS) shell includes various shell-like commands that directly interact with the
Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports,
such as Local FS, HFTP FS, S3 FS, and others.Below are the commands supported.
1.ls: This command is used to list all the files. Use lsr for recursive approach. It is useful
when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File
System) commands.
2.cat: To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->
3,copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This
is the most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks
4.touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
5. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So
let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero
myfile.txt from geeks folder will be copied to folder hero present on Desktop.
7. moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
8. cp: This command is used to copy files within hdfs.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
9. mv: This command is used to move files within hdfs.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
10.rmr: This command deletes a file from HDFS recursively. It is very useful command
when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the
directory itself.
Hadoop File System Interface.

Most Hadoop filesystem interactions are mediated through the Java API. The filesystem
shell, is a Java application using the Java FileSystem class. Hortonworks developed an
additional API to support requirements based on standard REST functionalities.
HTTP
The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages
to interact with HDFS. The HTTP interface is slower than the native Java client, so should be
avoided for very large data transfers if possible. There are two ways of accessing HDFS over
HTTP: directly, where the HDFS daemons serve HTTP requests to clients; and via a proxy
(or proxies), which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API. Both use the WebHDFS protocol.
WebHDFS Advantages
• Calls are much quicker than a regular “hadoop fs” command. You can easily see the
difference on cluster with Terabytes of data.
• If you have a non-java client which needs access to HDFS
Enable WebHDFS
• Enable WebHDFS in HDFS configuration file. (hdfs-site.xml)
• Set dfs.webhdfs.enabled as true.
• Restart HDFS daemons.
• We can now access HDFS with the WebHDFS API using Curl calls.
Other Related Components
• HFTP – this was the first mechanism that provided HTTP access to HDFS. It was
designed to facilitate data copying between clusters with different Hadoop versions.
HFTP is a part of HDFS. It redirects clients to the datanode containing the data for
providing data locality. Nevertheless, it supports only the read operations. The HFTP
HTTP API is neither curl/wget friendly nor RESTful. WebHDFS is a rewrite of HFTP
and is intended to replace HFTP.
• HdfsProxy – a HDFS contrib project. It runs as external servers (outside HDFS) for
providing proxy service. Common use cases of HdfsProxy are firewall tunneling and
user authentication mapping.
• HdfsProxy V3 – Yahoo!’s internal version that has a dramatic improvement over
HdfsProxy. It has a HTTP REST API and other features like bandwidth control.
Nonetheless, it is not yet publicly available.
• Hoop – a rewrite of HdfsProxy. It aims to replace HdfsProxy. Hoop has a HTTP REST
API. Like HdfsProxy, it runs as external servers to provide a proxy service. Because it
is a proxy running outside HDFS, it cannot take advantages of some features such as
redirecting clients to the corresponding datanodes for providing data locality. It has
advantages, however, in that it can be extended to control and limit bandwidth like
HdfsProxy V3, or to carry out authentication translation from one mechanism to
HDFS’s native Kerberos authentication. Also, it can serve proxy service to other file
systems such as Amazon S3 via Hadoop FileSystem API
Data Flow
A basic data flow of the Hadoop system can be divided into four phases:
1. Capture Big Data : The sources can be extensive lists that are structured, semi-
structured, and unstructured, some streaming, real-time data sources, sensors, devices,
machine-captured data, and many other sources. For data capturing and storage, we
have different data integrators such as, Flume, Sqoop, Storm, and so on in the Hadoop
ecosystem, depending on the type of data.
2. Process and Structure: We will be cleansing, filtering, and transforming the data by
using a MapReduce-based framework or some other frameworks which can perform

distributed programming in the Hadoop ecosystem. The frameworks available currently
are MapReduce, Hive, Pig, Spark and so on.
3. Distribute Results: The processed data can be used by the BI and analytics system or
the big data analytics system for performing analysis or visualization.
4. Feedback and Retain: The data analyzed can be fed back to Hadoop and used for
improvements...
Data Ingest with flume and Scoop and Hadoop Archieve

Flume and Sqoop have a crucial part to play in the Hadoop ecosystem. They have the
responsibility of transferring the data from sources like local file systems, HTTP, MySQL
and Twitter. These hold/produce data to data stores like HDFS, HBase and Hive. Both the
tools have default functionality and have the ability of abstracting away the users from the
complication of transferring data among these systems.
Flume: Flume Agents have the ability to transfer data created by a streaming application to
data stores like HDFS and HBase.
Flume is a distributed, and reliable tool for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on streaming data flows. It
is robust and fault-tolerant with tunable reliability mechanisms.
Sqoop: Sqoop can be used to bulk import data from typical RDBMS to Hadoop storage
structures like HDFS or Hive.
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores such as
relational databases (MS SQL Server, MySQL).
SQOOP = SQL + HADOOP
Apache Sqoop(which is a portmanteau for “sql-to-hadoop”) is an open source tool that allows
users to extract data from a structured data store into Hadoop for further processing. This
processing can be done with MapReduce programs or other higher-level tools such as Hive,
Pig or Spark.

Sqoop can automatically create Hive tables from imported data from a RDBMS (Relational
Database Management System) table.
Sqoop can also be used to send data from Hadoop to a relational database, useful for sending
results processed in Hadoop to an operational transaction processing system.
Sqoop includes tools for the following operations:
• Listing databases and tables on a database system

• Importing a single table from a database system, including specifying which columns
to import and specifying which rows to import using a WHERE clause
• Importing data from one or more tables using a SELECT statement
• Incremental imports from a table on a database system (importing only what has
changed since a known previous state)
• Exporting of data from HDFS to a table on a remote database system
Some of the important features of Sqoop are:
• It supports Parallel import/export.

• It can import the results of SQL query.
• Sqoop provides connectors for multiple Relational Database Management System
(RDBMS) databases such as MySQL and MS SQL Server.
• Sqoop supports Kerberos computer network authentication protocol that allows
nodes communicating over a non-secure network to prove their identity to one
another in a secure manner.
• Sqoop can load the whole table or parts of the table by a single command. Hence,
it supports full and incremental load.
Hadoop Archive
• HDFS Shares small files in efficiently, since each file is stored in a block and block meta
data is held in memory by the Name Node.
• Thus, a large number of small files can take a lot of memory on the Name Node for
example, 1 MB file is stored with a block size of 128 MB uses 1.MB of disk space not 128
MB.
• Hadoop Archives or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, there by reducing Name Node memory usage while still allowing
transparent access to files,
• Hadoop Archives can be used as input to map reduce.
I/O Compression :
• In the Hadoop framework, where large data sets are stored and processed, you will
need storage for large files.
• These files are divided into blocks and those blocks are stored in different nodes
across the cluster so lots of I/O and network data transfer is also involved.
• In order to reduce the storage requirements and to reduce the time spent in-network
transfer, you can have a look at data compression in the Hadoop framework.
• Using data compression in Hadoop you can compress files at various steps, at all of
these steps it will help to reduce storage and quantity of data transferred.
• You can compress the input file itself.
• That will help you reduce storage space in HDFS.
• You can also configure that the output of a MapReduce job is compressed in Hadoop.
• That helps is reducing storage space if you are archiving output or sending it to some
other application for further processing.
Serialization
• Data serialization is a process that converts structure data manually back to the
original form.
• Serialize to translate data structures into a stream of data. Transmit this stream of data
over the network or store it in DB regardless of the system architecture.
• Isn't storing information in binary form or stream of bytes is the right approach.
• Serialization does the same but isn't dependent on architecture.
Consider CSV files contains a comma (,) in between data, so while Deserialization, wrong
outputs may occur. Now, if metadata is stored in XML form, a self- architected form of data
storage, data can easily deserialize
Why Data Serialization for Storage Formats?
• To process records faster (Time-bound).

• When proper data formats need to maintain and transmit over data without schema support on
another end.
• Now when in the future, data without structure or format needs to process, complex Errors may
occur.
• Serialization offers data validation over transmission.
Avro
Avro is an open source project that provides data serialization and data exchange services for
Apache Hadoop. These services can be used together or independently. Avro facilitates the
exchange of big data between programs written in any language. With the serialization service,
programs can efficiently serialize data into files or into messages. The data storage is compact
and efficient. Avro stores both the data definition and the data together in one message or file.
Avro files include markers that can be used to split large data sets into subsets suitable
for Apache Map Reduce processing.
Avro handles schema changes like missing fields, added fields and changed fields; as a result,
old programs can read new data and new programs can read old data. Avro includes APIs for
Java, Python, Ruby, C, C++ and more. Data stored using Avro can be passed from programs
written in different languages, even from a compiled language like C to a scripting language
like Apache Pig.
File-based data structures
The systems that are used to organize and maintain data files are known as file based data
systems. These file systems are used to handle a single or multiple files and are not very
efficient.
Twofileformats:
1,Sequencefile
2, Map File
Sequencefile
1. sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary
forms of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be
efficiently stored and processed small files .
3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer**
provides append functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom writable type.
Sequencefile Compression
1. The internal format of the sequencefile depends on whether compression is enabled, or, if
it is, either a record compression or a block compression.
2, three kinds of types:
A. No compression type : If compression is not enabled (the default setting), then each record
consists of its record length (number of bytes), the length of the key, the key and the value.
The Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the
uncompressed format, and the difference is that the value byte is compressed with the
encoder defined in the header. Note that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once , so it is
more compact than record compression and generally preferred . When the number of bytes
recorded reaches the minimum size, it is added to the block. The minimum value
io.seqfile.compress.blocksizeis defined by the property in. The default value is 1000000
bytes. The format is record count, key length, key, value length, value.
Benefits of the Sequencefile file format:

A. Supports data compression based on records (record) or blocks (block).
B. Supports splittable, which can be used as input shards for mapreduce.
C. Simple to modify: The main responsibility is to modify the corresponding business logic,
regardless of the specific storage format.
Disadvantages of the Sequencefile file format:

The downside is the need for a merge file, and the merged file will be inconvenient to view.
because it is a binary file.
read/write Sequencefile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
2) Get filesystem
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
Mapfile
Mapfile is the sequencefile of the sorted index and can be found based on key.
Unlike Sequencefile, mapfile key must implement Writablecomparable interface, that is, the
key value is comparable, and value is the writable type.
You can use the Mapfile.fix () method to reconstruct the index and convert the Sequencefile
to mapfile.
It has two static member variables:
static final String INDEX_FILE_NAME
static final String DATA_FILE_NAME
By observing its directory structure, we can see that mapfile consists of two parts, data and
index respectively.
Index, which is a data-indexed file, primarily records the key value of each record and the
position at which the record is offset in the file.
When mapfile is accessed, the index file is loaded into memory, and the index mapping
relationship quickly navigates to the location of the file where the specified record is located.
Therefore, the retrieval efficiency of mapfile is efficient relative to sequencefile, and the
disadvantage is that it consumes a portion of memory to store index data.
It should be noted that the Mapfile does not record all records into index, which by default
stores an index map for every 128 records. Of course, the recording interval can be artificially
modified, throughMapFIle.Writer的setIndexInterval()methods, or
modifiedio.map.index.intervalattributes;
read/write Mapfile
Write Process:
2) Get filesystem
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
2) Get filesystem
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream

Unit II Big Data Analytics

Uploaded by

Copyright:

Available Formats

Unit II Big Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit II Big Data Analytics

Uploaded by

Copyright:

Available Formats

Unit –II Introduction to HDFS

Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC

Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC

HDFS Storage Daemon’s

Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC

Other Related Components

Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC

Data Ingest with flume and Scoop and Hadoop Archieve

SQOOP = SQL + HADOOP

Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC

• Listing databases and tables on a database system

Some of the important features of Sqoop are:

• It supports Parallel import/export.

Why Data Serialization for Storage Formats?

• To process records faster (Time-bound).

File-based data structures

Benefits of the Sequencefile file format:

Disadvantages of the Sequencefile file format:

Prep By:K.H.Rizwana,Asst Professor,Department of Computer Science,SNGC

You might also like