0% found this document useful (0 votes)
3 views

Module 5_Sqoop

Uploaded by

sonia
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 5_Sqoop

Uploaded by

sonia
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

APACHE SQOOP

Introduction to Apache Sqoop


• Generally, applications interact with the relational database using
RDBMS, and thus this makes relational databases one of the most
important sources that generate Big Data. Such data is stored in RDB
Servers in the relational structure. Here, Apache Sqoop plays an
important role in Hadoop ecosystem, providing feasible interaction
between the relational database server and HDFS.
Advantages of Apache Sqoop
• So, Apache Sqoop is a tool in Hadoop ecosystem which is designed to
transfer data between HDFS (Hadoop storage) and relational database
servers like MySQL, Oracle RDB, SQLite, Teradata, Netezza, Postgres etc.
Apache Sqoop imports data from relational databases to HDFS, and
exports data from HDFS to relational databases. It efficiently transfers
bulk data between Hadoop and external data stores such as enterprise
data warehouses, relational databases, etc.
• This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”.
• Additionally, Sqoop is used to import data from external datastores into
Hadoop ecosystem’s tools like Hive & HBase.
Why Sqoop?
• For Hadoop developer, the actual game starts after the data is being
loaded in HDFS. They play around this data in order to gain various
insights hidden in the data stored in HDFS.

• So, for this analysis, the data residing in the relational database
management systems need to be transferred to HDFS. The task of
writing MapReduce code for importing and exporting data from the
relational database to HDFS is uninteresting & tedious. This is where
Apache Sqoop comes to rescue and removes their pain. It automates
the process of importing & exporting the data.
Why Sqoop?
• Sqoop makes the life of developers easy by providing CLI for importing
and exporting data. They just have to provide basic information like
database authentication, source, destination, operations etc. It takes
care of the remaining part.

• Sqoop internally converts the command into MapReduce tasks, which


are then executed over HDFS. It uses YARN framework to import and
export the data, which provides fault tolerance on top of parallelism.
Key Features of Sqoop
Sqoop provides many salient features like:
1.Full Load: Apache Sqoop can load the whole table by a single command.
You can also load all the tables from a database using a single command.
2.Incremental Load: Apache Sqoop also provides the facility of incremental
load where you can load parts of table whenever it is updated.
3.Parallel import/export: Sqoop uses YARN framework to import and
export the data, which provides fault tolerance on top of parallelism.
4.Import results of SQL query: You can also import the result returned
from an SQL query in HDFS.
5.Compression: You can compress your data by using deflate(gzip)
algorithm with –compress argument, or by specifying –compression-codec
argument. You can also load compressed table in Apache Hive.
Key Features of Sqoop
• Connectors for all major RDBMS Databases: Apache Sqoop provides
connectors for multiple RDBMS databases, covering almost the entire
circumference.
• Kerberos Security Integration: Kerberos is a computer network
authentication protocol which works on the basis of ‘tickets’ to allow
nodes communicating over a non-secure network to prove their
identity to one another in a secure manner. Sqoop supports Kerberos
authentication.
• Load data directly into HIVE/HBase: You can load data directly into
Apache Hive for analysis and also dump your data in HBase, which is a
NoSQL database.
• Support for Accumulo: You can also instruct Sqoop to import the table
in Accumulo rather than a directory in HDFS.
Sqoop Architecture & Working
Sqoop Architecture
• The import tool imports individual tables from RDBMS to HDFS. Each row in a table
is treated as a record in HDFS.
• When we submit Sqoop command, our main task gets divided into subtasks which is
handled by individual Map Task internally. Map Task is the subtask, which imports
part of data to the Hadoop Ecosystem. Collectively, all Map tasks import the whole
data.
How Sqoop Works?
Sqoop Import
• The import tool imports individual tables from RDBMS to HDFS. Each row in a table
is treated as a record in HDFS. All records are stored as text data in text files or as
binary data in Avro and Sequence files.
Sqoop Export
• The export tool exports a set of files from HDFS back to an RDBMS. The files given as
input to Sqoop contain records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.
Apache Sqoop - Working
• Export also works in a similar manner.
• The export tool exports a set of files from HDFS back to an RDBMS. The files given as
input to Sqoop contain records, which are called as rows in the table.
• When we submit our Job, it is mapped into Map Tasks which brings the chunk of
data from HDFS. These chunks are exported to a structured data
destination. Combining all these exported chunks of data, we receive the whole data
at the destination, which in most of the cases is an RDBMS (MYSQL/Oracle/SQL
Server).
Apache Sqoop - Working
• Reduce phase is required in case of aggregations. But, Apache Sqoop just imports
and exports the data; it does not perform any aggregations. Map job launch multiple
mappers depending on the number defined by the user.
• For Sqoop import, each mapper task will be assigned with a part of data to be
imported. Sqoop distributes the input data among the mappers equally to get high
performance.
• Then each mapper creates a connection with the database using JDBC and fetches
the part of data assigned by Sqoop and writes it into HDFS or Hive or HBase based
on the arguments provided in the CLI.
Flume vs Sqoop
The major difference between Flume and Sqoop is that:
•Flume only ingests unstructured data or semi-structured data into HDFS.
•While Sqoop can import as well as export structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.
THANK YOU

15

You might also like