0% found this document useful (0 votes)
18 views28 pages

Sqoop

Sqoop is a tool used to transfer data between relational databases and Hadoop. It allows importing and exporting large amounts of data between databases like MySQL and HDFS. Sqoop supports operations like import, export, validation and incremental imports. It uses parallel processing to distribute the data transfer workload across multiple nodes for efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views28 pages

Sqoop

Sqoop is a tool used to transfer data between relational databases and Hadoop. It allows importing and exporting large amounts of data between databases like MySQL and HDFS. Sqoop supports operations like import, export, validation and incremental imports. It uses parallel processing to distribute the data transfer workload across multiple nodes for efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Definition of SQOOP

• Sqoop is defined as the tool which is used to


perform data transfer operations from
relational database management system to
Hadoop server. Thus it helps in transfer of bulk
of data from one point of source to another
point of source.
Some of the important Features of the
Sqoop :
• Sqoop also helps us to connect the result from the
SQL Queries into Hadoop distributed file system.
• Sqoop helps us to load the processed data directly
into the hive or Hbase.
• It performs the security operation of data with the
help of Kerberos.
• With the help of Sqoop, we can perform
compression of processed data.
• Sqoop is highly powerful and efficient in nature.
Operations of sqoop
• There are two major operations performed in
Sqoop :
• Import
• Export
ARCHITECTURE OF SQOOP
Internal working
Sqoop - Export
•from the HDFS to the RDBMS database.
•The target table must exist in the target database.
•The files which are given as input to the Sqoop contain
records, which are called rows in table.
• Those are read and parsed into a set of records and
delimited with user-specified delimiter.
•The default operation is to insert all the record from
the input files to the database table using the INSERT
statement.
• In update mode, Sqoop generates the UPDATE
statement that replaces the existing record into the
database.
• Sqoop is a tool in which works in the following manner, it first parses
argument which is provided by user in the command-line interface and
then sends those arguments to a further stage where arguments are
induced for Map only job.
• Once the Map receives arguments it then gives command of release of
multiple mappers depending upon the number defined by the user as an
argument in command line Interface.
• Once these jobs are then for Import command, each mapper task is
assigned with respective part of data that is to be imported on basis of key
which is defined by user in the command line interface.
• To increase efficiency of process Sqoop uses parallel processing technique
in which data is been distributed equally among all mappers.
• After this, each mapper then creates an individual connection with the
database by using java database connection model and then fetches
individual part of the data assigned by Sqoop.
• Once the data is been fetched then the data is been written in HDFS or
Hbase or Hive on basis of argument provided in command line. thus the
process Sqoop import is completed.
Sqoop Validation and interfaces
Validation is nothing but to validate the data copied
reason for validation
steps for validation
• a. Sqoop validation simply means validate the data
copied. Basically, either import or Export by
comparing the row counts from the source as well
as the target post copy.
• b. Moreover, we use this option to compare the
row counts between source as well as the target
just after data imported into HDFS.
• c. While during the imports, all the rows are
deleted or added, Sqoop tracks this change. Also
updates the log file.
Interfaces of Sqoop Validation
• Basically, there are 3 interfaces of Sqoop Validation such as:

a. ValidationThreshold
• whether the error margin between the source and target are acceptable:
Absolute, Percentage Tolerant and many more. However, the default
implementation is AbsoluteValidationThreshold.
• Basically, that ensures that the row counts from source as well as targets
are the same.

b. ValidationFailureHandler
• Also, it has once interface with ValidationFailureHandler, that is
responsible for handling failures here. Such as log an error/warning, abort
and many more. Although default implementation is LogOnFailureHandler.
Here that logs a warning message to the configured logger.
c) Validator:Validator drives the validation logic. Also delegates failure
handling to ValidationFailureHandler. Moreover, the default implementation
is RowCountValidator here.
COMMANDS
import data into HDFS(Syntax of
Sqoop Validation).
• $ sqoop import (generic-args) (import-args)
• $ sqoop-import (generic-args) (import-args)
Incremental Import
• Incremental import is a technique that imports only
the newly added rows in a table. It is required to add
‘incremental’, ‘check-column’, and ‘last-value’ options
to perform the incremental import.
• The following syntax is used for the incremental
option in Sqoop import command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
• Let us assume the newly added data into emp table
is as follows −
1206, satish p, grp des, 20000, GR
commands
• The following command is used to import
the emp table from MySQL database server to
HDFS.
• $ sqoop import \ --connect
jdbc:mysql://localhost/userdb \ --username
root \ --table emp --m 1
• To verify the imported data in HDFS, use the
following command.
• $ $HADOOP_HOME/bin/hadoop fs -cat
/emp/part-m-*
Import tables
• from the RDBMS database server to the HDFS.
Each table data is stored in a separate
directory and the directory name is same as
the table name.

You might also like