How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra,
Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact
with the relational database servers for importing and exporting the Big Data residing in
them. Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible
interaction between relational database server and Hadoop’s HDFS.
Sqoop is a tool designed to transfer data between Hadoop and relational database
servers. It is used to import data from relational databases such as MySQL, Oracle
to Hadoop HDFS, and export from Hadoop file system to relational databases. It is
provided by the Apache Software Foundation.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given
as input to Sqoop contain records, which are called as rows in table. Those are read
and parsed into a set of records and delimited with user-specified delimiter.
Syntax
The following syntax is used to import data into HDFS
$ sqoop import (generic-args) (import-args)
Example in Mysql
List-databases
Lists databases in your mysql database.
$ sqoop list-databases --connect jdbc:mysql://192.168.80.134:3306/employees
--username root
information_schema
employees
test
list-tables
Lists tables in your mysql database.
departments
dept_emp
dept_manager
employees
employees_exp_stg
employees_export
salaries
titles
Argument Description
--append Append data to an existing dataset in HDFS
--as-avrodatafile Imports data to Avro Data Files
--as-textfile Imports data as plain text (default)
--boundary-query
<statement> Boundary query to use for creating splits
--columns <col,col,col…>Columns to import from table
--direct Use direct import fast path
--direct-split-size <n>
Split the input stream every n bytes when importing
in direct mode
-m,--num-mappers <n> Use n map tasks to import in parallel
Argument Description
-e,--query <statement> Import the results of statement.
--split-by <column-name> Column of the table used to split work units
--table <table-name> Table to read
--target-dir <dir> HDFS destination dir
--where <where clause> WHERE clause to use during import
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
Importing a Table
Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file
system as a text file or a binary file.
The following command is used to import the authors table from MySQL database
server to HDFS.
sqoop import --connect jdbc:mysql://192.168.80.132:3306/books --username root
--table authors --target-dir /sqoop/mysqlstage -m 1
Incremental Import
Incremental import is a technique that imports only the newly added rows in a
table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to
perform the incremental import.
Argument Description
--check-column Specifies the column to be examined when determining which rows to
(col) import.
--incremental Specifies how Sqoop determines which rows are new. Legal values
(mode) for mode includeappend and lastmodified.
Argument Description
--last-value
(value) Specifies the maximum value of the check column from the previous import.
You should specify append mode when importing a table where new
rows are continually being added with increasing row id values. You
specify the column containing the row's id with --check-column. Sqoop
imports rows where the check column has a value greater than the
one specified with --last-value.
--incremental <mode> --
check-column <column name>
--last-value <last check column value>
result is :
1,Vivek,[email protected]
2,Priya,[email protected]
3,Tom,[email protected]
The above code will insert all the new rows based on the last value.
**In hdfs in the same old target directory ,a new file will be createdwith all new records.
Now we can think of second case where there are updates in rows
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2015-01-01 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
| 104 | Dallas | Texas | 4 | 2015-02-02 |
| 105 | Phoenix | Arzona | 17 | 2015-02-24 |
+------+------------+----------+------+------------+
Here we use incremental lastmodified where we will fetch all the updated rows based on
date.
Sqoop Job
Sqoop job creates and saves the import and export commands. It specifies
parameters to identify and recall the saved job. This re-calling or re-executing is
used in the incremental import, which can import the updated rows from RDBMS
table to HDFS
Syntax
The following is the syntax for creating a Sqoop job
$ sqoop job (generic-args) (job-args)
[-- [subtool-name] (subtool-args)]