0% found this document useful (0 votes)

576 views21 pages

Apache Sqoop

Sqoop is a tool for transferring data between structured data stores like relational databases and data warehouses to Hadoop. It allows importing and exporting data in parallel using MapReduce. Sqoop can import data into Hive for SQL queries or into HBase for NoSQL access. It uses connectors to integrate with different data sources and has direct connectors for high performance transfers. Sqoop is currently in Apache incubation and the latest version is 1.3.0.

Uploaded by

jprakash0205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

576 views21 pages

Apache Sqoop

Uploaded by

jprakash0205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Apache Sqoop

A Data Transfer Tool for Hadoop

Arvind Prabhakar, Cloudera Inc. Sept 21, 2011

What is Sqoop?
Allows easy import and export of data from structured data
stores:
o Relational Database
o Enterprise Data Warehouse
o NoSQL Datastore
Allows easy integration with Hadoop based systems:
o Hive
o HBase
o Oozie

Agenda
Motivation
Importing and exporting data using Sqoop
Provisioning Hive Metastore
Populating HBase tables
Sqoop Connectors
Current Status

Motivation
Structured data stored in Databases and EDW is not easily
accessible for analysis in Hadoop
Access to Databases and EDW from Hadoop Clusters is
problematic.
Forcing MapReduce to access data from Databases/EDWs is
repititive, error-prone and non-trivial.
Data preparation often required forefficientconsumption
by Hadoop based data pipelines.
Current methods of transferring data are inefficient/adhoc.

Enter: Sqoop
A tool to automate data transfer between structured
datastores and Hadoop.
Highlights
Uses datastore metadata to infer structure definitions
Uses MapReduce framework to transfer data in parallel
Allows structure definitions to be provisioned in Hive
metastore
Provides an extension mechanism to incorporate high
performance connectors for external systems.

Importing Data
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password ****
...

INFO mapred.JobClient: Counters: 12

INFO mapred.JobClient: Job Counters
INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12873
...
INFO mapred.JobClient: Launched map tasks=4
INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
INFO mapred.JobClient: FileSystemCounters
INFO mapred.JobClient: HDFS_BYTES_READ=505
INFO mapred.JobClient: FILE_BYTES_WRITTEN=222848
INFO mapred.JobClient: HDFS_BYTES_WRITTEN=35098
INFO mapred.JobClient: Map-Reduce Framework
INFO mapred.JobClient: Map input records=326
INFO mapred.JobClient: Spilled Records=0
INFO mapred.JobClient: Map output records=326
INFO mapred.JobClient: SPLIT_RAW_BYTES=505
INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.2754 seconds (3.0398
KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.

Importing Data
$ hadoop fs -ls
Found 32 items
....
drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS
....
$ hadoop fs -ls /user/arvind/ORDERS
arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS
Found 6 items
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs
... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000
... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001
... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002
... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003

Exporting Data
$ sqoop export --connect jdbc:mysql://localhost/acmedb \
--table ORDERS_CLEAN --username test --password **** \
--export-dir /user/arvind/ORDERS
...
INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/
sec)
INFO mapreduce.ExportJobBase: Exported 326 records.
$

Default Delimiters: ',' for fields, New-Lines for records

Optionally Specify Escape sequence
Delimiters can be specified for both import and export

Exporting Data
Exports can optionally use Staging Tables
Map tasks populate staging table
Each map write is broken down into many transactions
Staging table is then used to populate the target table in a
single transaction
In case of failure, staging table provides insulation from
data corruption.

Importing Data into Hive

$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** --hive-import
...

INFO mapred.JobClient: Counters: 12

INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.3995 seconds (3.0068 KB/
sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.
INFO hive.HiveImport: Removing temporary files from import process: ORDERS/_logs
INFO hive.HiveImport: Loading uploaded data into Hive
...
WARN hive.TableDefWriter: Column ORDER_DATE had to be cast to a less precise type in
Hive
WARN hive.TableDefWriter: Column REQUIRED_DATE had to be cast to a less precise
type in Hive
WARN hive.TableDefWriter: Column SHIP_DATE had to be cast to a less precise type in
Hive
...
$

Importing Data into Hive

$ hive
hive> show tables;
OK
...
orders
...
hive> describe orders;
OK
order_number int
order_date string
required_date string
ship_date string
status string
comments string
customer_number int
Time taken: 0.236 seconds
hive>

Importing Data into HBase

$ bin/sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** \
--hbase-create-table --hbase-table ORDERS --column-family mysql
...
INFO mapreduce.HBaseImportJob: Creating missing HBase table ORDERS
...
INFO mapreduce.ImportJobBase: Retrieved 326 records.
$

Sqoop creates the missing table if instructed

If no Row-Key specified, the Primary Key column is used.
Each output column placed in same column family
Every record read results in an HBase put operation
All values are converted to their string representation and
inserted as UTF-8 bytes.

Importing Data into HBase

hbase(main):001:0> list
TABLE
ORDERS
1 row(s) in 0.3650 seconds
hbase(main):002:0>describe 'ORDERS'
DESCRIPTION
ENABLED
{NAME => 'ORDERS', FAMILIES => [
true
{NAME => 'mysql', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE',
VERSIONS => '3', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}
1 row(s) in 0.0310 seconds
hbase(main):003:0>

Importing Data into HBase

hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 }
ROW COLUMN+CELL
10100 column=mysql:CUSTOMER_NUMBER,timestamp=1316036948264,
value=363
10100 column=mysql:ORDER_DATE, timestamp=1316036948264,
value=2003-01-06 00:00:00.0
10100 column=mysql:REQUIRED_DATE, timestamp=1316036948264,
value=2003-01-13 00:00:00.0
10100 column=mysql:SHIP_DATE, timestamp=1316036948264,
value=2003-01-10 00:00:00.0
10100 column=mysql:STATUS, timestamp=1316036948264,
value=Shipped
1 row(s) in 0.0130 seconds
hbase(main):012:0>

Sqoop Connectors
Connector Mechanism allows creation of new connectors
that improve/augment Sqoop functionality.
Bundled connectors include:
o MySQL, PostgreSQL, Oracle, SQLServer, JDBC
o Direct MySQL, Direct PostgreSQL
Regular connectors are JDBC based.
Direct Connectors use native tools for high-performance
data transfer implementation.

Import using Direct MySQL Connector

$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** --direct
...
manager.DirectMySQLManager: Beginning mysqldump fast path import
...

Direct import works as follows:

Data is partitioned into splits using JDBC
Map tasks used mysqldump to do the import with conditional
selection clause (-w 'ORDER_NUMBER' > ...)
Header and footer information was stripped out
Direct Export similarly uses

mysqlimport

utility.

Third Party Connectors

Oracle - Developed by Quest Software
Couchbase - Developed by Couchbase
Netezza - Developed by Cloudera
Teradata - Developed by Cloudera
Microsoft SQL Server - Developed by Microsoft
Microsoft PDW - Developed by Microsoft
Volt DB - Developed by VoltDB

Current Status
Sqoop is currently in Apache Incubator
Status Page
http://incubator.apache.org/projects/sqoop.html
Mailing Lists
[email protected]
[email protected]
Release
Current shipping version is 1.3.0

Sqoop Meetup

Monday, November 7 - 2011, 8pm - 9pm

at
Sheraton New York Hotel & Towers, NYC

Thank you!
Q&A

Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Snowflake Fundamentals Anand Jha
No ratings yet
Snowflake Fundamentals Anand Jha
50 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
Distributed Computing BE (AI&DS)
No ratings yet
Distributed Computing BE (AI&DS)
53 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
Big Data
No ratings yet
Big Data
25 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Final Practice Set
No ratings yet
Final Practice Set
31 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
MapR Certified Hadoop Developer Study Guide (MCHD)
No ratings yet
MapR Certified Hadoop Developer Study Guide (MCHD)
26 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
900 Startups Hiring Remotely in 2019 by Remotive Io
No ratings yet
900 Startups Hiring Remotely in 2019 by Remotive Io
69 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
HADOOP
100% (1)
HADOOP
35 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Big Data
No ratings yet
Big Data
244 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Apache Cassandra
No ratings yet
Apache Cassandra
7 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Sapient Intrview
No ratings yet
Sapient Intrview
4 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
100 Interview Questions On Hadoop PDF
No ratings yet
100 Interview Questions On Hadoop PDF
24 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Data Abstraction and Data Independence
No ratings yet
Data Abstraction and Data Independence
1 page
Apache Hive
No ratings yet
Apache Hive
3 pages
Impala and BigQuery
No ratings yet
Impala and BigQuery
47 pages
Matthaei - Parisiensis - Chronica Maiora 1240-1247 Vol 4
No ratings yet
Matthaei - Parisiensis - Chronica Maiora 1240-1247 Vol 4
721 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Course Contents of Hadoop and Big Data
No ratings yet
Course Contents of Hadoop and Big Data
11 pages
Data Analytics PPT 1
No ratings yet
Data Analytics PPT 1
16 pages
To Do List: Low Priority Tasks
100% (1)
To Do List: Low Priority Tasks
3 pages
Build Apps On SAP BTP
No ratings yet
Build Apps On SAP BTP
13 pages
Scikit-Learn - Machine Learning in Python PDF
No ratings yet
Scikit-Learn - Machine Learning in Python PDF
6 pages
IBM DataStage V11.5.x Database Transaction Processing
No ratings yet
IBM DataStage V11.5.x Database Transaction Processing
27 pages
Redis
No ratings yet
Redis
17 pages
CS549: Cryptography and Network Security: © by Xiang-Yang Li Department of Computer Science, IIT
No ratings yet
CS549: Cryptography and Network Security: © by Xiang-Yang Li Department of Computer Science, IIT
45 pages
EXIN Business Information Management Foundation: Sample Exam
No ratings yet
EXIN Business Information Management Foundation: Sample Exam
34 pages
Lec 4
No ratings yet
Lec 4
27 pages
LSTM
No ratings yet
LSTM
123 pages
Hikvision HikCentral Professional - V1.6.0 - Communication Matrix - 20200302 PDF
No ratings yet
Hikvision HikCentral Professional - V1.6.0 - Communication Matrix - 20200302 PDF
4 pages
Seminar Lab
No ratings yet
Seminar Lab
26 pages
Curriculum-PGP in Big Data Analytics and Optimization
No ratings yet
Curriculum-PGP in Big Data Analytics and Optimization
16 pages
Minot 2 Reportrrrrrrr
No ratings yet
Minot 2 Reportrrrrrrr
74 pages
Mobile App Hardening: Against Reverse Engineering
No ratings yet
Mobile App Hardening: Against Reverse Engineering
49 pages
WSO2 Product Administration: Module 01 - Introduction WSO2 Training
No ratings yet
WSO2 Product Administration: Module 01 - Introduction WSO2 Training
27 pages
May 2016 - May 2024 DWM
No ratings yet
May 2016 - May 2024 DWM
34 pages
EFINITION
No ratings yet
EFINITION
7 pages
ITIL Service Support Assessment-Español
No ratings yet
ITIL Service Support Assessment-Español
18 pages
ECS Deactivation Letter
0% (1)
ECS Deactivation Letter
1 page
Hadoop Intro
No ratings yet
Hadoop Intro
16 pages
Cyber Risk Controls, Data Analyatics and ICT Stu
No ratings yet
Cyber Risk Controls, Data Analyatics and ICT Stu
10 pages
An Introduction To SNMP
No ratings yet
An Introduction To SNMP
5 pages
Yine D
No ratings yet
Yine D
2 pages
Ste (Neoload) Project
No ratings yet
Ste (Neoload) Project
5 pages
Obiee Course Content:: 4. Analytics
No ratings yet
Obiee Course Content:: 4. Analytics
3 pages
MX InfoStorage Installation Instructions
No ratings yet
MX InfoStorage Installation Instructions
15 pages
Lightweight and Privacy-Preserving Two-Factor Authentication Scheme For Iot Devices
No ratings yet
Lightweight and Privacy-Preserving Two-Factor Authentication Scheme For Iot Devices
10 pages
HCPR - Referential Integrity - SAP NetWeaver Business Warehouse - Support Wiki PDF
No ratings yet
HCPR - Referential Integrity - SAP NetWeaver Business Warehouse - Support Wiki PDF
6 pages
ORACLE 11g SQL & PL/SQL Programming Training Track: 3 Days RT 509 3 Days RT 507
No ratings yet
ORACLE 11g SQL & PL/SQL Programming Training Track: 3 Days RT 509 3 Days RT 507
1 page
An Overview of Free Software Tools For General Data Mining: A. Jović, K. Brkić and N. Bogunović
No ratings yet
An Overview of Free Software Tools For General Data Mining: A. Jović, K. Brkić and N. Bogunović
6 pages
Best AI, ML, & Blockchain Development Service Company USA - Cuneiform
No ratings yet
Best AI, ML, & Blockchain Development Service Company USA - Cuneiform
3 pages
OnX Big Data Training Service Brief HDP Developer Pig Hive v3.24
No ratings yet
OnX Big Data Training Service Brief HDP Developer Pig Hive v3.24
3 pages
Operating System - Utility Software 2
No ratings yet
Operating System - Utility Software 2
1 page
CH04 Tutorial Student
No ratings yet
CH04 Tutorial Student
2 pages
Slip Test - 1
No ratings yet
Slip Test - 1
1 page
Unlock Baits (@UnlockBaits) Twitter
No ratings yet
Unlock Baits (@UnlockBaits) Twitter
1 page
Text Analysis With R For Students of Literature
No ratings yet
Text Analysis With R For Students of Literature
1 page
Visualization For Data Science and AI: Friday 13 September 2019
No ratings yet
Visualization For Data Science and AI: Friday 13 September 2019
1 page
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Apache Sqoop

Uploaded by

Apache Sqoop

Uploaded by

Apache Sqoop

A Data Transfer Tool for Hadoop

Arvind Prabhakar, Cloudera Inc. Sept 21, 2011

INFO mapred.JobClient: Counters: 12

Default Delimiters: ',' for fields, New-Lines for records

Importing Data into Hive

INFO mapred.JobClient: Counters: 12

Importing Data into Hive

Importing Data into HBase

Sqoop creates the missing table if instructed

Importing Data into HBase

Importing Data into HBase

Import using Direct MySQL Connector

Direct import works as follows:

Third Party Connectors

Monday, November 7 - 2011, 8pm - 9pm

You might also like