0% found this document useful (0 votes)
115 views2 pages

Sqoop Practice

Sqoop was used to import data from a MySQL database table called EMP to HDFS. However, the initial import failed because the EMP table does not have a primary key. Specifying a single mapper with -m 1 allowed the import to proceed sequentially. Later imports specified the --append flag to avoid file already exists errors and used --split-by to split the data across multiple mappers for a table without a primary key. Imports can also use the --query option to import a subset of data meeting certain conditions.

Uploaded by

Nagraj Goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views2 pages

Sqoop Practice

Sqoop was used to import data from a MySQL database table called EMP to HDFS. However, the initial import failed because the EMP table does not have a primary key. Specifying a single mapper with -m 1 allowed the import to proceed sequentially. Later imports specified the --append flag to avoid file already exists errors and used --split-by to split the data across multiple mappers for a table without a primary key. Imports can also use the --query option to import a subset of data meeting certain conditions.

Uploaded by

Nagraj Goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

-----------------------------------------------------DATA INGETION ON

HDFS------------------------------------------------------------------------------

------------------------------------------------------------TO IMPORT THE DATA FROM


“RDBMS” to “HDFS”-------------------------------------

--I created a EMP table in mysql without primary key ,I am trying to import data
from mysql to hdfs ,so i am running below command on Edgenode.
sqoop import --connect jdbc:mysql://localhost/zeyobron_analytics --username root
--password cloudera --table EMP --target-dir /user/cloudera/import1;

--Throwing below error

19/10/19 07:42:30 ERROR tool.ImportTool: Import failed: No primary key could be


found for table EMP. Please specify
one with --split-by or perform a sequential import with '-m 1'

--so i run below command by adding 1 mapper (m1)

sqoop import --connect jdbc:mysql://localhost/zeyobron_analytics --username root


--password cloudera --table EMP --target-dir /user/cloudera/import1 --m 1;

--so i got one warning and error below as the error

19/10/19 07:47:36 WARN security.UserGroupInformation: PriviledgedActionException


as:cloudera (auth:SIMPLE)
cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://quickstart.cloudera:8020/user/cloudera/import1 already exists

19/10/19 07:47:36 ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.


FileAlreadyExistsException: Output directory
hdfs://quickstart.cloudera:8020/user/cloudera/import1 already exists

--to avoid already exists(getting error becasue we run the cammand earlier ) error,
added --append

sqoop import --connect jdbc:mysql://localhost/zeyobron_analytics --username root


--password cloudera --table EMP --append --target-dir /user/cloudera/import1 --m 1;

--one part file is generated

--I tried with m 2 (mapper 2 ).


sqoop import --connect jdbc:mysql://localhost/zeyobron_analytics --username root
--password cloudera --table EMP --append --target-dir /user/cloudera/import1 --m
2;

--Throwing below error if i user mapper 2 on non primary key table

19/10/19 09:12:05 ERROR tool.ImportTool: Import failed: No primary key could be


found for table EMP. Please specify one with --split-by or perform a sequential
import with '-m 1'.

--in order to overcome the above error used --split-by column name (given integer
column name )

sqoop import --connect jdbc:mysql://localhost/zeyobron_analytics --username root


--password cloudera --table EMP --append --target-dir /user/cloudera/import1 --m 2
--split-by empno;

--i have 13 records in my table i have given mapper 14


showing as below very slow

19/10/19 10:18:44 INFO mapreduce.Job: Running job: job_1570851307430_0024


19/10/19 10:19:03 INFO mapreduce.Job: Job job_1570851307430_0024 running in uber
mode : false
19/10/19 10:19:03 INFO mapreduce.Job: map 0% reduce 0%
19/10/19 10:20:43 INFO mapreduce.Job: map 21% reduce 0%
19/10/19 10:20:48 INFO mapreduce.Job: map 36% reduce 0%
19/10/19 10:20:50 INFO mapreduce.Job: map 43% reduce 0%

19/10/19 10:22:54 INFO mapreduce.ImportJobBase: Transferred 541 bytes in 257.9985


seconds (2.0969 bytes/sec)
19/10/19 10:22:54 INFO mapreduce.ImportJobBase: Retrieved 13 records.
19/10/19 10:22:55 INFO util.AppendUtils: Appending to directory import1
19/10/19 10:22:55 INFO util.AppendUtils: Using found partition 6
-- 14 part files generated, 1 file empty part file cretaed

--split-by by ename varchar data type .

sqoop import --connect jdbc:mysql://localhost/zeyobron_analytics --username root


--password cloudera --table EMP --append --target-dir /user/cloudera/import1 --m 2
--split-by ename;

--It is taking (bouBoundingValsQuery) BoundingValsQuery: SELECT MIN(`ename`),


MAX(`ename`) FROM `EMP`

-----------IMPORT WITH “query” OPTION [\$CONDITIONS]

sqoop import --connect jdbc:mysql://localhost/ --username root --password cloudera


--query "select * from zeyobron_analytics.EMP where \$CONDITIONS " --append
--target-dir /user/cloudera/import1 --m 2 --split-by empno ;

sqoop import --connect jdbc:mysql://localhost/ --username root --password cloudera


--query "select * from zeyobron_analytics.EMP where job = 'MANAGER' AND \
$CONDITIONS " --append --target-dir /user/cloudera/import1 --m 2 --split-by empno
;

sqoop import --connect jdbc:mysql://localhost/ --username root --password cloudera


--query "select * from zeyobron_analytics.EMP where job = 'MANAGER' AND deptno =
10 AND \$CONDITIONS " --append --target-dir /user/cloudera/import1 --m 2 --split-
by empno ;

-- it took boundary val BoundingValsQuery: SELECT MIN(empno), MAX(empno) FROM


(select * from zeyobron_analytics.EMP where job = 'MANAGER' AND deptno = 10 AND
(1 = 1) ) AS t1

You might also like