0% found this document useful (0 votes)
26 views

Week 3

The document discusses using Sqoop to import and export data between databases and HDFS. Some key points covered include: - Sqoop Import is used to transfer data from a database to HDFS, while Sqoop Export moves data from HDFS to a database. - Imports use MapReduce jobs with mappers that divide the work based on the table's primary key by default. - Imports and exports can be customized through options like compression, column selection, and partitioning. - Staging tables are used during exports to avoid partial data transfers if a job fails. - Incremental imports allow importing only new or updated records over time rather than reprocessing the full table each time

Uploaded by

pali.rajtrader
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Week 3

The document discusses using Sqoop to import and export data between databases and HDFS. Some key points covered include: - Sqoop Import is used to transfer data from a database to HDFS, while Sqoop Export moves data from HDFS to a database. - Imports use MapReduce jobs with mappers that divide the work based on the table's primary key by default. - Imports and exports can be customized through options like compression, column selection, and partitioning. - Staging tables are used during exports to avoid partial data transfers if a job fails. - Incremental imports allow importing only new or updated records over time rather than reprocessing the full table each time

Uploaded by

pali.rajtrader
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 11

# SQOOP IMPORT EXECISE

=======================

SESSION - 1
============

Sqoop Import - Databases to HDFS (frequently)

Sqoop Export - HDFS to Databases

Sqoop Eval - to run queries on the database

sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera

sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera

sqoop-eval \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera \
--query "select * from retail_db.customers limit 10"

SESSION - 2
============

INSERT INTO people values (101,'Raj','Pali','Itwara chowk','Yavatmal)

Sqoop import
=============

(transfer data from your relation db to HDFS)

Mapreduce job

only mappers work and no reducer.

by default there are 4 mappers which do the work.

yes we can change the number of mappers.

these mappers divide the work based on primary key.

if there is no primary key then what will happen?

1. you change the number of mappers to 1.


2. split by column

sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
--query "describe retail_db.orders"

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \sqoop
--password cloudera \
--table orders \
--target-dir /queryresult

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \
--target-dir peopleresult

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \ {people table don't content the P.K therefore
setting the mapper 1}
-m 1 \ {if you dont set mapper 1 then it will give an
error}
--target-dir peopleresult

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \
-m 1 \
--warehouse-dir peopleresult1

Now my path will this = peopleresult1/people

Target dir vs. Warehouse dir


=============================
employee table that you are importing from mysql

In case of target directory the directory path mentioned is


the final path where data is copied.
/data

In case of warehouse directory, the system will create a


subdirectory with the table name.
/data/employee

sqoop-import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--as-sequencefile \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir

SESSION - 3
============
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera

sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
-P {console me aapka Password show nahi hoga!}

How to Redirect the logs for later use ?


----------------------------------------

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /queryresult4 1>query.out 2>query.err

Mostly it will contain output content : 1>query.out (in case of eval command)
And all other log , errors will be here : 2>query.err (you can set any name for
file & this file will be stored in cwd from where command is run)

Boundary query
===============

sqoop import the work is divided among the mappers based on the
primary key.

Employee table
===============
empId, empname, age, salary (empId is the primary key)
0
1
2
3
4
5
6
.
.
100000

the mappers by default will be 4.

find -- how the mapper will distribute the work on the basis of P.K.?

the max of primary key


min of primary key

split size = (max_of_pk - min_of_pk)/Num_Mappers

(100000 - 0)/4
100000/4 = 25000

split size = 25000


mapper1 0 - 25000
mapper2 25001 - 50000
mapper3 50001 - 75000
mapper4 75001 - 100000

SESSION - 4
============

sqoop-import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--warehouse-dir /user/cloudera/compressresult

sqoop-import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compression-codec BZip2Codec \
--warehouse-dir /user/cloudera/bzipcompresult

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--cloumns order_id,order_customer_id,order_status \
--where "order_status in ('complete','closed')" \ {Where clause converted as
BoundaryValsQuery}
--warehouse-dir /user/cloudera/customimportresult

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--boundary-query "SELECT 1, 68883" {Here we are hardcoding the min & max for
BVQ due to outlier}
--warehouse-dir /user/cloudera/ordersboundval

SESSION - 5
============

sqoop-import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--columns order_id,order_customer_id,order_status \
--where "order_status in ('processing')" \ {Where clause internally add
to boundary query, no matter what}
--warehouse-dir /user/cloudera/whereclauseresult

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_no_pk \ {It will fail because in this table no PK and therefore
mapper doesn't know how to divide the work among themselves}
--warehouse-dir /ordersnopk

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_no_pk \
--split-by order_id \
--target-dir /ordersnopk

sqoop import-all-tables \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--warehouse-dir /user/cloudera/autoreset1mresult \
--autoreset-to-one-mapper \ {uses one mappper if a table with no P.K. is
encountered}
--num-mappers 2

{Agar apne pass 100 tables hai or usme se 98 tables me P.K. hai and remaining 2
tables me P.K. nahi hai toh
jab table me P.K. hai toh 2 mapper work karege, and jissme P.K nahi hai ussme by
default mapper 1 ho jayega!}

SESSION - 6
============

sqoop create-hive-table \ {Creating the empty table in hive based on metadata in


mysql}
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \ {By default hive me table ka name same hota hai from
source table name but we can change it}
--hive-table emps \ {The name of the table should be emps in hive which
content the metadata of order table present in mysql}
--fields-terminated-by ','

# SQOOP EXPORT EXECISE


=======================

SESSION - 1
============

SQOOP EXPORT

IS USED TO TRANSFER DATA FROM HDFS TO RDBMS.

CREATE TABLE card_transactions (


transaction_id INT,
card_id BIGINT,
member_id BIGINT,
amount INT,
postcode INT,
pos_id BIGINT,
transaction_dt varchar(255),
status varchar(255),
PRIMARY KEY(transaction_id)
);

WE HAVE CARD_TRANS.CSV ON THE DESKTOP LOCALLY IN CLOUDERA.

WE SHOULD BE MOVING THIS FILE FROM LOCAL TO HDFS

hadoop fs -mkdir /data

hadoop fs -put Desktop/card_trans.csv /data

sqoop export \

--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--export-dir /data/card_trans.csv \
--fields-terminated-by ","

2 IMPORTANT THINGS:

1. why the job failed ? {check your Job tracking url}

2. if a job fails how to make sure that target table is not


impacted.{thats means nothing should be transfered if job fail
i.e. it should not be a partial}

Caused by:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntergrityConstraintViolationException:
Duplicate entry '345925144288000-10-10-2017 18:02:40' for key 'PRIMARY'

>>Concept : Staging table comes into play for avoid partial transfer of data :

>>1st creating the same schema table with stage name attach in mysql database,

CREATE TABLE card_transactions_stage (


card_id BIGINT,
member_id BIGINT,
amount INT(10),
postcode INT(10),
pos_id BIGINT,
transaction_dt varchar(255),
status varchar(255),
PRIMARY KEY (card_id, transaction_dt)
);

>>Now, Running the export command with --staging-table <table name>

sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /data/card_transactions.csv \
--field-terminated-by ','

>>If partial record transfered then the partial record will kept in stage table
will not
transfer to the main table;
>>If data is successfully transfered to staging table then stage table in MySql
will Migrate the data
to the main table and stage table will become empty. Because data has been
migrated.

SESSION - 8
============

sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /user/cloudera/data/card_transactions_new.csv \
--fields-terminated-by ','

SESSION - 9
============

Incremental Import

orders table in mysql

50000 records are there.

order_id is the primary key.

100 new orders are coming tomorrow in orders table.

again, sqoop import.

you already have done the import of 50000 records using


sqoop import.

in such a case you should go with incremental import

2 choices
==========

1.append mode - append mode is used when there are no updates


in data, and there are just new inserts.

2.lastmodified mode - when we need to capture the updates also.


so in this case we will be using a date on basis of which we will
try to fetch the data.

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0 {saying that if order_id is >0 then please import the records}

insert into orders values(68884,'2014-07-23 00:00:00',5522,'COMPLETE');


insert into orders values(68885,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68886,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68887,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68888,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68889,'2014-07-23 00:00:00',5522,'COMPLETE');

>>commit

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 68883 \
--append

SESSION - 10
=============

incremental import using append mode - only inserts, no updates.

incremental import using lastmodified mode - when there are updates


as well.

sqoop import
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db
--username root
--password cloudera
--table orders
--warehouse-dir /user/cloudera/data
--incremental lastmodified
--check-column order_date {Issme ham TimeStamp (Date) wala column specify karte
hai}
--last-value 0 {Basically I should give here date but in the first load I want to
consider everything}
--append

>> '2023-02-07 22:35:59' { Now next time I have run this then I have to replace 0
with this number thatswhy I am taking this }

insert into orders values(68890,current_timestamp,5523,'COMPLETE');


insert into orders values(68891,current_timestamp,5523,'COMPLETE');
insert into orders values(68892,current_timestamp,5523,'COMPLETE');
insert into orders values(68893,current_timestamp,5523,'COMPLETE');
insert into orders values(68894,current_timestamp,5523,'COMPLETE');

update orders set order_status='COMPLETE',order_date = current_timestamp WHERE


ORDER_ID = 68862;
commit;

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental lastmodified \
--check-column order_date \
--last-value '2023-02-07 22:35:59' \ {Bass aye date ko save karna padta hai for
next import}
--append {hamne jab ek baar import kiya and again hum incremental import karte hai
over the same output dir then
we want to choose either append or merge-key on the base of the
requirement}

if a record is updated in your table and then we use incremental


import with last modified. then we will get the updated record
also

5000 oldtimestamp in hdfs


5000 newtimestamp in hdfs {It means in your hdfs you will have 2 records with
oldtimestamp & newtimestamp
because we are using --append parameter}

you want that hdfs file should be always in sync with the table.
{e.g. If you have 1000 records in your table of MySql DB then there should be 1000
records in your hdfs}
{i.e. hame bass new updated records chahiye old wale records nahi chahiye}

{ thats means 5000 is Primary key and having 2 records then it should
consider the record with the latest timeStamp in hdfs thats means
there will not be any duplicate entry in HDFS ,So for that we use --merge-key }

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental lastmodified \
--check-column order_date \
--last-value '2023-02-07 22:35:59' \
--merge-key order_id {if i am using merge-key as against append then it will make
sure that against each P.K.
or each key order_id will have only one record and with
the latest TimeStamp will be considered
in hdfs}

{After using the above import command it will bring the new records which are added
& old records which are updated in the table and
after receiving that records then it will start process of merging the duplicate
records on the basis of --merging-key parameter,
and After it get merged then in will produce ony 1 file in the output dir with
Part-r file because merging is reducing activity }

2 modes
========
1. append - we talk only about new inserts

--incremental append
--check-column order_id
--last-value 0 {Any order_id greater than 0 should be import}

2. lastmodified - when we have updates as well

--incremental lastmodified
--check-column order_date {It should be some date column}
--last-value previousdate {So, this is a date after which all records entered
should be imported}

>>Aaapko append or merge dono me se ek parameter dena hi padega after 1st


incremental import otherwise will show err that output dir exist :

--append (will create the duplicacy if old records and old updated records in hdfs}

--merge-key order_id (will merge the duplicacy with the help of reducing activity
on the basis of P.K)
{And we usually replace the new records over the old records
on the basis of TimeStamp}

SESSION - 11
=============

incremental import

In this session we will talk about

1. sqoop job

2. password management.

sqoop job \
--create job_orders \ {job name should be unique}
-- import \ {yaha 2 times hypen ke baadme ek space honi chahiye}
--connect jdbc:mysql://quickstart.cloudera:3306\retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental append \
--check-column order_id \
--last-value 0

sqoop job --list : This command will show us all the created sqoop jobs.

sqoop job --exec job_orders

sqoop job --show job_orders : To see all the parameter saved or stored.

sqoop job --delete job_orders : Deleting a sqoop job

echo -n "cloudera" >> .password.file , it's is created in local cloudera

sqoop job \
--create job_orders \
-- import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password-file file:///home/cloudera/.password.file \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental append \
--check-column order_id \
--last-value 0

We are expecting the above command will be fully automatic.

We have successfully created job.

sqoop job --exec job_orders

You might also like