Week 3
Week 3
=======================
SESSION - 1
============
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera
sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera
sqoop-eval \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera \
--query "select * from retail_db.customers limit 10"
SESSION - 2
============
Sqoop import
=============
Mapreduce job
sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
--query "describe retail_db.orders"
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \sqoop
--password cloudera \
--table orders \
--target-dir /queryresult
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \
--target-dir peopleresult
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \ {people table don't content the P.K therefore
setting the mapper 1}
-m 1 \ {if you dont set mapper 1 then it will give an
error}
--target-dir peopleresult
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \
-m 1 \
--warehouse-dir peopleresult1
sqoop-import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--as-sequencefile \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir
SESSION - 3
============
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
-P {console me aapka Password show nahi hoga!}
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /queryresult4 1>query.out 2>query.err
Mostly it will contain output content : 1>query.out (in case of eval command)
And all other log , errors will be here : 2>query.err (you can set any name for
file & this file will be stored in cwd from where command is run)
Boundary query
===============
sqoop import the work is divided among the mappers based on the
primary key.
Employee table
===============
empId, empname, age, salary (empId is the primary key)
0
1
2
3
4
5
6
.
.
100000
find -- how the mapper will distribute the work on the basis of P.K.?
(100000 - 0)/4
100000/4 = 25000
SESSION - 4
============
sqoop-import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--warehouse-dir /user/cloudera/compressresult
sqoop-import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compression-codec BZip2Codec \
--warehouse-dir /user/cloudera/bzipcompresult
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--cloumns order_id,order_customer_id,order_status \
--where "order_status in ('complete','closed')" \ {Where clause converted as
BoundaryValsQuery}
--warehouse-dir /user/cloudera/customimportresult
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--boundary-query "SELECT 1, 68883" {Here we are hardcoding the min & max for
BVQ due to outlier}
--warehouse-dir /user/cloudera/ordersboundval
SESSION - 5
============
sqoop-import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--columns order_id,order_customer_id,order_status \
--where "order_status in ('processing')" \ {Where clause internally add
to boundary query, no matter what}
--warehouse-dir /user/cloudera/whereclauseresult
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_no_pk \ {It will fail because in this table no PK and therefore
mapper doesn't know how to divide the work among themselves}
--warehouse-dir /ordersnopk
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_no_pk \
--split-by order_id \
--target-dir /ordersnopk
sqoop import-all-tables \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--warehouse-dir /user/cloudera/autoreset1mresult \
--autoreset-to-one-mapper \ {uses one mappper if a table with no P.K. is
encountered}
--num-mappers 2
{Agar apne pass 100 tables hai or usme se 98 tables me P.K. hai and remaining 2
tables me P.K. nahi hai toh
jab table me P.K. hai toh 2 mapper work karege, and jissme P.K nahi hai ussme by
default mapper 1 ho jayega!}
SESSION - 6
============
SESSION - 1
============
SQOOP EXPORT
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--export-dir /data/card_trans.csv \
--fields-terminated-by ","
2 IMPORTANT THINGS:
Caused by:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntergrityConstraintViolationException:
Duplicate entry '345925144288000-10-10-2017 18:02:40' for key 'PRIMARY'
>>Concept : Staging table comes into play for avoid partial transfer of data :
>>1st creating the same schema table with stage name attach in mysql database,
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /data/card_transactions.csv \
--field-terminated-by ','
>>If partial record transfered then the partial record will kept in stage table
will not
transfer to the main table;
>>If data is successfully transfered to staging table then stage table in MySql
will Migrate the data
to the main table and stage table will become empty. Because data has been
migrated.
SESSION - 8
============
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /user/cloudera/data/card_transactions_new.csv \
--fields-terminated-by ','
SESSION - 9
============
Incremental Import
2 choices
==========
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0 {saying that if order_id is >0 then please import the records}
>>commit
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 68883 \
--append
SESSION - 10
=============
sqoop import
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db
--username root
--password cloudera
--table orders
--warehouse-dir /user/cloudera/data
--incremental lastmodified
--check-column order_date {Issme ham TimeStamp (Date) wala column specify karte
hai}
--last-value 0 {Basically I should give here date but in the first load I want to
consider everything}
--append
>> '2023-02-07 22:35:59' { Now next time I have run this then I have to replace 0
with this number thatswhy I am taking this }
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental lastmodified \
--check-column order_date \
--last-value '2023-02-07 22:35:59' \ {Bass aye date ko save karna padta hai for
next import}
--append {hamne jab ek baar import kiya and again hum incremental import karte hai
over the same output dir then
we want to choose either append or merge-key on the base of the
requirement}
you want that hdfs file should be always in sync with the table.
{e.g. If you have 1000 records in your table of MySql DB then there should be 1000
records in your hdfs}
{i.e. hame bass new updated records chahiye old wale records nahi chahiye}
{ thats means 5000 is Primary key and having 2 records then it should
consider the record with the latest timeStamp in hdfs thats means
there will not be any duplicate entry in HDFS ,So for that we use --merge-key }
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental lastmodified \
--check-column order_date \
--last-value '2023-02-07 22:35:59' \
--merge-key order_id {if i am using merge-key as against append then it will make
sure that against each P.K.
or each key order_id will have only one record and with
the latest TimeStamp will be considered
in hdfs}
{After using the above import command it will bring the new records which are added
& old records which are updated in the table and
after receiving that records then it will start process of merging the duplicate
records on the basis of --merging-key parameter,
and After it get merged then in will produce ony 1 file in the output dir with
Part-r file because merging is reducing activity }
2 modes
========
1. append - we talk only about new inserts
--incremental append
--check-column order_id
--last-value 0 {Any order_id greater than 0 should be import}
--incremental lastmodified
--check-column order_date {It should be some date column}
--last-value previousdate {So, this is a date after which all records entered
should be imported}
--append (will create the duplicacy if old records and old updated records in hdfs}
--merge-key order_id (will merge the duplicacy with the help of reducing activity
on the basis of P.K)
{And we usually replace the new records over the old records
on the basis of TimeStamp}
SESSION - 11
=============
incremental import
1. sqoop job
2. password management.
sqoop job \
--create job_orders \ {job name should be unique}
-- import \ {yaha 2 times hypen ke baadme ek space honi chahiye}
--connect jdbc:mysql://quickstart.cloudera:3306\retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental append \
--check-column order_id \
--last-value 0
sqoop job --list : This command will show us all the created sqoop jobs.
sqoop job --show job_orders : To see all the parameter saved or stored.
sqoop job \
--create job_orders \
-- import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password-file file:///home/cloudera/.password.file \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental append \
--check-column order_id \
--last-value 0