Incremental Updates in HIVE Using SQOOP PDF
Incremental Updates in HIVE Using SQOOP PDF
Incremental Updates in HIVE Using SQOOP PDF
AND
STEPS TO BE TAKEN TO EXECUTE SQOOP JOBS ON OOZIE
- created by [email protected]
This document attempts to explain how sqoop can be used to provide upserts (update and inserts) in hive from
mysql database. Additionally, this document also explains how the entire process can be automated by writing
workflow in oozie.
Certain precautions are needed to be observed when executing sqoop jobs in oozie. These are noted down at
the end.
Dataset for this document has been borrowed from www.kaggle.com. Please refer database.csv in spacex-
missions.zip. First, we will load the data into mysql and perform some initial processing like adding timestamp
and primary key
'spacex_missions_w' is our working table in mysql; we will create its replica in hive. We will be
constantly inserting and updating this table in mysql. After using four step incremental approach, these
changes will be reflected in Hive.
sqoop import \
--connect jdbc:mysql://10.170.245.155:3306/spacex \
--driver com.mysql.jdbc.Driver \
--username root \
--password cloudera \
--table spacex_missions_w \
--hive-import --create-hive-table \
--hive-table spacex.missions_base -m 1
Now we make changes in mysql table worldbase and load only the updated data into hive using sqoop
incremental option. Data in mysql:
UPDATE spacex_missions_w SET customer_name='NA' WHERE pid=2023, pid=2025;
Updated mysql:
We can see record_ts has been updated by 20mins for updated columns. We will use that time in sqoop
incremental import.
In order to support an on-going reconciliation between current records in HIVE and new change records,
two tables should be defined: missions_base and missions_inc
sqoop job \
--create spacex_import_job -- import \
--connect jdbc:mysql://quickstart.cloudera:3306/spacex \
--driver com.mysql.jdbc.Driver --username root \
--password cloudera --table spacex_missions_w \
--hive-import --create-hive-table --hive-table spacex.missions_inc -m 1 \
--check-column Record_TS --incremental lastmodified \
--last-value "2017-03-22 03:40:00"
We don't need to reset last-value while running the job next time as sqoop implicitly sets it.
12. Purging incremental table data and reloading data into base table
DROP TABLE IF EXISTS missions_inc;
hdfs fs –rm –r /user/hive/warehouse/spacex.db/missions_inc
DROP TABLE missions_base;
CREATE TABLE missions_base LIKE missions_reporting
INSERT OVERWRITE TABLE missions_base SELECT * FROM missions_reporting;
This entire process can be automated by designing a oozie workflow. We will desgin oozie
workflow in Hue.
At the time of writing this document, there are three strategies known so as to automate incremental updates on
oozie.
1) Shell Action
Execute IMPALA query and echo the output
Explanation of script:
We are setting environment variables to yarn because OOZIE executes its actions as YARN and not as
CLOUDERA. There may be a conflict between whoami and $USER.
Query output:
+-----------------------+
| max(record_ts) |
+-----------------------+
| 2017-04-07 03:14:29.0 |
+-----------------------+
Therefore we use a regex expression to extract timestamp.
And echo the output as
MAXTS=2017-04-07 03:14:29.0
2) SQOOP Action
import
--connect
jdbc:mysql://quickstart.cloudera:3306/spacex
--username
root
--password
cloudera
--table
spacex_missions_w
-m
1
--check-column
Record_TS
--incremental
lastmodified
--last-value
${wf:actionData("shell-1860")["MAXTS"]}
--merge-key
pid
--split-by
pid
--target-dir
/user/cloudera/hive/external/missions_inc
It must be noted that create-hive-table doesn't work with OOZIE. However, hive-import may
work and we can use this facility. However, here we have uploaded data into HDFS and we
will create an external table on top of it. Specifying, --merge-key primary_key is
required so as to execute incremental update the second time. (Meaning first incremental
update may work without specifying merge-key). Everything else is same as SQOOP command
mentioned in Hortonworks 4 Steps Incrmental Guide.
5) HDFS Action
Possible Enhancement: All the hive actions mentioned above can be executed in IMPALA and
performance can be improved especially those which involve JOINS