0% found this document useful (0 votes)
7 views5 pages

Module IV

The document outlines various data loading scenarios in Hadoop, including batch, real-time, incremental, full load, and change data capture. It details methods for loading 'data at rest' from sources like file systems, RDBMS, and web servers, utilizing tools such as Apache Sqoop and Flume. Additionally, it explains the features and usage of Apache Sqoop for importing and exporting data between RDBMS and Hadoop.

Uploaded by

ayux0431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

Module IV

The document outlines various data loading scenarios in Hadoop, including batch, real-time, incremental, full load, and change data capture. It details methods for loading 'data at rest' from sources like file systems, RDBMS, and web servers, utilizing tools such as Apache Sqoop and Flume. Additionally, it explains the features and usage of Apache Sqoop for importing and exporting data between RDBMS and Hadoop.

Uploaded by

ayux0431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Module IV:

🚚 1. Load Scenarios in Hadoop


In Hadoop, data can be loaded in different ways depending on the use case. Below are
common load scenarios:

Scenario Description Example

Batch Load Large volume of data loaded Daily sales data


periodically (e.g., daily)

Real-Time/Streaming Data is ingested as it arrives IoT sensors, web clicks


Load

Incremental Load Only new or updated data is loaded New records from a
CRM

Full Load Complete dataset is loaded every Monthly refresh of


time product catalog

Change Data Capture Captures data changes like Database audit logs
(CDC) insert/update/delete

📊 Visualization:
+-------------------+
| Source DB |
+--------+----------+
|
[Extract]
|
+--------v----------+
| Hadoop Cluster |
+-------------------+

🗃️ 2. Loading “Data at Rest”


"Data at Rest" refers to data that is already stored in systems like:

●​ File systems (CSV, JSON, XML, etc.)​

●​ RDBMS (e.g., MySQL, Oracle)​


●​ NoSQL databases​

●​ Warehouses (e.g., Teradata, Snowflake)​

✅ How to Load:
●​ Use tools like Apache Flume, Apache Nifi, Sqoop, or HDFS commands​

●​ Data is moved from local/remote storage into HDFS​

📦 Example:
hdfs dfs -put localfile.csv /user/hadoop/input/

🔗 3. Loading Data from Common Sources


A. Data Warehouse

●​ Typically batch loaded​

●​ Use Sqoop or export in CSV, then move to HDFS​

●​ Example: Importing from Teradata, Oracle​

B. Relational Databases (RDBMS)

●​ Use Apache Sqoop to move data to/from MySQL, PostgreSQL, Oracle​

●​ Allows incremental load based on timestamps/IDs​

C. Web Servers

●​ Web server logs (e.g., Apache logs) stored in files​

●​ Load using Apache Flume or direct copy into HDFS​

D. Database Logs

●​ Capture real-time data changes​


●​ Use tools like Kafka, Flume, or CDC tools like Debezium​

📥 Visualization: Source to HDFS Pipeline


+-------------+ +-------------+ +-------------+
| RDBMS | --> | Sqoop | --> | HDFS |
+-------------+ +-------------+ +-------------+

+--------------+ +------------+ +-------------+


| Web Logs | --> | Flume | --> | HDFS |
+--------------+ +------------+ +-------------+

🔄 4. What is Apache Sqoop?


Apache Sqoop (SQL-to-Hadoop) is a data transfer tool used to move data:

●​ From RDBMS to Hadoop (HDFS, Hive, HBase)​

●​ From Hadoop to RDBMS​

✅ Features:
●​ Parallel data transfer using MapReduce​

●​ Supports all major RDBMS (MySQL, Oracle, PostgreSQL)​

●​ Supports incremental import​

●​ Can import data into Hive and HBase directly​

●​ Export capability to write back to RDBMS​

🧪 5. Using Sqoop to Import and Export Data


🔹 A. Import Data from RDBMS to HDFS
sqoop import \
--connect jdbc:mysql://localhost/employees \
--username root --password secret \
--table employee \
--target-dir /user/hadoop/employees
🔹 B. Import to Hive
sqoop import \
--connect jdbc:mysql://localhost/employees \
--username root --password secret \
--table employee \
--hive-import \
--hive-table hive_employees

🔹 C. Incremental Import
sqoop import \
--connect jdbc:mysql://localhost/employees \
--username root --password secret \
--table employee \
--incremental append \
--check-column id \
--last-value 105

🔹 D. Export Data from HDFS to RDBMS


sqoop export \
--connect jdbc:mysql://localhost/employees \
--username root --password secret \
--table employee_export \
--export-dir /user/hadoop/employees
🧠 Recap Mind Map (Textual Visualization)
+-------------------+
| Hadoop HDFS |
+---------+---------+
^
|
+----------+ +--------+--------+ +-----------+
| RDBMS | --> | Sqoop Import | --> | Hive |
+----------+ +-----------------+ +-----------+

|
v

+----------+ +--------+--------+ +-----------+


| HDFS | --> | Sqoop Export | --> | RDBMS |
+----------+ +-----------------+ +-----------+

You might also like