0% found this document useful (0 votes)
7 views12 pages

Module 1_Data Integration in Context

Module 1 focuses on Data Integration (DI), defining it as the process of combining data from various sources for improved decision-making and operational efficiency. It contrasts ETL and ELT processes, highlights traditional versus modern DI approaches, and emphasizes the role of DI in analytics and reporting. The module also introduces Talend as a key tool in the DI ecosystem, providing hands-on exercises for practical understanding.

Uploaded by

rizqi ardiansyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Module 1_Data Integration in Context

Module 1 focuses on Data Integration (DI), defining it as the process of combining data from various sources for improved decision-making and operational efficiency. It contrasts ETL and ELT processes, highlights traditional versus modern DI approaches, and emphasizes the role of DI in analytics and reporting. The module also introduces Talend as a key tool in the DI ecosystem, providing hands-on exercises for practical understanding.

Uploaded by

rizqi ardiansyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Module 1: Data Integration in Context

1.1. What is Data Integration (DI)?

Definition
Data Integration (DI) is the process of combining data from different sources to provide a
unified, consistent view of information. It enables businesses to use diverse data assets across
systems to make better decisions, improve operational efficiency, and gain deeper insights.

Key Characteristics

 Involves extraction, transformation, and loading (ETL/ELT)

 Deals with structured, semi-structured, and unstructured data

 Supports both batch and real-time processing

 Often requires data quality and data governance measures

Common Use Cases

 Migrating data from legacy systems

 Synchronizing data across business applications

 Building data warehouses and data lakes

 Enabling business intelligence and advanced analytics

1.2. ETL vs ELT

Feature ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)

Order of Operations Extract → Transform → Load Extract → Load → Transform


Where Transformation In a DI tool (outside the Inside the target system (e.g.,
Happens database) DB, data lake)

On-premise databases, legacy Cloud-based systems, modern


Best Fit For
systems data lakes

Limited by external Utilizes power of modern data


Performance
transformation engine platforms

High control over data Better suited for massive data


Flexibility
transformation volumes

Example:

 ETL: Extract from Oracle → Clean & Join in Talend → Load into PostgreSQL

 ELT: Extract from MongoDB → Load into Snowflake → Transform using SQL queries in
Snowflake

1.3. Traditional vs Modern Data Integration

Criteria Traditional DI Modern DI (Cloud-native, Real-time)

Tools ETL tools, DB scripts Talend, Apache NiFi, dbt, Fivetran

Deployment On-premises Cloud, Hybrid

Architecture Batch-oriented Real-time, streaming

Mostly structured (DB, CSV, Structured, semi-structured,


Data Sources
etc.) unstructured

Target
Data warehouses Data lakes, cloud warehouses
Systems

User Personas IT/Data Engineers only Data Engineers, Analysts, Citizen Devs
Modern DI Goals:

 Lower latency

 Scalability with cloud resources

 Faster time to insight

 Self-service capabilities

1.4. Role of DI in Analytics and Reporting

Data integration plays a foundational role in any data-driven initiative. Here’s how:

How DI Enables Analytics:

 Combines disparate datasets into a single analytical model

 Ensures clean, consistent, and reliable data

 Facilitates historical and real-time reporting

 Enables machine learning and predictive modeling

Examples:

 Customer 360° view: Combining CRM, support, sales, and social media data

 Sales Performance Dashboard: Merging transactional data, marketing campaign data, and
external benchmarks

Key Concepts:

 Single Source of Truth (SSOT)

 Data Consistency

 Timeliness and Trustworthiness of Data


1.5. Talend in the Data Integration Ecosystem

Talend is a modern open-source-based platform that provides end-to-end capabilities for


integrating, transforming, cleaning, and governing data across any environment.

Why Talend?

 Open-source foundation (TOS - Talend Open Studio)

 Scalable from small to enterprise workloads

 Visual design environment (no/low-code)

 Connects to a wide variety of data sources (DBs, APIs, Big Data, cloud)

Position in DI Ecosystem:

 Works alongside DBs, file systems, APIs, messaging systems, cloud platforms

 Complements data warehousing, BI tools (e.g., Tableau, Power BI), and big data platforms
(e.g., Hadoop, Spark)

1.6. Talend Platform Overview

Talend offers a unified suite of tools across multiple data domains. Here's a breakdown:

Product Area Description

Core ETL/ELT tools for batch and real-time data


Talend Data Integration (DI)
pipelines

Native integration with Hadoop, Spark, and other big


Talend Big Data
data platforms

Profiling, cleansing, standardization, and validation


Talend Data Quality (DQ)
tools
Talend Master Data Management Tools to manage and govern shared data entities
(MDM) across systems

Talend ESB (Enterprise Service Bus) For real-time API and service orchestration

SaaS platform for designing, deploying, and monitoring


Talend Cloud
data flows in the cloud

Tools Used in This Training:

 Talend Open Studio for Big Data (TOS_BD)


Open-source desktop-based tool supporting data integration and big data features.
We will use it to:

o Connect to PostgreSQL (via Docker)

o Read/write flat files (CSV, Excel, JSON)

o Build and test data pipelines

o Explore big data connectors (e.g., HDFS, Hive)


🧪 Hands-On Exercises – Module 1: Data Integration in Context

✅ Exercise 1: Verify Talend Studio Installation and Workspace Setup

Objective: Ensure Talend is installed and ready for development.

Steps:

1. Launch Talend Open Studio for Big Data.

2. Choose your workspace (e.g., C:\Talend_Workspace).

3. Wait for the environment to load and close the welcome tab.

4. In the Repository pane, right-click Job Designs → Create folder → Name it Module1_Basics.

5. Inside the folder, right-click → Create job → Name: Check_Installation.

Validation:

 You should see the design workspace open with the job Check_Installation.

 Save the project using Ctrl + S.

✅ Exercise 2: Create a Basic Data Flow (File to Console)

Objective: Understand basic component wiring and execution.

Steps:

1. In the Check_Installation job:

2. From Palette, drag tFixedFlowInput and tLogRow onto the canvas.

3. Double-click tFixedFlowInput:

o Click Edit schema → Add columns: id (Integer), name (String)


o Click OK.

o Click Use Inline Table → Add values:

id | name

1 | Alice

2 | Bob

4. Connect tFixedFlowInput to tLogRow using a Main row.

5. Double-click tLogRow → Select Table mode.

6. Run the job (green run button).

Expected Output:

| id | name |

|----|-------|

| 1 | Alice |

| 2 | Bob |

✅ Exercise 3: Compare ETL vs ELT Flow Patterns (Simulated)

Objective: Simulate ETL and ELT scenarios for understanding the difference.

A. ETL Simulation: File → Transform in Talend → PostgreSQL

Setup PostgreSQL:

If not already running, create a file docker-compose.yml:

version: '3'

services:
postgres:

image: postgres:15

ports:

- "5432:5432"

environment:

POSTGRES_USER: talend

POSTGRES_PASSWORD: talend

POSTGRES_DB: training

volumes:

- pgdata:/var/lib/postgresql/data

volumes:

pgdata:

Run:

docker compose up -d

Create sample file: employees.csv

emp_id,full_name,department

1,Alice Johnson,Finance

2,Bob Smith,HR

Steps in Talend:

1. Create new job ETL_Flow.

2. Drag tFileInputDelimited, tMap, tPostgresqlOutput.

3. Configure tFileInputDelimited:
o File name: point to employees.csv

o Schema: emp_id (Integer), full_name (String), department (String)

4. Configure tMap:

o Add output schema with same fields

o No transformation (yet)

5. Configure tPostgresqlOutput:

o Connection: host = localhost, port = 5432, db = training, user = talend, password =


talend

o Table: employees

o Action: Create table if not exists, Insert

6. Run the job.

Validation:

SELECT * FROM employees;

B. ELT Simulation: File → PostgreSQL → Transform in SQL

1. Create new job ELT_Flow.

2. Use components: tFileInputDelimited → tPostgresqlOutput (no tMap).

3. Load raw data into a staging table: employees_staging

4. After job execution, use PgAdmin or any SQL client:

INSERT INTO employees_final(emp_id, full_name, department)

SELECT emp_id, INITCAP(full_name), department

FROM employees_staging;
Insight:

 ETL: Transformation in Talend

 ELT: Load raw → Transform in database

✅ Exercise 4: Exploring Talend Ecosystem (Metadata Discovery)

Objective: Use Repository Metadata to connect to PostgreSQL.

Steps:

1. In Talend Repository pane → Right-click Metadata → Db Connections → Create connection

2. Name: PostgreSQL_Local

3. Click Next → Fill in:

o DB Type: PostgreSQL

o Host: localhost

o Port: 5432

o DB Name: training

o User/Password: talend

4. Click Check → Connection Successful.

5. Finish → Right-click on connection → Retrieve Schema → Select employees → Finish.

Result:

 You can now reuse this schema in any job.

✅ Exercise 5: Role of DI in Reporting (Export to File)


Objective: Create an export-ready file from PostgreSQL.

Steps:

1. New Job: Export_Employees_Report

2. Drag tPostgresqlInput → tFileOutputDelimited

3. Use SQL: SELECT * FROM employees

4. Export file to: employees_report.csv

Validation:

 Open the exported CSV file in Excel or Notepad.


📘 Summary of Learnings

Concept Exercise

Talend Interface + Job Basics Ex. 1–2

ETL vs ELT Ex. 3

PostgreSQL Integration Ex. 3, 4

Repository Metadata Ex. 4

Data Export for Reporting Ex. 5

You might also like