Module 1: Data Integration in Context
1.1. What is Data Integration (DI)?
Definition
Data Integration (DI) is the process of combining data from different sources to provide a
unified, consistent view of information. It enables businesses to use diverse data assets across
systems to make better decisions, improve operational efficiency, and gain deeper insights.
Key Characteristics
Involves extraction, transformation, and loading (ETL/ELT)
Deals with structured, semi-structured, and unstructured data
Supports both batch and real-time processing
Often requires data quality and data governance measures
Common Use Cases
Migrating data from legacy systems
Synchronizing data across business applications
Building data warehouses and data lakes
Enabling business intelligence and advanced analytics
1.2. ETL vs ELT
Feature ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Order of Operations Extract → Transform → Load Extract → Load → Transform
Where Transformation In a DI tool (outside the Inside the target system (e.g.,
Happens database) DB, data lake)
On-premise databases, legacy Cloud-based systems, modern
Best Fit For
systems data lakes
Limited by external Utilizes power of modern data
Performance
transformation engine platforms
High control over data Better suited for massive data
Flexibility
transformation volumes
Example:
ETL: Extract from Oracle → Clean & Join in Talend → Load into PostgreSQL
ELT: Extract from MongoDB → Load into Snowflake → Transform using SQL queries in
Snowflake
1.3. Traditional vs Modern Data Integration
Criteria Traditional DI Modern DI (Cloud-native, Real-time)
Tools ETL tools, DB scripts Talend, Apache NiFi, dbt, Fivetran
Deployment On-premises Cloud, Hybrid
Architecture Batch-oriented Real-time, streaming
Mostly structured (DB, CSV, Structured, semi-structured,
Data Sources
etc.) unstructured
Target
Data warehouses Data lakes, cloud warehouses
Systems
User Personas IT/Data Engineers only Data Engineers, Analysts, Citizen Devs
Modern DI Goals:
Lower latency
Scalability with cloud resources
Faster time to insight
Self-service capabilities
1.4. Role of DI in Analytics and Reporting
Data integration plays a foundational role in any data-driven initiative. Here’s how:
How DI Enables Analytics:
Combines disparate datasets into a single analytical model
Ensures clean, consistent, and reliable data
Facilitates historical and real-time reporting
Enables machine learning and predictive modeling
Examples:
Customer 360° view: Combining CRM, support, sales, and social media data
Sales Performance Dashboard: Merging transactional data, marketing campaign data, and
external benchmarks
Key Concepts:
Single Source of Truth (SSOT)
Data Consistency
Timeliness and Trustworthiness of Data
1.5. Talend in the Data Integration Ecosystem
Talend is a modern open-source-based platform that provides end-to-end capabilities for
integrating, transforming, cleaning, and governing data across any environment.
Why Talend?
Open-source foundation (TOS - Talend Open Studio)
Scalable from small to enterprise workloads
Visual design environment (no/low-code)
Connects to a wide variety of data sources (DBs, APIs, Big Data, cloud)
Position in DI Ecosystem:
Works alongside DBs, file systems, APIs, messaging systems, cloud platforms
Complements data warehousing, BI tools (e.g., Tableau, Power BI), and big data platforms
(e.g., Hadoop, Spark)
1.6. Talend Platform Overview
Talend offers a unified suite of tools across multiple data domains. Here's a breakdown:
Product Area Description
Core ETL/ELT tools for batch and real-time data
Talend Data Integration (DI)
pipelines
Native integration with Hadoop, Spark, and other big
Talend Big Data
data platforms
Profiling, cleansing, standardization, and validation
Talend Data Quality (DQ)
tools
Talend Master Data Management Tools to manage and govern shared data entities
(MDM) across systems
Talend ESB (Enterprise Service Bus) For real-time API and service orchestration
SaaS platform for designing, deploying, and monitoring
Talend Cloud
data flows in the cloud
Tools Used in This Training:
Talend Open Studio for Big Data (TOS_BD)
Open-source desktop-based tool supporting data integration and big data features.
We will use it to:
o Connect to PostgreSQL (via Docker)
o Read/write flat files (CSV, Excel, JSON)
o Build and test data pipelines
o Explore big data connectors (e.g., HDFS, Hive)
🧪 Hands-On Exercises – Module 1: Data Integration in Context
✅ Exercise 1: Verify Talend Studio Installation and Workspace Setup
Objective: Ensure Talend is installed and ready for development.
Steps:
1. Launch Talend Open Studio for Big Data.
2. Choose your workspace (e.g., C:\Talend_Workspace).
3. Wait for the environment to load and close the welcome tab.
4. In the Repository pane, right-click Job Designs → Create folder → Name it Module1_Basics.
5. Inside the folder, right-click → Create job → Name: Check_Installation.
Validation:
You should see the design workspace open with the job Check_Installation.
Save the project using Ctrl + S.
✅ Exercise 2: Create a Basic Data Flow (File to Console)
Objective: Understand basic component wiring and execution.
Steps:
1. In the Check_Installation job:
2. From Palette, drag tFixedFlowInput and tLogRow onto the canvas.
3. Double-click tFixedFlowInput:
o Click Edit schema → Add columns: id (Integer), name (String)
o Click OK.
o Click Use Inline Table → Add values:
id | name
1 | Alice
2 | Bob
4. Connect tFixedFlowInput to tLogRow using a Main row.
5. Double-click tLogRow → Select Table mode.
6. Run the job (green run button).
Expected Output:
| id | name |
|----|-------|
| 1 | Alice |
| 2 | Bob |
✅ Exercise 3: Compare ETL vs ELT Flow Patterns (Simulated)
Objective: Simulate ETL and ELT scenarios for understanding the difference.
A. ETL Simulation: File → Transform in Talend → PostgreSQL
Setup PostgreSQL:
If not already running, create a file docker-compose.yml:
version: '3'
services:
postgres:
image: postgres:15
ports:
- "5432:5432"
environment:
POSTGRES_USER: talend
POSTGRES_PASSWORD: talend
POSTGRES_DB: training
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
Run:
docker compose up -d
Create sample file: employees.csv
emp_id,full_name,department
1,Alice Johnson,Finance
2,Bob Smith,HR
Steps in Talend:
1. Create new job ETL_Flow.
2. Drag tFileInputDelimited, tMap, tPostgresqlOutput.
3. Configure tFileInputDelimited:
o File name: point to employees.csv
o Schema: emp_id (Integer), full_name (String), department (String)
4. Configure tMap:
o Add output schema with same fields
o No transformation (yet)
5. Configure tPostgresqlOutput:
o Connection: host = localhost, port = 5432, db = training, user = talend, password =
talend
o Table: employees
o Action: Create table if not exists, Insert
6. Run the job.
Validation:
SELECT * FROM employees;
B. ELT Simulation: File → PostgreSQL → Transform in SQL
1. Create new job ELT_Flow.
2. Use components: tFileInputDelimited → tPostgresqlOutput (no tMap).
3. Load raw data into a staging table: employees_staging
4. After job execution, use PgAdmin or any SQL client:
INSERT INTO employees_final(emp_id, full_name, department)
SELECT emp_id, INITCAP(full_name), department
FROM employees_staging;
Insight:
ETL: Transformation in Talend
ELT: Load raw → Transform in database
✅ Exercise 4: Exploring Talend Ecosystem (Metadata Discovery)
Objective: Use Repository Metadata to connect to PostgreSQL.
Steps:
1. In Talend Repository pane → Right-click Metadata → Db Connections → Create connection
2. Name: PostgreSQL_Local
3. Click Next → Fill in:
o DB Type: PostgreSQL
o Host: localhost
o Port: 5432
o DB Name: training
o User/Password: talend
4. Click Check → Connection Successful.
5. Finish → Right-click on connection → Retrieve Schema → Select employees → Finish.
Result:
You can now reuse this schema in any job.
✅ Exercise 5: Role of DI in Reporting (Export to File)
Objective: Create an export-ready file from PostgreSQL.
Steps:
1. New Job: Export_Employees_Report
2. Drag tPostgresqlInput → tFileOutputDelimited
3. Use SQL: SELECT * FROM employees
4. Export file to: employees_report.csv
Validation:
Open the exported CSV file in Excel or Notepad.
📘 Summary of Learnings
Concept Exercise
Talend Interface + Job Basics Ex. 1–2
ETL vs ELT Ex. 3
PostgreSQL Integration Ex. 3, 4
Repository Metadata Ex. 4
Data Export for Reporting Ex. 5