100% found this document useful (1 vote)
2 views61 pages

Basics of Data Integration

The document discusses the fundamentals of data integration and its significance in business analytics, defining key concepts such as data warehouses, data marts, and the ETL (Extract, Transform, Load) process. It highlights the challenges and approaches to data integration, including schema and instance integration, and various methods like federated databases and memory-mapped data structures. Additionally, it emphasizes the need for data integration in organizations to enhance decision-making and streamline data management.

Uploaded by

kruthika c g
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2 views61 pages

Basics of Data Integration

The document discusses the fundamentals of data integration and its significance in business analytics, defining key concepts such as data warehouses, data marts, and the ETL (Extract, Transform, Load) process. It highlights the challenges and approaches to data integration, including schema and instance integration, and various methods like federated databases and memory-mapped data structures. Additionally, it emphasizes the need for data integration in organizations to enhance decision-making and streamline data management.

Uploaded by

kruthika c g
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Unit 2:

Basics of Data
Integration Fundamentals of Business
Analytics”
Data from several heterogeneous data sources extracted and
loaded in a data warehouse
Data Warehouse
According to William H. Inmon, “A data warehouse is a
subject-oriented, integrated, time variant and non-volatile
collection of data in support of management’s decision making
process.”

A data mart is meant to provide single domain data


aggregation that can then be used for analysis, reporting,
and/or decision support.
BI – The Process

Data Integration Data Analysis Reporting


What Is Data Integration?

Process of coherent merging of data from various data sources and presenting a
cohesive/consolidated view to the user

• Involves combining data residing at different sources and providing users with a
unified view of the data.

• Significant in a variety of situations;

 commercial (e.g., two similar companies trying to merge their database)

 Scientific (e.g., combining research results from different


bioinformatics research repositories)
Challenges in Data Integration

• Development challenges
 Translation of relational database to object-oriented applications
 Consistent and inconsistent metadata
 Handling redundant and missing data
 Normalization of data from different sources

• Technological challenges
 Various formats of data
 Structured and unstructured data
 Huge volumes of data

• Organizational challenges
 Unavailability of data
 Manual integration risk, failure
Two main approaches in Data Integration
.l.Integration is divided into two main approaches:
 Schema integration – reconciles schema elements

 Multiple data sources may provide data on the same entity type. The main goal is to allow applications to
transparently view and query this data as one uniform data source, and this is done using various mapping rules to
handle structural differences.
 Schema integration involves merging multiple database schemas into a unified schema. This process addresses
structural differences between databases to create a cohesive data model.

Example:
Consider two databases:

Database A: Contains employee information with fields Emp_ID, Emp_Name, and Dept.
Database B: Contains department details with fields Department_ID, Department_Name, and Manager_ID.

In schema integration, these schemas are combined to form a unified schema:

Unified Schema: Includes Emp_ID, Emp_Name, Dept_ID, Dept_Name, and Manager_ID.


This integration resolves naming conflicts (e.g., Dept vs. Department_ID) and ensures data consistency across
the organization.
2. Instance Integration
Instance integration focuses on merging actual data records from different databases, addressing
issues like duplicates and inconsistencies.

Instance integration – matches tuples and attribute values


Data integration from multiple heterogeneous data sources has become a high-priority task in
many large enterprises. Hence to obtain the accurate semantic information on the data content, the
information is being retrieved directly from the data. It identifies and integrates all the instance of
the data items that represents the real-world entity, distinct from the schema integration.

Example:
Suppose two customer databases need to be integrated:
Database X: Contains a record: Customer_ID: 001, Name: John Doe, Email: [email protected].
Database Y: Contains a record: Cust_ID: A1, Full_Name: J. Doe, Email_Address:
[email protected].
Instance integration involves identifying that these records refer to the same individual and
merging them into a single, consistent record:
Integrated Record: Customer_ID: 001, Name: John Doe, Email: [email protected].
This process ensures data accuracy and eliminates redundancy.
Both schema and instance integration are crucial for creating a unified, accurate
dataset, facilitating effective data analysis and decision-making.

Entity Identification (EI) and attribute-value conflict resolution (AVCR) comprise the
instance- integration task. When common key-attributes are not available across
different data sources, the rules for EI and the rules for AVCR are expressed as
combinations of constraints on their attribute values.
Various Stages in ETL
Cycle initiation

Build reference data Data Mapping

Extract (actual data)

Validate

Transform (clear, apply business rules) Data Staging

Stage (load into staging tables)

Audit reports (success/failure log)

Publish (load into target tables)

Archive

Clean up
Various Stages in ETL

DATA MAPPING DATA STAGING

VALIDATE STAGE



REFERENCE

EXTRACT TRANSFORM

ARCHIVE

------ 

AUDIT REPORTS
PUBLISH
Extract, Transform and Load

• What is ETL?
Extract, transform, and load (ETL) in database usage (and especially in data
warehousing) involves:
 Extracting data from different sources
 Transforming it to fit operational needs (which can include quality
levels)
 Loading it into the end target (database or data warehouse)
• Allows to create efficient and consistent databases
• While ETL can be referred in the context of a data warehouse, the term ETL is in
fact referred to as a process that loads any database.
• Usually ETL implementations store an audit trail on positive and negative process
runs.
Data Mapping
 It is a process of generating data element mapping between two distinct data
models. is the first process that is performed for a variety of data integration
tasks which include:

 Data transformation between data source and data destination.

 Identification of data relationships.

 Discovery of hidden sensitive data.

 Consolidation of multiple databases into a single database.


Data Staging

A data staging area is an intermediate storage area between the


sources of information and the Data Warehouse (DW) or Data
Mart (DM)

• A staging area can be used for any of the following purposes:


 Gather data from different sources at different times
 Load information from the operational database
 Find changes against current DW/DM values.
 Data cleansing

 Pre-calculate aggregates.

Data Extraction

• Extraction is the operation of extracting data from the source system for further use in a data
warehouse environment. This the first step in the ETL process.
• Data extraction, also sometimes referred to as data exfiltration, is the process of collecting data
from various sources and consolidating it into a single location.
• This involves retrieving data from different source systems, which may have varying formats,
such as flat files and relational databases.
• The complexity of the extraction process can depend on the type of source data. It is crucial to
store an intermediate version of the extracted data and back it up and archive it.
• The area where the extracted data is stored is called the staging area.

• Designing this process means making decisions about the following main aspects:
 Which extraction method would I choose?
 How do I provide the extracted data for further processing?
Data Extraction (cont…)

The data has to be extracted both logically and physically.


• The logical extraction method
 Full extraction
 Incremental extraction

• The physical extraction method


 Online extraction
 Offline extraction
Data Transformation

• It is the most complex and, in terms of production the most costly part of ETL process.
• A series of rules or functions is applied to the data extracted from the data that is loaded
into the end target. source to obtain derived data i.e is loaded into the end target.
• Depending upon the data source, manipulation of data may be required. If the data source
good, its data may require very less transformation and validation.
• But data from some sources might require one or more transformation types to meet the
operational needs and make data is fit in the end target.
• Some transformation types are
* Selecting only certain columns to load.
* Translating a few coded values.
* Encoding some free-form values.
* Deriving a new calculated value.
* Joining together data derived from multiple sources.
* Summarizing multiple rows of data.
* Splitting a column into multiple columns.
Data Transformation
• They can range from simple data conversion to extreme data scrubbing
techniques.
• From an architectural perspective, transformations can be performed in two
ways.
 Multistage data transformation
 Pipelined data transformation
Data Transformation

LOAD STAGE_01 VALIDAT STAGE_02


INTO TABLE E TABLE
STAGIN CUSTOM
ER KEYS
G TABLE

CONVERT INSERT
STAGE_03 TARGET
SOURCE KEY INTO
TABLE TABLE
TO WAREHOU
WAREHOUSE SE TABLE
KEYS

MULTISTAGE
TRANSFORMATION
Data Transformation

VALIDATE
EXTERNAL
CUSTOMER
TABLE
KEYS

CONVERT
INSERT TARGET ABLE
SOURCE KEYS
INTO T
TO WAREHOUSE
KEYS
WAREHOU
SE TABLE

PIPELINED
TRANSFORMATION
Data Loading
• The last stage of the ETL process is loading which loads the extracted and
transformed data into warehouse. the end target, usually the data

• Data can also be loaded by using SQL queries.

• The concept of data loading using the case study of DITT library . Assume for the
sake of simplicity that each department (IT, CS, IS, and SE) stores data in separate
sheets in an Excel workbook, except the Issue Return table which is in an Access
database. Each sheet contains a table. There are five tables in all:
• 1. Book (Excel)

• 2. Magazine (Excel)

• 3. CD (Excel)

• 4. Student (Excel)

• 5. Issue_Return (Access)
Data Loading
Look at the requirements of DIIT administration. "The DIIT administration is in need of
a report that indicates the annual spending on library purchases. The report should
further drill down to the spending by each department by category (books, CDs/DVDs,
magazines, journals, etc.).

" Prof.Frank had suggested the building of a data warehouse. In order to implement
Prof. Frank's solution, the Frank requirements should be satisfied:

• Data in all tables should be in proper format and must be consistent.


• The three tables - Book, Magazine, and CD - should be combined into a single table
that contains details of all these three categories of items.
• The student table should have a single column containing the full nam name of each
student.
• Phone numbers must be stored in a numeric ( integer) column.
• Date columns should be in a uniform format, preferably as a datetime data type.
• String columns (Fullname, Address) should be in uniform case (preferably
uppercase)
Answer a Quick Question

According to your understanding


What is the need for Data Integration in corporate world ?
Need for Data Integration

 It is done for providing data in a specific What it means?


view as requested by users,
applications, etc. DB2
 The bigger the organization gets, the
Unified
more data there is and the more SQL view of
data needs integration. data

 Increases with the need for data sharing.


Oracle
Advantages of Using Data Integration

 Of benefit to decision-makers, who What it means?


have access to important
information from past studies DB2
 Reduces cost, overlaps and
redundancies; reduces exposure to risks Unified
SQL view of
data
 Helps to monitor key variables like
trends and consumer behaviour,
Oracle
etc.
Common Approaches to Data Integration
Data Integration Approaches

• There are currently various methods for performing data integration.

• The most popular ones are:


 Federated databases
 Memory-mapped data structure
 Data warehousing
Data Integration Approaches
• Federated database (virtual database):
• Type of virtual database that integrates multiple independent databases while allowing them
to maintain their autonomy (a single federated database).
• A virtual database is a system that provides a unified view of multiple databases without
physically copying or moving the data. Instead of storing data in one place, it dynamically
retrieves and integrates data from different sources in real-time.
How a Virtual Database Works:
1.When a user queries data, the virtual database does not store the data itself.
2.It translates the query into multiple sub-queries that are sent to the original databases.
3.The results from different databases are combined and presented as a single, unified dataset.
Working:
• A middleware layer connects different databases without physically moving or replicating
the data.
• Queries are processed by translating them into the format of the underlying databases.
• The system returns a unified result to the user without requiring a centralized storage
system.
• The federated databases( virtual database ) is the fully integrated, logical composite of all
constituent databases in a federated database management system.
Use Case: Banking System

A bank has separate databases for:


• Customer information (MySQL).
• Loan records (PostgreSQL).
• Credit card transactions (Oracle DB).

Instead of merging all data into a single database, the bank sets up a virtual database that
integrates these three sources. When a bank employee searches for a customer’s financial profile,
the virtual database retrieves and combines real-time data from all three sources without
duplicating it.
Data Integration Approaches

Data Warehousing
• A data warehouse is a centralized repository that integrates structured data from multiple sources
for analysis and reporting.
• How It Works:
• Data is extracted from various operational databases (ETL process).
• The extracted data is cleaned, transformed, and stored (load) in a structured format.
• Users can query, analyze, and generate reports from the warehouse whose purpose is to deliver
business intelligence (BI) solutions.
The various primary concepts used in data warehousing would be:
• ETL (Extract Transform Load)
• Component-based (Data Mart)
• Dimensional Models and Schemas
• Metadata driven
Example:Business Intelligence & Reporting (e.g., sales performance tracking).
 Customer analytics (e.g., analyzing shopping behavior in e-commerce).
Memory-mapped data structure:

Useful when needed to do in-memory data manipulation and data structure is large. Memory-
mapped data integration is an in-memory approach where data is directly accessed in RAM
instead of traditional disk-based storage.
How It Works:
• Data from different sources is mapped into memory so that applications can access it instantly.
• It uses shared memory techniques to integrate data without duplicating it.
• Typically used in real-time analytics and high-speed transaction processing.
• It’s mainly used in the dot net platform and is always performed with C# or using VB.NET
• It’s is a much faster way of accessing the data than using Memory Stream.

Example : Financial trading systems that require real-time price updates.


 IoT applications where real-time data processing is crucial.
Comparison
Memory-Mapped Data
Feature Federated Databases Data Warehousing
Structures
Data Storage Distributed In-memory Centralized

Speed Medium Very High Medium (Batch)

Real-Time Access Yes Yes No

Scalability High Limited by RAM High

Complexity High Medium High

Unified access to
Use Case Real-time analytics Business Intelligence
multiple DBs

Which One to Choose?


· Use federated databases if you need real-time access to multiple databases without moving data.
· Use memory-mapped data structures if you need high-speed data access for real-time processing.
• Use data warehousing if you need historical data analysis and business intelligence.
Difference between a federated database and a data warehouse

1. Preferred when the databases are present across various locations over a large area
(geographically decentralized).
1. Preferred when the source information can be taken from one location.

2. Data would be present in various servers.


2. The entire data warehouse would be present in one server.

3. Requires high speed network connection.


3. Requires no network connection.

4. It is easier to create as compared to data warehouse.


4. Its creation is not as easy as that of the federated database

5.Requires no creation of new database.


5.Data warehouse must be created from scratch.

6.Requires network experts to set up the network connection.


6.Requires database experts such as data Steward.
Comparison
Memory-Mapped Data
Feature Federated Databases Data Warehousing
Structures
Data Storage Distributed In-memory Centralized

Speed Medium Very High Medium (Batch)

Real-Time Access Yes Yes No

Scalability High Limited by RAM High

Complexity High Medium High

Unified access to
Use Case Real-time analytics Business Intelligence
multiple DBs

Which One to Choose?


· Use federated databases if you need real-time access to multiple databases without moving data.
· Use memory-mapped data structures if you need high-speed data access for real-time processing.
• Use data warehousing if you need historical data analysis and business intelligence.
Federated Databases Example – Healthcare System Integration
Scenario:
A hospital group has multiple branches, each using its own database for patient records, lab
results, and billing. The hospital wants a unified view of patient data without moving or
duplicating data.

How Federated Databases Work:


The hospital implements a federated database system that connects all the branch databases.
When a doctor queries a patient’s history, the system fetches real-time data from multiple
databases.
The doctor sees a combined view of the patient’s medical history, lab results, and billing
details.

Example Tools: IBM InfoSphere Federation Server, Microsoft SQL Server PolyBase

Benefits:
• Real-time patient data access from different locations
• No need to duplicate sensitive patient data
• Each branch maintains autonomy over its database
Memory-Mapped Data Structures Example – Stock Market Trading

Scenario:
A stock exchange system needs to process millions of stock price updates per second to provide
real-time insights to traders.

How Memory-Mapped Data Works:


Stock price data is stored in memory-mapped files instead of traditional databases.
Traders' software can instantly read and analyze stock prices without disk access.
The system updates stock prices in real time, allowing traders to make quick decisions.

Example Tools: Apache Ignite, Redis, Oracle TimesTen

Benefits:
• Ultra-fast stock price updates with zero disk I/O
• Low latency for real-time trading decisions
• Efficient memory usage for large-scale trading
Data Warehousing Example – Retail Sales Analysis
Scenario:
A large retail company (e.g., Walmart, Amazon) wants to analyze customer purchasing trends
over the past 5 years to improve marketing and inventory planning.

How Data Warehousing Works:


Sales data is collected from all retail stores and online platforms.
The data is extracted, cleaned, and stored in a data warehouse (e.g., Amazon Redshift,
Google BigQuery).
Analysts run queries and reports to identify trends, such as seasonal buying patterns or
popular products.

Example Tools: Snowflake, Amazon Redshift, Google BigQuery

Benefits:
• Enables historical trend analysis for better decision-making
• Supports business intelligence (BI) dashboards
• Data is structured and optimized for fast analytics
Answer a Quick Question

According to your understanding


What are the advantages and limitations of Data Warehouse?
Data Warehouse – Advantage and Limitations

ADVANTAGES LIMITATIONS
• Integration at the lowest level, • Process would take a considerable
eliminating need for integration amount of time and effort
queries.
• Requires an understanding of the
• Runtime schematic cleaning is not domain
needed – performed at the data staging
environment • More scalable when accompanied
with a metadata repository –
• Independent of original data source increased load.

• Query optimization is possible. • Tightly coupled architecture


Metadata and Its
Types
Metadata is data about data and helps ensure consistency in data usage.
• Example: Inconsistent interpretations of training attendance in a company can be resolved by
clearly defining how training days are counted.

The "Too Many Watches" Phenomenon


• If different stakeholders interpret data differently (e.g., training days counted differently by team
leaders and employees), it creates confusion.
• Solution: Metadata should clearly define terms and standards to ensure uniformity.

Business Impact of Data Quality Issues


• Poor data quality can lead to incorrect reports and business decisions.
• Example: Rounding off financial data in reports may cause significant discrepancies.

• Real-World Data Quality Issues


Case Study: Census Data Delay
• The population census often takes years to publish. Why is this a data timeliness issue?
• Solution:
• Faster data collection tools.
• Automated processing for real-time updates.
Metadata and Its
Types
WHAT
• Data definitions, Metrics definitions, Subject
Business models, Data models, Business rules, Data rules,
Data owners/stewards, etc.

HOW
• Source/target maps, Transformation rules, data
Process cleansing rules, extract audit trail, transform
audit trail, load audit trail, data quality audit, etc.

TYPE
• Data locations, Data formats, Technical names,
Technical Data sizes, Data types, indexing, data structures,
etc.
WHO, WHEN
• Data access history: Who is accessing?
Application Frequency of access? When accessed? How
accessed? … , etc.
Data Quality and Data
Profiling
Data Quality
Data Quality means that data is accurate, complete, consistent, and useful for making
decisions.

Why is Data Quality Important?


🔹 If data is incorrect, wrong decisions can be made.
🔹 Bad data can cause errors in reports.
🔹 Good data saves time and money in businesses and organizations.
Example:
· If a school records a student’s birthdate incorrectly, the student might not be placed in the
correct grade.
Data Integrity ensures that data is correct, reliable, and follows rules to avoid mistakes or loss
of information.

Imagine a company has an Employee Table storing employee details. It follows data integrity
rules to keep the data correct.
Example of Data Integrity in a Database

EmpNo (Primary DeptNo (Foreign EmpName (Not Age (Check


Key) Key) Null) Constraint)

1001 D01 John Mathews 25

1010 D02 Elan Hayden 27

1011 D01 Jennifer Augustine 23

• Each column follows a rule to maintain integrity.


Types of Data Integrity Rules
1.Primary Key (PK) – Uniqueness Rule
· Each employee must have a unique EmpNo (Employee Number).
· Example: No two employees can have EmpNo = 1001.
2️Foreign Key (FK) – Relationship Rule
· DeptNo (Department Number) must exist in the Department Table.
· Example: If the Department Table has "D01" (Finance), an employee can be assigned to it.
· If DeptNo D05 doesn’t exist, we cannot assign an employee to it.

DeptNo (From Department Table) Dept Name


D01 Finance
D02 Purchase
D03 Human Resources
D04 Sales

3. Not Null – No Empty Values Rule


· EmpName must always be filled.
· Example: If we try to insert an employee without a name, the database will reject it .
4. Check Constraint – Business Rule
· Age should be between 18 and 65.
· Example: If someone tries to enter Age 17, the database will reject it.

Why is Data Integrity Important?


• Prevents duplicate records
• Ensures data accuracy and reliability
• Protects business operations from errors
• Keeps relationships between tables correct

• Data Quality means data is accurate and useful.


• Data Integrity ensures data follows rules to stay correct.
• Primary Key, Foreign Key, Not Null, and Check Constraints help maintain data integrity.
Definition
• Data quality refers to how well data meets business needs and follows enterprise standards.

• Data quality refers to the appropriateness of data for its intended purpose and its compliance
with enterprise data quality standards. It is broader than data integrity, which is specific to
database technology.

Why is it important?
• Poor data quality can lead to wrong business decisions.
• High-quality data ensures efficiency, accuracy, and trust in systems.

Four Key Dimensions of Data Quality


How Do We Maintain Data Quality?
1. Data Cleaning and Standardization
What is it?
• Fixing incorrect, outdated, or incomplete data.
• Applying consistent rules to format data properly.

Example:
• Converting phone numbers into a standard format:
◦ Incorrect: +1-555.123.4567, (555) 123-4567
◦ Corrected: +1-555-123-4567.

2. De-duplication (Removing Duplicate Data)

What is it?
◦ Detecting and removing duplicate records.
◦ Using smart algorithms to recognize duplicate names, addresses, or accounts.
◦ Example: Cleaning Up Customer Names
◦ A customer "Aleck Stevenson" has different spellings in different invoices:
Solution: Use a common identifier (Social Security No.) to merge duplicate records into:

3. Data Enrichment
What is it?
• Adding external data to improve data quality.

Example: An online shopping website wants to improve customer profiles.

🔹 Problem: Customers enter incomplete addresses when placing orders. Some miss the ZIP code, while others
write abbreviations differently (e.g., "St." instead of "Street").

🔹 Solution: The company enriches its customer address data by using a third-party postal database (like the
U.S. Postal Service).

🔹 Before Enrichment:
Customer Entry: "123 Main St, Springfield"
Missing: State and ZIP code
🔹 After Enrichment:
Corrected Address: "123 Main Street, Springfield, IL, 62704"

Benefit: Ensures accurate deliveries, reduces shipping errors, and improves customer satisfaction..
Key Areas in Data Quality Management
a) Data Governance
Data governance ensures that data is accurate, secure, and used appropriately through policies and compliance
regulations.
Example:Banks must comply with data privacy laws (e.g., GDPR, HIPAA) to protect customer information.
b) Metadata Management
Metadata provides details about stored data, such as definitions, formats, and relationships.
Example:A database field for "Customer Name" may have:
• Data type: Text
• Max length: 50 characters

c) Data Architecture & Design: Data architecture involves the structure of data storage, processing, and
integration within an organization.
Example:A company uses ETL (Extract, Transform, Load) pipelines to move raw data into a structured
database.
d) Database Management: Database management ensures optimized performance, backup, and data integrity.
Example: A banking system must ensure all transactions are stored securely and can be recovered in case of
system failure.
e) Data Security : Protecting sensitive data from unauthorized access, leaks, or cyber threats.
Example:
Only authorized healthcare professionals should access patient records in a hospital database.

f) Data Warehousing & Business Intelligence (BI)


Data warehouses store structured information for reporting and analytics. BI tools help businesses
track performance and make informed decisions.
Example: An e-commerce platform uses BI to analyze customer behavior and recommend products.

• Data quality is maintained through cleaning, standardization, de-duplication, and enrichment.


• Data governance and metadata management ensure proper structuring and security.
• Secure storage, architecture, and warehousing enable reliable data usage for decision-making.
Building Blocks of Data Quality Management

• Analyze, Improve and Control


• This methodology is used to
encompass people, processes and
technology.
• This is achieved through five
methodological building blocks,
namely:
 Profiling
 Quality
 Integration
 Enrichment
 Monitoring
Answer a Quick Question

According to your understanding


What is data quality and why it is important?
Data Quality

• Correcting, standardizing and validating the


information

• Creating business rules to correct, standardize and


validate your data.

• High-quality data is essential to successful business


operations.
Answer a Quick
Question

What do you think are the major causes of bad data quality?
Causes of Bad Data Quality

DURING DATA DECAY


PROCESS OF DURING LOADING
EXTRACTION AND ARCHIVING
Initial Conversion of Data
Consolidation of System Changes Not Captured

Manual Data Entry System Upgrades


Batch Feeds Use of New Data
Real Time Interfaces Loss of Expertise
Automation Process
Effect of Bad Quality

DURING DATA
TRANSFORMATIONS
Processing Data Data Scrubbing Data Purging
Data Quality in Data Integration

• Building a unified view of the database from the


information.
• An effective data integration strategy can lower
costs and improve productivity by ensuring
the consistency, accuracy and reliability of
data.
• Data integration enables to:
 Match, link and consolidate multiple data
sources
 Gain access to the right data sources at the
right time
 Deliver high-quality information
 Increase the quality of information
Data Quality in Data Integration

• Understand Corporate Information Anywhere in the Enterprise


• Data integration involves combining processes and technology to ensure an
effective use of the data can be made.

• Data integration can include:


 Data movement
 Data linking and matching
 Data house holding
Popular ETL
Tools
ETL Tools

• ETL process can be create using programming language.


• Open source ETL framework tools
 Clover.ETL
 Enhydra Octopus
 Pentaho Data Integration (also known as ‘Kettle’)
 Talend Open Studio

• Popular ETL Tools


 Ab Initio
 Business Objects Data Integrator
 Informatica
 SQL Server 2005/08 Integration services

You might also like