Basics of Data Integration
Basics of Data Integration
Basics of Data
Integration Fundamentals of Business
Analytics”
Data from several heterogeneous data sources extracted and
loaded in a data warehouse
Data Warehouse
According to William H. Inmon, “A data warehouse is a
subject-oriented, integrated, time variant and non-volatile
collection of data in support of management’s decision making
process.”
Process of coherent merging of data from various data sources and presenting a
cohesive/consolidated view to the user
• Involves combining data residing at different sources and providing users with a
unified view of the data.
• Development challenges
Translation of relational database to object-oriented applications
Consistent and inconsistent metadata
Handling redundant and missing data
Normalization of data from different sources
• Technological challenges
Various formats of data
Structured and unstructured data
Huge volumes of data
• Organizational challenges
Unavailability of data
Manual integration risk, failure
Two main approaches in Data Integration
.l.Integration is divided into two main approaches:
Schema integration – reconciles schema elements
Multiple data sources may provide data on the same entity type. The main goal is to allow applications to
transparently view and query this data as one uniform data source, and this is done using various mapping rules to
handle structural differences.
Schema integration involves merging multiple database schemas into a unified schema. This process addresses
structural differences between databases to create a cohesive data model.
Example:
Consider two databases:
Database A: Contains employee information with fields Emp_ID, Emp_Name, and Dept.
Database B: Contains department details with fields Department_ID, Department_Name, and Manager_ID.
Example:
Suppose two customer databases need to be integrated:
Database X: Contains a record: Customer_ID: 001, Name: John Doe, Email: [email protected].
Database Y: Contains a record: Cust_ID: A1, Full_Name: J. Doe, Email_Address:
[email protected].
Instance integration involves identifying that these records refer to the same individual and
merging them into a single, consistent record:
Integrated Record: Customer_ID: 001, Name: John Doe, Email: [email protected].
This process ensures data accuracy and eliminates redundancy.
Both schema and instance integration are crucial for creating a unified, accurate
dataset, facilitating effective data analysis and decision-making.
Entity Identification (EI) and attribute-value conflict resolution (AVCR) comprise the
instance- integration task. When common key-attributes are not available across
different data sources, the rules for EI and the rules for AVCR are expressed as
combinations of constraints on their attribute values.
Various Stages in ETL
Cycle initiation
Validate
Archive
Clean up
Various Stages in ETL
VALIDATE STAGE
REFERENCE
EXTRACT TRANSFORM
ARCHIVE
------
AUDIT REPORTS
PUBLISH
Extract, Transform and Load
• What is ETL?
Extract, transform, and load (ETL) in database usage (and especially in data
warehousing) involves:
Extracting data from different sources
Transforming it to fit operational needs (which can include quality
levels)
Loading it into the end target (database or data warehouse)
• Allows to create efficient and consistent databases
• While ETL can be referred in the context of a data warehouse, the term ETL is in
fact referred to as a process that loads any database.
• Usually ETL implementations store an audit trail on positive and negative process
runs.
Data Mapping
It is a process of generating data element mapping between two distinct data
models. is the first process that is performed for a variety of data integration
tasks which include:
• Extraction is the operation of extracting data from the source system for further use in a data
warehouse environment. This the first step in the ETL process.
• Data extraction, also sometimes referred to as data exfiltration, is the process of collecting data
from various sources and consolidating it into a single location.
• This involves retrieving data from different source systems, which may have varying formats,
such as flat files and relational databases.
• The complexity of the extraction process can depend on the type of source data. It is crucial to
store an intermediate version of the extracted data and back it up and archive it.
• The area where the extracted data is stored is called the staging area.
• Designing this process means making decisions about the following main aspects:
Which extraction method would I choose?
How do I provide the extracted data for further processing?
Data Extraction (cont…)
• It is the most complex and, in terms of production the most costly part of ETL process.
• A series of rules or functions is applied to the data extracted from the data that is loaded
into the end target. source to obtain derived data i.e is loaded into the end target.
• Depending upon the data source, manipulation of data may be required. If the data source
good, its data may require very less transformation and validation.
• But data from some sources might require one or more transformation types to meet the
operational needs and make data is fit in the end target.
• Some transformation types are
* Selecting only certain columns to load.
* Translating a few coded values.
* Encoding some free-form values.
* Deriving a new calculated value.
* Joining together data derived from multiple sources.
* Summarizing multiple rows of data.
* Splitting a column into multiple columns.
Data Transformation
• They can range from simple data conversion to extreme data scrubbing
techniques.
• From an architectural perspective, transformations can be performed in two
ways.
Multistage data transformation
Pipelined data transformation
Data Transformation
CONVERT INSERT
STAGE_03 TARGET
SOURCE KEY INTO
TABLE TABLE
TO WAREHOU
WAREHOUSE SE TABLE
KEYS
MULTISTAGE
TRANSFORMATION
Data Transformation
VALIDATE
EXTERNAL
CUSTOMER
TABLE
KEYS
CONVERT
INSERT TARGET ABLE
SOURCE KEYS
INTO T
TO WAREHOUSE
KEYS
WAREHOU
SE TABLE
PIPELINED
TRANSFORMATION
Data Loading
• The last stage of the ETL process is loading which loads the extracted and
transformed data into warehouse. the end target, usually the data
• The concept of data loading using the case study of DITT library . Assume for the
sake of simplicity that each department (IT, CS, IS, and SE) stores data in separate
sheets in an Excel workbook, except the Issue Return table which is in an Access
database. Each sheet contains a table. There are five tables in all:
• 1. Book (Excel)
• 2. Magazine (Excel)
• 3. CD (Excel)
• 4. Student (Excel)
• 5. Issue_Return (Access)
Data Loading
Look at the requirements of DIIT administration. "The DIIT administration is in need of
a report that indicates the annual spending on library purchases. The report should
further drill down to the spending by each department by category (books, CDs/DVDs,
magazines, journals, etc.).
" Prof.Frank had suggested the building of a data warehouse. In order to implement
Prof. Frank's solution, the Frank requirements should be satisfied:
Instead of merging all data into a single database, the bank sets up a virtual database that
integrates these three sources. When a bank employee searches for a customer’s financial profile,
the virtual database retrieves and combines real-time data from all three sources without
duplicating it.
Data Integration Approaches
Data Warehousing
• A data warehouse is a centralized repository that integrates structured data from multiple sources
for analysis and reporting.
• How It Works:
• Data is extracted from various operational databases (ETL process).
• The extracted data is cleaned, transformed, and stored (load) in a structured format.
• Users can query, analyze, and generate reports from the warehouse whose purpose is to deliver
business intelligence (BI) solutions.
The various primary concepts used in data warehousing would be:
• ETL (Extract Transform Load)
• Component-based (Data Mart)
• Dimensional Models and Schemas
• Metadata driven
Example:Business Intelligence & Reporting (e.g., sales performance tracking).
Customer analytics (e.g., analyzing shopping behavior in e-commerce).
Memory-mapped data structure:
Useful when needed to do in-memory data manipulation and data structure is large. Memory-
mapped data integration is an in-memory approach where data is directly accessed in RAM
instead of traditional disk-based storage.
How It Works:
• Data from different sources is mapped into memory so that applications can access it instantly.
• It uses shared memory techniques to integrate data without duplicating it.
• Typically used in real-time analytics and high-speed transaction processing.
• It’s mainly used in the dot net platform and is always performed with C# or using VB.NET
• It’s is a much faster way of accessing the data than using Memory Stream.
Unified access to
Use Case Real-time analytics Business Intelligence
multiple DBs
1. Preferred when the databases are present across various locations over a large area
(geographically decentralized).
1. Preferred when the source information can be taken from one location.
Unified access to
Use Case Real-time analytics Business Intelligence
multiple DBs
Example Tools: IBM InfoSphere Federation Server, Microsoft SQL Server PolyBase
Benefits:
• Real-time patient data access from different locations
• No need to duplicate sensitive patient data
• Each branch maintains autonomy over its database
Memory-Mapped Data Structures Example – Stock Market Trading
Scenario:
A stock exchange system needs to process millions of stock price updates per second to provide
real-time insights to traders.
Benefits:
• Ultra-fast stock price updates with zero disk I/O
• Low latency for real-time trading decisions
• Efficient memory usage for large-scale trading
Data Warehousing Example – Retail Sales Analysis
Scenario:
A large retail company (e.g., Walmart, Amazon) wants to analyze customer purchasing trends
over the past 5 years to improve marketing and inventory planning.
Benefits:
• Enables historical trend analysis for better decision-making
• Supports business intelligence (BI) dashboards
• Data is structured and optimized for fast analytics
Answer a Quick Question
ADVANTAGES LIMITATIONS
• Integration at the lowest level, • Process would take a considerable
eliminating need for integration amount of time and effort
queries.
• Requires an understanding of the
• Runtime schematic cleaning is not domain
needed – performed at the data staging
environment • More scalable when accompanied
with a metadata repository –
• Independent of original data source increased load.
HOW
• Source/target maps, Transformation rules, data
Process cleansing rules, extract audit trail, transform
audit trail, load audit trail, data quality audit, etc.
TYPE
• Data locations, Data formats, Technical names,
Technical Data sizes, Data types, indexing, data structures,
etc.
WHO, WHEN
• Data access history: Who is accessing?
Application Frequency of access? When accessed? How
accessed? … , etc.
Data Quality and Data
Profiling
Data Quality
Data Quality means that data is accurate, complete, consistent, and useful for making
decisions.
Imagine a company has an Employee Table storing employee details. It follows data integrity
rules to keep the data correct.
Example of Data Integrity in a Database
• Data quality refers to the appropriateness of data for its intended purpose and its compliance
with enterprise data quality standards. It is broader than data integrity, which is specific to
database technology.
Why is it important?
• Poor data quality can lead to wrong business decisions.
• High-quality data ensures efficiency, accuracy, and trust in systems.
Example:
• Converting phone numbers into a standard format:
◦ Incorrect: +1-555.123.4567, (555) 123-4567
◦ Corrected: +1-555-123-4567.
What is it?
◦ Detecting and removing duplicate records.
◦ Using smart algorithms to recognize duplicate names, addresses, or accounts.
◦ Example: Cleaning Up Customer Names
◦ A customer "Aleck Stevenson" has different spellings in different invoices:
Solution: Use a common identifier (Social Security No.) to merge duplicate records into:
3. Data Enrichment
What is it?
• Adding external data to improve data quality.
🔹 Problem: Customers enter incomplete addresses when placing orders. Some miss the ZIP code, while others
write abbreviations differently (e.g., "St." instead of "Street").
🔹 Solution: The company enriches its customer address data by using a third-party postal database (like the
U.S. Postal Service).
🔹 Before Enrichment:
Customer Entry: "123 Main St, Springfield"
Missing: State and ZIP code
🔹 After Enrichment:
Corrected Address: "123 Main Street, Springfield, IL, 62704"
Benefit: Ensures accurate deliveries, reduces shipping errors, and improves customer satisfaction..
Key Areas in Data Quality Management
a) Data Governance
Data governance ensures that data is accurate, secure, and used appropriately through policies and compliance
regulations.
Example:Banks must comply with data privacy laws (e.g., GDPR, HIPAA) to protect customer information.
b) Metadata Management
Metadata provides details about stored data, such as definitions, formats, and relationships.
Example:A database field for "Customer Name" may have:
• Data type: Text
• Max length: 50 characters
c) Data Architecture & Design: Data architecture involves the structure of data storage, processing, and
integration within an organization.
Example:A company uses ETL (Extract, Transform, Load) pipelines to move raw data into a structured
database.
d) Database Management: Database management ensures optimized performance, backup, and data integrity.
Example: A banking system must ensure all transactions are stored securely and can be recovered in case of
system failure.
e) Data Security : Protecting sensitive data from unauthorized access, leaks, or cyber threats.
Example:
Only authorized healthcare professionals should access patient records in a hospital database.
What do you think are the major causes of bad data quality?
Causes of Bad Data Quality
DURING DATA
TRANSFORMATIONS
Processing Data Data Scrubbing Data Purging
Data Quality in Data Integration