Topic 03 Data Integration
Topic 03 Data Integration
Management
@CIT 2024 1
Learning Outcome
By end of this lecture, students should be able to;
• Understanding data integration concepts
• ETL (Extract, Transform, Load) processes
• Tools for data integration and workflow orchestration
@CIT 2024 2
Data Integration
• Consider a room where different puzzle pieces are
scattered all around, each with a picture on it.
• Now, what do you do if you want to see the complete
picture? You bring all those pieces together, connect them,
and complete the puzzle, right? That’s precisely what data
integration is all about—combining data from multiple
sources into a centralized repository.
• This repository provides a holistic understanding of the
entire business. When done right, this leads to a single
source of truth (SSOT) that organizations rely on for
accurate decision-making.
@CIT 2024 3
Data Integration
• Data integration refers to the process of bringing together
data from multiple sources across an organization to provide
a complete, accurate, and up-to-date dataset for BI, data
analysis and other applications and business processes.
@CIT 2024 4
Data Integration
• Data ingestion is the process of importing large, assorted
data files from multiple sources into a single, cloud-based
storage medium a data warehouse, data mart or database
where it can be accessed and analyzed.
• A data lake is a centralized repository designed to store,
process, and secure large amounts of structured, semi-
structured, and unstructured data. It can store data in its
native format and process any variety of it, ignoring size
limits.
@CIT 2024 5
Datawarehouse
• Data warehouses are central repositories of integrated
data from one or more disparate sources.
• They store current and historical data in one single
place that are used for creating analytical reports for
workers throughout the enterprise.
• This is beneficial for companies as it enables them to
interrogate and draw insights from their data and make
decisions.
@CIT 2024 6
Data warehouse and Data mart
@CIT 2024 7
Examples of Data Integration
• 1. Integrating customer data to unlock marketing insights
• Assimilating customer data is among the most critical use
cases for data integration.
• Consolidating client data from all accessible sources, like
contact information, account details, customer lifetime value
(CLV) ratings, and information gathered from customer
inquiries, website views, direct sales initiatives, surveys,
social media postings, and other interactions.
@CIT 2024 8
Examples of Data Integration
• 2. Integrating IoT data to optimize industrial operations
• Organizations are increasingly moving to combine data
generated by many sensors deployed on internet-connected
industrial equipment, such as manufacturing machines,
automobiles, elevators, pipelines, electricity grids, and oil
rigs (i.e., the internet of things).
• One can utilize integrated sensor data sets to evaluate
business processes and operate preventative maintenance
simulations that anticipate potential equipment issues
before they occur, reducing unscheduled repair downtime.
@CIT 2024 9
Examples of Data Integration
• 3. Integrating store data to operate retail businesses
• Both traditional and online stores deal with a considerable
amount of data.
• Users must centralize all of this information to track
performance, regardless of which retailer or team member
submitted it.
• Data integration enables retailers to manage inventory,
workforce person-hours, revenue data, and other critical
variables across their channels and locations.
@CIT 2024 10
Benefits of Data Integration
• Enable inter-department and inter-system
collaboration:
• Employees in all departments and geographically
dispersed locations must have access to the business’s
data for shared and individualized initiatives.
• In addition, almost every department produces
information that the entire organization needs. Data
integration may promote data coordination and
unification throughout the enterprise.
@CIT 2024 11
Benefits of Data Integration
• Better data. Delivering more valuable data, both in
integrity and quality.
• Better collaboration. Improving collaboration with a
seamless knowledge transfer between systems, meaning
reduced errors.
• Fast connections between data storages. Adding an
effective data integration system with seamless
connections ensures you’ll always be able to reach your
data when you need it.
@CIT 2024 12
Benefits of Data Integration
• Unlock time and effort savings:
• When a business integrates its data effectively, it
dramatically reduces the time required to compile and
evaluate it.
• The automated management of centralized views eliminates
the need to collect data manually.
• Professionals no longer need to manually establish links
every time a report needs to be pulled out or an app-design
scenario comes into play.
@CIT 2024 13
Benefits of Data Integration
• Get ready access to reports
• In the absence of a data integration system that seamlessly
integrates information, reporting must be redone
periodically to accommodate any modifications. With
automatic updates, however, reports may be performed
whenever necessary in real-time.
• Maximize the value of information
• Over time, data integration activities increase the value of
enterprise data. Qualitative deficiencies are detected as
information is assimilated into a centralized system, and
the required adjustments are performed, resulting in much
more accurate data – the cornerstone for quality analysis.
@CIT 2024 14
Benefits of Data Integration
• Obtain value from big data sets
• Data lakes are often highly complicated, complex in their
structure and voluminous. For example, companies such as
Facebook and Google continuously process data from billions
of individuals.
• This substantial volume of typically unstructured data is
called “big data.” This implies that intelligent data integration
becomes vital for large data operations.
@CIT 2024 15
Benefits of Data Integration
• Empower business intelligence (BI) apps
• Data integration streamlines business intelligence (BI)
procedures by providing a consistent and uniform view
of data from several sources.
• Organizations may quickly deploy datasets to generate
meaningful insights around and about existing business
situations.
@CIT 2024 16
Benefits of Data Integration
• Increased efficiency and ROI. Because you're able to
access data quickly, you’ll cut down on errors.
• Better customer and partner experiences. When
you're able to retain your customers' wants and needs,
you can deliver it to them. For example, in a
manufacturing setting, you’d be able to order from
vendors when you need to replenish your inventory.
• A comprehensive view of your business. This includes
a complete picture of business analytics, insights, and
intelligence—as well as a complete overview of
processes and performance.
@CIT 2024 17
Data Integration Process
• Extract Transform and Load (ETL) is a standard data
integration approach in which data is physically taken
from several source systems, converted into a new
layout, and loaded into a centralized data storage.
What is ETL?
❑ETL stands for Extraction, Transformation and Load
❑ This is the most challenging, costly and time
consuming(often, 80% of data Automation time is spent on
ETL) step towards building any type of Data management
System.
@CIT 2024 18
ETL steps(3)
• Extract
❑ Extract relevant data
• Transform
❑Transform data to DW format
❑Build keys, etc.
❑Cleansing of data
• Load
❑Load data into DW
❑Build aggregates, etc
19
20
21
22
Data Extraction
• A data warehouse needs an initial load of the entire data set
from a specific source
◼ Capture of data from operational source in “as is” status
◼ The process of pulling out data that is required for the Data
Warehouse from the source system
• Can be to a file or to a database
• Could involve some degree of cleansing or transformation
23
▪ A big challenge during the data extraction process is how your
ETL tool handles structured and unstructured data (e.g.,
emails, web pages, etc.) The right tool is needed or you may
have to create a custom solution to assist you in transferring
unstructured data.
24
Extraction Methods
• The extraction method is immensely dependent on
the source rule and also on the business requirement
in the target data warehouse environment.
• There are two types of extraction methods
• Logical Extraction Methods
• and Physical Extraction Methods.
25
Logical Extraction Methods
They are of two types; Full and Incremental
❑Full Extraction - The data is extracted entirely from the source
system.
This extraction follows all the data directly accessible on the source
system, hence there is no requirement to hold track of changes to
the data source because of the final successful extraction.
❑Incremental Extraction loads only the changed data since
the last load
– the source must be able to detect its changes
– Applying changed much less time consuming than full loads
–in practice, DWH and sources can diverge. A regular full load
(e.g., every quarter) can be used to reconcile sources and DWH
26
Physical Extraction Methods
It is based on the chosen logical extraction method and the capacity and
conditions on the source side, the extracted information can be physically
extracted by two structures; Online and Offline
27
Data Transformation
• The data cleaning and organization stage is the
transformation stage.
• All of that data from multiple source systems will be
normalized and converted to a single system format —
improving data quality and compliance
• Transforms the data in accordance with the business rules
and standards that have been established
• Example include: format changes, de-duplication, splitting
up fields, replacement of codes, cleaning, joining, sorting,
derived values, and aggregates
28
DATA LOADING
• Transformation functions end as soon as load images are
created. You create load images to correspond to the target
files to be loaded in the data warehouse database.
• Data are physically moved to the data warehouse
• Several ways of loading data exist namely;
• Initial load—populating all the data warehouse tables for the very
first time.
• Incremental load—applying ongoing changes as necessary in a
periodic manner.
• Full refresh—completely erasing the contents of one or more
tables and reloading with fresh data (initial load is a refresh of all
the tables).
29
30
ETL Tool - General Selection criteria
➢Business Vision/Considerations
➢Overall IT strategy/Architecture
➢Over all cost of Ownership
➢Vendor Positioning in the Market
➢Performance
➢In-house Expertise available
➢User friendliness
➢Training requirements to existing users
➢References from other customers
31
End Of Presentation
• Thank You for Listening
@CIT 2024 32