0% found this document useful (0 votes)
2 views

06 Data Integration Presentation

Data integration is the process of merging data from various sources to create a unified view, addressing issues like redundancy and inconsistency. It employs methods such as ETL and various integration techniques, including manual, middleware, and application-based approaches. The importance of data integration lies in its ability to support large datasets, enhance business intelligence, and facilitate real-time information delivery across industries, including healthcare.

Uploaded by

theophilusindia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

06 Data Integration Presentation

Data integration is the process of merging data from various sources to create a unified view, addressing issues like redundancy and inconsistency. It employs methods such as ETL and various integration techniques, including manual, middleware, and application-based approaches. The importance of data integration lies in its ability to support large datasets, enhance business intelligence, and facilitate real-time information delivery across industries, including healthcare.

Uploaded by

theophilusindia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Integration

 The process of merging data from several


disparate sources.
 Addressing challenges like redundancy,
inconsistency, and duplication
 Creating a unified view of the data
Data Integration Methods

 Formally characterized as a triple (G, S, M)


 G: Global schema
 S: Heterogeneous source of schema
 M: Mapping between source and global schema
queries
Why is Data Integration
Important?

 Market and consumer data collection


 Supports queries in vast datasets
 Benefits from corporate intelligence
 Stimulates real-time information delivery
 Enables enterprise reporting, predictive analytics,
and business intelligence
 Healthcare industry applications
Data Integration
Approaches

 Tight Coupling: Uses ETL (Extraction,


Transformation, and Loading)
 Loose Coupling: Queries are transformed and sent
to source databases
Merging Data from
Diverse Sources

 Combining data from multiple systems


(databases, files, APIs)
Data Integration
Techniques

 Various data integration techniques in data mining


Manual Integration

 Avoids automation
 Data analyst collects, cleans, and integrates
 Suitable for small data sets
 Time-consuming for large datasets
Middleware Integration

 Uses middleware software to normalize and store


data
 Used for legacy-to-modern system integration
 Acts as a translator between systems
Application-based
Integration

 Software extracts, transforms, and loads data


 Saves time but requires technical skills
Uniform Access
Integration

 Data remains in original locations


 Generates a unified view
 No need to store integrated data separately
Data Integration Tools

 On-premise tools: connect legacy databases


 Open-source tools: cost-effective but require
security management
 Cloud-based tools: provide integration as a
service
Data Transformation

 Converts raw data into a suitable format


 Includes cleaning, reduction, wrangling, and
migration
 Improves business processes and decision-making
Types of Data
Transformation

 Constructive: Adds, copies, or replicates data


 Destructive: Deletes fields or records
 Aesthetic: Standardizes data to meet
requirements
 Structural: reorganised by renaming, moving or
combining columns
Data Normalization

 Scales data values to a smaller range


 Techniques include Min-Max, Z-Score
normalization, and Decimal Scaling
Min-Max Normalization

 Linear transformation
 Formula: (V_i - min_A) / (max_A - min_A) *
(new_max_A - new_min_A) + new_min_A
Z-Score Normalization

 Normalizes using mean and standard deviation


 Formula: Z = (V_i - mean) / standard_deviation
Data Discretization

 Converts continuous data into intervals


 Improves efficiency for data mining
 Uses decision-tree-based algorithms
Data Generalization

 Converts low-level data to high-level attributes


 Approaches: OLAP and Attribute-oriented
Induction (AOI)
Data Transformation
Process (ETL)

 ETL: Extract, Transform, Load


 Steps: Data Discovery, Mapping, Transformation,
Extraction, Code Execution, Review, Sending
Exploratory Data Analysis
(EDA)

 Used by data scientists to analyze datasets


 Summarizes key characteristics
 Often involves visualization techniques

You might also like