0% found this document useful (0 votes)
3 views9 pages

Data Ingreation Approaches

The document discusses various data integration methods including Federated Databases, Memory-Mapped Data Structures, and Data Warehousing, each with distinct advantages, challenges, and use cases. It emphasizes the importance of Data Quality and Data Integrity in maintaining accurate and reliable data within databases. Examples illustrate how these methods are applied in real-world scenarios like healthcare, stock trading, and retail analysis.

Uploaded by

kruthika c g
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views9 pages

Data Ingreation Approaches

The document discusses various data integration methods including Federated Databases, Memory-Mapped Data Structures, and Data Warehousing, each with distinct advantages, challenges, and use cases. It emphasizes the importance of Data Quality and Data Integrity in maintaining accurate and reliable data within databases. Examples illustrate how these methods are applied in real-world scenarios like healthcare, stock trading, and retail analysis.

Uploaded by

kruthika c g
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

1.

Federated Databases

A federated database system (FDBS) is a type of virtual database that integrates


multiple independent databases while allowing them to maintain their autonomy.

How It Works:

 A middleware layer connects different databases without physically moving or replicating the
data.
 Queries are processed by translating them into the format of the underlying databases.
 The system returns a unified result to the user without requiring a centralized storage
system.

Advantages:

 Real-time data access from multiple sources.


 No need for duplicating or transferring large amounts of data.
 Maintains autonomy of individual databases.

Challenges:

 Performance issues due to query translation overhead.


 Complex implementation as different databases may have different formats and query
languages.

Use Cases:

 Organizations with multiple database systems that need unified access (e.g., banks,
hospitals).
 Real-time reporting from distributed databases.

A virtual database is a system that provides a unified view of multiple databases


without physically copying or moving the data. Instead of storing data in one place,
it dynamically retrieves and integrates data from different sources in real-time.

How a Virtual Database Works:

1. When a user queries data, the virtual database does not store the data itself.
2. It translates the query into multiple sub-queries that are sent to the original databases.
3. The results from different databases are combined and presented as a single, unified
dataset.

Key Features of a Virtual Database

✅ No data duplication – Data remains in its original source.


✅ Real-time access – Fetches the latest data on demand.
✅ Flexible integration – Can connect different databases (SQL, NoSQL, Cloud,
etc.).
✅ Cost-effective – No need for large storage infrastructure.
Example of a Virtual Database

Use Case: Banking System

A bank has separate databases for:

 Customer information (MySQL)


 Loan records (PostgreSQL)
 Credit card transactions (Oracle DB)

Instead of merging all data into a single database, the bank sets up a virtual database
that integrates these three sources. When a bank employee searches for a customer’s
financial profile, the virtual database retrieves and combines real-time data from all
three sources without duplicating it.

2. Memory-Mapped Data Structures

Memory-mapped data integration is an in-memory approach where data is directly


accessed in RAM instead of traditional disk-based storage.

How It Works:

 Data from different sources is mapped into memory so that applications can access it
instantly.
 It uses shared memory techniques to integrate data without duplicating it.
 Typically used in real-time analytics and high-speed transaction processing.

Advantages:

 Extremely fast as it eliminates disk I/O operations.


 Reduces storage redundancy.
 Ideal for big data and real-time applications.

Challenges:

 Requires large RAM capacity, which can be expensive.


 Data is volatile—if the system crashes, data may be lost unless persisted.

Use Cases:

 Financial trading systems that require real-time price updates.


 IoT applications where real-time data processing is crucial.
 In-memory databases like Redis and Apache Ignite.
3. Data Warehousing

A data warehouse is a centralized repository that integrates structured data from


multiple sources for analysis and reporting.

How It Works:

 Data is extracted from various operational databases (ETL process).


 The extracted data is cleaned, transformed, and stored in a structured format.
 Users can query, analyze, and generate reports from the warehouse.

Advantages:

 Stores historical data, enabling trend analysis.


 Optimized for business intelligence (BI) and decision-making.
 Improves data consistency by standardizing formats.

Challenges:

 High initial setup costs and maintenance.


 Not suitable for real-time data processing (batch-oriented).
 Requires ETL processes, which can be complex.

Use Cases:

 Business Intelligence & Reporting (e.g., sales performance tracking).


 Customer analytics (e.g., analyzing shopping behavior in e-commerce).
 Enterprise data consolidation (e.g., integrating data from different departments).

Comparison of Methods

Federated Memory-Mapped Data


Feature
Databases Data Structures Warehousing
Data
Distributed In-memory Centralized
Storage
Speed Medium Very High Medium (Batch)
Real-Time
Yes Yes No
Access
Scalability High Limited by RAM High
Complexity High Medium High
Unified access to Real-time Business
Use Case
multiple DBs analytics Intelligence

Which One to Choose?


 Use federated databases if you need real-time access to multiple databases without moving
data.
 Use memory-mapped data structures if you need high-speed data access for real-time
processing.
 Use data warehousing if you need historical data analysis and business intelligence.

Would you like me to expand on any of these methods?

You said:

with example
ChatGPT said:

Here are examples of how each data integration method is used in real-world
scenarios:

1. Federated Databases Example – Healthcare System


Integration

Scenario:

A hospital group has multiple branches, each using its own database for patient
records, lab results, and billing. The hospital wants a unified view of patient data
without moving or duplicating data.

How Federated Databases Work:

 The hospital implements a federated database system that connects all the branch
databases.
 When a doctor queries a patient’s history, the system fetches real-time data from multiple
databases.
 The doctor sees a combined view of the patient’s medical history, lab results, and billing
details.

Example Tools: IBM InfoSphere Federation Server, Microsoft


SQL Server PolyBase

Benefits:

✅ Real-time patient data access from different locations


✅ No need to duplicate sensitive patient data
✅ Each branch maintains autonomy over its database

2. Memory-Mapped Data Structures Example – Stock


Market Trading

Scenario:
A stock exchange system needs to process millions of stock price updates per
second to provide real-time insights to traders.

How Memory-Mapped Data Works:

 Stock price data is stored in memory-mapped files instead of traditional databases.


 Traders' software can instantly read and analyze stock prices without disk access.
 The system updates stock prices in real time, allowing traders to make quick decisions.

Example Tools: Apache Ignite, Redis, Oracle TimesTen

Benefits:

✅ Ultra-fast stock price updates with zero disk I/O


✅ Low latency for real-time trading decisions
✅ Efficient memory usage for large-scale trading

3. Data Warehousing Example – Retail Sales Analysis

Scenario:

A large retail company (e.g., Walmart, Amazon) wants to analyze customer


purchasing trends over the past 5 years to improve marketing and inventory
planning.

How Data Warehousing Works:

 Sales data is collected from all retail stores and online platforms.
 The data is extracted, cleaned, and stored in a data warehouse (e.g., Amazon Redshift,
Google BigQuery).
 Analysts run queries and reports to identify trends, such as seasonal buying patterns or
popular products.

Example Tools: Snowflake, Amazon Redshift, Google


BigQuery

Benefits:

✅ Enables historical trend analysis for better decision-making


✅ Supports business intelligence (BI) dashboards
✅ Data is structured and optimized for fast analytics

Comparison Table with Examples


Federated Data
Memory-Mapped
Feature Databases Warehousing
(Stock Trading)
(Hospital) (Retail)
Data Distributed In-memory (real- Centralized
Location (multiple hospitals) time stock prices) (retail sales data)
Medium (real-time Very High Medium (batch
Speed
queries) (microseconds) processing)
Patient record
Use High-speed stock Analyzing sales
access across
Case trading trends
hospitals
Example IBM InfoSphere, Redis, Apache Snowflake,
Tools SQL PolyBase Ignite Amazon Redshift

Which One Should You Use?


 If you need real-time access to multiple databases → Use Federated Databases
 If you need extremely fast real-time processing → Use Memory-Mapped Data
 If you need historical data analysis for reporting → Use Data Warehousing

What is Data Quality?


Data Quality refers to how accurate, complete, and reliable data is. Good data quality means the data
is correct, up-to-date, and useful for making decisions.

Why is Data Quality Important?


Imagine you are filling out a form to apply for a job. If you enter the wrong phone number or leave
your email address blank, the company cannot contact you—this is an example of bad data quality.

What is Data Integrity?


Data Integrity ensures that the data in a database is correct, consistent, and does not get changed
accidentally. It involves rules that protect data from mistakes or unauthorized changes.

How Does Data Integrity Work?


In a database, data integrity is maintained using rules and constraints. Let’s understand some key
concepts with an example:

Example: Employee Database


Imagine we have an Employee Table in a company's database. The table has the following columns:

EmpNo (Primary Key) DeptNo (Foreign Key)EmpName (Not Null)Salary (Check Constraint)
101 1 John Doe 50000
102 2 Alice Smith 60000
103 3 Bob Williams 55000
Data Integrity Rules Explained with the Example:
Primary Key (PK) – Unique & Not Null

EmpNo (Employee Number) is the Primary Key, so each employee must have a unique EmpNo.
Example: There cannot be two employees with EmpNo 101.
It also cannot be empty (NULL), meaning every employee must have an EmpNo.
Foreign Key (FK) – Linking Tables
DeptNo (Department Number) is a Foreign Key, meaning it must refer to an existing department in
the Department Table.
Example: If DeptNo 5 does not exist in the Department Table, you cannot assign an employee to
DeptNo 5.
Not Null Constraint – No Empty Values

EmpName has a Not Null constraint, meaning every employee must have a name.
Example: If we try to insert a new employee without an EmpName, the database will reject the entry.
Check Constraint – Business Rules

Salary has a Check Constraint to ensure employees earn at least $30,000.


Example: If someone tries to enter a salary of $20,000, the database won’t allow it because it violates
the rule.
Why is Data Integrity Important?
✅ Prevents duplicate records (e.g., two employees with the same EmpNo).
✅ Ensures accuracy (e.g., an employee cannot be assigned to a non-existent department).
✅ Protects data consistency (e.g., salaries must follow company rules).
✅ Helps in reliable decision-making (e.g., HR can trust that all employees have valid salaries).

Conclusion
Data Quality means the data is correct and useful.
Data Integrity means the data follows rules to keep it accurate and consistent.
Databases use Primary Keys, Foreign Keys, Not Null, and Check Constraints to protect data integrity.

What is Data Quality?


Data Quality means that data is accurate, complete, consistent, and useful for
making decisions.

Why is Data Quality Important?

🔹 If data is incorrect, wrong decisions can be made.


🔹 Bad data can cause errors in reports.
🔹 Good data saves time and money in businesses and organizations.

📌 Example:

 If a school records a student’s birthdate incorrectly, the student might not be placed in the
correct grade.

What is Data Integrity?


Data Integrity ensures that data is correct, reliable, and follows rules to avoid
mistakes or loss of information.

Example of Data Integrity in a Database


Imagine a company has an Employee Table storing employee details. It follows data
integrity rules to keep the data correct.

EmpNo DeptNo EmpName Age (Check


(Primary Key) (Foreign Key) (Not Null) Constraint)
1001 D01 John Mathews 25
1010 D02 Elan Hayden 27
Jennifer
1011 D01 23
Augustine

🔹 Each column follows a rule to maintain integrity.

Types of Data Integrity Rules


1️⃣Primary Key (PK) – Uniqueness Rule
 Each employee must have a unique EmpNo (Employee Number).
 Example: No two employees can have EmpNo = 1001.

2️⃣Foreign Key (FK) – Relationship Rule


 DeptNo (Department Number) must exist in the Department Table.
 Example: If the Department Table has "D01" (Finance), an employee can be assigned to it.
 If DeptNo D05 doesn’t exist, we cannot assign an employee to it.

DeptNo (From Department


Dept Name
Table)
D01 Finance
D02 Purchase
Human
D03
Resources
D04 Sales

3️⃣Not Null – No Empty Values Rule


 EmpName must always be filled.
 Example: If we try to insert an employee without a name, the database will reject it.

4️⃣Check Constraint – Business Rule


 Age should be between 18 and 65.
 Example: If someone tries to enter Age 17, the database will reject it.

Why is Data Integrity Important?


✅ Prevents duplicate records
✅ Ensures data accuracy and reliability
✅ Protects business operations from errors
✅ Keeps relationships between tables correct

Final Summary
📌 Data Quality means data is accurate and useful.
📌 Data Integrity ensures data follows rules to stay correct.
📌 Primary Key, Foreign Key, Not Null, and Check Constraints help maintain
data integrity.

You might also like