Unit 3
Unit 3
1. Scalability: The architecture should accommodate the growth of data over time.
It should be able to handle increasing data volumes without significant
performance degradation.
2. Flexibility & Adaptability: The architecture should be adaptable to new
technologies, tools, or business needs, enabling the addition of new data sources
or capabilities.
3. Data Quality: The architecture should ensure the accuracy, consistency, and
reliability of data across the organization.
4. Data Security: Security measures should be integrated into the architecture to
protect sensitive data and comply with data privacy regulations (e.g., GDPR,
CCPA).
5. Data Accessibility: Data should be easy to access for authorized users or
applications, promoting collaboration and insight generation.
6. Efficiency: The architecture should be designed to optimize performance,
reducing unnecessary processing and storage costs.
7. Governance and Compliance: The architecture should ensure that data is
managed in accordance with regulatory requirements and internal governance
policies.
8. Automation: The data pipeline should automate data processing, storage,
and integration to reduce human intervention and errors.
Major Architecture
Concepts
Data Lakes vs Data Warehouses:
● Cloud Architecture: Data and applications are hosted on cloud providers like
AWS, Azure, or Google Cloud, offering scalability and flexibility.
● On-Premise Architecture: The infrastructure is managed internally within an
organization's data center.
Major Architecture Concepts
Microservices Architecture:
● ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are
two common methods for moving and transforming data between systems.
Data Generation in Source Systems
Data generation in source systems refers to the ways in which raw data is
created or collected. Different types of data sources exist, and understanding
them is critical to building a robust data architecture.
Sources of Data:
○ Logs are records of events that occur in a system, often used for
monitoring and debugging. Logs can be unstructured (e.g., server logs)
and are often ingested into data lakes for processing and analysis.
Example: Web server logs that track user interactions and system events.
Sources of Data
7. Database Logs:
1. Data Source Design: It is crucial to design source systems to ensure the data
is clean, well-structured, and easily accessible for extraction and integration
into the broader data architecture.
2. Data Collection and Sampling: Depending on the use case, data collection
methods (e.g., streaming, batch) should be carefully selected. For example,
real-time systems (e.g., IoT data streams) may require event-based data
collection.
3. Data Consistency and Integrity: Ensuring the consistency and integrity of
data at the source is essential, especially in OLTP systems where transactions
must adhere to ACID properties (Atomicity, Consistency, Isolation, Durability).
Source System Practical Details
4. Data Volume and Velocity: Depending on the system, the data might be
generated in massive volumes (big data systems), or at high velocity (e.g., real-
time systems). The architecture should be capable of handling both types
effectively.
5.Data Storage and Accessibility: Data generated from these sources must
be stored in a way that is scalable and accessible for downstream processing,
typically leveraging both cloud storage (for large volumes) and databases (for
transactional and analytical needs).
Conclusion
Understanding Enterprise Architecture and Data Architecture is critical for
building a system that supports business objectives and ensures the effective use
of data across the organization. Data generation from various sources like APIs,
OLTP systems, and logs demands a careful design of the infrastructure to
efficiently handle, process, and secure data. By integrating good data
management practices, architectures, and tools like CDC and OLAP,
organizations can build scalable, flexible, and secure data systems that drive
business insights and decision-making.