Module2 Data Engineering
Module2 Data Engineering
Module 2
Data Cleaning Techniques- Batch Processing vs. Stream Processing
Data Integration: Combining Data from Multiple Sources. Data
Integration Tools. Data Quality and Consistency
Data Pipelines: Designing and Implementing Data Pipelines. Workflow
Orchestration Tools (e.g., Apache Airflow). Monitoring and Maintaining
Data Pipelines
Data Security and Privacy: Ensuring Data Security. Data Encryption
Techniques. Compliance with Data Privacy Regulations
Introduction to Data Cleaning
• Data cleaning is the
process of detecting and
correcting errors or
inconsistencies in data.
• It ensures high-quality
data for analysis,
reporting, and machine
learning.
Batch processing involves collecting, processing, and cleaning data in large chunks at scheduled
intervals (e.g., hourly, daily, or weekly). It is suitable for tasks that do not require immediate responses
and can tolerate delays.
Key Characteristics of Batch Processing
✅ Processes large datasets at once.
✅ Suitable for historical data analysis and scheduled reports.
✅ High throughput but increased latency.
✅ Requires storage before processing (e.g., in databases, data lakes, or files).
Batch Processing - Common Data
Cleaning Techniques
• Handling Missing Values - Imputation (mean, median, mode) or removal.
Mean: Replacing missing values with the average of the column (useful for normally
distributed data).
Median: Using the middle value of the column (useful for skewed distributions).
Mode: Using the most frequently occurring value (best for categorical data).
• Standardization - Converting data into a uniform format.
• Outlier Detection – Identify
s
v
Stream Processing - Overview
• Processes data in real-time as it arrives.
• Low latency, used in fraud detection, IoT processing, and real-time
analytics.
• Handles high-velocity, high-volume data streams.
Stream Processing
Stream processing handles continuous, real-time data as it arrives, making it
ideal for applications where instant insights are required.
Key Characteristics of Stream Processing
✅ Processes data in real-time or near real-time.
✅ Best for event-driven applications (e.g., fraud detection, monitoring, IoT).
✅ Low latency but requires high computational power.
✅ Does not require intermediate storage (data is processed as it flows).
Examples of Stream Processing Systems
• Apache Kafka – A distributed event-streaming platform.
• Apache Flink – A real-time processing engine optimized for low-latency
streaming.
• Spark Streaming – A streaming extension of Apache Spark.
• AWS Kinesis – A real-time data streaming service.
Stream Processing - Data
Cleaning Techniques
1. Handling Duplicate Data
Duplicate records occur due to network retries, producer failures, or incorrect
joins.
Techniques:
• Event Deduplication using Sliding Windows → Store recent event IDs and
discard duplicates.
• Bloom Filters / Hash Sets → Maintain a lightweight structure to track processed
records.
• Exactly-Once Processing (EOS) in Kafka Streams → Ensure transactional writes
to prevent duplication.
No duplicate processing.
No data loss.
State consistency across processing nodes
2. Handling Missing Values
Missing data in fields like transaction amount, user ID, or timestamps can impact
downstream analysis.
Techniques:
• Default Value Imputation → Replace missing numeric values with mean/median/mode.
• Forward/Backward Fill → Use the previous or next valid value to fill missing data (use the
previous valid value (forward fill) or the next valid value (backward fill) to replace
them.)
• Drop Records with Critical Missing Fields → Discard records if essential fields (e.g.,
transaction ID) are missing.
3. Handling Out-of-Order and Late Arriving Data
Timestamps may be incorrect due to network latency or clock synchronization issues.
Techniques:
• Event Time Processing with Watermarks → Allow processing of slightly late events while
discarding extremely late events.
• Sorting within a Sliding Window → Collect events within a time window and reorder
4. Handling Inconsistent or Incorrect Data
Data may have incorrect values due to faulty sensors, incorrect inputs, or format mismatches.
Techniques:
• Schema Validation & Data Type Enforcement → Ensure fields match expected data types.
• Threshold-Based Filtering → Remove records with values that exceed predefined
thresholds (e.g., negative transaction amounts).
• Anomaly Detection Models → Use machine learning to identify and correct errors.
5. Standardizing and Normalizing Data
Data from different sources may have variations in formats, units, or structures.
Techniques:
• Standardized Formatting → Convert timestamps, currencies, or categorical values into a
common format.
• Normalization & Scaling → Normalize numerical fields (e.g., Min-Max Scaling) for
consistency.
• Text Cleaning → Remove unnecessary whitespace, convert to lowercase, handle special
characters.
6. Handling Outliers in Streaming Data
Outliers can skew analysis and lead to incorrect insights.
Techniques:
• Z-Score or IQR-Based Filtering → Remove data points outside standard statistical limits.
• Moving Average Smoothing → Reduce sudden spikes in real-time data.
• Machine Learning Models → Identify outliers using ML-based anomaly detection.
7. Ensuring Low Latency in Data Cleaning
Optimizing for speed is critical in real-time data processing.
Optimizations:
• Parallel Processing & Partitioning → Increase Kafka partitions and Flink parallelism.
• Stateful Processing with RocksDB → Store state efficiently for deduplication and time
tracking.
• Efficient Serialization → Use Apache Avro or Protocol Buffers instead of JSON for faster
processing.
• Batching Small Events → Process micro-batches to optimize throughput.
Issue Techniques Used
Detecting & Removing Uses statistical methods (mean, Z-score) Identifies anomalies in real-time using
Outliers across historical data. moving averages or ML models.
Handling Missing Imputes missing data with statistical Uses default values or interpolates based on
Values methods (mean, median, mode). recent incoming data.
Fixing Formatting Standardizes formats (e.g., date/time, case Ensures consistent formatting in real-time
Issues normalization) across large datasets. before ingestion.
Bulk validation and correction rules applied Implements real-time validation and alerts
Correcting Errors at scheduled intervals. for correction.
Handling Irrelevant Filters out unnecessary data before Filters out irrelevant data before entering the
Data processing. streaming pipeline.
Converts data types in bulk (e.g., integer to Ensures type consistency during real-time
Data Type Conversion float, string to date). ingestion.
Uses offline translation models or APIs to Implements real-time language detection and
Language Translation
process large datasets. translation for streaming data.
What is Data Integration?
• The process of combining data from
multiple sources to provide a unified
view.
• Ensures seamless data flow across
different applications and databases.
• Essential for analytics, business
intelligence, and decision-making.
Data Integration refers to the process of collecting, transforming, and merging data from multiple heterogeneous sources into
a single, unified view. This ensures seamless data flow across different applications, databases, and business processes, leading
to better decision-making.
Purpose:
The primary goal of data integration is to overcome the challenges of data disparate formats, allowing
organizations to:
• Gain a unified view of their data.
• Improve data quality and consistency.
• Enable better data analysis and decision-making.
• Streamline business processes.
Types of data integration:
Hybrid data integration: Allows users to share and access data from any
application, regardless of its location
Middleware data integration: Normalizes data to transport it to a master data
pool
Data virtualization: Creates a virtual data layer on top of different data sources,
allowing businesses to query and access data without physically moving it
Real-time data integration: Integrates incoming data into existing records in near
real-time
Manual Data Integration
• Data is collected from multiple sources and manually entered or merged into a unified system.
• Suitable for small-scale tasks but inefficient for large datasets.
• Cost savings
• Improved efficiency
• Increased agility
Techniques for Data Integration
1. ETL (Extract, Transform, Load) 2. ELT is a modern approach where data is extracted
from sources, loaded into a target system (like a data
2. ELT (Extract, Load, Transform) lake or warehouse), and then transformed within that
3. Data Virtualization environment.
4. Data Federation 3. Data virtualization allows applications to access and
5. Master Data Management (MDM) manipulate data from multiple sources without
physically moving or storing the data, creating a unified
1. ETL is a traditional data view.
integration process where data is
4. Data federation is a data integration approach that
extracted from various sources,
enables querying data from multiple disparate sources
transformed (cleaned, formatted, through a unified interface, abstracting the underlying
etc.), and then loaded into a target data sources.
system, often a data warehouse.
5. MDM is a comprehensive approach to managing an
organization's critical data assets, ensuring consistency
and accuracy across the enterprise
Benefits:
Data federation •Real-time access: Data federation allows
for real-time access and analysis of data from
multiple sources.
•Reduced complexity: It simplifies data
access and management by providing a
unified view of data from disparate sources.
•Improved data quality: By providing a
unified view, data federation can help
improve data quality and consistency.
•Examples:
•Accessing customer data from a CRM system, sales
data from an ERP system, and inventory data from
another database through a single interface.
•Analyzing data from different departments or business
units without needing to replicate or consolidate the
data.
Master Data Management
"HDS" likely refers to a Historical Data Store, while "ODS" stands for
Operational Data Store
4. Application Integration
• Application integration allows separate applications to work together by moving
and syncing data between them.
• The most typical use case is to support operational needs such as ensuring that
your system has the same data as your finance system.
Therefore, the application integration must provide consistency between the data
sets.
• Various applications usually have unique APIs for giving and taking data so SaaS
application automation tools can help you create and maintain native API
integrations efficiently and at scale.
Common Data Sources
• Relational Databases (MySQL, PostgreSQL, SQL
Server)
• NoSQL Databases (MongoDB, Cassandra,
Redis)
• Cloud Storage (AWS S3, Google Cloud Storage)
• Enterprise Applications (ERP, CRM, Salesforce)
• Web APIs & IoT Devices
• Flat Files & CSV Data
Data integration involves combining data from various sources, and common sources
include databases, cloud platforms, APIs, and streaming data, all of which can be integrated using
various tools and methodologies.
1. Databases 5. Streaming Data Sources
• Relational Databases (RDBMS): • Kafka, Apache Flink, AWS Kinesis
MySQL, PostgreSQL, SQL Server, • IoT devices, sensor data
Oracle
6. Enterprise Applications
• NoSQL Databases: MongoDB, • ERP (SAP, Oracle ERP)
DynamoDB
• CRM (Salesforce, HubSpot)
2. Flat Files & Documents • HRMS (Workday, BambooHR)
• CSV, JSON, XML, Excel 7. Social Media & Web Analytics
• Text files and log files • Twitter, Facebook, LinkedIn APIs
3. Cloud Storage & Data Lakes • Google Analytics, Adobe Analytics
• Amazon S3, Google Cloud 8. Big Data Platforms
Storage, Azure Blob Storage • Hadoop, Spark, Snowflake, Google BigQuery
• Data lakes like Hadoop HDFS,
Databricks Delta Lake
4. APIs & Web Services
• RESTful and SOAP APIs
• Web scraping sources
Popular Data Integration Tools
1. ETL (Extract, Transform, Load) Tools
•Apache NiFi – Open-source tool for automating data flow between systems.
•Talend – Provides ETL and ELT capabilities with a graphical interface.
•Informatica PowerCenter – Enterprise-grade ETL tool for data integration.
•Microsoft SQL Server Integration Services (SSIS) – ETL tool for Microsoft SQL Server.
•AWS Glue – Serverless ETL tool for AWS environments.
•Google Cloud Dataflow – Fully managed ETL service for stream and batch processing.
2. Data Pipelines & Workflow Orchestration
•Apache Airflow – Open-source workflow management tool for scheduling data pipelines.
•Luigi – Python-based workflow orchestration tool developed by Spotify.
•Prefect – Modern orchestration tool that simplifies workflow management.
3. Real-time Data Integration & Streaming
• Apache Kafka – Distributed event streaming platform for real-time data integration.
• Apache Flink – Real-time stream processing framework.
• Apache Spark Structured Streaming – Streaming data processing with Spark.
• Confluent Platform – Kafka-based data streaming and integration platform.
4. ELT (Extract, Load, Transform) Tools
• Fivetran – Fully managed ELT tool for cloud data warehouses.
• Stitch – Simple ELT solution with pre-built connectors for various data sources.
• Matillion – Cloud-native ELT tool optimized for modern data warehouses.
5. Data Virtualization Tools
• Denodo – Enterprise-grade data virtualization platform.
• TIBCO (The Information Bus Company) Data Virtualization – Unifies disparate data sources
without physical movement.
6. Cloud-Native Data Integration Tools
• Azure Data Factory – Cloud-based data integration tool for Microsoft Azure.
• Google Cloud Data Fusion – Managed data integration platform using Apache CDAP.
Ensuring Data Quality and Consistency
• Data Accuracy – Ensuring
correct and error-free data.
• Completeness – Avoiding
missing values in records.
• Consistency – Maintaining
uniform data across all systems.
• Timeliness – Keeping data
updated in real-time.
• Validity – Ensuring data
follows predefined rules.
Challenges in Data Integration
• Data format mismatches
(e.g., date formats, units of
measure).
• Data duplication and
inconsistencies.
• Real-time synchronization
issues.
• Security and compliance
concerns (GDPR, HIPAA).
• Handling large volumes of
data efficiently.
what is security in data engineering
In data engineering, security encompasses the measures and practices
implemented to protect data throughout its lifecycle, ensuring confidentiality,
integrity, and availability from unauthorized access, use, or modification.
Data Security and Privacy in Data Engineering
Understanding the Landscape
🔒 Data Security:
• Prevents data loss, corruption, and unauthorized access
• Ensures confidentiality, integrity, and availability (CIA triad)
🔏 Data Privacy:
• Manages the proper collection, usage, retention, deletion, and
storage of data
• Protects personal information and ensures compliance with
regulations like GDPR (General Data Protection Regulation) and
CCPA (California Consumer Privacy Act)
Data Lifecycle Considerations:
Data engineers must integrate security & privacy at every stage:
• Data Creation & Collection – Securely acquire data
• Data Storage – Implement encryption & access control
• Data Processing & Analysis – Ensure privacy-preserving
techniques
• Data Sharing & Dissemination – Use secure transmission
protocols
• Data Disposal – Securely erase outdated or unnecessary data
Data protection refers to the practices and strategies employed to safeguard sensitive information
from unauthorized access, loss, or damage, ensuring its availability and compliance with relevant
Data Privacy Best Practices
If you’re keen on complying with regulations (as you should) and creating a bond of trust with
your customers.