0% found this document useful (0 votes)
6 views32 pages

Question Data Engineering

The document discusses data cleaning strategies for both batch and stream processing in various scenarios, including retail and healthcare. It outlines techniques such as deduplication, handling missing values, standardizing formats, and ensuring data quality through ETL processes and real-time integration. Additionally, it emphasizes the importance of compliance, security, and automation in maintaining data integrity across different systems.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views32 pages

Question Data Engineering

The document discusses data cleaning strategies for both batch and stream processing in various scenarios, including retail and healthcare. It outlines techniques such as deduplication, handling missing values, standardizing formats, and ensuring data quality through ETL processes and real-time integration. Additionally, it emphasizes the importance of compliance, security, and automation in maintaining data integrity across different systems.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Scenario Question on Batch Processing -

Data Cleaning
• Scenario:
A retail company collects sales data from multiple stores daily. However, they notice
issues in their dataset, including duplicate records, missing values in the "Price"
column, and inconsistencies in date formats. Since the dataset is large, they decide
to clean the data using batch processing.
Question:
How can the company use batch processing to clean the dataset and ensure data
quality?
Answer:
The company can follow these batch processing data cleaning techniques:
• Deduplication
Identify and remove duplicate records based on unique transaction IDs or a combination of
"Date," "Store ID," and "Product ID."
DELETE FROM sales_data
WHERE id NOT IN (
SELECT MIN(id)
FROM sales_data
GROUP BY date, store_id, product_id
);
•Handling Missing Values in the "Price" Column
If a product's price is missing, they can impute values using the mean price of the same product across other
stores.
Example SQL query:

•Standardizing Date Formats


Convert all date values into a uniform format, e.g., YYYY-MM-DD.
Example in Python using pandas:
•Removing Anomalies
Identify negative prices or unrealistic values (e.g., a product price of $1,000,000).
•Example SQL query to remove extreme outliers:
DELETE FROM sales_data
WHERE price < 0 OR price > 10000;

•Automating the Process


The company can schedule the batch cleaning process to run daily using Apache Spark, SQL
scripts, or Python ETL pipelines (e.g., using Airflow).
•Example in Python (batch script using pandas):
Scenario-Based Question on Stream Processing -
Data Cleaning
A global e-commerce company processes millions of real-time transactions every
day. The incoming transaction data from various sources (mobile apps, web platforms,
and third-party integrations) contains inconsistencies, including:
Duplicate Orders due to retries from payment gateways.
Missing Customer Details such as email or shipping address.
Incorrect or Out-of-Sequence Timestamps because of delays in different time
zones.
These issues need to be fixed in real-time before transactions are stored in the
database and sent to fraud detection models.
You are tasked with implementing a real-time data cleaning pipeline using a stream
processing framework (e.g., Apache Flink, Kafka Streams, or Spark Streaming).
Questions:
• How would you handle duplicate orders in the streaming data?
• What techniques would you use to manage missing values in customer details?
• How would you handle incorrect or out-of-sequence timestamps in real-time
transactions?
• Which stream processing framework would you choose and why?
1. Handling Duplicate Orders in Streaming Data
Challenges with Duplicate Orders
• Duplicate transactions occur due to payment gateway retries.
• Customers may click “Pay” multiple times if there is a delay.
• Duplicate records lead to incorrect revenue calculations and customer
dissatisfaction.
2. Handling Missing Customer Details
Challenges with Missing Data
• Missing email or shipping addresses makes order
tracking difficult.
• Customer profiles may be incomplete in the database.
• Orders without critical fields may break downstream
systems.
Techniques to Handle Missing Values
Assignment
1. Scenario:
A social media analytics company collects real-time posts, comments, and reviews from
multiple platforms. The raw data contains spam, duplicate posts, and inconsistent text
formatting.
Question:
• Should data cleaning be performed using batch processing or real-time stream processing
for sentiment analysis?
• What preprocessing techniques (e.g., deduplication, stopword removal, text normalization)
should be applied in each approach?
• How does the choice of processing method impact real-time analytics and trend detection?
2. Scenario:
A retail company tracks product stock levels across warehouses, online stores, and third-party
sellers. The inventory data is sometimes outdated, duplicated, or inconsistent across different
sources, leading to stock mismanagement.
Question:
• How can batch processing help in periodically reconciling inventory data for accuracy?
• When should real-time stream processing be used to detect and clean inventory
discrepancies instantly?
• What are the challenges of implementing a hybrid approach combining batch and stream
Scenario 1:
A retail company has data coming in from multiple sources, including online sales,
in-store transactions, and third-party delivery partners. However, the company
notices discrepancies in customer purchase histories across these sources.
Question:
How can the company use data integration techniques to ensure data
consistency and accuracy while consolidating data from multiple sources?
Solution:
To address data discrepancies and ensure consistency, the company can follow
these steps:
1. Implement the ETL (Extract, Transform, Load) Process
The ETL process is essential for integrating data from multiple sources while
ensuring consistency.
• Extract:
• Collect data from online stores, Point of sale (POS) systems, third-party
platforms, and CRM systems.
• Ensure data extraction occurs in real-time or batch mode to reduce delays.
• Use data connectors and APIs to pull data from platforms like Shopify, Amazon,
or custom databases.
• Transform:
• Data Standardization: Convert all records to a common format (e.g., unifying date formats,
standardizing currency, and normalizing product categories).
• Data Deduplication: Identify and merge duplicate records (e.g., when the same customer is
registered multiple times with different email addresses).
• Error Handling: Clean corrupted or missing data using AI-driven cleansing tools.
• Load:
• Store transformed data in a centralized data warehouse or a cloud-based database
(e.g., Amazon Redshift, Google BigQuery, Snowflake).
• Maintain an audit log to track all modifications for compliance and troubleshooting.
• Tools for ETL:
Talend, Apache NiFi, Informatica PowerCenter, Microsoft SSIS
2. Implement Master Data Management (MDM) for a Single Customer View
MDM is crucial for ensuring a single version of truth for customer data across multiple
platforms.
• Create a Unified Customer Profile: Consolidate data from all sources to create a single,
verified customer profile.
• Merge Duplicate Entries: Use machine learning algorithms to detect duplicate customer
profiles and merge them into a single accurate record.
• Update Data in Real-time: Any change in customer information (e.g., address, phone
number) should be reflected across all systems automatically.
MDM Tools:
3. Real-time Data Integration with APIs and Streaming
Platforms
To reduce delays and ensure data is up-to-date, the
company should adopt real-time data streaming and API
integration.
• Use APIs for Instant Data Synchronization:
• API-based integration ensures immediate updates when a
purchase is made, eliminating discrepancies.
Tools: MuleSoft, Dell Boomi, Postman API
• Stream Data in Real-time for Faster Insights:
• Use event-driven streaming platforms like Apache Kafka or AWS
Kinesis to process transactions as they happen.
• Benefits:
• Instant fraud detection.
• Real-time inventory updates.
• Personalized customer recommendations.
4. Ensure Data Quality and Governance
Data consistency is directly linked to data quality and governance policies. The company
must:
• Automate Data Cleansing:
• Set up rules to fix incorrect records, missing values, and formatting errors.
• Example: If the country field is blank, fill it using the IP address location.
• Define Data Governance Policies:
• Establish a framework for data ownership and access control.
• Use role-based access control (RBAC) to ensure only authorized employees can modify sensitive
data.
• Audit and Monitor Data Changes:
• Log all data changes to track unauthorized modifications and ensure compliance.
• Data Governance Tools:
Collibra, Alation, Informatica Axon

Outcome:
By implementing these data integration strategies, the retail company will:

✅ Have a centralized and accurate customer database.


✅ Provide consistent purchase history across all channels.
✅ Improve customer satisfaction with personalized marketing.
Scenario 2: Data Integration in a Healthcare Organization
A healthcare organization aims to integrate patient records
from various hospitals and clinics into a centralized system.
However, the data is stored in different formats and structures
across multiple databases, making integration complex.
Question:
What data integration tools and strategies can be used to
unify patient records while maintaining data quality and
ensuring compliance with healthcare regulations?
Solution:
To successfully integrate patient records, the healthcare organization should
implement a robust data integration framework that ensures accuracy,
compliance, and security.
1. Standardize Data Formats for Interoperability
Different hospitals may use different Electronic Health Record (EHR) systems,
leading to inconsistencies. To unify data:
• Convert records into healthcare standards like:
• HL7 (Health Level 7) – widely used for patient data exchange.
• FHIR (Fast Healthcare Interoperability Resources) – modern standard for
web-based EHR integration.
• Automate Data Transformation:
• Convert different date formats, medical terminologies, and patient identifiers
into a common format.
• Use AI-driven data mapping tools to identify mismatched fields.
• Tools for Data Standardization:
• IBM Watson Health, MuleSoft Healthcare API, Microsoft Azure Health Data Services
2. Use ETL and API-based Integration for Data Unification
• ETL for Historical Data Migration:
• Extract data from legacy databases, CSV files, and third-party systems.
• Clean and transform data into a structured, consistent format.
• Load into a centralized cloud-based healthcare database.
• API-based Integration for Real-time Updates:
• Implement FHIR-based APIs to allow hospitals to send real-time updates on patient records.
ETL Tools:
Talend, Informatica Cloud, AWS Glue

3. Data Quality and Deduplication for Accurate Records


• Identify Duplicate Patient Records:
• Use AI-based matching algorithms to merge duplicate patient profiles.
• Example: If a patient is registered as "John A. Doe" and "J. Doe," AI can detect and unify
them.
• Automate Data Validation:
• Implement rules to detect missing information (e.g., missing age, incorrect patient ID).
Tools:
• IBM Infosphere QualityStage, Trifacta
• 4. Ensure Compliance with Healthcare Regulations
• To protect patient privacy, the organization must comply with:
• HIPAA (USA) – Protects patient data privacy.
• GDPR (Europe) – Regulates how patient data is processed.
• NDHM (India) – Ensures secure digital health records.
• Security Measures:
• ✅ Data Encryption: Encrypt sensitive records during storage and transfer.
✅ Access Controls: Use Role-Based Access Control (RBAC) to prevent
unauthorized access.
✅ Audit Logging: Maintain logs of who accessed or modified patient
records.

Outcome:
By implementing these data integration strategies, the healthcare
organization will:
✅ Create a single, accurate patient record across hospitals.
✅ Improve patient care through real-time data sharing.
Assignment
1. Scenario:
A city government wants to integrate real-time data from traffic cameras, public transportation
systems, environmental sensors, and emergency response units into a single dashboard for efficient
urban management. However, the data comes from multiple vendors using different protocols and
formats, making real-time analysis difficult.
Question:
What data integration techniques, IoT frameworks, and cloud solutions can be used to unify city-
wide data for improved decision-making and public safety?
2. Scenario:
A multinational company sources raw materials from various suppliers worldwide, with procurement
data stored in separate ERP systems across regions. Due to data silos and inconsistent formats,
procurement teams struggle with accurate demand forecasting and inventory management.
Question:
Which data integration strategies and tools can help unify vendor and inventory data across global
supply chains while ensuring real-time visibility and operational efficiency?
3. Scenario:
A pharmaceutical company conducts clinical trials across multiple research centers, collecting
patient health data, drug efficacy reports, and trial outcomes in different formats and database
systems. Data inconsistency makes it challenging to analyze trial results accurately.
Question:
What data integration strategies and tools can be employed to unify clinical trial data, ensure
regulatory compliance (e.g., FDA, HIPAA), and accelerate drug development?
SCENARIO
A global stock exchange uses real-time Kafka streams to
process billions of financial transactions daily. Some employees
and external brokers are suspected of manipulating stock
price data and insider trading by accessing sensitive
transactions before they are made public.
1️. How can the company ensure real-time data integrity in
a streaming environment?
2. How to detect and block unauthorized access to
transaction data?
3. How to prevent financial data tampering while
maintaining ultra-low latency?
4. How to secure APIs that interact with stock data in a
microservices architecture?
5. How to provide auditors with tamper-proof logs of
financial transactions?
Solutions:
1.Data Integrity & Real-Time Tamper Detection
Solution: Use blockchain-based data logging to create an immutable transaction
history.
Implementation:
• Use Hyperledger Fabric to store stock trade records immutably.
• Apply Merkle Trees to detect unauthorized changes in Kafka streams.
• Deploy Apache Flink with anomaly detection ML models to flag suspicious trading
patterns.
✅ Outcome: No unauthorized changes in stock prices; all manipulations are
detected instantly.
2. Zero Trust & Just-in-Time (JIT) Access for Brokers
Solution: Prevent insider trading by applying Zero Trust principles and Just-in-
Time (JIT) access.
• Implementation:
• Use Google BeyondCorp to validate every access attempt.
• Restrict high-frequency traders using behavior-based access control.
• Implement OAuth 2.0 with fine-grained API policies (e.g., stock brokers can only access
data relevant to their firm).
• ✅ Outcome: No unauthorized user can pre-access financial transactions for insider
3. API Security in Microservices Architecture
Solution: Implement Mutual TLS (mTLS) & API Threat Protection.
• Implementation:
• Enforce mTLS authentication between microservices (e.g., Envoy + Istio in
Kubernetes).
• Deploy Cloudflare API Gateway with rate limiting & anomaly detection.
• Use OAuth 2.0 + JWT (JSON Web Tokens) for secure API communication.
✅ Outcome: APIs handling stock transactions are fully encrypted and secure
from API abuse.
4. Immutable Audit Logs for Regulatory Compliance (SEC, FINRA,
GDPR)
Solution: Store audit logs in AWS Quantum Ledger Database (QLDB) or
Azure Confidential Ledger.
• Implementation:
• Encrypt audit logs using SHA-256 hashing.
• Restrict log deletion using WORM (Write Once, Read Many) storage.
• Implement Splunk with AI-powered anomaly detection for fraud monitoring.
✅ Outcome: The company achieves full compliance with SEC, GDPR, and
FINRA regulations.
2. A healthcare analytics company processes electronic health records
(EHRs), medical images, and patient prescriptions in a multi-cloud
data lake (AWS S3, Google Cloud Storage, Azure Data Lake). A recent
misconfiguration in cloud storage exposed 500,000 patient records,
violating HIPAA (Health Insurance Portability and Accountability Act).
The company must now secure patient data, prevent future breaches,
and comply with HIPAA & GDPR while still enabling medical research
and AI-driven diagnostics.
Questions:
1. How to prevent unauthorized access to patient data in a multi-
cloud environment?
2. How to secure AI models that process sensitive medical records?
3. How to implement fine-grained access control for doctors,
researchers, and insurers?
4. How to detect & prevent accidental misconfigurations in cloud
storage?
5. How to anonymize patient data for research without
compromising AI model accuracy?
1. Prevent Unauthorized Access with Zero Trust & Attribute-
Based Access Control (ABAC)
Zero Trust Security + Multi-Factor Authentication (MFA) + Role-
Based & Attribute-Based Access Control (RBAC & ABAC)
✔ Zero Trust Model:
• Require MFA & Identity Federation (Okta, Azure AD) for all users.
• Block direct database access; use secure APIs with OAuth 2.0 &
JWT authentication.
✔ ABAC for Medical Data Access:
• Doctors can access patient records only in their assigned
hospital.
• Researchers get anonymized data only (without names or
SSNs).
• Insurers can only see billing-related data, not full medical history.
• ✅ Outcome: Unauthorized access is completely blocked, ensuring
2. Securing AI Models that Process Medical Data
Homomorphic Encryption & Secure Multi-Party
Computation (SMPC)
✔ AI models should never process raw patient data:
• Use Google’s Differential Privacy API to add noise to data
before training AI models.
• Apply Homomorphic Encryption (HE) so models can process
encrypted medical records without decryption.
✔ Federated Learning for AI on Edge Devices:
• Train AI models on hospital devices instead of centralizing data
in the cloud.
• Use TensorFlow Federated to prevent raw patient data from
leaving hospital networks.
✅ Outcome: AI models maintain accuracy while ensuring full
patient privacy.
3. Privacy-Preserving Data Sharing for Medical Research
Anonymization + Tokenization
✔ De-Identification Techniques:
• Remove names, addresses, and social security numbers.
• Replace patient identifiers with randomized tokens using
Google DLP or AWS Comprehend Medical.
✔ Differential Privacy for Medical Datasets:
• Use synthetic data (AI-generated medical records) for
research instead of real patient data.
• Apply k-anonymity to ensure that at least k patients share
the same attributes, making re-identification impossible.
✅ Outcome: The company enables AI-driven healthcare
innovation without compromising patient privacy.
4. Continuous Cloud Storage Security Monitoring
Automated Cloud Security Posture Management (CSPM)
✔ Misconfiguration Prevention:
• Deploy AWS Macie, Google Security Command Center, and Azure Purview to detect
publicly exposed data.
• Enforce real-time alerts for S3 buckets, BigQuery, and Azure Data Lake
misconfigurations.
✔ Encryption & Key Management:
• Encrypt all stored data with AES-256 using Customer-Managed Encryption Keys
(CMEK).
• Rotate encryption keys automatically with AWS KMS, Azure Key Vault, and Google KMS.
✅ Outcome: No accidental public exposure of patient records due to misconfiguration.

Final Security & Privacy Outcomes for Scenario :


✔ Zero Trust & MFA prevent unauthorized access.
✔ Homomorphic encryption secures AI-driven diagnostics.
✔ Automated CSPM prevents cloud storage leaks.
✔ De-identification & differential privacy protect patient identities.
✔ Federated Learning prevents centralized medical data exposure.
Result: The company remains HIPAA & GDPR compliant while enabling secure AI-driven
healthcare research.
Assignment3

1. A global bank maintains a centralized data warehouse storing customer financial


transactions, loan records, and account details. An internal data engineer with privileged
access exfiltrated sensitive customer data and sold it on the dark web.
1. How to detect and prevent insider threats without disrupting legitimate work?
2. How to ensure least-privilege access while allowing engineers to maintain the system?
3. How to track data access & modifications in real time?
4. How to automatically respond to suspicious data movements (e.g., bulk exports)?
5. How to comply with financial regulations like PCI DSS & GDPR while securing customer
data?

2. A media company stores video streaming analytics, user behavior data, and subscription
details in a multi-cloud data lake (AWS S3, Azure Blob, Google Cloud Storage). A ransomware
attack encrypted petabytes of data, making it inaccessible to both customers and engineers.
Hackers demanded Bitcoin payments to restore access.
1. How to prevent ransomware from encrypting cloud storage data?
2. How to quickly recover encrypted data without paying the ransom?
3. How to enforce immutable backups & versioning in a multi-cloud setup?
4. How to detect unusual encryption activities in real time?
5. How to minimize downtime & revenue loss while restoring data?
3. A medical research lab collaborates with hospitals worldwide
to train AI models for cancer detection using patient MRI scans
and diagnosis reports. However, privacy laws like HIPAA &
GDPR restrict the direct sharing of patient data across borders.
1. How to train AI models on patient data without exposing
sensitive information?
2. How to enable global collaboration without violating HIPAA &
GDPR?
3. How to prevent re-identification when sharing anonymized
patient data?
4. How to ensure that hospitals maintain control over their own
patient records?
5. How to secure AI models against adversarial attacks that
could infer patient data?

You might also like