Question Data Engineering
Question Data Engineering
Data Cleaning
• Scenario:
A retail company collects sales data from multiple stores daily. However, they notice
issues in their dataset, including duplicate records, missing values in the "Price"
column, and inconsistencies in date formats. Since the dataset is large, they decide
to clean the data using batch processing.
Question:
How can the company use batch processing to clean the dataset and ensure data
quality?
Answer:
The company can follow these batch processing data cleaning techniques:
• Deduplication
Identify and remove duplicate records based on unique transaction IDs or a combination of
"Date," "Store ID," and "Product ID."
DELETE FROM sales_data
WHERE id NOT IN (
SELECT MIN(id)
FROM sales_data
GROUP BY date, store_id, product_id
);
•Handling Missing Values in the "Price" Column
If a product's price is missing, they can impute values using the mean price of the same product across other
stores.
Example SQL query:
Outcome:
By implementing these data integration strategies, the retail company will:
Outcome:
By implementing these data integration strategies, the healthcare
organization will:
✅ Create a single, accurate patient record across hospitals.
✅ Improve patient care through real-time data sharing.
Assignment
1. Scenario:
A city government wants to integrate real-time data from traffic cameras, public transportation
systems, environmental sensors, and emergency response units into a single dashboard for efficient
urban management. However, the data comes from multiple vendors using different protocols and
formats, making real-time analysis difficult.
Question:
What data integration techniques, IoT frameworks, and cloud solutions can be used to unify city-
wide data for improved decision-making and public safety?
2. Scenario:
A multinational company sources raw materials from various suppliers worldwide, with procurement
data stored in separate ERP systems across regions. Due to data silos and inconsistent formats,
procurement teams struggle with accurate demand forecasting and inventory management.
Question:
Which data integration strategies and tools can help unify vendor and inventory data across global
supply chains while ensuring real-time visibility and operational efficiency?
3. Scenario:
A pharmaceutical company conducts clinical trials across multiple research centers, collecting
patient health data, drug efficacy reports, and trial outcomes in different formats and database
systems. Data inconsistency makes it challenging to analyze trial results accurately.
Question:
What data integration strategies and tools can be employed to unify clinical trial data, ensure
regulatory compliance (e.g., FDA, HIPAA), and accelerate drug development?
SCENARIO
A global stock exchange uses real-time Kafka streams to
process billions of financial transactions daily. Some employees
and external brokers are suspected of manipulating stock
price data and insider trading by accessing sensitive
transactions before they are made public.
1️. How can the company ensure real-time data integrity in
a streaming environment?
2. How to detect and block unauthorized access to
transaction data?
3. How to prevent financial data tampering while
maintaining ultra-low latency?
4. How to secure APIs that interact with stock data in a
microservices architecture?
5. How to provide auditors with tamper-proof logs of
financial transactions?
Solutions:
1.Data Integrity & Real-Time Tamper Detection
Solution: Use blockchain-based data logging to create an immutable transaction
history.
Implementation:
• Use Hyperledger Fabric to store stock trade records immutably.
• Apply Merkle Trees to detect unauthorized changes in Kafka streams.
• Deploy Apache Flink with anomaly detection ML models to flag suspicious trading
patterns.
✅ Outcome: No unauthorized changes in stock prices; all manipulations are
detected instantly.
2. Zero Trust & Just-in-Time (JIT) Access for Brokers
Solution: Prevent insider trading by applying Zero Trust principles and Just-in-
Time (JIT) access.
• Implementation:
• Use Google BeyondCorp to validate every access attempt.
• Restrict high-frequency traders using behavior-based access control.
• Implement OAuth 2.0 with fine-grained API policies (e.g., stock brokers can only access
data relevant to their firm).
• ✅ Outcome: No unauthorized user can pre-access financial transactions for insider
3. API Security in Microservices Architecture
Solution: Implement Mutual TLS (mTLS) & API Threat Protection.
• Implementation:
• Enforce mTLS authentication between microservices (e.g., Envoy + Istio in
Kubernetes).
• Deploy Cloudflare API Gateway with rate limiting & anomaly detection.
• Use OAuth 2.0 + JWT (JSON Web Tokens) for secure API communication.
✅ Outcome: APIs handling stock transactions are fully encrypted and secure
from API abuse.
4. Immutable Audit Logs for Regulatory Compliance (SEC, FINRA,
GDPR)
Solution: Store audit logs in AWS Quantum Ledger Database (QLDB) or
Azure Confidential Ledger.
• Implementation:
• Encrypt audit logs using SHA-256 hashing.
• Restrict log deletion using WORM (Write Once, Read Many) storage.
• Implement Splunk with AI-powered anomaly detection for fraud monitoring.
✅ Outcome: The company achieves full compliance with SEC, GDPR, and
FINRA regulations.
2. A healthcare analytics company processes electronic health records
(EHRs), medical images, and patient prescriptions in a multi-cloud
data lake (AWS S3, Google Cloud Storage, Azure Data Lake). A recent
misconfiguration in cloud storage exposed 500,000 patient records,
violating HIPAA (Health Insurance Portability and Accountability Act).
The company must now secure patient data, prevent future breaches,
and comply with HIPAA & GDPR while still enabling medical research
and AI-driven diagnostics.
Questions:
1. How to prevent unauthorized access to patient data in a multi-
cloud environment?
2. How to secure AI models that process sensitive medical records?
3. How to implement fine-grained access control for doctors,
researchers, and insurers?
4. How to detect & prevent accidental misconfigurations in cloud
storage?
5. How to anonymize patient data for research without
compromising AI model accuracy?
1. Prevent Unauthorized Access with Zero Trust & Attribute-
Based Access Control (ABAC)
Zero Trust Security + Multi-Factor Authentication (MFA) + Role-
Based & Attribute-Based Access Control (RBAC & ABAC)
✔ Zero Trust Model:
• Require MFA & Identity Federation (Okta, Azure AD) for all users.
• Block direct database access; use secure APIs with OAuth 2.0 &
JWT authentication.
✔ ABAC for Medical Data Access:
• Doctors can access patient records only in their assigned
hospital.
• Researchers get anonymized data only (without names or
SSNs).
• Insurers can only see billing-related data, not full medical history.
• ✅ Outcome: Unauthorized access is completely blocked, ensuring
2. Securing AI Models that Process Medical Data
Homomorphic Encryption & Secure Multi-Party
Computation (SMPC)
✔ AI models should never process raw patient data:
• Use Google’s Differential Privacy API to add noise to data
before training AI models.
• Apply Homomorphic Encryption (HE) so models can process
encrypted medical records without decryption.
✔ Federated Learning for AI on Edge Devices:
• Train AI models on hospital devices instead of centralizing data
in the cloud.
• Use TensorFlow Federated to prevent raw patient data from
leaving hospital networks.
✅ Outcome: AI models maintain accuracy while ensuring full
patient privacy.
3. Privacy-Preserving Data Sharing for Medical Research
Anonymization + Tokenization
✔ De-Identification Techniques:
• Remove names, addresses, and social security numbers.
• Replace patient identifiers with randomized tokens using
Google DLP or AWS Comprehend Medical.
✔ Differential Privacy for Medical Datasets:
• Use synthetic data (AI-generated medical records) for
research instead of real patient data.
• Apply k-anonymity to ensure that at least k patients share
the same attributes, making re-identification impossible.
✅ Outcome: The company enables AI-driven healthcare
innovation without compromising patient privacy.
4. Continuous Cloud Storage Security Monitoring
Automated Cloud Security Posture Management (CSPM)
✔ Misconfiguration Prevention:
• Deploy AWS Macie, Google Security Command Center, and Azure Purview to detect
publicly exposed data.
• Enforce real-time alerts for S3 buckets, BigQuery, and Azure Data Lake
misconfigurations.
✔ Encryption & Key Management:
• Encrypt all stored data with AES-256 using Customer-Managed Encryption Keys
(CMEK).
• Rotate encryption keys automatically with AWS KMS, Azure Key Vault, and Google KMS.
✅ Outcome: No accidental public exposure of patient records due to misconfiguration.
2. A media company stores video streaming analytics, user behavior data, and subscription
details in a multi-cloud data lake (AWS S3, Azure Blob, Google Cloud Storage). A ransomware
attack encrypted petabytes of data, making it inaccessible to both customers and engineers.
Hackers demanded Bitcoin payments to restore access.
1. How to prevent ransomware from encrypting cloud storage data?
2. How to quickly recover encrypted data without paying the ransom?
3. How to enforce immutable backups & versioning in a multi-cloud setup?
4. How to detect unusual encryption activities in real time?
5. How to minimize downtime & revenue loss while restoring data?
3. A medical research lab collaborates with hospitals worldwide
to train AI models for cancer detection using patient MRI scans
and diagnosis reports. However, privacy laws like HIPAA &
GDPR restrict the direct sharing of patient data across borders.
1. How to train AI models on patient data without exposing
sensitive information?
2. How to enable global collaboration without violating HIPAA &
GDPR?
3. How to prevent re-identification when sharing anonymized
patient data?
4. How to ensure that hospitals maintain control over their own
patient records?
5. How to secure AI models against adversarial attacks that
could infer patient data?