MLOPS Case Study Questions and Answers
MLOPS Case Study Questions and Answers
Q1. System design: Based on the above information, describe the KPI that the
business should track.
Answer:
1. Model view KPI (technical efficiency)
• Accuracy: Measures the general purity of the model that detects damage when identifying scratches, bulk
or "no damage" (no damage).
• Prosecutor: indicates how much identified damage (dent/scratch) is classified correctly, which reduces
false positives.
• Confusion matrix: Visit performance in different damage categories, which help identify miscalculation
patterns.
• ESTIMMS Delay: Measures the time taken by the model to process an image and return the result of
detecting significant damage to real -time applications.
• Price accuracy index: Evaluates how a resale price is proximity to using model facility with actual sales
prices in the market.
• Reduction in manual inspection costs: Manual car tracks The costs received by reducing the need for
inspection.
• LED-to-cell conversion frequency: Monitor whose fast, detect automatically damage improves the speed
that the entry is converted to successful sales.
• Operational scalability: The system measures the system's ability to handle the increasing amount of
images of the car without a decrease in performance.
• Model Operation Detection Rate: Changes in data distribution (eg selection of lighting in new images)
Change that may affect the model performance.
• Model support frequency: Tracks that the model requires retrieval due to a decrease in performance or
new annotate data availability.
Q2. System design: Your company has decided to build an MLOps system. What
advantages would you get from building an MLOps system rather than a simple
model?
Answer:
• Scalability: MLOPS allows the system to handle large versions of car images and data, which support the
growing international operation of Carcadeepo.
• Automation: Whole ML provides automatic - data Patting, model training, perineogenic and monitoring -
to reduce human efforts and human errors.
• Continuous integration and purinogenic (CI/CD): Fast, reliable model updates and perfections, without
disturbing existing services.
• Model monitoring and operating detection: Continuous model in production spores performance, detect
data message (eg due to poor light) and trigger retrieval when needed.
• Breeding and use of use: The model maintains a broad overview of versions, experiments and data sets,
so that easy replication and results can compare.
• BETTER COOPERATION: Data offers steady workflows between researchers, engineers and business teams
through standardized pipelines and shared equipment.
• Cost efficiency: By reducing manual interventions, adaptation of resource use and advisory time to the
market for new models, reduces operating costs.
Q3: System design: You must create an ML system that has the features of a
complete production stack, from experiment tracking to automated model
deployment and monitoring. For this problem, create an ML system design
(diagram)
Answer:
Q4. System design: After creating the architecture,
please specify your reason for choosing the specific
tools you chose for the use case.
Answer:
Explaination why I would choose these technology sources
Data Sources:
• Input: Car images and annotations from multiple sources (user uploads, partner dealerships, legacy
systems).
• Function: Consolidate and store raw data in a centralized data lake (e.g., AWS S3).
• Function: Clean, augment, and transform the data to prepare it for training.
• Tools: TensorFlow/Keras for model building; MLflow for experiment tracking (logging parameters, metrics,
and artifacts).
• Function: Maintain version control and organize the best-performing models for production use.
• Tools: Docker for containerization; Kubernetes for orchestration; Flask/FastAPI for RESTful API endpoints.
• Tools: Prometheus and Grafana for metrics, ELK Stack for logging, and custom drift detection solutions.
• Function: Continuously monitor the deployed model's performance, log issues, and trigger alerts if
metrics (such as drift or latency) fall out of acceptable ranges.
CI/CD Pipeline:
• Function: Automate testing, building, and deployment processes ensuring smooth transitions from
development to production.
What: Collect car images and annotations from various sources (user uploads, partner dealerships, legacy
systems).
Where: Store the raw data in a centralized cloud data lake (e.g., AWS S3, Azure Blob Storage).
Schedule and orchestrate ETL jobs that extract raw images, clean them, and apply preprocessing steps.
Use Python libraries (e.g., TensorFlow’s ImageDataGenerator) to perform image augmentation (rotation, scaling,
brightness adjustments) and normalization.
Model Building:
Tool: TensorFlow/Keras
How:
Design a Convolutional Neural Network (CNN) to classify images into “Dent,” “Scratch,” or “None.”
Experiment Tracking:
Tool: MLflow
How:
Log hyperparameters, training metrics (accuracy, loss), and model artifacts during each experiment.
Integration:
The preprocessed data from the ETL pipeline feeds directly into the training scripts, ensuring consistent input for
experiments.
Evaluation:
What: Assess model performance using metrics such as accuracy, precision, recall, and F1-score.
Model Registry:
How:
Maintain version control and metadata (e.g., training parameters, experiment logs) to enable rollback if necessary.
Tool: Docker
How: Package the trained model along with its inference server (using Flask or FastAPI) into a Docker container.
Tool: Kubernetes
How:
Use Kubernetes Ingress and Horizontal Pod Autoscaler to manage load balancing and auto-scale the service.
Inference API:
What: Expose a RESTful endpoint that accepts car images and returns the predicted damage classification.
Performance Monitoring:
Tools: Prometheus (for metrics collection) and Grafana (for dashboard visualization)
How:
Monitor key metrics such as inference latency, throughput, and error rates.
Drift Detection:
Approach:
Implement custom drift detection scripts or use libraries (e.g., Evidently AI) to continuously compare current input
data distributions against historical baselines.
How:
Aggregate logs from the deployed service for debugging and historical analysis.
Set up alerts (using Prometheus Alertmanager or PagerDuty) to notify stakeholders if performance or drift metrics
exceed predefined thresholds.
Retraining Triggers:
When:
If drift is detected (e.g., due to lighting issues) or when new annotated data is available.
How:
The drift monitoring or data ingestion pipeline (monitored via Airflow/Kubeflow) automatically triggers a retraining
job.
Retraining Pipeline:
Process:
Log new experiments via MLflow and compare against the current production model.
If the updated model performs better, register the new version in the model registry.
CI/CD Pipeline:
How:
Automatically test, build, and deploy new models as soon as they pass integration and performance tests.
7. Feedback Loop
User Feedback:
How:
Collect user and system feedback (via operational metrics and logs) to identify areas for further improvement.
Continuous Improvement:
Outcome:
The feedback loop feeds back into the data ingestion layer, triggering further retraining and fine-tuning of the
model.
Integration Overview
Data Flow:
Raw images → ETL Pipeline (Airflow/Kubeflow) → Preprocessed Data → Training Pipeline (TensorFlow/Keras,
MLflow)
Experimentation & Versioning:
Real-Time Inference:
Continuous drift and performance monitoring → Alerting → Automated retraining (triggered via CI/CD)
CI/CD Integration:
Automated build, test, and deployment cycles ensure that new code or models are smoothly transitioned to
production.