Session 29 - MLOps Tools Overview-New
Session 29 - MLOps Tools Overview-New
1. Data Management
1. Scalability
2. Development Practices
2. Improved performance
3. Version Control
3. Reproducibility
4. Experiment Tracking/Model Registry
4. Collaboration and efficiency
5. Model Serving and CI/CD
5. Risk reduction
6. Automation
6. Cost Savings
7. Monitoring and Retraining
7. Faster time to market
8. Infrastructure Management
8. Better compliance and governance
9. Collaboration and Operations
10. Governance and Ethics
1. Basic understanding of ML
a. Python
b. Git
c. Software development best practices [OOP, Design Patterns]
d. Linux
e. Venv
f. Yaml(infrastructure as a code)
3. Data Engineering
a. SQL
b. Big Data Tech [Spark, Kafka]
c. Data Storage Solutions [Databases, Data Warehouses, Data lakes]
a. CI/CD Pipeline
b. Automation
6. Containerization technologies
a. Docker
b. Kubernetes
7. Networking Principles
a. Distributed computing
8. Security Fundamentals
a. Cybersecurity fundamentals
9. Soft Skills
An MLOps tool stack refers to the set of tools and technologies used together to
facilitate the practice of Machine Learning Operations (MLOps). This involves
managing the lifecycle of machine learning models from development through
deployment and maintenance, incorporating principles from DevOps in the
machine learning context. The goal of an MLOps tool stack is to streamline the
process of turning data into actionable insights and models into reliable, scalable,
and maintainable production systems.
Advantages Disadvantages
Overall Tools
Data Validation
Data Pipeline
Feature Stores
Data Observability
Data Governance
Data Security
1. Reusability
2. Transfer of Expertise
3. Data Quality
4. Standardization
General Idea
Architecture
Pipelines
Model Building
Model Registry
Experiment Tracking
1. Limited expertise
2. Simpler ML problems
3. Rapid Prototyping
4. Baseline model creation
1. Complex ML problems
2. Data quality issues
3. Projects that need complex feature engineering
4. Scalability issues
5. Interpretability is important
6. Suboptimal hyperparameter tuning
1. Comparison
2. Reproducibility
3. Collaboration
1. Model Versioning
2.Metadata Management
4. Stage Management
CI/CD
Containerization
Orchestration
Security
Model Serving
Monitoring
General Tools
Infrastructure Management
Collaboration
The Problem
The Problem
The manual deployment process
1. Manual integration
2. Manual building
3. Manual testing
4. Manual deployment
5. Manual verification
6. Manual rollback
7. Difficulty in executing different deployment strategies
Disadvantages
1. Time consuming
2. Labor intensive
3. Error prone
4. Deployment downtime
What is CD
Continuous Deployment (CD) is a software development practice where code changes are
automatically deployed to production as soon as they pass predefined automated tests. CD
extends the principles of Continuous Integration (CI) by automating the release process,
ensuring that software is always in a deployable state and can be released quickly and reliably.
4. Development to Staging:
○ The new version is deployed to a staging environment where further automated
tests and validations are performed.
5. Approval (Optional):
○ Some workflows include a manual approval step before deploying to production,
especially for critical applications.
6. Deployment to Production:
○ The new version is automatically deployed to the production environment using
deployment strategies like rolling updates, blue/green deployments, or canary
releases.
Sample Workflow
Model serving is the process of making the deployed model accessible for inference (i.e.,
predictions) by exposing it through an API or other interface so that applications can send data
to the model and receive predictions.
Main tasks
Select Infrastructure
Provision Resources
Objective: Set up the required infrastructure components using automated tools to ensure
consistency and repeatability.
• Infrastructure as Code (IaC): Use IaC tools to automate the provisioning process.
○ Terraform: Platform-agnostic tool for provisioning infrastructure across multiple
providers.
○ AWS CloudFormation: For provisioning AWS resources.
○ Azure Resource Manager: For provisioning Azure resources.
○ Google Cloud Deployment Manager: For provisioning GCP resources.
Configure Environments
Objective: Install and configure necessary software, libraries, and dependencies in the
provisioned infrastructure.
• Operating Systems: Ensure the correct OS is installed and configured (e.g., Ubuntu,
CentOS).
• Dependencies: Install required libraries and frameworks (e.g., TensorFlow, PyTorch, Scikit-
learn).
• Environment Management: Use tools like Conda or virtual environments to manage
dependencies.
Objective: Deploy essential services needed for MLOps, including databases, model registries,
and CI/CD pipelines.
• Databases: Set up databases for storing data (e.g., PostgreSQL, MySQL).
• Model Registry: Use a model registry to manage and version models (e.g., MLflow, DVC).
• CI/CD Pipelines: Set up CI/CD pipelines for automating model training, testing, and
deployment (e.g., Jenkins, GitLab CI, GitHub Actions).
Set Up Networking
A load balancer is a device or software that distributes network or application traffic across
multiple servers. By evenly distributing incoming traffic, load balancers ensure that no single
server becomes overwhelmed, which helps in achieving high availability and reliability for
applications.
○ Ensures that applications remain accessible even if one or more servers fail.
○ Redirects traffic from failing servers to healthy ones.
2. Scalability:
○ Balances the load to prevent any single server from becoming a bottleneck.
○ Enhances the overall performance by utilizing all available resources efficiently.
5. Security:
○ Can provide a single point of entry, making it easier to enforce security policies.
○ Often integrates with SSL/TLS to encrypt traffic, ensuring secure communication.
1. Traffic Distribution:
2. Health Checks:
○ Regularly checks the health of backend servers to ensure they can handle requests.
○ Redirects traffic away from unhealthy or down servers.
3. Session Persistence:
○ Ensures that requests from a particular client are directed to the same backend
server for the duration of the session, if needed.
○ Useful for applications that require session affinity (e.g., shopping carts).
4. SSL Termination:
1. Round Robin:
2. Least Connections:
3. IP Hash:
5. Random:
Example Workflow
1. Client Request:
2. DNS Resolution:
4. Health Check:
5. Routing Decision:
○ Based on the load balancing algorithm, the load balancer selects a healthy server to
handle the request.
○ The load balancer forwards the client’s request to the selected server.
7. Server Response:
○ The backend server processes the request and sends the response back to the load
balancer.
Amazon Auto Scaling Groups (ASGs) are a fundamental feature in AWS that automatically
adjust the number of Amazon EC2 instances in response to changing demand. This ensures
that the right number of EC2 instances are running to handle the load for your application.
Here’s an in-depth look at how ASGs work in AWS, including details on scaling policies,
monitoring with CloudWatch, and cost considerations.
1. Launch Configuration:
3. Scaling Policies:
○ Predictive Scaling Policies: Uses machine learning to predict future traffic and
schedules scaling actions ahead of time.
○ Custom Policies:
▪ Load on ELB: Scales instances based on the load metrics from an Elastic Load
Balancer.
▪ Scheduled Scaling: Allows you to schedule scaling actions to occur at specific
times to handle predictable traffic patterns.
○ Define a launch configuration or launch template that specifies the EC2 instance
settings.
○ Create an ASG and specify the desired, minimum, and maximum number of
instances.
○ Attach the launch configuration or launch template to the ASG.
○ Define the subnets and Availability Zones where the instances will be launched.
○ Configure scaling policies to determine how the group should scale in and out.
2. Launching Instances:
○ The ASG ensures that the desired number of instances are running. If there are fewer
instances than desired, it launches new instances based on the configuration.
○ The ASG continuously monitors the instances and the metrics specified in the scaling
policies.
○ When a scaling policy is triggered (e.g., CPU utilization exceeds a target), the ASG
adjusts the number of instances accordingly.
○ Scaling Out: Adds instances when the demand increases.
○ Scaling In: Terminates instances when the demand decreases.
4. Health Management:
○ ASGs perform regular health checks on instances. Health checks can be EC2 status
checks or custom health checks from an Elastic Load Balancer (ELB).
○ If an instance is found to be unhealthy, the ASG terminates it and launches a new
one to replace it.
○ ASGs can be integrated with Elastic Load Balancers (ELB) to distribute incoming
traffic across the instances in the group.
○ Ensures that traffic is directed to healthy instances and helps maintain high
availability.
Amazon CloudWatch plays a crucial role in monitoring the performance and health of
resources within an ASG:
• Metrics Collection: CloudWatch collects and tracks metrics for EC2 instances, such as CPU
utilization, network traffic, disk I/O, and more.
• Alarms: CloudWatch Alarms can be configured to trigger scaling actions based on specific
metrics, ensuring that the ASG responds to changes in demand in real time.
• Logs: CloudWatch Logs can store and monitor log data from instances, providing deeper
insights into application performance and issues.
Example Scenario
Suppose you have a web application with variable traffic. You can use an ASG to automatically
scale the number of instances based on traffic load:
Using Auto Scaling Groups is a free service in AWS. However, you will be charged for the
resources that the ASG provisions and uses:
• EC2 Instances: You pay for the EC2 instances launched by the ASG based on the instance
type and usage.
• Storage: Costs associated with the storage of AMIs and EBS volumes.
• Data Transfer: Charges for data transfer between instances and other AWS services.
By leveraging ASGs, you can ensure that your application maintains high availability, handles
varying levels of traffic efficiently, and operates cost-effectively, with the added benefit of
integrated monitoring and security features.
Rollout
Rollback
Retraining