100% found this document useful (1 vote)
49 views40 pages

Session 29 - MLOps Tools Overview-New

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
49 views40 pages

Session 29 - MLOps Tools Overview-New

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Recap

13 May 2024 08:10

Session 29 - MLOps Tools Overview Page 1


Core Aspects [Revision]
13 May 2024 08:11

Session 29 - MLOps Tools Overview Page 2


Benefits of MLOps
13 May 2024 16:06

1. Data Management
1. Scalability
2. Development Practices
2. Improved performance
3. Version Control
3. Reproducibility
4. Experiment Tracking/Model Registry
4. Collaboration and efficiency
5. Model Serving and CI/CD
5. Risk reduction
6. Automation
6. Cost Savings
7. Monitoring and Retraining
7. Faster time to market
8. Infrastructure Management
8. Better compliance and governance
9. Collaboration and Operations
10. Governance and Ethics

Session 29 - MLOps Tools Overview Page 3


Challenges
13 May 2024 16:06

1. Complexity of ml models [variability, black box nature]


2. Multitude of models
3. Quality of data
4. Cost and resource constraints
5. Handling scale
6. Security risks
7. Compliance and regulatory concerns
8. Integration with existing systems
9. Limited Expertise/Skill gap

Session 29 - MLOps Tools Overview Page 4


Prerequisites
13 May 2024 16:07

1. Basic understanding of ML

a. Cleaning and preprocessing


b. Feature engineering
c. Model building

2. Software development skills

a. Python
b. Git
c. Software development best practices [OOP, Design Patterns]
d. Linux
e. Venv
f. Yaml(infrastructure as a code)

3. Data Engineering

a. SQL
b. Big Data Tech [Spark, Kafka]
c. Data Storage Solutions [Databases, Data Warehouses, Data lakes]

4. DevOps Principles and Tools

a. CI/CD Pipeline
b. Automation

5. Familiarity with cloud platforms

a. AWS, GCP and Azure

6. Containerization technologies

a. Docker
b. Kubernetes

7. Networking Principles

a. Distributed computing

8. Security Fundamentals

a. Cybersecurity fundamentals

9. Soft Skills

Session 29 - MLOps Tools Overview Page 5


MLOps Tools Stack
14 May 2024 09:23

An MLOps tool stack refers to the set of tools and technologies used together to
facilitate the practice of Machine Learning Operations (MLOps). This involves
managing the lifecycle of machine learning models from development through
deployment and maintenance, incorporating principles from DevOps in the
machine learning context. The goal of an MLOps tool stack is to streamline the
process of turning data into actionable insights and models into reliable, scalable,
and maintainable production systems.

Session 29 - MLOps Tools Overview Page 6


Session 29 - MLOps Tools Overview Page 7
1. End to End MLOps Platforms
13 May 2024 08:16

Advantages Disadvantages

Easy to setup and use Cost


Standardization and consistency across projects Vendor lock-in
Good for quick experimentation Limited options to customize
Reduced IT overheads Privacy concerns
Enhanced security
Better support

Session 29 - MLOps Tools Overview Page 8


2. Data Management Tools
13 May 2024 18:23

Data Ingestion Data Transformation

Overall Tools

Data Validation
Data Pipeline

Data Versioning Data Annotation

Feature Stores
Data Observability

Data Governance

Data Security

Session 29 - MLOps Tools Overview Page 9


Data Observability Data Governance

1. Data drift • Data Cataloguing: Helps organizations


2. Pipeline failure organize and access their data assets.
3. Monitoring resource usage • Policy Management: Allows the
4. Throughput creation and enforcement of data
5. Latency governance policies.
• Automated Data Lineage: Tracks the
origin, movement, and transformation
Data Security of data.
• Privacy and Security: Helps
1. Data Encryption at rest, in organizations comply with data
transit and in use protection regulations like GDPR.
2. Role based access control
3. Data masking and
anonymization
4. Audit Trails - Keeping
detailed logs of all data
access and changes to track
how data is used and by
whom.
5. Backup and Disaster
recovery

Session 29 - MLOps Tools Overview Page 10


Session 29 - MLOps Tools Overview Page 11
Feature Store
19 May 2024 14:12

Why use Feature Store?

1. Reusability
2. Transfer of Expertise
3. Data Quality
4. Standardization

General Idea

Examples - Feast, AWS Feature Store

Architecture

Session 29 - MLOps Tools Overview Page 12


Session 29 - MLOps Tools Overview Page 13
Recap
21 May 2024 14:58

Session 29 - MLOps Tools Overview Page 14


3. Model Building
13 May 2024 23:55

Code Editors Version Control

Pipelines

Model Building

AutoML Hyperparameter Tuning Model Explainability

Model Registry
Experiment Tracking

Reasons to choose VSCode


1. Code Debugging
2. Extensions -> extension for powerful mlops tools like DVC, Docker,
Kubernetes etc.
3. Support for Jupyter notebook
4. Git Integration
5. Integrated command prompt
6. Code Quality(linting and formatting) and Testing

Session 29 - MLOps Tools Overview Page 15


When to use AutoML

1. Limited expertise
2. Simpler ML problems
3. Rapid Prototyping
4. Baseline model creation

When not to use AutoML

1. Complex ML problems
2. Data quality issues
3. Projects that need complex feature engineering
4. Scalability issues
5. Interpretability is important
6. Suboptimal hyperparameter tuning

Experiment tracking in MLOps refers to the process of systematically recording,


organizing, and managing all aspects of machine learning experiments. This
includes logging details about the datasets, preprocessing steps, model
configurations, hyperparameters, performance metrics, and other relevant
artifacts. The goal is to ensure reproducibility, facilitate comparison, and provide
transparency throughout the machine learning development lifecycle.

Why Experiment Tracking

1. Comparison
2. Reproducibility
3. Collaboration

What exactly is tracked?

1. Parameter (learning_rate, batch_size, regularization parameters)


2. Metrics (accuracy, precision, roc-auc score)
3. Artifacts (serialized model file, plots)
4. Source code
5. Environment (python version, library dependencies)
6. Version control information (version control state of the database)
7. Unique experiment ID

A model registry in MLOps is a centralized repository or system that manages the


storage, versioning, and lifecycle of machine learning models. It serves as a
catalog where models are tracked from development to deployment, ensuring
that the best-performing models are readily available for production. The model
registry is essential for managing the complexity of deploying and maintaining
models in a reliable and reproducible manner.

1. Model Versioning

○ Definition: Track different versions of a model as it evolves over time.


○ Tasks:
▪ Assign unique version identifiers to each model version.
▪ Maintain a history of changes and updates for each model.
▪ Support rollback to previous versions if needed.

2.Metadata Management

○ Definition: Store and manage metadata associated with each model.


○ Tasks:
▪ Record model-related information such as author, creation date,
description, and tags.
▪ Store performance metrics, hyperparameters, and training data
references.
▪ Capture details about the environment in which the model was
trained.

3. Model Lineage Tracking

○ Definition: Track the lineage and provenance of models to understand


their development history.
○ Tasks:
▪ Document the data preprocessing steps, feature engineering, and
model training processes.
▪ Link models to the specific experiments and runs that produced
them.
▪ Provide traceability from raw data to the final model.

4. Stage Management

○ Definition: Manage the lifecycle stages of a model (e.g., development,


staging, production).
○ Tasks:
▪ Define and manage different stages such as "development,"
"staging," "production," and "archived."
▪ Support model promotion and demotion between stages based on
validation and approval processes.
▪ Enforce stage-specific policies and controls.

5. Model Storage and Retrieval

○ Definition: Store and provide access to model artifacts.


○ Tasks:
▪ Securely store model files and artifacts in a centralized repository.
▪ Ensure efficient retrieval and loading of models for inference and
further training.

Session 29 - MLOps Tools Overview Page 16


5. Model Storage and Retrieval

○ Definition: Store and provide access to model artifacts.


○ Tasks:
▪ Securely store model files and artifacts in a centralized repository.
▪ Ensure efficient retrieval and loading of models for inference and
further training.
▪ Provide access controls and permissions to manage who can view
and retrieve models.

6. Integration with CI/CD Pipelines

○ Definition: Integrate with continuous integration and continuous


deployment (CI/CD) pipelines.
○ Tasks:
▪ Automate the process of registering new models as part of CI/CD
workflows.
▪ Trigger model validation, testing, and deployment steps based on
CI/CD pipeline stages.
▪ Ensure seamless integration with tools like Jenkins, GitHub Actions,
GitLab CI/CD, and others.

Session 29 - MLOps Tools Overview Page 17


Recap
07 June 2024 11:26

Session 29 - MLOps Tools Overview Page 18


4. Deployment & Monitoring
13 May 2024 23:56

CI/CD

Containerization

Orchestration

Security

Model Serving

Monitoring

General Tools

Cloud Platforms Provisioning

Cloud Native Tools ML Specific

Infrastructure Management

Collaboration

Session 29 - MLOps Tools Overview Page 19


Session 29 - MLOps Tools Overview Page 20
Continuous Integration
04 June 2024 07:46

The Problem

The Solution Continuous Integration

Continuous Integration (CI) is a software development practice where developers


frequently integrate their code changes into a shared repository, typically
multiple times a day. Each integration is automatically verified by running tests to
detect integration errors as quickly as possible. The primary goals of CI are to
improve software quality, reduce the time taken to deliver software, and ensure
that the codebase remains in a deployable state.

How does it work exactly?

Exact steps executed during the CI workflow

Session 29 - MLOps Tools Overview Page 21


Exact steps executed during the CI workflow

Session 29 - MLOps Tools Overview Page 22


Session 29 - MLOps Tools Overview Page 23
Continuous Deployment
04 June 2024 10:13

The Problem
The manual deployment process

1. Manual integration
2. Manual building
3. Manual testing
4. Manual deployment
5. Manual verification
6. Manual rollback
7. Difficulty in executing different deployment strategies

Disadvantages

1. Time consuming
2. Labor intensive
3. Error prone
4. Deployment downtime

What is CD
Continuous Deployment (CD) is a software development practice where code changes are
automatically deployed to production as soon as they pass predefined automated tests. CD
extends the principles of Continuous Integration (CI) by automating the release process,
ensuring that software is always in a deployable state and can be released quickly and reliably.

Continuous Deployment Workflow


1. Code Commit:
○ Developers commit code changes to the version control system (e.g., Git).

2. Continuous Integration (CI):


○ The CI pipeline is triggered, which checks out the latest code, runs automated tests,
and builds the application.

3. Build and Package:


○ The application is built and packaged into deployable artifacts (e.g., Docker images,
binaries).

4. Development to Staging:
○ The new version is deployed to a staging environment where further automated
tests and validations are performed.

5. Approval (Optional):
○ Some workflows include a manual approval step before deploying to production,
especially for critical applications.

6. Deployment to Production:
○ The new version is automatically deployed to the production environment using
deployment strategies like rolling updates, blue/green deployments, or canary
releases.

7. Monitoring and Rollback:


○ The production environment is monitored for any issues. Automated rollback
mechanisms are in place to revert to the previous version if necessary.

Sample Workflow

Session 29 - MLOps Tools Overview Page 24


Continuous Deployment vs Continuous Delivery

Session 29 - MLOps Tools Overview Page 25


Continuous Deployment vs Continuous Delivery

Session 29 - MLOps Tools Overview Page 26


Model Serving
04 June 2024 11:58

Model serving is the process of making the deployed model accessible for inference (i.e.,
predictions) by exposing it through an API or other interface so that applications can send data
to the model and receive predictions.

Main tasks

1. API development - Develop RESTful or gRPC APIs to expose the model.


2. Load the model into the serving environment, ensuring it’s ready for inference.
3. Accept and validate incoming data, perform inference, and return predictions.
4. Handling scalability
5. Handling latency
6. Security

Session 29 - MLOps Tools Overview Page 27


Cloud Platforms - AWS vs GCP vs Azure
22 May 2024 08:12

Session 29 - MLOps Tools Overview Page 28


Session 29 - MLOps Tools Overview Page 29
Session 29 - MLOps Tools Overview Page 30
Session 29 - MLOps Tools Overview Page 31
Provisioning
06 June 2024 10:30
What is provisioning
Food Example
Provisioning in MLOps
Provisioning in MLOps involves creating, configuring and managing the necessary Steps in provisioning
infrastructure and resources required to support the lifecycle of machine learning models, Requirements
Configure
from development to deployment and serving. This process can be automated and managed
Deploy services
using various tools and platforms to ensure efficiency, scalability, and reliability. Networking and security
Types of provisioning
Manual
Automated
Define Requirements IaC
Declarative syntax
Objective: Determine the computational, storage, and networking needs based on the specific Benefits of Automated
Fast
ML workload and use case. Less error prone
• Compute Requirements: Decide on the types and number of CPUs, GPUs, memory, and reproducibility
disk space needed.
• Storage Requirements: Identify the storage solutions for data, model artifacts, and logs.
• Network Requirements: Plan for networking components to ensure secure and efficient
communication between different parts of the infrastructure.

Select Infrastructure

Objective: Choose the appropriate infrastructure to meet the defined requirements,


considering options like cloud, on-premises, or hybrid setups.
• Cloud Providers: AWS, Google Cloud Platform (GCP), Microsoft Azure.
• On-Premises: Using local data centers with physical servers.
• Hybrid Solutions: Combining cloud and on-premises resources.

Provision Resources

Objective: Set up the required infrastructure components using automated tools to ensure
consistency and repeatability.
• Infrastructure as Code (IaC): Use IaC tools to automate the provisioning process.
○ Terraform: Platform-agnostic tool for provisioning infrastructure across multiple
providers.
○ AWS CloudFormation: For provisioning AWS resources.
○ Azure Resource Manager: For provisioning Azure resources.
○ Google Cloud Deployment Manager: For provisioning GCP resources.

Configure Environments

Objective: Install and configure necessary software, libraries, and dependencies in the
provisioned infrastructure.
• Operating Systems: Ensure the correct OS is installed and configured (e.g., Ubuntu,
CentOS).
• Dependencies: Install required libraries and frameworks (e.g., TensorFlow, PyTorch, Scikit-
learn).
• Environment Management: Use tools like Conda or virtual environments to manage
dependencies.

Session 29 - MLOps Tools Overview Page 32


Deploy Services

Objective: Deploy essential services needed for MLOps, including databases, model registries,
and CI/CD pipelines.
• Databases: Set up databases for storing data (e.g., PostgreSQL, MySQL).
• Model Registry: Use a model registry to manage and version models (e.g., MLflow, DVC).
• CI/CD Pipelines: Set up CI/CD pipelines for automating model training, testing, and
deployment (e.g., Jenkins, GitLab CI, GitHub Actions).

Set Up Networking

Objective: Configure networking to ensure secure and efficient communication between


different components.
• VPCs and Subnets: Configure Virtual Private Clouds (VPCs) and subnets to isolate
resources.
• Load Balancers: Set up load balancers to distribute traffic across multiple instances.
• Firewalls and Security Groups: Configure firewalls and security groups to control access.

Implement Security Measures

Objective: Ensure the infrastructure is secure and complies with relevant


regulations.
• Access Control: Set up IAM roles and policies for access management.
• Encryption: Implement encryption for data at rest and in transit (e.g.,
SSL/TLS).

Session 29 - MLOps Tools Overview Page 33


Load Balancers
06 June 2024 14:57

What is a Load Balancer?

A load balancer is a device or software that distributes network or application traffic across
multiple servers. By evenly distributing incoming traffic, load balancers ensure that no single
server becomes overwhelmed, which helps in achieving high availability and reliability for
applications.

Why Do We Need a Load Balancer?

1. Improved Availability and Reliability:

○ Ensures that applications remain accessible even if one or more servers fail.
○ Redirects traffic from failing servers to healthy ones.

2. Scalability:

○ Facilitates horizontal scaling by adding more servers to handle increased traffic.


○ Automatically distributes traffic to new servers as they are added.

3. Optimized Resource Utilization:

○ Balances the load to prevent any single server from becoming a bottleneck.
○ Enhances the overall performance by utilizing all available resources efficiently.

4. Maintenance and Updates:

○ Allows servers to be taken offline for maintenance without disrupting service.


○ Can direct traffic to servers running the latest updates or versions.

5. Security:

○ Can provide a single point of entry, making it easier to enforce security policies.
○ Often integrates with SSL/TLS to encrypt traffic, ensuring secure communication.

How Exactly Does a Load Balancer Work?

1. Traffic Distribution:

○ Incoming Requests: The load balancer receives incoming client requests.


○ Traffic Routing: It uses various algorithms to determine which backend server should
handle each request.

2. Health Checks:

○ Regularly checks the health of backend servers to ensure they can handle requests.
○ Redirects traffic away from unhealthy or down servers.

3. Session Persistence:

○ Ensures that requests from a particular client are directed to the same backend
server for the duration of the session, if needed.
○ Useful for applications that require session affinity (e.g., shopping carts).

4. SSL Termination:

○ Offloads SSL decryption/encryption from backend servers, improving their


performance.
○ Provides a single point to manage SSL certificates.

5. Types of Load Balancers:

○ Hardware Load Balancers: Physical devices deployed in data centers.


○ Software Load Balancers: Installed on servers and can be cloud-based.
○ Cloud Load Balancers: Managed services offered by cloud providers (e.g., AWS Elastic
Load Balancer, Google Cloud Load Balancer).

Load Balancing Algorithms

1. Round Robin:

○ Distributes requests sequentially across all servers.


○ Simple and effective for evenly distributed workloads.

2. Least Connections:

○ Directs traffic to the server with the fewest active connections.


○ Ideal for environments where connection duration varies significantly.

3. IP Hash:

○ Distributes requests based on the client’s IP address.


○ Ensures that a client is consistently routed to the same server.

4. Weighted Round Robin:

○ Distributes requests based on server weight, allowing more powerful servers to


handle more traffic.

5. Random:

○ Distributes traffic randomly among servers.


○ Can be effective in evenly distributed environments.

Example Workflow

1. Client Request:

○ A client sends a request to access a web application.

2. DNS Resolution:

Session 29 - MLOps Tools Overview Page 34


2. DNS Resolution:

○ The client’s DNS query is resolved to the load balancer’s IP address.

3. Load Balancer Receives Request:

○ The load balancer receives the incoming request.

4. Health Check:

○ The load balancer checks the health of the backend servers.

5. Routing Decision:

○ Based on the load balancing algorithm, the load balancer selects a healthy server to
handle the request.

6. Forwarding the Request:

○ The load balancer forwards the client’s request to the selected server.

7. Server Response:

○ The backend server processes the request and sends the response back to the load
balancer.

8. Load Balancer Response:

○ The load balancer forwards the server’s response to the client.

Session 29 - MLOps Tools Overview Page 35


Auto Scaling
06 June 2024 14:28

Amazon Auto Scaling Groups (ASGs) are a fundamental feature in AWS that automatically
adjust the number of Amazon EC2 instances in response to changing demand. This ensures
that the right number of EC2 instances are running to handle the load for your application.
Here’s an in-depth look at how ASGs work in AWS, including details on scaling policies,
monitoring with CloudWatch, and cost considerations.

Key Components of Auto Scaling Groups

1. Launch Configuration:

○ Launch Configuration: Specifies the EC2 instance configuration, including the


instance type, Amazon Machine Image (AMI), key pair, security groups, and other
instance settings.

2. Auto Scaling Group:

○ Defines the minimum, maximum, and desired number of EC2 instances.


○ Associates with one or more subnets for distributing instances across Availability
Zones (AZs) to improve fault tolerance.

3. Scaling Policies:

○ Dynamic Scaling Policies: Automatically adjust the number of instances based on


real-time metrics (e.g., CPU utilization, network traffic).
▪ Target Tracking Scaling: Adjusts the number of instances to keep a specific
metric (e.g., CPU utilization) at the target value.

○ Predictive Scaling Policies: Uses machine learning to predict future traffic and
schedules scaling actions ahead of time.

○ Custom Policies:

▪ Load on ELB: Scales instances based on the load metrics from an Elastic Load
Balancer.
▪ Scheduled Scaling: Allows you to schedule scaling actions to occur at specific
times to handle predictable traffic patterns.

How ASGs Work

1. Configuration and Creation:

○ Define a launch configuration or launch template that specifies the EC2 instance
settings.
○ Create an ASG and specify the desired, minimum, and maximum number of
instances.
○ Attach the launch configuration or launch template to the ASG.
○ Define the subnets and Availability Zones where the instances will be launched.
○ Configure scaling policies to determine how the group should scale in and out.

2. Launching Instances:

○ The ASG ensures that the desired number of instances are running. If there are fewer
instances than desired, it launches new instances based on the configuration.

3. Monitoring and Scaling:

○ The ASG continuously monitors the instances and the metrics specified in the scaling
policies.
○ When a scaling policy is triggered (e.g., CPU utilization exceeds a target), the ASG
adjusts the number of instances accordingly.
○ Scaling Out: Adds instances when the demand increases.
○ Scaling In: Terminates instances when the demand decreases.

4. Health Management:

○ ASGs perform regular health checks on instances. Health checks can be EC2 status
checks or custom health checks from an Elastic Load Balancer (ELB).
○ If an instance is found to be unhealthy, the ASG terminates it and launches a new
one to replace it.

5. Load Balancing Integration:

○ ASGs can be integrated with Elastic Load Balancers (ELB) to distribute incoming
traffic across the instances in the group.
○ Ensures that traffic is directed to healthy instances and helps maintain high
availability.

Monitoring with CloudWatch

Amazon CloudWatch plays a crucial role in monitoring the performance and health of
resources within an ASG:

• Metrics Collection: CloudWatch collects and tracks metrics for EC2 instances, such as CPU
utilization, network traffic, disk I/O, and more.

• Alarms: CloudWatch Alarms can be configured to trigger scaling actions based on specific
metrics, ensuring that the ASG responds to changes in demand in real time.

• Logs: CloudWatch Logs can store and monitor log data from instances, providing deeper
insights into application performance and issues.

• Dashboards: CloudWatch Dashboards offer a customizable view of metrics and alarms,


allowing for comprehensive monitoring of the ASG and its instances.

Example Scenario

Suppose you have a web application with variable traffic. You can use an ASG to automatically
scale the number of instances based on traffic load:

Session 29 - MLOps Tools Overview Page 36


Cost Considerations

Using Auto Scaling Groups is a free service in AWS. However, you will be charged for the
resources that the ASG provisions and uses:

• EC2 Instances: You pay for the EC2 instances launched by the ASG based on the instance
type and usage.
• Storage: Costs associated with the storage of AMIs and EBS volumes.
• Data Transfer: Charges for data transfer between instances and other AWS services.

By leveraging ASGs, you can ensure that your application maintains high availability, handles
varying levels of traffic efficiently, and operates cost-effectively, with the added benefit of
integrated monitoring and security features.

Session 29 - MLOps Tools Overview Page 37


5. Workflow Management
14 May 2024 01:42

Rollout
Rollback
Retraining

Session 29 - MLOps Tools Overview Page 38


MLOps Maturity Levels
13 May 2024 16:06

Session 29 - MLOps Tools Overview Page 39


How to select a MLOps tool?
13 May 2024 08:18

Session 29 - MLOps Tools Overview Page 40

You might also like