Module 3
Module 3
5
Case Study Mars Climate Orbiter
• NASA’s Mars Climate Orbiter
• Launched on December 11, 1998
• Intended to enter an orbit at 140 –
150 km above Mars.
• On September 23, 1999
• It smashed into the planet's
atmosphere and was destroyed.
• Cost: $328M
6
Case Study Mars Climate Orbiter
• Cause of failure
• The software controlling the thrusters on the spacecraft used
different units.
• Software modules were developed by teams in US and
Europe
• Engineers failed to convert the measure of rocket thrusts
• English unit: Pound-Force
• Metric unit: Newton, kg m / s2
• Difference: a factor of ≈ 4.45
7
Integration Testing
• Integration testing (sometimes called integration and testing,
abbreviated I&T) is the phase in software testing in which individual
software modules are combined and tested as a group. It occurs after
unit testing and before validation testing.
8
Objectives
• Understand the purpose of integration testing
• Distinguish typical integration faults from faults that should be eliminated in
unit testing
• Understand the nature of integration faults and how to prevent as well as
detect them
• Understand strategies for ordering construction and testing
• Approaches to incremental assembly and testing to reduce effort and control
risk
9
Integration vs. Unit Testing
• Unit (module) testing is a necessary foundation
• Unit level has maximum controllability and visibility
• Integration testing can never compensate for inadequate unit testing
• Integration testing may serve as a process check
• If module faults are revealed in integration testing, they signal inadequate
unit testing
• If integration faults occur in interfaces between correctly implemented
modules, the errors can be traced to module breakdown and interface
specifications
10
Integration Testing
• The entire system is viewed as a collection of subsystems (sets of
classes) determined during the system and object design
• Goal: Test all interfaces between subsystems and the interaction of
subsystems
• The Integration testing strategy determines the order in which the
subsystems are selected for testing and integration.
11
Why do we do integration testing?
• Unit tests only test the unit in isolation
• Many failures result from faults in the interaction of subsystems
• Often many Off-the-shelf components are used that cannot be unit
tested
• Without integration testing the system test will be very time
consuming
• Failures that are not discovered in integration testing will be
discovered after the system is deployed and can be very expensive.
12
Integration Testing Strategy
Subsystem = ?
B C D Layer II
E F G Layer III
B C D Layer II
Test E G
E F Layer III
Test B, E, F
Test F
Test C Test
A, B, C, D,
E, F, G
Test D,G
Test G
B C D Layer II
E F G
Layer III
Test
Test A Test A, B, C, D A, B, C, D,
E, F, G
Layer I
Layer I + II
All Layers
Special program is needed to do the testing, Test stub :
A program or a method that simulates the activity of a missing subsystem by answering to the
calling sequence of the calling subsystem and returning back fake data.
Pros and Cons of top-down
integration testing Is this enough?
E F G
Test E Layer III
up
Bottom Test B, E, F
Layer Test F
Tests
Test
A, B, C, D,
Test D,G E, F, G
Test G
Test A,B,C, D
Double
Double
Modified Sandwich Testing Strategy A
Test
TestI I
Layer I
Test B
B C D
Layer II
Test E
Triple
Triple E F G
Layer III
Triple Test
Triple
Test B, E, F TestI I
Test
TestI I Double
Double
Test
TestIIII
Test F
Test
Test D A, B, C, D,
Double
Double Test D,G E, F, G
Test
TestIIII
Test G
Test A,C
Test A
Test C Double
Double
Test
TestI I
Self reading
Steps in Integration-Testing
1.1.Based
Basedon onthe
theintegration
integrationstrategy,
strategy, 4.4.Do
Dostructural
structuraltesting:
testing:Define
Definetest
test
select
selectaacomponent
componentto tobe
betested.
tested. cases
casesthat
thatexercise
exercisethe
theselected
selected
Unit
Unittest
testall
allthe
theclasses
classesininthe
the component
component
component.
component. 5.5.Execute
. Put selected component together; Executeperformance
performancetests
tests
2.2. Put selected component together; 6.6.Keep
do Keeprecords
recordsofofthe
thetest
testcases
casesand
and
doany
anypreliminary
preliminaryfix-up
fix-up testing
testingactivities.
activities.
necessary
necessaryto tomake
makethetheintegration
integration
test
testoperational
operational(drivers,
(drivers,stubs)
stubs) 7.7.Repeat
Repeatsteps
steps11 to
to77until
untilthe
thefull
full
system
systemisistested.
tested.
3.3.Do
Dofunctional
functionaltesting:
testing:Define
Definetest
test
cases
casesthat
thatexercise
exerciseallalluses
usescases
cases
with
withthe
theselected
selectedcomponent
component The
Theprimary
primarygoalgoalof
ofintegration
integration
What is the relationship among packages, testing
testingisisto
toidentify
identifyerrors
errorsininthe
the
classes, components, interactions, activities, (current)
(current)component
component
deployments and use cases? configuration.
configuration.
34
Critical factors to consider
• Ensures software availability and accessibility
• Affects user experience and satisfaction
• Impacts business operations and revenue
• Security and compliance considerations
• Continuous deployment and integration
35
Deployment
• Proper deployment is crucial for the success of any software
development project.
• By understanding the importance of deployment and following best
practices, you can ensure that your software is available, accessible,
and secure for your end-users.
36
Trade-offs in the deployment
pipeline
• 1. Speed vs. Stability:
• Fast deployments: Frequent releases and rapid iteration can get features to users quickly. However, this can increase
the risk of introducing bugs and instability.
• Stable deployments: Thorough testing and staged rollouts minimize disruption. This approach can be slower, delaying
the release of new features.
• Trade-off: Finding the right balance between speed and stability depends on the application's criticality. A banking
application might prioritize stability over speed, while a social media app might favor rapid iteration.
• 2. Automation vs. Control:
• Full automation: CI/CD pipelines automate the entire deployment process, from code commit to production release.
This increases speed and reduces human error but can make it harder to intervene in case of issues.
• Manual control: Manual approvals and interventions at various stages provide greater control but slow down the
deployment process.
• Trade-off: The level of automation should align with the organization's risk tolerance and the complexity of the
deployment process. Highly regulated industries may require more manual control.
• 3. Cost vs. Performance:
• High-performance infrastructure: Using top-tier hardware and cloud services ensures optimal performance but
increases costs.
• Cost-effective infrastructure: Using less expensive resources can reduce costs but may impact performance, especially
under high load.
• Trade-off: Balancing cost and performance requires careful capacity planning and performance testing. Consider using
auto-scaling to dynamically adjust resources based on demand.
• 4. Security vs. Speed:
• Rigorous security checks: Integrating security testing and vulnerability scanning into the pipeline
increases security but can slow down deployments.
• Faster deployments: Skipping or reducing security checks can accelerate deployments but increases
the risk of security vulnerabilities.
• Trade-off: Security should never be an afterthought. Implement security measures throughout the
pipeline, but optimize them to minimize their impact on speed. Shift-left security testing can help.
• 5. Complexity vs. Maintainability:
• Complex pipelines: Highly customized and intricate pipelines can handle complex deployment
scenarios but can be difficult to maintain and troubleshoot.
• Simple pipelines: Simpler pipelines are easier to maintain but may not be able to handle complex
requirements.
• Trade-off: Strive for simplicity while ensuring the pipeline meets the necessary requirements. Use
modular design and well-defined stages to improve maintainability.
• 6. Feature Flags vs. Full Deployments:
• Feature flags: Enable or disable features in production without requiring a full deployment. This
allows for A/B testing and phased rollouts but adds complexity to the code.
• Full deployments: Deploying entire features at once is simpler but carries a higher risk.
• Trade-off: Feature flags offer greater control and flexibility but require careful management. Use
them strategically for features that are experimental or high-risk.
Basic deployment pipeline
Key Components of a Deployment
Pipeline
• Version Control System (VCS): It is a central repository that tracks and manages code changes made by
developers. It allows multiple developers to collaborate on the same codebase, facilitates versioning, and
provides a historical record of changes. Popular VCS options include Git, SVN, and Mercurial.
• Build Server: It is responsible for automatically compiling and packaging the code whenever changes are
pushed to the VCS. It takes the code from the repository and transforms it into executable artifacts that are
ready for testing and deployment. Continuous integration tools like Jenkins, GitLab CI/CD, and Travis CI are
often used to set up build servers.
• Automated Testing: It includes various types of tests, such as unit tests, integration tests, functional tests,
and performance tests. Automated testing ensures that any defects or issues are caught early in the
development process, reducing the risk of introducing bugs into production.
• Artifact Repository: The artifact repository stores the build artifacts generated by the build server. These
artifacts are the build process output and represent the compiled and packaged code ready for deployment.
A centralized artifact repository ensures consistent and reliable deployments across different environments,
such as staging and production.
• Deployment Automation: Deployment automation streamlines the application deployment process to
various environments. It involves setting up automated deployment scripts and configuration management
tools to ensure a consistent deployment process. By automating the deployment process, organizations can
reduce manual errors and maintain high consistency in their deployments.
Main Stages of a Deployment
Pipeline
• Stage 1: Commit Stage:
• The deployment pipeline starts with the Commit stage, triggered by code commits to the version control system (VCS). In this stage, the
code changes are fetched from the VCS, and the build server automatically compiles the code, running any pre-build tasks required. The
code is then subjected to static code analysis to identify potential issues, such as coding standards violations or security vulnerabilities. If
the code passes these initial checks, the build artifacts are generated, which serve as the foundation for subsequent stages.
• Stage 2: Automated Testing Stage:
• After the successful compilation and artifact generation in the Commit stage, the next stage involves automated testing. Various tests are
executed in this stage to ensure the code’s functionality, reliability, and performance.
• Unit tests, which validate individual code components, are run first, followed by integration tests that verify interactions between
different components or modules.
• Functional tests check whether the application behaves as expected from an end-user perspective.
• Stage 3: Staging Deployment:
• The application is deployed to a staging environment once the code changes have successfully passed the automated testing stage. The
staging environment resembles the production environment, allowing for thorough testing under conditions that simulate real-world
usage. This stage provides a final check before the application is promoted to production.
• Stage 4: Production Deployment:
• The final stage of the deployment pipeline is the Production Deployment stage. Once the application has passed all the previous stages
and received approval from stakeholders, it is deployed to the production environment.
• This stage requires extra caution as any issues or bugs introduced into the production environment can have significant consequences.
• To minimize risk, organizations often use deployment strategies such as canary releases, blue-green deployments, or feature toggles to
control the release process and enable easy rollback in case of any problems.
• Continuous production environment monitoring is also essential to ensure the application’s stability and performance.
Deployment Pipeline Practices
• A deployment pipeline (also known as a Continuous Delivery pipeline) is a series of automated steps that take code changes from
version control to production. Effective practices are essential for a smooth, reliable, and efficient pipeline.
• Continuous Integration (CI): This is the foundation. Developers frequently integrate their code changes into a shared repository.
Each integration triggers an automated build 1 and test process. CI helps catch integration issues early.
• Continuous Delivery (CD): CD extends CI by automating the release process. Changes that pass all tests are automatically
deployed to various environments (staging, testing, production). CD emphasizes frequent and reliable releases.
• Infrastructure as Code (IaC): Manage and provision infrastructure through code, enabling automation and repeatability. Tools like
Terraform and CloudFormation are used.
• Automated Testing: Implement a comprehensive suite of automated tests at different levels (unit, integration, system,
acceptance). This ensures code quality and reduces the risk of regressions.
• Version Control: Use a version control system (e.g., Git) to track code changes, collaborate effectively, and revert to previous
versions if needed.
• Configuration Management: Manage application configurations separately from the code. This allows for easy changes to
different environments.
• Monitoring and Alerting: Monitor the application and infrastructure in real-time to detect issues and receive alerts. Tools like
Prometheus and Grafana are commonly used.
• Rollback Strategy: Have a well-defined rollback plan in case a deployment introduces problems. This should allow for quickly
reverting to the previous version.
• Pipeline Visualization: Visualize the pipeline using a tool (e.g., Jenkins, GitLab CI/CD). This provides a clear overview of the
deployment process and helps identify bottlenecks.
• Feedback Loops: Establish feedback loops throughout the pipeline.
Developers should be notified of build failures, test failures, and production
issues.
• Small and Frequent Deployments: Deploy changes in small increments to
reduce risk and make it easier to identify and fix problems.
• Trunk-Based Development: Encourage developers to commit directly to the
main branch (trunk) rather than using long-lived feature branches. This
promotes faster integration and reduces merge conflicts.
• Deployment Strategies: Use appropriate deployment strategies like
blue/green deployments, canary releases, or rolling updates to minimize
downtime and risk.
The Commit Stage
• The commit stage is the very first stage of the deployment pipeline. It's where the journey from code change to
potential release begins. Think of it as the gatekeeper. Its primary goal is to quickly validate code changes before
they are integrated into the main codebase.
• Key activities in the commit stage:
• Code Commit: A developer commits their code changes to the version control system (typically a feature branch
or directly to the trunk in trunk-based development).
• Automated Build: The pipeline automatically triggers a build process. This compiles the code, links libraries, and
creates an executable or deployable artifact.
• Automated Testing (Subset): A subset of automated tests, especially unit tests and some quick integration tests,
are run. The focus here is on speed. These tests should be fast and provide immediate feedback on the code's
basic correctness.
• Static Code Analysis: Tools are used to analyze the code without executing it. This helps identify potential issues
like code style violations, security vulnerabilities, and code smells.
• Code Coverage Check: Ensure that the tests cover a sufficient percentage of the codebase.
• Quick Feedback: The results of the build, tests, and analysis are provided to the developer as quickly as possible.
If the commit stage fails, the developer needs to fix the issues before proceeding.
• Importance of the Commit Stage:
• Early Bug Detection: Catching errors early prevents them from propagating
through the pipeline and causing more significant problems later.
• Fast Feedback: Provides developers with immediate feedback on their code
changes, allowing them to fix issues quickly.
• Improved Code Quality: Encourages developers to write better code by
providing immediate feedback on code style and potential issues.
• Reduced Risk: Prevents broken code from being integrated into the main
codebase, reducing the risk of deployment failures.
• Faster Iteration: Enables developers to iterate quickly by providing rapid
feedback on their changes.
Automated Acceptance Test Gate
• The Automated Acceptance Test Gate is a crucial step in a Continuous Delivery
pipeline, acting as a quality checkpoint before code changes are released to
production (or even staging environments). It ensures that the software meets
the defined acceptance criteria from a user or business perspective. It's a "gate"
because it can stop the pipeline if the acceptance tests fail.
• What it is:
• The Automated Acceptance Test Gate is a stage in the deployment pipeline
where automated acceptance tests are executed. These tests are designed to
mimic real-world user scenarios and verify that the software behaves as
expected from the user's point of view. They focus on the "what" and "how" of
the software's functionality, not the "how" of its implementation (that's unit
and integration tests).
• Why it's important:
• Ensures Business Value: Acceptance tests confirm that the software delivers the intended business value and meets
user needs. They bridge the gap between development and business requirements.
• Reduces Risk: By catching functional issues early, the gate prevents defective software from reaching production,
minimizing the risk of user dissatisfaction and business disruption.
• Increases Confidence: Successful completion of the acceptance tests provides confidence that the software is ready
for release.
• Facilitates Communication: Acceptance tests serve as a clear and shared understanding of the software's
functionality between developers, testers, and stakeholders.
• Enables Continuous Delivery: Automated acceptance tests are essential for achieving continuous delivery, as they
allow for automated and reliable releases.
• How it works:
• Deployment to Test Environment: The code changes that have passed earlier stages of the pipeline (like the commit
stage and integration tests) are deployed to a dedicated test environment. This environment should closely resemble
the production environment.
• Automated Acceptance Tests Execution: The automated acceptance tests are executed against the deployed
software. These tests can be written using various frameworks and tools, such as Selenium, Cucumber, or Cypress.
• Test Results: The results of the acceptance tests are collected. This typically includes a pass/fail status for each test
case.
• Gate Decision: The automated test gate evaluates the test results. If all (or a pre-defined percentage) of the
acceptance tests pass, the pipeline proceeds to the next stage (e.g., deployment to staging or production). If any test
fails, the gate stops the pipeline.
• Feedback: If the gate fails, immediate feedback is provided to the development team. They can then investigate the
failed tests, fix the issues, and restart the pipeline.
Factors involved in monitoring
systems
• White-box and black-box monitoring
What is White Box Monitoring?
White box monitoring is the monitoring of applications running on a server. This could be
anything from the number of HTTP requests your web server is getting to the response
codes generated by your application. Other types of white box monitoring include:
•Monitoring MySQL queries running on a database server.
•Looking at the number of users utilizing a web application throughout the day, and
alerting if this goes above a predefined threshold.
•Considering the above example of HTTP requests—splitting these out into monitoring
the different kinds to ascertain how the application is performing, or whether users are
getting served the correct content. For example, a 403 would demonstrate a user has
tried to get to a part of the website they’re not allowed to visit. Likewise, a 200 would
indicate their request was successful and they were served the content.
•Performing advanced detection of behavior we don’t expect to see, such as a user not
going through the normal steps you’d expect when signing into your application or
resetting a password.
What is Black Box Monitoring?
Black box monitoring refers to the monitoring of servers with a focus on areas such as disk space, CPU usage,
memory usage, load averages, etc. These are what most in the industry would deem as the standard system
metrics to monitor. Other types of black box monitoring include:
Monitoring of network switches and other networking devices such as load balancers from the system metrics
perspective, as defined above.
Looking at hypervisor-level resource usage for all virtual machines running on the hypervisor (such as VMware,
KVM, Xen, etc.).
Alerting on hard disk errors that may present a problem if a disk isn’t replaced soon (using SMART, for
instance).
Differences Between Black Box and White Box Monitoring
There are differences between these two types of monitoring. Traditionally, systems
administrators would take care of both white and black box monitoring; however, with the
advent of DevOps and modern changes in the IT industry, we’re increasingly finding that
application developers are taking responsibility for the monitoring of the applications (white
box) they’re writing and as a result, are building monitoring solutions or writing checks for
monitoring systems deployed by DevOps engineers.
Systems administrators and DevOps engineers tend to take responsibility for the monitoring of black box items
such as servers. There is some crossover where DevOps engineers can also take responsibility for white box
monitoring, but this depends on the business or environment you’re working in.
Building a monitoring system
• Stage 1: Inventory Collection
• Start by gathering a comprehensive inventory of your systems: cloud resources, applications, servers,
processes, databases, and auxiliary systems such as caches and brokers.
• Stage 2: Identifying Metrics
• Once your inventory is ready, identify the critical metrics for each component. This step can be detail-
oriented, and a spreadsheet can prove helpful to keep things organized. Here's what you should
monitor:
• Infrastructure Services: CPU, RAM, Disk usage, Process status (e.g., Nginx web server, PM2 app
server), Load balancer metrics, Auto-scaling group metrics, Container metrics, Kubernetes resource
usage.
• Application Services: HTTP endpoint status codes, Latency, SSL status and expiry, Health checks, and
APM metrics like transaction time, error rates, and slow transactions.
• Databases: CPU/RAM/Disk-I/O, Read/Write throughput, Active connections, Latency.
• Message Brokers (Kafka, RabbitMQ): Message rate, Queue size, Consumer lag, Message delivery rate.
• Caches (Redis): Hit/Miss ratios, Memory usage, Evictions, Average retrieval time.
• Stage 3: Selection of Tools
• For each metric, identify the appropriate agent or exporter to extract data from the source and store it in the
destination. This decision significantly influences the monitoring stack you'll adopt.
• Evaluate the available tools based on your functional requirements, pricing, operational overhead, features,
vendor lock-in, etc. Here are some leading tools in the market:
• LGTM Stack (Formerly Prometheus stack): An array of components from Grafana Labs, including Loki for log
management, Grafana for visualization, Tempo for Tracing, and Mimir for storing metrics in time series.
• ELK Stack: Comprehensive log management and Security Incident and Event Management (SIEM). It supports logs
from various sources (like servers, applications, load balancers, s3 buckets, cloud trail etc), with Elastic APM
support.
• New Relic: New Relic's cloud-based platform offers real-time insights into your application performance, user
experiences, and the overall success of your digital initiatives. It combines four independent yet interoperable tools
- APM, Infrastructure, Logs, and Digital Experience Monitoring - providing a comprehensive observability platform.
• DataDog: A unified, SaaS-based monitoring and analytics platform that provides full visibility into the performance
of your applications, network, and infrastructure. With support for over 400 technologies out-of-the-box, DataDog
integrates with a wide range of applications, cloud platforms, and services.
• Cloud-Native Services: Cloud providers offer built-in monitoring services such as Amazon CloudWatch for AWS,
Google Stackdriver for GCP, and Azure Monitor for Microsoft Azure. These services provide resource and
application monitoring that collects and tracks metrics, collects and monitors log files, and responds to system-
wide performance changes.
• Stage 4: Alerts and Notifications 🔔
• Remember, any alert delivered should be an actionable alert. Avoid false alarms as they tend to bury the actual
ones underneath.
Monitoring infrastructure and
applications
1. Why Monitoring is Crucial in DevOps?
• Early Issue Detection: Identifies failures before they impact users.
• Performance Optimization: Ensures applications run efficiently.
• Automation & Self-Healing: Triggers auto-scaling and remediation.
• Improved Collaboration: Shared dashboards help DevOps teams work efficiently.
• Faster Deployment Cycles: Detects deployment failures in CI/CD pipelines.
2. Key Monitoring Areas in DevOps
a) Infrastructure Monitoring
• Monitors servers, cloud instances, containers, and networks.
What to Monitor?
• ✅ CPU, Memory, Disk Usage
✅ Network Latency & Bandwidth
✅ Server Uptime & Availability
✅ Kubernetes Cluster Health
Tools
• Prometheus & Grafana – Open-source, best for time-series data.
• Datadog – Full-stack observability with AI-driven insights.
• Zabbix – Enterprise-grade infrastructure monitoring.
b) Application Performance Monitoring (APM)
• Monitors application behavior, dependencies, and transaction flows.
What to Monitor?
• ✅ API Response Times & Throughput
✅ Error Rates & Exception Logs
✅ Database Performance (Slow Queries, Deadlocks)
✅ Microservices & Containerized Workloads
Tools
• New Relic – Comprehensive APM with AI-driven anomaly detection.
c) Log Monitoring & Analysis
• Centralizes and analyzes logs for troubleshooting & security.
What to Monitor?
• ✅ System & Application Logs
✅ CI/CD Pipeline Logs (Build & Deployment Failures)
✅ Security & Audit Logs
Tools
• ELK Stack (Elasticsearch, Logstash, Kibana) – Open-source log aggregation.
• Splunk – AI-powered log analytics.
• Graylog – Scalable, centralized logging platform.
d) CI/CD Pipeline Monitoring
• Ensures smooth software delivery in automated pipelines.
What to Monitor?
• ✅ Build Failures & Errors
✅ Deployment Time & Rollback Frequency
✅ Performance Issues Post-Deployment
Tools
• Jenkins Monitoring Plugins – Tracks build health.
• GitLab CI/CD Monitoring – Monitors pipeline success rates.
• Azure DevOps & AWS CodePipeline Metrics – Provides insights into deployments.
e) Kubernetes & Container Monitoring
• Ensures containerized workloads run efficiently.
What to Monitor?
• ✅ Container Resource Utilization
✅ Pod Health & Node Availability
✅ Kubernetes Cluster Logs & Errors
Tools
• Kubernetes Metrics Server – Native resource monitoring.
• Prometheus + Grafana – Real-time container performance monitoring.
f) Security & Compliance Monitoring
• Protects against vulnerabilities and security breaches.
What to Monitor?
• ✅ Unauthorized Access Attempts
✅ API Security & Endpoint Monitoring
✅ Compliance with DevSecOps Policies
Tools
• Falco – Kubernetes runtime security monitoring.
• Splunk Security – SIEM for DevSecOps monitoring.
• CrowdStrike Falcon – Endpoint and cloud security.
3. Best Practices for Monitoring in DevOps
✅ Integrate Observability into CI/CD Pipelines – Detect issues early in the release process.
✅ Use Unified Dashboards – Centralized visualization with Grafana, Kibana, or Datadog.
✅ Automate Incident Response – Implement self-healing mechanisms.
✅ Enable Distributed Tracing – Use OpenTelemetry or Jaeger for microservices tracking.
✅ Define SLIs, SLOs, & SLAs – Set clear service level indicators and objectives.
collecting data, logging, creating the
dashboard, behaviour-driven
monitoring
• Why? Data collection is the foundation of monitoring, enabling teams to analyze system performance, detect anomalies,
and make informed decisions.
• Types of Data to Collect
• 🔹 Infrastructure Metrics: CPU, Memory, Disk, Network usage
🔹 Application Metrics: Response times, request rates, error rates
🔹 Logs & Events: System, application, security, and audit logs
🔹 Network Data: Latency, packet loss, bandwidth
🔹 User Behavior Data: User interactions, performance bottlenecks
• Data Collection Tools
• ✅ Prometheus: Collects time-series data for infrastructure & applications
✅ Telegraf: Agent-based data collection for various sources (databases, APIs, cloud)
✅ Fluentd & Logstash: Log and event collection from multiple sources
✅ OpenTelemetry: Standardized tracing, logging, and metrics collection
✅ AWS CloudWatch / Azure Monitor / GCP Stackdriver: Cloud-native monitoring solutions
• Best Practices for Data Collection
• ✔ Collect only essential metrics to avoid excessive storage costs
✔ Use tags & labels to categorize data effectively
✔ Implement data aggregation to minimize noise
• 2. Logging in DevOps
• Logging provides historical context, troubleshooting capabilities, and security auditing.
• Types of Logs to Collect
• 📌 System Logs: OS-level logs (CPU, memory, disk, network issues)
📌 Application Logs: API requests, database queries, error messages
📌 Security Logs: Authentication attempts, firewall alerts, intrusion detection
📌 CI/CD Pipeline Logs: Build failures, deployment errors
• Logging Tools & Frameworks
• ELK Stack (Elasticsearch, Logstash, Kibana): Open-source log collection and visualization
• Splunk: Enterprise-grade log analysis with AI-driven insights
• Graylog: Scalable, real-time log monitoring
• Fluentd & Logstash: Log aggregation and forwarding
• Best Practices for Logging
• ✔ Use structured logging (JSON format) for easy searching & parsing
✔ Enable centralized log management for distributed environments
✔ Set up log rotation & retention policies to manage storage
• 3. Creating Dashboards for DevOps Monitoring
• Dashboards provide a visual representation of system health, performance, and
anomalies.
• What Should a DevOps Dashboard Include?
• 📊 Infrastructure Metrics: CPU, memory, disk, network stats
📊 Application Performance: Latency, throughput, error rates
📊 Security & Compliance: Unauthorized access attempts, firewall logs
📊 CI/CD Pipeline Health: Build success/failure rates, deployment trends
📊 User Experience Metrics: Page load time, response time, transaction success rate
• Dashboard Tools
• ✅ Grafana: Open-source, highly customizable monitoring dashboard
✅ Kibana (ELK Stack): Log analytics and visualization
✅ Datadog: Cloud-based monitoring with AI-driven insights
✅ New Relic & AppDynamics: Application performance dashboards
✅ Cloud-Specific Dashboards: AWS CloudWatch, Azure Monitor, GCP Stackdriver
• Best Practices for Dashboarding
• ✔ Use real-time and historical data to detect trends
✔ Implement role-based access control (RBAC) for security
✔ Set up threshold-based alerts for proactive issue detection
• 4. Behavior-Driven Monitoring (BDM) in DevOps
• What is Behavior-Driven Monitoring?
BDM focuses on monitoring actual user interactions, application behavior, and system responses
instead of just system-level metrics. It helps detect issues that traditional monitoring might miss.
• Key Aspects of BDM
• ✅ Real User Monitoring (RUM): Collects real-time user interactions and performance data
✅ Synthetic Monitoring: Simulates user actions to detect performance issues proactively
✅ Distributed Tracing: Tracks requests across microservices to identify bottlenecks
✅ Anomaly Detection & AI-Driven Alerts: Uses machine learning to detect abnormal behavior
• BDM Tools
• Google Lighthouse: Measures real user experience (performance, accessibility)
• Pingdom & Catchpoint: Synthetic monitoring to simulate user behavior
• New Relic Browser & Dynatrace RUM: Real user experience tracking
• Jaeger / OpenTelemetry: Distributed tracing for microservices
• Best Practices for BDM
• ✔ Monitor real user sessions to detect issues affecting end users
✔ Use AI-based anomaly detection to spot unusual behaviors
✔ Integrate BDM with CI/CD pipelines to catch problems before release
Site reliability engineering
• Site Reliability Engineering (SRE) is a discipline that applies software
engineering principles to IT operations to improve the reliability,
scalability, and efficiency of systems. It was pioneered at Google to
bridge the gap between development and operations by focusing on
automation, monitoring, and incident management.
• Key Goals of SRE:
• ✅ Improve system reliability and uptime
✅ Automate manual operational tasks
✅ Enhance observability (monitoring, logging, tracing)
✅ Establish error budgets to balance innovation and stability
✅ Implement incident management & postmortems
SRE stands for Site Reliability Engineering, which is a software engineering discipline that focuses on the reliability and
maintainability of complex systems. The goal of SRE is to ensure that the systems and services a company provides are reliable,
scalable, and efficient.
Increased reliability: By implementing SRE practices, companies can reduce the number of outages and downtime, thereby
improving the reliability of their services.
Faster incident resolution: SRE practices emphasize incident response and remediation, which means that companies can resolve
issues more quickly and efficiently.
Improved scalability: SRE practices help companies scale their services as demand grows, which means that they can handle
increased traffic without sacrificing performance.
Better collaboration: SRE teams work closely with software development teams, which fosters collaboration and knowledge
sharing.
Continuous improvement: SRE practices involve continuous monitoring and analysis of system
performance, which means that companies can continually identify areas for improvement and make
changes to optimize their systems.
Overall, implementing SRE practices can help companies build more reliable, scalable, and efficient systems,
which can improve customer satisfaction and reduce the risk of costly outages and downtime.
SRE Use Cases
SRE (Site Reliability Engineering) can be applied to a wide range of use cases and industries, where software
systems and services are critical to business operations. Here are a few examples of how SRE can be used in
different contexts:
1. E-commerce: E-commerce websites and applications are critical to business success, as they enable
companies to sell products and services online. SRE practices can help ensure that these systems are
reliable, scalable, and secure, which can improve customer satisfaction and reduce the risk of lost revenue
due to downtime.
2. Social media: Social media platforms need to handle large amounts of traffic and user data, while also
ensuring the privacy and security of their users. SRE can help these platforms optimize their systems for
performance and scalability, while also implementing robust security measures.
3. Finance: Financial systems and services are critical to business operations, as they handle sensitive
customer data and financial transactions. SRE can help ensure that these systems are reliable, secure, and
scalable, which can reduce the risk of costly outages and security breaches.
4. Healthcare: Healthcare systems and services require high levels of reliability and security, as they
handle sensitive patient data and support critical healthcare operations. SRE can help healthcare
organizations optimize their systems for performance, scalability, and security, which can improve patient
outcomes and reduce the risk of data breaches.
5. Gaming: Online gaming platforms and services require high levels of reliability, scalability, and
performance, as they support millions of users around the world. SRE can help these platforms optimize
their systems for performance and scalability, while also ensuring a smooth user experience.
What is site reliability engineering?
• Site reliability engineering (SRE) is the practice of using software tools to automate IT infrastructure tasks such
as system management and application monitoring. Organizations use SRE to ensure their software
applications remain reliable amidst frequent updates from development teams. SRE especially improves the
reliability of scalable software systems because managing a large system using software is more sustainable
than manually managing hundreds of machines.
Why is site reliability engineering important?
• Site reliability describes the stability and quality of service that an application offers after being made
available to end users. Software maintenance sometimes affects software reliability if technical issues go
undetected. For example, when developers make new changes, they might inadvertently impact the existing
application and cause it to crash for certain use cases.
• The following are some benefits of site reliability engineering (SRE) practices.
• Improved collaboration
• SRE improves collaboration between development and operations teams. Developers often have to make
rapid changes to an application to release new features or fix critical bugs. On the other hand, the operations
team has to ensure seamless service delivery. Hence, the operations team uses SRE practices to closely
monitor every update and promptly respond to any issues that arise due to changes.
• Enhanced customer experience
• Organizations use an SRE model to ensure software errors do not impact the customer experience. For
example, software teams use SRE tools to automate the software development lifecycle. This reduces errors,
meaning the team can prioritize new feature development over bug fixes.
• Improved operations planning
• SRE teams accept that there's a realistic chance for software to fail. Therefore, teams plans for the
appropriate incident response to minimize the impact of downtime on the business and end users. They can
What are the key principles in site reliability engineering?
• The following are some key principles of site reliability engineering (SRE).
• Application monitoring
• SRE teams accept that errors are a part of the software deployment process. Instead of striving
for a perfect solution, they monitor software performance in terms of service-level agreements
(SLAs), service-level indicators (SLIs), and service-level objectives (SLOs). They observe and
monitor performance metrics after deploying the application in production environments.
• Gradual change implementation
• SRE practices encourage the release of frequent but small changes to maintain system
reliability. SRE automation tools use consistent but repeatable processes to do the following:
• Reduce risks due to changes
• Provide feedback loops to measure system performance
• Increase speed and efficiency of change implementation
• Automation for reliability improvement
• SRE uses policies and processes that embed reliability principles in every step of the delivery
pipeline. Some strategies that automatically resolve problems include the following:
• Developing quality gates based on service-level objectives to detect issues earlier
• Automating build testing using service-level indicators
• Making architectural decisions that ensure system resiliency at the outset of software
development
What is observability in site reliability engineering?
• Observability is a process that prepares the software team for uncertainties when the software goes live
for end users. Site reliability engineering (SRE) teams use tools to detect abnormal behaviors in the
software and, more importantly, collect information that helps developers understand what causes the
problem. Observability involves collecting the following information with SRE tools.
• Metrics
• Metrics are quantifiable values that reflect an application's performance or system health. SRE teams use
metrics to determine if the software consumes excessive resources or behaves abnormally.
• Logs
• SRE software generates detailed, timestamped information called logs in response to specific events.
Software engineers use logs to understand the chain of events that lead to a particular problem.
• Traces
• Traces are observations of the code path of a specific function in a distributed system. For example,
checking out an order cart might involve the following:
• Tallying the price with the database
• Authenticating with the payment gateway
• Submitting the orders to vendors
• Traces consist of an ID, name, and time. They help software developers detect latency issues and improve
software performance.
What is monitoring in site reliability engineering?
• Monitoring is a process of observing predefined metrics in an application. Developers decide which
parameters are critical in determining the application's health and set them in monitoring tools. Site
reliability engineering (SRE) teams collect critical information that reflects the system performance and
visualize it in charts.
• In SRE, software teams monitor these metrics to gain insight into system reliability.
• Latency
• Latency describes the delay when the application responds to a request. For example, a form submission on
a website takes 3 seconds before it directs users to an acknowledgment webpage.
• Traffic
• Traffic measures the number of users concurrently accessing your service. It helps software teams
accordingly budget computing resources to maintain a satisfactory service level for all users.
• Errors
• An error is a condition where the application fails to perform or deliver according to expectations. For
example, when a webpage fails to load or a transaction does not go through, SRE teams use software tools to
automatically track and respond to errors in the application.
• Saturation
• Saturation indicates the real-time capacity of the application. A high level of saturation usually results in
degrading performance. Site reliability engineers monitor the saturation level and ensure it is below a
particular threshold.
• What are the key metrics for site reliability engineering?
• Site reliability engineering (SRE) teams measure the quality of service delivery and reliability using the following
metrics.
• Service-level objectives
• Service-level objectives (SLOs) are specific and quantifiable goals that you are confident the software can achieve at a
reasonable cost to other metrics, such as the following:
• Uptime, or the time a system is in operation
• System throughput
• System output
• Download rate, or the speed at which the application loads
• An SLO promises delivery through the software to the customer. For example, you set a 99.95% uptime SLO for your
company's food delivery app.
• Service-level indicators
• Service-level indicators (SLIs) are the actual measurements of the metric an SLO defines. In real-life situations, you
might get values that match or differ from the SLO. For example, your application is up and running 99.92% of the
time, which is lower than the promised SLO.
• Service-level agreements
• The service-level agreements (SLAs) are legal documents that state what would happen when one or more SLOs are
not met. For example, the SLA states that the technical team will resolve your customer's issue within 24 hours after a
report is received. If your team could not resolve the problem within the specified duration, you might be obligated to
refund the customer.
• Error budgets
• Error budgets are the noncompliance tolerance for the SLO. For example, an uptime of 99.95% in the SLO means that
the allowed downtime is 0.05%. If the software downtime exceeds the error budget, the software team devotes all
resources and attention to stabilize the application.
• How does site reliability engineering work?
• Site reliability engineering (SRE) involves the participation of site reliability engineers in a
software team. The SRE team sets the key metrics for SRE and creates an error budget
determined by the system's level of risk tolerance. If the number of errors is low, the
development team can release new features. However, if the errors exceed the permitted
error budget, the team puts new changes on hold and solves existing problems.
• For example, a site reliability engineer uses a service to monitor performance metrics and
detect anomalous application behavior. If there are issues with the application, the SRE
team submits a report to the software engineering team. The developers fix the reported
cases and publish the updated application.
• DevOps
• DevOps is a software culture that breaks down the traditional boundary of development and
operation teams. With DevOps, developers and operation engineers no longer work in silos.
Instead, they use software tools to improve collaboration and keep up with the rapid pace of
software update releases.
• SRE compared to DevOps
• SRE is the practical implementation of DevOps. DevOps provides the philosophical
foundation of what must be done to maintain software quality amidst the increasingly
shortened development timeline. Site reliability engineering offers the answers to how to
achieve DevOps success. SRE ensures that the DevOps team strikes the right balance
between speed and stability.
• What are the responsibilities of a site reliability engineer?
• A site reliability engineer is an IT expert who uses automation tools to monitor and observe
software reliability in the production environment. They are also experienced in finding
problems in software and writing codes to fix them. They are typically former system
administrators or operation engineers with good coding skills. The following are some site
reliability responsibilities.
• Operations
• Site reliability engineers spend up to half of their time in operations work. This includes several
tasks, such as the following:
• Emergency incident response
• Change management
• IT infrastructure management
• The engineers use SRE tools to automate several operations tasks and increase team efficiency.
• System support
• Site reliability engineers work closely with the development team to create new features and
stabilize production systems. They create an SRE process for the entire software team and are on
hand to support escalation issues. More importantly, site reliability teams provide documented
procedures to customer support to help them effectively deal with complaints.
• Process improvement
• Site reliability engineers improve the software development lifecycle by holding post-incident
reviews. The SRE team documents all software problems and respective solutions in a shared
knowledge base. This helps the software team efficiently respond to similar issues in the future.
What are the common site reliability engineering tools?
• Site reliability engineering (SRE) teams use different types of tools to facilitate monitoring,
observation, and incident response.
• Container orchestrator
• Software developers use a container orchestrator to run containerized applications on various
platforms. Containerized applications store their code files and related resources within a single
package called a container. For example, software engineers use Amazon Elastic Kubernetes Service
(Amazon EKS) to run and scale cloud applications.
• On-call management tools
• On-call management tools are software that allow SRE teams to plan, arrange, and manage support
personnel who deal with reported software problems. SRE teams use the software to ensure there is
always a support team on standby to receive timely alerts on software issues.
• Incident response tools
• Incident response tools ensure a clear escalation pathway for detected software issues. SRE teams
use incident response tools to categorize the severity of reported cases and deal with them
promptly. The tools can also provide post-incident analysis reports to prevent similar problems from
happening again.
• Configuration management tools
• Configuration management tools are software that automate software workflow. SRE teams use
these tools to remove repetitive tasks and become more productive. For example, site reliability
engineers use AWS OpsWorks to automatically set up and manage servers on AWS environments.