0% found this document useful (0 votes)
30 views23 pages

FIS Report

Uploaded by

Bimal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views23 pages

FIS Report

Uploaded by

Bimal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Table of Contents: -

 Fault injector Simulator


 Simulation Process
 Availability Zone Failover Testing
 Recovery Testing
 Performance Trend Testing
 Apache Jmeter

Fault Injection Simulator:


AWS Fault Injection Simulator (FIS) is a fully managed service for running
fault injection experiments on AWS, making it easier to improve application
performance, observability, and resilience.

Workflow of FIS

Simulation Process :-
Simulating AWS Fault Injection Simulator (FIS) involves setting up
experiments that inject failures into your AWS environment to test the
resilience of your applications. These experiments can simulate failures like
stopping EC2 instances, injecting network latency, or even simulating
Availability Zone (AZ) failures. Here’s a step-by-step guide on how to
simulate AWS FIS:
1. Prerequisites
 IAM Role: Create an IAM role that grants AWS FIS permission to perform
actions on the targeted resources (e.g., EC2, RDS).
 AWS CLI / Console Access: You can either use the AWS Management
Console, AWS CLI, or SDK to create and manage FIS experiments.
 AWS Services Setup: Ensure you have running resources such as EC2
instances, Auto Scaling Groups, RDS databases, or other AWS services
that you want to test.
2. Create an IAM Role for FIS
FIS needs permission to perform fault injection activities like stopping EC2
instances or adding network latencies. To create the role:
1. Go to the IAM Console.
2. Create a role with the following policies:
o AWSFISServiceRolePolicy: This allows FIS to access AWS
resources.
o Add specific permissions for EC2, RDS, Auto Scaling, etc.,
depending on what you want to test.
3. Define an AWS FIS Experiment Template
An experiment template is a blueprint that defines the actions, resources,
and conditions under which the FIS will inject faults.
1. Access FIS in the AWS Management Console:
o Navigate to AWS Fault Injection Simulator.
o Click on Create experiment template.

o Select the target account

For “This account”: Experiment Template looks like this-


Description and name:
 Description: Field for entering a description of the experiment.
 Name (optional): Field to assign a name or tag to identify the
experiment template.

Actions and targets:

 Actions: Add specific actions for the experiment (e.g., restarting or


stopping services).

 Targets: Add targets to specify which AWS resources the actions will
be applied to.
Experiment options:

 Empty target resolution mode: Defines how the experiment


handles a situation where no resources are targeted (e.g., "Fail" if no
targets are found).

Service access:

 IAM Role: Specifies an IAM role needed to run the experiment. It


either creates a new role or uses an existing one (in this case, the role
AWSSIMRole-1729234733559 is selected).

Stop conditions:
 Option to add a CloudWatch alarm as a stop condition that will
terminate the experiment if triggered.

Logs:

 Options to send experiment logs to Amazon S3 or CloudWatch Logs,


with a selection for log version (default is Version 2).

Tags:

 Section to add tags (up to 50) to help manage and identify the
experiment.

For “Multiple account”- Experiment Template looks like this


Description and name:

 Description: A field to enter a detailed description of the experiment.


 Name (optional): Field to specify a name or tag for the experiment.

Actions and targets:

 Actions: Allows adding actions that the experiment will take (e.g.,
rebooting or stopping resources).

 Targets: Enables adding specific AWS resources as targets for the


actions.
Experiment options:

 Empty target resolution mode: Defines what happens if no targets


are found. In this case, the selected behavior is "Fail," meaning the
experiment will stop if there are no targets.

Target account configurations:

 This section is important for multi-account experiments. It allows


configuring the necessary permissions for the experiment to run across
multiple AWS accounts.
 A notice states, "Target account configuration is required for multi-
account experiments."
 Options to add target account role ARNs (Amazon Resource Names)
manually or upload a file containing ARNs for all target accounts are
available.
 Add new role ARN: A button to add the necessary role ARN for the
target accounts.

Service access:

 IAM Role: Specifies the role required by AWS FIS to run the
experiment. In this case, it is creating a new role named AWSSIMRole-
1729235690655.

Stop conditions:

 Option to select a CloudWatch alarm that will stop the experiment if


certain conditions are met.

Logs:

 Option to send logs to either an Amazon S3 bucket or CloudWatch


Logs.
 Log version: Version of the logs to be used (Version 2 is selected
here).

Tags:

 Section to add tags for easier management and organization of the


experiment.

4. Run the Experiment


Once the experiment template is created, you can execute the experiment.
1. Go to the FIS Console.
2. Select the experiment template you created.
3. Start the experiment and confirm the execution.
Once the experiment starts, FIS will perform the fault injection actions as
defined in the template, such as stopping instances or introducing latency.
5. Monitor the Experiment
You can monitor your experiment in several ways:
 Amazon CloudWatch: Set up metrics and alarms to watch the impact of
the fault injection.
 AWS FIS Console: You can view the progress and logs of the experiment
in the FIS dashboard.
6. Analyze the Results
After the experiment runs, analyze the results:
 Check if your application’s failover mechanisms worked as expected.
 Verify if Auto Scaling properly handled terminated instances.
 Ensure your disaster recovery plans kick in when needed.
 Review CloudWatch logs and alarms to measure how the fault affected
your environment and whether it recovered properly.
7. Tear Down
After the test, you can stop or terminate the instances that were affected by
the experiment, especially if they were manually created just for testing
purposes.

Common Use Cases for FIS Simulations


 Availability Zone Failures: Simulate failure of resources in one AZ to
test failover.
 EC2 Instance Termination: Test how your Auto Scaling Group responds
to instance failures.
 Network Latency Injection: Test how increased latency affects
distributed systems.
 Database Failures: Simulate RDS failure to test the resilience of your
database layers.

Availability Zone Failover Test: -


Objective: -
The purpose of this Availability Zone (AZ) Failover Test is to simulate and
validate the failover process of a two-node setup in two different Availability
Zones (AZs). The test ensures that in the event of a failure in one AZ, the
other AZ can maintain availability and continue operations, effectively
mitigating service disruption.

The Availability Zone Failover Test utilizes a standardized model involving two
nodes in two separate Availability Zones (AZs), resulting in a total of four
nodes for failover. This test ensures that if one AZ becomes unavailable, the
application can seamlessly shift workloads to the healthy AZ without
impacting performance. Implementing this approach is crucial for
maintaining minimal downtime in multi-AZ deployments, thereby enhancing
the overall reliability and resilience of the application.

Test Model
The test is designed around a standardized model that sets up:
 4 nodes in total
o 2 nodes (EC2 instances) in AZ1
o 2 nodes (EC2 instances) in AZ2
The test utilizes AWS Fault Injection Simulator (FIS) to inject faults, simulate a
failure in one AZ, and monitor the system’s response in the other AZ.

Test Setup
1. AWS Resources
 AWS EC2 Instances:
o Two instances in us-east-1a (AZ1)
o Two instances in us-east-1b (AZ2)
 AWS Fault Injection Simulator (FIS):
o Used to stop the EC2 instances in AZ1 to simulate an outage.
 CloudWatch:
o Used to monitor performance metrics like CPU utilization to ensure that
failover is functioning as expected.
2. Test Steps
Step 1: Create EC2 Instances in Multiple AZs
The following Python script uses Boto3 to provision two EC2 instances in two
different AZs (us-east-1a and us-east-1b).
 Instances Created:
o AZ1 (us-east-1a):
 Instance ID: i-xxxxxxxxxxxxxxxxx
 Instance ID: i-xxxxxxxxxxxxxxxxx
o AZ2 (us-east-1b):
 Instance ID: i-xxxxxxxxxxxxxxxxx
 Instance ID: i-xxxxxxxxxxxxxxxxx
Step 2: Create AWS FIS Experiment Template
A FIS experiment template is created to simulate an AZ failure by stopping
all instances in AZ1 (us-east-1a). This tests the system's ability to handle
an AZ outage and shift traffic or workloads to the healthy nodes in AZ2 (us-
east-1b).

Create the experiment template using the AWS FIS console. In the template,
you specify two actions that will run sequentially for three minutes each. The
first action stops one of the test instances, which AWS FIS chooses randomly.
The second action stops both test instances.
To create an experiment template
1. Open the AWS FIS console at https://console.aws.amazon.com/fis/.
2. In the navigation pane, choose Experiment templates.
3. Choose Create experiment template.
4. For Description and name, enter a description and a name for the
template.
5. For Actions, do the following:
a. Choose Add action.
b. Enter a name for the action. For example, enter stopOneInstance.
c. For Action type, choose aws:ec2:stop-instances.
d. For Target keep the target that AWS FIS creates for you.
e. For Action parameters, Start instances after duration, specify 3
minutes (PT3M).
f. Choose Save.
6. For Targets, do the following:
a. Choose Edit for the target that AWS FIS automatically created for you
in the previous step.
b. Replace the default name with a more descriptive name. For example,
enter oneRandomInstance.
c. Verify that Resource type is aws:ec2:instance.
d. For Target method, choose Resource IDs, and then choose the IDs of
the two test instances.
e. For Selection mode, choose Count. For Number of resources,
enter 1.
f. Choose Save.
7. Choose Add target and do the following:
a. Enter a name for the target. For example, enter bothInstances.
b. For Resource type, choose aws:ec2:instance.
c. For Target method, choose Resource IDs, and then choose the IDs of
the two test instances.
d. For Selection mode, choose All.
e. Choose Save.
8. From the Actions section, choose Add action. Do the following:
a. For Name, enter a name for the action. For example,
enter stopBothInstances.
b. For Action type, choose aws:ec2:stop-instances.
c. For Start after, choose the first action that you added
(stopOneInstance).
d. For Target, choose the second target that you added
(bothInstances).
e. For Action parameters, Start instances after duration, specify 3
minutes (PT3M).
f. Choose Save.
9. For Service Access, choose Use an existing IAM role, and then
choose the IAM role that you created as described in the prerequisites
for this tutorial. If your role is not displayed, verify that it has the
required trust relationship. For more information, see IAM roles for AWS
FIS experiments.
10. (Optional) For Tags, choose Add new tag and specify a tag key
and tag value. The tags that you add are applied to your experiment
template, not the experiments that are run using the template.
11. Choose Create experiment template. When prompted for
confirmation, enter create and then choose Create experiment
template.

 Experiment Template Created:


o Template ID: fis-xxxxxxxxxxxxxxxxx
o Action: Stop instances in AZ1
Start the experiment
When you have finished creating your experiment template, you can use it to
start an experiment.
To start an experiment
1. You should be on the details page for the experiment template that you
just created. Otherwise, choose Experiment templates and then
select the ID of the experiment template to open the details page.
2. Choose Start experiment.
3. (Optional) To add a tag to your experiment, choose Add new tag and
enter a tag key and a tag value.
4. Choose Start experiment. When prompted for confirmation,
enter start and choose Start experiment.
Track the experiment progress
You can track the progress of a running experiment until the experiment is
completed, stopped, or failed.
To track the progress of an experiment
1. You should be on the details page for the experiment that you just
started. Otherwise, choose Experiments and then select the ID of the
experiment to open the details page.
2. To view the state of the experiment, check State in the Details page.
3. When the state of the experiment is Running, go to the next step.
Verify the experiment result
You can verify that the instances were stopped and started by the
experiment as expected.
To verify the result of the experiment
1. Open the Amazon EC2 console
at https://console.aws.amazon.com/ec2/ in a new browser tab or
window. This allows you to continue to track the progress of the
experiment in the AWS FIS console while viewing the result of the
experiment on the Amazon EC2 console.
2. In the navigation pane, choose Instances.
3. When the state of the first action changes
from Pending to Running (AWS FIS console), the state of one of the
target instances changes from Running to Stopped (Amazon EC2
console).
4. After three minutes, the state of the first action changes
to Completed, the state of the second action changes to Running,
and the state of the other target instance changes to Stopped.
5. After three minutes, the state of the second action changes
to Completed, the state of the target instances changes to Running,
and the state of the experiment changes to Completed.
Clean up
If you no longer need the test EC2 instances that you created for this
experiment, you can terminate them.
To terminate the instances
1. Open the Amazon EC2 console
at https://console.aws.amazon.com/ec2/.
2. In the navigation pane, choose Instances.
3. Select both test instances and choose Instance state, Terminate
instance.
4. When prompted for confirmation, choose Terminate.
If you no longer need the experiment template, you can delete it.
To delete an experiment template using the AWS FIS console
1. Open the AWS FIS console at https://console.aws.amazon.com/fis/.
2. In the navigation panel, choose Experiment templates.
3. Select the experiment template, choose Actions, Delete experiment
template.
4. When prompted for confirmation, enter delete and then choose Delete
experiment template.

Step 3: Execute the FIS Experiment


Once the experiment template is set, the experiment is executed to simulate
the failure in AZ1.
 Experiment ID: fis-exp-xxxxxxxxxxxxxxx
 Action Duration: 1 hour (stopping EC2 instances in AZ1)
Step 4: Monitor System Performance (CloudWatch Metrics)
After the FIS experiment is triggered, the health and performance of the
instances in AZ2 are monitored to ensure they are handling the increased
load as expected.
Metrics observed:
 CPU Utilization
 Network Traffic
 Disk I/O
 Metrics Captured for AZ2 Instances:
o Instance ID: i-xxxxxxxxxxxxxxxxx
 CPU Utilization: 40% (Pre-experiment), 60% (During failover)
 Network Traffic: 500 KB/s (Pre-experiment), 800 KB/s (During failover)
o Instance ID: i-xxxxxxxxxxxxxxxxx
 CPU Utilization: 42% (Pre-experiment), 62% (During failover)
 Network Traffic: 480 KB/s (Pre-experiment), 770 KB/s (During failover)
Test Results
CPU CPU Network Network
AZ Instance ID Utilization Utilization Traffic Traffic
(Pre) (During) (Pre) (During)
i-
AZ1 xxxxxxxxxxxxxx 38% Stopped 500 KB/s Stopped
xxx
i-
AZ1 xxxxxxxxxxxxxx 40% Stopped 490 KB/s Stopped
xxx
i-
AZ2 xxxxxxxxxxxxxx 40% 60% 500 KB/s 800 KB/s
xxx
i-
AZ2 xxxxxxxxxxxxxx 42% 62% 480 KB/s 770 KB/s
xxx
Summary of Results:
 The instances in AZ2 (us-east-1b) successfully handled the increased
load after the failure in AZ1 (us-east-1a).
 CPU utilization in AZ2 instances increased by an average of 20%,
indicating that they took on additional workloads.
Network traffic in AZ2 increased by around 60%, showing that failover
processes successfully redirected traffic to the healthy AZ.
Conclusions
 Failover Success: The test demonstrated that the two-node failover
system operates effectively across two AZs. When instances in AZ1
were stopped, the AZ2 instances handled the workload without
noticeable degradation in performance.
 Performance Stability: The increased CPU utilization and network
traffic were within acceptable ranges, and the system continued
functioning as expected under simulated failure conditions.
 Automated Testing: Integrating this process into a pytest framework
enables regular automated failover tests, ensuring high availability
across AZs.

Recommendations for Improvement


 Scaling Instances: Depending on the workload, scaling beyond two
instances in each AZ could help balance the load more evenly during a
failover event.
 Enhanced Monitoring: Additional CloudWatch metrics such as disk
I/O, memory usage, and application-level health checks can provide
more granular insights during failover scenarios.

Next Steps
 Schedule recurring failover tests using AWS FIS and CloudWatch.
 Integrate failover testing into the CI/CD pipeline to ensure system
reliability under real-world failure conditions.

Recovery Testing

Recovery testing is a type of testing that ensures a system can


successfully recover from failures, such as hardware failures, software
crashes, or network issues. The goal is to evaluate how well the system
resumes operations after encountering an unexpected fault.

Objective
The primary objective of recovery testing is to verify that the system can
recover from failures and return to its normal operational state within
acceptable limits, with minimal data loss or corruption. This is critical to
ensure business continuity, minimize downtime, and guarantee that users
can rely on the system during and after failures.
Key Scenarios in Recovery Testing
Recovery testing typically involves scenarios where:
1. Hardware Failures: Disk crashes, server shutdowns, or network
disconnections.
2. Software Failures: Application crashes, database connection failures,
or service interruptions.
3. Power Failures: Sudden loss of power, especially in physical data
centers.
4. System Overload: Recovery from high-load conditions where the
system or its components have failed due to resource exhaustion.
Steps for Conducting Recovery Testing
1. Identify Failure Scenarios
First, identify the types of failures your system could experience. Examples
include:
 EC2 instance shutdown in one Availability Zone (AZ).
 Database disconnection or failure.
 Disk failure in a storage service like S3 or EBS.
 Network partition or disconnect between services.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
UsingAlarmActions.html

2. Simulate Failures
Use tools like AWS Fault Injection Simulator (FIS) or manual approaches to
simulate the failures:
 AWS FIS: Inject failures such as stopping or rebooting EC2 instances
or causing network latency to simulate a failure.
Example scenario:
 Stop instances in us-east-1a (AZ1) and simulate failure while
monitoring the recovery process in us-east-1b (AZ2).
3. Monitor Recovery Process
Once the failure is induced, monitor the system to assess how it responds:
 Instance Recovery: Did EC2 instances in AZ2 take over the workload
when AZ1 instances failed?
 Service Continuity: Was the service available during recovery, even
with degraded performance?
 Data Integrity: Was any data lost or corrupted during the failure and
recovery process?

4. Assess System Logs and Metrics


Analyze logs and metrics (CloudWatch, application logs, etc.) to gather
detailed information on:
 Time taken to detect failure.
 Time taken to initiate recovery processes.
 Time to full system recovery.
 Any errors encountered during recovery.
5. Validate System State Post-Recovery
Once the system has recovered, verify that it returns to a fully operational
state:
 Restored Availability: Ensure services are running, and users can
access the system without errors.
 Data Recovery: Ensure data is consistent, correct, and no critical
information was lost.
 Performance: Evaluate if the system performance is restored to pre-
failure levels.

Metrics for Recovery Testing


1. Mean Time to Recover (MTTR): Measures the average time required
to recover from a failure. The goal is to minimize this value.
2. Data Loss: Determine if any data was lost or corrupted during the
failure. Ideally, there should be no data loss.
3. System Downtime: The duration for which the system was
unavailable to users during the failure and recovery.
4. Error Frequency During Recovery: Track the number of errors or
retries required to achieve recovery.

Example: Recovery Testing for EC2 Failover


In the context of an AZ Failover Test like the one described earlier,
recovery testing would involve checking if the system can recover once an
Availability Zone failure is resolved.
1. Simulated Failure: Stop all EC2 instances in us-east-1a (AZ1).
2. Failover Process: Observe whether the instances in us-east-1b
(AZ2) handle the increased load and continue operations.
3. Recovery Step: After 1 hour, restart the EC2 instances in us-east-1a
and assess if they automatically reintegrate into the system.
4. Recovery Metrics:
o Time taken for EC2 instances in AZ1 to restart.
o Time taken for load to rebalance between AZ1 and AZ2.
o Post-recovery performance and stability of both AZ1 and AZ2.
Challenges in Recovery Testing
 Complex Failures: Multiple, simultaneous failures (e.g., database and
network) can make recovery more complicated.
 Data Synchronization: Ensuring all components (e.g., databases,
distributed systems) are synchronized during and after recovery.
 Automation of Recovery: Automating the recovery process with
minimal manual intervention is critical for large-scale systems.

Conclusion
 Recovery testing ensures that your system can return to a stable state
after encountering failures, which is crucial for maintaining availability
and minimizing downtime. By automating recovery processes and
monitoring system performance, you can ensure business continuity
even in the face of unexpected failures.

Performance Trend Testing: -

To conduct Performance Trend Testing using AWS Fault Injection Service


(FIS), you can follow a structured approach that involves setting up your
environment, defining your experiments, and monitoring the performance
metrics over time. Here's how to implement this process based on the
information gathered from the search results.
Step-by-Step Guide to Performance Trend Testing with AWS FIS
Step 1: Define Objectives: -
1. Identify Goals:
 Determine what you want to achieve with your performance
trend testing (e.g., assess how performance metrics change
under load, identify bottlenecks).
2. Select Metrics:
 Decide on key performance indicators (KPIs) to monitor, such as
CPU utilization, memory usage, response times, and error rates.
Step 2: Set Up Your Environment: -
1. Create EC2 Instances:
 Use the AWS Management Console or Boto3 to launch EC2
instances that will simulate workloads.
2. Install Load Testing Tools:
 Install tools like stress-ng or Apache JMeter on your EC2
instances to apply load.
3. Set Up Monitoring:
 Configure Amazon CloudWatch to collect performance metrics
from your EC2 instances.
 Create CloudWatch dashboards to visualize metrics over time.
Step 3: Create Experiment Templates in AWS FIS: -
1. Access AWS FIS:
 Log in to the AWS Management Console and navigate to the AWS
Fault Injection Service.
2. Create Experiment Template:
 Click on Experiment templates and then Create Experiment
template.
 Fill out the required fields such as name and description.
 Define your targets (e.g., EC2 instances) and actions (e.g., inject
CPU stress).
 Set stop conditions based on performance thresholds (e.g., stop
if CPU utilization exceeds 80%).
 Click Create template once all settings are configured.

Step 4: Start Performance Trend Testing


1. Run Load Tests:
 Use your load testing tool to apply load to the EC2 instances at
regular intervals.
 For example, you can run stress-ng commands to simulate both
normal and high usage periods:
sudo stress-ng --cpu 4 --timeout 600 --metrics-brief # Simulate load
for 10 minutes
2. Start FIS Experiment:
 Navigate back to the FIS console, click on Experiments, and
then Start experiment.
 Select the experiment template you created earlier.
 Monitor the experiment's progress through the console.
Step 5: Monitor Performance Metrics
1. Use CloudWatch for Monitoring:
 Track real-time performance metrics such as CPU utilization,
memory usage, and network traffic during the load tests.
 Set up CloudWatch alarms to notify you if any thresholds are
breached.
2. Analyze Data Over Time:
 Collect data from CloudWatch over several runs of your load
tests.
 Look for trends in performance metrics as load is applied.
Step 6: Analyze Results
1. Review Findings:
 After completing your tests, analyze the collected metrics.
 Compare results against previous tests to identify trends in
performance degradation or improvement.
2. Document Insights:
 Create a report summarizing findings, including any unexpected
behaviors observed during the tests.
3. Implement Improvements:
 Based on your analysis, implement changes in architecture or
codebase as needed to optimize performance.
Conclusion: -
By following these steps, you can effectively conduct Performance Trend
Testing using AWS Fault Injection Service (FIS). This structured approach
allows you to simulate real-world conditions while monitoring how your
application performs under stress, ultimately leading to improved
performance and resilience.

->Apache JMeter
Apache JMeter is an open-source tool designed to load test applications
and measure their performance under heavy loads.
 How to Use It with AWS FIS:
o Simulate Load: Use JMeter to simulate high traffic or load on
your web application (hosted on EC2 instances).
o Failover Testing: While the load is being simulated by JMeter,
you can use AWS FIS to inject failures like EC2 instance
termination, rebooting, or network latency in multiple AZs.
o Test Use Cases: Test how your auto-scaling group, load
balancer, or multi-AZ architecture handles the EC2 instance
failures while under load.
Example Use Case:
1. Create a JMeter test to simulate HTTP requests to your application
(hosted on multiple EC2 instances in different AZs).
2. Run the FIS experiment to terminate EC2 instances in a specific AZ
or introduce network latency between AZs.
3. Monitor the application’s performance under load using CloudWatch
metrics and JMeter reports to verify that failover happens correctly and
the load balancer is distributing traffic effectively.
Steps to Integrate JMeter with AWS FIS:
1. Set up JMeter to run load tests on your application endpoint (e.g., a
web app running on EC2).
2. Use AWS FIS to create a failover scenario (e.g., terminating EC2
instances in one AZ).
3. Analyze the response times, error rates, and failover behavior
through both JMeter and AWS CloudWatch.

You might also like