How to test infrastructure code: automated testing for Terraform, Kubernetes, Docker, Packer and more

Automated testing for:
✓ terraform
✓ docker
✓ packer
✓ kubernetes
✓ and more
Passed: 5. Failed: 0. Skipped: 0.
Test run successful.
How to
test
infrastructure
code

The DevOps world is full of
Fear

“Fear leads to
anger. Anger
leads to hate.
Hate leads to
suffering.”
Scrum Master Yoda

And you all know what
suffering leads to, right?

Many DevOps teams deal
with this fear in two ways:

Sadly, both of these just make
the problem worse!

There’s a better way to deal
with this fear:

Automated tests give you the
confidence to make changes

We know how to write automated
tests for application code…

resource "aws_lambda_function" "web_app" {
function_name = var.name
role = aws_iam_role.lambda.arn
# ...
}
resource "aws_api_gateway_integration" "proxy" {
type = "AWS_PROXY"
uri = aws_lambda_function.web_app.invoke_arn
# ...
}
But how do you test your Terraform code
deploys infrastructure that works?

apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-world-app-deployment
spec:
selector:
matchLabels:
app: hello-world-app
replicas: 1
spec:
containers:
- name: hello-world-app
image: gruntwork-io/hello-world-app:v1
ports:
- containerPort: 8080
How do you test your Kubernetes code
configures your services correctly?

This talk is about how to write
tests for your infrastructure code.

I’m
Yevgeniy
Brikman
ybrikman.com

Co-founder of
Gruntwork
gruntwork.io

1. Static analysis
2. Unit tests
3. Integration tests
4. End-to-end tests
5. Conclusion
Outline

Static analysis: test your code
without deploying it.

Static analysis
1. Compiler / parser / interpreter
2. Linter
3. Dry run

Statically check your code for
syntactic and structural issues

Tool Command
Terraform terraform validate
Packer packer validate <template>
Kubernetes kubectl apply -f <file> --dry-run --validate=true
Examples:

Statically validate your code to
catch common errors

Tool Linters
Terraform
1. conftest
2. terraform_validate
3. tflint
Docker
1. dockerfile_lint
2. hadolint
3. dockerfilelint
Kubernetes
1. kube-score
2. kube-lint
3. yamllint
Examples:

Partially execute the code and
validate the “plan”, but don’t
actually deploy

Tool Dry run options
Terraform
1. terraform plan
2. HashiCorp Sentinel
3. terraform-compliance
Kubernetes kubectl apply -f <file> --server-dry-run
Examples:

Unit tests: test a single “unit”
works in isolation.

Unit tests
1. Unit testing basics
2. Example: Terraform unit tests
3. Example: Docker/Kubernetes unit tests
4. Cleaning up after tests

You can’t “unit test” an entire end-
to-end architecture

Instead, break your infra code into
small modules and unit test those!
module
module
module
module
module
module
module
module
module
module
module
module
module module
module

With app code, you can test units
in isolation from the outside world

# ...
}
type = "AWS_PROXY"
# ...
}
But 99% of infrastructure code is about
talking to the outside world…

# ...
}
type = "AWS_PROXY"
# ...
}
If you try to isolate a unit from the
outside world, you’re left with nothing!

So you can only test infra code by
deploying to a real environment

Key takeaway: there’s no pure
unit testing for infrastructure
code.

Therefore, the test strategy is:
1. Deploy real infrastructure
2. Validate it works
(e.g., via HTTP requests, API calls, SSH commands, etc.)
3. Undeploy the infrastructure
(So it’s really integration testing of a single unit!)

Tool
Deploy /
Undeploy
Validate Works with
Terratest Yes Yes
Terraform, Kubernetes, Packer,
Docker, Servers, Cloud APIs, etc.
kitchen-terraform Yes Yes Terraform
Inspec No Yes Servers, Cloud APIs
Serverspec No Yes Servers
Goss No Yes Servers
Tools that help with this strategy:

Tool
Deploy /
Undeploy
Validate Works with
Terratest Yes Yes
Terraform, Kubernetes, Packer,
Docker, Servers, Cloud APIs, etc.
kitchen-terraform Yes Yes Terraform
Inspec No Yes Servers, Cloud APIs
Serverspec No Yes Servers
Goss No Yes Servers
In this talk, we’ll use Terratest:

Sample code for this talk is at:
github.com/gruntwork-io/infrastructure-as-code-testing-talk

An example of a Terraform
module you may want to test:

infrastructure-as-code-testing-talk
└ examples
└ hello-world-app
└ main.tf
└ outputs.tf
└ variables.tf
└ modules
└ test
└ README.md
hello-world-app: deploy a “Hello,
World” web service

# ...
}
type = "AWS_PROXY"
# ...
}
Under the hood, this example runs on
top of AWS Lambda & API Gateway

$ terraform apply
Outputs:
url = ruvvwv3sh1.execute-api.us-east-2.amazonaws.com
$ curl ruvvwv3sh1.execute-api.us-east-2.amazonaws.com
Hello, World!
When you run terraform apply, it
deploys and outputs the URL

Let’s write a unit test for
hello-world-app with Terratest

└ examples
└ modules
└ test
└ hello_world_app_test.go
└ README.md
Create hello_world_app_test.go

func TestHelloWorldAppUnit(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world-app",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
validate(t, terraformOptions)
}
The basic test structure

}
}
1. Tell Terratest where your Terraform
code lives

}
}
2. Run terraform init and terraform
apply to deploy your module

}
}
3. Validate the infrastructure works.
We’ll come back to this shortly.

}
}
4. Run terraform destroy at the end of
the test to undeploy everything

func validate(t *testing.T, opts *terraform.Options) {
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t,
url, // URL to test
200, // Expected status code
"Hello, World!", // Expected body
10, // Max retries
3 * time.Second // Time between retries
)
}
The validate function

url, // URL to test
10, // Max retries
)
}
1. Run terraform output to get the web
service URL

url, // URL to test
10, // Max retries
)
}
2. Make HTTP requests to the URL

url, // URL to test
10, // Max retries
)
}
3. Check the response for an expected
status and body

url, // URL to test
10, // Max retries
)
}
4. Retry the request up to 10 times, as
deployment is asynchronous

Note: since we’re testing a
web service, we use HTTP
requests to validate it.

Infrastructure Example Validate with… Example
Web service Dockerized web app HTTP requests Terratest http_helper package
Server EC2 instance SSH commands Terratest ssh package
Cloud service SQS Cloud APIs Terratest aws or gcp packages
Database MySQL SQL queries MySQL driver for Go
Examples of other ways to validate:

$ export AWS_ACCESS_KEY_ID=xxxx
$ export AWS_SECRET_ACCESS_KEY=xxxxx
To run the test, first authenticate to
AWS

$ go test -v -timeout 15m -run TestHelloWorldAppUnit
…
--- PASS: TestHelloWorldAppUnit (31.57s)
Then run go test. You now have a unit
test you can run after every commit!

What about other tools, such
as Docker + Kubernetes?

└ examples
└ hello-world-app
└ docker-kubernetes
└ Dockerfile
└ deployment.yml
└ modules
└ test
└ README.md
docker-kubernetes: deploy a “Hello,
World” web service to Kubernetes

FROM ubuntu:18.04
EXPOSE 8080
RUN DEBIAN_FRONTEND=noninteractive apt-get update &&
apt-get install -y busybox
RUN echo 'Hello, World!' > index.html
CMD ["busybox", "httpd", "-f", "-p", "8080"]
Dockerfile: Dockerize a simple “Hello,
World!” web service

apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-world-app-deployment
spec:
selector:
matchLabels:
app: hello-world-app
replicas: 1
spec:
containers:
- name: hello-world-app
image: gruntwork-io/hello-world-app:v1
ports:
- containerPort: 8080
deployment.yml: define how to deploy a
Docker container in Kubernetes

$ cd examples/docker-kubernetes
$ docker build -t gruntwork-io/hello-world-app:v1 .
Successfully tagged gruntwork-io/hello-world-app:v1
$ kubectl apply -f deployment.yml
deployment.apps/hello-world-app-deployment created
service/hello-world-app-service created
$ curl localhost:8080
Hello, World!
Build the Docker image, deploy to
Kubernetes, and check URL

Let’s write a unit test for this
code.

└ examples
└ modules
└ test
└ docker_kubernetes_test.go
└ README.md
Create docker_kubernetes_test.go

func TestDockerKubernetes(t *testing.T) {
buildDockerImage(t)
path := "../examples/docker-kubernetes/deployment.yml"
options := k8s.NewKubectlOptions("", "", "")
defer k8s.KubectlDelete(t, options, path)
k8s.KubectlApply(t, options, path)
validate(t, options)
}

buildDockerImage(t)
}
1. Build the Docker image. You’ll see
the buildDockerImage method shortly.

buildDockerImage(t)
}
2. Tell Terratest where your Kubernetes
deployment is defined

buildDockerImage(t)
}
3. Configure kubectl options to
authenticate to Kubernetes

buildDockerImage(t)
}
4. Run kubectl apply to deploy the web
app to Kubernetes

buildDockerImage(t)
}
5. Check the app is working. You’ll see
the validate method shortly.

buildDockerImage(t)
}
6. At the end of the test, remove all
Kubernetes resources you deployed

func buildDockerImage(t *testing.T) {
options := &docker.BuildOptions{
Tags: []string{"gruntwork-io/hello-world-app:v1"},
}
path := "../examples/docker-kubernetes"
docker.Build(t, path, options)
}
The buildDockerImage method

func validate(t *testing.T, opts *k8s.KubectlOptions) {
k8s.WaitUntilServiceAvailable(t, opts, "hello-world-
app-service", 10, 1*time.Second)
serviceUrl(t, opts), // URL to test
10, // Max retries
3*time.Second // Time between retries
)
}
The validate method

10, // Max retries
)
}
1. Wait until the service is deployed

10, // Max retries
)
}
2. Make HTTP requests

10, // Max retries
)
}
3. Use serviceUrl method to get URL

func serviceUrl(t *testing.T, opts *k8s.KubectlOptions) string {
service := k8s.GetService(t, options, "hello-world-app-service")
endpoint := k8s.GetServiceEndpoint(t, options, service, 8080)
return fmt.Sprintf("http://%s", endpoint)
}
The serviceUrl method

$ kubectl config set-credentials …
To run the test, first authenticate to a
Kubernetes cluster.

Note: Kubernetes is now part of
Docker Desktop. Test 100% locally!

$ go test -v -timeout 15m -run TestDockerKubernetes
…
--- PASS: TestDockerKubernetes (5.69s)
Run go test. You can validate your
config after every commit in seconds!

Note: tests create and destroy
many resources!

Pro tip #1: run tests in completely
separate “sandbox” accounts

Tool Clouds Features
cloud-nuke AWS (GCP planned)
Delete all resources older than a certain
date; in a certain region; of a certain type.
Janitor Monkey AWS
Configurable rules of what to delete.
Notify owners of pending deletions.
aws-nuke AWS
Specify specific AWS accounts and
resource types to target.
Azure Powershell Azure
Includes native commands to delete
Resource Groups
Pro tip #2: run these tools in cron jobs
to clean up left-over resources

Integration tests: test multiple
“units” work together.

Integration tests
1. Example: Terraform integration tests
2. Test parallelism
3. Test stages
4. Test retries

└ examples
└ hello-world-app
└ proxy-app
└ web-service
└ modules
└ test
└ README.md
Let’s say you have two Terraform
modules you want to test together:

└ examples
└ hello-world-app
└ proxy-app
└ web-service
└ modules
└ test
└ README.md
proxy-app: an app that acts as an HTTP
proxy for other web services.

└ examples
└ hello-world-app
└ proxy-app
└ web-service
└ modules
└ test
└ README.md
web-service: a web service that you
want proxied.

variable "url_to_proxy" {
description = "The URL to proxy."
type = string
}
proxy-app takes in the URL to proxy via
an input variable

output "url" {
value = module.web_service.url
}
web-service exposes its URL via an
output variable

└ examples
└ modules
└ test
└ docker_kubernetes_test.go
└ proxy_app_test.go
└ README.md
Create proxy_app_test.go

func TestProxyApp(t *testing.T) {
webServiceOpts := configWebService(t)
defer terraform.Destroy(t, webServiceOpts)
terraform.InitAndApply(t, webServiceOpts)
proxyAppOpts := configProxyApp(t, webServiceOpts)
defer terraform.Destroy(t, proxyAppOpts)
terraform.InitAndApply(t, proxyAppOpts)
validate(t, proxyAppOpts)
}

}
1. Configure options for the web
service

}
2. Deploy the web service

}
3. Configure options for the proxy app
(passing it the web service options)

}
4. Deploy the proxy app

}
5. Validate the proxy app works

}
6. At the end of the test, undeploy the
proxy app and the web service

func configWebService(t *testing.T) *terraform.Options {
return &terraform.Options{
TerraformDir: "../examples/web-service",
}
}
The configWebService method

func configProxyApp(t *testing.T, webServiceOpts
*terraform.Options) *terraform.Options {
url := terraform.Output(t, webServiceOpts, "url")
TerraformDir: "../examples/proxy-app",
Vars: map[string]interface{}{
"url_to_proxy": url,
},
}
}
The configProxyApp method

},
}
}
1. Read the url output from the web-
service module

},
}
}
2. Pass it in as the url_to_proxy input to
the proxy-app module

url, // URL to test
`{"text":"Hello, World!"}`, // Expected body
10, // Max retries
)
}
The validate method

$ go test -v -timeout 15m -run TestProxyApp
…
--- PASS: TestProxyApp (182.44s)
Run go test. You’re now testing
multiple modules together!

…
But integration tests can take (many)
minutes to run…

Infrastructure tests can take a
long time to run

One way to save time: run
tests in parallel

t.Parallel()
// The rest of the test code
}
t.Parallel()
// The rest of the test code
}
Enable test parallelism in Go by adding
t.Parallel() as the 1st line of each test.

$ go test -v -timeout 15m
=== RUN TestHelloWorldApp
=== RUN TestDockerKubernetes
=== RUN TestProxyApp
Now, if you run go test, all the tests
with t.Parallel() will run in parallel

But there’s a gotcha:
resource conflicts

resource "aws_iam_role" "role_example" {
name = "example-iam-role"
}
resource "aws_security_group" "sg_example" {
name = "security-group-example"
}
Example: module with hard-coded IAM
Role and Security Group names

name = "example-iam-role"
}
name = "security-group-example"
}
If two tests tried to deploy this module
in parallel, the names would conflict!

Key takeaway: you must
namespace all your resources

name = var.name
}
name = var.name
}
Example: use variables in all resource
names…

uniqueId := random.UniqueId()
"name": fmt.Sprintf("text-proxy-app-%s", uniqueId)
},
}
At test time, set the variables to a
randomized value to avoid conflicts

Consider the structure of the
proxy-app integration test:

1. Deploy web-service
2. Deploy proxy-app
3. Validate proxy-app
4. Undeploy proxy-app
5. Undeploy web-service

2. Deploy proxy-app
When iterating locally, you sometimes
want to re-run just one of these steps.

2. Deploy proxy-app
But as the code is written now, you
have to run all steps on each test run.

2. Deploy proxy-app
And that can add up to a lot of
overhead.
(~3 min)
(~2 min)
(~30 seconds)
(~1 min)
(~2 min)

Key takeaway: break your
tests into independent test
stages

The original test structure

stage := test_structure.RunTestStage
defer stage(t, "cleanup_web_service", cleanupWebService)
stage(t, "deploy_web_service", deployWebService)
defer stage(t, "cleanup_proxy_app", cleanupProxyApp)
stage(t, "deploy_proxy_app", deployProxyApp)
stage(t, "validate", validate)
The test structure with test stages

1. RunTestStage is a helper function
from Terratest.

2. Wrap each stage of your test with a
call to RunTestStage

3. Define each stage in a function
(you’ll see this code shortly).

4. Give each stage a unique name

Any stage foo can be skipped by
setting the env var SKIP_foo=true

$ SKIP_cleanup_web_service=true
$ SKIP_cleanup_proxy_app=true
Example: on the very first test run, skip
the cleanup stages.

Running stage 'deploy_web_service'…
Running stage 'deploy_proxy_app'…
Running stage 'validate'…
Skipping stage 'cleanup_proxy_app'…
Skipping stage 'cleanup_web_service'…
That way, after the test finishes, the
infrastructure will still be running.

$ SKIP_deploy_web_service=true
$ SKIP_deploy_proxy_app=true
Now, on the next several test runs, you
can skip the deploy stages too.

Skipping stage 'deploy_web_service’…
Skipping stage 'deploy_proxy_app'…
This allows you to iterate on solely the
validate stage…

Skipping stage 'deploy_proxy_app'…
Which dramatically speeds up your
iteration / feedback cycle!

$ SKIP_validate=true
$ unset SKIP_cleanup_web_service
$ unset SKIP_cleanup_proxy_app
When you’re done iterating, skip
validate and re-enable cleanup

Skipping stage 'deploy_proxy_app’…
Skipping stage 'validate’…
Running stage 'cleanup_proxy_app’…
Running stage 'cleanup_web_service'…
This cleans up everything that was left
running.

func deployWebService(t *testing.T) {
opts := configWebServiceOpts(t)
test_structure.SaveTerraformOptions(t, "/tmp", opts)
terraform.InitAndApply(t, opts)
}
func cleanupWebService(t *testing.T) {
opts := test_structure.LoadTerraformOptions(t, "/tmp")
terraform.Destroy(t, opts)
}
Note: each time you run test stages via
go test, it’s a separate OS process.

}
}
So to pass data between stages, one
stage needs to write the data to disk…

}
}
And the other stages need to read that
data from disk.

Real infrastructure can fail for
intermittent reasons
(e.g., bad EC2 instance, Apt downtime, Terraform bug)

To avoid “flaky” tests, add
retries for known errors.

&terraform.Options{
RetryableTerraformErrors: map[string]string{
"net/http: TLS handshake timeout": "Terraform bug",
},
MaxRetries: 3,
TimeBetweenRetries: 3*time.Second,
}
Example: retry up to 3 times on a
known TLS error in Terraform.

End-to-end tests: test your
entire infrastructure works
together.

How do you test this entire thing?

You could use the same strategy…
1. Deploy all the infrastructure
2. Validate it works
(e.g., via HTTP requests, API calls, SSH commands, etc.)
3. Undeploy all the infrastructure

But it’s rare to write end-to-
end tests this way. Here’s why:

e2e
Tests
Test pyramid
Integration Tests
Unit Tests
Static analysis

e2e
Tests
Integration Tests
Unit Tests
Static analysis
Cost,
brittleness,
run time

e2e
Tests
Integration Tests
Unit Tests
Static analysis
60 – 240+
minutes
5 – 60
minutes
1 – 20
minutes
1 – 60
seconds

e2e
Tests
Integration Tests
Unit Tests
Static analysis
E2E tests are too slow to be useful
60 – 240+
minutes
5 – 60
minutes
1 – 20
minutes
1 – 60
seconds

Another problem with E2E
tests: brittleness.

Assume a single resource (e.g.,
EC2 instance) has a 1/1000
(0.1%) chance of failure.

Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
The more resources your tests deploy,
the flakier they will be.

Test type # of resources Chance of failure
Unit tests 10 1%
Integration tests 50 5%
End-to-end tests 500+ 40%+
You can work around the failure rate
for unit & integration tests with retries

Key takeaway: E2E tests from
scratch are too slow and too
brittle to be useful

Instead, you can do
incremental E2E testing!

module
module
module
module
module
module
module
module
module
module
module
module
module module
module
1. Deploy a persistent test
environment and leave it running.

module
module
module
module
module
module
module
module
module
module
module
module
module module
module
2. Each time you update a module,
deploy & validate just that module

module
module
module
module
module
module
module
module
module
module
module
module
module module
module
3. Bonus: test your deployment
process is zero-downtime too!

Technique Strengths Weaknesses
Static analysis
1. Fast
2. Stable
3. No need to deploy real resources
4. Easy to use
1. Very limited in errors you can catch
2. You don’t get much confidence in your
code solely from static analysis
Unit tests
1. Fast enough (1 – 10 min)
2. Mostly stable (with retry logic)
3. High level of confidence in individual units
1. Need to deploy real resources
2. Requires writing non-trivial code
Integration tests
1. Mostly stable (with retry logic)
2. High level of confidence in multiple units
working together
3. Slow (10 – 30 min)
End-to-end tests
1. Build confidence in your entire
architecture
3. Very slow (60 min – 240+ min)*
4. Can be brittle (even with retry logic)*

All of them!
They all catch different types of bugs.

e2e
Tests
Keep in mind the test pyramid
Integration Tests
Unit Tests
Static analysis

e2e
Tests
Lots of unit tests + static analysis
Integration Tests
Unit Tests
Static analysis

e2e
Tests
Fewer integration tests
Integration Tests
Unit Tests
Static analysis

e2e
Tests
A handful of high-value e2e tests
Integration Tests
Unit Tests
Static analysis

Infrastructure code
without tests is scary

Fight the fear & build confidence in
your code with automated tests

How to test infrastructure code: automated testing for Terraform, Kubernetes, Docker, Packer and more

More Related Content

What's hot (20)

Similar to How to test infrastructure code: automated testing for Terraform, Kubernetes, Docker, Packer and more (20)

More from Yevgeniy Brikman (20)

Recently uploaded (20)

How to test infrastructure code: automated testing for Terraform, Kubernetes, Docker, Packer and more