CI - CD With Jenkins Pipelines, Part 1 - .NET Core Application Deployments On AWS ECS - by Alexander Savchuk - Xero Developer
CI - CD With Jenkins Pipelines, Part 1 - .NET Core Application Deployments On AWS ECS - by Alexander Savchuk - Xero Developer
es, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
In theory, deploying a dockerised .NET Core app is easy (because Docker simpli es
everything, right?). Just trigger your CI/CD pipeline on any new commit to GitHub
repository, build an image, run the tests, push the image to the ECR repository, update
the ECS task de nition to point to the new image, and then update the ECS service to
use the new task revision. Rinse and repeat for all environments.
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 1/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
The branching model of the repository is pretty simple. There is a master branch, which
is built on every push and is always deployable, and a bunch of feature branches that are
mostly ignored by the deployment system.
There is also a lot of infrastructure bits and pieces which need to be deployed for each
microservice — ECR repository, ECS task and service de nitions, IAM role with policies,
as well as in some cases security groups, Route53 records, a Redis cluster, and an ALB
with listeners and target groups. We use Terraform to manage most of our
infrastructure. Infrastructure deployments use a di erent pipeline and will be covered in
the next blog post.
Build
All Docker images follow the same pattern. Docker les are multi-stage — we use a larger
SDK image to compile the application and a much smaller runtime image to deploy it.
Base images are pinned to the exact version (e.g. microsoft/aspnetcore-build:2.0.5–
2.1.4) to avoid surprises which could happen if we just used ‘latest’.
First, we need to restore NuGet packages, which are referenced in *.csproj les. Copying
all les in one go would invalidate the Docker cache and trigger a lengthy restore
whenever any part of the application code changes. We copy *.csproj les separately
from C# source les to avoid this problem. There are complex inter-dependencies
between the projects, and copying *.csproj les one by one would be quite a chore. We
copy them in bulk (which, unfortunately, attens the directory structure) and then run a
simple script to move them to the correct folders. The end goal here is to optimise
layering, leverage cache, and reduce the build times.
1 COPY src/*/*.csproj ./
2 RUN for file in $(ls *.csproj); do mkdir -p ${file%.*}/ && mv $file ${file%.*}/; done
Next comes the compilation. This step usually can’t be cached, so it takes some time.
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 2/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
Lastly, the compiled assets are copied into the runtime image, along with the entrypoint
script. This script is responsible for setting up the execution environment correctly and
bailing out early if any of the mandatory environment variables are missing. The
applications typically are environment-agnostic (they do not care whether they run
locally on a Windows laptop or in a Linux Docker container on AWS), but certain
environment variables must always be set. All conditional logic is pushed to the
entrypoint script, which determines the environment and then, if necessary, fetches
secrets from Parameter store using the pstore utility, grabs metadata like the task
revision and container ID from the container metadata le, and so on. All logs emitted
by the application contain this metadata, which is crucial for being able to identify
which of the hundreds of containers running at a given point in time is having issues.
We add several tags to our Docker images. The build number comes from Jenkins and is
used during the application rollouts. We also add a timestamp and a git hash, which
point to the last commit included in this build and help to establish which application
changes a given Docker image includes.
Another handy tag is the base image version (e.g. ‘ 2.1-runtime’ for .NET Core apps),
which helps us understand whether we need to update an app if any security
vulnerabilities are discovered in the base image. The CI system needs to be aware of the
base image version, so instead of hard-coding it in the Docker le we are passing it as an
argument. The Docker le expects this argument to be set at build-time, but defaults to
something sensible for easier local builds:
Jenkins sources the le that stores the base image version and passes it to the ‘docker
build’ command:
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 3/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
Deploy
As involved as building a Docker image can be, deploying it in a repeatable and safe
manner turned out to be more complicated.
1 {
2 "name": "${app}",
3 "image": "${image}", <<< THE CHANGE GOES HERE
4 "cpu": ${cpu},
5 ...
6 }
The question is how Terraform would know which image to deploy? This tool does
accept parameters, but we have a strict rule that the only external parameter that is
allowed is AWS region, and even that is determined internally by the ‘terraform-
deployer’ Docker container which we use for infrastructure deployments. All other
arguments are either de ned in the con guration les, if they are static, or fetched by
Terraform dynamically using external sources (usually a remote Terraform state, AWS
API calls, or custom data sources). This approach allows us to use standard boilerplate
deployments for all our 100 odd infrastructure projects and not have to maintain
separate scripts for any of them.
We needed some way to deploy new images quickly, so the workaround was to just hard
code task de nition to use ‘latest’. It wasn’t enough though — because the task de nition
never changed (‘latest’ is always the same tag/string/hash, even though the underlying
image changes), Terraform did not detect any changes in the task and correspondingly
did not update the ECS service. We had to force its hand by passing a timestamp
environment variable to the task de nition. Its only purpose was to change the hash of
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 4/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
the task de nition, so Terraform would have to create a new task revision and then
update the service.
Another problem was that each application deployment was touching all infrastructure
components even though it only needed to change one line in the task de nition. This
slowed down the deployments and increased the blast radius if anything went wrong.
What was even worse though, Terraform was designed to manage infrastructure, not
application deployments, and it would report a deployment as successful even if the
application could not start consistently of failed the health checks.
In summary, this was a quick and dirty way to get something to the test environment. It
was not suitable for production deployments, but it worked for a while during the initial
development phase.
Not surprisingly, this caused other sorts of issues. Now we were changing ECS task
de nitions from two separate pipelines, and they not always agreed with each other.
Terraform only knew about the ‘latest’ image, and each time we refreshed our
infrastructure it reset the task de nition back to ‘latest’ image. Frequently this was the
same image that was already deployed, so it was just a relatively harmless refresh of the
service. Occasionally, however, the ‘latest’ image was not the same that was currently
running, and in this case, Terraform would perform a surreptitious application
deployment.
Another problem was that with this setup, all promotions between environments (test →
UAT → prod) happened in the same pipeline. If a deployment to, say, UAT, failed for
whatever reason (frequently this would be some transient network issue), we had to re-
run the whole pipeline from the start. There was also no clean way to do rollback with
this setup.
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 5/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
There were several reasons why this wasn’t a good t for our needs. One was that the
suggested solution only works with web services behind a load balancer. Some of our
most important services are worker-type console applications, and they would require a
di erent approach. Ideally, we would prefer to nd something that could work for all
our services in a similar fashion to keep the maintenance overhead low. Another reason
was that provided template relied on Code* family of services for deployments. These
technologies duplicate the capability that we already possess via Jenkins and have some
issues con guring with non-public GitHub Enterprise and cross-account access.
First of all, we needed to nd a way for Terraform to not change the current image ID
in ECS task de nitions. This was something that was managed outside of Terraform
now, and it should not try to reset it.
There didn’t seem to be a clean way of achieving this using any of the existing Terraform
providers and data sources, so we ended up extending Terraform by writing an external
data source. The next blog post will cover this in more detail.
1 stage('Start deployment') {
2 when {
3 branch 'master'
4 }
5 steps {
6 build job: "Deployment/${serviceName}/${env.BRANCH_NAME}",
7 propagate: true,
8 wait: true,
9 parameters: [
10 [$class: 'StringParameterValue', name: 'imageName', value: imageName],
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 7/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
10 [$class: StringParameterValue , name: imageName , value: imageName],
11 [$class: 'StringParameterValue', name: 'serviceName', value: serviceName],
12 [$class: 'StringParameterValue', name: 'tag', value: "${env.BUILD_NUMBER}"]
13 ]
14 }
15 }
Deployments to test and UAT are de ned in a single Jenkins le used by all services,
which is parameterised with a service name, an image name, and an image tag. It
deploys the new image to test environment, runs some integration tests, and then
promotes the image to UAT. Only one image can be updated at a time. If a task
contains several containers (for example, an app container and an nginx container),
updating them will require separate deployments. This setup limits the blast radius and
makes rollbacks easier in case anything goes wrong.
After going through the same sequence of steps in UAT, this pipeline triggers yet another
pipeline to start deployment to prod environment.
Production deployments are similar, but require explicit approval from one of the
authorised users to start the actual deployment process. This stage proved tricky to
implement correctly. Without a timeout, the con rmation step would block the pipeline
inde nitely, so at one point we had dozens of hanging builds which were soaking up our
executors. After we added a timeout, the builds started to error out, which was also not
really what we wanted. In the end, we added a rather clunky workaround which timed
out and marked the build as successful even if it wasn’t approved to proceed to
production.
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 8/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
12 choice(choices: services, description: 'Name of the ECS service to deploy', name: 'serviceNam
13 choice(choices: services, description: 'Name of Docker image to update', name: 'imageName')
14 string(defaultValue: 'Tag to deploy', description: 'Docker image tag', name: 'tag')
15 }
16
17 stage("Confirm") {
18 when {
19 branch 'master'
20 }
21 options {
22 timeout(time: 5, unit: 'MINUTES')
23 }
24 agent none
25 steps {
26 script {
27 approvalMap = deploy.getSignoff('prod', 'app', "${params.serviceName} image ${params.imag
28 if (approvalMap['Release']) {
29 release = approvalMap['Release']
30 }
31 }
32 }
33 }
34
35 // Other deployment steps
36 // <...>
37
38 post {
39 failure {
40 script {
41 if (!release) {
42 currentBuild.result = "SUCCESS"
43 }
44 }
45 }
46 }
47 }
The deployment is then reported to Slack and recorded in our monitoring and auditing
systems.
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 9/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
In summary, we have pipelines triggering other pipelines, that invoke some more
pipelines. E ectively, this is a pipeline chain. A failure in any of the links will propagate
back, and it’s straightforward to re-run any of the previous build jobs.
The actual rollout is controlled by ECS, and we can a ect it by setting minimum and
maximum deployment percentages. The below con guration, for example, will
temporarily double the number of running tasks before scaling them back and will
maintain 70% of tasks running at all times.
During this rollout, the ecs-deploy utility monitors the status of ECS service. If it
deployed successfully, the utility marks the old ECS tasks revision as INACTIVE, and the
deployment is considered successful. If the tasks failed to start consistently, after a
con gurable timeout ecs-deploy would attempt to roll back to the previous ECS task
revision.
There may be cases where tasks can start successfully, but the latest changes introduce
some bug which is only found after the deployment. In this case, we would roll forward
by con guring the ECS service to use a new ECS task revision, pointing to the old image.
As far as Jenkins is concerned, this will be a standard deployment, which is triggered
manually by providing the image tag to the build job as a parameter. Alternatively (and
easier) we can just rerun an older build job, which will have all parameters already set.
Such setup allows us to compose complex pipelines and to arbitrarily re-run them at
di erent stages. It also ensures that we have the latest version of the master branch
running in all environments, while keeping the deployment to production gated.
Over a million small businesses, and their advisors are looking for the best cloud apps that
integrate with Xero. Partner with us, and we’ll make sure they nd yours.
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 11/12
9/21/2020 CI/CD with Jenkins pipelines, part 1: .NET Core application deployments on AWS ECS | by Alexander Savchuk | Xero Developer
https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0 12/12