How to Prepare for the AWS Solutions Architect Professional Exam (SAP-C02)

You’re here! You made it to this point in your AWS career that you’re seriously considering taking the Solutions Architect Professional exam.

Likely, you’ve already heard a lot about this test. It has its own legend in the AWS community because it’s appropriately hard for what’s expected of a person who sits it: someone with roughly 2-5 years of AWS experience, typically working towards a Solutions Architect role or something similar.

However, I’m here to tell you that if you’re doing that kind of work every day–I think this test won’t be as hard as you think. If you aren’t, well…I can’t really say. Maybe it will be that hard for you!

Who It’s For and Why It’s Hard

From the horse’s mouth:

“The AWS Certified Solutions Architect – Professional (SAP-C02) exam is intended for individuals who perform a solutions architect role. The exam validates a candidate’s advanced technical skills and experience in designing optimized AWS solutions that are based on the AWS Well-Architected Framework.”

https://d1.awsstatic.com/training-and-certification/docs-sa-pro/AWS-Certified-Solutions-Architect-Professional_Exam-Guide.pdf

This test isn’t about knowing tons of features, in-depth configuration, or even a 300-400 level on any individual service. It’s a test for architects, which means you’ll have to wade through a series of questions that assess your ability to make design decisions based on business requirements.

The old analogy of a foot deep and a mile wide plays well here. I’d say that it’s twice as deep as the Associate exam but also twice as wide. Take a peek at the appendix of the document I cited above and you’ll see what I mean. You’re about to see services you’ve likely never touched in your life!

That breadth of the AWS portfolio in my opinion is what makes it hard. I had the immense privilege of taking the Advanced Networking Specialty exam 6 months prior to this one, which was hard because it was so deep. 300-400 level questions for 70 questions. This one is 200-250ish level questions for 75 questions, so slightly longer but less depth, in exchange for a 10x increase in portfolio size.

Personally, I found this test to be about 25% easier than taking the ANS-C01. I didn’t fail this one, whereas I failed the other one. I found breadth to be easier to tolerate than raw depth. Depending on how you think, you might believe otherwise! Regardless, it’s a hard test. That much is certain.

Exam Resources

Okay, onto the good stuff!

This test consists of 4 domains, which are:

  • Domain 1: Design Solutions for Organizational Complexity – 26%
  • Domain 2: Design for New Solutions – 29%
  • Domain 3: Continuous Improvement for Existing Solutions – 25%
  • Domain 4: Accelerate Workload Migration and Modernization – 20%

Notice that it’s a fairly balanced test. There’s no “gaming” this one in terms of studying hard for a few domains and slacking on one. While Domain 2 is the heaviest, it’s not by much.

If you do this work daily, I think you’ll find that some of these resonate more than others. I found Domain 1 + 4 to be easier to grasp, whereas I spend less time with 2 and 3. Many customers I work with are in the process of migrating and building new solutions, so naturally the conversations I have around those lend to this test. You might be in a different boat. If you work with very mature cloud organizations, Domain 2 and 3 might be more your style.

This is something I appreciate about this test–it maps well to real-world work. It is a practical exam.

Unlike some of the Specialty AWS exams that have fewer resources available, there are an immense amount of resources for this exam. It’s borderline overwhelming the number of courses and books that exist for it, so I’m going to offer you what I used and nothing else.

Adrian Cantrill. I’ve used his materials since I began my trek in 2020 and he has never let me down. This test is no different. Adrian Cantrill videos are the standard. In-depth, beautifully illustrated, and well-designed. You will get the maximum value per hour of study if you use his materials. I didn’t use any other courses to study, just this one. There are a handful of services that he doesn’t cover, but they are so few that I was able to simply look them up instead. Please refer to the AWS study guide to fill those gaps, I promise they are few.

Tutorials Dojo exams (aka Jon Bonso exams). Similarly, I’ve used these as well since 2020, and once again I’m here to say that they are the most accurate reproduction of the style and content of real-world AWS questions. Please, don’t buy dumps of the actual exam. If you can get through one of these and get a passing score, I promise you’ll do well on this exam. Serious kudos to the Tutorials Dojo team for providing this level of quality!

The AWS Skillbuilder Exam Review course. This is available on the Tutorials Dojo site or on Skillbuilder itself, but I’ve linked it here to the TD site. I thought this was a great recap of everything you need to know to sit the exam, and offers you some probing questions to practice how you answer questions. This is an important skill for the SAP-C02.

User guides. I can’t tag them all here, but I spent a considerable amount of time in user guides for many of the services on this exam. As I said earlier, if this is your day-to-day job, you’ve likely already been in dozens of user guides. This test is simply assessing the knowledge that you’ve accrued over the years.

What You Should Know Well Going into this Exam

This is such a broad test that it’s hard to encapsulate every possible thing that you’d need to know, so I’m going to take a different approach. Let’s talk through architect tasks and map them to test items.

  • Migrations
    • Assess ways of migrating data and applications to AWS. Opt for native AWS tools like Application Discovery service, Server Migration Service, AWS DataSync, and the Database Migration Service.
    • Learn your 6 (or is it 7?) R’s. You’ll frequently see decisions between lift-and-shift/rehost and replatform. Rearchitecting/refactoring is also a hot topic that favors event-driven architectures and serverless options.
    • Establish Hybrid Connectivity. It’s rare to see organizations that are 100% in the Cloud, and this test acknowledges that. Know your options when it comes to VPNs, Direct Connect, etc. for extending on-prem to Cloud. I put this under migrations because it’s often a concern of new Cloud adopters, but obviously, it’s relevant for any organization.
    • Integrate Active Directory. Windows shops typically don’t give up AD when they migrate. Know Managed AD and AD Connector, and how to federate with AD.
  • Storage Solutions
    • Develop a storage solution against business requirements. Think S3, FSx, EFS. Know where they apply and why to choose them. For example, you can present S3 storage as an SMB file share, so maybe you don’t need to use FSx just because it’s a Windows environment. You might need FSx if DFS is a requirement throughout and you want a highly performant file system. And if it’s Linux hosts where a highly elastic shared file storage space is required…you already know, it’s EFS. Beyond this, understand S3 lifecycles, bucket permissions, tiering, access points, etc. You need to know storage well to pass this exam.
    • Connecting storage into AWS. Storage Gateway, File Gateway, Tape Gateway. Why would you choose this vs. DataSync? Know the trade-offs. Also oddly enough, AWS Transfer family had a showing on the test so don’t forget that.
    • Move data on a short timeline. Snowball. Sometimes it’s the fastest way into AWS. Know how it works and when to use one.
  • Organizations + Identity Management
    • Design an AWS organization mapped to the business. Learn how Control Tower can do this in an automated way. Learn OU structure, inheritance, and how SCPs apply at each level. How do you centralize billing and apply chargeback models? How can you share resources across an Organization effectively?
    • Build Principle of Least Privilege environments. Learn how to use IAM roles effectively, that’s the preferred option. Learn about cross-account roles and how to make those. How would an auditing account be permitted read access? These are the things that will come up in the test.
    • Manage identities. IAM, Cognito, SSO. These are all in play and for good reason. It’s hard to find a single customer that isn’t making use of these, so expect to see a lot of questions geared around federating identities and how that works in AWS terms. Temporary credentials via STS are a must-know item.
  • Networking for Complex Organizations
    • Design a hub and spoke network for a multi-account, multi-AZ or multi-regional organization. You better know VPC inside and out at a minimum. Get familiar with AWS best practices in this space like using Transit Gateway effectively, ingress/egress architectures for inspection, integration of security appliances, etc.
    • Know the right hybrid network connectivity option. VIFs…Private, Public, Transit. Where does a DX Gateway make sense? When is it better to just use a private VIF? Is a VPN performant enough? All good to know items on the test.
    • Leverage Endpoints to access internal services. You won’t see too much of this topic but know your options and why they’re important. If you need a private option that scales well for high bandwidth requirements, or simply for the privacy it provides, you’ll want to be familiar with PrivateLink, interface endpoints, gateway endpoints, and gateway load balancer endpoints for this exam. You’ll also want to know how to design these endpoints and where to put them.
    • Integrate hybrid DNS solutions. You’ll want to have a solid understanding of DNS records at a minimum. Can you clearly state the difference between an Alias and CNAME record? Think through Inbound and Outbound endpoints and where to use them (Managed AD perhaps?). Understand the use case of private DNS records. Learn how to integrate an on-prem environment with AWS.
  • Content Distribution Networks + Web-Facing Content
    • Design a CDN to serve static content to customers. This is an ultra-common scenario on this test in particular. Know CloudFront inside and out, how it connects to S3, how to secure your connection with HTTPS + SSL certs, how OAIs work, how the edge cache works, TTLs, etc. Also know when not to use CloudFront and instead use Global Accelerator (when you need to move the network closer to the customer and not content).
    • Leverage Lambda@Edge to perform functions closer to users. It needs to be said, it comes up a few times. Know the common use cases like inspecting URLs or headers, checking cookies, generating HTTP responses, etc. It will almost certainly be on the test so expect to know when to use it.
    • Connect to web servers from the internet. How are customers accessing web servers? Via CF, via a load balancer? How are certificates managed? Where do you terminate your SSL connection? How do web servers scale to support increased traffic? Learn the many ways this could be done and be prepared to see a lot of questions related to this topic.
    • Make it resilient. Get familiar with the ASG + Load Balancer combo, and failover routing with Route53. How do you respond to health checks? What happens when there’s a failure? Assume there will be a failure and plan for it effectively.
  • Monitoring Your Environment
    • See what’s happening and respond accordingly. There are a lot of questions about detecting changes in your environment, whether it be user actions or resource changes. Understand how you can use CloudWatch Logs and CloudTrail to identify and aggregate these items of interest and then leverage tools like SQS + Lambda to provide remediation. Know how alerts can be sent out via SNS to responsible parties. Ideally, you can automate a lot of this!
    • Correct user actions. While AWS Config is only detective in nature and doesn’t provide remediation, it can be used in combination with Lambda to correct unwanted user actions.
    • Detect PII in your environment. Macie gets a few features here and there, so understand where it’s applicable. IMO, this is a great one to score easy points…if you see PII, think Macie.
  • App Modernization & Event-Driven Architectures
    • Decouple the monolith using AWS services. SQS and Lambda are common features in event-driven architectures. Knowing their capabilities and limitations is a must. Lambda in particular has some constraints around startup, concurrency, and time to run. You might find using Batch to be the better option for long-running tasks, and maybe you’ll want to use Step Functions for sequential items.
    • Leverage APIs in your environment with native AWS services. API Gateway is a frequent mention in this exam, although I found it wasn’t at depth. One thing you definitely want to know is the errors it throws (4XX & 5XX errors). You should always be familiar with how it stages and deploys, and the interaction it has with other AWS services, especially with Lambda.
    • Assess using container services against each other as a matter of best fit. In the real world, this is often a matter of customers figuring out what level of management they need over their resources and what tasks they’d rather hand over to AWS. One common scenario you’ll see in this exam is comparing a few different container solutions and choosing the best one. Should you use ECS in EC2 mode, Fargate mode, or pick EKS? I think at a basic level you have to understand containers well to figure any of these out, but moreso you need an understanding of the tradeoffs of each AWS service and where it is most appropriate. You’ll see some explicit mentions of Kubernetes, nodes, etc. so maybe brush up on those terms if you’re not already familiar.
  • Infrastructure as Code
    • Make use IaC to simplify operations and control changes. AWS heavily favors using CloudFormation (duh) and this test is a continuation of that. No mention of Terraform here, just CloudFormation. This means you should be familiar with CloudFormation and how it works (how it’s built, how it deploys). Know terms directly related to the service, some of the functions it includes, the variety of “stack” options to include Nested Stacks, StackSets, Cross-Stack references, etc. Also know the problem it solves and why you should choose CloudFormation in a list of options to build a new solution.
    • Use the Serverless Application Model (SAM). AWS is pushing this, but if you’re not aware of what it is and what it does, it’s worth checking it out. SAM is like a combination of serverless + CloudFormation which is fairly easy to understand and implement.
  • Databases and Data Warehousing
    • Pick the right database for the job. This is harder than it sounds because AWS has an extensive list of DB options that can achieve a high level of specificity for each use case. I found RDS, Aurora, and DynamoDB to all be heavy features on this exam. You’ll see one of them in at least 50% of the questions on this test because the DB layer is in nearly every architecture. Just as in the real-world, many of the questions are about features of the services themselves, especially around the areas of Business Continuity and Disaster Recovery. How do each of these fail, and which one addresses the given RTO and RPO requirements? You should know the capabilities of each in depth in this area. The robustness of these solutions is exam fodder, so allocate a considerable amount of your study time to each of them. I didn’t see as many mentions of some of the lesser-known DBs but that doesn’t mean you should avoid them. I’d still brush up on Neptune, DocumentDB, and QLDB.
    • Solutioning with Redshift. Understand how Redshift ingests data, what sources it uses, and how it works internally. It’s an OLAP database–if you’re unclear what that is, it’s worth reading through why that’s important and what it solves. You’ll occasionally see Redshift compared to other DBs which either makes it a dead giveaway or allows you to scratch it out, but either way it does get attention so be ready for it. One last item–know how it fails, it might surprise you.
    • Make use of caching solutions. Get familiar with some of the AWS solutions like ElastiCache and DAX and the problems they solve. If you see read-heavy workloads, it might be the right fit. You’ll see scenarios where customers are struggling to keep pace with traffic and this is often a good solution for that.
  • Data Analytics and Streaming
    • Architect for demanding levels of ingest and analytics. Kinesis is the go-to product for real and near-time solutions. If you hear those words, your ears should perking up already. It’s a bit of a tricky product especially if you don’t interact with it that much, but rest assured that if you can understand it conceptually with the problem it solves you should be able to take it on in the exam. Be able to explain the differences between Data Streams and Data Firehose. Understand where they can ingest from and what comes out, and where that data can go. Know how to differentiate it from SQS, which is similar but still different. Kinesis is for big-scale ingestion!
    • Analyze your data. Make use of Kinesis Data Analytics for real-time data processing. Use other products for non-real-time analytics. You might see some EMR mentions (AWS Hadoop), mentions of QuickSight (intelligence dashboards), or AWS Batch (long-running AWS Lambda). I think familiarity here with each product will serve you well, but don’t over-prioritize it.
    • I’ll be the first to admit, I don’t spend a lot of time with Data Streaming or Analytics products. Because of this, I tried to get to the point where I could easily distinguish these services, but my hands-on knowledge is lacking. I still passed, so it’s doable.
  • Dev Products
    • Leverage the AWS Code suite for CI/CD. The Code suite refers to a list of the AWS Code products–Code Commit, CodeBuild, CodeDeploy, CodePipeline. It’s a blessing and a curse…they all work together, but that can also make them difficult to separate. In my mind, CodeCommit = AWS Github, CodeDeploy = AWS Jenkins, CodePipeline is self-explanatory, and AWS Code Build is..complicated. Rember it’s used for builds and that you can customize it with buildspec.yml. All in all, if you can understand how these all work together as part of a functioning CI/CD pipeline, you should do well.
    • Know when to use Elastic Beanstalk. One of the strangest named AWS services, this odd product is a way of abstracting the AWS away from the developer. The good thing about it is that it emphasizes being very managed, so the use case will likely stand out when you see it. Elastic Beanstalk is aimed at smaller teams, where abstracting the infrastructure is paramount.
  • Cost Optimization
    • Save money in your designs. I am so happy to see this prioritized in this exam because it speaks to real concerns. How can we save money? Know things like savings plans (EC2, Compute), using reserved instances, Spot instances, and timeframes related to spending. Design using ASGs to save money when stuff isn’t running. Make use of storage tiers that are based on frequency of access, and leverage lifecycle policies to move data or delete it when necessary.
    • Implement tagging. This might only get a mention or two, but understanding how to do cost allocation is a useful skill regardless of the exam. Know how to apply tags, where to create them, who can modify them, and where they will be inherited.
  • Security
    • Bake in security to all your designs. If you’ve used AWS for any period of time, you’ll know that AWS by design nudges you to choose secure options. What this exam wants you to do is default to AWS best practices, and default to picking the best-fit option for every scenario. Big items include Key Management, Secrets management, and detective services. KMS, Secrets Manager, Guard Duty, Shield, Inspector, WAF, and Config will be there. Choose Roles > Users most of the time. Apply the principle of least privilege in all your policies. Reinforce security posture using features like NACLs in combination with Security Groups.
    • This is not the Security Specialty exam. I think at this point in time you’ll be equipped if you understand the portfolio of security products and general AWS best practices–there are no “gotcha” security items on this exam.
  • Random Stuff (The Extended AWS Portfolio)
    • Systems Manager. I couldn’t fit this elsewhere but let it be known, you should know about Systems Manager on this exam. Understand how the agent works, how patching works, and how you can use Systems Manager to make your life easier in operations. A common scenario involves simplifying operations, so you’ll see it come up in questions like that.
    • IoT. Greengrass, IoT Core…I had a surprising number of questions about these products. In hindsight, I would have allocated an hour or two to learn these products so I could easily differentiate them. While IoT is a niche solution, it is certainly a feature on this exam.
    • Amazon Text + Speech products. There’s a long list of them, including Lex, Polly, Textract, Transcribe, and Translate. Being able to distinguish between them is important as they are all different and frequently used in combination with one another.
    • Workspaces. AppStream. Know the difference.
    • SageMaker. Do not bury yourself trying to learn it, but it’s worth knowing about and what it does before you head in. It’s an incredibly complex product that might confuse you without some preparation.
    • X-Ray. This understated service might be on the exam, so know the use for it. It might help a dev in need!

Conclusion

This is without a doubt the longest review I’ve written for an AWS exam, and it’s still too short to cover the portfolio of services that are included in this exam.

I felt like this test was a perfect assessment for a budding Architect–it covers a lot of ground, presents you with challenging questions, and dives into services that you don’t run into often that you still ought to know. I would not recommend this test to anyone with less than 2 years of AWS experience, and I certainly would not recommend it to someone who has never held an AWS job before. I would highly recommend it to someone who has been working with AWS for 2+ years and is ready to push themselves on a certifiably difficult exam.

With adequate preparation, this test is very achievable. Take it seriously, get your hands dirty, take on AWS design problems at work, and you’ll certainly pass this. Best of luck!

How to Prepare for the AWS Advanced Networking Specialty Certification (ANS-C01)

So, you’ve heard everything about how hard the AWS Advanced Networking Specialty test is and you’re looking for more information about it…

I can confirm: the legends are true. It’s a hard test. It might or might not be the hardest AWS test. I haven’t taken SA Professional yet, so I can’t confirm that yet. I did take this one twice–once more than for any other AWS exam.

What I can say definitively is that this test will strain your knowledge of AWS cloud networking. If you hear networking and flinch, this might be the hardest AWS test for you.

Who It’s For and Why It’s Hard

So who is it for exactly? Let’s see what AWS has to say:

“The exam is for individuals who perform complex networking tasks.”

That’s a great start, but here’s who I think it can be useful to:

  • Network Engineers
  • Cloud Engineers
  • Solutions Engineers and Architects

That’s it, really. I think there are some other niche roles that would seek benefit from this knowledge, but this is a focused exam that has only tiny bits of overlap with other knowledge domains. It’s networking-focused, point blank.

Why is this test so hard, you ask?

It’s not just knowledge. Sure, there’s a lot of knowledge required to take it but the domain itself is fairly narrow in AWS scope. There aren’t that many services that you need to know, but you do need to know them well.

What does make it a challenge is the visualization you will have to perform for most of the questions on the exam. There are no diagrams, no topologies, no visuals to help you. You will have to rapidly imagine what these topologies look like and apply scrutiny in the form of advising a particular AWS solution.

Questions are often asked in the familiar “AWS-pillar” format:

“What is the most OPERATIONALLY efficient solution?” “What is the most COST-EFFICIENT solution?”

Or you’ll get it like this: “What solution would best solve this problem?” That one can be dangerous if you can’t distinguish between a good solution and a great one.

Looking at large network architectures, visualizing them in your head, and in 3 minutes (about what you’ll have per question) determining the best solution is what makes this exam hard. I’ll get into more of this later, but there’s an element of imagination that I think makes this more difficult than other AWS exams.

Exam Resources

The ANS-C01 is a 65 question, 170 minute exam. You’ll need all the resources you can get to tackle it.

Here are the most helpful that I’ve found. I’ve categorized them as well:

What You Should Know Going Into the Exam

This exam assumes you have above-average network knowledge. For anyone who isn’t a network engineer, what does that mean?

Here’s a short list of things to know, certainly not all-inclusive:

  • IP Addressing & Subnetting
  • OSI Model
  • CIDR conventions
  • BGP
  • VLANS
  • Cable types (to the extent it applies to Direct Connect and your Customer Gateway)
  • DNS function
  • DHCP
  • Routing Protocols (OSPF, Static Routing)
  • TCP + UDP transmission

It also assumes you have a strong understanding of VPC prior. This includes most of the core functions and features of a VPC (subnets, availability zones, security groups, NACLs, endpoints, etc.).

Domain breakdown is like so:

  • Network Design (30%)
  • Network Implementation (26%)
  • Network Management and Operation (20%)
  • Network Security, Compliance, and Governance (24%)

I feel like this accurately reflects what I saw in the exam. There are many AWS services at play in any of these scenarios, so be prepared to see all of them presented in different domains.

As for AWS networking services you’re expected to know for the exam, I think there are a few staples that you need to know inside and out in no particular order.

Route 53
Hybrid DNS Hub & Spoke Architecture From the Deep Dive on Hybrid DNS

How well do you know Route 53? I doubt you know it nearly well enough to pass this exam unless you are using it every day. Making full use of this service involves hybrid configurations, so it’s not enough to know it as the AWS-only DNS service.

You should have a strong understanding of Public and Private Zones, Inbound and Outbound endpoints, the Route53 Resolver (or .2 resolver), Hosting & Registrar options, and migrating a domain to AWS.

Transit Gateway
Transit Gateway peering architecture (hub and spoke)

You’re probably familiar with Transit Gateway (TGW) for the peering capability it provides and the fact it makes AWS networking 10x simpler. However, this exam goes in-depth to explain the benefits and use-cases of TGW, as well as its constraints.

The most important thing you can know is when to use TGW and when another alternative might be the better choice (PrivateLink, VPC Peering). You should also understand how to use TGW in situations where network isolation is required, and how to control associations and propagations to do so.

Have a strong understanding of the use of TGW (or many of them) within a wider architecture. You’ll have to imagine it in play a LOT throughout the exam, so work to internally visualize it in every situation.

Direct Connect
Basic Direct Connect Architecture with Public and Private VIFs

AWS’ premier hybrid connectivity service is everywhere on this exam. You can scarcely find a question where Direct Connect (DX) isn’t explicitly mentioned or in-use. Knowing DX and the million ways you can configure it to work for a variety of needs is important.

What makes it particularly tricky is the fact you can’t spin it up like an EC2 instance. You’re going to have to memorize theory and do your best to incorporate it into your other learning.

Know how to use DX with VPNs, how to connect it to TGW, DX Gateway, all the Virtual Interfaces (VIFs) DX supports, how to request and set it up, how to use BGP on top of it, what encryption options are available (it isn’t encrypted by default!!), and finally how to use it for multi-region requirements.

Elastic Load Balancing

Choosing between load balancers is a frequent decision point on the exam, as well as how to make better use of the one you’re already using. AWS has a vested interest in having you use their load balancers and wants you to know them inside and out. You likely won’t see a single question about 3rd party load balancers on this exam.

Know the capabilities and limitations of each load balancer, especially the Application Load Balancer and Network Load Balancer. The Gateway Load Balancer isn’t as heavily featured (simply because there are less places to use it), but it’s still an important feature due to its integration function with 3rd party security appliances. Be aware of encryption options across each, their applicable Layer of control, X-Forwarded and Proxy Protocol v2, security policies, as well as their architectures.

BGP

I know this isn’t an AWS service, but you need to know it well. You need to know about BGP communities in AWS, how BGP determines path, how AWS prioritizes BGP routes, the formatting of it, Active/Passive vs Active/Active configs, and in general the function of this service to make connections between your on-prem and AWS workloads. Prioritize this knowledge–it pays off here.

Container and Kubernetes Networking (AWS-specific)

I was surprised to see this make such a large feature, but it’s definitely there. Knowing how to autoscale, load balance, and network in some of the AWS native container services (ECS, EKS) is a must. While not as in-depth as other mentions above, don’t get caught off-guard by this. It could be the difference between a pass or a fail.

Other Services

There are obviously far more than this featured, but those were the main ones I saw. If you’d like a more complete list, check the Appendix of the Official Guide.

Summary

Anyone who attempts this test has my respect. It’s not an easy one, and rightfully earns its legend. It’s doable, but hard.

I’d recommend that anyone who takes it get a practice exam if possible to get familiar with the questions. I didn’t, and I regretted that because it could have spared me a failure.

I hope this guide helps you! Feel free to reach out to me on my LinkedIn page if you have any questions about this exam.

How to Setup AWS Elastic Disaster Recovery

AWS Elastic Disaster Recovery, or EDR, is a managed pilot-light-style disaster recovery solution. It is perfectly suited to those wanting great RTO and RPO times at a lower cost and with minimal hassle.

Today, I will walk you through how to setup and use EDR. Before that, a few fast facts about the product:

  • Agent-based
  • On-Prem or Cloud-based workloads
  • Uses point-in-time recovery (snapshots)
  • AWS Managed
  • RPO times of seconds
  • RTO times of minutes
  • 300 replicating source servers per account per AWS region
  • Can perform test drills with no impact to production
  • Allows failback to source servers

One thing to point out is that “AWS Managed” in this case means that the disaster recovery processes are managed by AWS. You are still responsible for installing the agent, configuring replication settings, and initiating the recovery.

AWS manages the snapshots, the replication server, and the conversion of a snapshot into a recovery instance to provide a relatively seamless experience for you.

Architecture

AWS EDR General Architecture

Immediately, you’ll notice there’s no mention of EDR in here. Amazon DRS is the same thing as EDR–another naming convention changed over the years.

EDR uses the agent on your source servers to replicate data across to a staging subnet in any AWS region that you select. This ensures that point-in-time snapshots are kept ready in the event of a failover or drill.

EDR supports point-in-time snapshots:

  • Every 10 minutes for the last hour
  • Once per hour for the previous 24 hours
  • Once per day for the previous 7 days (can be modified for anywhere between 1 and 365 days)

When a failover happens, EDR converts these snapshots into recovery instances within a specified recovery subnet. Using either the most recent or select snapshot, this conversion takes place.

You’ll notice that it creates a conversion server in EC2 to process all your snapshots into recovery instances. Upon process completion, the conversion server shuts down and the recovery instances are created. You can access these instances like any other via the EC2 console. You’ll also be notified in the EDR dashboard that the job is complete.

Installing the Agent

The first step in setting up EDR is turning it on in the console. To do that, navigate to the EDR service in the region you would like to host your replication server in and enable it. In my demo, you’ll see that I use us-east-2 for my replication region.

In addition, you’ll want to specify the staging area subnet, otherwise it will put it in the default VPC for that region. You can also pick the size of the replication server, its volumes, security groups, point-in-time policies, etc.

Once that is complete, we can install the replication agent on the VM/server that we want replicated. Successful installation requires either temporary or permanent AWS credentials. I won’t cover that here, but however you choose to do it, ensure you have credentials to perform this. I will use an access key in the demo.

Installation is performed by downloading the installer and then running a command on your source server. I’m using an Ubuntu Linux distro, but you can find the installer located here.

Something I also want to mention is that you can install this agent in any server in any AWS region. The source serve I will replicate is in us-east-1, but it could be anywhere.

The installer download is below:

sudo wget -O ./aws-replication-installer-init.py https://aws-elastic-disaster-recovery-us-east-1.s3.us-east-1.amazonaws.com/latest/linux/aws-replication-installer-init.py

Then run it as a python script:

sudo python3 aws-replication-installer-init.py

Doing this will prompt you for some information, to include the desired replication region, access key information, and disk path to replicate:

Running the aws-replication-installer file & configuration requirements

Once you’ve filled that out, the agent will install on the source server and then eventually begin syncing with the EDR console. Upon completion of that, it will output a source server ID and state that the replication agent was successfully installed.

In the EDR dashboard of your replication region (us-east-2), you’ll see that the source server began to start syncing:

EDR Dashboard with Source Server syncing

You’ll also notice that a new recovery replication server was created in EC2 as part of the installation and syncing process:

Newly created replication server instance

This recovery replication server handles the point-in-time snapshots and runs in the background ready to support any EDR-related failovers. Otherwise, you can ignore it for the rest of this demo.

Configuring EDR Settings

There are a few important configuration items you should determine before going through with a failover drill or recovery. You can find all of these in the EDR dashboard.

Go to Source Server and click on the source server you synced up earlier. This will lead you to an overview that looks like this:

The source server overview and configuration page

Underneath the overview, you’ll see a few tabs. On top of the recovery dashboard itself that shows you the data replication status, you’ll want to look at the replication and launch settings.

Within the replication settings, you can choose the subnet for the replication server, the instance type, and a few other important criteria:

Replication settings

Perhaps the most important item in here is the replication subnet itself–where do you want your replication servers to be based? You can also choose to have a dedicated replication server.

You should also check launch settings before initiating a failover. Here you can specify general launch settings to include instance right-sizing (which I’ve turned off to save money) and configure your EC2 launch template for any instances replicated from this source server:

Launch settings

Within the launch template, you can modify the size of the recovery instance(s) to be created and the subnet you’d like to create it in, as well as any other settings you can change in the launch template. However, the size of this instance can only be modified IF you turned off instance right-sizing, otherwise AWS will continously overwrite the template even after you’ve set your selections to the default template.

Failover (Recovery Jobs)

EDR provides you with two ways to do a failover–either as a drill or as a real recovery.

The drill has zero impact to production servers, and can be done whenever you want. It can take a snapshot you choose and create recovery instances from it. This is a great option to test your DR plan without impacting production, and gives you the benefit of a real recovery time to benchmark off of.

You can initiate a recovery job from the source server panel. In order for it to work, the source server must be healthy (indicated in the dashboard) and there must be an available snapshot to use:

Initiate recovery job

In this scenario, I’ll select initiate drill and then a snapshot to use:

Select point-in-time snapshots

If this was a real failover, you’d likely want to use the most up-to-date snapshot, but as you can see there are point-in-time recovery points that follow the guidance posted earlier. Select a snapshot and click Initiate Drill to begin.

To see the progress, go to Recovery Job History and select the most recent job:

Recovery job history dashboard

At this point, you’ll see some job progress listed. Snapshotting occurs at the same second as the job is initiated, and then conversion begins:

Job in process

This conversion takes a few minutes to complete, at which point a new recovery instance is created. This process takes about 10-20 minutes, easily satisfying RTO times of an hour or less.

Upon completion, you’ll be notified in the Job Log that the job is ended and that a new recovery instance was created. The link to it is also provided:

Job complete

Following that link will take you to the Recovery Instances page, also located in the left taskbar. From there you can access the instance link and see any details about the newly created recovery instance:

Recovery instance page (drill instance)

You’ll notice a failback option in the upper right corner. This failback feature, although outside the scope of this demo, can be used to fail this newly created instance back to the source server. It requires additional configuration to perform. It can also be used as part of a mass failback for multiple VMware vCenter machines. You can read more about that here.

At this point, we’ve successfully configured EDR, installed the agent, synced the source server, and performed a failover drill. The demo is complete!

Summary

AWS EDR is a managed, point-in-time DR solution that works best as a pilot light. It can help you achieve RTO times of minutes and RPO times of seconds, all while reducing cost. It can be a valuable tool in your DR strategy and is a solid entry point cloud DR solution vs. more complex solutions like using load-balanced, autoscaled compute or multi-region failover routing policies.

If you are looking for a warm standby solution, this isn’t it.

If you’re looking for an active/active solution, this also isn’t it.

But for workloads where you can afford the RTO/RPO times that this product meets and where you prefer to leave the heavy lifting to someone else, I’d highly recommend using EDR.

I hope this was helpful! Please reach out to me on LinkedIn for any questions/concerns related to this documentation.

How to Deploy an NGINX Host on EC2 Using Terraform

Today I’m going to talk through how to set up an NGINX Host on EC2 using Terraform! You can reference my work over on GitHub.

For anyone not familiar with NGINX (pronounced “Engine-X”), you can use it for a wide variety of applications. It’s most commonly known as a web server, but it could work as a reverse proxy server, or even as a mail proxy.

Deploying an NGINX host is pretty straightforward, and a nice little exercise to expand on that is to do it using Infrastructure-as-Code. I’ve chosen to use Terraform today, and I’ll be provisioning it onto AWS infrastructure.

Building AWS Infrastructure

A few design choices I’ve made for this exercise:

  • AWS EC2 Host
  • Terraform Cloud backend
  • Ubuntu 20.04 AMI

The Terraform Cloud backend here is optional. I’ve gotten comfortable using it to store my state files when I work, but that’s totally up to you if you prefer to store them locally. One reason to consider using it is if you want to connect to a Version Control System like GitHub down the line.

Below is the backends.tf code:

# --- root/backends.tf ---

terraform {
  backend "remote" {
    organization = "YOUR-ORG-HERE"

    workspaces {
      name = "terraform-nginx-host"
    }
  }
}

What the above does is connect me to my organization on Terraform Cloud and establish which workspace I’ll have my state file stored in. As I said before, this is an optional step.

If you don’t opt to use TF Cloud, use the code below instead and throw it in the providers.tf file:

terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }
}

provider "aws" {
  region                  = "us-west-1"
  shared_credentials_file = "~/.aws/credentials"
  profile                 = var.aws_profile
}

Carrying on, below is the providers.tf file that I used:

# --- root/providers.tf ---

provider "aws" {
  region                  = "us-west-1"
  shared_credentials_file = "~/.aws/credentials"
  profile                 = var.aws_profile
}

The above providers.tf file dictates what provider we’ll be referencing as we create infrastructure. These providers add a set of resource types and data sources for Terraform to manage. You’ll want to run a terraform init once you’ve added your backend and providers.tf files.

Next, I’ll go through the Virtual Private Cloud (VPC) Infrastructure in my main.tf file. I’ll do my best to break it down so it’s quick and easy to understand:

# --- root/main.tf ---

resource "aws_vpc" "trobsec_vpc" {
  cidr_block           = "10.87.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "trobsec"
  }
}

resource "aws_subnet" "trobsec_public_subnet" {
  vpc_id                  = aws_vpc.trobsec_vpc.id
  cidr_block              = "10.87.1.0/24"
  map_public_ip_on_launch = true
  availability_zone       = "us-west-1a"

  tags = {
    Name = "trobsec-public"
  }
}

resource "aws_internet_gateway" "trobsec_igw" {
  vpc_id = aws_vpc.trobsec_vpc.id

  tags = {
    Name = "trobsec-igw"
  }
}

resource "aws_route_table" "trobsec_public_rt" {
  vpc_id = aws_vpc.trobsec_vpc.id

  tags = {
    Name = "trobsec-public-rt"
  }
}

resource "aws_route" "default_route" {
  route_table_id         = aws_route_table.trobsec_public_rt.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.trobsec_igw.id
}

resource "aws_route_table_association" "trobsec_public_association" {
  subnet_id      = aws_subnet.trobsec_public_subnet.id
  route_table_id = aws_route_table.trobsec_public_rt.id
}

resource "aws_security_group" "trobsec_sg" {
  name        = "trobsec_sg"
  description = "trobsec security group"
  vpc_id      = aws_vpc.trobsec_vpc.id

  ingress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = [var.my_public_ip]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

The VPC itself falls in the CIDR block of 10.87.0.0/16 and contains one subnet, 10.87.1.0/24, located in us-west-1a. Truth be told, I don’t really need 251 IPs (the number of available IPs in the /24 range minus AWS-reserved IPs) to do this, but for the sake of ease bear with me.

This VPC is accessible by the public internet via an Internet Gateway (IGW). You’ll notice in my subnet configuration that I’ve included a property called “map_public_ip_on_launch.” This ensures instances launched in this subnet will be automatically assigned a public IP and accessible via the IGW.

There is one main route table, and one default route that points traffic from inside the VPC to the outside world (0.0.0.0/0) using the IGW I set up earlier. I’ve associated the one subnet to this route table as well. What this allows is for traffic to always have a way out of the VPC and to the public internet.

Lastly, I’ve added a security group here. A security group in AWS acts like a stateful firewall, which tracks traffic entering and departing the VPC. In this case, I’ve authorized only one ingress route on any port from my own public IP address, and I’ve authorized egress on any port to any destination. This security group will also apply to the EC2 instance I’ll cover next.

Create an EC2 Instance and Key Pair

Creating an EC2 instance with NGINX on it requires a little custom configuration.

You need to:

  • Select an Amazon Machine Image (AMI)
  • Writing a bootstrap script (Bash)

Selecting an AMI requires some extra configuration surprisingly. You’ll want to create a datasources.tf file and a data resource:

# --- root/datasources.tf ---

data "aws_ami" "server_ami" {
  most_recent = true
  owners      = ["099720109477"]

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }
}

Data sources are like read-only references in Terraform. They aren’t resources in the sense that they create anything, but they can be referenced by actual resources to pull information from AWS where applicable.

In this case, we are using an Ubuntu 20.04 AMI. Some of this information is tricky to get. You’ll want to go to EC2 => Images => AMI Catalog => Ubuntu 20.04 LTS (HVM) and get the AMI ID off of it (i.e. ami-01f87c43e618bf8f0).

AMI Catalog

Using the AWS CLI (a separate download, check this site out to get it), you can actually grab the owner ID which you’ll need to pull the correct image.

aws ec2 describe-images –region us-west-1 –image-ids ami-01f87c43e618bf8f0

Towards the top, you’ll see the Owner ID. Use that number for the owners property, and be sure to encapsulate it in brackets too!

You can also use the ImageLocation for values. In my case, I’ve opted to use an * at the very end to denote all potential values after amd64-server. This will allow that value to change without breaking my data request.

After this is complete, the EC2 Instance itself isn’t too much of a hassle:

# --- root/main.tf ---

resource "aws_instance" "trobsec_host" {
  instance_type          = "t2.micro"
  ami                    = data.aws_ami.server_ami.id
  key_name               = aws_key_pair.trobsec_key.id
  vpc_security_group_ids = [aws_security_group.trobsec_sg.id]
  subnet_id              = aws_subnet.trobsec_public_subnet.id
  user_data              = file("userdata.tpl")

  root_block_device {
    volume_size = 10
  }

  tags = {
    Name = "trobsec-host"
  }
}

Going through this, you’ll see that I’ve opted for free-tier (t2.micro) and I’ve used that AMI data source previously referenced. I’ve assigned this EC2 instance to the subnet and VPC I created earlier, and I’ve given it an attached storage device of 10Gb.

Some unfamiliars in there…where is the key? I’ll get to that in just a second.

First, let’s talk through the userdata.tpl file I’ve created:

#!/bin/bash
sudo apt update -y &&
sudo apt install -y nginx
echo "Hello World from $(hostname -f)" > /var/www/html/index.html

This is a very simple bash script that will run on the instance at startup, also known as bootstrapping. It will first update apt, install NGINX on the host, and then overwrite the index.html file produced and write “Hello World from HOSTNAME.” No other configuration is necessary to get NGINX to work.

Now for that key:

# --- root/main.tf ---

resource "aws_key_pair" "trobsec_key" {
  key_name   = "trobsec-key"
  public_key = file("~/.ssh/${var.aws_key_name}.pub")
}

You can generate and store a private and public key locally on your computer and use the file function to use that public key when provisioning the instance. You can generate the keys whatever way you like as long as it’s a common AWS encryption standard. Use this documentation to get a better understanding of this particular resource.

Other Configuration Files

Variables.tf:

# --- root/variables.tf ---

variable "aws_profile" {
  type        = string
  description = "AWS Profile"
}

variable "aws_key_name" {
  type        = string
  description = "AWS Key"
}

variable "my_public_ip" {
  type        = string
  description = "My public IP address + CIDR"
}

These are all strings. I’m actually storing them in a terraform.tfvars file and throwing that in my .gitignore, so you can fill in the blanks for this particular one.

Outputs.tf:

# --- root/outputs.tf ---

output "public_ip_addr" {
  value = aws_instance.trobsec_host.public_ip
}

This is handy to get the public IP address posted when you finish up a Terraform Apply (which I will in just a minute!).

Terraform Plan and Apply

Now for the fun stuff. Throughout this, it would be preferable to test piece by piece, but for the sake of simplicity–let’s run a terraform plan now:

9 to add, 0 to change, 0 to destroy

If you’ve done everything correctly, you should have 9 resources to add.

Next, let’s run a terraform apply -auto-approve and watch it go! This should take approximately a minute.

Apply complete! Includes outputs.

After this, you’ll get a sweet little notification that your terraform apply is complete, and you’ll see that output with the listed public IP address!

Testing

Now we have to test to make sure our NGINX is running as it should. First, let’s SSH into the host just to make sure it’s running correctly (make sure you do this from the folder your private key is stored in):

ssh -i KEY_NAME ubuntu@IP_ADDRESS

Then check the status:

systemctl status nginx

You should see it running. If not, try doing a systemctl restart nginx to see if it goes. If it didn’t install correctly, it will let you know that as well. Be sure to check your Bash script and ensure there are no typos.

Last, the most glorious test of all:

Hello World! from our Public IP

Hooray! If you’re seeing a Hello World from ip-X-X-X-X.us-west-1.etc you’re in luck. This demonstrates that your NGINX server is publicly accessible and serves web pages.

After you’re finished, make sure to use terraform destroy -auto-approve to delete your infrastructure! You should destroy all 9 resources in the process.

Conclusion

Congrats on completing this! Playing with Terraform is always fun and makes the whole process a lot easier when it comes time to tear down everything.

I hope this was helpful to you! If you have any questions about this, please feel free to reach out to me in the comments below or on LinkedIn (see my about page!).

How to Prepare for the HashiCorp Terraform Associate Exam

HashiCorp Certified: Terraform Associate Certification : Terraform

Today I’m going to talk you through how to prepare for and pass the HashiCorp Terraform Associate Exam! Studying for and eventually taking this exam was a great experience, and a lot of that is owed to the simplicity of functionality using Terraform.

What is Terraform?

If you’ve arrived here, you’re probably already clued in to what Terraform is and why you’d use it. However, I’ll do a quick explanation of it just so we’re both on the same page.

Terraform is an Infrastructure-as-Code (IaC) tool that enables you to build virtual infrastructure with templates written in HashiCorp Configuration Language (HCL). These templates can be used to create infrastructure in all major cloud service providers as well as many you may not have heard of. You can find a list of providers here.

What makes Terraform so compelling is that it is a cloud-agnostic tool that can be used in multiple environments simultaneously! In simpler terms, Terraform uses a common language to speak to multiple providers.

Following a simple workflow of Write -> Plan -> Apply, you are able to write IaC, plan your deployment before it occurs, and then finally apply those changes to achieve the desired state (what was in the plan).

Study Resources

Fortunately, there are many great resources to study for the Terraform Associate Exam! The key to making the most of any of these resources is to apply the knowledge by writing code yourself. I can’t emphasize enough how important it is to write your own Terraform code.

With that said, here is a list of resources I used to prepare:

  • HashiCorp Learn: This is the authoritative source for Terraform tutorials, and I found myself referencing this repeatedly when I needed a quick text explanation of configuration language or CLI commands
  • More Than Certified in Terraform: This course by Derek Morgan is a superb course to get you a rock solid understanding of Terraform for the exam. I can say confidently that if you take this course, follow the instructions, write the code, and retain the knowledge, you will pass the Terraform exam. It goes above and beyond what you’ll need on the exam and builds excellent workflow habits that will make you a better Terraformer. Between this and the Registry, you can be dangerous with Terraform in a few weeks.
  • HashiCorp Registry: You will repeatedly reference this as you write code. With an understanding of how to build resource blocks, data blocks, and modules, you can take this registry and apply your knowledge to any provider and build your own infrastructure with relative ease.
  • Terraform Practice Exams: These practice exams by Bryan Krausen are the perfect way to round out your preparation for the exam by exposing you to the style of questions you’ll receive.

There are more resources than this on the web, but these were the 4 that I would recommend.

The Skills You Need to Write Terraform Code and Pass the Exam

I think it’s far more important to be able to understand and write Terraform code than simply pass the exam. Passing the exam should be a byproduct of your understanding!

There are certainly more aspects of Terraform than writing code, and looking at the study domains here you can see what that includes.

Outside of writing code, here is what I would focus on:

  • Learn the CLI commands you will use the most (terraform init, terraform validate, terraform plan, terraform apply, terraform destroy, etc.)
  • Understand Terraform State, and specifically how Terraform references state files and uses them to determine necessary changes to infrastructure
  • Learn the utility of modules and how information moves from parent to child modules and back again, they make your life easy if you understand them (see graphic below)
  • Understand how plugins work and how you can reference providers, upgrade/update them if necessary, etc.
  • Learn how secrets are managed contextually, and what security risks there are inherent to using Terraform and how to mitigate them
  • Get familiar with dependencies and how Terraform makes decisions about them (in what order, explicit vs. implicit)
  • Learn what workspaces are and how they use different state files as references
  • Practice setting up in Terraform Cloud and see how it differs from using a local backend
Terraform module flows
Terraform Module Flows

Writing Your Own Code

Writing code should be your bread and butter before heading into the exam. Using the More Than Certified in Terraform Course is a great way to do that.

I think there’s something to be said though for building something yourself, even if it’s as simple as provisioning an AWS S3 bucket or KMS key.

Practice building a Terraform deployment from scratch. Create that main.tf file, reference a provider, build a resource block, use the CLI, and make a simple resource.

Through this practice, you’ll develop workflows and an understanding of best practices in Terraform. This includes dividing your main.tf file into multiple files with separate functions, like a providers.tf file or outputs.tf. You can go further and build separate folders for child modules, and then build your own modules.

Get used to using the Registry as your go-to reference. If you can take and apply the templates from the Registry documentation, you can make whatever you want. Sure, you’ll mess up quite a few times and things won’t always work like you want to…but that’s part of the fun.

I would recommend at the beginning using a single provider like AWS to hone your skills. It could get confusing if you try to bring in too many cloud providers all at once, as they have slight differences which might conflate your understanding. Once you are comfortable though, Terraform Apply away and watch infrastructure get provisioned everywhere!

Summary

Getting your hands dirty with Terraform is a rewarding experience. There are few things more satisfying in tech than building whole infrastructure stacks with some code and a few simple commands.

Perhaps the most important benefit of using IaC is getting yourself out of the GUI. Building IaC shows you how much faster it is when you code. You can build a VPC on the GUI, but there are so many mistakes that can happen along the way. If you mess up the order or skip a step, you might be scratching your head for hours trying to figure it out.

Code alleviates that problem. Once you get a functional bit of it, you can create and destroy with speed and efficiency. It removes much of the potential to make errors and allows scalability like you wouldn’t believe.

At some point, you’ll actually wonder how you did it without IaC.

Terraform is a very worthy investment of your time, and I hope this guide helps you to do that. Feel free to contact me at my LinkedIn profile here if you have any questions or need assistance!

VPC Fun with HashiCorp Terraform

Today I had some fun building out a VPC on Terraform with the ol’ MTC Course. Fortunately, this is easy with Terraform!

Also, side note…I’ve given up the Day X thing because it doesn’t make sense anymore. I end up taking breaks between lessons trying to review the material, which ends up taking a lot longer and covers a lot less.

VPC Documentation

VPC documentation, along with all other documentation for integrations with Terraform, is available on the HashiCorp Registry site. The utility of this is unmatched. It takes almost all the guesswork out of anything you need to write in Terraform.

VPC Components

Here’s the code for the network/main.tf file:

# --- networking/main.tf ---

data "aws_availability_zones" "available" {}

resource "random_integer" "random" {
  min = 1
  max = 100
}

resource "random_shuffle" "az_list" {
  input        = data.aws_availability_zones.available.names
  result_count = var.max_subnets
}

resource "aws_vpc" "mtc_vpc" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true #provides a dns hostname for any resource deployed in a public environment
  enable_dns_support   = true

  tags = {
    Name = "mtc_vpc-${random_integer.random.id}"
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_subnet" "mtc_public_subnet" {
  count                   = var.public_sn_count
  vpc_id                  = aws_vpc.mtc_vpc.id
  cidr_block              = var.public_cidrs[count.index]
  map_public_ip_on_launch = true #not necessary for private subnets
  availability_zone       = random_shuffle.az_list.result[count.index]

  tags = {
    Name = "mtc_public_${count.index + 1}"
  }
}

resource "aws_subnet" "mtc_private_subnet" {
  count                   = var.private_sn_count
  vpc_id                  = aws_vpc.mtc_vpc.id
  cidr_block              = var.private_cidrs[count.index]
  map_public_ip_on_launch = false
  availability_zone       = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "mtc_private_${count.index + 1}"
  }
}

resource "aws_route_table_association" "mtc_public_assoc" {
  count          = var.public_sn_count
  subnet_id      = aws_subnet.mtc_public_subnet.*.id[count.index]
  route_table_id = aws_route_table.mtc_public_rt.id
}

resource "aws_internet_gateway" "mtc_internet_gateway" {
  vpc_id = aws_vpc.mtc_vpc.id
  tags = {
    Name = "mtc_igw"
  }
}

resource "aws_route_table" "mtc_public_rt" {
  vpc_id = aws_vpc.mtc_vpc.id
  tags = {
    Name = "mtc_public"
  }
}

resource "aws_route" "default_route" {
  route_table_id         = aws_route_table.mtc_public_rt.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.mtc_internet_gateway.id
}

resource "aws_default_route_table" "mtc_private_rt" {
  default_route_table_id = aws_vpc.mtc_vpc.default_route_table_id

  tags = {
    Name = "mtc_private"
  }
}

resource "aws_security_group" "mtc_sg" {
  name        = "public_sg"
  description = "Security Group for Public Access"
  vpc_id      = aws_vpc.mtc_vpc.id
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.access_ip]
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

That’s a lot! Breaking it down reveals a bunch of components of the VPC.

What’s great about Terraform is what you see is (for the most part) what you get. It’s not particularly difficult to read any of this code and most of it (especially if you’re accustomed to looking at AWS stuff) is immediately understandable.

There are a few variables mixed in there convolutes this, but the actual resources and their sub-characteristics are pretty clear.

In this code for the network/main.tf file, I’ve created a single VPC with public subnet(s), private subnet(s), an internet gateway, route tables, a default route out, and finally a public security group.

Using modules abstracts away SO much of the code from the root module, as you can see below:

# --- root/main.tf ---

locals {
  vpc_cidr = "10.124.0.0/16"
}

module "networking" {
  source           = "./networking"
  vpc_cidr         = local.vpc_cidr
  access_ip = var.access_ip
  public_sn_count  = 2
  private_sn_count = 3
  max_subnets      = 20
  #public_cidrs = ["10.123.2.0/24", "10.123.4.0/24"]
  public_cidrs  = [for i in range(2, 255, 2) : cidrsubnet(local.vpc_cidr, 8, i)]
  private_cidrs = [for i in range(1, 255, 2) : cidrsubnet(local.vpc_cidr, 8, i)]
  
}

The utility of modules becomes visually obvious now. Re-usable modules mean that configuration is simpler, cleaner, better. DRY principles reinforced. Voilá!

I’m able to customize the number of subnets in the root main.tf file, which is huge. This is a massive time saver and minimizes errors made in trying to write it up or click through it every single time. I think IaC is pretty great for this reason…

What’s Next?

On the immediate end: keep going with this MTC course. I need to get rid of the hardcoding for that security group, and then I’ll continue on adding features until this is a full 3-tier architecture.

In other news, I went ahead and booked my TF Associate exam for 17 November. I’m nervous to get back into the test game (I haven’t done one since I passed SCS this summer), but fortunately this one is a more limited scope than the previous.

This is a busy month for me. I might not be writing here as often, but I will be active regardless. Be back soon!

Day 7 of Hashicorp Terraform

Today, I finally arrived at the topic I’ve waited a while for now to begin–building AWS infrastructure using Terraform! For anyone tuning in, I am using the More Than Certified in Terraform Course by Derek Morgan, which you can find here.

Setup an AWS VPC in Terraform

Learning about modules was DRY, but it served a purpose. Right out the gate, I am using modules to build a VPC with Terraform.

I can understand their value now. Using a module for networking components allows me to focus on building repeatable components at a lower level rather than trying to do everything at the root level.

This consists of 3 files:

  • networking/main.tf
  • networking/outputs.tf
  • networking/variables.tf

The main.tf file at the networking level is a series of resources necessary to build the VPC itself. At a bare minimum, this would be the VPC itself, which contains a CIDR block, and 2 AWS requirements to enable DNS hostnames for public and DNS support. If you’re wondering, these AWS requirements can also be found in the AWS GUI console under VPC => Actions => Edit DNS Hostnames/Edit DNS Resolution.

resource "aws_vpc" "mtc_vpc" {
    cidr_block = var.vpc_cidr
    enable_dns_hostnames = true #provides a dns hostname for any resource deployed in a public environment
    enable_dns_support = true
    
    tags = {
        Name = "mtc_vpc-${random_integer.random.id}"
    }
}

The configuration above is enough to get started. You’ll notice that there are some variables in there, which are included in the variables.tf file.

In addition to the VPC, I defined 2 subnets in different AZs.

resource "aws_subnet" "mtc_public_subnet" {
    count = length(var.public_cidrs)
    vpc_id = aws_vpc.mtc_vpc.id
    cidr_block = var.public_cidrs[count.index]
    map_public_ip_on_launch = true #not necessary for private subnets
    availability_zone = ["us-west-2a", "us-west-2b", "us-west-2c", "us-west-2d"][count.index]
    
    tags = {
        Name = "mtc_public_${count.index + 1}"
    }
}

The first thing this resource does is count how many subnets with public CIDRs it needs to create, and what VPC to assign them to. It’s helpful to know that the cidr_block and vpc_id attributes are always required when building a subnet on Terraform. In this case, the AZs are all hardcoded in a single list for reference (not ideal but that’s what I have for now). These AZs are assigned when the subnet is created because subnets themselves always reside in AZs.

You can reference the Terraform Documentation for AWS here if you’re interested in reading more about configuration options.

Terraform Cloud Backend

I could have mentioned this earlier, but this infrastructure is referencing a state stored in the Terraform Cloud backend which needs to be set up separately from AWS. All these states are available within the TF Cloud GUI under the selected workspace, under States. Usually, you’d see the tfstate files stored locally in Cloud9, but this provides a cloud storage solution that works just as well.

Another thing to note: using Cloud9 avoids some of the IAM role permission requirements of setting up Terraform to work with AWS. Because I’m using the Cloud9 AWS-integrated IDE and its associated role permissions to AWS services, there’s stuff going on in the background which I don’t need to fuss with. It’s not that it isn’t happening, but it isn’t visible here.

Summary

I had a lot of fun with this. It’s a blast to see AWS infrastructure get rapidly built on things other than CloudFormation and I’m excited to go further with this topic. Until next time!

Day 6 of Hashicorp Terraform

Now I’m getting into the real meat of the course–Modules!

I started this weeks ago and go off-track due to a move and life events. It’s been a chaotic period and I’m happy to pick up now where I left off, so here goes…

A module, as HashiCorp describes it, is “a container for multiple resources that can be used together.” This allows you to “describe infrastructure in terms of architecture rather than physical objects.” Sounds enticing, but I still don’t get it.

I think this is a vague answer, so here’s a slightly better explanation from FreeCodeCamp.com:  “A module allows you to group resources together and reuse this group later, possibly many times.” I think the emphasis on reusability is the most important part.

In today’s lesson, I will provide a brief introduction to modules and why you should care about them.

Why a Module?

In keeping with the idea of DRY coding principles (Don’t Repeat Yourself), modules provide a simple way to expedite your coding in Terraform! They encourage coders to stay general and avoid hard coding, allowing re-use of modules to create multiple resources.

Visually, it looks something like this following a terraform apply command:

A visual representation of information flow from the MTC Terraform Course

Modules are referenced in the main.tf file (in this case, both image and container modules).

Input calls are made in the root module main.tf file. They reference the child modules via the source=”/DIR” attribute. These pass along variables that are accepted by the child module and output as values via the outputs.tf file. These values are then returned to the root module main.tf file.

After compilation, these values are then returned via the root module outputs.tf file to the Terraform CLI.

In this example, outputs from the Image module are later referenced by the Container module, which uses the image provided (the grafana image) and inputs it into a Docker container.

The utility of this is that it is repeatable. You can take any image, process it, and package it into a container while referencing the same code.

I’m going to go deeper into modules very soon, so I’ll think I’ll leave it here. Next time I will cover some more substantial examples of using modules with AWS!

Day 5 of Hashicorp Terraform

Technically, I stretched this over 2 days…but because it was a pretty short amount of material I thought it was more reasonable to do a single post. I’ve finally cleared the 2nd section of the course which covered Terraform basics, and now I begin modular deployments.

Path References and String Interpolation

This was a short and sweet lesson, which covered the utility of path references with Terraform-provided commands.

Rather than hardcoding a directory, it is recommended to use a dynamic directory link that reflects the current directory in the event that the directory changes. This can be done using path.cwd which prints the current working directory.

Coupling this with interpolation allows that data to be put into string format, which allows you to reference variables, attributes, and call functions. Interpolation uses this format: ${var.foo}

Maps and Lookups

Maps are similar to dictionaries in Python, at least to my eyes. They allow you to map values to variables.

Mapping

Lookups correspond to map types, and provides you a way of “looking up” mapped values.

resource "docker_image" "nodered_image" {
   name = lookup(var.image, var.env)
}

In this case, the lookup gets the image based on the provided env variable.

It is possible to perform a lookup using map keys, which doesn’t require the lookup function at all and cuts down a few characters. It shortens to code to look more like this instead of the lookup version:

resource "docker_image" "nodered_image" {
  name = var.image[terraform.workspace]
}

Workspaces

Probably the most useful part of this lesson, Terraform workspaces allow you to run multiple, independent state files simultaneously. Creating new workspaces is easy enough, and once you create a workspace you can switch between them just as simply.

In the course, I created two workspaces (dev and prod) and was able to perform a terraform apply to each one of them and run each set of containers according to ports specific to the workspace. This was possible using the terraform.workspace feature in the code (seen above). Running a terraform show | grep external displays the running ports, which should match the list provided in the .tfvars file.

.tfvars file

Conclusion

I enjoyed learning some more Terraform tonight, but I have to admit I bombed the quizzes on the way out with a D average. Points were lost for very particular things that I could definitely see popping up on the Terraform exam.

I’m trying out a new method of notetaking at the finish of each lesson, and although I’m seeing overall improvement, there are still details that don’t make it into my notes. I’ll work on tightening those notes up or maybe running more demos to get extra familiar with some of the features. I’m weak at some of the state-related lesson material, and I also struggled with some of the commands and knowing exactly what they output.

I’m looking forward to becoming a little better every day with this stuff!

Day 4 of Hashicorp Terraform

Today I got to work with variables some more, learn about persistent storage in Terraform, and use local values. This was a great set of lessons that touched on a lot of nagging questions I had going into them, so I’m grateful for that!

Variables

In the previous session, I learned how to create and set variables. This time, I learned how to validate variables as well as some tips and tricks for working with them.

Validation of variables occurs in the actual declaration of a variable and is as simple as adding a little condition and error message.

Variable validation

Validation is important so your code actually functions as you want it to. If any value could be put in there, you could create a problem without even knowing it!

Next, I learned about the .tfvars file, which is essentially a record of variable values which is used in the decision-making process of Terraform when setting a value for any variable. The order of precedence would be:

  • An explicit declaration of the value at runtime (-var variable_name=something)
  • The .tfvars main file or alternate .tfvars file (also specifically referenced with -var-file=alternate.tfvars)
  • The declared value in main.tf or variables.tf (wherever the variables are stored)

Another thing about the .tfvars file is that it should always be included in the .gitignore file when put into production.

The last topic pertaining to variables involved using a sensitivity flag. This sensitivity flag indicates the value of a variable should be kept confidential in Terraform’s output. This can be done by adding sensitive = true to both the variable and any output which includes that variable.

This sensitivity flag does NOT prevent users from pulling this information in other ways, including using docker ps -a or by simply looking at the state file. It’s recommended to use access control measures to ensure that unwanted users cannot view the contents of any files with sensitive information.

Persisting Storage

It is possible in Terraform to persist storage using the “local-exec” method of provisioning. It is not advisable as there are better ways to do so, but for learning purposes it’s a helpful tool.

So far I’ve been unable to persist my docker container across multiple deployments, but with local-exec it’s possible to do this.

To perform this, I created a resource (“null_resource”) named “dockervol” which served as a directory to store the actual volume (which I’ll get to soon). Creating this resource uses the local-exec provisioner, and runs a bash command which first creates a directory OR if it already has done that, ensures that ownership of the directory is changed to a specified Nodered user (1000:1000) for that volume.

Null resource with command script

The || true component of this ensures idempotence, which allows the Terraform file to run multiple times and maintain the same state without errors. When not included, a single terraform apply can be run, but an error will be generated when it detects that a directory is already made.

Next, I made a volume within the existing docker container code, which specifies a host path to store that container.

Adding the volumes property with the container and host path to the docker container resource

Once this is completed along with a terraform apply it is possible to persist the same docker container across multiple deployments. This way, you won’t lose your docker container information every time you destroy it in terraform.

Local Values

The last (quick) lesson I took involved using local values. These are locally stored expressions that you can name and reference across multiple .tf files.

Locals declaration

This can be referenced using local.variable_name.

Conclusion

All in all another fantastic lesson! I’m making slow progress on Terraform because I really want to spend quality time learning the tool itself. Getting to experiment with it means I remember a lot more of the material.

I plan on taking the Terraform certification at some point after finishing this course up. In the meantime, I’m staying focused on recording high-quality notes for it. Whether or not I end up certifying this, I know this knowledge will be important in my line of work.