Example Tech Spec - Transactional Email
Example Tech Spec - Transactional Email
Overview
[ Link to the product spec and supporting documents, if they exist]
We currently use the default mailing emailer system included in our web framework. Templates
are checked in with code and are part of the release cycle; and we use our own SMTP server.
This solution has limitations, and should be improved; see ticket #1845. We want to move
transactional emails to a hosted 3rd party service that supports templates and tracking
delivery/open/click rates. We propose using Mailchimp. Send-email jobs will be enqueued to
RabbitMQ, and consumed by a new Email Sending Service. This is a medium-sized project that
will take about one engineer-month of effort from our team, and about two people-weeks of
combined work from other teams.
Goals
[ You can skip Goals, Product Requirements if there is a product spec that explained these in
detail. ]
1. Allow marketing and design teams to modify email templates without engineering
involvement.
2. Make available email metrics on: open rates, click rates, and deliverability.
3. Make it easier and faster to add new emails to our system.
Product Requirements
● Emails use a template (with variable substitution and basic if-then-else logic) which can
be edited directly by the marketing and design teams in a self-serve fashion.
● Emails designs can be tested in a staging environment that does not affect production.
Assumptions
● We must have the capacity to sent up to 2 emails to each customer per day (though we’ll
certainly not plan to send this many emails on a regular basis). At current growth
projections, assuming we are planning for a 2-year horizon, that means we need
capacity for up to 20,000 emails/day. [ It’s a good idea to provide SLAs and capacity
whenever possible. ]
● We only care about HTML email; sorry, Mutt/Pine users!
● We will want to segregate our email sending system (APIs, templates) so there is a
development/test sandbox which is separate from production; e.g. a change in a
development/test email template won’t impact production.
● It is the responsibility of the marketing and design teams to copy changes forward from
development/test to production. Note that this is potentially error-prone process that will
be difficult to test; this is likely something we’ll want to address soon. [ This is a gross
problem - it’s likely going to be a pain point, but I’m not sure we can solve this right
away, so I’d rather call it out and move on. ]
● We don’t want to accidentally trigger emails to real people from our development/test
domains, so should add a filter that only sends to @ourcompany.com.
● Because it is possible that a bug could cause us to e.g. send 50 emails to the same
customer in 5 minutes, it is highly desired to include a safeguard throttle that (a) raises
an internal alarm if we send more than 3 emails to the same email address in one hour,
and (b) prevents sending more than 5 emails to the same email address in one hour.
[These numbers were pulled out of thin air. I don’t care about the actual numbers, but I
find that people are often more inclined to engage when there’s a specific stake in the
ground to talk about.]
● This system only applies to transactional emails, triggered by in-app user events. Email
blasts for e.g. marketing purposes will be handled by a separate system.
● We will not support A/B testing at this stage, though it’s a likely future project.
● Cron-job triggered emails (e.g. send a customer daily summary at 4am) m ight be
handled by our system in the future, but this is not supported for now.
● We do not have a multilingual or localized solution yet; only English emails using a US
date format are supported. [It never hurts to think about international…]
● While we will certainly be able to more quickly add new emails than today (e.g. “let’s
send an thank-you email when a customer refers a friend”), this is not just a matter of
filing an engineering ticket because it requires coordination between three teams: the
email content owner; engineering; and QA. I’ve noted this as an open question below.
[Ugh, this will be a tricky process. For now, let’s just say it’s not all on us, and sort
through the details later.]
Open Questions
[ It’s totally OK to publish a tech spec with holes in it, as long as you call out the holes. ]
● What’s our process for adding new emails? Who owns streamlining this?
● How will we manage email retry?
● How should we throttle emails (so a bug doesn’t send a slew of emails to one
customer?)
● How should we configure our Nginx/Firewall rules to allow Mailchip to make callbacks to
our systems securely?
● What happens if someone messes up a Mailchimp template such that it e.g. has broken
if-then-else logic and cannot be sent? How is this monitored, and who is responsible for
checking? We are trusting marketing to own this, but does there need to be an
engineering safe-guard too?
● Do we need a manual “pause” switch that will turn off the email worker (so jobs enqueue,
but emails aren’t sent)? This might be useful for maintenance or bugs.
● How do we absolutely, 100% guarantee that someone will stay on top of Mailchimp
billing? E.g. if someone in finance leaves the company, we don’t want an expired credit
card notice to go unnoticed.
Approach
[ Here, I’m combining a presentation of the chosen approach alongside the alternatives we
considered. It’s would also be fine to separate these out into a section “Other Options
Considered” ]
The first question we considered is: do we want to send email ourselves (from our own server)
or use a third party system? While it’s cheap to send emails using our own SMTP server, we
know from experience that maintaining email deliverability (IP address reputation, bounces,
unsubscribes) is a huge operational burden, and we strongly lean towards using a third party.
Our ideal third party system will handle both sending emails and templates (with a template
editor supporting variable substitution and if-then-else logic). We would prefer an all-in-one
solution to reduce the number of external touch points; this means we won’t consider solutions
like e.g. SendWithUs which just manage templates.
Two of the more popular solutions for transactional emails with templates are SendGrid and
MailChimp. Because our team has had experience with MailChimp and they have a solid
operational reputation, we suggest using MailChimp. (Note that their transactional system used
to be called Mandrill, but this was pulled into MailChimp brand name).
Components
[ This kind of summary is optional, but sometimes it help the reader to re-state what we
proposed in terms of what we’ll be building. ]
We will write a new “Transactional Email Sending Service” that consumes from the jobs queue
to send emails. There will be at least 4 workers running on at least 2 different machines. These
workers are responsible for the (simple) mechanics of making remote API calls to MailChimp to
send emails; populating additional standardized variables (see below), and retry on failure (also
see below).
Because we already use RabbitMQ, we will also be using RabbitMQ for our email queues (with
a new queue). There will be separate instances of the queue and worker fleets for production
and development/staging. When an email queue worker is started, it will know whether it is
running in production or not.
This method just adds the email to the send-queue. It is assumed that all internal applications
will use our standardized configuration management. We will add hooks into this system so on
application start, email configurations (how to connect to the right RabbitMQ queue) are
automatically set up.
Template Names
[ I might be rambling a bit here and getting off into the weeds - but you know what? I’m talking to
myself and figuring out an approach, and that’s largely the point of the exercise. ]
We suggest that we use ONE MailChimp account for both test and production emails. Each
email template must start with either “test_” or “prod_” to distinguish between the environment,
e.g. there would be both “test_welcome” and “prod_welcome” templates.
When our software triggers an email, it would just specify the e.g. “welcome” template; if it was
in a staging environment, the email sending service will prepend “test_” to the template name.
Note that the marketing team owns creating and managing these email templates.
Standard Variables
We will want to provide all emails with a standard list of variables. These will be automatically
populated by the email sending service; clients do not need to specify these. Note that we can’t
expect to pull in the universe of all data about a customer (e.g. full click-stream history) because
some data might require complex queries. Instead, we will maintain a list of standard variables
on this wiki page. Initially our variables include:
● First name
● Date joined
● Membership tier
● User timezone (as both current UTC offset in hours, and name like “US/Pacific”)
Client-specified variables are merged into set of variables given by the email service. In the
case where there is a conflict (the client specifies a variable name which is provided by the
service) we will log an ERROR and override the client variable.
Email Retry
If the email sending service is unable to send an email via MailChimp due to connectivity
problems or because of some other outage, we will attempt to re-send the email using
backing-off retry logic. The mechanics are TBD. [ It’s OK to not know this yet, let’s move on... ]
Throttle
Note that this feature might not be included in the initial rollout.
It is highly desired to detect sending a slew of emails to the same email address in a short
period of time. We could use Redis counters with a TTL; but the specifics are TBD. [ Again, we
capture this as an open question and move on.]
The email sending service will also check this table before send emails (to skip non-active
addresses) and will upsert on sending.
Schema Changes
[ In this example, I’m being almost absurdly terse because I’m assuming I’m working with a
particular team and process where more details wouldn’t be necessary. ]
We will create a new database table email_address_status, discussed above. It will be
approximately the same size as the users table. The only two users are the email sending
service and the email callback service.
Our biggest risk is that we send too many emails, which annoy or alienate customers. And of
course, there must be an opt-out option for users so they are in control of how we use their
emails.
In terms of privacy, we will be sending some customer PII (email address, first name, pricing
tier) to a third party (Mailchimp) via an encrypted (HTTPS) API; this shouldn’t be big deal, but
the legal and devops teams should be made aware of this.
Our email callback service will need to receive external callbacks from mailchimp, so we’ll need
to be careful to ensure that these external endpoints are not a vector of attack. Precise
firewall/Nginx configuration TBD.
Test Plan
For major releases with a black-box (manual) QA cycle, we’d like to ask the QA team to add the
welcome email as part of their test plan. Signing up to create an account should trigger an
email, which the QA team will verify arrives and looks visually correct.
We would also like to include email integration tests to our Selenium automated tests. The
golden-path sign-up flow should include an end check of an email inbox to ensure a welcome
email arrived within a few seconds of sign-up.
Operations
Deployment
We will work with the devops team to include the email sending service and email callback
service as new components (along with config file management, secrets, etc) in our deployment
system. We will own adding the new DB schema ourselves (as a checked-in migration), but the
DBAs will be keeping an eye on things like e.g. query performance on prod.
Initially, our expected load is quite low (we’re just taking small data blobs off a queue and
making remote HTTPs calls), so we suggest deploying to a fleet of 2 separate VMs (on separate
physical services, of course) for each service.
RabbitMQ is owned by the infrastructure team, they are onboard with adding and supporting the
new email queues.
Rollout Plan
Because this is a new system, we can deploy the services and components in advance of
actually sending customer emails. The order of operation is:
● Deploy full infrastructure (but don’t send welcome emails)
○ Marketing creates templates
○ RabbitMQ setup
○ Database changes
○ Deploy email sending service
○ Deploy email callback service
● Manually test (trigger emails)
○ Validate integrations
○ Validate monitoring and metrics
● Deploy code that sends actual emails
Rollbacks
[ Always include some kind of playbook for what people should do if something goes wrong with
the release - even if it’s just one line. ]
If the are unforeseen issues, it should be easy to simply shut down the email workers. We will
be closely looking at logfile ERRORs when this goes live.
In addition to standard machine-health metrics and monitoring, we will use logfile monitoring
(with Graylog).
Metrics
The project success metrics (customer engagement rates by cohort, customer support request
rate) are provided by the BI team, and we’ll check regularly to see how this project is affecting
these.
We suggest that we start tracking some additional email-status metrics from logfile entries, with
a Graylog dashboard report:
● Count emails sent, by template
● Count bounces reported by Mailchimp
Note that MailChimp has good dashboards for open rate, click rates, etc. For now, we won’t pull
these into our system but will give the business team access to log into the MailChimp account.
Long-term Support
The platform team owns the email sending service and email callback service going forward;
infrastructure team owns the RabbitMQ. Note that while we’re using Mailchimp for now, our
system is generic enough that there’s not all that much vendor lock-in. We expect to spend on
the order of $2500/month sending emails. [ N.b. this number is totally made up ].
As noted in the “Out of Scope” section in the beginning, adding new email templates is not just
an engineering project, but a process involving several teams. [ It’s OK to repeat yourself to
CYA ].