Projections
Projections
Event Sourcing is a very interesting topic and there do in fact exist some systems
which can be built solely around Event Sourcing. I have built them. In reality
however the vast majority of systems which are built with Event Sourcing cannot
only be built using Event Sourcing.
The problem tends to be that we have these annoying users who want to query
data in ways that are not conducive to being done with a log.
As an example they might want to see all the customers who have spent more
than $1000 in the shop in a given year.
They then complain when we do implement it on the Event Log which happens
to be running 50,000 shops around the world and why it takes almost a day to
get them back their result.
It is entirely possible to go through the log and figure out an answer to this
question, but it is far from an optimal way of doing such.
Assuming we would want to be doing this query and we did not also own the
power company where we had a vested interest to use more power to come up
with the answer there is likely a better way of doing things.
This is where projections come in. A Projection is a cache. See we already know
what many of the questions people want to ask are. We do this during analysis.
We find out that they want to be able to query for all of the customers who have
spent more than $1000 for a given year and instead of calculating it every time,
precalculate it.
The results of this precalculation are then stored in something else say a SQL
database then the user can just run a report which queries this SQL database.
It is not even uncommon to have them directly connect excel etc directly to such
a database and do it themselves.
This process of taking events and moving the information into a useful structural
model of some type is known as a Projection. The best way to think of them is
simply as a cache. Any query being run on them could also be run across the
entire Event Log, but it would be prohibitively expensive. As such the “query”
will be run across the entire Event Log and the output of that query will be put
into another model, so it can be quickly accessed.
Projections will run against the entire Event Log until they reach the end. They
will then continue following the Event Log. This transition of reaching the end of
the log is important for projections as they can then consider themselves “live”.
As Projections are following the Event Log it is sometimes important for them
to notify people reading the data they are producing whether they are “live”.
Making decisions off of data which is 72 hours old is likely not the best idea in
many decisions. What should be the expectation of the client who is reading
the information that they are producing?
1
Most Projections are further durable. They are storing their data in a database
or some other form of persistent storage. As such Projections generally run with
a Checkpoint of some sort. If there is a restart/etc of a system they will start
back from where they last were. This can get a bit tricky at times when dealing
with the atomicity of updating the Checkpoint and the Projection itself and
possibly needing the Projection to be idempotent in certain circumstances but
generally works reasonably well.
One of my favorite Checkpoints to use with a Projection is to actually keep the
Checkpoint with the Projection data itself, later referred to as a DatabaseCheck-
point. As an example
id currentcount checkpoint
17 47 383672
25 52 376389
82 117 376389
This gives extremely fine-grained control over the checkpointing involved with
the Projection. You know that the data for the item with id 52 is good up to
the position 376389 in the log.
This fine-grained checkpointing might not seem that valuable until you realize
that Projections are almost always eventually consistent. You can return that
value to the client, and it is then known what position of the log the data
in the projection knew of when they were looking at the data. There will
be significant discussion about eventual consistency later, but this allows for
optimistic concurrency control.
The client can say “the data I showed the user is good up until . . . 383672”
and the domain can then decide if anything past that point is relevant that the
operation should be returned to the client telling them to see further data before
continuing.
You can do this same thing with the read model as a whole just by remembering
the checkpoint for the entire model, but you gain much grained control by putting
the checkpoint with the projection data itself.
One can look at even a specific row in a table and know where that row is in
terms of the log overall. You can choose to prioritize certain projections over
others in terms of when they should be run. Optimistic concurrency checks at a
row based as opposed to a database level can be done too. The checkpoint now
applies to the data itself not the “bucket” the data is in.
There are two further benefits here that many do not consider until they start
actually working with projections in production.
The first and this is a major benefit is what if you wanted to not have one
“projection engine” running? What if you wanted to have three of them running
2
on the same database? As they are all now using row based versioning they
would . . . not conflict with each other. It would just be “oh someone else already
did that . . . ”. This is in fact a common way of reaching the goal of having
multiple concurrent sets of projection engines running on the same database.
If one of those projection engines were to go down . . . we have two more!
You might be surprised with how efficient many databases are at deliberately
having multiple writers doing the same writes that are failing due to conflicts.
Be wary however on which databases you apply this to, not all have a strong or
even functioning consistency assurances and this can wind up in some interesting
situations without either.
Much more common however is to run to multiple independent databases.
Why not run three sets of projections to three distinct instances of a database?
If you have one set of projections which runs against say SQL Server or MongoDB
why not just make it that you have three distinct instances doing the same
things, but doing them completely independently?
Which do you trust more? Three distinct instances of them running or . . . their
clustering?
Beyond this another big gain can be found here . . . geographic distribution.
Instead of putting all three into a data center say us-east-1 you could instead
put them into three data centers. You then could have one in us-east-1, one
in us-west-1, one in us-west-2, and because you also have a team in India
ap-south-1-del-1a.
Geographic distribution of projections is perhaps surprisingly rather trivial. Most
event stores already support replication. They are after all a store of events.
Even if the event store you were using did not explicitly support replication it
likely supports a SubscribeToAll operation that projections use which you can
use to write a “replication bridge” between multiple event stores in a relatively
easy fashion. SubscribeToAll on one node and write to the other.
While this latter option of building your own bridge does carry more risk (that
code is really going to need to be tested well), it is completely feasible to do
with most systems.
A main thing to consider when geographically distributing projections is how to
deal with discovery. A client is going to need some mechanism of figuring out
which model it should talk to. This should be decided relatively early in the
process as it is quite key in terms of how the system operates overall.
Beyond this although projections can be discussed “on the outside” as just being
a series of Event Handlers that are bundled together there are a series of patterns
within projections that can be found. Understanding what type of projection
is being dealt with can often be quite valuable in terms of how to manage it.
A projection which is focused on doing bulk inserts of historical data into a
3
reporting database is quite different in terms of management than one which is
focused on updating a single order in a document database as quickly as it can.
Types of Projections
Although projections are often discussed “from the outside” there are numerous
different sub-categories of projections. It should not be surprising that some
patterns are noticed in dealing with the varying mechanisms of I/O.
Whether you are intending to bulk insert a set of items which were mailed to
customers once per day is a very different situation than if you want a near real
time model showing how things are moving through the warehouse.
The simplest and most common form a projection is in this text called a Simple
Projection. A Simple Projection will listen to some events and update state
when it sees an event.
public class SampleProjection : Handles<ItemCreated>, Handles<ItemUpdated> {
public void Handles(ItemCreated e) {
//INSERT INTO TABLE ...
}
4
}
5
Inserting Projections make up a large number of projections which are dealt
with though they can vary by industry. If you are example working in finance it
would not even be unusual to see that 70-80% of your projections are Inserting
Projections.
The reason you tend to see this in finance in particular is that everything is
based off of transactions. Beyond this everything tends to be temporal. As such
you tend to be inserting a huge amount of information and updating/deleting
almost nothing unless they are summary tables etc.
A special type of Inserting Projection is a Batched Projection. Very often when
doing a large number of inserts the main bottleneck is the time it takes to
get the data from the projection to the storage and receive a response. The
network overhead is a significant portion of the overall operation time. Batched
Projections seek to minimize this overhead by sending ten rows at once instead
of doing a round trip for each insert.
Batched Projections can even go one step further than this (perhaps even
deserving a different name) to make the batch size essentially . . . everything.
This is most commonly used during large replays. Instead of inserting record by
record all of the data is instead written to say a CSV (comma separated values)
file. When the system has become caught up in then does a bulk insert of the
entire CSV file which could even be millions of rows.
Batched Projections especially on large replays can make a significant difference
in terms of how long it takes for the projection to replay (read: an order of
magnitude). It is quite common as well to make a projection that is only a
Batch Projection when it is in “replay mode”. Once it has caught up it switches
to being a regular Inserting Projection. The reason for this is there are two
separate goals at the different points in the projection’s lifecycle. When it is live,
the goal is to get that specific piece of data in at a low latency. When it is in
history the goal is to get ALL of the data in quickly.
While discussing replays. It is crucial that you are . . . good at them. I have
as a consultant seen numerous clients who had reached a point where doing a
full replay was something they could not do. They had all sorts of weird logic
attempting to do partial replays etc. Your read models are cattle not pets. You
should be killing them off regularly just for the sake of killing them off.
You might even do it on every release. In a Continuous Integration environment
this is probably a bit extreme but in such an environment doing it say biweekly
or monthly on a release makes a lot of sense. The key here is that you know
how to do it, you know how it works, you know that is does work, and you have
some general expectation of how long things will take.
If rebuilding an entire read model is a regular thing for you to do . . . when
something really nasty happens in production, and you actually need to do it . . .
it will be Tuesday not an emergency. You will likely even have a pretty solid
idea of long it will actually take.
6
One trap many fall in is they have never actually done a full replay or have
not done one in a very long time. When something does happen it is a panic,
nobody knows what is involved, how long it will take, or if it will even work. As
the old expression goes :
“Doctor, it hurts when I . . . ” Doctor : “well do that more.”.
You should be extremely comfortable replaying projections. It should be a
common operation that is done. Ideally it should take somewhere around the
amount of time to go out for a relaxed lunch. “We are replaying projections.”
is actually quite a reasonable excuse to go out for a bit longer team lunch on
Friday.
In most systems projections should not be treated as a single “unit”. There are
likely some natural partitions within the model. Identifying these partitions
and treating them as being separate can help a lot with reducing replay times
etc. When identifying partitions you are not generally looking for “these three
tables” it is larger conceptual units. It might be that your read model is made
up of three “sub-models” and those three “sub-models” can be replayed/hosted
independently. This will not only help with replays of projections but will also
help if you need to scale at a later point as you can move each “sub-model” off
to a distinct read model.
Replaying projections is something that can take quite a bit of getting used to,
but it is also a highly valuable feature to have. Remember that read models are
cattle not cats.
While the concept of a projection is relatively simple, there can however be a lot
of details associated with them. These details exist in their design, implemen-
tation, and operation. There is not really a good way of understanding all the
details associated with them aside from actually building then running some in
production as many things can be subtle and environment specific.
A good initial strategy would be to try moving over only a service or two (non-
vital ones with small surface area) and to let them run in production for a while
to see how things operate. Try doing things like providing multiple models of the
same data to see how you might be able to leverage it. Run into the to need to
replay the production models and how to actually handle it. Many sounds quite
complex at first but can in most circumstances with some practice be handled
in relatively simple ways.
Get into “good habits” associated with your projections. Release often. Do not
be afraid of doing a full replay, avoid partial replays until you actually have a
need for them that cannot be solved in some other manner. Over time of dealing
with it, especially in development and integration environments you will begin
to get good at it. You will start seeing some simple tools which would make
things “a bit easier” as such you will build them.
The only way to get good at managing your projections is to actually do it. Start
early in your development process getting into a habit of dealing with them.
7
You will find all sorts of not-necessarily-obvious things which can help.