Dgraph: Synchronously Replicated, Transactional and Distributed Graph Database
Dgraph: Synchronously Replicated, Transactional and Distributed Graph Database
Database
Manish Jain
[email protected]
Abstract Dgraph solves the join depth problem with a unique shard-
ing mechanism. Instead of sharding by entities, as most sys-
Dgraph is a distributed graph database which provides hori- tems do, Dgraph shards by relationships. Dgraph’s unique
zontal scalability, distributed cluster-wide ACID transactions, way of sharding data is inspired by research at Google [?],
low-latency arbitrary-depth joins, synchronous replication, which shows that the overall latency of a query is greater than
high availability and crash resilience. Aimed at real-time trans- the latency of the slowest component. The more servers a
actional workloads, Dgraph shards and stores data in a way query touches to execute, the slower the query latency would
to optimize joins and traversals, while still providing data be. By doing relationship based sharding, Dgraph can execute
retrieval and aggregation. Dgraph’s unique take is to provide a join or traversal in a single network call (with a backup
low-latency arbitrary-depth joins in a constant number of net- network call to replica if the first is slow), irrespective of the
work calls (typically, just one network call) that would be size of the cluster or the input set of entities. Dgraph executes
required to execute a single join, irrespective of the size of arbitrary-depth joins without network broadcasts or collecting
the cluster or the size of the result set. data in a central place. This allows the queries to be fast and
latencies to be low and predictable.
1 Introduction
2 Dgraph Architecture
Distributed systems or databases tend to suffer from join depth
problem. That is, as the number of traversals of relationships Dgraph consists of Zeros and Alphas, each representing a
increase within a query, the number of network calls required group that they are serving. Zeros serve group zero and Alphas
(in a sufficiently sharded dataset) increase. This is typically serve group one, group two and onwards. Each group forms
due to entity-based data sharding, where entities are randomly a Raft cluster of 1, 3 or 5 members configurable by a human
(sometimes with a heuristic) distributed across servers con- operator (henceforth, referred to as the operator). All updates
taining all the relationships and attributes along with them. made to the group are serialized via Raft consensus algorithm
This approach suffers from high-fanout result set in interme- and applied in that order to the leader and followers.
diate steps of a graph query causing them to do a broadcast Zeros store and propagate metadata about the cluster while
across the cluster to perform joins on the entities. Thus, a sin- Alphas store user data. In particular, Zeros are responsible for
gle graph query results in network broadcasts, hence causing membership information, which keeps track of the group each
a jump in the query latency as the cluster grows. Alpha server is serving, its internal IP address for communi-
Dgraph is a distributed database with a native graph back- cation within the cluster, the shards it is serving, etc. Zeros do
end. It is the only native graph database to be horizontally not keep track of the health of the Alphas and take actions on
scalable and support full ACID-compliant cluster-wide dis- them – that is considered the job of the operator. Using this
tributed transactions. In fact, Dgraph is the first graph database information, Zero can tell the new Alpha to either join and
to have been Jepsen [?] tested for transactional consistency. serve an existing group, or form a new group.
Dgraph automatically shards data into machines, as the The membership information is streamed out from Zero to
amount of data or the number of servers change, and auto- all the Alphas. Alphas can use this membership information
matically reshards data to move it across servers to balance to route queries (or mutations) which hit the cluster. Every
the load. It also supports synchronous replication backed instance in the cluster forms a connection with every other
by Raft [?] protocol, which allows the queries to seamlessly instance (thus forming 2 × (N2 ) open connections, where N
failover to provide high availability. = number of Dgraph instances in the cluster), however, the
1
protocol buffer [?] data format and not interchanged among
the two.
{
"uid" : "0xab",
"type" : "Astronaut",
"name" : "Mark Watney",
"birth" : "2005/01/02",
"follower": { "uid": "0xbc", ... },
}
2
show Badger to provide equivalent or faster writes than other
LSM based DBs, while providing equivalent read latencies
compared to B+-tree based DBs (which tend to provide much
faster reads than LSM trees).
As mentioned above, all records with the same predicate
form one shard. Within a shard, records sharing the same
subject-predicate are grouped and condensed into one single
key-value pair in Badger. This value is referred to as a posting
list, a terminology commonly used in search engines to refer
to a sorted list of doc ids containing a search term. A posting
list is stored as a value in Badger, with the key being derived
from subject and predicate.
3
2.3 Data Sharding
While Dgraph shares a lot of features of NoSQL and dis-
tributed SQL databases, it is quite different in how it handles
its records. In other databases, a row or document would be
the smallest unit of storage (guaranteed to be located together),
while sharding could be as simple as generating equal sized
chunks consisting of many of these records.
Dgraph’s smallest unit of record is a triple (subject-
predicate-object, described below), with each predicate in
its entirety forming a shard. In other words, Dgraph logically
groups all the triples with the same predicate and considers
them one shard. Each shard is then assigned a group (1..N)
which can then be served by all the Alphas serving that group,
as explained in section ??.
This data sharding model allows Dgraph to execute a com-
plete join in a single network call and without any data fetch-
ing across servers by the caller. This combined with grouping
of records in a unique way on disk to convert operations which
would typically be executed by expensive disk iterations, into
fewer, cheaper disk seeks makes Dgraph internal working
quite efficient.
To elaborate this further, consider a dataset which contains Figure 3: Data sharding
information about where people live (predicate: "lives-in")
and what they eat (predicate: "eats"). Data might look some-
thing like this: 2.4 Data Rebalancing
As explained above, each shard contains a whole predicate
in its entirety which means Dgraph shards can be of uneven
<person-a> <lives-in> <sf> . size. The shards not only contain the original data, but also
<person-a> <eats> <sushi> . all of their indices. Dgraph groups contain many shards, so
<person-a> <eats> <indian> . the groups can also be of uneven size. The group and shard
... sizes are periodically communicated to Zero. Zero uses this
<person-b> <lives-in> <nyc> . information to try to achieve a balance among groups, using
<person-b> <eats> <thai> . heuristics. Current one being used is just data size, with the
idea that equal sized groups would allow similar resource
usage across servers serving those groups. Other heuristics,
particularly around query traffic, could be added later.
In this case, we’ll have two shards: lives-in and eats. As- To achieve balance, Zero would move shards from one
suming the worst case scenario where the cluster is so big that group to another. It does so by marking the shard read-only,
each shard lives on a separate server. For a query which asks then asking the source group to iterate over the underlying key-
for [people who live in SF and eat Sushi], Dgraph values concurrently and streaming them over to the leader of
would execute one network call to server containing lives- the destination group. The destination group leader proposes
in and do a single lookup for all the people who live in these key-values via Raft, gaining all the correctness that
SF (* <lives-in> <sf>). In the second step, it would take comes with it. Once all the proposals have been successfully
those results and send them over to server containing eats, applied by the destination group, Zero would mark the shard
do a single lookup to get all the people who eat Sushi (* as being served by the destination group. Zero would then
<eats> <sushi>), and intersect with the previous step’s re- tell source group to delete the shard from its storage, thus
sultset to generate the final list of people from SF who eat finalizing the process.
Sushi. In a similar fashion, this result set can then be further While this process sounds pretty straighforward, there are
filtered/joined, each join executing in one network call. many race and edge conditions here which can cause transac-
As we learnt in section ??, the result set is a list of sorted tional correctness to be violated as shown by Jepsen tests [?].
64-bit unsigned integers, which make the retrieval and inter- We’ll showcase some of these violations here:
section operations very efficient. 1. A violation can occur when a slightly behind Alpha
4
server would think that it is still serving the shard (despite the // tokens shouldn’t be encoded with the byte
shard having moved to another group) and allow mutations // identifier.
to be run on itself. To avoid this, all transactions states keep Tokens(interface{}) ([]string, error)
the shard and the group info for the writes (along with their
conflict keys as we’ll see in section ??). The shard-group // Identifier returns the prefix byte for this
information is then checked by Zero to ensure that what the // token type. This should be unique. The range
transaction observes (via Alpha it talked to) and what Zero // 0x80 to 0xff (inclusive) is reserved for
has is the same – a mismatch would cause a transaction abort. // user-provided custom tokenizers.
2. Another violation happens when a transaction commits Identifier() byte
after the shard was put into read-only mode – this would cause
that commit to be ignored during the shard transfer. Zero // IsSortable returns true if the tokenizer can
catches this by assigning a timestamp to the move operation. // be used for sorting/ordering.
Any commits (on this shard) at a higher timestamp would be IsSortable() bool
aborted, until the shard move has completed and the shard is
brought back to the read-write mode. // IsLossy() returns true if we don’t store the
3. Yet another violation can occur when the destination // values directly as index keys during
group receives a read below the move timestamp, or a source // tokenization. If a predicate is tokenized
group receives a read after it has deleted the shard. In both // using a lossy tokenizer, we need to fetch
cases, no data exists which can cause the reads to incorrectly // the actual value and compare.
return back nil values. Dgraph avoids this by informing the IsLossy() bool
destination group of the move timestamp, which it can use }
to reject any reads for that shard below it. Similarly, Zero
includes a membership mark at which the source Alpha must Every tokenizer has a globally unique identifier
reach before the group can delete the shard, thus, every Alpha (Identifier() byte), including custom tokenizers pro-
member of the group would know that it is no longer servig vided by operators. The tokens generated are prefixed with a
the data before deleting it. tokenizer identifier to be able to traverse through all tokens
Overall, the mechanism of membership information syn- belonging to only that tokenizer. This is useful when doing
chronization during a shard move proved the hardest to get iteration for inequality queries (greater than, less than, etc.).
right with respect to transactional correctness. Note that inequality queries can only be done if a tokenizer is
sortable (IsSortable() bool). For example, in strings, an
exact index is sortable, but a hash index is not.
3 Indexing Depending upon which index a predicate has set in the
schema, every mutation in that predicate would invoke one
Dgraph is designed to be a primary database for applications.
or more of these tokenizers to generate the tokens. Note that
As such, it supports most of the commonly needed indices. In
indices only operate on values, not objects. A set of tokens
particular, for strings, it supports regular expressions, full-text
would be generated with the before mutation value and an-
search, term matching, exact and hash matching index. For
other set with the after mutation value. Mutations would be
datetime, it supports year, month, day and hour level indices.
added to delete the subject uid from the posting lists of before
For geo, it supports nearby, within, etc. operations, and so
tokens and to add the subject uid to the after tokens.
on...
Note that all indices have object values, so they largely deal
All these indices are stored by Dgraph using the same post-
only in uids. Indices in particular can suffer from high fan-out
ing list format described above. The difference between an
problem and are solved using posting list splits described in
index and data is the key. A data key is typically <predicate,
the section ??.
uid>, while an index key is <predicate, token>. A token
is derived from the value of the data, using an index tokenizer.
Each index tokenizer supports this interface: 4 Multiple Version Concurrency Control
type Tokenizer interface { As described in section ??, data is stored in posting list format,
Name() string which consists of postings sorted by integer ids. All posting
list writes are stored as deltas to Badger on commit, using the
// Type returns the string representation of commit timestamp. Note that timestamps are monotonically
// the typeID that we care about. increasing globally across the DB, so any future commits are
Type() string guaranteed to have a higher timestamp.
It is not possible to update this list in-place, for multiple
// Tokens return tokens for a given value. The reasons. One is that Badger (and most LSM trees) writes are
5
immutable, which plays very well with filesystems and rsync.
Second is that adding an entry within a sorted list requires
moving following entries, which depending upon the position
of the entry can be expensive. Third, as the posting list grows,
we want to avoid rewriting a large value every time a mutation
happens (for indices, it can happen quite frequently).
Dgraph considers a posting list as a state. Every future
write is then stored as a delta with a higher timestamp. A delta
would typically consist of postings with an operation (set or
delete). To generate a posting list, Badger would iterate the
versions in descending order, starting from the read timestamp,
picking all deltas until it finds the latest state. To run a posting
list iteration, the right postings for a transaction would be
picked, sorted by integer ids, and then merge-sort operation is
run between these delta postings and the underlying posting
list state.
Earlier iterations of this mechanism were aimed at keep-
ing the delta layer sorted by integer ids as well, overlaying it
on top of the state to avoid doing sorting during the reads —
any addition or deletion made would be consolidated based
on what was already in the delta layer and the state. These
iterations proved too complex to maintain for the team and
suffered from hard to find bugs. Ultimately, that concept was
dropped in favor of a simple understandable solution of pick- Figure 4: MVCC
ing the right postings for a read and sorting them before itera-
tion. Additionally, earlier APIs implemented both forward and
backward iteration adding complexity. Over time, it became While designing transactions in Dgraph, we looked at pa-
clear that only forward iteration was required, simplifying the pers from Spanner [?], HBase [?], Percolator [?] and others.
design. Spanner most famously uses atomic clocks to assign times-
There are many benefits in avoiding having to regenerate tamps to transactions. This comes at the cost of lower write
the posting list state on every write. At the same time, as throughput on commodity servers which don’t have GPS
deltas accumulate, the work of list regeneration gets delegated based clock sync mechanism. So, we rejected that idea in fa-
to the readers, which can slow down the reads. To find a vor of having a single Zero server, which can hand out logical
balance and avoid gaining deltas indefinitely, we added a timestamps at a much faster pace.
rollup mechanism. To avoid Zero becoming a single point of failure, we run
Rollups: As keys get read, Dgraph would selectively re- multiple Zero instances forming a Raft group. But, this comes
generate the posting lists which have a minimum number of with a unique challenge of how to do handover in case of
deltas, or haven’t been regenerated for a while. The regener- leader relection. Omid, Reloaded [?] (referenced as Omid2)
ation is done by starting from the latest state, then iterating paper handles this problem by utilizing external system. In
over the deltas in order and merging them with the state. The Omid2, they run a standby timestamp server to take over in
final state is then written back at the latest delta timestamp, re- case the leader fails. This standby server doesn’t need to get
placing the delta and forming a new state. All previous deltas the latest transaction state information, because Omid2 uses
and states for that key can then be discarded to reclaim space. Zookeeper [?], a centralized service for maintaining transac-
This system allows Dgraph to provide MVCC. Each read tion logs. Similarly, TiDB built TiKV, which uses a Raft-based
is operating upon an immutable version of the DB. Newer replication model for the key-values. This allows every write
deltas are being generated at higher timestamps and would be by TiDB to automatically be considered highly-available. Sim-
skipped during a read at a lower timestamp. ilarly, Bigtable [?], uses Google Filesystem [?] for distributed
storage. Thus, no direct information transfer needs to happen
5 Transactions among the multiple servers forming the quorum.
While this concept achieves simplicity in the database, we
Dgraph has a design goal of being simple to operate. As were not entirely thrilled with this idea due to two reasons.
such, one of the goals is to not depend upon any third party One, we had an explicit goal of non-reliance on any third-
system. This proved quite hard to achieve while providing party system to make running Dgraph operationally easier,
high availability for not only data but also transactions. and felt that a solution should be possible without pushing
6
synchronous replication within Badger (storage). Second, we Algorithm 1 Commit (Ts , Keys)
wanted to avoid touching disk unless necessary. By having 1: for each key k ∈ Keys do
Raft be part of the Dgraph process, we can find-tune when 2: if lastCommit (k) > Ts then
things get written to state to achieve better efficiency. In fact, 3: Propose(Ts ← abort )
our implementation of transactions don’t write to DB state on 4: return
disk until they are committed (still written to Raft WAL). 5: end if
We closely looked at HBase papers ( [?], [?]) for other 6: end for
ideas, but they didn’t directly fit our needs. For example, 7: Tc ← GetTimestamps(1)
HBase pushed a lot of transaction information back to the 8: for each key k ∈ Keys do
client, giving them critical information about what they should 9: lastCommit (k) ← Tc
or should not read to maintain the transactional guarantees. 10: end for
This however, makes the client libraries harder to build and 11: Propose(Ts ← Tc )
maintain, something we did not like. On top of that, a graph
query can touch millions of keys in the intermediate steps, it’s
expensive to keep track of all that information and propagate Algorithm 2 Watermark: Calculate DoneUntil (T , isPending)
that to the client. 1: if T ∈
/ MinHeap then
Aim for Dgraph client libraries was to keep as minimal 2: MinHeap ← T
state as possible to allow open-source users unfamiliar with 3: end if
the internals of Dgraph to build and maintain libraries in 4: pending(T ) ← isPending
languages unfamiliar to us (for example, Elixir). 5: curDoneT s ← DoneUntil
// TODO: Do I describe the first iteration? 6: for each minT s ∈ MinHeap.Peek () do
We simply could not find a paper at the time which de- 7: if pending(minT s) then
scribed how to build a simple to understand, highly-available 8: break
transactional system which could be run without assuming 9: end if
that the storage layer is highly available. So, we had to come 10: MinHeap.Pop()
up with a new solution. Our second iteration still faced many 11: curDoneT s ← minT s
issues as proven by Jepsen tests. So, we simplified our second 12: end for
iteration to a third one, which is as follows. 13: DoneUntil ← curDoneT s
5.1 Lock-Free High Availability Transaction simple, Zero does not push to any Alpha leader. It is the job
Processing of (whoever is) the latest Alpha leader to establish an open
Dgraph follows a lock-free transaction model. Each transac- stream from Zero to receive transaction status updates.
tion pursues its course concurrently, never blocking on other Along with the transaction status update, Zero leader also
transactions, while reading the committed data at or below its sends out a MaxAssigned timestamp. MaxAssigned is cal-
start timestamp. As mentioned before, Zero leader maintains culated using a Watermark algorithm ??, which maintains
an Oracle which hands out logical transaction timestamps a min-heap of all allocated timestamps, both start and com-
to Alphas. Oracle also keeps track of a commit map, storing mit timestamps. As consensus is achieved, the timestamps
a conflict key → latest commit timestamp. As shown in al- are marked as done and MaxAssigned gets advanced to the
gorithm ??, every transaction provides the Oracle the list of maximum timestamp up until which everything has achieved
conflict keys, along with the start timestamp of the transac- consensus as needed. Note that start timestamps don’t typi-
tion. Conflict keys are derived from the modified keys, but cally need a consensus (unless lease needs to be updated) and
are not the same. For each write, a conflict key is calculated get marked as done immediately. Commit timestamps always
depending upon the schema. When a transaction requests a need a consensus to ensure that Zero group achieves quorum
commit, Zero would check if any of those keys has a commit on the status of the transaction. This allows a Zero follower
timestamp higher than the start timestamp of the transaction. to become a leader and have full knowledge of transaction
If the condition is met, the transaction is aborted. Otherwise, statuses. This ordering is crucial to achieve the transactional
a new timestamp is leased by the Oracle, set as the commit guarantees as we will see below.
timestamp and conflict keys in the map are updated. Once Alpha leaders receive this update, they would propose
The Zero leader then proposes this status update (commit it to their followers, applying the updates in the same order.
or abort) in the form of a start → commit ts (where commit All Raft proposal applications in Alphas are done serially.
ts = 0 for abort) to the followers and achieves quorum. Once Alphas also have an Oracle, which keeps track of the pending
quorum is achieved, Zero leader streams out this update to transactions. They maintain the start timestamp, along with a
the subscribers, which are Alpha leaders. To keep the design transaction cache which keeps all the updated posting lists in
7
transactions and linearizable reads.
For correctness, only Zero leader is allowed to assign times-
tamps, uids, etc. There are edge cases where Zero followers
would mistakenly think they’re the leaders and serve stale
data — Dgraph does multiple things to avoid these scenarios.
1. If a Zero leadership changes, the new leader would lease
out a range of timestamps higher than the previous leader has
seen. However, an older commit proposal stuck with the older
leader can get forwarded to the new one. This can allow a
commit to happen at an older timestamp, causing failure of
Figure 5: MaxAssigned watermark. Open circles represent transactional guarantees. We avoid this by disallowing Zero
and filled circles represent done. Start timestamps 1, 2, and 4 followers forwarding requests to the leader and rejecting those
are immediately marked as done. Commit timestamp 3 begins proposals.
and must have consensus before it is done. Watermark keeps // TODO: We should have a membership section, which
track of the highest timestamp at and below which everything explains how membership works and is transmitted to Alphas.
is done. 2. Every membership state update streamed from Zero re-
quires a read-quorum (check with Zero peers to find the latest
Raft index update seen by the group). If the Zero is behind
a partition, for example, it wouldn’t be able to achieve this
quorum and send out a membership update. Alphas expect an
update periodically and if they don’t hear from the Zero leader
after a few cycles, they’d consider the Zero leader defunct,
abolish connection and retry to establish connection with a
(potentially different) healthy leader.
6 Consistency Model
Dgraph supports MVCC, Read Snapshots and Distributed
ACID transactions. The transactions are cluster-wide across
universal dataset – not limited by any key level or server
Figure 6: The MaxAssigned system ensures that linearizable
level restrictions. Transactions are also lockless. They don’t
reads. Reads at timestamps higher than the current MaxAs-
block/wait on seeing pending writes by uncommitted trans-
signed (MA) must block to ensure the writes up until the read
actions. They can all proceed concurrently and Zero would
timestamp are applied. Txn 2 receives start ts 3, and a read at
choose to commit or abort them depending on conflicts.
ts 3 must acknowledge any writes up to ts 2.
Considering the expense of tracking all the data read by a
single graph query (could be millions of keys), Dgraph does
memory. On a transaction abort, the cache is simply dropped. not provide Serializable Snapshot Isolation. Instead, Dgraph
On a transaction commit, the posting lists are written to Bad- provides Snapshot Isolation, tracking writes which is a much
ger using the commit timestamp. Finally, the MaxAssigned more contained set than reads.
timestamp is updated. Dgraph hands out monotonically increasing timestamps
Every read or write operation must have a start times- (represented by T ) for transactions (represented by T x).
tamp. When a new query or mutation hits an Alpha, it would Ergo, if any transaction T xi commits before T x j starts, then
T xi Tx
ask Zero to assign a timestamp. This operation is typically Tcommit < Tstartj . Any commit at Tcommit is guaranteed to be
batched to only allow one pending assignment call to Zero seen by a read at timestamp Tread by any client, if Tread >
leader per Alpha. If the start timestamp of a newly received Tcommit . Thus, Dgraph reads are linearizable. Also, all reads
query is higher than the MaxAssigned registered by that Al- are snapshots across the entire cluster, seeing all previously
pha, it would block the query until its MaxAssigned reaches committed transactions in full.
or exceeds the start ts. This solution nicely tackles a wide- As mentioned, Dgraph reads are linearizable. While this is
array of edge case scenarios, including Alpha falling back or great for correctness, it can cause performance issues when a
going behind a network partition from its peers or just restart- lot of reads and writes are going on simultaneously. All reads
ing after a crash, etc. In all those cases, the queries would are supposed to block until the Alpha has seen all the writes
be blocked until the Alpha has seen all updates up until the up until the read timestamp. In many cases, operators would
timestamp of the query, thus maintaining the guarantee of opt for performance over achieving linearizablity. Dgraph
8
provides two options for speeding up reads: much harder. So, keeping with the hard learnt lesson of pre-
1. A typical read-write transaction would allocate a new dictability principle, we changed it to make the leader calcu-
timestamp to the client. This would update MaxAssigned late the snapshot index and propose this result. This allowed
which would then flow via Zero leader to Alpha leaders and leader and followers to all take snapshot at the same index,
then get proposed. Until that happens, a read can’t proceed. exactly the same time (if they’re generally caught up). Further
Read-only transactions would still require a read timestamp more, this group level snapshot event is then communicated
from Zero, but Zero would opportunistically hand out the to Zero to allow it to trim the conflict map by removing all
same read timestamp to multiple callers, allowing Alpha to entries below the snapshot timestamp. Following this chain
amortize the cost of reaching MaxAssigned across multiple of events in logs has improved debuggability of the system
queries. dramatically.
2. Best-effort transactions are a variant of read-only trans- Dgraph only keeps metadata in Raft snapshots, the actual
actions, which would use an Alpha’s observed MaxAssigned data is stored separately. Dgraph does not make a copy of
timestamp as the read timestamp. Thus, the receiver Alpha that data during snapshot. When a follower falls behind and
does not have to block at all and can continue to process the needs a snapshot, it asks the leader for it and leader would
query. This is the equivalent of eventual consistency model stream the snapshot from its state (Badger, just like Dgraph,
typical in other databases. Ultimately, every Dgraph read is a supports MVCC and when doing a read at a certain times-
snapshot over the entire distributed database and none of the tamp, is operating upon a logical snapshot of the DB). In the
reads would violate the snapshot guarantee. 1 previous versions, follower would wipe out its current state
before accepting the updates from the leader. In the newer
versions, leader can choose to send only the delta state up-
7 Replication date to the follower, which can decrease the data transmitted
considerably.
Most updates to Dgraph are done via Raft. Let’s start with
Alphas which can push a lot of data through the system. All
8 High Availability and Scalability
mutations and transaction updates are proposed via Raft and
are made part of the Raft write-ahead logs. On a crash and Dgraph’s architecture revolves around Raft groups for update
restart, the Raft logs are replayed from the last snapshot to log serialization and replication. In the CAP throrem, this
bring the state machine back up to the correct latest state. On follows CP, i.e. in a network partition, Dgraph would choose
the flip side, the longer the logs, the longer it takes for Alpha consistency over availability. However, the concepts of CAP
to replay them on a restart, causing a start delay. So, the logs theorem should not be confused with high availability, which
must be trimmed by taking a snapshot which indicates that is determined by how many instances can be lost without the
the state up until that point has been persisted and does not service getting affected.
need to be replayed on a restart. In a three-node group, Dgraph can loose one instance per
As mentioned above, Alphas write mutations to the Raft group without causing any measurable impact on the function-
WAL, but keep them in memory in a transaction cache. When ality of the database. However, loosing two instances from
a transaction is committed, the mutations are written to the the same group would cause Dgraph to block, considering all
state at the commit timestamp. This means that on a restart, updates go through Raft. In a five-node group, the number of
all the pending transactions must be brought back to memory instances that can be lost without affecting functionality is
via the Raft WAL. This requires a calculation to pick the two. We do not recommend running more than five replicas
right Raft index to trim the logs at, which would keep all the per group.
pending transactions in their entirety in the logs. Given the central managerial role of Dgraph Zero, one
One of the lessons we learnt while fixing Jepsen issues might assume that Zero would be the single point of failure.
was that, to improve debuggability of a complex distributed However, that’s not the case. In the scenario where Zero
system, the system should run like clock work. In other words, follower dies, nothing changes really. If the Zero leader dies,
once an event in one system has happened, events in other one of the Zero followers would become the leader, renew its
systems should almost be predictable. This guiding principle timestamp and uid assignment lease, pick up the transaction
determined how we take snapshots. status logs (stored via Raft) and start accepting requests from
Raft paper allows leaders and followers to take snapshots Alphas. The only thing that could be lost during this transition
independently of each other. Dgraph used to do that but that are transactions which were trying to commit with the lost
brought unpredictability to the system and made debugging Zero. They might error out, but could be retried. Same goes
1 Note however that a typical Dgraph query could hit multiple Alphas in
for Alphas. All Alpha followers have the same information
as the Alpha leader and any of the members of the group can
various groups — some of these Alphas might not have reached the read
timestamp (initial Alpha’s MaxAssigned timestamp) yet. In those cases, the be lost without losing any state.
query could still block until those Alphas catch up. Dgraph can support as many groups as can be represented
9
by 32-bit integer (even that is an artificial limit). Each group use trigram indexing, geo-spatial queries uses S2-cell based
can have one, three, five (potentially more, but not recom- geo indexing and so on... As described in section above, in-
mended) replicas. The number of uids (graph nodes) that can dexing keys encode predicate and token, instead of a predicate
be present in the system are limited by 64-bit unsigned integer, and uid. So, the mechanism to fill up the matrix is the same as
same goes for transaction timestamps. All of these are very in any other task query. Only this time, we use list of tokens
generous limits and not a cause of concern for scalability. instead of a list of Uids as the query set.
10
caching — such a cache would be difficult to maintain in 11 Acknowledgments
an MVCC environment where each read can have different
results, based on its timestamp. Dgraph wouldn’t have been possible without the tireless con-
Sorted integer encoding and intersection is a hotly re- tributions of its core dev team and extended community. This
searched topic and there is a lot of room for optimization work also wouldn’t have been possible without funding from
here in terms of performance. As mentioned earlier, work is our investors. A full list of contributors is present here:
underway in experimenting a switch to Roaring Bitmaps.
We also plan to work on a query optimizer, which can github.com/dgraph-io/dgraph/graphs/contributors
better determine the right sequence in which to execute query.
So far, the simple nature of GraphQL has let the operators Dgraph is an open source software, available on
manually optimize their queries — but surely Dgraph can do
a better job knowing the state of data. https://github.com/dgraph-io/dgraph
Future work here is to allow writes during the shard move,
which depending upon the size of the shard can take some More information about Dgraph is available on
time.
TODO: Add a conclusion. https://dgraph.io
11