Chapter 2 discusses aggregate data models, which provide a way to perceive and manipulate data in databases, distinguishing between data models and storage models. It highlights the shift from relational models to NoSQL models, specifically focusing on aggregate orientation, which allows for more complex data structures and easier manipulation of related data as a unit. The chapter also contrasts key-value and document data models, emphasizing their aggregate-oriented nature and the differences in how data is accessed and structured.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
11 views9 pages
Aggregrate Data Models
Chapter 2 discusses aggregate data models, which provide a way to perceive and manipulate data in databases, distinguishing between data models and storage models. It highlights the shift from relational models to NoSQL models, specifically focusing on aggregate orientation, which allows for more complex data structures and easier manipulation of related data as a unit. The chapter also contrasts key-value and document data models, emphasizing their aggregate-oriented nature and the differences in how data is accessed and structured.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
Chapter 2
——
Aggregate Data Models
A data model is the model through which we perceive and manipulate our data.
For people using a database, the data model describes how we interact with the
data in the database. This is distinct from a storage model, which describes how
the database stores and manipulates the data internally. In an ideal world, we
should be ignorant of the storage model, but in practice we need at least some
inkling of it—primarily to achieve decent performance.
In conversation, the term “data model” often means the model of the specific
data in an application. A developer might point to an entity-relationship diagram
of their database and refer to that as their data model containing customers, or-
ders, products, and the like. However, in this book we'll mostly be using “data
model” to refer to the model by which the database organizes data—what might
be more formally called a metamodel.
The dominant data model of the last couple of decades is the relational data
model, which is best visualized as a set of tables, rather like a page of a spread-
sheet. Each table has rows, with each row representing some entity of interest.
We describe this entity through columns, each having a single value. A column
may refer to another row in the same or different table, which constitutes a rela-
tionship between those entities. (We’re using informal but common terminology
when we speak of tables and rows; the more formal terms would be relations
and tuples.)
One of the most obvious shifts with NoSQL is a move away from the relational
model, Each NoSQL solution has a different model that it uses, which we put
into four categories widely used in the NoSQL ecosystem: key-value, document,
column-family, and graph. Of these, the first three share a common characteristi
of their data models which we will call aggregate orientation. In this chapter
we'll explain what we mean by aggregate orientation and whac it means for data
models.
13
@ scanned with OKEN ScannerREGATE Data MODELS
Cnaprer 2 AC
2.1 Aggregate
‘The relational model takes the infor 7
nle is a limited data structur
into tuples (rows). A tuple i Keeape
so you cannot nest one tuple within another to get nested records, nor can you
put a list of values of tuples within another. This simplicity underpins the rela-
tional model—it allows us to think of all operations as operating on and returning
mation that we want to store and divides it
: It captures a set of values,
tuples. ee
Aggregate orientation takes a different approach. It recognizes that often, you
want fo operate on data in units that have a more complex structure than a set
of tuples. It can be handy to think in terms of a complex record that allows lists
and other record structures to be nested inside it. As we'll see, key-value, docu-
ment, and column-family databases all make use of this more complex record.
However, there is no common term for this complex record; in this book we use
the term “aggregate.”
Aggregate is a term that comes from Domain-Driven Design [Evans]. In
Domain-Driven Design, an aggregate is a collection of related objects that we
wish to treat as a unit. In particular, itis a unit for data manipulation and man-
agement of consistency. Typically, we like to update aggregates with atomic op-
erations and communicate with our data storage in terms of aggregates. This
definition matches really well with how key-value, document, and column.
family databases work. Dealing in aggregates makes it much easier for these
databases to handle operating on a cluster, since the aggregate makes a natural
unit for replication and sharding, Aggregates are also often easier for application
Programmers to work with, since they often manipulate data through aggregate
structures,
2.1.1 Example of Relations and Aggregates
Atthis point, an example may help explain what we’
re talking about. Let’s assume
we have to build an e-commerce website;
We are going to be selling items directly
our protean tr the web, and we will have to store information about users,
data, We can ee orders, shipping addresses, billing addresses, and paymens
da We cn ne this scenario to model the data using a relation deta store as
ae ae as a about their pros and cons, For a relational
se, we might start with a data model shows
base, we mi n in Figure 2.1.
igure 2.2 Presents some sample data for this model.
Ss we're pood r¢ soldiers, i "
no data i repeated init Soldiers, everything is properly normalized, so that
: Wy Pie tables, We also have referential integrity, Areal
istic order system would natur; rn d bur this is the
naturally be more involve i is
fit ofthe rarefied air of they, De MOFE involved than this, is is chi
Now le’s see |
see how this model mi
perep; =
‘Bbregate-oriented terms (Figure 2,3)
look when we think in more
@ scanned with OKEN Scanner2.1 AGGREGATES Vv
———]
‘Customer
jp
name
Le
1
7 a *
Order Payment Order Item
Billing 1 *
Address cardNumber price
ba es
eae: meal cnld A
1
‘Address
street
city
state
post code shipping Address
emer!
Figure 2.1 Data model oriented around a relational database (using UML notation
[Fowler UML])
[customer | Orders
i 3a | Gustonertd | shippingaadressta]
2 2 2 7 |
Product ;
} BilLingAddress
16 Hane -
1 Custonert | addressia
2 oS0L bast =
oso Bistitied = 7
Productia | _ Price
a 32.85
|rderPaynent
uw
=|
Fig n,
“862.2 Typical data using RDBMS data model
@ scanned with OKEN ScannerCnarrer 2 Aggricate Data MODELS
1 LJ Order
bing asons [x
a F | ore: payne
‘street 1 Order Item Payment
oy ngnatrss [rice cana
postcove om [se
Toe scoes
Figure 2.3. An aggregate data model
Again, we have some sample data, which we'll show in JSON format as that’s
a common representation for data in NoSQL land.
// in customers
{
“sa":
billingAddress": [{"city":"Chicago"}]
}
// in orders
; OSQL Distilled”
i,
hi ppingAddress": [{"city": “Chicago”
“orderPayment" : [ ee ae
{
scinfo" :"1000-1000-1000-1000" ,
‘txnld":"abeli f879rF¢"
“billingAddress”:
(eity": "Chicago")
1,
3
@ scanned with OKEN ScannerIn this model, we have two main aggregates: customer and order. We've used
the black-diamond composition marker in UML to show hows data fits into the
aggregation structure, The customer contains a list of billing addresses; the order
serains a list of order items, a shipping address, and payments. The payment
itself contains a billing address for that payment, :
‘A single logical address record appears three times in the example data, but
inatead of using IDs it’s treated as a value and copied cach time, This firs
Jomain where we would not want the shipping address, nor the payment’ billing
adress, to change. In a relational database, we would ensure that the address
ows aren't updated for this case, making a new row instead. With aggcegates,
wwe can copy the whole address structure into the aggregate as we need to.
The link between the customer and the order isn’t within either aggregate—it's
a relationship between aggregates. Similarly, the link from an order item would
cross into a separate aggregate structure for products, which we haven't gone
jnto. We've shown the product name as part of the order item here—this kind
of denormalization is similar to the tradeoffs with relational databases, but is
more common with aggregates because we want to minimize the number of
aggregates we access during a data interaction,
‘The important thing to notice here isn’t the particular way we've drawn che
aggregate boundary so much as the fact that you have to think about accessing
that data—and make that part of your thinking when developing the application
data model. Indeed we could draw our aggregate boundaries differently, putting
all the orders for a customer into the customer aggregate (Figure 2.4).
Using the above data model, an example Customer and Order would look
like this:
// in customers:
{
tomer": {
1,
“name: "Martin",
“billingAddress": [{"city": “Chicago"}],
“orders”: [
{
id":99,
‘customerId":1,
rorderteeas”: i
“productId” :27,
price”: 32.45,
“productName": “NoSQL Distilled"
hicago"}]
‘shippingAddress": [{"city":"
—
@ scanned with OKEN Scannerre Data MODELS
Cnarter 2 AGGRr
Customer
name
Order
billing Address |
‘Address.
street 1
iy Shipping Address
state pein
* x | order payment
Post code
apd
billing Address
Order Item’ Payment
ice ccinto
Lal txnid
Ea
Product
name
——__
Figure 2.4 Embed all the objects for customer and the customer's orders
“orderPayment": [
t
“ccinfo":"1000-1000-1000-1000",
“txnld" :"abelif879rft",
“billingAddress": {"city";
}
‘Chicago"}
€ most things in modeliny
ate boundaries,
ig, there’s no universal
answer for how to draw your
It depends entirely
— on how you tend to manipulate
yen ata. If you tend to access a Customer together with all of that customer's
to feat Ones then you would Prefer a single aggregate, However, if you tend
sary oll accessing a single order at
sere 4 time, then you should prefer havin
“Parate aggregates for each order, Natu . 4
ally, this is very context-specific; some
@ scanned with OKEN Scannerapplications will prefer one or the other, even within a sing
le system, which is
why many people prefer aggregate ignorance.
exactly
2.1.2. Consequences of Aggregate Orientation
While the relational mapping captures the various data elements and their rela-
tionships reasonably well, it does so without any notion of an aggregate entity.
In our domain language, we might say that an order consists of order items, 4
shipping address, and a payment. This can be expressed in the relational model
in terms of foreign key relationships—but there is nothing to distinguish relation-
ships that represent aggregations from those that don’t. As a result, the data-
base can’t use a knowledge of aggregate structure to help it store and distribute
the data.
Various data modeling techniques have provided ways of marking aggregate
or composite structures. The problem, however, is that modelers rarely provide
any semantics for what makes an aggregate relationship different from any other;
where there are semantics, they vary. When working with aggregate-oriented
databases, we have a clearer semantics to consider by focusing on the unit of i
teraction with the data storage. It is, however, not a logical data property: It’s
all about how the data is being used by applications—a concern that is often
outside the bounds of data modeling.
Relational databases have no concept of aggregate within their data model, so
we call them aggregate-ignorant. In the NoSQL world, graph databases are
also aggregate-ignorant. Being aggregate-ignorant is not a bad thing. It’s often
difficult to draw aggregate boundaries well, particularly if the same data is used
in many different contexts. An order makes a good aggregate when a customer
is making and reviewing orders, and when the retailer is processing orders.
However, ifa retailer wants to analyze its product sales over the last few months,
then an order aggregate becomes a trouble. To get to product sales history, you'll
have to dig into every aggregate in the database. So an aggregate structure may
help with some data interactions but be an obstacle for others. An aggregate-
ignorant model allows you to easily look at the data in different ways, so itis a
better choice when you don’t have a primary structure for manipulating your dara,
The clinching reason for aggregate orientation is that it helps greatly with
running on a cluster, which as you'll remember is the killer argument for the rise
of NoSQL. If we're running on a cluster, we need to minimize how many nodes
we need to query when we are gathering data. By explicidy icelading
aggregates, we give the database important information about which bits of data
will be manipulated together, and thus should live on the same nod ae
Aggregates have an important consequence for transactions. Re risa
databases allow you to manipulate any combination of rows from any el bl “a
a single transaction. Such transactions are called ACID transactions: ee ie
nsistent, Isolated, and Durable. ACID is a rather contrived aeronyins O16 8
Point is the atomicity: Many rows spanning many tables are up -
@ scanned with OKEN ScannerCnarrer 2 AGGREGATE Data MovELs
single operation. This operation either sueceeds or fails in its entirety, and con.
current operations are isolated from each other so they cannot seea partial update
It’s often said that NoSQL databases don’t support ACID transactions and
thus sacrifice consistency. This is-a rather sweeping simplification. In general,
it’s true that aggregate-oriented databases don’t have ACID transactions that
span multiple aggregates. Instead, they support atomic manipulation of a single
aggregate at atime. This means that if we need to manipulate multiple aggregates
in an atomic way, we have to manage that ourselves in the application code,
In practice, we find that most ofthe time we are able to keep our atomicity needs
to within a single aggregates indeed, that’s part of the consideration for deciding
how to divide up our data into aggregates. We should also remember that graph
and other aggregate-ignorant databases usually do support ACID transactions
similar to relational databases. Above all, the topic of consistency is much more
involved than whether a database is ACID or not, as we'll explore in Chapter 5.
ee
2.2 Key-Value and Document Data Models
We said. earlier on that key-value and document databases were strongly
aggregate-oriented. What we meant by this was that we think of these databases
as primarily constructed through aggregates. Both of these types of databases
consist of lots of aggregates with each aggregate having a key or ID that's used
to get at the data.
“The two models differ in that in a key-value database, the aggregate is opaque
to the database—just some big blob of mostly meaningless bits. In contrast, a
document database is able to see a structure in the aggregate. The advantage of
opacity is that we can store whatever we like in the aggregate. The database may
impose some general size limit, but other than that we have complete freedom.
A document database imposes limits on what we can place in it, defining allowable
structures and types. In return, however, we get more flexibility in access.
With a key-value store, we can only access an aggregate by lookup based on
its key. With a document database, we can submit queries to the database based
on the fields in the aggregate, we can retrieve part of the aggregate rather than
the whole thing, and database can create indexes based on the contents of the
aggregate.
In practice, the line between key-value and document gets a bit blurry. People
often put an ID fild in a document database to do a Key-value style lookup.
Dashes clad as key-value databases may allow you structures for data
Pevond jut an opaque aggregate, For example, Rak allows you to add metadats
1) abr rgares for indexing and interaggregate links, Redi allows you to break
sav et ‘beregate into lists or sets, You can support querying by integrating
: ols such as Solr. As an example, Riak includes a search facility that uses
Solr-like searching on any aggregates that are 8 a eae
ayregates that are stored as JSON or XML structures:
4
@ scanned with OKEN Scanner2.3 COLUMN
FAMILY Stores Vv
pespite this blurriness, the general distinetion still holds, With k
orabises, we expect £0 mostly Look up aggregates using a key Wit done
sa databases, we mostly expect to submit some form of query based on the
Feral structure of the documents this might be a key, but ee ore likened
pe something clse. > ore likely to
Till here—!
—_—_——_—$$
2.3 Column-Family Stores
One of the early and influential NoSQL databases was Google's BigTable
[Chang ete. Its name conjured up a tabular structure which it realized with
sparse columns and no schema. As you'll soon see, it doesn’t help to think of this
croeture as a table; rather, it is a two-level map. But, however you think about
the structure, it has been a model that influenced later databases such as HBase
and Cassandra.
“These databases with a bigtable-style data model are often referred to as column,
stores, but that name has been around for a while to deseribe a different animal.
Pre NoSQL column stores, such as C-Store [C-Store], were happy with SQL and
the relational model. The thing that made them different was the way which
they physically stored data. Most databases have a row asa unit of storage which,
in particular, helps write performance. However, there are many scenston where
enes are rare, but you often necd to read a few columns of many rows ft oe
Inthis situation, it’s better to store groups of columns for all rows as the basic
storage unit—which is why these databases are called column stores.
Bigtable and its offspring follow this notion of storing groups of columns
(column families) together, but part company with C-Store and friends by
abandoning the relational model and SQL. In this book, we refer to this class of
databases as column-family databases.
Perhaps the best way to think of the column-family model is as a two-level
aggregate structure. As with key-value stores, the first key is often described as
a row identifier, picking up the aggregate of interest. ‘The difference with column-
family structures is that this row ageregate 16 itself formed of a map of more
detailed values. These second-level values are referred to as columns. AS well as
accessing the row as a whole, operations also allow picking out a patti a or
umn, so to get a particular customer's name from Figure 2.5 you cow
something like get('1234", ‘name'). 2
olunn-family fistabasts organize their columns into column tari.
Each column has to be part of a single column family, and the cola oF
unit for access, with the assumption that dara for a particular column fn
be usually accessed together.
This also gives you a coupl
is seructured.
cof ways to think about low ee dais seructu
@ scanned with OKEN Scanner