0% found this document useful (0 votes)
12 views13 pages

NoSQL Module 1 Part1

The document discusses the emergence and characteristics of NoSQL databases, highlighting their shift away from the relational model to support large volumes of data and cluster efficiency. It introduces the concept of aggregates, which are collections of related objects treated as a unit for data manipulation, and explains the differences between key-value, document, and column-family data models. The document also emphasizes the benefits of aggregate-oriented databases in managing data storage and transactions compared to traditional relational databases.

Uploaded by

athulatk6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

NoSQL Module 1 Part1

The document discusses the emergence and characteristics of NoSQL databases, highlighting their shift away from the relational model to support large volumes of data and cluster efficiency. It introduces the concept of aggregates, which are collections of related objects treated as a unit for data manipulation, and explains the differences between key-value, document, and column-family data models. The document also emphasizes the benefits of aggregate-oriented databases in managing data storage and transactions compared to traditional relational databases.

Uploaded by

athulatk6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 1

NoSQL
The Emergence of NoSQL

It’s a wonderful irony that the term “NoSQL” first made its appearance in the late 90s as the name of
an open-source relational database [Strozzi NoSQL].

Led by Carlo Strozzi, this database stores its tables as ASCII files, each tuple represented by a line
with fields separated by tabs.

The name comes from the fact that the database doesn’t use SQL as a query language.

Instead, the database is manipulated through shell scripts that can be combined into the usual UNIX
pipelines.

Key Points

• Relational databases have been a successful technology for twenty years, providing persistence,
concurrency control, and an integration mechanism.

• Application developers have been frustrated with the impedance mismatch between the relational
model and the in-memory data structures.

• There is a movement away from using databases as integration points towards encapsulating
databases within applications and integrating through services.

• The vital factor for a change in data storage was the need to support large volumes of data by
running on clusters. Relational databases are not designed to run efficiently on clusters.

• NoSQL is an accidental neologism. There is no prescriptive definition—all you can make is an


observation of common characteristics.

• The common characteristics of NoSQL databases are

• Not using the relational model

• Running well on clusters

• Open-source

• Built for the 21st century web estates

• Schemeless

• The most important result of the rise of NoSQL is Polyglot Persistence(Polyglot Persistence is a
fancy term to mean that when storing data, it is best to use multiple data storage
technologies, chosen based upon the way data is being used by individual applications or
components of a single application)
Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM
Module 1
NoSQL

Aggregate Data Models

A data model is the model through which we perceive and manipulate our data.

For people using a database, the data model describes how we interact with the data in the
database.

This is distinct from a storage model, which describes how the database stores and manipulates the
data internally.

the term “data model” often means the model of the specific data in an application.

A developer might point to an entity-relationship diagram of their database and refer to that as their
data model containing customers, orders, products, and the like.

“data model” to refer to the model by which the database organizes data formally called a
metamodel.

The dominant data model of the last couple of decades is the relational data model, which is best
visualized as a set of tables, rather like a page of a spreadsheet. Each table has rows, with each row
representing some entity of interest.

We describe this entity through columns, each having a single value. A column may refer to another
row in the same or different table, which constitutes a relationship between those entities

One of the most obvious shifts with NoSQL is a move away from the relational model.
Each NoSQL solution has a different model that it uses, which we put into four categories widely
used in the NoSQL ecosystem:

key-value,

document,

column-family, and

graph.

Each DB has its own query language

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM


Module 1
NoSQL

Of these, the first three share a common characteristic of their data models which we will call
aggregate orientation.

Aggregates,

The relational model takes the information that we want to store and divides it into tuples (rows).

A tuple is a limited data structure: It captures a set of values, so you cannot nest one tuple within
another to get nested records, nor can you put a list of values or tuples within another.

This simplicity underpins the relational model—it allows us to think of all operations as operating on
and returning tuples.

Aggregate orientation takes a different approach.

It recognizes that often, to operate on data in units that have a more complex structure than a set
of tuples. It can be handy to think in terms of a complex record that allows lists and other record
structures to be nested inside it.

As we’ll see, key-value, document, and column-family databases all make use of this more complex
record

Aggregate is a term that comes from Domain-Driven Design [Evans]. In Domain-Driven Design, an
aggregate is a collection of related objects that we wish to treat as a unit.
In particular, it is a unit for data manipulation and management of consistency.

Typically, we like to update aggregates with atomic operations and communicate with our data
storage in terms of aggregates.

This definition matches really well with how key-value, document, and column-family databases
work.

Dealing in aggregates makes it much easier for these databases to handle operating on a cluster,
since the aggregate makes a natural unit for replication and sharding.

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM


Module 1
NoSQL
Aggregates are also often easier for application programmers to work with, since they often
manipulate data through aggregate structures.

Example of Relations and Aggregates

Let’s assume we have to build an e-commerce website; we are going to be selling items directly to
customers over the web, and we will have to store information about users, our product catalog,
orders, shipping addresses, billing addresses, and payment data. We can use this scenario to model
the data using a relation data store as well as NoSQL data stores and talk about their pros and cons.

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM

Module 1
NoSQL
As we’re good relational soldiers, everything is properly normalized, so that no data is repeated in
multiple tables. We also have referential integrity. A realistic order system would naturally be more
involved than this, but this is the benefit of the rarefied air of a book.

a common representation for data in NoSQL land.

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM


Module 1
NoSQL
In this model, we have two main aggregates: customer and order. We’ve used the black-diamond
composition marker in UML to show how data fits into the aggregation structure.

The customer contains a list of billing addresses; the order contains a list of order items, a shipping
address, and payments. The payment itself contains a billing address for that payment.

A single logical address record appears three times in the example data, but instead of using IDs it’s
treated as a value and copied each time.

This fits the domain where we would not want the shipping address, nor the payment’s billing
address, to change. In a relational database, we would ensure that the address rows aren’t updated
for this case, making a new row instead.

With aggregates, we can copy the whole address structure into the aggregate as we need to. The
link between the customer and the order isn’t within either aggregate—it’s a relationship between
aggregates. Similarly, the link from an order item would cross into a separate aggregate structure for
products, which we haven’t gone into.

We’ve shown the product name as part of the order item here—this kind of denormalization is
similar to the tradeoffs with relational databases, but is more common with aggregates because we
want to minimize the number of aggregates we access during a data interaction.

The important thing to notice here isn’t the particular way we’ve drawn the aggregate boundary so
much as the fact that you have to think about accessing that data—and make that part of your
thinking when developing the application data model. Indeed we could draw our aggregate
boundaries differently, putting all the orders for a customer into the customer aggregat

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM


Module 1
NoSQL

Yojana Kiran
Kumar,Asst. Professor,Dept of BVOC ,SDM
Module 1
NoSQL
In this model, we have two main aggregates: customer and order

The customer contains a list of billing addresses; the order contains a list of order items, a shipping
address, and payments. The payment itself contains a billing address for that payment.

Consequences of Aggregate Orientation,


Relational databases have no concept of aggregate within their data model, so we call them
aggregate-ignorant.

In the NoSQL world, graph databases are also aggregate-ignorant. Being aggregate-ignorant is not a
bad thing.

It’s often difficult to draw aggregate boundaries well, particularly if the same data is used in many
different contexts.

An order makes a good aggregate when a customer is making and reviewing orders, and when the
retailer is processing orders. However, if a retailer wants to analyze its product sales over the last
few months, then an order aggregate becomes a trouble. To get to product sales history, you’ll have
to dig into every aggregate in the database.

So an aggregate structure may help with some data interactions but be an obstacle for others. An
aggregate-ignorant model allows you to easily look at the data in different ways, so it is a better
choice when you don’t have a primary structure for manipulating your data.

The confirm reason for aggregate orientation is that it helps greatly with running on a cluster, which
as you’ll remember is the killer argument for the rise of NoSQL.

If we’re running on a cluster, we need to minimize how many nodes we need to query when we are
gathering data. By explicitly including aggregates, we give the database important information about
which bits of data will be manipulated together, and thus should live on the same node.

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM


Module 1
NoSQL
Aggregates have an important consequence for transactions. Relational databases allow you to
manipulate any combination of rows from any tables in a single transaction. Such transactions are
called ACID transactions: Atomic, Consistent, Isolated, and Durable.

ACID is a rather contrived acronym; the real point is the atomicity: Many rows spanning many tables
are updated as a single operation.

This operation either succeeds or fails in its entirety, and concurrent operations are isolated from
each other so they cannot see a partial update. It’s often said that NoSQL databases don’t support
ACID transactions and thus sacrifice consistency.

This is a rather sweeping simplification. In general, it’s true that aggregate-oriented databases don’t
have ACID transactions that span multiple aggregates. Instead, they support atomic manipulation of
a single aggregate at a time. This means that if we need to manipulate multiple aggregates in an
atomic way, we have to manage that ourselves in the application code.

In practice, we find that most of the time we are able to keep our atomicity ne eds to within a single
aggregate; indeed, that’s part of the consideration for deciding how to divide up our data into
aggregates. We should also remember that graph and other aggregate-ignorant databases usually
do support ACID transactions similar to relational databases.

Key-Value and Document Data Models

The advantage of opacity is that we can store whatever we like in the aggregate. The database may
impose some general size limit, but other than that we have complete freedom.

A document database imposes limits on what we can place in it, defining allowable structures and
types. In return, however, we get more flexibility in access. With a key-value store, we can only
access an aggregate by lookup based on its key. With a document database, we can submit queries
to the database based on the fields in the aggregate, we can retrieve part of the aggregate rather
than the whole thing, and database can create indexes based on the contents of the aggregate. In
practice, the line between key-value and document gets a bit blurry.

People often put an ID field in a document database to do a key-value style lookup. Databases
classified as key-value databases may allow you structures for data beyond just an opaque
aggregate. For example, Riak allows you to add metadata to aggregates for indexing and
interaggregate links, Redis allows you to break down the aggregate into lists or sets. You can support
querying by integrating search tools such as Solr. As an example, Riak includes a search facility that
uses Solr-like searching on any aggregates that are stored as JSON or XML structures. Despite this
blurriness, the general distinction still holds. With key-value databases, we expect to mostly look up
aggregates using a key. With document databases, we mostly expect to submit some form of query
based on the internal structure of the document; this might be a key, but it’s more likely to be
something else.

Column-Family Stores.

A column store database is a type of database that stores data using a column oriented model.

A column store database can also be referred to as a:

∙ Column database

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM


Module 1
NoSQL
∙ Column family database
∙ Column oriented database
∙ Wide column store database
∙ Wide column store
∙ Columnar database
∙ Columnar store

The Structure of a Column Store Database

Columns store databases use a concept called a keyspace. A keyspace is kind of like a schema in the
relational model. The keyspace contains all the column families (kind of like tables in the relational
model), which contain rows, which contain columns.

Like this:

A keyspace containing column families.


Here’s a closer look at a column family:
Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM
Module 1
NoSQL

A column
family containing 3 rows. Each row contains its own set of columns.
As the above diagram shows:

∙ A column family consists of multiple rows.


∙ Each row can contain a different number of columns to the other rows. And the columns
don’t have to match the columns in the other rows (i.e. they can have different column
names, data types, etc).
∙ Each column is contained to its row. It doesn’t span all rows like in a relational database.
Each column contains a name/value pair, along with a timestamp. Note that this example
uses Unix/Epoch time for the timestamp.

Here’s how each row is constructed:

Here’s a breakdown of each element in the row:


∙ Row Key. Each row has a unique key, which is a unique identifier for that row. ∙
Column. Each column contains a name, a value, and timestamp.
∙ Name. This is the name of the name/value pair.
∙ Value. This is the value of the name/value pair.

Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM


Module 1
NoSQL
∙ Timestamp. This provides the date and time that the data was inserted. This can be used to
determine the most recent version of data.

Some DBMSs expand on the column family concept to provide extra functionality/storage ability. For
example, Cassandra has the concept of composite columns, which allow you to nest objects inside a
column.

Benefits of Column Store Databases

Some key benefits of columnar databases include:

∙ Compression. Column stores are very efficient at data compression and/or partitioning. ∙
Aggregation queries. Due to their structure, columnar databases perform particularly well
with aggregation queries (such as SUM, COUNT, AVG, etc).
∙ Scalability. Columnar databases are very scalable. They are well suited to massively parallel
processing (MPP), which involves having data spread across a large cluster of machines –
often thousands of machines.
∙ Fast to load and query. Columnar stores can be loaded extremely fast. A billion row table
could be loaded within a few seconds. You can start querying and analysing almost
immediately.

Key Points

• An aggregate is a collection of data that we interact with as a unit. Aggregates form the boundaries
for ACID operations with the database.

• Key-value, document, and column-family databases can all be seen as forms of aggregateoriented
database.

• Aggregates make it easier for the database to manage data storage over clusters.

• Aggregate-oriented databases work best when most data interaction is done with the same
aggregate; aggregate-ignorant databases are better when interactions use data organized in many
different formations

*******
Yojana Kiran Kumar,Asst. Professor,Dept of BVOC ,SDM

You might also like