0% found this document useful (0 votes)
59 views36 pages

Unit 2 - Big Data Analytics - CCS334

Uploaded by

shameemshan12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views36 pages

Unit 2 - Big Data Analytics - CCS334

Uploaded by

shameemshan12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CCS334 BIG DATA ANALYTICS

UNIT II NOSQL DATA MANAGEMENT


Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – schemaless databases – materialized views – distribution models –
master-slave replication – consistency - Cassandra – Cassandra data model – Cassandra examples –
Cassandra clients

INTRODUCTION TO NoSQL
NoSQL database, also called Not Only SQL, is an approach to data management and database design
that's useful for very large sets of distributed data. NoSQL is a whole new way of thinking about a
database. NoSQL is not a relational database. The reality is that a relational database model may not
be the best solution for all situations. The easiest way to think of NoSQL, is that of a database which
does not adhering to the traditional relational database management system (RDMS) structure.
Sometimes you will also see it revered to as 'not only SQL'.the most popular NoSQL database is
Apache Cassandra. Cassandra, which was once Facebook’s proprietary database, was released as
open source in 2008. Other NoSQL implementations include SimpleDB, Google BigTable, Apache
Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies that use NoSQL include NetFlix,
LinkedIn and Twitter.
Why Are NoSQL Databases Interesting? / Why we should use Nosql? / when to use Nosql?
There are several reasons why people consider using a NoSQL database.
Application development productivity. A lot of application development effort is spent on
mapping data between in-memory data structures and a relational database. A NoSQL database may
provide a data model that better fits the application’s needs, thus simplifying that interaction and
resulting in less code to write, debug, and evolve.

Large data. Organizations are finding it valuable to capture more data and process it more quickly.
They are finding it expensive, if even possible, to do so with relational databases. The primary
reason is that a relational database is designed to run on a single machine, but it is usually more
economic to run large data and computing loads on clusters of many smaller and cheaper machines.
Many NoSQL databases are designed explicitly to run on clusters, so they make a better fit for big
data scenarios.
Analytics. One reason to consider adding a NoSQL database to your corporate infrastructure is that
many NoSQL databases are well suited to performing analytical queries.

Scalability. NoSQL databases are designed to scale; it’s one of the primary reasons that people
choose a NoSQL database. Typically, with a relational database like SQL Server or Oracle, you scale
by purchasing larger and faster servers and storage or by employing specialists to provide
additional tuning. Unlike relational databases, NoSQL databases are designed to easily scale out as
they grow. Data is partitioned and balanced across multiple nodes in a cluster, and aggregate
queries are distributed by default.

Massive write performance. This is probably the canonical usage based on Google's influence.
High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the
problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times
per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB
so writes need to be distributed over a cluster, which implies key-value access, MapReduce,
replication, fault tolerance, consistency issues, and all the rest. For faster writes in- memory systems
can be used.

Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind
set. When latency is important it's hard to beat hashing on a key and reading the value directly from
memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more
about reliability, for example. but what people have wanted for a long time was a better memcached
and many NoSQL systems offer that.

Flexible data model and flexible datatypes. NoSQL products support a whole range of new data
types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced
data structures, document-oriented, and key-value. Complex objects can be easily stored without a
lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure
allows for much more flexibility. We also have program and programmer friendly compatible
datatypes likes JSON.

Schema migration. Schemalessness makes it easier to deal with schema migrations without so
much worrying. Schemas are in a sense dynamic, because they are imposed by the application at
run-time, so different parts of an application can have a different view of the schema.

Write availability. Do your writes need to succeed no mater what? Then we can get into
partitioning, CAP, eventual consistency and all that jazz.

Easier maintainability, administration and operations. This is very product specific, but many
NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are
spending a lot of effort on ease of use, minimal administration, and automated operations. This can
lead to lower operations costs as special code doesn't have to be written to scale a system that was
never intended to be used that way.

No single point of failure. Not every product is delivering on this, but we are seeing a definite
convergence on relatively easy to configure and manage high availability with automatic load
balancing and cluster sizing. A perfect cloud partners.

Generally available parallel computing. We are seeing MapReduce baked into products, which
makes parallel computing something that will be a normal part of development in the future.

Programmer ease of use. Accessing your data should be easy. While the relational model is
intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok
keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This
is a developer led coup. The response to a database problem can't always be to hire a really
knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a
system that they can make work for themselves. It shouldn't be so hard to make a product perform.
Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper
product, that you control, that's easier to use, and that's easier to scale?

Use the right data model for the right problem. Different data models are used to solve different
problems. Much effort has been put into, for example, wedging graph operations into a relational
model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now
seeing a general strategy of trying find the best fit between a problem and solution

Distributed systems and cloud computing support. Not everyone is worried about scale or
performance over and above that which can be achieved by non-NoSQL systems. What they need is
a distributed system that can span datacenters while handling failure scenarios without a hiccup.
NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy
strict consistency protocols, and so are well positioned to operate in distributed scenarios.

Difference between Sql and Nosql


● SQL databases are primarily called as Relational Databases (RDBMS); whereas NoSQL database
are primarily called as non-relational or distributed database.

● SQL databases are table based databases whereas NoSQL databases are document based, key-
value pairs, graph databases or wide-column stores. This means that SQL databases represent
data in form of tables which consists of n number of rows of data whereas NoSQL databases are
the collection of key-value pair, documents, graph databases or wide-column stores which do not
have standard schema definitions which it needs to adhered to.

● SQL databases have predefined schema whereas NoSQL databases have dynamic schema for
unstructured data.

● SQL databases are vertically scalable whereas the NoSQL databases are horizontally scalable.
SQL databases are scaled by increasing the horse-power of the hardware. NoSQL databases are
scaled by increasing the databases servers in the pool of resources to reduce the load.

● SQL databases uses SQL ( structured query language ) for defining and manipulating the data,
which is very powerful. In NoSQL database, queries are focused on collection of documents.
Sometimes it is also called as UnQL (Unstructured Query Language). The syntax of using UnQL
varies from database to database.

● SQL database examples: MySql, Oracle, Sqlite, Postgres and MS-SQL. NoSQL database examples:
MongoDB, BigTable, Redis, RavenDb, Cassandra, Hbase, Neo4j and CouchDb

● For complex queries: SQL databases are good fit for the complex query intensive
environment whereas NoSQL databases are not good fit for complex queries. On a high- level,
NoSQL don’t have standard interfaces to perform complex queries, and the queries themselves in
NoSQL are not as powerful as SQL query language.

● For the type of data to be stored: SQL databases are not best fit for hierarchical data storage.
But, NoSQL database fits better for the hierarchical data storage as it follows the key-value pair
way of storing data similar to JSON data. NoSQL database are highly preferred for large dataset
(i.e for big data). Hbase is an example for this purpose.

● For scalability: In most typical situations, SQL databases are vertically scalable. You can manage
increasing load by increasing the CPU, RAM, SSD, etc, on a single server. On the integrity of the
data. While you can use NoSQL for transactions purpose, it is still not comparable and sable
enough in high load and for complex transactional applications. Other hand, NoSQL databases are
horizontally scalable. You can just add few more servers easily in your NoSQL database
infrastructure to handle the large traffic. Hand, NoSQL

● For properties: SQL databases emphasizes on ACID properties ( Atomicity, Consistency, Isolation
and Durability) whereas the NoSQL database follows the Brewers CAP theorem ( Consistency,
Availability and Partition tolerance )

● For DB types: On a high-level, we can classify SQL databases as either open-source or close-
sourced from commercial vendors. NoSQL databases can be classified on the basis of way of
storing data as graph databases, key-value store databases, document store databases, column
store database and XML databases. databases are horizontally scalable. You can just add few
more servers easily in your NoSQL database infrastructure to handle the large traffic.

● For high transactional based application: SQL databases are best fit for heavy duty transactional
type applications, as it is more stable and promises the atomicity as well as For support:
Excellent support are available for all SQL database from their vendors. There are also lot of
independent consultations who can help you with SQL database for a very large scale
deployments. For some NoSQL database you still have to rely on community support, and only
limited outside experts are available for you to setup and deploy your large scale NoSQL
deployments.

Types of Nosql Databases: There are four general types of NoSQL databases, each with their
own specific attributes:

1. Key-Value storage

This is the first category of NoSQL database. Key-value stores have a simple data
model, which allow clients to put a map/dictionary and request value par key. In the
key-value storage, each key has to be unique to provide unambiguous identification
of values. For example.
2. Document databases

In the document database NoSQL store document in JSON format. JSON-based


document are store in completely different sets of attributes can be stored together,
which stores highly unstructured data as named value pairs and applications that look at
user behavior, actions, and logs in real time.

3. Columns storage

Columnar databases are almost like tabular databases. Thus keys in wide column store
scan have many dimensions, resulting in a structure similar to a multi-dimensional,
associative array. Shown in below example storing data in a wide column system using a
two-dimensional key.

4. Graph storage

Graph databases are best suited for representing data with a high, yet flexible
number of interconnections, especially when information about those
interconnections is at least as important as there presented data. In NoSQL database,
data is stored in a graph like structures in graph databases, so that the data can be
made easily accessible. Graph databases are commonly used on social networking
sites. As show in below figure.

Example databases

Pros and Cons of Relational Databases


• Advantages
• Data persistence
•Concurrency – ACID, transactions, etc.
•Integration across multiple applications
•Standard Model – tables and SQL
• Disadvantages
•Impedance mismatch
•Integration databases vs. application databases
• Not designed for clustering

Database Impedance mismatch:


Impedance Mismatch means the difference between data model and in memory data
structures. Impedance is the measure of the amount that some object resists (or obstruct,
resist) the flow of another object.
Imagine you have a low current flashlight that normally uses AAA batteries. Suppose you
could attach your car battery to the flashlight. The low current flashlight will pitifully
output a fraction of the light energy that the high current battery is capable of producing.
However, match the AAA batteries to the flashlight and they will run with maximum
efficiency.
The data representation in RDMS is not matched with the data structure used in memory.
In- memory, data structures are lists, dictionaries, nested and hierarchical data structures
whereas in Relational database, it stores only atomic values, and there is no lists are nested
records. Translating between these representations can be costly, confusing and limits the
application development productivity.
Some common characteristics of nosql include:

● Does not use the relational model (mostly)

● Generally open-source projects (currently)

● Driven by the need to run on clusters

● Built for the need to run 21st century web properties

● Schema-less

● Polygot persistence: The point of view of using different data stores in different
circumstances is known as Polyglot Persistence.
Today, most large companies are using a variety of different data storage technologies for different
kinds of data. Many companies still use relational databases to store some data, but the persistence
needs of applications are evolving from predominantly relational to a mixture of data sources.
Polyglot persistence is commonly used to define this hybrid approach. The definition of polyglot is
“someone who speaks or writes several languages.” The term polyglot is redefined for big data as a
set of applications that use several core database technologies.
● Auto Sharding: NoSQL databases usually support auto-sharding, meaning that they
natively and automatically spread data across an arbitrary number of servers, without
requiring the application to even be aware of the composition of the server pool
Nosql data model
Relational and NoSQL data models are very different. The relational model takes data and
separates it into many interrelated tables that contain rows and columns. Tables reference each other
through foreign keys that are stored in columns as well. When looking up data, the desired information
needs to be collected from many tables (often hundreds in today’s enterprise applications) and
combined before it can be provided to the application. Similarly, when writing data, the write needs to
be coordinated and performed on many tables.

NoSQL databases have a very different model. For example, a document-oriented NoSQL database takes
the data you want to store and aggregates it into documents using the JSON format. Each JSON
document can be thought of as an object to be used by your application. A JSON document might, for
example, take all the data stored in a row that spans 20 tables of a relational database and aggregate it into
a single document/object. Aggregating this information may lead to duplication of information, but since
storage is no longer cost prohibitive, the resulting data model flexibility, ease of efficiently distributing
the resulting documents and read and write performance improvements make it an easy trade-off for web-
based applications.

Another major difference is that relational technologies have rigid schemas while NoSQL models
are schemaless. Relational technology requires strict definition of a schema prior to storing any
data into a database. Changing the schema once data is inserted is a big deal, extremely disruptive
and frequently avoided – the exact opposite of the behavior desired in the Big Data era, where
application developers need to constantly – and rapidly – incorporate new types of data to enrich
their apps.

Aggregates data model in Nosql


Data Model: A data model is the model through which we perceive and manipulate our data. For
people using a database, the data model describes how we interact with the data in the database.

Relational Data Model: The relational model takes the information that we want to store and
divides it into tuples. Tuple being a limited Data Structure it captures a set of values and can’t be
nested. This gives Relational Model a space of development.

AGGREGATE DATA MODELS


Aggregate Model: Aggregate is a term that comes from Domain-Driven Design, an aggregate is a
collection of related objects that we wish to treat as a unit, it is a unit for data manipulation and
management of consistency.
● Atomic property holds within an aggregate

● Communication with data storage happens in unit of aggregate

● Dealing with aggregate is much more efficient in clusters

● It is often easier for application programmers to work with aggregates

Example of Relations and Aggregates


Let’s assume we have to build an e-commerce website; we are going to be selling items directly to
customers over the web, and we will have to store information about users, our product catalog,
orders, shipping addresses, billing addresses, and payment data. We can use this scenario to model
the data using a relation data store as well as NoSQL data stores and talk about their pros and cons.
For a relational database, we might start with a data model shown in the following figure
The following figure presents some sample data for this model.

In relational, everything is properly normalized, so that no data is repeated in multiple tables. We also
have referential integrity. A realistic order system would naturally be more involved than this. Now let’s
see how this model might look when we think in more aggregate oriented terms

Again, we have some sample data, which we’ll show in JSON format as that’s a
In this model, we have two main aggregates: customer and order. We’ve used the black-diamond
composition marker in UML to show how data fits into the aggregation structure. The customer
contains a list of billing addresses; the order contains a list of order items, a shipping address,
and payments. The payment itself contains a billing address for that payment.
A single logical address record appears three times in the example data, but instead of using IDs
it’s treated as a value and copied each time. This fits the domain where we would not want the
shipping address, nor the payment’s billing address, to change. In a relational database, we
would ensure that the address rows aren’t updated for this case, making a new row instead. With
aggregates, we can copy the whole address structure into the aggregate as we need to.
Aggregate-Oriented Databases: Aggregate-oriented databases work best when most data
interaction is done with the same aggregate; aggregate-ignorant databases are better when
interactions use data organized in many different formations.
Key-value databases
• Stores data that is opaque to the database
• The database does cannot see the structure of records
• Application needs to deal with this
• Allows flexibility regarding what is stored (i.e. text or binary data)
Document databases
• Stores data whose structure is visible to the database
• Imposes limitations on what can be stored
• Allows more flexible access to data (i.e. partial records) via querying
Both key-value and document databases consist of aggregate records accessed by ID

Column-family databases
• Two levels of access to aggregates (and hence, two pars to the “key” to access an
aggregate’s data)
• ID is used to look up aggregate record
• Column name – either a label for a value (name) or a key to a list entry (order id)
• Columns are grouped into column families
Schemaless Databases
A common theme across all the forms of NoSQL databases is that they are
schemaless. When you want to store data in a relational database, you first have to define a
schema—a defined structure for the database which says what tables exist, which
columns exist, and what data types each column can hold. Before you store some data, you
have to have the schema defined for it.
With NoSQL databases, storing data is much more casual. A key-value store allows
you to store any data you like under a key. A document database effectively does the same
thing, since it makes no restrictions on the structure of the documents you store. Column-
family databases allow you to store any data under any column you like. Graph databases
allow you to freely add new edges and freely add properties to nodes and edges as you
wish.
Why Schemeless?
• A schemaless store also makes it easier to deal with nonuniform data
• When starting a new development project you don't need to spend the same
amount of time on up-front design of the schema.
• No need to learn SQL or database specific stuff and tools.
• The rigid schema of a relational database (RDBMS) means you have to absolutely
follow the schema. It can be harder to push data into the DB as it has to perfectly fit
the schema. Being able to add data directly without having to tweak it to match the
schema can save you time
• Minor changes to the model and you will have to change both your code and the
schema in the DBMS. If no schema, you don't have to make changes in two places.
Less time consuming
• With a NoSql DB you have fewer ways to pull the data out
• Less overhead for DB engine
• Less overhead for developers related to scalability
• Eliminates the need for Database administrators or database experts -> fewer
people involved and less waiting on experts
• Save time writing complex SQL joins -> more rapid development
Pros and cons of schemeless data
Pros:

● More freedom and flexibility

● you can easily change your data organization

● you can deal with nonuniform

data Cons:
• A program that accesses data:
a. almost always relies on some form of implicit schema
b. It assumes that certain fields are present
c. carry data with a certain meaning
• The implicit schema is shifted into the application code that accesses data
• To understand what data is present you have look at the application code
• The schema cannot be used to:
a. decide how to store and retrieve data efficiently
b. ensure data consistency
• Problems if multiple applications, developed by different people, access the
same database.
• Relational schemas can be changed at any time with standard SQL commands
Key-Value and Document Data Models
Key-value databases
A key-value store is a simple hash table, primarily used when all access to the database is via
primary key.
Key-value stores are the simplest NoSQL data stores to use from an API perspective. The
client can either get the value for the key, put a value for a key, or delete a key from the data
store. The value is a BLOB(Binary Large Object) that the data store just stores, without
caring or knowing what’s inside; it’s the responsibility of the application to understand
what was stored. Since key- value stores always use primary-key access, they generally
have great performance and can be easily scaled.
It is an associative container such as map, dictionary, and in query processing an index. It is
an abstract data type composed of a collection of unique keys and a collection of values,
where each key is associated with one value (or set of values). The operation of finding the
value associated with a key is called a lookup or indexing The relationship between a key
and its value is sometimes called a mapping or binding.
Some of the popular key-value databases are Riak, Redis, Memcached DB, Berkeley DB,
HamsterDB, Amazon DynamoDB.
A Key-Value model is great for lookups of simple or even complex values. When the values
are themselves interconnected, you’ve got a graph as shown in following figure. Lets you
traverse quickly among all the connected values.

In Key-Value database,

● Data is stored sorted by key.

● Callers can provide a custom comparison function to override the sort order.

● The basic operations are Put(key,value), Get(key), Delete(key).

● Multiple changes can be made in one atomic batch.

● Users can create a transient snapshot to get a consistent view of data.

● Forward and backward iteration is supported over the data.


In key-value databases, a single object that stores all the data and is put into a single
bucket. Buckets are used to define a virtual keyspace and provide the ability to define
isolated non- default configuration. Buckets might be compared to tables or folders in
relational databases or file systems, respectively.
As their name suggest, they store key/value pairs. For example, for search engines, a store
may associate to each keyword (the key) a list of documents containing it (the
corresponding value).
One approach to implement a key-value store is to use a file decomposed in blocks . As the
following figure shows, each block is associated with a number (ranging from 1 to n). Each
block manages a set of key-value pairs: the beginning of the block contained, after some
information, an index of keys and the position of the corresponding values. These
values are stored starting from the end of the block (like a memory heap). The free
space available is delimited by the end of the index and the end of the values.

In this implementation, the size of a block is important since it defines the largest
value that can be stored (for example the longest list of document identifiers containing a
given keyword). Moreover, it supposes that a block number is associated to each key. These
block numbers can be assigned in two different ways:

1. The block number is obtained directly from the key, typically by using a hash
function. The size of the file is then defined by the largest block number computed
by every possible key.

2. The block number is assigned increasingly. When a new pair must be stored, the
first block that can hold it is chosen. In practice, a given amount of space is reserved
in a block in order to manage updates of existing pairs (a new value can replace an
older and smaller one). This limit the size of the file to amount of values to store.

Document Databases
In a relational database system you must define a schema before adding records to a
database. The schema is the structure described in a formal language supported by the
database and provides a blueprint for the tables in a database and the relationships
between tables of data. Within a table, you need to define constraints in terms of rows and
named columns as well as the type of data that can be stored in each column.
In contrast, a document-oriented database contains documents, which are records that
describe the data in the document, as well as the actual data. Documents can be as complex
as you choose; you can use nested data to provide additional sub-categories of information
about your object. You can also use one or more document to represent a real-world object.
The following compares a conventional table with document-based objects:
In this example we have a table that represents beers and their respective attributes: id,
beer name, brewer, bottles available and so forth. As we see in this illustration, the
relational model conforms to a schema with a specified number of fields which represent a
specific purpose and data type. The equivalent document-based model has an individual
document per beer; each document contains the same types of information for a specific
beer.
In a document-oriented model, data objects are stored as documents; each document
stores your data and enables you to update the data or delete it. Instead of columns with
names and data types, we describe the data in the document, and provide the value for that
description. If we wanted to add attributes to a beer in a relational mode, we would need to
modify the database schema to include the additional columns and their data types. In the
case of document-based data, we would add additional key-value pairs into our documents
to represent the new fields.
The other characteristic of relational database is data normalization ; this means you
decompose data into smaller, related tables. The figure below illustrates this:

In the relational model, data is shared across multiple tables. The advantage to this model
is that there is less duplicated data in the database. If we did not separate beers and
brewers into different tables and had one beer table instead, we would have repeated
information about breweries for each beer produced by that brewer.
The problem with this approach is that when you change information across tables, you
need to lock those tables simultaneously to ensure information changes across the table
consistently. Because you also spread information across a rigid structure, it makes it more
difficult to change the structure during production, and it is also difficult to distribute the
data across multiple servers.

In the document-oriented database, we could choose to have two different document


structures: one for beers, and one for breweries. Instead of splitting your application objects
into tables and rows, you would turn them into documents. By providing a reference in the
beer document to a brewery document, you create a relationship between the two entities:

In this example we have two different beers from the Amtel brewery. We represent each
beer as a separate document and reference the brewery in the brewer field. The document-
oriented approach provides several upsides compared to the traditional RDBMS model.
First, because information is stored in documents, updating a schema is a matter of
updating the documents for that type of object. This can be done with no system downtime.
Secondly, we can distribute the information across multiple servers with greater ease.
Since records are contained within entire documents, it makes it easier to move, or
replicate an entire object to another server.
Using JSON Documents
JavaScript Object Notation (JSON) is a lightweight data-interchange format which is easy to
read and change. JSON is language-independent although it uses similar constructs to
JavaScript. The following are basic data types supported in JSON:

● Numbers, including integer and floating point,

● Strings, including all Unicode characters and backslash escape characters,

● Boolean: true or false,

● Arrays, enclosed in square brackets: [“one”, “two”, “three”]

● Objects, consisting of key-value pairs, and also known as an associative array or hash.
The key must be a string and the value can be any supported JSON data type.
For instance, if you are creating a beer application, you might want particular
document structure to represent a beer:
{
"name":
"description":
"category":
"updated":
}
For each of the keys in this JSON document you would provide unique values to represent
individual beers. If you want to provide more detailed information in your beer application
about the actual breweries, you could create a JSON structure to represent a brewery:
{
"name":
"address":
"city":
"state":
"website":
"description":
}
Performing data modeling for a document-based application is no different than the work
you would need to do for a relational database. For the most part it can be much more
flexible, it can provide a more realistic representation or your application data, and it also
enables you to change your mind later about data structure. For more complex items in
your application, one option is to use nested pairs to represent the information:
{
"name":
"address":
"city":
"state":
"website":
"description":
"geo":
{
"location": ["-105.07", "40.59"],
"accuracy": "RANGE_INTERPOLATED"
}
"beers": [ _id4058, _id7628]
}
In this case we added a nested attribute for the geo-location of the brewery and for beers.
Within the location, we provide an exact longitude and latitude, as well as level of accuracy
for plotting it on a map. The level of nesting you provide is your decision; as long as a
document is under the maximum storage size for Server, you can provide any level of
nesting that you can handle in your application.
In traditional relational database modeling, you would create tables that contain a subset of
information for an item. For instance a brewery may contain types of beers which are
stored in a separate table and referenced by the beer id. In the case of JSON documents, you
use key-values pairs, or even nested key-value pairs.
Column-Family Stores
Its name conjured up a tabular structure which it realized with sparse columns and no
schema. The column-family model is as a two-level aggregate structure. As with key-value
stores, the first key is often described as a row identifier, picking up the aggregate of
interest. The difference with column family structures is that this row aggregate is itself
formed of a map of more detailed values. These second-level values are referred to as
columns. As well as accessing the row as a whole, operations also allow picking out a
particular column, so to get a particular customer’s name from following figure you could
do something like
get('1234', 'name').

Column-family databases organize their columns into column families. Each column has to
be part of a single column family, and the column acts as unit for access, with the
assumption that data for a particular column family will be usually accessed together.
The data is structured into:
• Row-oriented: Each row is an aggregate (for example, customer with the ID of 1234)
with column families representing useful chunks of data (profile, order history) within that
aggregate.
• Column-oriented: Each column family defines a record type (e.g., customer profiles)
with rows for each of the records. You then think of a row as the join of records in all
column families.
Even though a document database declares some structure to the database, each document
is still seen as a single unit. Column families give a two-dimensional quality to column-
family databases.
Cassandra uses the terms “ wide” and “ skinny.” Skinny rows have few columns with the
same columns used across the many different rows. In this case, the column family defines
a record type, each row is a record, and each column is a field. A wide row has many
columns (perhaps thousands), with rows having very different columns. A wide column
family models a list, with each column being one element in that list.
Relationships
Atomic Aggregates
Aggregates allow one to store a single business entity as one document, row or key-value
pair and update it atomically:

Graph Databases
Graph databases are one style of NoSQL databases that uses a distribution model similar to
relational databases but offers a different data model that makes it better at handling data
with complex relationships.

● Entities are also known as nodes, which have properties

● Nodes are organized by relationships which allows to find interesting patterns


between the nodes
● The organization of the graph lets the data to be stored once and then
interpreted in different ways based on relationships
Let’s follow along some graphs, using them to express themselves. We’ll read” a graph by
following arrows around the diagram to form sentences.
A Graph contains Nodes and Relationships

A Graph –[:RECORDS_DATA_IN]–> Nodes –[:WHICH_HAVE]–> Properties.

The simplest possible graph is a single Node, a record that has named values referred to as
Properties. A Node could start with a single Property and grow to a few million, though that
can get a little awkward. At some point it makes sense to distribute the data into multiple
nodes, organized with explicit Relationships.
Query a Graph with a Traversal

A Traversal –navigates–> a Graph; it –identifies–> Paths –which order–> Nodes.

A Traversal is how you query a Graph, navigating from starting Nodes to related Nodes
according to an algorithm, finding answers to questions like “what music do my friends like
that I don’t yet own,” or “if this power supply goes down, what web services are affected?”

Example
In this context, a graph refers to a graph data structure of nodes connected by edges. In the
above figure we have a web of information whose nodes are very small (nothing more than
a name) but there is a rich structure of interconnections between them. With this structure,
we can ask questions such as “ find the books in the Databases category that are
written by someone whom a friend of mine likes.”
Graph databases specialize in capturing this sort of information—but on a much larger
scale than a readable diagram could capture. This is ideal for capturing any data consisting
of complex relationships such as social networks, product preferences, or eligibility rules.

Materialized Views
In computing, a materialized view is a database object that contains the results of a query.
For example, it may be a local copy of data located remotely, or may be a subset of the rows
and/or columns of a table or join result, or may be a summary based on aggregations of a
table's data. Materialized views can be used within the same aggregate.
Materialized views, which store data based on remote tables, are also known as snapshots.
A snapshot can be redefined as a materialized view.

Materialized view is computed in advance and cached on disk.


Strategies to building a materialized view:
Eager approach: the materialized view is updated at the same time of the base data. It is
good when you have more frequent reads than writes.
Detached approach: batch jobs update the materialized views at regular intervals. It is
good when you don’t want to pay an overhead on each update.
NoSQL databases do not have views and have precomputed and cached queries usually called
“materialized view”.
Distribution Models
Multiple servers: In NoSQL systems, data distributed over large clusters
Single server – simplest model, everything on one machine. Run the database on a single
machine that handles all the reads and writes to the data store. We prefer this option
because it eliminates all the complexities. It’s easy for operations people to manage and
easy for application developers to reason about.
Although a lot of NoSQL databases are designed around the idea of running on a cluster, it
can make sense to use NoSQL with a single-server distribution model if the data model of
the NoSQL store is more suited to the application. Graph databases are the obvious
category here— these work best in a single-server configuration.
If your data usage is mostly about processing aggregates, then a single-server document or
key- value store may well be worthwhile because it’s easier on application developers.
Orthogonal aspects of data distribution models:
Sharding: DB Sharding is nothing but horizontal partitioning of data. Different people are
accessing different parts of the dataset. In these circumstances we can support horizontal
scalability by putting different parts of the data onto different servers—a technique that’s
called sharding.
A table with billions of rows can be partitioned using “Range Partitioning”. If the customer
transaction date, for an example, based partitioning will partition the data vertically. So
irrespective which instance in a Real Application Clusters access the data, it is “not”
horizontally partitioned although Global Enqueue Resources are owning certain blocks in
each instance but it can be moving around. But in “db shard” environment, the data is
horizontally partitioned. For an example: United States customer can live in one shard and
European Union customers can be in another shard and the other countries customers can
live in another shard but from an access perspective there is no need to know where the
data lives. The DB Shard can go to the appropriate shard to pick up the data.

Different parts of the data onto different servers

● Horizontal scalability

● Ideal case: different users all talking to different server nodes

● Data accessed together on the same ggregate unit!


node a

Pros: it can improve both reads and writes


Cons: Clusters use less reliable machines resilience decreases
Many NoSQL databases offer auto-sharding
the database takes on the responsibility of sharding
Improving performance:
Main rules of sharding:
1. Place the data close to where it’s accessed
Orders for Boston: data in your eastern US data center
2. Try to keep the load even
All nodes should get equal amounts of the load
3. Put together aggregates that may be read in
sequence Same order, same node

Master-Slave
Replication Master
• is the authoritative source for the data
• is responsible for processing any updates to that data
• can be appointed manually or automatically
Slaves
• A replication process synchronizes the slaves with the master
• After a failure of the master, a slave can be appointed as new master very quickly
Pros and cons of Master-Slave Replication
Pros
• More read requests:
• Add more slave nodes
• Ensure that all read requests are routed to the slaves
• Should the master fail, the slaves can still handle read requests
• Good for datasets with a read-intensive dataset
Cons
• The master is a bottleneck
• Limited by its ability to process updates and to pass those updates on
• Its failure does eliminate the ability to handle writes until:
a) the master is restored or
b) a new master is appointed
c) Inconsistency due to slow propagation of changes to the slaves
d) Bad for datasets with heavy write traffic

Consistency
The consistency property ensures that any transaction will bring the database from one
valid state to another. Any data written to the database must be valid according to all
defined rules, including constraints, cascades, triggers, and any combination thereof.
It is the biggest change from a centralized relational database to a cluster oriented
NoSQL.
Relational databases has strong consistency whereas NoSQL systems has mostly eventual
consistency.
ACID: A DBMS is expected to support “ACID transactions,” processes that are:

● Atomicity: either the whole process is done or none is

● Consistency: only valid data are written


● Isolation: one operation at a time

● Durability: once committed, it stays that way

Various forms of consistency


1. Update Consistency (or write-write conflict):
Martin and Pramod are looking at the company website and notice that the phone
number is out of date. Incredibly, they both have update access, so they both go in at the
same time to update the number. To make the example interesting, we’ll assume they
update it slightly differently, because each uses a slightly different format. This issue is
called a write-write conflict: two people updating the same data item at the same time.
When the writes reach the server, the server will serialize them—decide to apply one,
then the other. Let’s assume it uses alphabetical order and picks Martin’s update first, then
Pramod’s. Without any concurrency control, Martin’s update would be applied and
immediately overwritten by Pramod’s. In this case Martin’s is a lost update. We see this as
a failure of consistency because Pramod’s update was based on the state before Martin’s
update, yet was applied after it.
Solutions:
• Pessimistic approach
a. Prevent conflicts from occurring
i. Usually implemented with write locks managed by the system
• Optimistic approach
a. Lets conflicts occur, but detects them and takes action to sort them out
b. Approaches:
i. conditional updates: test the value just before updating
ii. save both updates: record that they are in conflict and then
merge them
• Do not work if there’s more than one server (peer-to-peer replication)
2. Read Consistency (or read-write conflict)
Alice and Bob are using Ticketmaster website to book tickets for a specific show.
Only one ticket is left for the specific show. Alice signs on to Ticketmaster first and finds
one left, and finds it expensive. Alice takes time to decide. Bob signs on and finds one
ticket left, orders it instantly. Bob purchases and logs off. Alice decides to buy a ticket, to
find there are no tickets. This is a typical Read-Write Conflict situation.
Another example where Pramod has done a read in the middle of Martin’s write as
shown in below.
We refer to this type of consistency as logical consistency. To avoid a logically
inconsistent by providing Martin wraps his two writes in a transaction, the system
guarantees that Pramod will either read both data items before the update or both after
the update. The length of time an inconsistency is present is called the inconsistency
window.
Replication consistency
Let’s imagine there’s one last hotel room for a desirable event. The hotel reservation
system runs on many nodes. Martin and Cindy are a couple considering this room, but
they are discussing this on the phone because Martin is in London and Cindy is in
Boston. Meanwhile Pramod, who is in Mumbai, goes and books that last room. That
updates the replicated room availability, but the update gets to Boston quicker than it
gets to London. When Martin and Cindy fire up their browsers to see if the room is
available, Cindy sees it booked and Martin sees it free. This is another inconsistent
read—but it’s a breach of a different form of consistency we call replication
consistency: ensuring that the same data item has the same value when read from
different replicas.
Eventual consistency: At any time, nodes may have replication inconsistencies but, if there
are no further updates, eventually all nodes will be updated to the same value. In other words,
Eventual consistency is a consistency model used in nosql database to achieve high
availability that informally guarantees that, if no new updates are made to a given data item,
eventually all accesses to that item will return the last updated value.
Eventually consistent services are often classified as providing BASE (Basically
Available, Soft state, Eventual consistency) semantics, in contrast to traditional ACID
(Atomicity, Consistency, Isolation, Durability) guarantees.
Basic Availability. The NoSQL database approach focuses on availability of data even in
the presence of multiple failures. It achieves this by using a highly distributed approach
to database management. Instead of maintaining a single large data store and focusing
on the fault tolerance of that store, NoSQL databases spread data across many storage
systems with a high degree of replication. In the unlikely event that a failure disrupts
access to a segment of data, this does not necessarily result in a complete database
outage.
Soft state. BASE databases abandon the consistency requirements of the ACID model
pretty much completely. One of the basic concepts behind BASE is that data consistency
is the developer's problem and should not be handled by the database.
Eventual Consistency. The only requirement that NoSQL databases have regarding
consistency is to require that at some point in the future, data will converge to a
consistent state. No guarantees are made, however, about when this will occur. That is a
complete departure from the immediate consistency requirement of ACID that
prohibits a transaction from executing until the prior transaction has completed and
the database has converged to a consistent state.
Version stamp: A field that changes every time the underlying data in the record
changes. When you read the data you keep a note of the version stamp, so that when
you write data you can check to see if the version has changed.
You may have come across this technique with updating resources with HTTP. One way
of doing this is to use etags. Whenever you get a resource, the server responds with an
etag in the header. This etag is an opaque string that indicates the version of the
resource. If you then update that resource, you can use a conditional update by
supplying the etag that you got from your last GET method. If the resource has changed
on the server, the etags won’t match and the server will refuse the update, returning a
412 (Precondition Failed) error response. In short,
• It helps you detect concurrency conflicts.
• When you read data, then update it, you can check the version stamp to ensure
nobody updated the data between your read and write
• Version stamps can be implemented using counters, GUIDs (a large random
number that’s guaranteed to be unique), content hashes, timestamps, or a
combination of these.
• With distributed systems, a vector of version stamps (a set of counters, one for
each node) allows you to detect when different nodes have conflicting updates.
• Sometimes this is called a compare-and-set (CAS) operation.

Relaxing consistency
The CAP Theorem: The basic statement of the CAP theorem is that, given the three properties of
• Consistency, Availability, and Partition tolerance, you can only get two.
• Consistency: all people see the same data at the same time
• Availability: if you can talk to a node in the cluster, it can read and write data
Partition tolerance: the cluster can survive communication breakages that separate
the cluster into partitions unable to communicate with each other

Network partition: The CAP theorem states that if you get a network partition, you have to
trade off availability of data versus consistency.
Very large systems will “partition” at some point::

● That leaves either C or A to choose from (traditional DBMS prefers C over A and P )

● In almost all cases, you would choose A over C (except in specific applications
such as order processing)

CA systems
• A single-server system is the obvious example of a CA system
• CA cluster: if a partition occurs, all the nodes would go down
• A failed, unresponsive node doesn’t infer a lack of CAP availability
• A system that suffer partitions: tradeoff consistency VS availability
• Give up to some consistency to get some availability
An example
• Ann is trying to book a room of the Ace Hotel in New York on a node located in
London of a booking system
• Pathin is trying to do the same on a node located in Mumbai
• The booking system uses a peer-to-peer distribution
• There is only a room available
• The network link breaks

Possible solutions
• CP: Neither user can book any hotel room, sacrificing availability
• caP: Designate Mumbai node as the master for Ace hotel
• Pathin can make the reservation
• Ann can see the inconsistent room information
• Ann cannot book the room
• AP: both nodes accept the hotel reservation
• Overbooking!
Cassandra
The Cassandra data store is an open-source Apache project available at
http://cassandra.apache.org. Cassandra originated at Facebook in 2007 to solve that company’s
inbox search problem, in which they had to deal with large volumes of data in a way that was
difficult to scale with traditional methods.
Main features
• Decentralized
Every node in the cluster has the same role. There is no single point of failure. Data is
distributed across the cluster (so each node contains different data), but there is no master as
every node can service any request.
• Supports replication and multi data center replication
Replication strategies are configurable.[18] Cassandra is designed as a distributed
system, for deployment of large numbers of nodes across multiple data centers. Key features
of Cassandra’s distributed architecture are specifically tailored for multiple-data center
deployment, for redundancy, for failover and disaster recovery.
• Scalability
Read and write throughput both increase linearly as new machines are added, with no
downtime or interruption to applications.
• Fault-tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication
across multiple data centers is supported. Failed nodes can be replaced with no downtime.
• Tunable consistency
Writes and reads offer a tunable level of consistency, all the way from "writes never
fail" to "block for all replicas to be readable", with the quorum level in the middle.
• MapReduce support
Cassandra has Hadoop integration, with MapReduce support. There is support also for
Apache Pig and Apache Hive.
• Query language
Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to the
traditional RPC interface. Language drivers are available for Java (JDBC), Python, Node.JS and
Go
Why we use?
1. Quick writes
2. Fail safe
3. Quick Reporting
4. Batch Processing too, with map reduces.
5. Ease of maintenance
6. Ease of configuration
7. tune ably consistent
8. highly available
9. fault tolerant
10. The peer-to-peer design allows for high performance with linear scalability
and no single points of failure
11. Decentralized databases
12. Supports 12 different client languages
13. Automatic provisioning of new nodes
Cassandra Data Model and Cassandra Examples
Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key
value nature is represented by a row object, in which value would be generally organized in
columns. In short, cassandra uses the following terms
1. Keyspace can be seen as DB Schema in SQL.
2. Column family resembles a table in SQL world (read below this analogy is misleading)
3. Row has a key and as a value a set of Cassandra columns. But without relational
schema corset.
4. Column is a triplet := (name, value, timestamp).
5. Super column is a tupel := (name, collection of columns).
6. Data Types : Validators & Comparators
7. Indexes

Cassandra data model is illustrated in the following figure


Key Spaces

Key Spaces are the largest container, with an ordered list of Column Families, similar to
a database in RDMS.
Column
A Column is the most basic element in Cassandra: a simple tuple that contains a name,
value and timestamp. All values are set by the client. That's an important consideration for the
timestamp, as it means you'll need clock synchronization.

Super Column
A Super Column is a column that stores an associative array of columns. You could think of it as
similar to a HashMap in Java, with an identifying column (name) that stores a list of columns
inside (value). The key difference between a Column and a Super Column is that the value of a
Column is a string, where the value of a Super Column is a map of Columns. Note that Super
Columns have no timestamp, just a name and a value.

Column Family
A Column Family hold a number of Rows, a sorted map that matches column names to
column values. A row is a set of columns, similar to the table concept from relational databases.
The column family holds an ordered list of columns which you can reference by
column name. The Column Family can be of two types, Standard or Super. Standard Column
Familys contain a map of normal columns,

Example
Super Column Family's contain rows of Super Columns.

Example

Data Types
There are predefined data types in cassandra, in which
● The data type of row key is called a validator.

● The data type for a column name is called a comparator.


You can assign predefined data types when you create your column family (which is
recommended), but Cassandra does not require it. Internally Cassandra stores column names
and values as hex byte arrays (BytesType). This is the default client encoding.
Indexes
The understanding of Indexes in Cassandra is requisite. There are two kinds of them.
● The Primary index for a column family is the index of its row keys. Each node
maintains this index for the data it manages.
● The Secondary indexes in Cassandra refer to indexes on column values.
Cassandra implements secondary indexes as a hidden column family
Primary index determines cluster-wide row distribution. Secondary indexes are very
important for custom queries.
Differences Between RDBMS and Cassandra
1. No Query Language: SQL is the standard query language used in relational databases.
Cassandra has no query language. It does have an API that you access through its RPC
serialization mechanism, Thrift.
2. No Referential Integrity: Cassandra has no concept of referential integrity, and
therefore has no concept of joins.
3. Secondary Indexes: The second column family acts as an explicit secondary index in
Cassandra
4. Sorting: In RDBMS, you can easily change the order of records by using ORDER BY or
GROUP BY in your query. There is no support for ORDER BY and GROUP BY statements in
Cassandra. In Cassandra, however, sorting is treated differently; it is a design decision.
Column family definitions include a CompareWith element, which dictates the order in
which your rows will be sorted.
5. Denormalization: In the relational world, denormalization violates Codd's normal forms,
and we try to avoid it. But in Cassandra, denormalization is, well, perfectly normal. It's not
required if your data model is simple.
6. Design Patterns: Cassandra design pattern offers a Materialized View, Valueless Column,
and Aggregate Key.
Cassandra Clients

1. Thrift

Thrift is the driver-level interface; it provides the API for client


implementations in a wide variety of languages. Thrift was developed at Facebook
and donated as an Apache project
Thrift is a code generation library for clients in C++, C#, Erlang, Haskell, Java,
Objective C/Cocoa, OCaml, Perl, PHP, Python, Ruby, Smalltalk, and Squeak. Its goal
is to provide an easy way to support efficient RPC calls in a wide variety of popular
languages, without requiring the overhead of something like SOAP.
The design of Thrift offers the following features:
• Language-independent types
• Common transport interface
• Protocol independence
• Versioning support
2. Avro
The Apache Avro project is a data serialization and RPC system targeted as
the replacement for Thrift in Cassandra. Avro provides many features similar to
those of Thrift and other data serialization and RPC mechanisms including:
• Robust data structures
• An efficient, small binary format for RPC calls
• Easy integration with dynamically typed languages such as Python,
Ruby, Smalltalk, Perl, PHP,
• and Objective-C
• Avro is the RPC and data serialization mechanism for Cassandra. It
generates code that remote clients can use to interact with the
database.
• It’s well-supported in the community and has the strength of growing
out of the larger and very well-known Hadoop project. It should serve
Cassandra well for the foreseeable future.
3. Hector
Hector is an open source project written in Java using the MIT license. It was
created by Ran Tavory of Outbrain (previously of Google) and is hosted at GitHub. It
was one of the early Cassandra clients and is used in production at Outbrain. It
wraps Thrift and offers JMX, connection pooling, and failover.
Hector is a well-supported and full-featured Cassandra client, with many
users and an active community. It offers the following:
• High-level object-oriented API
• Fail over support
• Connection pooling
• JMX (Java Management eXtensions) support
4. Chirper
Chirper is a port of Twissandra to .NET, written by Chaker Nakhli. It’s
available under the Apache 2.0 license, and the source code is on GitHub
5. Chiton

Chiton is a Cassandra browser written by Brandon Williams that uses the


Python GTK framework
6. Pelops
Pelops is a free, open-source Java client written by Dominic Williams. It is
similar to Hector in that it’s Java-based, but it was started more recently. This
has become a very popular client. Its goals include the following:
• To create a simple, easy-to-use client
• To completely separate concerns for data processing from lower-
level items such as connection pooling
• To act as a close follower to Cassandra so that it’s readily up to date
7. Kundera
Kundera is an object-relational mapping (ORM) implementation for
Cassandra written using Java annotations.
8. Fauna
Ryan King of Twitter and Evan Weaver created a Ruby client for the Cassandra
database called Fauna.

You might also like