0% found this document useful (0 votes)
9 views

9 NoSQL Database

The document provides an overview of NoSQL databases, covering various data models, comparisons, and the CAP theorem which states that consistency, availability, and partition tolerance cannot all be achieved simultaneously in distributed systems. It discusses the characteristics of NoSQL databases, including key-value stores, document databases, and graph databases, emphasizing their scalability and flexibility compared to traditional relational databases. Additionally, it highlights the importance of NoSQL in managing large volumes of unstructured data generated by modern web applications.

Uploaded by

bharathkesav1275
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

9 NoSQL Database

The document provides an overview of NoSQL databases, covering various data models, comparisons, and the CAP theorem which states that consistency, availability, and partition tolerance cannot all be achieved simultaneously in distributed systems. It discusses the characteristics of NoSQL databases, including key-value stores, document databases, and graph databases, emphasizing their scalability and flexibility compared to traditional relational databases. Additionally, it highlights the importance of NoSQL in managing large volumes of unstructured data generated by modern web applications.

Uploaded by

bharathkesav1275
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

NoSQL Databases - CO4, BTL4

As in Syllabus:
(1) NoSQL Data Models.
(2) Comparisons of various NoSQL Databases.
(3) CAP Theorem.
(4) Sorage Layout.
(5) Query models.
(6) Key-Value Stores.
(7) Document-databases – Apache CouchDB, MongoDB.
(8) Column Oriented Databases – Google’s Big Table, Cassandra.

Text Book:
Andreas Meier, Michael Kaufmann, “SQL & NoSQL Databases: Models, Languages, Consistency
Options and Architectures for Big Data Management”, Springer Verlag 2019
Chapter 1, pp. 1 to 23, Chapter 4, pp. 135, Chapter 7, pp. 201 & Internet materials.
Relation Relational Database
Management System

Structured Query
Language (SQL)
1.3 Big
Data
The term Big Data is used to label large volumes of data that push the
limits of conventional software.

This data is usually unstructured (Sect. 5.1) and may originate from a wide
variety of sources: social media postings, e-mails, electronic archives with
multimedia content, search engine queries, document repositories of
content management systems, sensor data of various kinds, rate
developments at stock exchanges, traffic flow data and satellite images,
smart meters in household appliances, order, purchase, and payment
processes in online stores, e-health applications, monitoring systems, etc.
1.3 Big
Data
There is no binding definition for Big Data yet, but most data specialists will
agree
on three v’s.
(1)volume (extensive amounts of data),
(2)variety (multiple formats, structured, semi-structured, and
unstructured data), and
(3) velocity
Gartner Group’s(high-speed
IT glossaryand real-time
offers processing).
the following definition:
1.4 NoSQL
Databases
The term NoSQL is now used for any nonrelational data
management approaches meeting two criteria:

• Firstly: The data is not stored in tables.

• Secondly: The database language is not SQL.

NoSQL is also sometimes interpreted as ‘Not only SQL’ to express that


other technologies besides relational data technology are used in
massively distributed web applications.

NoSQL technologies are especially necessary if the web service requires


high availability.
1.4 NoSQL • NoSQL database management
Databases
The basic structure of an NoSQL database systems mostly use a massively
management system
distributed storage
architecture.

• The actual data is stored in key-


value pairs, columns or column
families, document stores, or
graphs.

• In order to ensure high


availability and avoid outages
in NoSQL database systems,
various redundancy concepts
are supported.

• The massively distributed and


1.4 NoSQL
Databases
• In NoSQL databases, the actual data is stored in key-value pairs, columns
or column families, document stores, or graphs.
• Key-value stores are the simplest
version. Data is stored as an
identification key <key = “key”>
and a list of values <value =
“value 1”, “value 2”, …> .

Eg. A good example is an online store


with session management and
shopping basket.

The session ID is the identification key;


the individual items from the basket
• In document stores, records are managed as The graph
are stored database
as values consists
in addition of
to the
documents within the NoSQL database. nodes (concepts,
customer profile. objects) and
directed edges (relationships)
• These documents are structured text files, connecting the nodes.
e.g., in JSON or XML format, which can be
searched for by a unique key or attributes
1.4.1 Graph-based
Model

A graph abstractly presents the nodes


and edges with their properties.

This Figure shows part of a movie


collection as an example.

It contains the nodes MOVIE with


attributes Title and Year (of release),
GENRE with the respective Type (e.g.,
crime, mystery, comedy, drama,
thriller, Western, science fiction,
documentary, etc.), ACTOR with Name
and Year of Birth, and DIRECTOR with
Name and Nationality.
The example uses three directed edges:

• The edge ACTED_IN shows which artist from the


ACTOR node starred in which film from the
MOVIE node.
• This edge also has a property, the Role of the
actor in the movie.

• The other two edges, HAS and


DIRECTED_BY, go from the MOVIE node to
In the manifestation level, i.e., the graph database,the
theGENRE and
property DIRECTOR
graph contains node, respectively.
the concrete
values
1.4.1 Graph-based
Model
The property graph model for databases is formally based on graph theory.

Depending on their maturity, relevant software products may offer


algorithms to calculate the following traits:
• Connectivity: A graph is connected when every node in the graph is
connected to every other node by at least one path.

• Shortest path: The shortest path between two nodes of a graph is the
one with the least edges.

• Nearest neighbor: In graphs with weighted edges (e.g., by distance or


time in a transport network), the nearest neighbors of a node can be
determined by finding the minimal intervals (shortest route in terms of
distance or time).

• Matching: Matching in graph theory means finding a set of edges that


have no common nodes.
1.4.2 Graph Query Cypher is Neo4j’s graph query language that lets you retrieve data
from the graph.
Language Cypher
Cypher is a declarative query language for extracting patterns from graph databases.

Users define their query by specifying nodes and edges.

The database management system then calculates all patterns meeting the criteria by
analyzing the possible paths (connections between nodes via edges).
This graph shows a segment of a graph database on
movies and actors.

To keep things simple, only two types of node are


shown: ACTOR and MOVIE.

ACTOR nodes contain two attribute-value pairs,


specifically (Name: FirstName LastName) and
(YearOfBirth: Year).

Edges: The ACTED_IN relationship represents which


actors starred in which movies. Edges can also have
properties if attribute-value pairs are added to them.
For the ACTED_IN relationship, the respective roles of
Activity: Learn Neo4j and Cypher combination the actors in the movies are listed.
1.4.2 Graph Query
Nodes can be connected by multiple relationship
Language Cypher edges. The movie ‘Man of Tai Chi’ and actor
Keanu Reeves are linked not only by the actorʼs
role (ACTED_IN), but also by the director
position (DIRECTED_BY).

The diagram therefore shows that Keanu


Reeves both directed the movie ‘Man of Tai
Chi’ and starred in it as Donaka Mark.
If we want to analyze this graph database on movies,
we can use Cypher.

For instance, the Cypher query to It uses the following basic query elements:
find the year the movie ‘The Matrix’ • MATCH: Specification of nodes and edges, as well as
was released would be: declaration of search patterns.

• WHERE: Conditions for filtering results.

• RETURN: Specification of the desired search result,


aggregated if necessary
1.4.2 Graph Query
Language Cypher
This query sends out the variable m for
the movie ‘The Matrix’ to return the
movieʼs year of release by m.Year.

In Cypher, parentheses always indicate


nodes, i.e., (m: Movie) declares the
control variable m for the MOVIE node.

In addition to control variables, individual


For instance, the Cypher query to attribute-value pairs can be included in
find the year the movie ‘The Matrix’ curly brackets.
was released would be:

Since we are specifically interested in the


movie ‘The Matrix’, we can add {Title:
“The Matrix”} to the node (m: Movie).
1.4.2 Graph Query Queries regarding the relationships within the graph
Language Cypher database are a bit more complicated.

Relationships between two arbitrary nodes (a) and (b)


are expressed in Cypher by the arrow symbol “- > ”,
i.e., the path from (a) to (b) is declared as “(a)- > (b)”.

If the specific relationship between (a) and (b) is of


importance, the edge [r] can be inserted in the middle
of the arrow.
For a list of movie titles (m), actor names (a), and
respective roles (r), the query would have to be:

To find out who played Neo in ‘The Matrix’,


we use the following query to analyze the the result: ‘Man of Tai Chi’ with actor Keanu Reeves in
ACTED_IN path between ACTOR and MOVIE: the role of Donaka Mark and the movie ‘The Matrix’
with Keanu Reeves as Neo.

To limit the above search only to ‘Keanu Reeves’, the


below query is used

Cypher will return the result Keanu Reeves.


NoSQL Databases - CO4, BTL4
As in Syllabus:
(1) NoSQL Data Models.
(2) Comparisons of various NoSQL Databases.

(3) CAP Theorem.


(4) Sorage Layout.
(5) Query models.
(6) Key-Value Stores.
(7) Document-databases – Apache CouchDB, MongoDB.
(8) Column Oriented Databases – Google’s Big Table, Cassandra.
Text Book:
Andreas Meier, Michael Kaufmann, “SQL & NoSQL Databases: Models, Languages, Consistency
Options and Architectures for Big Data Management”, Springer Verlag 2019
Chapter 1, pp. 1 to 23, Chapter 4, pp. 135, Chapter 7, pp. 201 & Internet materials.
CAP For Massive
Chapter 4.3, Page NO: 134
Distributed Data
Theorem
For large and distributed data storage systems, consistency cannot always
be the primary goal; sometimes availability and partition tolerance
take priority.
In relational database systems, transactions are always atomic,
consistent, isolated, and durable (ACID).

Web-based applications, on the other hand, are geared towards high


availability and the ability to continue working if a computer node or a
network connection fails.

Such partition tolerant systems use replicated computer nodes and a


softer consistency requirement called BASE (basically available, soft
state, eventually consistent).

This allows replicated computer nodes to temporarily hold diverging


data versions and only be updated with a delay.
CAP For Massive
Distributed Data
Theorem
During a symposium in 2000, Eric Brewer of the University of California,
Berkeley, presented the hypothesis that the three properties of
consistency, availability, and partition tolerance cannot exist
simultaneously in a massive distributed computer system.

This hypothesis was later proven by researchers at MIT in Boston


and established as the CAP theorem.
CAP For Massive
Distributed Data
Theorem
The CAP theorem states that in any massive distributed data management
system, only two of the three properties consistency, availability, and
partition tolerance can be ensured.
In short, massive distributed systems can have a combination of
either consistency and availability (CA), consistency and partition
tolerance (CP), or availability and partition tolerance (AP), but it is
impossible to have all three at once.
CAP For Massive
Distributed Data
Theorem
Use cases of the CAP theorem may
include:
NoSQL Databases - CO4, BTL4
As in Syllabus:
(1) NoSQL Data Models.
(2) Comparisons of various NoSQL Databases.
(3) CAP Theorem.
(4) Sorage Layout.
(5) Query models.
(6) Key-Value Stores.
(7) Document-databases – Apache CouchDB, MongoDB.
(8) Column Oriented Databases – Google’s Big Table, Cassandra.

Text Book:
Andreas Meier, Michael Kaufmann, “SQL & NoSQL Databases: Models, Languages, Consistency
Options and Architectures for Big Data Management”, Springer Verlag 2019
Chapter 1, pp. 1 to 23, Chapter 4, pp. 135, Chapter 7, pp. 201 & Internet materials.
NoSQL Chapter 7, Page NO: 201
Database
The term NoSQL was first used in 1998 for a database that (although
relational) did not have an SQL interface.

It became of growing importance during the 2000s, especially with the


rapid expansion of the internet.

The growing popularity of global web services saw an increase in the use
of web-scale databases, since there was a need for data management
systems that could handle the enormous amounts of data generated by
web services.
NoSQL
Database
Relational/SQL database systems are much more than mere data
storage systems. • These SQL functionalities offer
numerous benefits regarding data
They provide a large degree of processing logic:
consistency and security.
• Powerful declarative language
constructs • This goes to show that SQL databases
• Schemas and metadata are mainly designed for integrity and
transaction protection, as required in
• Consistency assurance
banking applications or insurance
• Referential integrity and triggers software, among others.
• Recovery and logging
• Multi-user operation and • However, since data integrity control
synchronization requires much work and processing
power, relational databases quickly
• Users, roles, and security • The
reachpowerfulness
their of the
limits withdatabase
large
• Indexing management
amounts of data. system is
disadvantageous for efficiency and
performance, as well as for flexibility in
data processing.
NoSQL
Database

NoSQL databases have the following


properties
Core NoSQL technologies Massively distributed key-value Storing data in the Bigtable
are: store with sharding and hash- model.
• Key-value stores based key distribution.

• Column family
databases

• Document stores

•Illustration
Graph databases
of an XML Example of a graph
document represented by Example of a document database with user data of
tables. store. a website.
Core NoSQL technologies
are:

• Key-value stores • These four database models, also called core


NoSQL models, are discussed in this chapter.

• Column family • Other types of NoSQL databases fall in the category


of Soft NoSQL, e.g., object databases, grid
databases databases, and the family of XML databases

• Document stores

• Graph databases
Key-Value
Store
The simplest way of storing data is assigning a value to a variable or a
key.

The simplest database model possible is data storage that stores a data
object as stores,
In key-value a value for another
a specific data
value can be storedobject aswith
for any key key.a simple command, such as
SET.
Below is an example in which data for users of a website is stored: first name, last name, e-
mail, and encrypted password.
Data objects can be retrieved with a simple
query using the key, with command GET

For instance, the value John is stored for the key


User:U17547:firstname.
Additional Examples
Key-Value
Store
A database is a key-value store if it has the
following properties:
• There is a set of identifying data objects,
the keys. • Key-value stores are
schema-less, i.e., data
• For each key, there is exactly one objects can be stored at
associated descriptive data object, the any time and in arbitrary
value for that key. formats, without the
need for any metadata
• Specifying a key allows querying the objects such as tables or
Key-value stores have seen a large increase in
associated value in the database. columns to be defined
popularity as part of the NoSQL trend, since
they are scalable for huge amounts of data. beforehand.

It is possible to write and read extensive amounts • Going without a schema,


of data efficiently. makes key-value stores
Processing speed can be enhanced even further if the key-value pairs are buffered in the
easy to partition and
main memory of the database. Such setups are called in-memory databases.
flexible.
NoSQL
Database

• There is almost no limit to increasing a key-value storeʼs scalability with


fragmentation or sharding of the data content.

• Partitioning is rather easy in key-value stores, due to the simple model.

• Individual computers within the cluster, called shards, take on only a


part of the keyspace.

• This allows for the distribution of the database onto a large number of
individual machines.

• The keys are usually distributed according to the principles of


consistent hashing.
The Figure shows a distributed architecture for
a key-value store:
• A numerical value (hash) is generated from
a key; using the module operator, this
value can now be positioned on a defined
number of address spaces (hash slots) in
order to determine on which shard within
the distributed architecture the value for
the key will be stored.

• The distributed database can also be


copied to additional computers and
updated there to improve partition
tolerance, a process called replication.
Massively distributed key-value store
with sharding and hash-based key • The original data content in the master
distribution cluster is synchronized with multiple
• The master cluster contains three computers (shards A, B, and
replicated C). sets, the slave clusters.
data
• The data is kept directly in the main memory (RAM) to reduce response times.
• The data content is replicated to a slave cluster for permanent storage on a hard drive.
• Another slave cluster further increases performance by providing another replicated
computer cluster for complex queries and analyses.
Examples of Popular Key-Value
Databases
Core NoSQL technologies
are:

• Key-value stores

• Column family
databases

• Document stores

• Graph databases
Column family
databases
Even though key-value stores are able to process large amounts
of data performantly, their structure is still quite elementary.

Often, the data needs to be structured with a schema.

Column-family stores enhance the key-value concept accordingly


by providing additional structure.
For read-intensive databases storing the data in
column major order is better than storing in
row-major order.

Therefore, in order to optimize access, it is


useful to structure the data in such groups of
columns—column families—as storage units.

Column-family store - store data not in


Column family
databases

Look at the above image, to perform a query like "average price" for all dates, In row-
oriented databases we have to read over large areas, in column-oriented databases the
prices are stored as one sequential region and we can read just that region.

Column-oriented databases are therefore extremely quick at aggregate queries


(sum, average, min, max, etc.).
Examples of Column Store DBMSs
• Bigtable
• Cassandra
• HBase
• Vertica
• Druid
• Accumulo
• Hypertable
Column family Bigtab
databases le
Google presented its Bigtable database model for the distributed storage of
structured data, and it influenced the development of column-family stores.
Google Bigtable is a distributed, column-oriented data store created by Google Inc. to handle very
large amounts of structured data associated with the company's Internet search and Web services
operations.
It has the following
properties:
Column family Bigtab Bigtable stores data in massively
databases le scalable tables, each of which is a
sorted key/value map.

The table is composed of rows, each


of which typically describes a single
entity, and columns, which contain
individual values for each row.

Each row is indexed by a single row


key.
Columns that are related to one another are typically grouped together into a column family.

Each column is identified by a combination of the column family and a column qualifier, which is a
unique name within the column family.

Each row/column intersection can contain multiple cells.

Each cell contains a unique timestamped version of the data for that row and column.

Storing multiple cells in a column provides a record of how the stored data for that row and column
has changed over time.

Bigtable tables are sparse; if a column is not used in a particular row, it does not take up any space.
An example…
Additional Examples….
Column family Bigtab In Bigtable, a table has three dimensions: It maps
databases le an entry of the database for one row and one
column at a certain time as a string:

The storage unit addressed with a certain


combination of row key, column key, and
time stamp is called a cell.
It has the following
properties:
Column family Bigtab • Figure summarizes another example for how
databases data is stored in the Bigtable model.
le
• A data cell is addressed with row key and
column key.

• In the given example, there is one row key


per user.

• The content is additionally historicized with a


time stamp.

• Several columns are grouped into column


families:
(1) The columns Mail, Name, and Phone
form the column family Contact.
(2) Access data, such as user names and
passwords could be stored in the column family
Access.

• In the example the row U17547 contains a


value for the column Contact:Mail, but not
for the column Contact:Phone.
• Read this website – for column-family
store
https://database.guide/what-is-a-column-
store-database/
• Read this website – for Cassandra
https://www.tutorialspoint.com/cassandra/index.htm
Core NoSQL technologies
are:

• Key-value stores

• Column family
databases

• Document stores

• Graph databases
Document • Document stores store structured data in records
Stores which are called documents.

• Document stores are completely schema-free, i.e.,


there is no need to define a schema before inserting
data structures.

• The absence of a schema allows for extreme


flexibility in storing a wide range of data, also
facilitates fragmentation of the data.

• On the first level, document stores are a kind of key-


value stores.

• For every key (document ID), a record can be stored


as value. These records are called documents.

• On the second level, these documents have their own


Figure 7.3 shows a sample document storeinternal structure.
The visitHistory attribute holds a nested
D_USERS that stores data on the users of a attribute value as an associative array, which
website. again contains key-value pairs.

For every user key with the attribute _id, an This nested structure lists the date of the last
object containing all user information, such as visit to the website as the associated value.
Document
Stores
To summarize, a document store is a database
management system with the following
properties:

• It is a key-value store.

• The data objects stored as values for keys


are called documents; the keys are used for
identification.

• The documents contain data structures in


the form of recursively nested attribute
value pairs without referential integrity.

• These data structures are schema-free,


i.e., arbitrary attributes can be used in
every document without defining a schema
Document
Stores

Document-databases

• Apache CouchDB,

• MongoDB
Core NoSQL technologies
are:

• Key-value stores

• Column family
databases

• Document stores

• Graph databases
DBs to Practice…
• Object Relational Database – SQL

• NoSQL databases
• Document Store – MongoDB

• Columnar DB – Cassandra

• Graph Database - Neo4j -Cypher


COs Course Outcomes Plan for remaining days
CO1 Understand and analyse the RDBMS and << Completed >>
its internal organization Except
(1) Indexing
(2) Database Tuning

CO2 Apply algorithms for query processing << Completed >>


and optimization
CO3 Apply transaction processing and << Completed >>
concurrency control techniques for
real-world applications
CO4 Understand and apply the design of << Completed >>
Object relational and NoSQL databases One quiz/tutorial on Object relational model and NoSQL
data model – 13th June 2023
CO5 Understand and implement solutions on << Completed >>
big data and graph databases Except: MongoDB and Python
One lab evaluation on graph database – Neo4j with
Cypher, MongoDB, Object relational model – 21st June
2023.
Final Project Review – 28th June 2023.
(ModernDB concept should have been implemented in
the project)
• Thank you…

You might also like