0% found this document useful (0 votes)
11 views9 pages

DBMS-unit 5-Nosql Databases

anna university
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

DBMS-unit 5-Nosql Databases

anna university
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

NOSQL

Emergence of NOSQL Systems


Many companies and organizations are faced with applications that store vast amounts of data.
Consider a free e-mail application, such as Google Mail or YahooMail or other similar service—
this application can have millions of users, and eachuser can have thousands of e-mail messages.
There is a need for a storage system that can manage all these e-mails; a structured relational
SQL system may not be appropriate because
(1) SQL systems offer too many services (powerful query language, concurrency control, etc.),
which this application may not need; and
(2) A structured data model such the traditional relational model may be too restrictive.

Some of the organizations that were faced with these data management and storageapplications
decided to develop their own systems:
■ Google developed a proprietary NOSQL system known as BigTable, which is used in
many of Google’s applications that require vast amounts of data storage, such as Gmail,
Google Maps, and Web site indexing. Apache Hbase is an open source NOSQL system
based on similar concepts. Google’s innovation led to the category of NOSQL systems
known as column-based or wide column stores; they are also sometimes referred to
as column family stores.
■ Amazon developed a NOSQL system called DynamoDB that is available through
Amazon’s cloud services. This innovation led to the category knownas key-value data
stores or sometimes key-tuple or key-object data stores.
■ Facebook developed a NOSQL system called Cassandra, which is now open source and
known as Apache Cassandra. This NOSQL system uses conceptsfrom both key-value
stores and column-based systems.
■ Other software companies started developing their own solutions and making them
available to users who need these capabilities—for example, MongoDBand CouchDB,
which are classified as document-based NOSQL systems or document stores.
■ Another category of NOSQL systems is the graph-based NOSQL systems,or graph
databases; these include Neo4J and GraphBase, among others.
Characteristics of NOSQL Systems
We divide the characteristics into two categories—those related to distributed databases and
distributed systems, and those related to data models and query languages.
NOSQL characteristics related to distributed databases and distributed systems. NOSQL
systems emphasize high availability, so replicating the data is inherent in many of these systems.
Scalability is another important characteristic, because many of the applications that use
NOSQL systems tend to have data that keeps growing in volume. High performance is another
required characteristic, whereas serializable consistency may not be as important for some of
the NOSQL applications.
1. Scalability: There are two kinds of scalability in distributed systems: horizontal and
vertical. In NOSQL systems, horizontal scalability is generally used, where the
distributed system is expanded by adding more nodes for data storage and processing as
the volume of data grows. Vertical scalability, on the other hand, refers to expanding the
storage and computing power of existing nodes.
2. Availability, Replication and Eventual Consistency: Many applications that use
NOSQL systems require continuous system availability. To accomplish this, data is
replicated over two or more nodes in a transparent manner, so that if one node fails, the
data is still available on other nodes. Replication improves data availability and can also
improve read performance, because read requests can often be serviced from any of the
replicated data nodes. However, write performance becomes more cumbersome because
an update must be applied to every copy of the replicated data items;consistency, so more
relaxed forms of consistency known as eventual consistency are used.
3. Replication Models: Two major replication models are used in NOSQL systems: master-
slave and master-master replication. Master-slave replicationrequires one copy to be the
master copy; all write operations must be appliedto the master copy and then propagated
to the slave copies, usually using eventual consistency (the slave copies will eventually be
the same as the master copy). The master-master replication allows reads and writes at
any of the replicas but may not guarantee that reads at nodes that store different copies
see the same values. Different users may write the same data item concurrently at
different nodes of the system, so the values of the item will be temporarily inconsistent.
A reconciliation method to resolve conflicting write operationsof the same data item at
different nodes must be implemented as part of themaster-master replication scheme.
4. Sharding of Files: In many NOSQL applications, files (or collections of data objects) can
have many millions of records (or documents or objects), and these records can be
accessed concurrently by thousands of users. So it is notpractical to store the whole file
in one node.
5. High-Performance Data Access: In many NOSQL applications, it is necessary to find
individual records or objects (data items) from among the millions of data records or
objects in a file. To achieve this, most systems useone of two techniques: hashing or
range partitioning on object keys. The majority of accesses to an object will be by
providing the key value rather than by using complex query conditions. In hashing, a
hash function h(K) is applied to the key K, and the location of the object with key K is
determinedby the value of h(K). In range partitioning, the location is determined via a
range of key values; for example, location i would hold the objects whose key values K are
in the range Kimin ≤ K ≤ Kimax.
NOSQL characteristics related to data models and query languages. NOSQL systems
emphasize performance and flexibility over modeling power and complex querying.
1. Not Requiring a Schema: The flexibility of not requiring a schema is achieved in many
NOSQL systems by allowing semi-structured, self- describing data. The users can
specify a partial schema in some systems to improve storage efficiency, but it is not
required to have a schema in most of the NOSQL systems. As there may not be a schema
to specify constraints, any constraints on the data would have to be programmed in the
application programs that access the data items. There are various languages for
describing semistructured data, such as JSON (JavaScript Object Notation) and XML
(Extensible Markup Language).
2. Less Powerful Query Languages: Many applications that use NOSQL systems may not
require a powerful query language such as SQL, because search (read) queries in these
systems often locate single objects in a single file based on their object keys. NOSQL
systems typically provide a set of functions and operations as a programming API
(application programming interface), so reading and writing the data objects is
accomplished by calling the appropriate operations by the programmer. In many cases,
the operations are called CRUD operations, for Create, Read, Update, and Delete. In
other cases, they are known as SCRUD because of an added Search (or Find) operation.
Some NOSQL systems also provide a high-level query language, but it may not have the
full power of SQL; only a subset of SQL querying capabilities would be provided. In
particular, many NOSQL systems do not provide join operations as part of the query
language itself; the joins need to be implemented in the application programs.
3. Versioning: Some NOSQL systems provide storage of multiple versions of the data
items, with the timestamps of when the data version was created.

Categories of NOSQL Systems


NOSQL systems have been characterized into four major categories, with some additional
categories that encompass other types of systems. The most common categorization lists the
following four major categories:
1. Document-based NOSQL systems: These systems store data in the form of
documents using well known formats, such as JSON (JavaScript Object Notation).
Documents are accessible via their document id, but can also be accessed rapidly using
other indexes.
2. NOSQL key-value stores: These systems have a simple data model based on fast
access by the key to the value associated with the key; the value canbe a record or an
object or a document or even have a more complex data structure.
3. Column-based or wide column NOSQL systems: These systems partition a table by
column into column families (a form of vertical partitioning), where each column
family is stored in its own files. They alsoallow versioning of data values.
4. Graph-based NOSQL systems: Data is represented as graphs, and related nodes can
be found by traversing the edges using path expressions.
Additional categories can be added as follows to include some systems that are noteasily
categorized into the above four categories, as well as some other types of systems that
have been available even before the term NOSQL became widely used.
5. Hybrid NOSQL systems: These systems have characteristics from two or more of
the above four categories.
6. Object databases:
7. XML databases:

The CAP Theorem


When we discussed concurrency control in distributed databases, we assumed that the distributed
database system (DDBS) is required to enforce the ACID properties (atomicity, consistency,
isolation, durability) of transactions that are running concurrently. In a system with data
replication, concurrency control becomes more complex because there can be multiple copies of
each data item. So if an update is applied to one copy of an item, it must be appliedto all other
copies in a consistent manner. The possibility exists that one copy of anitem X is updated by a
transaction T1 whereas another copy is updated by a transaction T2, so two inconsistent copies of
the same item exist at two different nodes in the distributed system. If two other transactions T3
and T4 want to read X, each mayread a different copy of item X.
The CAP theorem, which was originally introduced as the CAP principle, can be used to explain
some of the competing requirements in a distributed system with replication. The three letters in
CAP refer to three desirable properties of distributed systems with replicated data: consistency
(among replicated copies), availability (of the system for read and write operations) and partition
tolerance (in the face of the nodes in the system being partitioned by a network fault).
Consistency means that the nodes will have the same copies of a replicated data item visible for
various transactions.
Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed.
Partition tolerance means that the system can continue operating if the network connecting the
nodes has a fault that results in two or more partitions, where the nodes in each partition can only
communicate among each other.
The CAP theorem states that it is not possible to guarantee all three of the desirable properties—
consistency, availability, and partition tolerance—at the same time in a distributed system with
data replication. If this is the case, then the distributed system designer would have to choose two
properties out of the three to guarantee.
It is generally assumed that in many traditional (SQL) applications, guaranteeing consistency
through the ACID properties is important. On the other hand, in a NOSQL distributed data store,
a weaker consistency level is often acceptable, and guaranteeing the other two properties
(availability, partition tolerance) is important. Hence, weaker consistency levels are often used in
NOSQL system instead of guaranteeing serializability. In particular, a form of consistency known
as eventual consistency is often adopted in NOSQL systems.

NOSQL Key-Value Stores


Key-value stores focus on high performance, availability, and scalability by storing data in a
distributed storage system. The data model used in key-value stores is relatively simple, and in
many of these systems, there is no query language but rather a set of operations that can be used
by the application programmers. The key is a unique identifier associated with a data item and is
used to locate this data item rapidly. The value is the data item itself, and it can have very different
formats for different key-value storage systems. The main characteristic of key-value stores is the
fact that every value (data item) must be associated with a unique key, and that retrieving the
value by supplying the key must be very fast.
There are many systems that fall under the key-value store label, so rather than provide a lot of
details on one particular system, we will give a brief introductory over- view for some of these
systems and their characteristics.

DynamoDB Overview
The DynamoDB system is an Amazon product and is available as part of Amazon’s AWS/SDK
platforms (Amazon Web Services/Software Development Kit). It can be used as part of Amazon’s
cloud computing services, for the data storage component.
DynamoDB data model. The basic data model in DynamoDB uses the concepts of tables, items,
and attributes. A table in DynamoDB does not have a schema; it holds a collection of self-
describing items. Each item will consist of a number of (attribute, value) pairs, and attribute
values can be single-valued or multivalued. So basically, a table will hold a collection of items,
and each item is a self-describing record (or object). DynamoDB also allows the user to specify
the items in JSON format, and the system will convert them to the internal storage format of
DynamoDB.
When a table is created, it is required to specify a table name and a primary key; the primary key
will be used to rapidly locate the items in the table. Thus, the primary key is the key and the item
is the value for the DynamoDB key-value store. The primary key attribute must exist in every
item in the table. The primary key can be one of the following two types:
■ A single attribute. The DynamoDB system will use this attribute to build a hash index
on the items in the table. This is called a hash type primary key. The items are not
ordered in storage on the value of the hash attribute.
■ A pair of attributes. This is called a hash and range type primary key. The primary
key will be a pair of attributes (A, B): attribute A will be used for hashing, and because
there will be multiple items with the same value of A, the B values will be used for
ordering the records with the same A value. A table with this type of key can have
additional secondary indexes defined on its attributes.

Voldemort Key-Value Distributed Data Store


Voldemort is an open source system available through Apache 2.0 open source licensing rules. It
is based on Amazon’s DynamoDB. The focus is on high performance and horizontal scalability,
as well as on providing replication for high availability and sharding for improving latency
(response time) of read and write requests. All three of those features—replication, sharding, and
horizontal scalability—are realized through a technique to distribute the key-value pairs among
the nodes of a distributed cluster; this distribution is known as consistent hashing. Voldemort has
been used by LinkedIn for data storage. Some of the features of Voldemort are as follows:
■ Simple basic operations. A collection of (key, value) pairs is kept in a Voldemort
store. We will assume the store is called s. The basic interface for data storage and
retrieval is very simple and includes three operations: get, put, and delete. The
operation s.put(k, v) inserts an item as a key-value pair with key k and value v. The
operation s.delete(k) deletes the item whose key is k from the store, and the operation
v = s.get(k) retrieves the value v associated with key k. The application can use these
basic operations to build its own requirements. At the basic storage level, both keys
and values are arrays of bytes (strings).
■ High-level formatted data values. The values v in the (k, v) items can be specified
in JSON (JavaScript Object Notation), and the system will convert between JSON and
the internal storage format. Other data object formats can also be specified if the
application provides the conversion (also known as serialization) between the user
format and the storage format as a Serializer class. The Serializer class must be
provided by the user and will include operations to convert the user format into a
string of bytes for storage as a value, and to convert back a string (array of bytes)
retrieved via s.get(k) into the user format. Voldemort has some built-in serializers for
formats other than JSON.
■ Consistent hashing for distributing (key, value) pairs. A variation of the data
distribution algorithm known as consistent hashing is used in Voldemort for data
distribution among the nodes in the distributed cluster of nodes. A hash function h(k)
is applied to the key k of each (k, v) pair, and h(k) determines where the item will be
stored. The method assumes that h(k) is an integer value, usually in the range 0 to
Hmax = 2n−1, where n is chosen based on the desired range for the hash values. This
method is best visualized by considering the range of all possible integer hash values
0 to Hmax to be evenly distributed on a circle (or ring).

■ Consistency and versioning. Voldemort uses a method similar to the one developed
for DynamoDB for consistency in the presence of replicas. Basically, concurrent write
operations are allowed by different processes so there could exist two or more
different values associated with the same key at different nodes when items are
replicated. Consistency is achieved when the item is read by using a technique known
as versioning and read repair. Concurrent writes are allowed, but each write is
associated with a vector clock value. When a read occurs, it is possible that different
versions of the same value (associated with the same key) are read from different
nodes. If the system can reconcile to a single final value, it will pass that value to the
read; otherwise, more than one version can be passed back to the application, which
will reconcile the various versions into one version based on the application semantics
and give this reconciled value back to the nodes.

Examples of Other Key-Value Stores


Oracle NoSQL Database, Redis, and Cassandra.

Column-Based or Wide Column NOSQL Systems


Another category of NOSQL systems is known as column-based or wide columnsystems. The
Google distributed storage system for big data, known as BigTable, isa well-known example of
this class of NOSQL systems, and it is used in many Google applications that require large
amounts of data storage, such as Gmail. Big-Table uses the Google File System (GFS) for data
storage and distribution. An open source system known as Apache Hbase is somewhat similar to
Google Big- Table, but it typically uses HDFS (Hadoop Distributed File System) for data stor-
age. HDFS is used in many cloud computing applications. Hbase can also use Amazon’s Simple
Storage System (known as S3) for data storage.
BigTable (and Hbase) is sometimes described as a sparse multidimensional distributed persistent
sorted map, where the word map means a collection of (key, value) pairs (the key is mapped to
the value). One of the main differences that distinguishcolumn-based systems from key-value
stores is the nature of thekey. In column-based systems such as Hbase, the key is multidimensional
and so has several components: typically, a combination of table name, row key, column, and
timestamp. As we shall see, the column is typically composed of two components: column family
and column qualifier.
Hbase Data Model and Versioning
Hbase data model. The data model in Hbase organizes data using the concepts of namespaces,
tables, column families, column qualifiers, columns, rows, and datacells. A column is identified
by a combination of (column family: column qualifier).Data is stored in a self-describing form by
associating columns with data values, where data values are strings. Hbase also stores multiple
versions of a data item, with a timestamp associated with each version, so versions and timestamps
are also part of the Hbase data model. As with other NOSQLsystems, unique keys are associated
with stored data items for fast access, but the keys identify cells in the storage system.

NOSQL Graph Databases and Neo4j


Another category of NOSQL systems is known as graph databases or graph oriented NOSQL
systems. The data is represented as a graph, which is a collection of vertices (nodes) and edges.
Both nodes and edges can be labeled to indicate the types of entities and relationships they
represent, and it is generally possible to store data associated with both individual nodes and
individual edges. Many systems can be categorized as graph databases. We will focus our
discussion on one particular system, Neo4j, which is used in many applications. Neo4j is an open
source system, and it is implemented in Java.

Neo4j Data Model


The data model in Neo4j organizes data using the concepts of nodes and relation-ships. Both
nodes and relationships can have properties, which store the data itemsassociated with nodes and
relationships. Nodes can have labels; the nodes that have the same label are grouped into a
collection that identifies a subset of the nodes in the database graph for querying purposes. A
node can have zero, one, or several labels. Relationships are directed; each relationship has a start
node and end node aswell as a relationship type, which serves a similar role to a node label by
identifying similar relationships that have the same relationship type. Properties can be specified
via a map pattern, which is made of one or more “name: value” pairs enclosedin curly brackets; for
example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}.
In conventional graph theory, nodes and relationships are generally called vertices and edges. The
Neo4j graph data model somewhat resembles how data is represented in the ER and EER models,
but with some notable differences. Comparing the Neo4j graph model with ER/EER concepts,
nodes correspond to entities, node labels correspond to entity types and subclasses, relation-ships
correspond to relationship instances, relationship types correspond to relationship types, and
properties correspond to attributes. One notable difference is that a relationship is directed in
Neo4j, but is not in ER/EER. Another is that a node may have no label in Neo4j, which is not
allowed in ER/EER because every entity must belong to an entity type. A third crucial difference
is that the graph model of Neo4j is used as a basis for an actual high-performance distributed data
base system whereas the ER/EER model is mainly used for database design.
■ Labels and properties. When a node is created, the node label can be specified. It is
also possible to create nodes without any labels. Properties are enclosed in curly
brackets { … }. It is possible that some nodes have multiple labels; for example the
same node can be labeled as PERSON and EMPLOYEE and MANAGER by listing
all the labels separated by the colon symbol as follows:
PERSON:EMPLOYEE:MANAGER.
■ Relationships and relationship types. The relationship can be traversed in either
direction. The relationship types (labels) are WorksFor, Manager, LocatedIn, and
WorksOn;

■ Paths. A path specifies a traversal of part of the graph. It is typically used as part of a
query to specify a pattern, where the query will retrieve from the graph data that
matches the pattern. A path is typically specified by a start node, followed by one or
more relationships, leading to one or more end nodes that satisfy the pattern.

■ Optional Schema. A schema is optional in Neo4j. Graphs can be created and used
without a schema, but in Neo4j version 2.0, a few schema-related functions were
added. The main features related to schema creation involve creating indexes and
constraints based on the labels and properties. For example, it is possible to create the
equivalent of a key constraint on a property of a label, so all nodes in the collection of
nodes associated with the label must have unique values for that property.

■ Indexing and node identifiers. When a node is created, the Neo4j system creates an
internal unique system-defined identifier for each node. To retrieve individual nodes
using other properties of the nodes efficiently, the user can create indexes for the
collection of nodes that have a particular label. Typically, one or more of the
properties of the nodes in that collection can be indexed. For example, Empid can be
used to index nodes with the EMPLOYEE label, Dno to index the nodes with the
DEPARTMENT label, and Pno to index the nodes with the PROJECT label.

The Cypher Query Language of Neo4j


Neo4j has a high-level query language, Cypher. There are declarative commands for creating
nodes and relationships, as well as for finding nodes and relationships based on specifying
patterns. Deletion and modification of data is also possible in Cypher. We introduced the
CREATE command in the previous section, so we will now give a brief overview of some of the
other features of Cypher.
A Cypher query is made up of clauses. When a query has several clauses, the result from one
clause can be the input to the next clause in the query. The Cyber language can specify complex
queries and updates on a graph database.

You might also like