DBMS-unit 5-Nosql Databases
DBMS-unit 5-Nosql Databases
Some of the organizations that were faced with these data management and storageapplications
decided to develop their own systems:
■ Google developed a proprietary NOSQL system known as BigTable, which is used in
many of Google’s applications that require vast amounts of data storage, such as Gmail,
Google Maps, and Web site indexing. Apache Hbase is an open source NOSQL system
based on similar concepts. Google’s innovation led to the category of NOSQL systems
known as column-based or wide column stores; they are also sometimes referred to
as column family stores.
■ Amazon developed a NOSQL system called DynamoDB that is available through
Amazon’s cloud services. This innovation led to the category knownas key-value data
stores or sometimes key-tuple or key-object data stores.
■ Facebook developed a NOSQL system called Cassandra, which is now open source and
known as Apache Cassandra. This NOSQL system uses conceptsfrom both key-value
stores and column-based systems.
■ Other software companies started developing their own solutions and making them
available to users who need these capabilities—for example, MongoDBand CouchDB,
which are classified as document-based NOSQL systems or document stores.
■ Another category of NOSQL systems is the graph-based NOSQL systems,or graph
databases; these include Neo4J and GraphBase, among others.
Characteristics of NOSQL Systems
We divide the characteristics into two categories—those related to distributed databases and
distributed systems, and those related to data models and query languages.
NOSQL characteristics related to distributed databases and distributed systems. NOSQL
systems emphasize high availability, so replicating the data is inherent in many of these systems.
Scalability is another important characteristic, because many of the applications that use
NOSQL systems tend to have data that keeps growing in volume. High performance is another
required characteristic, whereas serializable consistency may not be as important for some of
the NOSQL applications.
1. Scalability: There are two kinds of scalability in distributed systems: horizontal and
vertical. In NOSQL systems, horizontal scalability is generally used, where the
distributed system is expanded by adding more nodes for data storage and processing as
the volume of data grows. Vertical scalability, on the other hand, refers to expanding the
storage and computing power of existing nodes.
2. Availability, Replication and Eventual Consistency: Many applications that use
NOSQL systems require continuous system availability. To accomplish this, data is
replicated over two or more nodes in a transparent manner, so that if one node fails, the
data is still available on other nodes. Replication improves data availability and can also
improve read performance, because read requests can often be serviced from any of the
replicated data nodes. However, write performance becomes more cumbersome because
an update must be applied to every copy of the replicated data items;consistency, so more
relaxed forms of consistency known as eventual consistency are used.
3. Replication Models: Two major replication models are used in NOSQL systems: master-
slave and master-master replication. Master-slave replicationrequires one copy to be the
master copy; all write operations must be appliedto the master copy and then propagated
to the slave copies, usually using eventual consistency (the slave copies will eventually be
the same as the master copy). The master-master replication allows reads and writes at
any of the replicas but may not guarantee that reads at nodes that store different copies
see the same values. Different users may write the same data item concurrently at
different nodes of the system, so the values of the item will be temporarily inconsistent.
A reconciliation method to resolve conflicting write operationsof the same data item at
different nodes must be implemented as part of themaster-master replication scheme.
4. Sharding of Files: In many NOSQL applications, files (or collections of data objects) can
have many millions of records (or documents or objects), and these records can be
accessed concurrently by thousands of users. So it is notpractical to store the whole file
in one node.
5. High-Performance Data Access: In many NOSQL applications, it is necessary to find
individual records or objects (data items) from among the millions of data records or
objects in a file. To achieve this, most systems useone of two techniques: hashing or
range partitioning on object keys. The majority of accesses to an object will be by
providing the key value rather than by using complex query conditions. In hashing, a
hash function h(K) is applied to the key K, and the location of the object with key K is
determinedby the value of h(K). In range partitioning, the location is determined via a
range of key values; for example, location i would hold the objects whose key values K are
in the range Kimin ≤ K ≤ Kimax.
NOSQL characteristics related to data models and query languages. NOSQL systems
emphasize performance and flexibility over modeling power and complex querying.
1. Not Requiring a Schema: The flexibility of not requiring a schema is achieved in many
NOSQL systems by allowing semi-structured, self- describing data. The users can
specify a partial schema in some systems to improve storage efficiency, but it is not
required to have a schema in most of the NOSQL systems. As there may not be a schema
to specify constraints, any constraints on the data would have to be programmed in the
application programs that access the data items. There are various languages for
describing semistructured data, such as JSON (JavaScript Object Notation) and XML
(Extensible Markup Language).
2. Less Powerful Query Languages: Many applications that use NOSQL systems may not
require a powerful query language such as SQL, because search (read) queries in these
systems often locate single objects in a single file based on their object keys. NOSQL
systems typically provide a set of functions and operations as a programming API
(application programming interface), so reading and writing the data objects is
accomplished by calling the appropriate operations by the programmer. In many cases,
the operations are called CRUD operations, for Create, Read, Update, and Delete. In
other cases, they are known as SCRUD because of an added Search (or Find) operation.
Some NOSQL systems also provide a high-level query language, but it may not have the
full power of SQL; only a subset of SQL querying capabilities would be provided. In
particular, many NOSQL systems do not provide join operations as part of the query
language itself; the joins need to be implemented in the application programs.
3. Versioning: Some NOSQL systems provide storage of multiple versions of the data
items, with the timestamps of when the data version was created.
DynamoDB Overview
The DynamoDB system is an Amazon product and is available as part of Amazon’s AWS/SDK
platforms (Amazon Web Services/Software Development Kit). It can be used as part of Amazon’s
cloud computing services, for the data storage component.
DynamoDB data model. The basic data model in DynamoDB uses the concepts of tables, items,
and attributes. A table in DynamoDB does not have a schema; it holds a collection of self-
describing items. Each item will consist of a number of (attribute, value) pairs, and attribute
values can be single-valued or multivalued. So basically, a table will hold a collection of items,
and each item is a self-describing record (or object). DynamoDB also allows the user to specify
the items in JSON format, and the system will convert them to the internal storage format of
DynamoDB.
When a table is created, it is required to specify a table name and a primary key; the primary key
will be used to rapidly locate the items in the table. Thus, the primary key is the key and the item
is the value for the DynamoDB key-value store. The primary key attribute must exist in every
item in the table. The primary key can be one of the following two types:
■ A single attribute. The DynamoDB system will use this attribute to build a hash index
on the items in the table. This is called a hash type primary key. The items are not
ordered in storage on the value of the hash attribute.
■ A pair of attributes. This is called a hash and range type primary key. The primary
key will be a pair of attributes (A, B): attribute A will be used for hashing, and because
there will be multiple items with the same value of A, the B values will be used for
ordering the records with the same A value. A table with this type of key can have
additional secondary indexes defined on its attributes.
■ Consistency and versioning. Voldemort uses a method similar to the one developed
for DynamoDB for consistency in the presence of replicas. Basically, concurrent write
operations are allowed by different processes so there could exist two or more
different values associated with the same key at different nodes when items are
replicated. Consistency is achieved when the item is read by using a technique known
as versioning and read repair. Concurrent writes are allowed, but each write is
associated with a vector clock value. When a read occurs, it is possible that different
versions of the same value (associated with the same key) are read from different
nodes. If the system can reconcile to a single final value, it will pass that value to the
read; otherwise, more than one version can be passed back to the application, which
will reconcile the various versions into one version based on the application semantics
and give this reconciled value back to the nodes.
■ Paths. A path specifies a traversal of part of the graph. It is typically used as part of a
query to specify a pattern, where the query will retrieve from the graph data that
matches the pattern. A path is typically specified by a start node, followed by one or
more relationships, leading to one or more end nodes that satisfy the pattern.
■ Optional Schema. A schema is optional in Neo4j. Graphs can be created and used
without a schema, but in Neo4j version 2.0, a few schema-related functions were
added. The main features related to schema creation involve creating indexes and
constraints based on the labels and properties. For example, it is possible to create the
equivalent of a key constraint on a property of a label, so all nodes in the collection of
nodes associated with the label must have unique values for that property.
■ Indexing and node identifiers. When a node is created, the Neo4j system creates an
internal unique system-defined identifier for each node. To retrieve individual nodes
using other properties of the nodes efficiently, the user can create indexes for the
collection of nodes that have a particular label. Typically, one or more of the
properties of the nodes in that collection can be indexed. For example, Empid can be
used to index nodes with the EMPLOYEE label, Dno to index the nodes with the
DEPARTMENT label, and Pno to index the nodes with the PROJECT label.