Module 5 Part II NoSQL DB
Module 5 Part II NoSQL DB
• Many companies use applications that needs to store vast amounts of data.
• Example, consider an application such as Facebook, with millions of users who submit
posts, many with images and videos. User profiles, user relationships, and posts must
all be stored in a huge collection of data stores.
• Traditional Database systems cannot be used for this type of applications.
• To store large amount of data some organizations developed their own applications
o Google developed a proprietary NOSQL system known as BigTable, which is
used in many of Google’s applications that require vast amounts of data storage,
such as Gmail, Google Maps, and Web site indexing. Apache Hbase is an open
source NOSQL system based on similar concepts. Google’s innovation led to
the category of NOSQL systems known as column-based or wide column
stores;
o Amazon developed a NOSQL system called DynamoDB that is available
through Amazon’s cloud services. This innovation led to the category known as
key-value data stores
o Facebook developed a NOSQL system called Cassandra, which is now open
source and known as Apache Cassandra. This NOSQL system uses concepts
from both key-value stores and column-based systems.
o Other software companies started developing their own solutions and making
them available to users who need these capabilities—for example, MongoDB
and CouchDB, which are classified as document-based NOSQL systems or
document stores.
o Another category of NOSQL systems is the graph-based NOSQL systems, or
graph databases; these include Neo4J and GraphBase
Characteristics of NOSQL Systems
1) NOSQL characteristics related to distributed databases and distributed systems.
• Scalability:
o Any system that continuously evolve in order to support the growing
amount of work is considered to be scalable.
o there are two kinds of scalability in distributed systems: horizontal and
vertical. In NOSQL systems, horizontal scalability is generally used,
where the distributed system is expanded by adding more nodes for data
storage and processing as the volume of data grows.
o Vertical scalability, on the other hand, refers to expand ing the storage
and computing power of existing nodes.
• Availability, Replication and Eventual Consistency
o Many applications that use NOSQL systems require continuous system
availability.
o To accomplish this, data is replicated over two or more nodes in a
transparent manner, so that if one node fails, the data is still available on
other nodes
o Replication improves data availability and can also improve read
performance, because read requests can often be serviced from any of
the replicated data nodes.
o A relaxed form of consistency is known as eventual consistency
• Replication Models:
o Two major replication models are used in NOSQL systems: master-
slave and master-master replication.
o Master-slave replication requires one copy to be the master copy; all
write operations must be applied to the master copy and then propagated
to the slave copies,
o The master-master replication allows reads and writes at any of the
replicas but may not guarantee that reads at nodes that store different
copies see the same values.
• Sharding of Files
o In NOSQL applications, files can have many millions of records and
these records can be accessed concurrently by thousands of users. So it
is not practical to store the whole file in one node.
o Sharding (also known as horizontal partitioning )of the file records is
often employed in NOSQL system
o Sharding is a type of DataBase partitioning in which a large database is
divided or partitioned into smaller data and different nodes.
• High-Performance Data Access:
o In many NOSQL applications, it is necessary to find individual records
or objects (data items) from among the millions of data records or
objects in a file.
o To achieve this, most systems use one of two techniques: hashing or
range partitioning on object keys.
o In hashing, a hash function h(K) is applied to the key K, and the location
of the object with key K is determined by the value of h(K).
o In range partitioning, the location is determined via a range of key values
2) NOSQL characteristics related to data models and query languages.
• Not Requiring a Schema:
o It is not required to have a schema in most of the NOSQL systems.
o So it allows semi-structured, self describing data
o Since there is no schema, any constraints on the data would have to be
programmed in the application programs that access the data items.
o There are various languages for describing semistructured data, such as
JSON (JavaScript Object Notation) and XML
• Less Powerful Query Languages
o NOSQL systems may not require a powerful query language such as
SQL, because search (read) queries in these systems often locate single
objects in a single file based on their object keys.
o NOSQL systems typically provide a set of functions and operations as a
programming API. so reading and writing the data objects is
accomplished by calling the appropriate operations by the programmer.
o The basic operations are called CRUD operations, for Create, Read,
Update, and Delete.
• Versioning:
o Some NOSQL systems provide storage of multiple versions of the data
items, with the timestamps of when the data version was created
CAP Theorem
• The CAP theorem, explain some of the competing requirements in a distributed system
with replication.
• The three letters in CAP refer to three desirable properties of distributed systems with
replicated data:
o Consistency
▪ Consistency means that the nodes will have the same copies of a
replicated data item visible for various transactions.
o Availability
▪ Availability means that each read or write request for a data item will
either be processed successfully or wil receive a message that the
operation cannot be completed.
o Partition tolerance
▪ Partition tolerance mean that the system can continue operating if the
network connecting the nodes has a fault that results in two or more
partitions, where the nodes in each partition can only communicate
among each other.
• The CAP theorem states that it is not possible to guarantee all three of the desirable
properties—consistency, availability, and partition tolerance—at the same time in a
distributed system with data replication.
• If this is the case, then the distributed system designer would have to choose two
properties out of the three to guarantee.
• In a NOSQL distributed data store, a weaker consistency level is often acceptable, and
guaranteeing the other two properties (availability, partition tolerance) is important.
• In particular, a form of consistency known as eventual consistency is often adopted in
NOSQL systems.