Module 7
NoSQL
Dr. S. RENUKA DEVI
Professor
SCOPE
VIT Chennai Campus
What is NoSQL?
Not Only SQL
Most NoSQL systems are distributed databases or
distributed storage systems, with a focus on semi-
structured data storage, high performance, availability,
data replication, and scalability
A structured relational SQL system may not be appropriate
for applications storing vast amount of data because
SQL systems offer too many services which the
application may not need
Traditional relational model may be too restrictive
Some of the organizations decided to develop their own
systems:
Google developed a proprietary NoSQL system known as
BigTable - used in many of Google’s applications such as
Gmail, Google Maps, and Web site indexing.
Apache Hbase is an open source NoSQL system based on
similar concepts.
Google innovation - Column-based or wide column
stores; also referred to as column family stores.
Amazon developed a NoSQL system called DynamoDB –
available through Amazon’s cloud services.
This innovation led to the category known as key-value
data stores or sometimes key-tuple or key-object data
stores.
Facebook developed a NoSQL system called Cassandra,
which is now open source and known as Apache Cassandra.
It uses concepts from both key-value stores and column-
based systems.
Other software companies started developing their own
solutions—for example, MongoDB and CouchDB, which
are classified as document-based NoSQL systems or
document stores.
Another category of NoSQL systems is the graph-based
NoSQL systems, or graph databases
Neo4j, graphbase
Characteristics of NOSQL Systems
NOSQL characteristics related to distributed databases and
distributed systems
Scalability -In NoSQL systems, horizontal scalability is generally
used, where the distributed system is expanded by adding more nodes
for data storage and processing as the volume of data grows.
Availability, Replication and Eventual Consistency
Replication Models
Master-slave replication requires one copy to be the master copy;
all write operations must be applied to the master copy and then
propagated to the slave copies
master-master replication allows reads and writes at any of the
replicas but may not guarantee that reads at nodes that store
different copies see the same values.
Sharding of Files (also known as horizontal partitioning) -
distribute the load of accessing the file records to multiple nodes
High-Performance Data Access: - hashing or range partitioning on
object keys
Characteristics of NOSQL Systems
NoSQL characteristics related to data models and query
languages
NoSQL systems emphasize performance and flexibility over
modeling power and complex querying
Not Requiring a Schema - allows semi-structured, self-
describing data
Less Powerful Query Languages - CRUD operations
Versioning
Categories of NOSQL Systems
Document-based NOSQL systems: These systems store data
in the form of documents using well-known formats, such as
JSON (JavaScript Object Notation)
Documents are accessible via their document id
NoSQL key-value stores: a simple data model based on fast
access by the key to the value associated with the key
Column-based or wide column NoSQL systems: These
systems partition a table by column into column families where
each column family is stored in its own files.
Graph-based NOSQL systems: Data is represented as graphs,
and related nodes can be found by traversing the edges using
path expressions.
Hybrid NoSQL systems
Object databases
XML databases
The CAP Theorem
CAP refer to three desirable properties of distributed systems with
replicated data: consistency , availability and partition
tolerance
The CAP theorem states that it is not possible to guarantee all
three of the desirable properties at the same time in a distributed
system with data replication.
If this is the case, then the distributed system designer would have
to choose two properties out of the three to guarantee.
In a N0SQL distributed data store, a weaker consistency level is
often acceptable, and guaranteeing the other two properties
(availability, partition tolerance) is important.
Hence, eventual consistency is often adopted in NoSQL systems.
Document-Based NOSQL Systems
It stores data as collections of similar documents
Sometimes known as document stores
Examples include MongoDB and CouchDB
MongoDB Data Model
MongoDB documents are stored in BSON (Binary JSON)
format
Individual documents are stored in a collection.
The operation createCollection is used to create each
collection.
For example, the following command can be used to create a
collection called project to hold PROJECT objects from the
COMPANY database
db.createCollection(“project”, { capped : true, size : 1310720, max :
500 } )
db.createCollection(“worker”, { capped : true, size : 5242880, max
: 2000 } ) )
MongoDB Data Model
Each document in a collection has a unique ObjectId
field, called _id, which is automatically indexed in the
collection
The value of ObjectId can be specified by the user, or it
can be system-generated
System-generated ObjectIds have a specific format,
which combines the timestamp when the object is
created (4 bytes, in an internal MongoDB format), the
node id (3 bytes), the process id (2 bytes), and a counter
(3 bytes) into a 16-byte Id value.
User-generated ObjectsIds can have any value specified
by the user as long as it uniquely identifies the document
MongoDB Data Model
A collection does not have a schema
The structure of the data fields in documents is
chosen based on how documents will be accessed and
used
Denormalized document design with embedded
subdocuments
Embedded array of document references
MongoDB CRUD Operations
MongoDb has several CRUD (create, read, update,
delete) operations
Documents can be created and inserted into their
collections using the insert operation, whose format is:
db.<collection_name>.insert(<document(s)>)
The parameters of the insert operation can include either a
single document or an array of documents
Example:
db.project.insert( { _id: “P1”, Pname: “ProductX”, Plocation:
“Bellaire” } )
db.worker.insert( [ { _id: “W1”, Ename: “John Smith”,
ProjectId: “P1”, Hours: 32.5 },{ _id: “W2”, Ename: “Joyce
English”, ProjectId: “P1”,Hours: 20.0 } ] )
MongoDB CRUD Operations
The delete operation is called remove, and the format
is:
db.<collection_name>.remove(<condition>)
There is also an update operation, which has a
condition to select certain documents, and a $set
clause to specify the update.
For read queries, the main command is called find,
and the format is:
db.<collection_name>.find(<condition>)
MongoDB Distributed Systems Characteristics
MongoDB is a distributed system, the two-phase
commit method is used to ensure atomicity and
consistency of multi-document transactions
Replication in MongoDB
The concept of replica set is used in MongoDB to
createmultiple copies of the same data set on different
nodes in the distributed system
It uses a variation of the master-slave approach - all
write operations must be applied to the primary copy
and then propagated to the secondaries
MongoDB Distributed Systems Characteristics
Sharding in MongoDB
Sharding of the documents in the collection—also known as
horizontal partitioning— divides the documents into disjoint
partitions known as shards.
This allows the system to add more nodes as needed by a
process known as horizontal scaling of the distributed
system
It store the shards of the collection on different nodes to
achieve load balancing
Each node will process only those operations pertaining to
the documents in the shard stored at that node
MongoDB Distributed Systems Characteristics
Two ways to partition a collection into shards – range
partitioning and hash partitioning
Both require that the user specify a particular document field to
be used as the basis for partitioning the documents into shards.
The partitioning field, known as the shard key must have two
characteristics:
it must exist in every document in the collection
it must have an index
Range partitioning creates the chunks by specifying a range of key
values
Hash partitioning applies a hash function h(K) to each shard key
K, and the partitioning of keys into chunks is based on the hash
values
MongoDB Distributed Systems Characteristics
When sharding is used, MongoDB queries are submitted to a
module called the query router, which keeps track of which
nodes contain which shards based on the particular
partitioning method used on the shard keys
The query (CRUD operation) will be routed to the nodes that
contain the shards that hold the documents that the query is
requesting
Sharding and replication are used together; sharding focuses
on improving performance via load balancing and horizontal
scalability, whereas replication focuses on ensuring system
availability when certain nodes fail in the distributed system
NoSQL Key-Value stores
The key is a unique identifier associated with a data item
and is used to locate this data item rapidly
The value is the data item itself, and it can have very
different formats for different key-value storage systems
The main characteristic of key-value stores is the fact that
every value (data item) must be associated with a unique
key, and that retrieving the value by supplying the key must
be very fast
NoSQL Key-Value stores
DynamoDB
an Amazon product and is available as part of Amazon’s
AWS/SDK platforms
The basic data model in DynamoDB uses the concepts of
tables, items, and attributes
A table in DynamoDB does not have a schema; it holds a
collection of self-describing items
Each item will consist of a number of (attribute, value) pairs,
and attribute values can be single-valued or multivalued
DynamoDB also allows the user to specify the items in JSON
format, and the system will convert them to the internal
storage format of DynamoDB
NoSQL Key-Value stores
When a table is created, it is required to specify a table
name and a primary key
The primary key will be used to rapidly locate the items in
the table
Thus, the primary key is the key and the item is the value
for the DynamoDB key-value store
The primary key attribute must exist in every item in the
table
Column-Based or Wide Column NOSQL
Systems
BigTable - Google distributed storage system for big data
An open source system known as Apache Hbase is similar
to Google BigTable, but it typically uses HDFS (Hadoop
Distributed File System) for data storage
HDFS is used in many cloud computing applications
sAnother well-known example of column-based NOSQL
systems is Cassandra
Column-Based or Wide Column NOSQL
Systems
BigTable (and Hbase) is sometimes described as a sparse
multidimensional distributed persistent sorted map, where
the word ‘map’ means a collection of (key, value) pairs
One of the main differences that distinguish column-based
systems from key-value stores is the nature of the key
In Hbase, the key is multidimensional and so has several
components: typically, a combination of table name, row
key, column, and timestamp
Hbase data model
. The data model in Hbase organizes data using the concepts of
namespaces, tables, column families, column qualifiers, columns,
rows, and data cells
A column is identified by a combination of (column family:column
qualifier)
Data is stored in a self-describing form by associating columns with
data values, where data values are strings
Hbase also stores multiple versions of a data item, with a timestamp
associated with each version, so versions and timestamps are also
part of the Hbase data model
Hbase data model
Column Families, Column Qualifiers, and Columns
A table is associated with one or more column families
Each column family will have a name
Column families must be specified when the table is created and cannot
be changed later
The table name is followed by the names of the column families
associated with the table.
When the data is loaded into a table, each column family can be
associated with many column qualifiers
A column is specified by a combination of
ColumnFamily:ColumnQualifier.
The concept of column family is somewhat similar to vertical partitioning
because columns (attributes) that are accessed together because they
belong to the same column family are stored in the same files..
Examples in Hbase
Examples in Hbase
A cell holds a basic data item in Hbase. The key (address)
of a cell is specified by a combination of (table, rowid,
columnfamily, columnqualifier, timestamp)
If timestamp is left out, the latest version of the item is
retrieved
A namespace is a collection of tables.
NOSQL Graph Databases
The data is represented as a graph, which is a collection of
vertices (nodes) and edges
Both nodes and edges can be labeled to indicate the types
of entities and relationships they represent
It is generally possible to store data associated with both
individual nodes and individual edges
Any Queries?