0% found this document useful (0 votes)
3 views

Module 7 - NoSQL

The document discusses NoSQL databases and provides details about MongoDB and key-value stores. It describes the characteristics of NoSQL systems, different categories including document, key-value, column and graph databases. It also covers MongoDB data model, operations and distributed features like replication and sharding.

Uploaded by

ST
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 7 - NoSQL

The document discusses NoSQL databases and provides details about MongoDB and key-value stores. It describes the characteristics of NoSQL systems, different categories including document, key-value, column and graph databases. It also covers MongoDB data model, operations and distributed features like replication and sharding.

Uploaded by

ST
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Module 7

NoSQL

Dr. S. RENUKA DEVI


Professor
SCOPE
VIT Chennai Campus
What is NoSQL?
Not Only SQL

Most NoSQL systems are distributed databases or


distributed storage systems, with a focus on semi-
structured data storage, high performance, availability,
data replication, and scalability

A structured relational SQL system may not be appropriate


for applications storing vast amount of data because
SQL systems offer too many services which the
application may not need
Traditional relational model may be too restrictive
Some of the organizations decided to develop their own
systems:
Google developed a proprietary NoSQL system known as
BigTable - used in many of Google’s applications such as
Gmail, Google Maps, and Web site indexing.
Apache Hbase is an open source NoSQL system based on
similar concepts.
Google innovation - Column-based or wide column
stores; also referred to as column family stores.
Amazon developed a NoSQL system called DynamoDB –
available through Amazon’s cloud services.
This innovation led to the category known as key-value
data stores or sometimes key-tuple or key-object data
stores.
Facebook developed a NoSQL system called Cassandra,
which is now open source and known as Apache Cassandra.
It uses concepts from both key-value stores and column-
based systems.

Other software companies started developing their own


solutions—for example, MongoDB and CouchDB, which
are classified as document-based NoSQL systems or
document stores.

Another category of NoSQL systems is the graph-based


NoSQL systems, or graph databases
Neo4j, graphbase
Characteristics of NOSQL Systems
NOSQL characteristics related to distributed databases and
distributed systems
 Scalability -In NoSQL systems, horizontal scalability is generally
used, where the distributed system is expanded by adding more nodes
for data storage and processing as the volume of data grows.
 Availability, Replication and Eventual Consistency
 Replication Models
 Master-slave replication requires one copy to be the master copy;
all write operations must be applied to the master copy and then
propagated to the slave copies
 master-master replication allows reads and writes at any of the
replicas but may not guarantee that reads at nodes that store
different copies see the same values.
 Sharding of Files (also known as horizontal partitioning) -
distribute the load of accessing the file records to multiple nodes
 High-Performance Data Access: - hashing or range partitioning on
object keys
Characteristics of NOSQL Systems
NoSQL characteristics related to data models and query
languages
NoSQL systems emphasize performance and flexibility over
modeling power and complex querying
Not Requiring a Schema - allows semi-structured, self-
describing data
Less Powerful Query Languages - CRUD operations
Versioning
Categories of NOSQL Systems
 Document-based NOSQL systems: These systems store data
in the form of documents using well-known formats, such as
JSON (JavaScript Object Notation)
 Documents are accessible via their document id
 NoSQL key-value stores: a simple data model based on fast
access by the key to the value associated with the key
 Column-based or wide column NoSQL systems: These
systems partition a table by column into column families where
each column family is stored in its own files.
 Graph-based NOSQL systems: Data is represented as graphs,
and related nodes can be found by traversing the edges using
path expressions.
 Hybrid NoSQL systems
 Object databases
 XML databases
The CAP Theorem
 CAP refer to three desirable properties of distributed systems with
replicated data: consistency , availability and partition
tolerance

 The CAP theorem states that it is not possible to guarantee all


three of the desirable properties at the same time in a distributed
system with data replication.

 If this is the case, then the distributed system designer would have
to choose two properties out of the three to guarantee.

 In a N0SQL distributed data store, a weaker consistency level is


often acceptable, and guaranteeing the other two properties
(availability, partition tolerance) is important.

 Hence, eventual consistency is often adopted in NoSQL systems.


Document-Based NOSQL Systems

It stores data as collections of similar documents

Sometimes known as document stores

Examples include MongoDB and CouchDB


MongoDB Data Model
MongoDB documents are stored in BSON (Binary JSON)
format
 Individual documents are stored in a collection.
The operation createCollection is used to create each
collection.
For example, the following command can be used to create a
collection called project to hold PROJECT objects from the
COMPANY database
db.createCollection(“project”, { capped : true, size : 1310720, max :
500 } )
db.createCollection(“worker”, { capped : true, size : 5242880, max
: 2000 } ) )
MongoDB Data Model
Each document in a collection has a unique ObjectId
field, called _id, which is automatically indexed in the
collection
The value of ObjectId can be specified by the user, or it
can be system-generated
System-generated ObjectIds have a specific format,
which combines the timestamp when the object is
created (4 bytes, in an internal MongoDB format), the
node id (3 bytes), the process id (2 bytes), and a counter
(3 bytes) into a 16-byte Id value.
User-generated ObjectsIds can have any value specified
by the user as long as it uniquely identifies the document
MongoDB Data Model
A collection does not have a schema
The structure of the data fields in documents is
chosen based on how documents will be accessed and
used
Denormalized document design with embedded
subdocuments
Embedded array of document references
MongoDB CRUD Operations
MongoDb has several CRUD (create, read, update,
delete) operations
Documents can be created and inserted into their
collections using the insert operation, whose format is:
db.<collection_name>.insert(<document(s)>)
The parameters of the insert operation can include either a
single document or an array of documents
Example:
db.project.insert( { _id: “P1”, Pname: “ProductX”, Plocation:
“Bellaire” } )
db.worker.insert( [ { _id: “W1”, Ename: “John Smith”,
ProjectId: “P1”, Hours: 32.5 },{ _id: “W2”, Ename: “Joyce
English”, ProjectId: “P1”,Hours: 20.0 } ] )
MongoDB CRUD Operations
The delete operation is called remove, and the format
is:
db.<collection_name>.remove(<condition>)

There is also an update operation, which has a


condition to select certain documents, and a $set
clause to specify the update.
For read queries, the main command is called find,
and the format is:
db.<collection_name>.find(<condition>)
MongoDB Distributed Systems Characteristics
MongoDB is a distributed system, the two-phase
commit method is used to ensure atomicity and
consistency of multi-document transactions

Replication in MongoDB
The concept of replica set is used in MongoDB to
createmultiple copies of the same data set on different
nodes in the distributed system
It uses a variation of the master-slave approach - all
write operations must be applied to the primary copy
and then propagated to the secondaries
MongoDB Distributed Systems Characteristics
Sharding in MongoDB
 Sharding of the documents in the collection—also known as
horizontal partitioning— divides the documents into disjoint
partitions known as shards.
 This allows the system to add more nodes as needed by a
process known as horizontal scaling of the distributed
system
 It store the shards of the collection on different nodes to
achieve load balancing
 Each node will process only those operations pertaining to
the documents in the shard stored at that node
MongoDB Distributed Systems Characteristics
Two ways to partition a collection into shards – range
partitioning and hash partitioning
Both require that the user specify a particular document field to
be used as the basis for partitioning the documents into shards.
The partitioning field, known as the shard key must have two
characteristics:
 it must exist in every document in the collection
 it must have an index
Range partitioning creates the chunks by specifying a range of key
values
Hash partitioning applies a hash function h(K) to each shard key
K, and the partitioning of keys into chunks is based on the hash
values
MongoDB Distributed Systems Characteristics
 When sharding is used, MongoDB queries are submitted to a
module called the query router, which keeps track of which
nodes contain which shards based on the particular
partitioning method used on the shard keys

 The query (CRUD operation) will be routed to the nodes that


contain the shards that hold the documents that the query is
requesting

 Sharding and replication are used together; sharding focuses


on improving performance via load balancing and horizontal
scalability, whereas replication focuses on ensuring system
availability when certain nodes fail in the distributed system
NoSQL Key-Value stores
The key is a unique identifier associated with a data item
and is used to locate this data item rapidly

The value is the data item itself, and it can have very
different formats for different key-value storage systems

The main characteristic of key-value stores is the fact that


every value (data item) must be associated with a unique
key, and that retrieving the value by supplying the key must
be very fast
NoSQL Key-Value stores
DynamoDB
an Amazon product and is available as part of Amazon’s
AWS/SDK platforms
The basic data model in DynamoDB uses the concepts of
tables, items, and attributes
A table in DynamoDB does not have a schema; it holds a
collection of self-describing items
Each item will consist of a number of (attribute, value) pairs,
and attribute values can be single-valued or multivalued
DynamoDB also allows the user to specify the items in JSON
format, and the system will convert them to the internal
storage format of DynamoDB
NoSQL Key-Value stores
When a table is created, it is required to specify a table
name and a primary key

The primary key will be used to rapidly locate the items in


the table

Thus, the primary key is the key and the item is the value
for the DynamoDB key-value store

The primary key attribute must exist in every item in the


table
Column-Based or Wide Column NOSQL
Systems
BigTable - Google distributed storage system for big data

An open source system known as Apache Hbase is similar


to Google BigTable, but it typically uses HDFS (Hadoop
Distributed File System) for data storage

HDFS is used in many cloud computing applications

sAnother well-known example of column-based NOSQL


systems is Cassandra
Column-Based or Wide Column NOSQL
Systems
BigTable (and Hbase) is sometimes described as a sparse
multidimensional distributed persistent sorted map, where
the word ‘map’ means a collection of (key, value) pairs

One of the main differences that distinguish column-based


systems from key-value stores is the nature of the key

In Hbase, the key is multidimensional and so has several


components: typically, a combination of table name, row
key, column, and timestamp
Hbase data model
 . The data model in Hbase organizes data using the concepts of
namespaces, tables, column families, column qualifiers, columns,
rows, and data cells

 A column is identified by a combination of (column family:column


qualifier)

 Data is stored in a self-describing form by associating columns with


data values, where data values are strings

 Hbase also stores multiple versions of a data item, with a timestamp


associated with each version, so versions and timestamps are also
part of the Hbase data model
Hbase data model
Column Families, Column Qualifiers, and Columns
 A table is associated with one or more column families
 Each column family will have a name
 Column families must be specified when the table is created and cannot
be changed later
 The table name is followed by the names of the column families
associated with the table.
 When the data is loaded into a table, each column family can be
associated with many column qualifiers
 A column is specified by a combination of
ColumnFamily:ColumnQualifier.
 The concept of column family is somewhat similar to vertical partitioning
because columns (attributes) that are accessed together because they
belong to the same column family are stored in the same files..
Examples in Hbase
Examples in Hbase

A cell holds a basic data item in Hbase. The key (address)


of a cell is specified by a combination of (table, rowid,
columnfamily, columnqualifier, timestamp)

If timestamp is left out, the latest version of the item is


retrieved

A namespace is a collection of tables.


NOSQL Graph Databases

The data is represented as a graph, which is a collection of


vertices (nodes) and edges

Both nodes and edges can be labeled to indicate the types


of entities and relationships they represent

It is generally possible to store data associated with both


individual nodes and individual edges
Any Queries?

You might also like