0% found this document useful (0 votes)
0 views12 pages

Module 5 Part II NoSQL DB

NoSQL systems are designed to handle vast amounts of data that traditional databases cannot manage, exemplified by applications like Facebook and Google's BigTable. They offer various characteristics such as scalability, availability, and flexibility in data models, with categories including document-based, key-value, column-based, and graph databases. The CAP theorem highlights the trade-offs between consistency, availability, and partition tolerance in distributed systems, with NoSQL often prioritizing availability and partition tolerance over strict consistency.

Uploaded by

2514dhruv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views12 pages

Module 5 Part II NoSQL DB

NoSQL systems are designed to handle vast amounts of data that traditional databases cannot manage, exemplified by applications like Facebook and Google's BigTable. They offer various characteristics such as scalability, availability, and flexibility in data models, with categories including document-based, key-value, column-based, and graph databases. The CAP theorem highlights the trade-offs between consistency, availability, and partition tolerance in distributed systems, with NoSQL often prioritizing availability and partition tolerance over strict consistency.

Uploaded by

2514dhruv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction to NoSQL systems

• Many companies use applications that needs to store vast amounts of data.
• Example, consider an application such as Facebook, with millions of users who submit
posts, many with images and videos. User profiles, user relationships, and posts must
all be stored in a huge collection of data stores.
• Traditional Database systems cannot be used for this type of applications.
• To store large amount of data some organizations developed their own applications
o Google developed a proprietary NOSQL system known as BigTable, which is
used in many of Google’s applications that require vast amounts of data storage,
such as Gmail, Google Maps, and Web site indexing. Apache Hbase is an open
source NOSQL system based on similar concepts. Google’s innovation led to
the category of NOSQL systems known as column-based or wide column
stores;
o Amazon developed a NOSQL system called DynamoDB that is available
through Amazon’s cloud services. This innovation led to the category known as
key-value data stores
o Facebook developed a NOSQL system called Cassandra, which is now open
source and known as Apache Cassandra. This NOSQL system uses concepts
from both key-value stores and column-based systems.
o Other software companies started developing their own solutions and making
them available to users who need these capabilities—for example, MongoDB
and CouchDB, which are classified as document-based NOSQL systems or
document stores.
o Another category of NOSQL systems is the graph-based NOSQL systems, or
graph databases; these include Neo4J and GraphBase
Characteristics of NOSQL Systems
1) NOSQL characteristics related to distributed databases and distributed systems.
• Scalability:
o Any system that continuously evolve in order to support the growing
amount of work is considered to be scalable.
o there are two kinds of scalability in distributed systems: horizontal and
vertical. In NOSQL systems, horizontal scalability is generally used,
where the distributed system is expanded by adding more nodes for data
storage and processing as the volume of data grows.
o Vertical scalability, on the other hand, refers to expand ing the storage
and computing power of existing nodes.
• Availability, Replication and Eventual Consistency
o Many applications that use NOSQL systems require continuous system
availability.
o To accomplish this, data is replicated over two or more nodes in a
transparent manner, so that if one node fails, the data is still available on
other nodes
o Replication improves data availability and can also improve read
performance, because read requests can often be serviced from any of
the replicated data nodes.
o A relaxed form of consistency is known as eventual consistency
• Replication Models:
o Two major replication models are used in NOSQL systems: master-
slave and master-master replication.
o Master-slave replication requires one copy to be the master copy; all
write operations must be applied to the master copy and then propagated
to the slave copies,
o The master-master replication allows reads and writes at any of the
replicas but may not guarantee that reads at nodes that store different
copies see the same values.
• Sharding of Files
o In NOSQL applications, files can have many millions of records and
these records can be accessed concurrently by thousands of users. So it
is not practical to store the whole file in one node.
o Sharding (also known as horizontal partitioning )of the file records is
often employed in NOSQL system
o Sharding is a type of DataBase partitioning in which a large database is
divided or partitioned into smaller data and different nodes.
• High-Performance Data Access:
o In many NOSQL applications, it is necessary to find individual records
or objects (data items) from among the millions of data records or
objects in a file.
o To achieve this, most systems use one of two techniques: hashing or
range partitioning on object keys.
o In hashing, a hash function h(K) is applied to the key K, and the location
of the object with key K is determined by the value of h(K).
o In range partitioning, the location is determined via a range of key values
2) NOSQL characteristics related to data models and query languages.
• Not Requiring a Schema:
o It is not required to have a schema in most of the NOSQL systems.
o So it allows semi-structured, self describing data
o Since there is no schema, any constraints on the data would have to be
programmed in the application programs that access the data items.
o There are various languages for describing semistructured data, such as
JSON (JavaScript Object Notation) and XML
• Less Powerful Query Languages
o NOSQL systems may not require a powerful query language such as
SQL, because search (read) queries in these systems often locate single
objects in a single file based on their object keys.
o NOSQL systems typically provide a set of functions and operations as a
programming API. so reading and writing the data objects is
accomplished by calling the appropriate operations by the programmer.
o The basic operations are called CRUD operations, for Create, Read,
Update, and Delete.
• Versioning:
o Some NOSQL systems provide storage of multiple versions of the data
items, with the timestamps of when the data version was created

CAP Theorem
• The CAP theorem, explain some of the competing requirements in a distributed system
with replication.
• The three letters in CAP refer to three desirable properties of distributed systems with
replicated data:
o Consistency
▪ Consistency means that the nodes will have the same copies of a
replicated data item visible for various transactions.
o Availability
▪ Availability means that each read or write request for a data item will
either be processed successfully or wil receive a message that the
operation cannot be completed.
o Partition tolerance
▪ Partition tolerance mean that the system can continue operating if the
network connecting the nodes has a fault that results in two or more
partitions, where the nodes in each partition can only communicate
among each other.
• The CAP theorem states that it is not possible to guarantee all three of the desirable
properties—consistency, availability, and partition tolerance—at the same time in a
distributed system with data replication.
• If this is the case, then the distributed system designer would have to choose two
properties out of the three to guarantee.
• In a NOSQL distributed data store, a weaker consistency level is often acceptable, and
guaranteeing the other two properties (availability, partition tolerance) is important.
• In particular, a form of consistency known as eventual consistency is often adopted in
NOSQL systems.

Categories of NOSQL Systems


1. Document-based NOSQL systems
2. NOSQL key-value stores
3. Column-based or wide column NOSQL systems:
4. Graph-based NOSQL systems

Document-Based NOSQL Systems and MongoDB


• Document stores- collection of similar documents
• Documents are accessible via their document id, but can also be accessed rapidly using
other indexes.
• Since schema is not needed, the documents are specified as self-describing data
• Although the documents in a collection should be similar they can have different data
elements (attributes)
• Documents can be specified in various formats, such as XML,JSON
• Examples of document based NOSQL systems- MongoDB, CouchDB
MongoDB Data Model:
• Documents are stored in BSON (Binary JSON) format, which is a variation of JSON
with some additional data types and is more efficient for storage than JSON
• Individual documents are stored in a collection.
• The operation createCollection is used to create each collection
o For example, the fol lowing command can be used to create a collection called
project to hold PROJECT objects from the COMPANY database
db.createCollection(“project”, { capped : true, size : 1310720, max : 500 })
o The first parameter “project” is the name of the collection, which is followed by an
optional document that specifies collection options. In our example, the collection
is capped; this means it has upper limits on its storage space (size) and number of
documents (max)
o Create another document collection called worker to hold information about the
EMPLOYEEs who work on each project;
db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
• A collection does not have a schema. So for the structure of the data fields in documents
user can choose a normalized design (similar to normalized relational tuples) or a denor
malized design(similar to XML documents or complex objects).

Figure shows a simplified MongoDB document

In figure a) the workers information is embedded in the project document; so there is


no need for the “worker” collection. This is known as the denormalized pattern
In figure b) worker references are embedded in the project document, but the worker
documents themselves are stored in a separate “worker” collection

A third option in Figure 24.1(c) would use a normalized design

MongoDB CRUD Operations:


• MongoDb has several CRUD operations, where CRUD stands for (create, read, update,
delete).
• Insertion:
o Documents can be created and inserted into their collections using the insert
operation
o The parameters of the insert operation can include either a single document or
an array of documents
db.<collection_name>.insert(<document(s)>)
• Remove
o The delete operation is called remove
o The documents to be removed from the collection are specified by a Boolean
con dition
db.<collection_name>.remove(<condition>
• Update
o It has a condition to select certain documents, and a $set clause to specify the
update.
• Read:
o For read queries, the main command is called find
o General Boolean conditions can be specified as <condition>, and the documents
in
the collection that return true are selected for the query result
db.<collection_name>.find(<condition>)

MongoDB Distributed Systems Characteristics:


• Since MongoDB is a distributed system, the two-phase commit method is used to
ensure atomicity and consistency of multidocument transactions.
• Replication in MongoDB:
o The concept of replica set is used in MongoDB to create multiple copies of the
same data set on different nodes
o Also uses a variation of the master-slave approach for replication.
o For example,
▪ Suppose that we want to replicate a particular document collection C.
▪ A replica set will have one primary copy of the collection C stored in
one node N1, and at least one secondary copy (replica) of C stored at
another node N2. Additional copies can be stored in nodes N3, N4, etc.,
▪ The total number of participants in a replica set must be at least three,
so if only one secondary copy is needed, a participant in the replica set
known as an arbiter must run on the third node N3.
▪ The arbiter does not hold a replica of the collection but participates in
elections to choose a new primary if the node storing the current primary
copy fails.
▪ In MongoDB replication, all write operations must be applied to the
primary copy and then propagated to the secondaries.
▪ The default read preference processes all reads at the primary copy, so
all read and write operations are per formed at the primary node.
• Sharding in MongoDB
▪ storing all the documents in one node can lead to performance problems,
▪ Sharding of the documents in the collection—also known as horizontal
partitioning— divides the documents into disjoint partitions known as
shards.
▪ This allows the system to add more nodes as needed by a process known
as horizontal scaling of the distributed system
▪ There are two ways to partition a collection into shards in MongoDB
• Both require that the user specify a particular document field to
be used as the basis for partitioning the documents into shards.
The partitioning field—known as the shard key
• range partitioning- Range partitioning creates the chunks by
specifying a range of key values and each chunk would contain
the key values in one range
• hash partitioning- applies a hash function h(K) to each shard
key K, and the partitioning of keys into chunks is based on the
hash values
NOSQL Key-Value Stores
• Uses key-unique identifier associated with a data item and is used to locate this data
item rapidly.
• The value is the data item itself,
o There can be different formats for different key-value storage systems.
o Eg: string of bytes or can be structured data rows (tuples) similar to relational
data, or semistructured data using JSON
• The main characteristic of key-value stores is every value (data item) must be associated
with a unique key
• Examples:
o DynamoDB
o Oracle key-value store.
o Redis key-value cache and store.
o Apache Cassandra.
Dynamo DB overview:
• The DynamoDB system is an Amazon product and is available as part of Amazon’s
AWS/SDK platforms (Amazon Web Services/Software Development Kit).
• It can be used as part of Amazon’s cloud computing services, for the data storage
component.
Dynamo DB data model
• A table in DynamoDB does not have a schema; it holds a collection of self-describing
items.
• Each item will consist of a number of (attribute, value) pairs, and attribute values can
be single-valued or multivalued.
• When a table is created, it is required to specify a table name and a primary key; the
primary key will be used to rapidly locate the items in the table. Thus, the primary key
is the key and the item is the value for the DynamoDB key-value store.
• The primary key attribute must exist in every item in the table. The primary key can be
one of the following two types:
o A single attribute.:The DynamoDB system will use this attribute to build a
hash index on the items in the table. This is called a hash type primary key.
o A pair of attributes: The primary key will be a pair of attributes (A, B):
attribute A will be used for hash ing, and because there will be multiple items
with the same value of A, the B values will be used for ordering the records with
the same. This is called a hash and range type primary key
Voldemort Key-Value Distributed Data Store
• Voldemort is an open source system available through Apache 2.0 open source licens
ing rules. It is based on Amazon’s DynamoDB
• Voldemort has been used by LinkedIn for data storage
• Features of Voldemort are as follows:
o Simple basic operations:
▪ A collection of (key, value) pairs is kept in a Voldemort store.We will
assume the store is called s.
▪ The operation s.put(k, v) inserts an item as a key-value pair with key k
and value v
▪ The operation s.delete(k) deletes the item whose key is k from the store
▪ The operation v = s.get(k) retrieves the value v associated with key k.
o High-level formatted data values.
▪ The values v in the (k, v) items can be specified in JSON and the system
will convert between JSON and the internal storage format
o Consistent hashing for distributing (key, value) pairs.
▪ A variation of the data distribution algorithm known as consistent
hashing is used for data distribution among the nodes
▪ A hash function h(k) is applied to the key k and h(k) determines where
the item will be stored

Column-Based or Wide Column NOSQL Systems


• Example: BigTable
o The Google distributed storage system for big data, known as BigTable, is a
well-known example of this class .
o It is used in many Google applications that require large amounts of data
storage, such as Gmail.
o Big Table uses the Google File System (GFS) for data storage and distribution.
• Example: Apache Hbase
o Uses HDFS (Hadoop Distributed File System) for data stor age.
o HDFS is used in many cloud computing application
• BigTable (and Hbase) is sometimes described as a sparse multidimensional
distributed persistent sorted map, where the word map means a collection of (key,
value) pairs
• One of the main differences between column-based systems from key-value stores is
the nature of the key. In column-based systems such as Hbase, the key is
multidimensional and so has several components: typically, a combination of table
name, row key, column, and timestamp.
Hbase data model
• Hbase organizes data using the concepts of namespaces, tables, column families,
column qualifiers, columns, rows, and data cells.
• Tables and Rows.
o Data in Hbase is stored in tables, and each table has a table name.
o Data in a table is stored as self-describing rows.
o Each row has a unique row key, and row keys are strings that must have the
property that they can be lexicographically ordered
• Column Families, Column Qualifiers, and Columns
o A table is associated with one or more column families. Each column family
will have a name, must be specified when the table is created and cannot be
changed later.
o When the data is loaded into a table, each col umn family can be associated
with many column qualifiers
o A column is specified by a combination of
ColumnFamily:ColumnQualifier.
• Versions and Timestamps. Hbase can keep several versions of a data item, along
with the timestamp associated with each version
• Cells. A cell holds a basic data item in Hbase.
• Namespaces. A namespace is a collection of tables
NOSQL Graph Databases and Neo4j
• The data is represented as a graph, which is a collection of vertices (nodes) and
edges.
• Example: Neo4j
Neo4j Data Model
• The data model in Neo4j organizes data using the concepts of nodes and
relation ships.
o Nodes can have labels; the nodes that have the same label are grouped
into a collection
o Relationships are directed; each relationship has a start node and end
node as well as a relationship type,
o Both nodes and relationships can have properties ,WHICH IS can be
speci fied via a map pattern, which is made of one or more “name :
value” pairs enclosed in curly brackets;
for example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}
• The Neo4j graph data model somewhat resembles how data is repre sented in
the ER and EER models
o nodes cor respond to entities
o node labels correspond to entity types and subclasses,
o relation ships correspond to relationship instances,
o relationship types correspond to relationship types,
o and properties correspond to attributes.
• To create nodes and relationships we use a high-level query language, Cypher
o A Cypher query is made up of clauses. When a query has several clauses,
the result from one clause can be the input to the next clause in the query

• Labels and properties.


o When a node is created, the node label can be speci fied.
o Here the node labels are EMPLOYEE, DEPARTMENT, PROJECT, and
LOCATION,
• Relationships and relationship types.
o The → specifies the direction of the relationship
o The relationship types are WorksFor, Manager, LocatedIn, and WorksOn
• Paths.
o path specifies a traversal of part of the graph. It is typically used as part of a
query to specify a pattern where the query will retrieve from the graph data that
matches the pattern.
• Optional Schema.
o A schema is optional in Neo4
• Indexing and node identifiers.
o When a node is created, the Neo4j system creates an internal unique system-
defined identifier for each node.

You might also like