Module-2
Module-2
22AML64A
Next-Gen Database Technology using MongoDB
Sushmitha M
ASSISTANT PROFESSOR
DEPARTMENT OF AI & ML
Module 2
NoSQL Big Data Management: Introduction, NoSQL Data Store, NoSQL Data
Architecture Patterns, NoSQL to Manage Big Data, Shared-Nothing Architecture
for Big Data Tasks.
Textbook 3: Chapter 3
INTRODUCTION
2. Flexibility
Makes it very easy to install, implement and debug new services in a distributed
environment.
3. Sharding
Storing the different parts of data onto different sets of data nodes, clusters or
servers.
For example, university students have a huge database, sharding divides in
databases, called shards. Each shard may correspond to a database for an
individual course and year. Each shard stored at different nodes or servers.
INTRODUCTION
4. Speed: Computing power increases in a distributed computing system as shards
run parallelly on individual data nodes in clusters independently (no data sharing
between shards).
8.Performance:
The collection of processors in the system provides higher performance than a
centralized computer, due to the lesser cost of communication among machines
(Cost means time taken up in communication).
Module 2
Textbook 2: Chapter 3
Prolog
NOSQL DATA STORE Lisp
Haskell
Miranda
Erlang
SQL (in the broadest sense)
Consistency in transactions means that a transaction must Durability means a transaction must persist once
maintain the integrity constraint, and follow the consistency completed.
principle.
For example, the difference of sum of deposited amounts and
withdrawn amounts in a bank account must equal the last balance.
All three data need to be consistent.
NOSQL DATA STORE
NoSQL
A new category of data stores is NoSQL (means Not Only SQL) data stores.
NoSQL is an altogether new approach of thinking about databases, such as schema
flexibility, simple relationships, dynamic schemas, auto sharding, replication,
integrated caching, horizontal scalability of shards, distributable tuples, semi-
structures data, and flexibility in approach.
Issues with NoSQL data stores are lack of standardization in approaches, processing
difficulties for complex queries, and dependence on eventually consistent results in
place of consistency in all states.
NOSQL DATA STORE
CAP Theorem
Have you ever seen an advertisement for a
landscaper, house painter, or some other
tradesperson that starts with the headline,
“Cheap, Fast, and Good: Pick Two”?
CAP Theorem
CAP theorem states that in networked shared-data systems or distributed systems, we can
only achieve at most two out of three guarantees for a database: Consistency, Availability,
and Partition Tolerance.
A distributed system is a network that stores data on more than one node (physical or virtual
machines) at the same time.
Among C, A, and P, two are at least present for the application/service/process. Consistency
means all copies have the same value as in traditional DBs. Availability means at least one
copy is available in case a partition becomes inactive or fails. For example, in web
applications, the other copy in the other partition is available. Partition means parts that are
active but may not cooperate (share) as in distributed DBs.
NOSQL DATA STORE
CAP Theorem
Consistency in distributed databases means that all nodes observe the same data at the same
time. Therefore, the operations in one partition of the database should reflect in other related
partitions in the case of distributed database. Operations, which change the sales data from a
specific showroom in a table should also reflect in changes in related tables which are using
that sales data.
Availability means that during the transactions, the field values must be available in other
partitions of the database so that each request receives a response on success as well as
failure. (Failure causes the response to a request from the replicate of data). Distributed
databases require transparency between one another. Network failure may lead to data
unavailability in a certain partition in case of no replication. Replication ensures availability.
Partition means a division of a large database into different databases without affecting their
operations of them by adopting specified procedures.
Partition tolerance: Refers to the continuation of operations as a whole even in case of
message loss, node failure, or node not reachable.
NOSQL DATA STORE
The CAP theorem is also called Brewer’s Theorem, because it was first advanced by Professor
Eric A. Brewer during a talk he gave on distributed computing in 2000.
Consistency means that all clients see the same data at the same time, no matter which node
they connect to. For this to happen, whenever data is written to one node, it must be instantly
forwarded or replicated to all the other nodes in the system before the write is deemed
‘successful.’
Availability means that any client making a request for data gets a response, even if one or
more nodes are down. Another way to state this—all working nodes in the distributed system
return a valid response for any request, without exception.
A partition is a communications break within a distributed system—a lost or temporarily
delayed connection between two nodes. Partition tolerance means that the cluster must
continue to work despite any number of communication breakdowns between nodes in the
system.
NOSQL DATA STORE
NOSQL DATA STORE
Schema-less Database
NoSQL databases’ flexibility is responsible for the rising popularity of a schemaless approach
and is often considered more user-friendly than scaling a schema or SQL database.
NoSQL data not necessarily have a fixed table schema.
The systems do not use the concept of Join (between distributed datasets).
A cluster-based highly distributed node manages a single large data store with a NoSQL DB.
Data written at one node replicates to multiple nodes. Therefore, these are identical, fault-
tolerant and partitioned into shards.
Distributed databases can store and process a set of information on more than one computing
nodes
BASE Properties BA stands for basic availability, S stands for soft state and E stands for eventual
consistent.
Basically Available – Rather than enforcing immediate consistency, BASE-modelled NoSQL databases
will ensure availability of data by spreading and replicating it across the nodes of the database cluster.
Soft State – Due to the lack of immediate consistency, data values may change over time. The BASE
model breaks off with the concept of a database that enforces its own consistency, delegating that
responsibility to developers.
Eventually Consistent – The fact that BASE does not enforce immediate consistency does not mean
that it never achieves it. However, until it does, data reads are still possible (even though they might
not reflect reality).
Module 2
Textbook 2: Chapter 3
NOSQL DATA ARCHITECTURE PATTERNS
Key-Value Store
• The simplest way to implement a schema-less data store is to use key-value pairs.
• The data store characteristics are high performance, scalability and flexibility.
• Data retrieval is fast in key-value pairs data store.
• A simple string called, key maps to a large data string or BLOB (Basic Large Object).
• Key-value store accesses use a primary key for accessing the values. Therefore, the store can be easily
scaled up for very large data.
• The concept is similar to a hash table where a unique key points to a particular item(s) of data.
NOSQL DATA ARCHITECTURE PATTERNS
Key-Value Store
Figure shows key-value pairs architectural pattern and example of students' database as key-value pairs
NOSQL DATA ARCHITECTURE PATTERNS
5. Returned values on queries can be used to convert into lists, table- columns, data frame fields and
columns.
6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operational cost.
7. The key can be synthetic or auto-generated. The key is flexible and can be represented in many
formats: (i) Artificially generated strings created from a hash of a value, (ii) Logical path names to
images or files, (iii) REST web-service calls (request response cycles), and (iv) SQL queries.
NOSQL DATA ARCHITECTURE PATTERNS
Document Store
A document store database (also known as a document-oriented database,
aggregate database, or simply document store or document database) is a database
that uses a document-oriented model to store data.
Document store databases store each record and its associated data within a single
document. Each document contains semi-structured data that can be queried
against using various query and analytics tools of the DBMS.
NOSQL DATA ARCHITECTURE PATTERNS
Document Store
NOSQL DATA ARCHITECTURE PATTERNS
JSON:
JSON is a data exchange format that stands for JavaScript Object Notation with
the extension .json. JSON is known as a lightweight data format type and is
favored for its human readability and nesting features. It is often used in
conjunction with APIs and data configuration.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
CSV:
CSV is a data storage format that stands for Comma Separated Values with the
extension .csv. CSV files store data values (plain text) in a list format separated
by commas. Notably, CSV files tend to be smaller in size and can be opened in
text editors.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
XML
XML
3.3 NOSQL DATA ARCHITECTURE PATTERNS
• Scalability
• Partitionability
• Availability
• Tree-like columnar
• Adding new data at ease
• Replication of columns
• No optimization for Join:
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Scalability
The database uses row IDs and column names to locate a column and values
in the column fields.
The interface for the fields is simple.
The back-end system can distribute queries over a large number of
processing nodes without performing any Join operations.
The retrieval of data from the distributed node can be least complicated by an
intelligent plan of row IDs and columns, thereby increasing performance.
Scalability means the addition of a number of rows.
The number of processing instructions is proportional to the number of
ACVMs due to scalable operations.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Partitionability:
• For example, large data of ACVMs can be partitioned into datasets of size, say 1
MB in the number of row groups.
• Values in columns of each row-group, process in-memory at a partition.
• Values in columns of each row group independently parallelly process in-
memory at the partitioned nodes.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Availability:
• The cost of replication is lower since the system scales on distributed nodes
efficiently.
• The lack of Join operations enables storing a part of a column-family matrix
on remote computers.
• Thus, the data is always available in case of failure of any node.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Querying all the field values in a column in a family, all columns in the family, or
a group of column families are fast in the in-memory column-family data store.
No optimization for Join: Column-family data stores are similar to sparse matrix
data. The data do not optimize for Join operations.
3.3 NOSQL DATA ARCHITECTURE PATTERNS
Examples of graph model usages are social networks of connected people. The connections to
related persons become easier to model when using the graph model.
Graph Data Base
Examples of graph model usages are social networks of connected people. The connections to
related persons become easier to model when using the graph model.
Graph Data base for Car Model Sale
Characteristics of graph databases are:
Link analysis,
Friend of friend queries,
Rules and inference,
Rule induction and
Pattern matching.
Link analysis is needed to perform searches and look for patterns and relationships in situations,
such as social networking, telephone, or email.
Examples of graph DBs are:
Neo4J, AllegroGraph, HyperGraph, Infinite Graph, Titan, and FlockDB.
Big Data solution needs scalable storage of terabytes and petabytes, dropping
of support for database Joins, and storing data differently on several
distributed servers (data nodes) together as a cluster.
A solution, such as CouchDB, DynamoDB, MongoDB or Cassandra follow
CAP theorem (with compromising the consistency factor) to make
transactions faster and easier to scale. A solution must also be partitioning
tolerant
3.4 NoSQL to Manage Big Data
High and easy scalability: NoSQL data stores are designed to expand
horizontally. Horizontal scaling means that scaling out by adding more
machines as data nodes (servers) into the pool of resources (processing,
memory, network connections). The design scales out using multi-utility cloud
services.
3.4 NoSQL to Manage Big Data
Usages of NoSQL servers which are less expensive. NoSQL data stores
require less management effort. It supports many features like automatic
repair, easier data distribution, and simpler. data models that makes database
administrator (OBA) and tuning requirements lessstringent
3.4 NoSQL to Manage Big Data
Usages of open-source tools: NoSQL data stores are cheap and open-source.
Database implementation is easy and typically uses cheap servers to manage
the exploding data and transactions while RDBMS databases are expensive
and use big servers and storage systems. So, the cost per gigabyte datastore
and processing of that data can be many times less than the cost of RDBMS
3.4 NoSQL to Manage Big Data
Textbook 2: Chapter 3
3.4 NoSQL to Manage Big Data
The columns of two tables relate by a relationship. A relational algebraic equation specifies
the relation. Keys share between two or more SQL tables in RDBMS. Shared nothing (SN)
is a cluster architecture. A node does not share data with any other node.
Data of different data stores partition among the number of nodes (assigning different
computers to deal with different users or queries). Processing may require every node to
maintain its own copy of the application's data, using a coordination protocol.
Examples are using the partitioning and processing are Hadoop, Flink and Spark
SHARED-NOTHING ARCHITECTURE
FOR BIG DATA TASKS
1. Independence: Each node with no memory sharing; thus possesses computational self-
sufficiency
2. Self-Healing: A link failure causes the creation of another link
3. Each node functioning as a shard: Each node stores a shard (a partition of large DBs)
4. No network contention
SHARED-NOTHING ARCHITECTURE
FOR BIG DATA TASKS
1. Independence: Each node with no memory sharing; thus possesses computational self-
sufficiency
2. Self-Healing: A link failure causes the creation of another link
3. Each node functioning as a shard: Each node stores a shard (a partition of large DBs)
4. No network contention
Choosing the Distribution Models
Big Data requires distribution on multiple data nodes at clusters.
Distributed software components give advantage of parallel processing; thus providing
horizontal scalability.
Distribution gives
ability to handle large-sized data, and
processing of many read and write operations simultaneously in an application.
A resource manager manages, allocates, and schedules the resources of each processor,
memory and network connection.
Four models for distribution of the data store are given below:
Single Server Model
Simplest distribution option for NoSQL data store and access is Single Server Distribution
(SSD) of an application.
A graph database processes the relationships between nodes at a server.
The SSD model suits well for graph DBs.
Aggregates of datasets may be key-value, column-family or BigTable data stores which require
sequential processing. These data stores also use the SSD model. An application executes the
data sequentially on a single server.
Four models for distribution of the data store are given below:
Single Server Model
Figure shows the SSD model. Process and datasets distribute to a single server which runs the
application.
Sharding Very Large Databases
Figure shows sharding of very large datasets into four divisions,
each running the application on four i,j, k and l different servers
at the cluster. DBi, DBj, DBk and DB1 are four
The application programming model in SN architecture is such
that an application process runs on multiple shards in parallel.
Sharding provides horizontal scalability.
A data store may add an auto-sharding feature.
The performance improves in the SN. However, in case of a link
failure with the application, the application can migrate the shard
DB to another node.
Master Slave Distribution
Master directs the slaves. Slave nodes data replicate on multiple
slave servers in Master Slave Distribution (MSD) model.
When a process updates the master, it updates the slaves also. A
process uses the slaves for read operations.
Processing performance improves when process runs large
datasets distributed onto the slave nodes.
Figure shows an example of MongoDB. MongoDB database
server is mongod and the client is mongo.
Peer-to-Peer Distribution Model
Peer-to-Peer distribution (PPD) model and replication show the following characteristics:
• All replication nodes accept read request and send the responses.
• All replicas function equally.
• Node failures do not cause loss of write capability, as other replicated node responds.
Cassandra adopts the PPD model. The data distributes among all the nodes in a cluster.
Performance can further be enhanced by adding the nodes. Since nodes read and write both, a replicated
node also has updated data.
The biggest advantage in the model is consistency. When a write is on different nodes, then write
inconsistency occurs.
Choosing Master Slave versus Peer-to-Peer
Master-slave replication provides greater scalability for read operations. Replication provides resilience
during the read. Master does not provide resilience for writers. Peer-to-peer replication provides resilience
for read and writing both.
Sharing Combining with Replication Master-slave and sharding creates multiple masters. However, for
each data, a single master exists. Configuration assigns a master to a group of datasets. Peer-to-peer and
sharding use the same strategy for the column-family data stores. The shards replicate on the nodes, which
do read and write operations.
Ways of Handling Big Data Problems
Use replication to horizontally distribute the client read-requests: Replication means
creating backup copies of data in real time. Many Big Data clusters use replication to
make the failure-proof retrieval of data in a distributed environment. Using replication
enables horizontal scaling out of the client requests.
Moving queries to the data, not the data to the queries: Most NoSQL data stores use
cloud utility services (Large graph databases may use enterprise servers). Moving
client node queries to the data is efficient as well as a requirement in Big Data
solutions.
Queries distribution to multiple nodes: Client queries for the DBs analyze at the
analyzers, which evenly distribute the queries to data nodes/ replica nodes. High
performance query processing requires usages of multiple nodes. The query execution
takes place separately from the query evaluation (The evaluation means interpreting
the query and generating a plan
for its execution sequence).