Module 2 Notes
Module 2 Notes
The demerits of distributed computing are: (i) issues in troubleshooting in a larger networking
infrastructure, (ii) additional software requirements and (iii) security risks for data and resources.
Big Data solutions require a scalable distributed computing model with shared-nothing architecture. A
solution is Big Data store in HDFS files.
NoSQL data also store Big Data, and facilitate random read/write accesses. The accesses are sequential
in HDFS data. HBase is a NoSQL solution .Examples of other solutions are MongoDB and Cassandra.
MongoDB and Cassandra DBMSs create HDFS compatible distributed data stores and include their
specific query processing languages.
NoSQL
Big Data NoSQL Solutions NoSQL DBs are needed for Big Data solutions. They play an
important role in handling Big Data challenges.
Table 3.1 gives the examples of widely used NoSQL data stores.
CAP Theorem
Among C, A and P, two are at least present for the application/service/process.
Consistency means all copies have the same value like in traditional DBs.
Availability means at least one copy is available in case a partition becomes inactive or
fails. For example, in web applications, the other copy in the other partition is available.
Partition means parts which are active but may not cooperate (share) as in distributed DBs.
3 SUSHMITHA M, Asst. Prof., Dept. of AI & ML, GAT, Bengaluru
Module 2 Next – Gen Database technology using MongoDB 22AML64A
Consistency in distributed databases means that all nodes observe the same data at the same time.
Therefore, the operations in one partition of the database should reflect in other related partitions in case of
distributed database. Operations, which change the sales data from a specific showroom in a table should
also reflect in changes in related tables which are using that sales data.
Availability means that during the transactions, the field values must be available in other partitions of the
database so that each request receives a response on success as well as failure. (Failure causes the response
to request from the replicate of data). Distributed databases require transparency between one another.
Network failure may lead to data unavailability in a certain partition in case of no replication. Replication
ensures availability.
Partition means division of a large database into different databases without affecting the operations on
them by adopting specified procedures.
Partition tolerance: Refers to continuation of operations as a whole even in case of message loss, node
failure or node not reachable.
Brewer’s CAP (Consistency, Availability and Partition Tolerance) theorem demonstrates that any
distributed system cannot guarantee C, A and P together. Consistency− All nodes observe the same data at
the same time. Availability– Each request receives a response on success/failure. Partition Tolerance−The
system continues to operate as a whole even in case of message loss, node failure or node not reachable.
Partition tolerance cannot be overlooked for achieving reliability in a distributed database system. Thus, in
case of any network failure, a choice can be:
Database must answer, and that answer would be old or wrong data (AP).
Database should not answer, unless it receives the latest copy of the data (CP).
The CAP theorem implies that for a network partition system, the choice of consistency and availability are
mutually exclusive. CA means consistency and availability, AP means availability and partition tolerance
and CP means consistency and partition tolerance. Figure 3.1 shows the CAP theorem usage in Big Data
Solutions.
Schema-less Models
• Schema of a database system refers to designing of a structure for datasets and data structures for
storing into the database. NoSQL data not necessarily have a fixed table schema.
• The systems do not use the concept of Join (between distributed datasets). A cluster-based highly
distributed node manages a single large data store with a NoSQL DB.
• Data written at one node replicates to multiple nodes. Therefore, these are identical, fault-tolerant
and partitioned into shards. Distributed databases can store and process a set of information on
more than one computing nodes.
• NoSQL data model offers relaxation in one or more of the ACID properties (Atomicity,
consistence, isolation and durability) of the database. Distribution follows CAP theorem. CAP
theorem states that out of the three properties, two must at least be present for the
application/service/process.
4 SUSHMITHA M, Asst. Prof., Dept. of AI & ML, GAT, Bengaluru
Module 2 Next – Gen Database technology using MongoDB 22AML64A
• Figure 3.2 shows characteristics of Schema-less model for data stores. ER stands for entity-relation
modelling. Relations in a database build the connections between various tables of data
• NoSQL data stores use non-mathematical relations but store this information as an aggregate called
metadata. Metadata refers to data describing and specifying an object or objects. Metadata is a
record with all the information about a particular dataset and the inter-linkages. Metadata helps in
selecting an object, specifications of the data and, usages that design where and when. Metadata
specifies access permissions, attributes of the objects and enables additions of an attribute layer to
the objects. Files, tables, documents and images are also the objects.
The key-value store provides client to read and write values using a key as follows:
(i) Get(key), returns the value associated with the key.
(ii) Put (key, value), associates the value with the key and updates a value if this key is already
present.
(iii) Multi-get (key1, key2, .., keyN), returns the list of values associated with the list of keys.
(iv) Delete(key), removes a key and its value from the data store.
CSV and JSON File Formats CSV data store is a format for records .CSV does not represent object-
oriented databases or hierarchical data records. JSON and XML represent semistructured data, object-
oriented records and hierarchical data records. JSON (Java Script Object Notation) refers to a language
format for semistructured data. JSON represents object-oriented and hierarchical data records, object, and
resource arrays in JavaScript.
XML (eXtensible Markup Language) is an extensible, simple and scalable language. Its self-describing
format describes structure and contents in an easy to understand format. XML is widely used. The
document model consists of root element and their sub-elements. XML document model has a hierarchical
structure. XML document model has features of object-oriented records. XML format finds wide uses in
data store and data exchanges over the network. An XML document is semi-structured
Document JSON Format- CouchDB Database
Apache CouchDB is an open-source database. Its features are:
• CouchDB provides mapping functions during querying, combining and filtering of information.
• CouchDB deploys JSON Data Store model for documents. Each document maintains separate data
and metadata (schema).
• CouchDB is a multi-master application. Write does not require field locking when controlling the
concurrency during multi-master application.
• CouchDB querying language is JavaScript. Java script is a language which documents use to
transform. CouchDB queries the indices using a web browser. CouchDB accesses the documents
using HTTP API. HTTP methods are Get, Put and Delete .
• CouchDB data replication is the distribution model that results in fault tolerance and reliability.
XQuery and XPath are query languages for finding and extracting elements and attributes from XML
documents. The query commands use sub-trees and attributes of documents. The querying is similar as in
SQL for databases. XPath treats XML document as a tree of nodes. XPath queries are expressed in the
form of XPath expressions.
3.3.4.1 Object Relational Mapping : The following example explains object relational mapping.
Example : How does an HTML object and XML based web service relate with tabular data stores?
Solution: Figure 3.7 shows the object relational mapping of HTML document and XML web services
store with a tabular data store.
Figure 3.8 Section of the graph database for car-model sales (ii) The yearly sales
compute by path traversals from nodes for weekly sales to yearly sales data. (iv) The
path traversals exhibit BASE properties because during the intermediate paths,
consistency is not maintained. Eventually when all the path traversals complete, the
data becomes consistent.
Graph databases enable fast network searches. Graph uses linked datasets, such as
social media data. Data store uses graphs with nodes and edges connecting each other
through relations, associations and properties. Querying for data uses graph traversal
along the paths. Traversal may use single-step, path expressions or full recursion. A
relationship represents key. A node possesses property including ID. An edge may have
a label which may specify a role.
Characteristics of graph databases are:
Use specialized query languages, such as RDF uses SPARQL
Create a database system which models the data in a completely different way than the
key-values, document, columnar and object data store models.
Can have hyper-edges. A hyper-edge is a set of vertices of a hypergraph. A hypergraph
is a generalization of a graph in which an edge can join any number of vertices (not
only the neighbouring vertices).
Consists of a collection of small data size records, which have complex interactions
between graph-nodes and hypergraph nodes.
When a new relationship adds in RDBMS, then the schema changes. The data need transfer from one field
to another. The task of adding relations in graph database is simpler. The nodes assign internal identifiers
to the nodes and use these identifiers to join the network. Traversing the joins or relationships is fast in
graph databases. It is due to the simpler form of graph nodes. The graph data may be kept in RAM only.
The relationship between nodes is consistent in a graph store. Graph databases have poor scalability. They
are difficult to scale out on multiple servers. This is due to the close connectivity feature of each node in
the graph. Data can be replicated on multiple servers to enhance read and the query processing
performance. Write operations to multiple servers and graph queries that span multiple nodes, can be
complex to implement.
Typical uses of graph databases are: (i) link analysis, (ii) friend of friend queries, (iii) Rules and inference,
(iv) rule induction and (v) Pattern matching. Link analysis is needed to perform searches and look for
patterns and relationships in situations, such as social networking, telephone, or email records (Sections 9.4