Unit 2 Bda Bda
Unit 2 Bda Bda
Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – schemaless databases – materialized views – distribution
models– master-slave replication – consistency - Cassandra – Cassandra data model – Cassandra
examples – Cassandra clients
1. Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are capable of scaling horizontally to
handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
Graph databases are purpose-built to store and navigate relationships. Relationships are first-
class citizens in graph databases, and most of the value of graph databases is derived from these
relationships. Graph databases use nodes to store data entities, and edges to store relationships
between entities. An edge always has a start node, end node, type, and direction, and an edge can
describe parent-child relationships, actions, ownership, and the like. There is no limit to the
number and kind of relationships a node can have.
A graph in a graph database can be traversed along specific edge types or across the entire graph.
In graph databases, traversing the joins or relationships is very fast because the relationships
between nodes are not calculated at query times but are persisted in the database. Graph
databases have advantages for use cases such as social networking, recommendation engines,
and fraud detection, when you need to create relationships between data and quickly query these
relationships.
The following graph shows an example of a social network graph. Given the people (nodes) and
their relationships (edges), you can find out who the "friends of friends" of a particular person
are—for example, the friends of Howard's friends.
A social network is a good example of a graph. The people in the network would be the nodes,
the attributes of each person (such as name, age, and so on) would be properties, and the lines
connecting the people (with labels such as “friend” or “mother” or “supervisor”) would indicate
their relationship.
In a conventional database, queries about relationships can take a long time to process. This is
because relationships are implemented with foreign keys and queried by joining tables. As any
SQL DBA can tell you, performing joins is expensive, especially when you must sort through
large numbers of objects—or, worse, when you must join multiple tables to perform the sorts of
indirect (e.g. “friend of a friend”) queries that graph databases excel at.
Graph databases work by storing the relationships along with the data. Because related nodes are
physically linked in the database, accessing those relationships is as immediate as accessing the
data itself. In other words, instead of calculating the relationship as relational databases must do,
graph databases simply read the relationship from storage. Satisfying queries is a simple matter
of walking, or “traversing,” the graph.
A graph database not only stores the relationships between objects in a native way, making
queries about relationships fast and easy, but allows you to include different kinds of objects and
different kinds of relationships in the graph. Like other NoSQL databases, a graph database is
schema-less. Thus, in terms of performance and flexibility, graph databases hew closer to
document databases or key-value stores than they do relational or table-oriented databases.
Again, a social network is a useful example. Graph databases reduce the amount of work needed
to construct and display the data views found in social networks, such as activity feeds, or
determining whether or not you might know a given person due to their proximity to other
friends you have in the network.
Another application for graph databases is finding patterns of connection in graph data that
would be difficult to tease out via other data representations. Fraud detection systems use graph
databases to bring to light relationships between entities that might otherwise have been hard to
notice.
Similarly, graph databases are a natural fit for applications that manage the relationships or
interdependencies between entities. You will often find graph databases behind recommendation
engines, content and asset management systems, identity and access management systems, and
regulatory compliance and risk management solutions.
Neo4j
Neo4j is easily the most mature (11 years and counting) and best-known of the graph databases
for general use. Unlike previous graph database products, it doesn’t use a SQL back-end. Neo4j
is a native graph database that was engineered from the inside out to support large graph
structures, as in queries that return hundreds of thousands of relations and more.
Neo4j comes in both free open-source and for-pay enterprise editions, with the latter having no
restrictions on the size of a dataset (among other features). You can also experiment with Neo4j
online by way of its Sandbox, which includes some sample datasets to practice with.
The Azure Cosmos DB cloud database is an ambitious project. It’s intended to emulate multiple
kinds of databases—conventional tables, document-oriented, column family, and graph—all
through a single, unified service with a consistent set of APIs.
To that end, a graph database is just one of the various modes Cosmos DB can operate in. It uses
the Gremlin query language and API for graph-type queries, and supports the Gremlin console
created for Apache TinkerPop as another interface.
JanusGraph
JanusGraph was forked from the TitanDB project, and is now under the governance of the Linux
Foundation. It uses any of a number of supported back ends—Apache Cassandra, Apache
HBase, Google Cloud Bigtable, Oracle BerkeleyDB—to store graph data, supports the Gremlin
query language (as well as other elements from the Apache TinkerPop stack), and can also
incorporate full-text search by way of the Apache Solr, Apache Lucene, or Elasticsearch
projects.
IBM, one of the JanusGraph project’s supporters, offers a hosted version of JanusGraph on IBM
Cloud, called Compose for JanusGraph. Like Azure Cosmos DB, Compose for JanusGraph
provides autoscaling and high availability, with pricing based on resource usage.
A materialized view is a replica of a target master from a single point in time. The master can be
either a master table at a master site or a master materialized view at a materialized view site.
Whereas in multimaster replication tables are continuously updated by other master sites,
materialized views are updated from one or more masters through individual batch updates,
known as a refreshes, from a single master site or master materialized view site,
You can use materialized views to achieve one or more of the following goals:
Master is the authoritative source for the data is responsible for processing any updates to
that data can be appointed manually or automatically Slaves A replication process synchronizes
the slaves with the master After a failure of the master, a slave can be appointed as new master
very quickly 6 CREDITS: Jimmy Lin (University of Maryland) Pros and cons of Master-Slave
Replication
Cassandra has become so popular because of its outstanding technical features. Given below are
some of the features of Cassandra:
● Elastic scalability − Cassandra is highly scalable; it allows to add more
hardware to accommodate more customers and more data as per
requirement.
● Always on architecture − Cassandra has no single point of failure and it is
continuously available for business-critical applications that cannot
afford a failure.
In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is
detected that some of the nodes responded with an out-of-date value, Cassandra will return the
most recent value to the client. After returning the most recent value, Cassandra performs a read
repair in the background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data replication among the
nodes in a cluster to ensure no single point of failure.
Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to
work with CQL or separate application language drivers.
Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in case
of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format,
and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic
attributes of a Keyspace in Cassandra are −
● Replication factor − It is the number of machines in the cluster that will
receive copies of the same data.
● Replica placement strategy − It is nothing but the strategy to place replicas
in the ring. We have strategies such as simple strategy (rack-aware
strategy), old network topology strategy (rack-aware strategy), and network topology
strategy (datacenter-shared strategy).
● Column families − Keyspace is a container for a list of one or more
column families. A column family, in turn, is a container of a collection
of rows. Each row contains ordered columns. Column families represent
the structure of your data. Each keyspace has at least one and often
many column families.
The syntax of creating a Keyspace is as follows −
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
The following illustration shows a schematic view of a Keyspace.
Column
A column is the basic data structure of Cassandra with three values, namely key or column name,
value, and a time stamp. Given below is the structure of a column.
SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column
stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize
performance, it is important to keep columns that you are likely to query together in the same
column family, and a super column can be helpful here.Given below is the structure of a super
column.
The following table lists down the points that differentiate the data model of Cassandra from that
of an RDBMS.
If there will be many partitions, then all these partitions need to be visited for collecting the
query data.
It does not mean that partitions should not be created. If your data is very large, you can’t keep
that huge amount of data on the single partition. The single partition will be slowed down.
Data retrieval will be slow by this data model due to the bad primary key.
● Joins
● Group by
● Filtering on which column etc.
For example, a course can be studied by many students. I want to search all the students that are
studying a particular course.
So by querying on course name, I will have many student names that will be studying a
particular course.
For example, a course can be studied by many students, and a student can also study many
courses.
So in this case, I will have two tables i.e. divide the problem into two cases.
First, I will create a table by which you can find courses by a particular student.