0% found this document useful (0 votes)
48 views29 pages

Unit 2 Bda Bda

sd

Uploaded by

rethinakumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views29 pages

Unit 2 Bda Bda

sd

Uploaded by

rethinakumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Unit 2 BDA - Bda

Computer Science an dEngineering (Sree Sastha Institute of Engineering and


Technology)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Rethina Kumari M ([email protected])
CCS334 BIGDATA ANALYTICS

UNIT II NOSQL DATA MANAGEMENT

Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – schemaless databases – materialized views – distribution
models– master-slave replication – consistency - Cassandra – Cassandra data model – Cassandra
examples – Cassandra clients

1. Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are capable of scaling horizontally to
handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.

NoSQL databases are generally classified into four main categories:

1. Document databases: These databases store data as semi-structured documents, such as


JSON or XML, and can be queried using document-oriented query languages.
2. Key-value stores: These databases store data as key-value pairs, and are optimized for
simple and fast read/write operations.
3. Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity. They are optimized for fast and efficient querying
of large amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are designed to
handle complex relationships between data.
NoSQL databases are often used in applications where there is a high volume of data that
needs to be processed and analyzed in real-time, such as social media analytics, e-commerce,
and gaming. They can also be used for other applications, such as content management systems,
document management, and customer relationship management.
However, NoSQL databases may not be suitable for all applications, as they may not provide
the same level of data consistency and transactional guarantees as traditional relational
databases. It is important to carefully evaluate the specific needs of an application when
choosing a database management system.
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the tabular
relations used in relational databases. Such databases came into existence in the late 1960s, but
did not obtain the NoSQL moniker until a surge of popularity in the early twenty-first century.

Downloaded by Rethina Kumari M ([email protected])


NoSQL databases are used in real-time web applications and big data and their use are increasing
over time.
● NoSQL systems are also sometimes called Not only SQL to emphasize the fact that they may
support SQL-like query languages. A NoSQL database includes simplicity of design, simpler
horizontal scaling to clusters of machines and finer control over availability. The data
structures used by NoSQL databases are different from those used by default in relational
databases which makes some operations faster in NoSQL. The suitability of a given NoSQL
database depends on the problem it should solve.
● NoSQL databases, also known as “not only SQL” databases, are a new type of database
management system that have gained popularity in recent years. Unlike traditional relational
databases, NoSQL databases are designed to handle large amounts of unstructured or semi-
structured data, and they can accommodate dynamic changes to the data model. This makes
NoSQL databases a good fit for modern web applications, real-time analytics, and big data
processing.
● Data structures used by NoSQL databases are sometimes also viewed as more flexible than
relational database tables. Many NoSQL stores compromise consistency in favor of
availability, speed and partition tolerance. Barriers to the greater adoption of NoSQL stores
include the use of low-level query languages, lack of standardized interfaces, and huge
previous investments in existing relational databases.
● Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE,
Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB
have made them central to their designs.
● Most NoSQL databases offer a concept of eventual consistency in which database changes
are propagated to all nodes so queries for data might not return updated data immediately or
might result in reading data that is not accurate which is a problem known as stale reads.
Also some NoSQL systems may exhibit lost writes and other forms of data loss. Some
NoSQL systems provide concepts such as write-ahead logging to avoid data loss.
● One simple example of a NoSQL database is a document database. In a document database,
data is stored in documents rather than tables. Each document can contain a different set of
fields, making it easy to accommodate changing data requirements
● For example, “Take, for instance, a database that holds data regarding employees.”. In a
relational database, this information might be stored in tables, with one table for employee
information and another table for department information. In a document database, each
employee would be stored as a separate document, with all of their information contained
within the document.
● NoSQL databases are a relatively new type of database management system that have gained
popularity in recent years due to their scalability and flexibility. They are designed to handle
large amounts of unstructured or semi-structured data and can handle dynamic changes to the
data model. This makes NoSQL databases a good fit for modern web applications, real-time
analytics, and big data processing.
Key Features of NoSQL :
1. Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate
changing data structures without the need for migrations or schema alterations.

Downloaded by Rethina Kumari M ([email protected])


2. Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to
a database cluster, making them well-suited for handling large amounts of data and high
levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a document-based data
model, where data is stored in semi-structured format, such as JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model,
where data is stored as a collection of key-value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-based data
model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to be highly
available and to automatically handle node failures and data replication across multiple nodes
in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and
dynamic manner, with support for multiple data types and changing data structures.
8. Performance: NoSQL databases are optimized for high performance and can handle a high
volume of reads and writes, making them suitable for big data and real-time applications.
Advantages of NoSQL: There are many advantages of working with NoSQL databases such as
MongoDB and Cassandra. The main advantages are high scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of data
and placing it on multiple machines in such a way that the order of the data is preserved is
sharding. Vertical scaling means adding more resources to the existing machine whereas
horizontal scaling means adding more machines to handle the data. Vertical scaling is not
that easy to implement but horizontal scaling is easy to implement. Examples of horizontal
scaling databases are MongoDB, Cassandra, etc. NoSQL can handle a huge amount of data
because of scalability, as the data grows NoSQL scale itself to handle that data in an efficient
manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data,
which means that they can accommodate dynamic changes to the data model. This makes
NoSQL databases a good fit for applications that need to handle changing data requirements.
3. High availability : Auto replication feature in NoSQL databases makes it highly available
because in case of any failure data replicates itself to the previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they can handle large
amounts of data and traffic with ease. This makes them a good fit for applications that need
to handle large amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and traffic,
which means that they can offer improved performance compared to traditional relational
databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional
relational databases, as they are typically less complex and do not require expensive
hardware or software.
7. Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
1. Lack of standardization : There are many different types of NoSQL databases, each with
its own unique strengths and weaknesses. This lack of standardization can make it difficult to
choose the right database for a specific application

Downloaded by Rethina Kumari M ([email protected])


2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means
that they do not guarantee the consistency, integrity, and durability of data. This can be a
drawback for applications that require strong data consistency guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for
storage but it provides very little functionality. Relational databases are a better choice in the
field of Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL
yet. In other words, two database systems are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle
complex queries, which means that they are not a good fit for applications that require
complex data analysis or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional
relational databases. This can make them less reliable and less secure than traditional
databases.
7. Management challenge : The purpose of big data tools is to make the management of a
large amount of data as simple as possible. But it is not so easy. Data management in NoSQL
is much more complex than in a relational database. NoSQL, in particular, has a reputation
for being challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available in
the market.
9. Backup : Backup is a great weak point for some NoSQL databases like MongoDB.
MongoDB has no approach for the backup of data in a consistent manner.
10. Large document size : Some database systems like MongoDB and CouchDB store data in
JSON format. This means that documents are quite large (BigData, network bandwidth,
speed), and having descriptive key names actually hurts since they increase the document
size.
Types of NoSQL database: Types of NoSQL databases and the name of the databases system
that falls in that category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Tabular: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
When should NoSQL be used:
1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database regularly to handle the
data.

2. Aggregate Data Models


A data model is the model through which we perceive and manipulate our data. For
people using a database, the data model describes how we interact with the data in the database.
This is distinct from a storage model, which describes how the database stores and manipulates

Downloaded by Rethina Kumari M ([email protected])


the data internally. In an ideal world, we should be ignorant of the storage model, but in practice
we need at least some inkling of it—primarily to achieve decent performance.
In conversation, the term “data model” often means the model of the specific data in an
application.
A developer might point to an entity-relationship diagram of their database and refer to
that as their data model containing customers, orders, products, and the like. However, in this
book we’ll mostly be using “data model” to refer to the model by which the database organizes
data—what might be more formally called a metamodel.
The dominant data model of the last couple of decades is the relational data model, which
is best visualized as a set of tables, rather like a page of a spreadsheet. Each table has rows, with
each row representing some entity of interest. We describe this entity through columns, each
having a single value. A column may refer to another row in the same or different table, which
constitutes a relationship between those entities. (We’re using informal but common terminology
when we speak of tables and rows; the more formal terms would be relations and tuples.)
One of the most obvious shifts with NoSQL is a move away from the relational model.
Each NoSQL solution has a different model that it uses, which we put into four categories widely
used in the NoSQL ecosystem: key-value, document, column-family, and graph. Of these, the
first three share a common characteristic of their data models which we will call aggregate
orientation.
3. Aggregates
The relational model takes the information that we want to store and divides it into tuples
(rows). A tuple is a limited data structure: It captures a set of values, so you cannot nest one
tuple within another
to get nested records, nor can you put a list of values or tuples within another. This simplicity
underpins the relational model—it allows us to think of all operations as operating on and
returning tuples.
Aggregate orientation takes a different approach. It recognizes that often, you want to
operate on data in units that have a more complex structure than a set of tuples. It can be handy
to think in terms of a complex record that allows lists and other record structures to be nested
inside it. As we’ll see, key-value, document, and column-family databases all make use of this
more complex record. However, there is no common term for this complex record; in this book
we use the term “aggregate.”
Aggregate is a term that comes from Domain-Driven Design. In Domain-Driven Design,
an aggregate is a collection of related objects that we wish to treat as a unit. In particular, it is a
unit for data manipulation and management of consistency. Typically, we like to update
aggregates with atomic operations and communicate with our data storage in terms of
aggregates. This definition matches really well with how key-value, document, and column-
family databases work. Dealing in aggregates makes it much easier for these databases to handle
operating on a cluster, since the aggregate makes a natural unit for replication and sharding.
Aggregates are also often easier for application programmers to work with, since they often
manipulate data through aggregate structures.
Example of Relations and Aggregates
At this point, an example may help explain what we’re talking about. Let’s assume we
have to build an e-commerce website; we are going to be selling items directly to customers over
the web, and we will have to store information about users, our product catalog, orders, shipping
addresses, billing addresses, and payment data. We can use this scenario to model the data using

Downloaded by Rethina Kumari M ([email protected])


a relation data store as well as NoSQL data stores and talk about their pros and cons. For a
relational database, we might start with a data model shown in the below diagram.

Downloaded by Rethina Kumari M ([email protected])


As we’re good relational soldiers, everything is properly normalized, so that no data is repeated
in
multiple tables. We also have referential integrity. A realistic order system would naturally be
more
involved than this, but this is the benefit of the rarefied air of a book.
Now let’s see how this model might look when we think in more aggregate-oriented
terms

Downloaded by Rethina Kumari M ([email protected])


In this model, we have two main aggregates: customer and order. We’ve used the black-
diamond composition marker in UML to show how data fits into the aggregation structure.The
customer contains a list of billing addresses; the order contains a list of order items, a shipping
address, and payments.The payment itself contains a billing address for that payment.
A single logical address record appears three times in the example data, but instead of
using IDs it’s treated as a value and copied each time. This fits the domain where we would not
want the shipping address, nor the payment’s billing address, to change.
In a relational database, we would ensure that the address rows aren’t updated for this
case, making a new row instead. With aggregates, we can copy the whole address structure into
the aggregate as we need to. The link between the customer and the order isn’t within either
aggregate—it’s a relationship between aggregates. Similarly, the link from an order item would
cross into a separate aggregate structure for products, which we haven’t gone into. We’ve shown
the product name as part of the order item here—this kind of denormalization is similar to the
tradeoffs with relational databases, but is more common with aggregates because we want to
minimize the number of aggregates we access during a data interaction. The important thing to
notice here isn’t the particular way we’ve drawn the aggregate boundary so much as the fact that
you have to think about accessing that data—and make that part of your thinking when
developing the application data model. Indeed we could draw our aggregate boundaries
differently, putting all the orders for a customer into the customer aggregate.

Downloaded by Rethina Kumari M ([email protected])


Aggregates have an important consequence for transactions. Relational databases allow
you to manipulate any combination of rows from any tables in a single transaction. Such
transactions are called ACID transactions: Atomic, Consistent, Isolated, and Durable. ACID is
a rather contrived acronym; the real point is the atomicity: Many rows spanning many tables are
updated as a single operation.
This operation either succeeds or fails in its entirety, and concurrent operations are
isolated from each other so they cannot see a partial update. It’s often said that NoSQL databases
don’t support ACID transactions and thus sacrifice consistency.
This is a rather sweeping simplification. In general, it’s true that aggregate-oriented
databases don’t have ACID transactions that span multiple aggregates. Instead, they support
atomic manipulation of a single aggregate at a time. This means that if we need to manipulate
multiple aggregates in an atomic way, we have to manage that ourselves in the application code.
In practice, we find that most of the time we are able to keep our atomicity needs to
within a single aggregate; indeed, that’s part of the consideration for deciding how to divide up
our data into aggregates. We should also remember that graph and other aggregate-ignorant
databases usually do support ACID transactions similar to relational databases.
Key-Value and Document Data Models
We said earlier on that key-value and document databases were strongly aggregate-
oriented. What we meant by this was that we think of these databases as primarily constructed
through aggregates. Both
of these types of databases consist of lots of aggregates with each aggregate having a key or ID
that’s used to get at the data.

Downloaded by Rethina Kumari M ([email protected])


The two models differ in that in a key-value database, the aggregate is opaque to the
database—just some big blob of mostly meaningless bits. In contrast, a document database is
able to see a structure in the aggregate. The advantage of opacity is that we can store whatever
we like in the aggregate. The database may impose some general size limit, but other than that
we have complete freedom. A document database imposes limits on what we can place in it,
defining allowable structures and types.
In return, however, we get more flexibility in access. With a key-value store, we can only
access an aggregate by lookup based on its key. With a document database, we can submit
queries to the database based on the fields in the aggregate, we can retrieve part of the aggregate
rather than the whole thing, and database can create indexes based on the contents of the
aggregate.
In practice, the line between key-value and document gets a bit blurry. People often put
an ID field in a document database to do a key-value style lookup. Databases classified as key-
value databases
may allow you structures for data beyond just an opaque aggregate. For example, Riak allows
you to add metadata to aggregates for indexing and interaggregate links, Redis allows you to
break down the
aggregate into lists or sets. You can support querying by integrating search tools such as Solr.
As an example, Riak includes a search facility that uses Solr-like searching on any
aggregates that are stored as JSON or XML structures. Despite this blurriness, the general
distinction still holds. With key-value databases, we expect to mostly look up aggregates using a
key. With document databases, we mostly expect to submit some form of query based on the
internal structure of the document; this might be a key, but it’s more likely to be something else.

The graph database defined

Graph databases are purpose-built to store and navigate relationships. Relationships are first-
class citizens in graph databases, and most of the value of graph databases is derived from these
relationships. Graph databases use nodes to store data entities, and edges to store relationships
between entities. An edge always has a start node, end node, type, and direction, and an edge can
describe parent-child relationships, actions, ownership, and the like. There is no limit to the
number and kind of relationships a node can have.
A graph in a graph database can be traversed along specific edge types or across the entire graph.
In graph databases, traversing the joins or relationships is very fast because the relationships
between nodes are not calculated at query times but are persisted in the database. Graph
databases have advantages for use cases such as social networking, recommendation engines,
and fraud detection, when you need to create relationships between data and quickly query these
relationships.
The following graph shows an example of a social network graph. Given the people (nodes) and
their relationships (edges), you can find out who the "friends of friends" of a particular person
are—for example, the friends of Howard's friends.

Downloaded by Rethina Kumari M ([email protected])


Graph database features
The term “graph” comes from the use of the word in mathematics. There it’s used to describe a
collection of nodes (or vertices), each containing information (properties), and with labeled
relationships (or edges) between the nodes.

A social network is a good example of a graph. The people in the network would be the nodes,
the attributes of each person (such as name, age, and so on) would be properties, and the lines
connecting the people (with labels such as “friend” or “mother” or “supervisor”) would indicate
their relationship.

In a conventional database, queries about relationships can take a long time to process. This is
because relationships are implemented with foreign keys and queried by joining tables. As any
SQL DBA can tell you, performing joins is expensive, especially when you must sort through
large numbers of objects—or, worse, when you must join multiple tables to perform the sorts of
indirect (e.g. “friend of a friend”) queries that graph databases excel at.

Graph databases work by storing the relationships along with the data. Because related nodes are
physically linked in the database, accessing those relationships is as immediate as accessing the
data itself. In other words, instead of calculating the relationship as relational databases must do,
graph databases simply read the relationship from storage. Satisfying queries is a simple matter
of walking, or “traversing,” the graph.

A graph database not only stores the relationships between objects in a native way, making
queries about relationships fast and easy, but allows you to include different kinds of objects and
different kinds of relationships in the graph. Like other NoSQL databases, a graph database is
schema-less. Thus, in terms of performance and flexibility, graph databases hew closer to
document databases or key-value stores than they do relational or table-oriented databases.

Downloaded by Rethina Kumari M ([email protected])


Graph database use cases
Graph databases work best when the data you’re working with is highly connected and should be
represented by how it links or refers to other data, typically by way of many-to-many
relationships.

Again, a social network is a useful example. Graph databases reduce the amount of work needed
to construct and display the data views found in social networks, such as activity feeds, or
determining whether or not you might know a given person due to their proximity to other
friends you have in the network.

Another application for graph databases is finding patterns of connection in graph data that
would be difficult to tease out via other data representations. Fraud detection systems use graph
databases to bring to light relationships between entities that might otherwise have been hard to
notice.

Similarly, graph databases are a natural fit for applications that manage the relationships or
interdependencies between entities. You will often find graph databases behind recommendation
engines, content and asset management systems, identity and access management systems, and
regulatory compliance and risk management solutions.

Popular graph databases


Because graph databases serve a relatively niche use case, there aren’t nearly as many of them as
there are relational databases. On the plus side, that makes the standout products easier to
identify and discuss.

Neo4j

Neo4j is easily the most mature (11 years and counting) and best-known of the graph databases
for general use. Unlike previous graph database products, it doesn’t use a SQL back-end. Neo4j
is a native graph database that was engineered from the inside out to support large graph
structures, as in queries that return hundreds of thousands of relations and more.

Neo4j comes in both free open-source and for-pay enterprise editions, with the latter having no
restrictions on the size of a dataset (among other features). You can also experiment with Neo4j
online by way of its Sandbox, which includes some sample datasets to practice with.

See InfoWorld’s review of Neo4j for more details.

Microsoft Azure Cosmos DB

The Azure Cosmos DB cloud database is an ambitious project. It’s intended to emulate multiple
kinds of databases—conventional tables, document-oriented, column family, and graph—all
through a single, unified service with a consistent set of APIs.

To that end, a graph database is just one of the various modes Cosmos DB can operate in. It uses
the Gremlin query language and API for graph-type queries, and supports the Gremlin console
created for Apache TinkerPop as another interface.

Downloaded by Rethina Kumari M ([email protected])


Another big selling point of Cosmos DB is that indexing, scaling, and geo-replication are
handled automatically in the Azure cloud, without any knob-twiddling on your end. It isn’t clear
yet how Microsoft’s all-in-one architecture measures up to native graph databases in terms of
performance, but Cosmos DB certainly offers a useful combination of flexibility and scale.

See InfoWorld’s review of Azure Cosmos DB for more details.

JanusGraph

JanusGraph was forked from the TitanDB project, and is now under the governance of the Linux
Foundation. It uses any of a number of supported back ends—Apache Cassandra, Apache
HBase, Google Cloud Bigtable, Oracle BerkeleyDB—to store graph data, supports the Gremlin
query language (as well as other elements from the Apache TinkerPop stack), and can also
incorporate full-text search by way of the Apache Solr, Apache Lucene, or Elasticsearch
projects.

IBM, one of the JanusGraph project’s supporters, offers a hosted version of JanusGraph on IBM
Cloud, called Compose for JanusGraph. Like Azure Cosmos DB, Compose for JanusGraph
provides autoscaling and high availability, with pricing based on resource usage.

What is a schemaless database?


A schemaless database manages information without the need for a blueprint. The onset of
building a schemaless database doesn’t rely on conforming to certain fields, tables, or data model
structures. There is no Relational Database Management System (RDBMS) to enforce any
specific kind of structure. In other words, it’s a non-relational database that can handle any
database type, whether that be a key-value store, document store, in-memory, column-oriented,
or graph data model. NoSQL databases’ flexibility is responsible for the rising popularity of a
schemaless approach and is often considered more user-friendly than scaling a schema or SQL
database.
How does a schemaless database work?
With a schemaless database, you don’t need to have a fully-realized vision of what your data
structure will be. Because it doesn’t adhere to a schema, all data saved in a schemaless database
is kept completely intact. A relational database, on the other hand, picks and chooses what data it
keeps, either changing the data to fit the schema, or eliminating it altogether. Going schemaless
allows every bit of detail from the data to remain unaltered and be completely accessible at any
time. For businesses whose operations change according to real-time data, it’s important to have
that untouched data as any of those points can prove to be integral to how the database is later
updated. Without a fixed data structure, schemaless databases can include or remove data types,
tables, and fields without major repercussions, like complex schema migrations and outages.
Because it can withstand sudden changes and parse any data type, schemaless databases are
popular in industries that are run on real-time data, like financial services, gaming, and social
media.
Schemaless vs. schema databases pros and cons
How much information do you know about your new database setup? Can you see its structure
well ahead of time and know for certain it will never change? If so, you may be dealing with a

Downloaded by Rethina Kumari M ([email protected])


situation that best suits a schema database. Its strictness is the basis of its appeal. Let’s get
granular and weigh the pros and cons of going one way or the other.

What is a Materialized View?

A materialized view is a replica of a target master from a single point in time. The master can be
either a master table at a master site or a master materialized view at a materialized view site.
Whereas in multimaster replication tables are continuously updated by other master sites,
materialized views are updated from one or more masters through individual batch updates,
known as a refreshes, from a single master site or master materialized view site,

Downloaded by Rethina Kumari M ([email protected])


Why Use Materialized Views?

You can use materialized views to achieve one or more of the following goals:

Ease Network Loads


If one of your goals is to reduce network loads, then you can use materialized views to distribute
your corporate database to regional sites. Instead of the entire company accessing a single
database server, user load is distributed across multiple database servers. Through the use of
multitier materialized views, you can create materialized views based on other materialized
views, which enables you to distribute user load to an even greater extent because clients can
access materialized view sites instead of master sites. To decrease the amount of data that is
replicated, a materialized view can be a subset of a master table or master materialized view.
While multimaster replication also distributes a corporate database among multiple sites, the
networking requirements for multimaster replication are greater than those for replicating with
materialized views because of the transaction by transaction nature of multimaster replication.
Further, the ability of multimaster replication to provide real-time or near real-time replication
may result in greater network traffic, and might require a dedicated network link.
Materialized views are updated through an efficient batch process from a single master site or
master materialized view site. They have lower network requirements and dependencies than
multimaster replication because of the point in time nature of materialized view replication.
Whereas multimaster replication requires constant communication over the network,
materialized view replication requires only periodic refreshes.
In addition to not requiring a dedicated network connection, replicating data with materialized
views increases data availability by providing local access to the target data. These benefits,
combined with mass deployment and data subsetting (both of which also reduce network loads),
greatly enhance the performance and reliability of your replicated database.
Create a Mass Deployment Environment
Deployment templates allow you to precreate a materialized view environment locally. You can
then use deployment templates to quickly and easily deploy materialized view environments to

Downloaded by Rethina Kumari M ([email protected])


support sales force automation and other mass deployment environments. Parameters allow you
to create custom data sets for individual users without changing the deployment template. This
technology enables you to roll out a database infrastructure to hundreds or thousands of users.
Enable Data Subsetting
Materialized views allow you to replicate data based on column- and row-level subsetting, while
multimaster replication requires replication of the entire table. Data subsetting enables you to
replicate information that pertains only to a particular site. For example, if you have a regional
sales office, then you might replicate only the data that is needed in that region, thereby cutting
down on unnecessary network traffic.
Enable Disconnected Computing
Materialized views do not require a dedicated network connection. Though you have the option
of automating the refresh process by scheduling a job, you can manually refresh your
materialized view on-demand, which is an ideal solution for sales applications running on a
laptop. For example, a developer can integrate the replication management API for refresh on-
demand into the sales application. When the salesperson has completed the day's orders, the
salesperson simply dials up the network and uses the integrated mechanism to refresh the
database, thus transferring the orders to the main office.
Master and Slave Replication

Master is the authoritative source for the data is responsible for processing any updates to
that data can be appointed manually or automatically Slaves A replication process synchronizes
the slaves with the master After a failure of the master, a slave can be appointed as new master
very quickly 6 CREDITS: Jimmy Lin (University of Maryland) Pros and cons of Master-Slave
Replication

Downloaded by Rethina Kumari M ([email protected])


Pros More read requests:
● Add more slave nodes Ensure that all read requests are routed to the slaves
● Should the master fail, the slaves can still handle read requests
● Good for datasets with a read-intensive dataset
Cons
● The master is a bottleneck
● Limited by its ability to process updates and to pass those updates on
● Its failure does eliminate the ability to handle writes until:
● the master is restored or
● a new master is appointed
● Inconsistency due to slow propagation of changes to the slaves
● Bad for datasets with heavy write
traffic Cassandra
What is Cassandra?

Cassandra is an open source, distributed and decentralized/distributed storage system (database),


for managing very large amounts of structured data spread out across the world. It provides
highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra −
● It is scalable, fault-tolerant, and consistent.
● It is a column-oriented database.
● Its distribution design is based on Amazon’s Dynamo and its data model on Google’s
Bigtable.
● Created at Facebook, it differs sharply from relational database management systems.
● Cassandra implements a Dynamo-style replication model with no single point of failure,
but adds a more powerful “column family” data model.
● Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Features of Cassandra

Cassandra has become so popular because of its outstanding technical features. Given below are
some of the features of Cassandra:
● Elastic scalability − Cassandra is highly scalable; it allows to add more
hardware to accommodate more customers and more data as per
requirement.
● Always on architecture − Cassandra has no single point of failure and it is
continuously available for business-critical applications that cannot
afford a failure.

Downloaded by Rethina Kumari M ([email protected])


● Fast linear-scale performance − Cassandra is linearly scalable, i.e., it
increases your throughput as you increase the number of nodes in the
cluster. Therefore it maintains a quick response time.
● Flexible data storage − Cassandra accommodates all possible data
formats including: structured, semi-structured, and unstructured. It can
dynamically accommodate changes to your data structures according
to your need.
● Easy data distribution − Cassandra provides the flexibility to distribute
data where you need by replicating data across multiple data centers.
● Transaction support − Cassandra supports properties like Atomicity,
Consistency, Isolation, and Durability (ACID).
● Fast writes − Cassandra was designed to run on cheap commodity
hardware. It performs blazingly fast writes and can store hundreds of
terabytes of data, without sacrificing the read efficiency.
History of Cassandra

● Cassandra was developed at Facebook for inbox search.


● It was open-sourced by Facebook in July 2008.
● Cassandra was accepted into Apache Incubator in March 2009.
● It was made an Apache top-level project since February 2010.
The design goal of Cassandra is to handle big data workloads across multiple nodes without any
single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes in a cluster.
● All the nodes in a cluster play the same role. Each node is independent and at the same
time interconnected to other nodes.
● Each node in a cluster can accept read and write requests, regardless of where the data is
actually located in the cluster.
● When a node goes down, read/write requests can be served from other nodes in the
network.
Data Replication in Cassandra

In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is
detected that some of the nodes responded with an out-of-date value, Cassandra will return the
most recent value to the client. After returning the most recent value, Cassandra performs a read
repair in the background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data replication among the
nodes in a cluster to ensure no single point of failure.

Downloaded by Rethina Kumari M ([email protected])


Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with
each other and detect any faulty nodes in the cluster.
Components of Cassandra

The key components of Cassandra are as follows −


● Node − It is the place where data is stored.
● Data center − It is a collection of related nodes.
● Cluster − A cluster is a component that contains one or more data
centers.
● Commit log − The commit log is a crash-recovery mechanism in
Cassandra. Every write operation is written to the commit log.
● Mem-table − A mem-table is a memory-resident data structure. After
commit log, the data will be written to the mem-table. Sometimes, for a
single-column family, there will be multiple mem-tables.
● SSTable − It is a disk file to which the data is flushed from the mem-
table when its contents reach a threshold value.
● Bloom filter − These are nothing but quick, nondeterministic, algorithms
for testing whether an element is a member of a set. It is a special kind
of cache. Bloom filters are accessed after every query.
Cassandra Query Language

Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to
work with CQL or separate application language drivers.

Downloaded by Rethina Kumari M ([email protected])


Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a
proxy between the client and the nodes holding the data.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the data
will be captured and stored in the mem-table. Whenever the mem-table is full, data will be
written into the SStable data file. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary
data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom filter to
find the appropriate SSTable that holds the required data.
Cassandra Data Model
The data model of Cassandra is significantly different from what we normally see in an RDBMS.
This chapter provides an overview of how Cassandra stores its data.
Cluster

Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in case
of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format,
and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic
attributes of a Keyspace in Cassandra are −
● Replication factor − It is the number of machines in the cluster that will
receive copies of the same data.
● Replica placement strategy − It is nothing but the strategy to place replicas
in the ring. We have strategies such as simple strategy (rack-aware
strategy), old network topology strategy (rack-aware strategy), and network topology
strategy (datacenter-shared strategy).
● Column families − Keyspace is a container for a list of one or more
column families. A column family, in turn, is a container of a collection
of rows. Each row contains ordered columns. Column families represent
the structure of your data. Each keyspace has at least one and often
many column families.
The syntax of creating a Keyspace is as follows −
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
The following illustration shows a schematic view of a Keyspace.

Downloaded by Rethina Kumari M ([email protected])


Column Family
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered
collection of columns. The following table lists the points that differentiate a column family from
a table of relational databases.

A Cassandra column family has the following attributes −


● keys_cached − It represents the number of locations to keep cached per
SSTable.
● rows_cached − It represents the number of rows whose entire contents
will be cached in memory.
● preload_row_cache − It specifies whether you want to pre-populate the
row cache.

Downloaded by Rethina Kumari M ([email protected])


Note − Unlike relational tables where a column family’s schema is not fixed, Cassandra does
not force individual rows to have all the columns.
The following figure shows an example of a Cassandra column family.

Column
A column is the basic data structure of Cassandra with three values, namely key or column name,
value, and a time stamp. Given below is the structure of a column.

SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column
stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize
performance, it is important to keep columns that you are likely to query together in the same
column family, and a super column can be helpful here.Given below is the structure of a super
column.

Data Models of Cassandra and RDBMS

The following table lists down the points that differentiate the data model of Cassandra from that
of an RDBMS.

Downloaded by Rethina Kumari M ([email protected])


Cassandra Data Model Rules
In Cassandra, writes are not expensive. Cassandra does not support joins, group by, OR clause,
aggregations, etc. So you have to store your data in such a way that it should be completely
retrievable. So these rules must be kept in mind while modelling data in Cassandra.

Maximize the number of writes


In Cassandra, writes are very cheap. Cassandra is optimized for high write performance. So try
to maximize your writes for better read performance and data availability. There is a tradeoff
between data write and data read. So, optimize you data read performance by maximizing the
number of data writes.

Maximize Data Duplication


Data denormalization and data duplication are defacto of Cassandra. Disk space is not more
expensive than memory, CPU processing and IOs operation. As Cassandra is a distributed
database, so data duplication provides instant data availability and no single point of failure.

Cassandra Data Modeling Goals


You should have following goals while modelling data in Cassandra:

Spread Data Evenly Around the Cluster

Downloaded by Rethina Kumari M ([email protected])


You want an equal amount of data on each node of Cassandra Cluster. Data is spread to different
nodes based on partition keys that is the first part of the primary key. So, try to choose integers
as a primary key for spreading data evenly around the cluster.

Minimize number of partitions read while querying data


Partition are a group of records with the same partition key. When the read query is issued, it
collects data from different nodes from different partitions.

If there will be many partitions, then all these partitions need to be visited for collecting the
query data.

It does not mean that partitions should not be created. If your data is very large, you can’t keep
that huge amount of data on the single partition. The single partition will be slowed down.

So try to choose a balanced number of partitions.

Good Primary Key in Cassandra


Let’s take an example and find which primary key is good.

Here is the table MusicPlaylist.

In above example, table MusicPlaylist,

● Songid is the partition key, and


● SongName is the clustering column
● Data will be clustered on the basis of SongName. Only one partition will be created with
the SongId. There will not be any other partition in the table MusicPlaylist.

Data retrieval will be slow by this data model due to the bad primary key.

Here is another table MusicPlaylist.

Downloaded by Rethina Kumari M ([email protected])


In above example, table MusicPlaylist,

● Songid and Year are the partition key, and


● SongName is the clustering column.
● Data will be clustered on the basis of SongName. In this table, each year, a new partition
will be created. All the songs of the year will be on the same node. This primary key will
be very useful for the data.

Our data retrieval will be fast by this data model.

Model Your Data in Cassandra


Following things should be kept in mind while modelling your queries:

Determine what queries you want to support


First of all, determine what queries you want.

For example, do you need?

● Joins
● Group by
● Filtering on which column etc.

Create table according to your queries


Create table according to your queries. Create a table that will satisfy your queries. Try to create
a table in such a way that a minimum number of partitions needs to be read.

Handling One to One Relationship in Cassandra


One to one relationship means two tables have one to one correspondence. For example, the
student can register only one course, and I want to search on a student that in which course a
particular student is registered in.

Downloaded by Rethina Kumari M ([email protected])


So in this case, your table schema should encompass all the details of the student in
corresponding to that particular course like the name of the course, roll no of the student, student
name, etc.

One to One Relationship in Cassandra


Create table Student_Course
(
Student rollno int primary key,
Student_name text,
Course_name text,
);
Handling One to Many Relationship in Cassandra
One to many relationships means having one to many correspondence between two tables.

For example, a course can be studied by many students. I want to search all the students that are
studying a particular course.

So by querying on course name, I will have many student names that will be studying a
particular course.

One to Many Relationship in Cassandra


Create table Student_Course

Downloaded by Rethina Kumari M ([email protected])


(
Student_rollno int,
Student_name text,
Course_name text,
);
I can retrieve all the students for a particular course by the following query.

Select * from Student_Course where Course_name='Course Name';


Handling Many to Many Relationship in Cassandra
Many to many relationships means having many to many correspondence between two tables.

For example, a course can be studied by many students, and a student can also study many
courses.

Many to Many Relationship in Cassandra


I want to search all the students that are studying a particular course. Also, I want to search all
the course that a particular student is studying.

So in this case, I will have two tables i.e. divide the problem into two cases.

First, I will create a table by which you can find courses by a particular student.

Create table Student_Course


(
Student_rollno int primary key,
Student_name text,
Course_name text,
);
I can find all the courses by a particular student by the following query.

Select * from Student_Course where student_rollno=rollno;


Second, I will create a table by which you can find how many students are studying a particular
course.

Downloaded by Rethina Kumari M ([email protected])


Create table Course_Student
(
Course_name text primary key,
Student_name text,
student_rollno int
);
I can find a student in a particular course by the following query.

Select * from Course_Student where Course_name=CourseName;

Downloaded by Rethina Kumari M ([email protected])

You might also like