BIG Data 2
BIG Data 2
1) What is NoSQL?
Summary
NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is
easy to scale
The concept of NoSQL databases beccame popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data
In the year 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source
relational database
NoSQL databases never follow the relational model it is either schema-free or has
relaxed schemas
Four types of NoSQL Database are 1).Key-value Pair Based 2).Column-oriented Graph 3).
Graphs based 4).Document-oriented
NOSQL can handle structured, semi-structured, and unstructured data with equal effect
CAP theorem consists of three words Consistency, Availability, and Partition Tolerance
BASE stands for Basically Available, Soft state, Eventual consistency
The term "eventual consistency" means to have copies of data on multiple machines to
get high availability and scalability
NOSQL offer limited query capabilities
NoSQL database is non-relational, so it scales out better than relational databases as they are
designed with web applications in mind.
1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
[1]
2000- Graph database Neo4j is launched
Features of NoSQL
Non-relational
Schema-free
Simple API
Offers easy to use interfaces for storage and querying data provided
[2]
Distributed
Shared Nothing Architecture. This enables less coordination and higher distribution.
2) Aggregate Data Models:
An aggregate is a collection of data that we interact with as a unit. These units of data or
aggregates form the boundaries for ACID operations with the database, Key-value, Document,
and Column-family databases can all be seen as forms of aggregate-oriented database.
Aggregates make it easier for the database to manage data storage over clusters, since the unit
of data now could reside on any machine and when retrieved from the database gets all the
related data along with it. Aggregate-oriented databases work best when most data interaction
is done with the same aggregate, for example when there is need to get an order and all its
details, it better to store order as an aggregate object but dealing with these aggregates to get
item details on all the orders is not elegant.
Distribution Models:
Aggregate oriented databases make distribution of data easier, since the distribution
mechanism has to move the aggregate and not have to worry about related data, as all the
related data is contained in the aggregate. There are two styles of distributing data:
Sharding: Sharding distributes different data across multiple servers, so each server acts
[3]
as the single source for a subset of data.
Replication: Replication copies data across multiple servers, so each bit of data can be
found in multiple places. Replication comes in two forms,
Master-slave replication makes one node the authoritative copy that handles
writes while slaves synchronize with the master and may handle reads.
Master-slave replication reduces the chance of update conflicts but peer-to-peer replication
avoids loading all writes onto a single server creating a single point of failure. A system may use
either or both techniques. Like Riak database shards the data and also replicates it based on the
replication factor.
NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented,
Graph-based and Document-oriented. Every category has its unique attributes and limitations.
None of the above-specified database is better to solve all the problems. Users should select
the database based on their product needs.
Column-oriented Graph
Graphs based
Document-oriented
Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy
load.
Key-value pair storage databases store data as a hash table where each key is unique, and the
value can be a JSON, BLOB(Binary Large Objects), string, etc.
[4]
For example, a key-value pair may contain a key like "Website" associated with a value like
"Guru99".
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as a
collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon's Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored contiguously.
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the
data is readily available in a column.
[5]
Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is
stored as a document. The document is stored in JSON or XML formats. The value is understood
by the DB and can be queried.
In this diagram on your left you can see we have rows and columns, and in the right, we have a
document database which has a similar structure to JSON. Now for the relational database, you
have to know what columns you have and so on. However, for a document database, you have
data store like JSON object. You do not require to define which make it flexible.
The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-
commerce applications. It should not use for complex transactions which require multiple
operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular Document
originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.
[6]
Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
The goal is for users to get all, or most of, their data from one server
Many NoSQL databases perform automatic sharding
Sharding can improve both read and write performance
Sharding allows horizontal scaling for both reads and writes
However sharding does not improve resilience
Since sharding distributes data across many machines there is a larger chance of failure
Particularly compared to a single machine that is highly maintained
Locate the Vancouver accounts in Vancouver servers
[7]
Locate aggregates that are likely to be accessed together or in sequence in the same
location
What is the CAP Theorem?
CAP theorem is also called brewer's theorem. It states that is impossible for a distributed data
store to offer more than two out of three guarantees
1. Consistency
2. Availability
3. Partition Tolerance
Consistency:
The data should remain consistent even after the execution of an operation. This means once
data is written, any future read request should contain that data. For example, after updating
the order status, all the clients should be able to see the same data.
Availability:
The database should always be available and responsive. It should not have any downtime.
Partition Tolerance:
Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be partitioned
into multiple groups which may not communicate with each other. Here, if part of the database
is unavailable, other parts are always unaffected.
Eventual Consistency
The term "eventual consistency" means to have copies of data on multiple machines to get high
availability and scalability. Thus, changes made to any data item on one machine has to be
propagated to other replicas.
Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.
Basically, available means DB is available all the time as per CAP theorem
[8]
Soft state means even without an input; the system state may change
Eventual consistency means that the system will become consistent over time
Advantages of NoSQL
Easy Replication
Can handle structured, semi-structured, and unstructured data with equal effect
Handles big data which manages data velocity, variety, volume, and complexity
[9]
Excels at distributed database and multi-data center operations
Offers a flexible schema design which can easily be altered without downtime or service
disruption
Disadvantages of NoSQL
No standardization rules
It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain unique values as keys
become difficult
Data arrive from one or few Data arrive from many locations
locations
[10]
Supports complex transactions Supports simple transactions
(with joins)
SQL NoSQL
[11]
SQL NoSQL
These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.
These databases are best suited for complex These databases are not so good for complex
queries queries
[12]
Apache Cassandra Features
There are following features that Cassandra provides.
Massively Scalable Architecture: Cassandra has a masterless design where all nodes are
at the same level which provides operational simplicity and easy scale out.
Masterless Architecture: Data can be written and read on any node.
Linear Scale Performance: As more nodes are added, the performance of Cassandra
increases.
No Single point of failure: Cassandra replicates data on different nodes that ensures no
single point of failure.
Fault Detection and Recovery: Failed nodes can easily be restored and recovered.
Flexible and Dynamic Data Model: Supports datatypes with Fast writes and reads.
Data Protection: Data is protected with commit log design and build in security like
backup and restore mechanisms.
Tunable Data Consistency: Support for strong data consistency across distributed
architecture.
Multi Data Center Replication: Cassandra provides feature to replicate data across
multiple data center.
Data Compression: Cassandra can compress up to 80% data without any overhead.
Cassandra Query language: Cassandra provides query language that is similar like SQL
language. It makes very easy for relational database developers moving from relational
database to Cassandra.
[13]
…
Primary key(ColumnName)
) with PropertyName=PropertyValue;
Drop Table
Command 'Drop table' drops specified table including all the data from the keyspace. Before
dropping the table, Cassandra takes a snapshot of the data not the schema as a backup.
Syntax
Drop Table KeyspaceName.TableName
Example
Here is the snapshot of the executed command 'Drop Table' that will drop table Student from
the keyspace 'University'.
[14]
After successful execution of the command 'Drop Table', table Student will be dropped from
the keyspace University.
Truncate Table
Command 'Truncate table' removes all the data from the specified table. Before truncating the
data, Cassandra takes the snapshot of the data as a backup.
Syntax
Truncate KeyspaceName.TableName
Example
Here is the snapshot of the executed command 'Truncate table' that will remove all the data
from the table Student.
After successful execution of the command 'Truncate Table', all the data will be removed from
the table Student.
[15]
Insert Data
The Cassandra insert statement writes data in Cassandra columns in row form. Cassandra insert
query will store only those columns that are given by the user. You have to necessarily specify
just the primary key column.
It will not take any space for not given values. No results are returned after insertion.
Syntax
Insert into KeyspaceName.TableName(ColumnName1, ColumnName2, ColumnName3 . . . .)
values (Column1Value, Column2Value, Column3Value . . . .)
Example
Here is the snapshot of the executed Cassandra Insert into table query that will insert one
record in Cassandra table 'Student'.
Upsert Data
Cassandra does upsert. Upsert means that Cassandra will insert a row if a primary key does not
exist already otherwise if primary key already exists, it will update that row.
Update Data
The Cassandra Update query is used to update the data in the Cassandra table. If no results are
returned after updating data, it means data is successfully updated otherwise an error will be
returned. Column values are changed in 'Set' clause while data is filtered with 'Where' clause.
Syntax
Update KeyspaceName.TableName
Set ColumnName1=new Column1Value,
… .
Where ColumnName=ColumnValue
Example
[16]
Here is the snapshot of the executed Cassandra Update command that updates the record in
the Student table.
Update University.Student
Set name='Hayden'
Where rollno=1;
After successful execution of the update query in Cassandra 'Update Student', student name
will be changed from 'Clark' to 'Hayden' that has rollno 1.
[17]
Delete from University.Student where rollno=1;
After successful execution of the CQL Delete command, one rows will be deleted from the table
Student where rollno value is 1.
[18]