0% found this document useful (0 votes)
81 views18 pages

BIG Data 2

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views18 pages

BIG Data 2

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit 2

1) What is NoSQL?

Summary
 NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is
easy to scale
 The concept of NoSQL databases beccame popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data
 In the year 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source
relational database
 NoSQL databases never follow the relational model it is either schema-free or has
relaxed schemas
 Four types of NoSQL Database are 1).Key-value Pair Based 2).Column-oriented Graph 3).
Graphs based 4).Document-oriented
 NOSQL can handle structured, semi-structured, and unstructured data with equal effect
 CAP theorem consists of three words Consistency, Availability, and Partition Tolerance
 BASE stands for Basically Available, Soft state, Eventual consistency
 The term "eventual consistency" means to have copies of data on multiple machines to
get high availability and scalability
 NOSQL offer limited query capabilities

NoSQL database is non-relational, so it scales out better than relational databases as they are
designed with web applications in mind.

Brief History of NoSQL Databases

 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database

[1]
 2000- Graph database Neo4j is launched

 2004- Google BigTable is launched

 2005- CouchDB is launched

 2007- The research paper on Amazon Dynamo is released

 2008- Facebooks open sources the Cassandra project

 2009- The term NoSQL was reintroduced

Features of NoSQL

Non-relational

 NoSQL databases never follow the relational model

 Never provide tables with flat fixed-column records

 Work with self-contained aggregates or BLOBs

 Doesn't require object-relational mapping and data normalization

 No complex features like query languages, query planners,Referential integrity joins,


ACID

Schema-free

 NoSQL databases are either schema-free or have relaxed schemas

 Do not require any sort of definition of the schema of the data

 Offers heterogeneous structures of data in the same domain

Simple API

 Offers easy to use interfaces for storage and querying data provided

 APIs allow low-level data manipulation & selection methods

 Text-based protocols mostly used with HTTP REST with JSON

 Mostly used no standard based NoSQL query language

 Web-enabled databases running as internet-facing services

[2]
Distributed

 Multiple NoSQL databases can be executed in a distributed fashion

 Offers auto-scaling and fail-over capabilities

 Often ACID concept can be sacrificed for scalability and throughput

 Mostly no synchronous replication between distributed nodes Asynchronous Multi-


Master Replication, peer-to-peer, HDFS Replication

 Only providing eventual consistency

 Shared Nothing Architecture. This enables less coordination and higher distribution.
2) Aggregate Data Models:

An aggregate is a collection of data that we interact with as a unit. These units of data or
aggregates form the boundaries for ACID operations with the database, Key-value, Document,
and Column-family databases can all be seen as forms of aggregate-oriented database.

Aggregates make it easier for the database to manage data storage over clusters, since the unit
of data now could reside on any machine and when retrieved from the database gets all the
related data along with it. Aggregate-oriented databases work best when most data interaction
is done with the same aggregate, for example when there is need to get an order and all its
details, it better to store order as an aggregate object but dealing with these aggregates to get
item details on all the orders is not elegant.

Aggregate-oriented databases make inter-aggregate relationships more difficult to handle than


intra-aggregate relationships. Aggregate-ignorant databases are better when interactions use
data organized in many different formations. Aggregate-oriented databases often compute
materialized views to provide data organized differently from their primary aggregates. This is
often done with map-reduce computations, such as a map-reduce job to get items sold per day.

Distribution Models:

Aggregate oriented databases make distribution of data easier, since the distribution
mechanism has to move the aggregate and not have to worry about related data, as all the
related data is contained in the aggregate. There are two styles of distributing data:

 Sharding: Sharding distributes different data across multiple servers, so each server acts

[3]
as the single source for a subset of data.

 Replication: Replication copies data across multiple servers, so each bit of data can be
found in multiple places. Replication comes in two forms,

 Master-slave replication makes one node the authoritative copy that handles
writes while slaves synchronize with the master and may handle reads.

 Peer-to-peer replication allows writes to any node; the nodes coordinate to


synchronize their copies of the data.

Master-slave replication reduces the chance of update conflicts but peer-to-peer replication
avoids loading all writes onto a single server creating a single point of failure. A system may use
either or both techniques. Like Riak database shards the data and also replicates it based on the
replication factor.

3) Types of NoSQL Databases

NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented,
Graph-based and Document-oriented. Every category has its unique attributes and limitations.
None of the above-specified database is better to solve all the problems. Users should select
the database based on their product needs.

Types of NoSQL Databases:

 Key-value Pair Based

 Column-oriented Graph

 Graphs based

 Document-oriented

Key Value Pair Based

Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy
load.

Key-value pair storage databases store data as a hash table where each key is unique, and the
value can be a JSON, BLOB(Binary Large Objects), string, etc.

[4]
For example, a key-value pair may contain a key like "Website" associated with a value like
"Guru99".

It is one of the most basic NoSQL database example. This kind of NoSQL database is used as a
collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.

Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon's Dynamo paper.

Column-based

Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored contiguously.

They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the
data is readily available in a column.

[5]
Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,

HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.

Document-Oriented:

Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is
stored as a document. The document is stored in JSON or XML formats. The value is understood
by the DB and can be queried.

In this diagram on your left you can see we have rows and columns, and in the right, we have a
document database which has a similar structure to JSON. Now for the relational database, you
have to know what columns you have and so on. However, for a document database, you have
data store like JSON object. You do not require to define which make it flexible.

The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-
commerce applications. It should not use for complex transactions which require multiple
operations or queries against varying aggregate structures.

Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular Document
originated DBMS systems.

Graph-Based

A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.

[6]
Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.

Graph base database mostly used for social networks, logistics, spatial data.

Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.

Query Mechanism tools for NoSQL

Sharding distributes data between nodes

 The goal is for users to get all, or most of, their data from one server
 Many NoSQL databases perform automatic sharding
 Sharding can improve both read and write performance
 Sharding allows horizontal scaling for both reads and writes
 However sharding does not improve resilience
 Since sharding distributes data across many machines there is a larger chance of failure
 Particularly compared to a single machine that is highly maintained
 Locate the Vancouver accounts in Vancouver servers

[7]
 Locate aggregates that are likely to be accessed together or in sequence in the same
location
What is the CAP Theorem?

CAP theorem is also called brewer's theorem. It states that is impossible for a distributed data
store to offer more than two out of three guarantees

1. Consistency

2. Availability

3. Partition Tolerance

Consistency:

The data should remain consistent even after the execution of an operation. This means once
data is written, any future read request should contain that data. For example, after updating
the order status, all the clients should be able to see the same data.

Availability:

The database should always be available and responsive. It should not have any downtime.

Partition Tolerance:

Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be partitioned
into multiple groups which may not communicate with each other. Here, if part of the database
is unavailable, other parts are always unaffected.

Eventual Consistency

The term "eventual consistency" means to have copies of data on multiple machines to get high
availability and scalability. Thus, changes made to any data item on one machine has to be
propagated to other replicas.

Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.

BASE: Basically Available, Soft state, Eventual consistency

 Basically, available means DB is available all the time as per CAP theorem

[8]
 Soft state means even without an input; the system state may change

 Eventual consistency means that the system will become consistent over time

Advantages of NoSQL

 Can be used as Primary or Analytic Data Source

 Big Data Capability

 No Single Point of Failure

 Easy Replication

 No Need for Separate Caching Layer

 It provides fast performance and horizontal scalability.

 Can handle structured, semi-structured, and unstructured data with equal effect

 Object-oriented programming which is easy to use and flexible

 NoSQL databases don't need a dedicated high-performance server

 Support Key Developer Languages and Platforms

 Simple to implement than using RDBMS

 It can serve as the primary data source for online applications.

 Handles big data which manages data velocity, variety, volume, and complexity

[9]
 Excels at distributed database and multi-data center operations

 Eliminates the need for a specific caching layer to store data

 Offers a flexible schema design which can easily be altered without downtime or service
disruption

Disadvantages of NoSQL

 No standardization rules

 Limited query capabilities

 RDBMS databases and tools are comparatively mature

 It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.

 When the volume of data increases it is difficult to maintain unique values as keys
become difficult

 Doesn't work as well with relational data

 The learning curve is stiff for new developers

 Open source options so not so popular for enterprises.

Relation databases and NoSQL databases in a tabular format.

Relational Database NoSQL Database

Handles data coming in low Handles data coming in high velocity


velocity

Data arrive from one or few Data arrive from many locations
locations

Manages structured data Manages structured unstructured and


semi-structured data.

[10]
Supports complex transactions Supports simple transactions
(with joins)

single point of failure with No single point of failure


failover

Handles data in the moderate Handles data in very high volume


volume.

Centralized deployments Decentralized deployments

Transactions written in one Transaction written in many locations


location

Gives read scalability Gives both read and write scalability

Deployed in vertical fashion Deployed in Horizontal fashion

Key highlights on SQL vs NoSQL:

SQL NoSQL

RELATIONAL DATABASE MANAGEMENT


SYSTEM (RDBMS) Non-relational or distributed database system.

These databases have fixed or static or


predefined schema They have dynamic schema

[11]
SQL NoSQL

These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.

These databases are best suited for complex These databases are not so good for complex
queries queries

Vertically Scalable Horizontally scalable

Follows CAP(consistency, availability, partition


Follows ACID property tolerance)

What is Apache Cassandra?


Cassandra is a distributed database management system designed for handling a high volume
of structured data across commodity servers
Cassandra handles the huge amount of data with its distributed architecture. Data is placed on
different machines with more than one replication factor that provides high availability and no
single point of failure.
In the image below, circles are Cassandra nodes and lines between the circles shows distributed
architecture, while the client is sending data to the node.

[12]
Apache Cassandra Features
There are following features that Cassandra provides.
 Massively Scalable Architecture: Cassandra has a masterless design where all nodes are
at the same level which provides operational simplicity and easy scale out.
 Masterless Architecture: Data can be written and read on any node.
 Linear Scale Performance: As more nodes are added, the performance of Cassandra
increases.
 No Single point of failure: Cassandra replicates data on different nodes that ensures no
single point of failure.
 Fault Detection and Recovery: Failed nodes can easily be restored and recovered.
 Flexible and Dynamic Data Model: Supports datatypes with Fast writes and reads.
 Data Protection: Data is protected with commit log design and build in security like
backup and restore mechanisms.
 Tunable Data Consistency: Support for strong data consistency across distributed
architecture.
 Multi Data Center Replication: Cassandra provides feature to replicate data across
multiple data center.
 Data Compression: Cassandra can compress up to 80% data without any overhead.
 Cassandra Query language: Cassandra provides query language that is similar like SQL
language. It makes very easy for relational database developers moving from relational
database to Cassandra.

Cassandra Table: Create, Alter, Drop & Truncate (with Example)


The syntax of Cassandra query language (CQL) resembles with SQL language.
 Create Table
 Alter Table
 Drop Table
 Truncate Table
How to Create Table
Column family in Cassandra is similar to RDBMS table. Column family is used to store data.
Command 'Create Table' is used to create column family in Cassandra.
Syntax
Create table KeyspaceName.TableName
(
ColumnName DataType,

[13]

Primary key(ColumnName)
) with PropertyName=PropertyValue;

Cassandra Alter table


Command 'Alter Table' is used to drop column, add a new column, alter column name, alter
column type and change the property of the table.
Syntax
Following is the syntax of command 'Alter Table.'
Alter table KeyspaceName.TableName +
Alter ColumnName TYPE ColumnDataype |
Add ColumnName ColumnDataType |
Drop ColumnName |
Rename ColumnName To NewColumnName |
With propertyName=PropertyValue
Example
Here is the snapshot of the command 'Alter Table' that will add new column in the table
Student.

Drop Table
Command 'Drop table' drops specified table including all the data from the keyspace. Before
dropping the table, Cassandra takes a snapshot of the data not the schema as a backup.
Syntax
Drop Table KeyspaceName.TableName
Example
Here is the snapshot of the executed command 'Drop Table' that will drop table Student from
the keyspace 'University'.

[14]
After successful execution of the command 'Drop Table', table Student will be dropped from
the keyspace University.

Truncate Table
Command 'Truncate table' removes all the data from the specified table. Before truncating the
data, Cassandra takes the snapshot of the data as a backup.
Syntax
Truncate KeyspaceName.TableName
Example
Here is the snapshot of the executed command 'Truncate table' that will remove all the data
from the table Student.

After successful execution of the command 'Truncate Table', all the data will be removed from
the table Student.

Cassandra Query Language(CQL): Insert Into, Update (Example)


In this article, you will learn Cassandra commands with CQL examples-
 Insert Data
 Upsert Data
 Update Data
 Delete Data
 Cassandra Where Clause

[15]
Insert Data
The Cassandra insert statement writes data in Cassandra columns in row form. Cassandra insert
query will store only those columns that are given by the user. You have to necessarily specify
just the primary key column.
It will not take any space for not given values. No results are returned after insertion.
Syntax
Insert into KeyspaceName.TableName(ColumnName1, ColumnName2, ColumnName3 . . . .)
values (Column1Value, Column2Value, Column3Value . . . .)
Example
Here is the snapshot of the executed Cassandra Insert into table query that will insert one
record in Cassandra table 'Student'.

Insert into University.Student(RollNo,Name,dept,Semester) values(2,'Michael','CS', 2);


After successful execution of the command Insert into Cassandra, one row will be inserted in
the Cassandra table Student with RollNo 2, Name Michael, dept CS and Semester 2.

Upsert Data
Cassandra does upsert. Upsert means that Cassandra will insert a row if a primary key does not
exist already otherwise if primary key already exists, it will update that row.
Update Data
The Cassandra Update query is used to update the data in the Cassandra table. If no results are
returned after updating data, it means data is successfully updated otherwise an error will be
returned. Column values are changed in 'Set' clause while data is filtered with 'Where' clause.
Syntax
Update KeyspaceName.TableName
Set ColumnName1=new Column1Value,
… .
Where ColumnName=ColumnValue
Example

[16]
Here is the snapshot of the executed Cassandra Update command that updates the record in
the Student table.

Update University.Student
Set name='Hayden'
Where rollno=1;
After successful execution of the update query in Cassandra 'Update Student', student name
will be changed from 'Clark' to 'Hayden' that has rollno 1.

Cassandra Delete Data


Command 'Delete' removes an entire row or some columns from the table Student. When data
is deleted, it is not deleted from the table immediately. Instead deleted data is marked with a
tombstone and are removed after compaction.
Syntax
Delete from KeyspaceName.TableName
Where ColumnName1=ColumnValue
The above Cassandra delete row syntax will delete one or more rows depend upon data
filtration in where clause.
Delete ColumnNames from KeyspaceName.TableName
Where ColumnName1=ColumnValue
The above syntax will delete some columns from the table.
Example
Here is the snapshot of the command that will remove one row from the table Student.

[17]
Delete from University.Student where rollno=1;
After successful execution of the CQL Delete command, one rows will be deleted from the table
Student where rollno value is 1.

What Cassandra does not support


There are following limitations in Cassandra query language (CQL).
1. CQL does not support aggregation queries like max, min, avg
2. CQL does not support group by, having queries.
3. CQL does not support joins.
4. CQL does not support OR queries.
5. CQL does not support wildcard queries.
6. CQL does not support Union, Intersection queries.
7. Table columns cannot be filtered without creating the index.
8. Greater than (>) and less than (<) query is only supported on clustering column.
Cassandra query language is not suitable for analytics purposes because it has so many
limitations.
Cassandra Where Clause
In Cassandra, data retrieval is a sensitive issue. The column is filtered in Cassandra by creating
an index on non-primary key columns.
Syntax
Select ColumnNames from KeyspaceName.TableName Where ColumnName1=Column1Value
AND
ColumnName2=Column2Value AND
..
Example

select * from University.Student;


Two records are retrieved from Student table.
 Here is the snapshot that shows the data retrieval from Student with data filtration. One
record is retrieved.
Data is filtered by name column. All the records are retrieved that has name equal to Guru99.

select * from University.Student where name='Guru99';

[18]

You might also like