Data Analytics using NoSQL
DHINESHKUMAR S K
Taxonomy of NoSQL
Key-value
Graph database
Document-oriented
Column family
3
Typical NoSQL architecture
Hashing
K
function
maps each
key to a
server (node)
CAP theorem for NoSQL
What the CAP theorem really says:
• If you cannot limit the number of faults and requests can be
directed to any server and you insist on serving every request you
receive then you cannot possibly be consistent
How it is interpreted:
• You must always give something up: consistency, availability or
tolerance to failure and reconfiguration
Theory of NOSQL: CAP
GIVEN:
C
• Many nodes
• Nodes contain replicas of partitions
of the data
• Consistency
• All replicas contain the same version
of data
• Client always has the same view of
the data (no matter what node)
• Availability A P
• System remains operationalon failing
nodes
• All clients can always read and write
• Partition tolerance CAP Theorem: satisfying
• multiple entry points
• System remains operationalon
all three at the same
system split (communication
malfunction) time is impossible
• System works well across physical
network partitions
Sharding of data
Distributes a single logical database system across a cluster of machines
Uses range-based partitioning to distribute documents based
on a specific shard key
Automatically balances the data associated with each shard
Can be turned on and off per collection (table)
Replica Sets
Redundancy and Failover
Zero downtime for upgrades and
maintenance replica1
Master-slave replication Client
Strong Consistency
Delayed Consistency
Geospatial features
How does NoSQL vary from RDBMS?
Looser schema definition
Applications written to deal with specific documents/ data
Applications aware of the schema definition as opposed to the data
Designed to handle distributed, large databases
Trade offs:
No strong support for ad hoc queries but designed for speed and growth of database
Query language through the API
Relaxation of the ACID properties
Benefits of NoSQL
Elastic Scaling Big Data
• RDBMS scale up – bigger • Huge increase in data
load , bigger server RDMS: capacity and
• NO SQL scale out – constraints of data
distribute data across volumes at its limits
multiple hosts • NoSQL designed for big
seamlessly data
DBA Specialists
• RDMS require highly
trained expert to
monitor DB
• NoSQL require less
management, automatic
repair and simpler data
models
Benefits of NoSQL
Flexible data models Economics
• Change management to • RDMS rely on expensive
schema for RDMS have proprietary servers to
to be carefully managed manage data
• NoSQL databases more • No SQL: clusters of
relaxed in structure of cheap commodity
data servers to manage the
• Database schema data and transaction
changes do not have to volumes
be managed as one • Cost per gigabyte or
complicated change unit
transaction/second for
• Application already
written to address an
NoSQL can be lower
amorphous schema than the cost for a
RDBMS
Drawbacks of NoSQL
• Support • Maturity
• RDBMS vendors • RDMS mature
provide a high level of product: means stable
support to clients and dependable
• Stellar reputation • Also means old no
• NoSQL – are open longer cutting edge nor
interesting
source projects
with startups • NoSQL are still
supporting them implementing their
• Reputation not yet basic feature set
established
Drawbacks of NoSQL
• Administration • Analytics and Business
• RDMS administrator well Intelligence
defined role • RDMS designed to
• No SQL’s goal: no address this niche
administrator necessary • NoSQL designed to meet
however NO SQL still the needs of an Web 2.0
requires effort to application - not
maintain designed for ad hoc
• Lack of Expertise query of the data
• Whole workforce of • Tools are being
trained and seasoned developed to address
RDMS developers this need
• Still recruiting
developers to the NoSQL
camp
RDB ACID to NoSQL BASE
Atomicity Basically
Consistency Available (CP)
Isolation Soft-state
(State of system may change
over time)
Durability Eventually
consistent
(Asynchronous propagation)
MongoDB
What is MongoDB?
Developed by 10gen
Founded in 2007
A document-oriented, NoSQL database
Written in C++
Supports APIs (drivers) in many computer languages
JavaScript, Python, Ruby, Perl, Java, Java Scala, C#, C++, Haskell, Erlang
Functionality of MongoDB
• Dynamic schema
• No DDL
• Document-based database
• Secondary indexes
• Query language via an API
• Atomic writes and fully-consistent reads
• If system configured that way
• Master-slave replication with automated failover (replica sets)
• Built-in horizontal scaling via automated range-based
• partitioning of data (sharding)
Why use MongoDB?
Simple queries
Functionality provided applicable to most web applications
Easy and fast integration of data
No ERD diagram
Not well suited for heavy and complex transactions systems
MongoDB: CAP approach
C
Focus on Consistency and
Partition tolerance
• Consistency
• all replicas contain the same
version of the data
• Availability
• system remains operational on A P
failingnodes
• Partition tolarence
CAP Theorem:
• multiple entry points satisfying all three at the same time is
• system remains operational on impossible
system split
MongoDB: Hierarchical Objects
0 or more Databases
A MongoDB instance may have
0 or more Collections
zero or more ‘databases’
0 or more Documents
A database may have
zero or more ‘collections’.
A collection may have
zero or more ‘documents’.
0 or
more
A document may have
Fields
one or more ‘fields’.
MongoDB ‘Indexes’ function much like their
RDBMS counterparts.
RDB Concepts to NO SQL
RDBMS MongoDB
Database Database
Table, View Collection
Row Document (BSON)
Column Field
Index Index
Join Embedded Document
Foreign Key Reference
Partition Shard
Choices made for Design of MongoDB
Scale horizontally over commodity hardware
Lots of relatively inexpensive servers
Keep the functionality that works well in RDBMSs
Ad hoc queries
Fully featured indexes
Secondary indexes
What doesn’t distribute well in RDB?
Long running multi-row transactions
Joins
Both artifacts of the relational data model (row x column)
BSON format
Binary-encoded serialization of JSON-like documents
Zero or more key/value pairs are stored as a single entity
Each entry consists of a field name, a data type, and a value
Large elements in a BSON document are prefixed with a
length field to facilitate scanning
Schema Free
MongoDB does not need any pre-defined data schema
Every document in a collection could have different data
Addresses NULL data fields
{name: “will”, name: “jeff”, {name: “brendan”,
eyes: “blue”, eyes: “blue”, aliases: [“el diablo”]}
birthplace: “NY”, loc: [40.7, 73.4],
aliases: [“bill”, “la ciacco”], boss: “ben”}
loc: [32.7, 63.4],
boss: ”ben”}
{name: “matt”,
pizza: “DiGiorno”,
height: 72,
name: “ben”, loc: [44.6, 71.3]}
hat: ”yes”}
JSON format
Data is in name / value pairs
A name/value pair consists of a field name followed by a colon, followed by a value:
Example: “name”: “R2-D2”
Data is separated by commas
Example: “name”: “R2-D2”, race : “Droid”
Curly braces hold objects
Example: {“name”: “R2-D2”, race : “Droid”, affiliation:“rebels”}
An array is stored in brackets []
Example [ {“name”: “R2-D2”, race : “Droid”, affiliation: “rebels”}, {“name”: “Yoda”,
affiliation: “rebels”} ]
MongoDB Features
Document-Oriented storage
Full Index Support
Replication & High Availability Agile
Auto-Sharding
Querying
Fast In-Place Updates
Map/Reduce functionality Scalable
Index Functionality
• B+ tree indexes
• An index is automatically created on the _id field (the primary key)
• Users can create other indexes to improve query performance or to enforce Unique values
for a particular field
• Supports single field index as well as Compound index
• Like SQL order of the fields in a compound index matters
• If you index a field that holds an array value, MongoDB creates
• separate index entries for every element of the array
• Sparse property of an index ensures that the index only contain entries for documents that
have the indexed field. (so ignore records that do not have the field defined)
• If an index is both unique and sparse – then the system will reject records that have a
duplicate key value but allow records that do not have the indexed field defined
Hands ON!!!!!
Example: Mongo Document
{
name: 'Brad Steve’,
address: {
street: 'Oak Terrace', city: 'Denton’
}
}
Example: Mongo Collection
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": "DUMONT",
"First Name": "Jean",
"Date of Birth": "01-22-1963" Obligatory, and automatically generated
}, by MongoDB
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES"
}
Sample!
BLOG
A blog post has an author, some text, and many comments
The comments are unique per post, but one author has many posts
How would you design this in SQL?
Blog – BAD Design
Collections for posts, authors, and comments
References by manually created ID
post = { author = {
id: 150, id: 100,
author: 100, name: 'Michael Arrington' posts: [150]
text: 'This is a pretty awesome post.’, }
comments: [100, 105, 112]
} comment = {
id: 105,
text: 'Whatever this is good comment’
}
Sample: Better Design
Collection for posts
Embed comments, author name
post = {
author: 'Michael Arrington’,
text: 'This is a pretty awesome post.’,
comments: [ 'Whatever this post sux.', 'I agree, lame!’ ]
}
Installation
CRUD Operations
• Create
• db.collection.insert( <document> )
• db.collection.save( <document> )
• db.collection.update( <query>, <update>, { upsert: true } )
Collection
• Read specifies the
• db.collection.find( <query>, <projection> )
• db.collection.findOne( <query>, <projection> )
collection or
• Update
the ‘table’ to
•
• db.collection.update( <query>, <update>, <options> )
Delete
store the
• db.collection.remove( <query>, <justOne> ) document
Create Operations
Db.collection specifies the collection or the ‘table’ to store the document
• db.collection_name.insert( <document> )
• Omit the _id field to have MongoDB generate a unique key
• Example db.parts.insert( {{type: “screwdriver”, quantity: 15 } )
• db.parts.insert({_id: 10, type: “hammer”, quantity: 1 })
• db.collection_name.update( <query>, <update>, { upsert: true } )
• Will update 1 or more records in a collection satisfying query
• db.collection_name.save( <document> )
• Updates an existing record or creates a new record
Read Operations
• db.collection.find( <query>, <projection> ).cursor modified
• Provides functionality similar to the SELECT command
• <query> where condition , <projection> fields in result set
• Example: var PartsCursor = db.parts.find({parts: “hammer”}).limit(5)
• Has cursors to handle a result set
• Can modify the query to impose limits, skips, and sort orders.
• Can specify to return the ‘top’ number of records from the result set
• db.collection.findOne( <query>, <projection> )
Query Operators
Name Description
$eq Matches value that are equal to a specified value
$gt, $gte Matches values that are greater than (or equal to a specified value
$lt, $lte Matches values less than or ( equal to ) a specified value
$ne Matches values that are not equal to a specified value
$in Matches any of the values specified in an array
$nin Matches none of the values specified in an array
$or Joins query clauses with a logical OR returns all
$and Join query clauses with a loginal AND
$not Inverts the effect of a query expression
$nor Join query clauses with a logical NOR
$exists Matches documents that have a specified field
Update Operations
• db.collection_name.insert( <document> )
• Omit the _id field to have MongoDB generate a unique key
• Example db.parts.insert( {{type: “screwdriver”, quantity: 15 } )
• db.parts.insert({_id: 10, type: “hammer”, quantity: 1 })
• db.collection_name.save( <document> )
• Updates an existing record or creates a new record
• db.collection_name.update( <query>, <update>, { upsert: true } )
• Will update 1 or more records in a collection satisfying query
• db.collection_name.findAndModify(<query>, <sort>, <update>,<new>,
<fields>,<upsert>)
• Modify existing record(s) – retrieve old or new version of the record
Delete Operations
• db.collection_name.remove(<query>, <justone>)
• Delete all records from a collection or matching a criterion
• <justone> - specifies to delete only 1 record matching the criterion
• Example: db.parts.remove(type: /^h/ } ) - remove all parts starting with h
• Db.parts.remove() – delete all documents in the parts collections
SQL vs. Mongo DB entities
My SQL Mongo DB
START TRANSACTION; db.contacts.save( { user
INSERT INTO contacts VALUES (NULL, Name: “joeblow”,
‘joeblow’);
emailAddresses:
INSERT INTO contact_emails
VALUES
[ “[email protected]
( NULL, ”[email protected]”, m”,
“[email protected]” ] }
LAST_INSERT_ID() ),
( NULL,
“[email protected]”,
LAST_INSERT_ID() ); COMMIT; MongoDB separates physical structure
from logical structure
Designed to deal with large &distributed
Aggregation
Aggregation Framework Operators
$project
$match
$limit
$skip
$sort
$unwind
$group
…….
$match
Filter documents
Uses existing query syntax
If using $geoNear it has to be first in pipeline
$where is not supported
Matching Field Values
{
"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
} {
"_id" : 271466,
{ "amenity" : "pub",
"_id" : 271466,
"amenity" : "pub", "name" : "The Red Lion",
"name" : "The Red Lion", "location" : {
"location" : { "type" : "Point",
"type" : "Point",
"coordinates" : [ "coordinates" : [
-1.5494749, -1.5494749,
50.7837119 50.7837119
]
} ]}
$project
Reshape documents
Include, exclude or rename fields
Inject computed fields
Create sub-document fields
Including and Excluding Fields
{ { “$project”: {
"_id" : 271466,
“_id”: 0,
"amenity" : "pub", “amenity”: 1,
"name" : "The Red Lion", “name”: 1,
"location" : { }}
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119 {
] “amenity” : “pub”,
“name” : “The Red Lion”
} }
}
Reformatting Documents
{ { “$project”: {
"_id" : 271466,
“_id”: 0,
"amenity" : "pub", “name”: 1,
"name" : "The Red Lion", “meta”: {
“type”: “$amenity”}
"location" : { }}
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119 {
] “name” : “The Red Lion”
“meta” : {
} “type” : “pub”
} }}
$group
• Group documents by an ID
Field reference, object, constant
• Other output fields are computed
$max, $min, $avg, $sum
$addToSet, $push $first, $last
• Processes all data in memory
Aggregation Framework Benefits
Real-time
Simple yet powerful interface
Declared in JSON, executes in C++
Runs inside MongoDB on local data
− Adds load to your DB
− Limited Operators
− Data output is limited