NoSQL D
NoSQL D
▪ NoSQL is an umbrella term for all databases and data stores that don’t follow the
RDBMS principles
▪ A class of products
▪ A collection of several (related) concepts about data storage and manipulation
▪ Often related to large data sets
▪ Non-relational DBMSs are not new
▪ But NoSQL represents a new incarnation
▪ Due to massively scalable Internet applications
▪ Based on distributed and parallel computing
▪ Development
▪ Starts with Google
▪ First research paper published in 2003
▪ Continues also thanks to Lucene's developers/Apache (Hadoop) and Amazon (Dynamo)
▪ Then a lot of products and interests came from Facebook, Netfix, Yahoo, eBay, Hulu, IBM,
and many more
▪ Three major papers were the seeds of the NoSQL movement
▪ BigTable (Google)
▪ Dynamo (Amazon)
▪ Distributed key-value data store
▪ Eventual consistency
▪ CAP Theorem
▪ NoSQL comes from Internet, thus it is often related to the “big data” concept
▪ How much big are “big data”?
▪ Over few terabytes Enough to start spanning multiple storage units
▪ Challenges
▪ Efficiently storing and accessing large amounts of data is difficult, even more considering
fault tolerance and backups
▪ Manipulating large data sets involves running immensely parallel processes
▪ Managing continuously evolving schema and metadata for semi-structured and un-
structured data is difficult
▪ Explosion of social media sites (Facebook, Twitter) with large
data needs
▪ Rise of cloud-based solutions such as Amazon S3 (simple storage
solution)
▪ Just as moving to dynamically-typed languages (Python, Ruby,
Groovy), a shift to dynamically-typed data with frequent schema
changes
▪ Open-source community
▪ The context is Internet
▪ RDBMSs assume that data are
▪ Dense
▪ Largely uniform (structured data)
▪ With massive sparse data sets, the typical storage mechanisms and access methods
get stretched
▪ Large data volumes ▪ Asynchronous Inserts &
▪ Google’s “big data” Updates
▪ Schema-less
▪ Scalable replication and
distribution ▪ ACID transaction properties
▪ Potentially thousands of are not needed – BASE
machines ▪ CAP Theorem
▪ Potentially distributed
around the world ▪ Open source development
▪Key-Value Store :
▪are the simplest NoSQL databases. Every single item in the database is stored
as an attribute name (or 'key'), together with its value.
▪Graph Databases :
▪are used to store information about networks of data, such as social connections.
▪ Documents
▪ Loosely structured sets of key/value pairs in documents, e.g., XML, JSON
▪ Encapsulate and encode data in some standard formats or encodings
▪ Are addressed in the database via a unique key
▪ Documents are treated as a whole, avoiding splitting a document into its constituent
name/value pairs
▪ Allow documents retrieving by keys or contents
▪ Notable for:
▪ MongoDB (used in FourSquare, Github, and more)
▪ CouchDB (used in Apple, BBC, Canonical, Cern, and more)
▪ The central concept is the notion of a "document“ which corresponds to a
row in RDBMS.
▪ A document comes in some standard formats like JSON (BSON).
▪ Documents are addressed in the database via a unique key that represents
that document.
▪ The database offers an API or query language that retrieves documents
based on their contents.
▪ Documents are schema free, i.e., different documents can have structures
and schema that differ from one another. (An RDBMS requires that each row
contain the same columns.)
16
{
_id: ObjectId("51156a1e056d6f966f268f81"),
type: "Article",
author: "Derick Rethans",
title: "Introduction to Document Databases with MongoDB",
date: ISODate("2013-04-24T16:26:31.911Z"),
body: "This arti…"
},
{
_id: ObjectId("51156a1e056d6f966f268f82"),
type: "Book",
author: "Derick Rethans",
title: "php|architect's Guide to Date and Time Programming with PHP",
isbn: "978-0-9738621-5-7"
}
▪ Store data in a schema-less way
▪ Store data as maps
▪ HashMaps or associative arrays
▪ Provide a very efficient average running
time algorithm for accessing data
▪ Notable for:
▪ Couchbase (Zynga, Vimeo, NAVTEQ, ...)
▪ Redis (Craiglist, Instagram, StackOverfow,
flickr, ...)
▪ Amazon Dynamo (Amazon, Elsevier,
IMDb, ...)
▪ Apache Cassandra (Facebook, Digg,
Reddit, Twitter,...)
▪ Voldemort (LinkedIn, eBay, …)
▪ Riak (Github, Comcast, Mochi, ...)
▪ Data are stored in a column-oriented way
▪ Data efficiently stored
▪ Avoids consuming space for storing nulls
▪ Columns are grouped in column-families
▪ Data isn’t stored as a single table but is stored by column families
▪ Unit of data is a set of key/value pairs
▪ Identified by “row-key”
▪ Ordered and sorted based on row-key
▪ Notable for:
▪ Google's Bigtable (used in all
Google's services)
▪ HBase (Facebook, StumbleUpon,
Hulu, Yahoo!, ...)
▪ Graph-oriented
▪ Everything is stored as an edge, a node or an attribute.
▪ Each node and edge can have any number of attributes.
▪ Both the nodes and edges can be labelled.
▪ Labels can be used to narrow searches.
20
▪ Issues with scaling up when the dataset is just too big
▪ RDBMS were not designed to be distributed
▪ Traditional DBMSs are best designed to run well on a “single” machine
▪ Larger volumes of data/operations requires to upgrade the server with faster
CPUs or more memory known as ‘scaling up’ or ‘Vertical scaling’
▪ NoSQL solutions are designed to run on clusters or multi-node database
solutions
▪ Larger volumes of data/operations requires to add more machines to the
cluster, Known as ‘scaling out’ or ‘horizontal scaling’
▪ Different approaches include:
▪ Master-slave
▪ Sharding (partitioning)
▪ RDBMSs are based on ACID (Atomicity, Consistency, Isolation, and Durability)
properties
▪ NoSQL
▪ Does not give importance to ACID properties
▪ In some cases completely ignores them
▪ In distributed parallel systems it is difficult/impossible to ensure ACID properties
▪ Long-running transactions don't work because keeping resources blocked for a
long time is not practical
▪ Acronym contrived to be the opposite of ACID
▪ Basically Available,
▪ Soft state,
▪ Eventually Consistent
▪ Characteristics
▪ Weak consistency – stale data OK
▪ Availability first
▪ Best effort
▪ Approximate answers OK
▪ Aggressive (optimistic)
▪ Simpler and faster
A congruent and logical way for assessing the problems involved in
assuring ACID-like guarantees in distributed systems is provided by the
CAP theorem
At most two of the following three can be maximized at one time
▪ Consistency
▪ Each client has the same view of the
data
▪ Availability
▪ Each client can always read and write
▪ Partition tolerance
▪ System works well across distributed
physical networks
▪ CAP theorem – At most two properties on three can be
addressed
▪ The choices could be as follows: