0% found this document useful (0 votes)
2 views

Unit5_Notes_Short_DB

The lecture covers NOSQL databases and their role in managing big data, highlighting their characteristics such as scalability, availability, and lack of required schemas. It discusses various types of NOSQL systems, including document-based, key-value stores, and graph databases, along with examples like MongoDB and DynamoDB. The document also explains replication models, the CAP theorem, and the importance of sharding for load balancing in distributed systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit5_Notes_Short_DB

The lecture covers NOSQL databases and their role in managing big data, highlighting their characteristics such as scalability, availability, and lack of required schemas. It discusses various types of NOSQL systems, including document-based, key-value stores, and graph databases, along with examples like MongoDB and DynamoDB. The document also explains replication models, the CAP theorem, and the importance of sharding for load balancing in distributed systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Lecture: NOSQL Databases and Big Data

Storage Systems
Things I know.

View on GitHub

Lecture: NOSQL Databases and Big Data


Storage Systems
Readings: Chapter 24, Fundamentals of database systems, Seventh Edition (R. Elmasri, S.
Navathe).

NOSQL Databases and Big Data Storage Systems


NOSQL: not only SQL
most NOSQL systems are distributed databases or distributed storage systems which
focus on semi-structured data storage, high performance, availability, data replication, and
scalability
why NOSQL?
SQL systems offer too many services (powerful query language, concurrency control,
etc.)
a structured data model may be too restrictive
relational systems require schema, NOSQL systems don’t
NOSQL systems focus on storage of ‘big data’
typical applications that use NOSQL
social media, web links, user profiles, marketing and sales, posts and tweets, road
maps and spatial data, email, etc
Examples
DynamoDB (Amazon): key-value data store
BigTable: Google’s proprietary NOSQL system. Column-based or wide column store
Cassandra (Facebook): uses concepts from both key-value store and column-based
systems
MongoDB and CouchDB: document stores
Neo4J and GraphBase: graph-based NOSQL systems
OrientDB: combines several concepts
NOSQL characteristics
scalability
horizontal scalability (by adding more nodes) is employed while the system is
operational, so techniques for distributing the existing data among new nodes
without interrupting system operation are necessary
availability, replication, and eventual consistency
requirement for continuous system availability (data is replicated over many
nodes in transparent manner)
if one node fails, the data is still available on other nodes
replication improves data availability and read performance (however, write
performance becomes more cumbersome, must write to every copy of the
replicated data items)
this can slow down write performance if serializable consistency is required, so
more relaxed forms of consistency known as eventual consistency are used
sharding of files
NOSQL applications can have millions of records, and these records can be
accessed concurrently by thousands of users (it is not practical to store the
whole file in one node)
sharding (also known as horizontal partitioning) of the file records is employed
this serves to distribute the load of accessing the file records to multiple nodes
the combination of sharding and replicating the shards works towards
improving load balancing as well as data availability
schema not required
semi-structured, self describing data facilitates this flexibility of no schema
the lack of schema and constraints:
constraints on the data would have to be programmed
languages for describing semi-structured data are
JSON (JavaScript Object Notation)
XML (Extensible Markup Language)
less powerful query languages
we may not require a powerful query language such as SQL because search
(read) queries often locate single objects in a single file based on their object
keys
in many cases, the operations are called CRUD operations
only a subset of SQL querying capabilities are provided (many NOSQL systems
do not provide join operations)
replication models
master-slave
requires one copy to be the master copy
all write operations must be applied to the master copy and then propagated to
the slave copies
usually using eventual consistency (the slave copies will eventually be the same
as the master copy)
master-master replication
allows reads and writes at any of the replicas
may not guarantee that reads at nodes that store different copies see the same
values
different users may write the same data item concurrently at different nodes of
the system (so the values of the item will be temporarily inconsistent)
categories of NOSQL systems
document-based NOSQL systems: documents are accessible via their document id,
but can also be accessed rapidly using other indexes
NOSQL key-value stores: simple data model based on fast access by the key to the
value associated with the key (hashing)
graph-based NOSQL systems: data is represented as graphs, and related nodes can
be found by traversing the edges
column-based or wide column NOSQL systems
hybrid NOSQL systems: these systems have characteristics from two or more of the
above four categories
consistency
various levels of consistency among replicated data items (enforcing serializabilty is
the strongest form of consistency)
ACID properties
atomicity: transaction performed in its entirety or not at all
consistency preservation: takes database from one consistent state to another
isolation: not interfered with by other transactions
durability or permanency: changes must persist in the database
high overhead: can reduce operation performance (especially on NOSQL replicated
systems)
the CAP theorem
CAP theorem refers to three desirable properties of distributed systems with
replicated data
consistency: among replicated copies (consider a variable X1 replicated 4 times
and updated concurrently by 6 users)
availability: we receive a non-error response (without guarantee that it is the
most recent write)
partition tolerance: continue to operate despite loss of messages by the
network between nodes
not possible to guarantee all three simultaneously in distributed system with data
replication
weaker consistency level is often acceptable in NOSQL distributed data store
(eventual consistency often adopted)
guaranteeing availability and partition tolerance more important
eventually all accesses to an item will return the last updated value

MongoDB
collections of similar documents
individual documents resemble complex objects or XML documents
documents are self-describing
can have different data elements
documents can be specified in various formats: XML, JSON
MongoDB supports CRUD operations
documents stored in binary JSON (BSON) format
individual documents stored in a collection
each document in collection has unique ObjectID field called _id
a collection does not have a schema
structure of the data fields in documents chosen based on how documents will be
accessed
user can choose normalized or denormalized design
replication
concept of replica set to create multiple copies on different nodes
variation of master-slave approach
a replica set will have one primary copy of a collection C stored in one node
N1 , and at least one secondary copy (replica) of C stored at another node N2
primary copy, secondary copy, and arbiter
arbiter participates in elections to select new primary if needed
all write operations applied to the primary copy and propagated to the secondaries
user can choose read preference
read requests can be processed at any replica
sharding
horizontal partitioning divides the documents into disjoint partitions (shards)
allows adding more nodes as needed
shards stored on different nodes to achieve load balancing
partitioning field (shard key) must exist in every document in the collection (must
have an index; use of shard key)
range partitioning
creates chunks by specifying a range of key values
works best with range queries
Hash partitioning
partitioning based on the hash values of each shard key
hash function h(K) to each shard key K to give the shard

NOSQL Key-Value Stores


key-value stores focus on high performance, availability, and scalability
can store structured, unstructured, or semistructured data
key: unique identifier associated with a data item (used for fast retrieval)
value: the data item itself (can be string or array of bytes)
no query language
DynamoDB
DynamoDB part of Amazon’s Web Services/SDK platforms (proprietary)
table holds a collection of self-describing items
item consists of attribute-value pairs (records-tuples)
attribute values can be single or multi-valued
primary key used to locate items within a table
can be single attribute or pair of attributes
the primary key will be a pair of attributes (A, B) :
attribute A will be used for hashing, and because there will be multiple items
with the same value of A ,
the B values will be used for ordering the records with the same A value.
a table with this type of key can have additional secondary indexes defined
examples of other key-value stores
oracle key-value store: oracle NOSQL Database
redis key-value cache and store
caches data in main memory to improve performance
offers master-slave replication and high availability
offers persistence by backing up cache to disk
apache Cassandra (used by Facebook and others)
offers features from several NOSQL categories

NOSQL Graph Databases and Neo4j


graph databases
data represented as a graph
collection of vertices (nodes) and edges
possible to store data associated with both individual nodes and individual edges
Neo4j
open source system
uses concepts of nodes and relationships
nodes can have labels
zero, one, or several
both nodes and relationships can have properties
each relationship has a start node, end node, and a relationship type
properties specified using a map pattern
creating nodes
CREATE command
part of high-level declarative query language Cypher
node label can be specified when node is created
properties are enclosed in curly brackets
path
traversal of part of the graph
typically used as part of a query to specify a pattern
schema optional in Neo4j
indexing and node identifiers
users can create for the collection of nodes that have a particular label
one or more properties can be indexed
Cypher query made up of clauses
result from one clause can be the input to the next clause in the query
Neo4j has a graph visualization interface, so that a subset of the nodes and edges in
a database graph can be displayed as a graph

knowledge is maintained by diegocasmo.


This page was generated by GitHub Pages.

You might also like