CT113H Lecture 1 - Introduction To NoSQL
CT113H Lecture 1 - Introduction To NoSQL
● Everything is in Cloud
○ flexibility and distributed nature of the systems
Agenda
● Current trends in data management & computing
● Big Data
● Relational vs. NoSQL databases
○ the value of relational databases
○ new requirements
○ NoSQL features, strengths and challenges
● Types of NoSQL databases
○ key-value stores, document databases,
column-family databases, graph databases
○ principles and examples
Big Data
“Big data is high volume, high
velocity, and/or high variety
information assets that require
new forms of processing to
enable enhanced decision
making, insight discovery and
process optimization.”
(Gartner, 2012)
Sources of Big Data
● Social networks
○ this data is huge, but the volumes are relatively limited
● Logs of various web/email servers or routers
○ growing beyond limits
● Sensor networks
○ this sector is expected to grow even faster
● Internet of things (IoT)
● Computer-driven machines, like airplanes:
○ one overseas flight of Boeing generates 640 TB of data
● etc.
Processing (Traditional) Data
● OLTP: Online Transaction Processing
○ Standard databases (DBMSs) and database applications
○ Storing, querying, multi-user access
● OLAP: Online Analytical Processing (Warehousing)
○ Answer multi-dimensional analytical queries
○ Financial/marketing reporting, budgeting, forecasting, …
● RTAP: Real-Time Analytic Processing
(Big Data Architecture & Technology)
○ Data gathered & processed in real-time (streaming)
○ Real-time and history data combined
Technologies for Big Data
● Distributed file systems (GFS, HDFS, etc.)
● MapReduce
○ and other models for distributed programming
● NoSQL databases
● Data Warehouses
● Grid computing, cloud computing
● Large-scale machine learning
Agenda
● Current trends in data management & computing
● Big Data
● Relational vs. NoSQL databases
○ the value of relational databases
○ new requirements
○ NoSQL features, strengths and challenges
● Types of NoSQL databases
○ key-value stores, document databases,
column-family databases, graph databases
○ principles and examples
Relational Database Management Systems
● RDBMS are predominant database technologies
○ first defined in 1970 by Edgar Codd of IBM's Research Lab
● Data modeled as relations (tables)
○ object = tuple of attribute values
■ each attribute has a certain domain
○ a table is a set of objects (tuples, rows) of the same type
■ relation is a subset of cartesian product of the attribute domains
○ each tuple identified by a key
■ field (or a set of fields) that uniquely identifies a row
○ tables and objects “interconnected” via foreign keys
● Relational calculus, SQL query language
RDBMS Example
Trends Requirements
● Volume of data ● Real database scalability
. ○ massive database distribution
● Cloud comp. (IaaS) ○ dynamic resource management
○ horizontally scaling systems
● Velocity of data .
● Frequent update operations
● Big users ● Massive read throughput
● Variety of data ● Flexible database schema
○ semi-structured data
RDBMS for Big Data
● relational schema ● but current data are
○ data in tuples naturally flexible
○ a priori known schema
● inefficient for large
● schema normalization data
○ data split into tables (3NF) ● slow in distributed
○ queries merge the data environment
● transaction support
○ trans. management with ACID ● full transactions very
○ Atomicity, Consistency, Isolation, Durability inefficient in
○ safety first distributed envir.
NoSQL Databases
● What is “NoSQL”?
○ term used in late 90s for a different type of technology:
Carlo Strozzi: http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/
○ “Not Only SQL”?
■ but many RDBMS are also “not just SQL”
http://basho.com/about/customers/
https://www.mongodb.com/who-uses-mongodb
http://planetcassandra.org/companies/
http://neo4j.com/customers/
The End of Relational Databases?
● Relational databases are not going away
○ are ideal for a lot of structured data, reliable, mature, etc.
● RDBMS became one option for data storage
Polyglot persistence – using different data stores in
different circumstances [Sadalage & Fowler: NoSQL Distilled, 2012]
Two trends:
1. NoSQL databases implement standard RDBMS features
2. RDBMS are adopting NoSQL principles
Agenda
● Current trends in data management & computing
● Big Data
● Relational vs. NoSQL databases
○ the value of relational databases
○ new requirements
○ NoSQL features, strengths and challenges
● Types of NoSQL databases
○ key-value stores, document databases,
column-family databases, graph databases
○ principles and examples
NoSQL Technologies
● MapReduce programming model
○ running over a distributed file system
● Key-value stores
● Document databases
● Column-family stores
● Graph databases
source: http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases
MapReduce: Principles
source: Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Features
● MapReduce is a generic approach for distributed
processing of large data collections
Amazon Elastic
MapReduce
Key-value Stores: Basics
● A simple hash table (map), primarily used when
all accesses to the database are via primary key
○ key-value mapping
● In RDBMS world: A table with two columns:
○ ID column (primary key)
○ DATA column storing the value (unstructured BLOB)
● Basic operations:
○ Put a value for a key put(key, value)
○ Get the value for the key value:= get(key)
○ Delete a key-value delete(key)
Key-value Stores: Architecture
1. Embedded systems
○ the system is a library and the DB runs within your system
Project
Voldemort
MS Azure
DocumentDB
column names
“contents:html” “param:lang” “param:enc” “a:cnnsi.com” “a:ihned.cz”
t2
t6 <html>...
t8 t2 t2 t3 t7
<html>...
”com.ccn.www” <html>... EN UTF-8 CNN.com CNN
column column column column column
column family
row key row
Column-family Stores: Representatives
Memcached http://memcached.org/
○ distributed key-value store
○ used as a cache between web servers
and MySQL servers in the beginning of FB
sources: http://goo.gl/SZ6jia http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/
Facebook: Database Tech. Behind (3)
Apache Giraph http://giraph.apache.org/
○ graph database
○ facebook users and connections is
one very large graph
○ used since 2013 for various analytic tasks (trillion edges)
RocksDB http://rocksdb.org/
○ high-performance key-value store
○ developed internally in FB, now open-source
sources: https://code.facebook.com/posts/509727595776839/scaling-apache-giraph-to-a-trillion-edges/ http://goo.gl/XNtG6p
Questions?