Lecture 6 Document Databases Data Formats
Lecture 6 Document Databases Data Formats
Document Databases
Lecture 6 of NoSQL Databases (PA195)
David Novak & Vlastislav Dohnal
Faculty of Informatics, Masaryk University, Brno
http://disa.fi.muni.cz/vlastislav-dohnal/teaching/nosql-databases-fall-2019/
Agenda
● Text (Document) Data Types
○ JSON: JavaScript Object Notation
○ XML: usage and comparison with JSON
2
NoSQL Databases and Data Types
1. Key-value stores:
○ Can store any (text or binary) data
■ often, if using JSON data, additional functionality is available
2. Document databases
○ Structured text data - Hierarchical tree data structures
■ typically JSON, XML
3. Column-family stores
○ Rows that have many columns associated with a row key
■ can be written as JSON
3
Part 1: Document Data Types
4
Data Formats
● Binary Data (previous lecture)
○ often, we want to store objects (class instances)
○ objects can be binary serialized (marshalled)
■ and kept in a key-value store
○ there are several popular serialization formats
■ Protocol Buffers, Apache Thrift
source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 7
JSON: Data Types (1)
● object – an unordered set of name+value pairs
○ these pairs are called properties (members) of an object
○ syntax: { name: value, name: value, name: value, ...}
8
JSON: Data Types (2)
● value – string in double quotes / number / true
or false (i.e., Boolean) / null / object / array
9
JSON: Data Types (3)
● string – sequence of zero or more Unicode
characters, wrapped in double quotes
○ Backslash escaping
10
JSON: Data Types (4)
● number – like a C or Java number
○ Integer or float
○ Octal and hexadecimal formats are not used
11
JSON Properties
● There is no way to write comments in JSON
○ Originally, there was but it was removed for security
source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 13
Document with JSON Schema
source: I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 14
XML: Basic Information
● XML: eXtensible Markup Language
○ W3C standard (since 1996)
● both human and
machine readable
● example:
source: http://en.wikipedia.org/wiki/XML 15
XML: Features and Comparison
● Standard ways to specify XML document schema:
○ DTD, XML Schema, etc.
○ concept of Namespaces; XML editors (for given schema)
● Technologies for parsing: DOM, SAX
● Many associated technologies:
○ XPath, XQuery, XSLT (transformation)
17
Document Databases: Fundamentals
● Basic concept of data: Document
● Documents are self-describing pieces of data
○ Hierarchical tree data structures
○ Nested associative arrays (maps), collections, scalars
○ XML, JSON (JavaScript Object Notation), BSON, …
● Documents in a collection should be “similar”
○ Their schema can differ
● Often: Documents stored as values of key-value
○ Key-value stores where the values are examinable
○ Building search indexes on various keys/fields 18
Why Document Databases
● XML and JSON are popular for data exchange
○ Recently mainly JSON
● Data stored in document DB can be used directly
19
Document Databases: Representatives
MS Azure
DocumentDB
21
MongoDB
● Initial release: 2009
○ Written in C++
○ Open-source
○ Cross-platform
● JSON documents
● Basic features:
○ High performance – many indexes
○ High availability – replication + eventual consistency +
automatic failover
○ Automatic scaling – automatic sharding across the cluster
○ MapReduce support
http://www.mongodb.org/ 22
MongoDB: Terminology
RDBMS MongoDB ● each JSON document:
database instance MongoDB instance ○ belongs to a collection
schema database ○ has a field _id
table collection ■ unique within the collection
row document
rowid _id
● each collection:
○ belongs to a “database”
http://www.mongodb.org/ 23
Documents
● Use JSON for API communication
● Internally: BSON
○ Binary representation of JSON
○ For storage and inter-server communication
24
Document Fields
● Every document must have field _id
○ Used as a primary key
○ Unique within the collection
○ Immutable
○ Any type other than an array
○ Can be generated automatically
26
Schema: Embedded Docs
● Related data in a single document structure
○ Documents can have subdocuments (in a field or array)
http://www.mongodb.org/ 27
Schema: Embedded Docs (2)
● Denormalized schema
● Main advantage:
Manipulate related data in a single operation
● Use this schema when:
○ One-to-one relationships: one doc “contains” the other
○ One-to-many: if children docs have one parent document
● Disadvantages:
○ Documents may grow significantly during the time
○ Impacts both read/write performance
■ Document must be relocated on disk if its size exceeds allocated space
■ May lead to data fragmentation on the disk 28
Schema: References
● Links/references from one document to another
● Normalization of the schema
http://www.mongodb.org/ 29
Schema: References (2)
● More flexibility than embedding
● Use references:
○ When embedding would result in duplication of data
■ and only insignificant boost of read performance
○ To represent more complex many-to-many relationships
○ To model large hierarchical data sets
● Disadvantages:
○ Can require more roundtrips to the server
■ Documents are accessed one by one
30
Querying: Basics
● Mongo query language
● A MongoDB query:
○ Targets a specific collection of documents
○ Specifies criteria that identify the returned documents
○ May include a projection to specify returned fields
○ May impose limits, sort, orders, …
http://www.mongodb.org/ 32
Querying: Selection
db.inventory.find({ type: "snacks" })
● All documents from collection inventory where the type field
has the value snacks
● upsert: true
○ if no document in the inventory collection matches
○ creates a new document (generated _id)
■ it contains fields _id, type, item, qty 35
MapReduce
collection "accesses":
{
"user_id": <ObjectId>,
"login_time": <time_the_user_entered_the_system>,
"logout_time": <time_the_user_left_the_system>,
"access_type": <type_of_the_access>
}
http://www.mongodb.org/ 39
Indexes: Example of Use (2)
● Compound Index
on userid
(ascending) AND
score field
(descending)
● Multikey index on
the addr.zip field
http://www.mongodb.org/ 43
Index Types (3)
● Ordered Index
○ B-Tree (see above)
● Hashed Indexes
○ Fast O(1) indexes the hash of the value of a field
■ Only equality matches
● Geospatial Index (operators docs)
○ 2d indexes = use planar geometry when returning results
■ For data representing points on a two-dimensional plane
○ 2dsphere indexes = spherical (Earth-like) geometry
■ For data representing latitude, longitude
● Text Indexes
○ Searching for string content in a collection 44
Part 2.3: MongoDB - Behind the Scene
MongoDB: Behind the Scene
● BSON format
● Distribution models
○ Replication
○ Sharding
○ Balancing
● MapReduce
● Transactions
● Journaling
46
BSON (Binary JSON) Format
● Binary-encoded serialization of JSON documents
○ Representation of documents, arrays, JSON simple data
types + other types (e.g., date)
http://www.bsonspec.org/ 47
BSON: Basic Types
● byte – 1 byte (8-bits)
● int32 – 4 bytes (32-bit signed integer)
● int64 – 8 bytes (64-bit signed integer)
● double – 8 bytes (64-bit IEEE 754 floating point)
http://www.bsonspec.org/ 48
BSON Grammar
document ::= int32 e_list "\x00"
● BSON document
● int32 = total number of bytes in document
http://www.bsonspec.org/ 49
BSON Grammar (2)
element ::= "\x01" e_name double Floating point
| "\x02" e_name string UTF-8 string
| "\x03" e_name document Embedded document
| "\x04" e_name document Array
| "\x05" e_name binary Binary data
| … …
e_name ::= cstring
● Field key
54
Replica Set: CAP
● Let us have three nodes in the replica set
○ Let’s say that the master is disconnected from the other two
■ The distributed system is partitioned
○ The master finds out, that it is alone
■ Specifically, that can communicate with less than half of the nodes
■ And it steps down from being master (handles just reads)
○ The other two slaves “think” that the master failed
■ Because they form a partition with more than half of the nodes
■ And elect a new master
● In case of just two nodes in RS
○ Both partitions will become read-only
■ Similar case can occur with any even number of nodes in RS
○ Therefore, we can always add an arbiter node to even-sized
55
RS
Sharding
● MongoDB enables
collection partitioning
(sharding)
56
Collection Partitioning
● Mongo partitions collection’s data by the shard key
○ Indexed field(s) that exist in each document in the collection
■ Since Mongo 4.2, the value is mutable
○ Divided into chunks, distributed across shards
■ Range-based partitioning
■ Hash-based partitioning
○ When a chunk grows beyond
the size limit, it is split
■ Metadata change, no data migration
● Data balancing:
○ Background chunk migration
57
Sharding: Components
● MongoDB runs in cluster of different node types:
● Shards – store the data
○ Each shard is a replica set
■ Can be a single node
59
Journaling
● Write operations are applied in memory and into
a journal before done in the data files (on disk)
○ To restore consistent state after a hard shutdown
○ Can be switched on/off
● Journal directory – holds journal files
● Journal file = write-ahead redo logs
○ Append only file
○ Deleted when all the writes are durable
○ When size > 1GB of data, MongoDB creates a new file
■ The size can be modified
● Clean shutdown removes all journal files 60
Transactions
● Write ops: atomic at the level of single document
○ Including nested documents
○ Sufficient for many cases, but not all
○ When a write operation modifies multiple documents,
other operations may interleave
$isolated operator
● Transactions: (docs) is deprecated
62
References
● I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a
NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.