0% found this document useful (0 votes)
2 views

4-NoSQL _v1_

The document provides an introduction to NoSQL databases, highlighting their origins, characteristics, and types, including key-value, document, wide-column, and graph databases. It discusses the need for NoSQL due to the limitations of traditional relational databases in handling large data volumes and the demands of modern web applications. Additionally, it covers concepts like polyglot persistence, distribution models, and the CAP theorem, emphasizing the trade-offs between consistency, availability, and partition tolerance in distributed systems.

Uploaded by

kaxomax113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

4-NoSQL _v1_

The document provides an introduction to NoSQL databases, highlighting their origins, characteristics, and types, including key-value, document, wide-column, and graph databases. It discusses the need for NoSQL due to the limitations of traditional relational databases in handling large data volumes and the demands of modern web applications. Additionally, it covers concepts like polyglot persistence, distribution models, and the CAP theorem, emphasizing the trade-offs between consistency, availability, and partition tolerance in distributed systems.

Uploaded by

kaxomax113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Introduction to NoSQL

Databases (v1)
Juan Manuel Gimeno Illa
[email protected]
Bibliography

Pramod Sadalage and Martin


Fowler. NoSQL Distilled: A brief
guide to the emerging world of
polyglot persistence. Pearson
Education (2012)

2024/2025 Ampliació de BBDD i EP 2


Bibliography

Dan Sullivan, NoSQL for Mere


Mortals.
Addison-Wesley (2015)

2024/2025 Ampliació de BBDD i EP 3


Bibliography

Enric Redmond and Jim R.


Wilson. Seven Databases in Seven
Weeks. A Guide to Modern
Databases and the NoSQL
Movement.
The Pragmatic Programmers
(2012)

2024/2025 Ampliació de BBDD i EP 4


Bibliography

Luc Perkins, Jim Wilson and Enric


Redmond. Seven Databases in
Seven Weeks, Second Edition. A
Guide to Modern Databases and
the NoSQL Movement.
The Pragmatic Programmers
(2018)

2024/2025 Ampliació de BBDD i EP 5


Bibliography

Martin Kleppmann. Designing


Data-Intensive Applications. The
Big Ideas behind Reliable Scalable
and Maintainable Systems.
O’Reilly (2017)

2024/2025 Ampliació de BBDD i EP 6


Bibliography

Alex Petrov. Database Internals: A


deep-dive into how distributed
data systems work.
O’Reilly (2019)

2024/2025 Ampliació de BBDD i EP 7


Index
• Origins
• Polyglot Persistence
• Main characteristics
• The “four” main types
• key-value
• document / MongoDB
• wide-column
• graph / Neo4J

2024/2025 Ampliació de BBDD i EP 8


Origins

2024/2025 Ampliació de BBDD i EP 9


Origins
• Born out of the needs of modern web applications
• large data volumes
• worldwide distribution of clients and servers
• Application servers horizontally scalable
• clusters of commodity servers
• different regions for low latency
• reliability via replication

2024/2025 Ampliació de BBDD i EP 10


Origins
• For these types of architectures, RDBMS didn’t fit well
• The architecture best suited to RDMS
• DB as a central point of integration of different applications (single source of
truth)
• concurrency control not scalable to webapp’s necessities
• single (vertically scalable) server
• In addition, there’s the impedance mismatch between relational
model and in-memory data-structures.

2024/2025 Ampliació de BBDD i EP 11


Origins
• The term NoSQL originally surfaced at an informal meetup on June
11, 2009, as a twitter handle
• So, it’s not a surprise that it is ill-defined
• It’s generally applied to new non-relational databases such as Redis,
MongoDB, Cassandra, Neo4j, etc.
• run well on clusters (most of all)
• schema-less data
• sacrifice consistency for other useful properties

2024/2025 Ampliació de BBDD i EP 12


Origins
• The applications made by Google, Amazon, Facebook, etc. created the
need for new forms of data storage
• So, NoSQL is an accidental neologism, without a prescriptive
definition, and only a list of common characteristics
• not satisfied by all of them
• with very dissimilarities among them

2024/2025 Ampliació de BBDD i EP 13


Polyglot Persistence
• RDBMs have been for a long time the way to go
• they really fit very well to application architecture
• SQL is mostly standardized
• normalization allows data to be flexibly queried
• Compare this to the difficulties in choosing the right data structure in
a program
• only one type of data structure?
• for all the data?
• Different data have different needs !!

2024/2025 Ampliació de BBDD i EP 14


Polyglot Persistence

2024/2025 Ampliació de BBDD i EP 15


Polyglot Persistence
• All data need the same durability?
• Is it the same to lose session data than completed orders?
• All data need the same querybility?
• The queries we do with orders and customer info are the same as those used
to retrieve the session data or the shopping card info?
• All data access need the same latency?
• Session data queries need the same low latency as BI/DW analysis that can be
done offline?

2024/2025 Ampliació de BBDD i EP 16


Polyglot Persistence

2024/2025 Ampliació de BBDD i EP 17


Polyglot Persistence

2024/2025 Ampliació de BBDD i EP 18


Polyglot Persistence
• Easier to include new types of data in the application
• Each one with its specific model
• But now “data coherence” must be checked at the application level
• And no universal language and concepts for
• modelling
• querying
• Consistent with trend to structure the architecture around
independent services

2024/2025 Ampliació de BBDD i EP 19


Polyglot Persistence

2024/2025 Ampliació de BBDD i EP 20


Main characteristics

2024/2025 Ampliació de BBDD i EP 21


Main characteristics
• The (mostly) common characteristics of NoSQL databases are
• Not using the relational model
• Running well on clusters
• Open-source (most of them but with enhanced services around)
• Built for the 21st century web architecture needs
• Better fit for Big Data scenarios
• (Better aligned with programming productivity)

2024/2025 Ampliació de BBDD i EP 22


A note on development productivity
• Agile processes emphasise short development iterations
• sometimes this goes against traditional DBAs
• Remember: Data Base as the source of truth
• Lot of development effort spent on mapping data between RDB and
data structures
• Impedance mismatch
• NoSQL provides models that better fit the application needs
• less code to write, debug, evolve

2024/2025 Ampliació de BBDD i EP 23


Classification of NoSQL databases
• This classification by data model is useful, but crude
• The lines between them are often blurry
• Key-value:
• simply associates keys with values
• Document:
• associates to a key a document (JSON object)
• Wide-column (or column-family):
• evolution of documents mainly for analytical (and big data) scenarios
• Graph:
• data is represented as nodes connected by edges, both with properties

2024/2025 Ampliació de BBDD i EP 24


Classification of NoSQL databases
• But these four “types” can be further classified as

❖ Key-value
Aggregate ❖ Document
Schemaless orientation
orientation ❖ Wide-column
❖ Graph

2024/2025 Ampliació de BBDD i EP 25


Main characteristics
• Aggregate model
• Schema-less
• Distribution model
• Eventual Consistency
• Map-reduce

2024/2025 Ampliació de BBDD i EP 26


Aggregate model
• RDBs have no concept of aggregate within their data model
• In the NoSQL world, graph databases are also aggregate-ignorant
• Being aggregate-ignorant is not a bad thing.
• An aggregate structure may help with some data interactions but be an
obstacle for others
• Aggregates have an important consequence for transactions.
• Relational databases allow you to manipulate any combination of rows from
any tables in a single transaction (ACID transactions).
• NoSQL databases (in general) don’t support ACID transactions outside
aggregates and thus sacrifice consistency (or efficiency).
• Graph and other aggregate-ignorant databases do support ACID transactions
similar to relational databases

2024/2025 Ampliació de BBDD i EP 27


Aggregate model
• Scenario:
• Let's assume we have to build an e-commerce website; we are going to be
selling items directly to customers over the Web, and we will have to store
information about users, our product catalog, orders, shipping addresses,
billing addresses, and payment data.

2024/2025 Ampliació de BBDD i EP 28


Traditional relational model

[Sadalage & Fowler 2012]

2024/2025 Ampliació de BBDD i EP 29


Traditional relational model

• Normalized
• no duplication
• Referential integrity

2024/2025 Ampliació de BBDD i EP 30


Aggregate model

[Sadalage & Fowler 2012]

2024/2025 Ampliació de BBDD i EP 31


Aggregate model
• Black-diamond composition marker
• how data fits into the aggregation structure.
• Two main aggregates: customer and order.
• The customer contains a list of billing addresses.
• The order contains a list of order items, a shipping address, and payments.
• The payment itself contains a billing address for that payment.
• The link between the customer and the order isn't aggregate
• it's a relationship between aggregates.
• The link from an order item would cross into a separate aggregate
structure for products

2024/2025 Ampliació de BBDD i EP 32


Aggregate model
// in orders
{
// in customers “_id”: 99,
{ “customerId”: 1,
“_id”: 1, “orderItems”: [
“name”: ”Martin”, {
“billingAddress”: [{“city”: “Chicago”}] “productId”: 27,
} “price”: 32,45,
“productName”: “NoSQL Distilled”
}
],
“shippingAddress”: [{“city”: “Chicago”}],
• Some sample data “orderPayment”: [
{
• In JSON format
“ccinfo”: “1000-1000-1000-1000”,
• Behold duplicities “txnid”: “abelif879rft”,
• No JOIN needed on query !!! “billingAddress”: [{“city”: “Chicago”}]
}
]
}
2024/2025 Ampliació de BBDD i EP 33
Schema-less
• A key-value store allows you to store any data you like under a key
• A document store effectively does the same thing since it makes no
restrictions on the structure of the documents you store
• Column-family stores allow you to store any data under any column
you like
• Graph stores allow you to freely add new edges and freely add
properties to nodes and edges as you wish.

2024/2025 Ampliació de BBDD i EP 34


Schema-less
• Without a schema binding you, you can store whatever you need
• easily changing it as you learn more
• no need to delete old unneeded data
• simplifies working with non-uniform data
• But complexity doesn’t disappear, it simply moves
• code has to deal with it !!!!
• And if code has to deal with it, code creates an implicit scheme
• all the assumptions code does to access the data (e.g. field names, value
types, etc.)

2024/2025 Ampliació de BBDD i EP 35


Schema-less
• And this creates some problems:
• to understand the data, you must look all the code (complex if data is
accessed by multiple applications)
• code is much more bloated and brittle
• the database cannot use the schema to optimize data access and organization
• How can these problems be reduced?
• clearly delineate different areas of an aggregate for access by different
applications
• different sections of a document in a document store (but this can duplicate data)
• different families in a column-family store
• encapsulate all database interaction within an single application and integrate
with other applications using services

2024/2025 Ampliació de BBDD i EP 36


Distribution models
• As data volume increases
• Scaling up (vertically) is not easy nor cheap
• So, one must scale out (horizontally)
• So, databases must run on clusters
• Aggregates fit well to this model
• Different distribution models:
• Single server (as baseline)
• Sharding
• Replication
• Primary-Secondaries
• Peer-to-peer

2024/2025 Ampliació de BBDD i EP 37


Distribution models
• Single Server
• Simplest option: no distribution at all
• Most, but not all, of NoSQL databases can be run in this mode
• Takes advantage of NoSQL data model (e.g. document, graph) but not of its
scaling possibilities
• Mainline as a baseline for comparison or for very simple systems.
• Single point of failure
• That’s why we will normally use sharding/replication or a mixture of
both

2024/2025 Ampliació de BBDD i EP 38


Distribution models
• Sharding
• Often a busy data store is busy because different people are accessing
different parts of the data set
• Sharding distributes different data across multiple servers, so each server acts
as the single source for a subset of data
• Data that is accessed together is clumped together
• Load is balanced out between servers
• One have to try to keep the load even
• Many NoSQL databases do auto-sharding
• Shards based on geographic regions (lower latency)
• Scales out both reads & writes
• But each shard is a single point of failure for a subset of the data set

2024/2025 Ampliació de BBDD i EP 39


Distribution models
• Replication
• Sometimes a data store is mostly does read operation and data is written
occasionally
• Replication copies data across multiple servers, so each bit of data can be
found in multiple places
• Writes can be managed
• Primary-Secondaries: replication makes one node (primary) the authoritative copy that
handles writes and then synchronize secondaries; secondaries may respond reads
• Peer-to-peer: allows writes to any node; nodes coordinate to synchronize their copies
• No single point of failure
• But introduces consistency problems !!!

2024/2025 Ampliació de BBDD i EP 40


Consistency
• Relational databases exhibit strong consistency
• Write-write conflict (lost update)
• pessimistic (using locks)
• optimistic (conditional updates)
• Read-write conflict (inconsistent read)
• logical inconsistency (half an update)
• ACID transactions
• But with replication, we have replication consistency
• not all replicas updated at the same time
• eventually consistent
• can exacerbate logical consistency by widening inconsistency window
• particularly problematic when one gets inconsistent with oneself
• read-your-writes consistency
• session consistency

2024/2025 Ampliació de BBDD i EP 41


CAP Theorem
• CAP Theorem:
• Describes the trade-offs involved in distributed systems
• 1998: Original conjecture made by Eric Brewer
• 2002: (Proven) by Nancy Lynch and Seth Gilbert
C A
• C: Consistency
• A: Availability
• every request received by a non P
failing node results in a response
• P: Partition tolerance

2024/2025 Ampliació de BBDD i EP 42


CAP Theorem
• Popular (too much simple) version is:
• “Of the Consistency, Availability and Partition
tolerance guarantees, pick two”.
• But this interpretation implies there are three
possible systems: C A
• CA, AP and CP
• And this is a problem, because P is a reality in P
a distributed system
• So only AP and CP?
• Which means: A or C but not both?

2024/2025 Ampliació de BBDD i EP 43


CAP Theorem
• “CAP Theorem Twelve Years Later” (E.Brewer, 2012)
• “The ‘2 of 3’ formulation was always misleading because it tended to
oversimplify the tensions among properties...
• CAP prohibits only a tiny part of the design space: perfect availability and
consistency in the presence of partitions”
• Availability and Consistency is not a 0-1 decision
• We can relax consistency in favour of availability but not be inconsistent

2024/2025 Ampliació de BBDD i EP 44


CAP Theorem
• So, it is possible to be C, A and P
• Until a partition occurs
• Then you have to choose between C and A
• tuneable consistency
• tuneable availability
• How to make that choice is one of the decisions to make when
designing the system
• which NoSQL database to use
• which tuning of the configuration parameters

2024/2025 Ampliació de BBDD i EP 45


CAP Theorem
• Tuneable consistency
• Strong consistency:
• copy data to all servers before the client is acknowledged
• all servers will have the update
• reads will be consistent
• Eventual consistency:
• copy data on the server and acknowledge immediately, then replicate to all servers
• there’s a window of inconsistent reads
• The tuneable availability is called latency
• The more available the quicker (so the less latency)
• Very high latency indistinguishable from unavailability

2024/2025 Ampliació de BBDD i EP 46


CAP Theorem
• So, the real trade-off is between
• Strong consistency and low availability (high latency)
• Weak consistency and high availability (low latency)
• And there is a scale:
• R: number of nodes in synchronous read
• W: number of nodes in synchronous write
• For instance, in Cassandra’s documentation:
• “The CAP theorem states... you can't have the three at the same time and get
an acceptable latency. Trade-offs... are tuneable in Cassandra. You can get
strong consistency with Cassandra (with an increased latency).”

2024/2025 Ampliació de BBDD i EP 47


CAP Theorem
• “Replication and the latency-consistency trade-off” (D. Abadi, 2011):
• “Unlike CAP, where consistency and availability are only traded off in the
event of a network partition, the latency vs. consistency trade-off is present
even during normal operations of the system.”
• The only condition is that the system replicates data, which is the case for all
distributed databases

2024/2025 Ampliació de BBDD i EP 48


CAP Theorem
• “Consistency Trade-offs in Modern Database System Design” (D.
Abaldi, 2012)
• PACELC: “if there is a partition (P) how does the system trade-off between
availability and consistency (A and C); else (E) when the system is running as
normal in the absence of partitions, how does the system trade-off between
latency (L) and consistency (C)?”
• For example: the default configurations of Dynamo, Cassandra and Riak are
PA/EL:
• if partition: give up consistency for availability
• under normal operation: give up consistency for lower latency

2024/2025 Ampliació de BBDD i EP 49


Map-Reduce
• If data is stored in a cluster, we have to find a way to compute with it
that is cluster-friendly
• Map-reduce is a pattern to allow computations to be parallelized over
a cluster
• Map: reads data from an aggregate and transforms it to relevant key-value
pairs
• As it only reads a single record can be parallelized and run on the node that stores the
data
• Reduce: takes many values corresponding to a single key and summarizes
them into a single output
• Each reducer operates on the the pairs of a single pair so it can be parallelized by key

2024/2025 Ampliació de BBDD i EP 50


Map-Reduce

2024/2025 Ampliació de BBDD i EP 51


Map-Reduce

2024/2025 Ampliació de BBDD i EP 52


Map-Reduce
• Reducers that have the same form of input and output can be combined
into pipelines
• This improves parallelism and reduces the data to be transferred
• Map-reduce operations can be composed when output of one reduce is
the input of another operation’s map
• If the result of a map-reduce computation is widely used, it can be stores
as a materialized view
• Which can also be updated incrementally
• Examples:
• HBase based on Hadoop
• Views in CouchDB
• MongoDB allows map-reduce but has frameworks built upon it
2024/2025 Ampliació de BBDD i EP 53
Key-Value Databases

2024/2025 Ampliació de BBDD i EP 54


Key-Value
• They are the simplest NoSQL data stores to use from an API perspective
• The client can either
• get the value associated to the key
• put a value for a key
• delete a key
• The value associated to the key is (normally) a blob that the database
stores
• So, no queries on the value
• Only primary-key access
• great performance
• great scalability

2024/2025 Ampliació de BBDD i EP 55


Key-Value
• Amazon’s Dynamo:
• Dynamo: Amazon’s Highly Available Key-value Store
• Not open-source but origin of great ideas in modern distributed databases
• Project Voldemort:
• Open-source of some of the Dynamo’s ideas
• Redis:
• Data Structures Server
• MemcachedDB:
• In-memory DB

2024/2025 Ampliació de BBDD i EP 56


Key-Value
• Good for
• Storing session information
• User profiles and preferences
• Shopping cart data
• Bad for
• Relationships among data
• Multi-operation transactions
• Query by data
• Operation over sets of multiple keys

2024/2025 Ampliació de BBDD i EP 57


Document Databases

2024/2025 Ampliació de BBDD i EP 58


Document
• They store documents in the value part of the key-value stores
• So, they are key-value stores in which the value is examinable
• What do we mean by document?
• They are self-describing, hierarchical tree data structures which can consist of
maps, collections and scalar values
• They can be XML, JSON, BSON, etc.
• The documents stored are similar to each other but must not have
the same structure
• Since version 3.2 MongoDB allows schema validation

2024/2025 Ampliació de BBDD i EP 59


MongoDB
• Database:
• A container for collections
• Collection:
• A grouping of documents
• Typically similar
• Document:
• Displayed as JSON
• Stored as BSON (all JSON datatypes, dates, numers, ObjectIds, ...)
• Every document requires an _id field

2024/2025 Ampliació de BBDD i EP 60


MongoDB
• Every document requires an _id field which acts as a primary key
• If the document does not include it, Mongo will create this field with an
ObjectId as value
• A document may contain different fields
• A field may contain different values
• Flexible schema
• But optional schema validation

2024/2025 Ampliació de BBDD i EP 61


MongoDB
• Atlas Platform:
• Database as a service
• Interact and manage data
• Compass:
• GUI interface
• Query, compose aggregation pipelines, analyse data
• Mongosh:
• Node.js REPL environment
• MongoDB drivers:
• To connect app with Database

2024/2025 Ampliació de BBDD i EP 62


MongoDB
• Insert documents:
• db.collection.insertOne({document})
• db.collection.insertMany([{doc1}, {doc2}, …])

db.inventory.insertMany([
{ item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
{ item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
{ item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
{ item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
{ item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
]);

2024/2025 Ampliació de BBDD i EP 63


MongoDB
• Finding documents:
• db.collection.find({condition})
• db.inventory.find( {} )
• db.inventory.find( { status: "D" } )
• db.inventory.find( { status: { $in: [ "A", "D" ] } } )
• db.inventory.find( { status: "A", qty: { $lt: 30 } } )
• db.inventory.find( { $or: [ { status: "A" },
{ qty: { $lt: 30 } } ] } )

• SQL to MongoDB Mapping Chart

2024/2025 Ampliació de BBDD i EP 64


MongoDB
• Replacing a document (except _id field)
• db.collection.replaceOne({filter}, {replacement}, {options})

db.inventory.replaceOne(
{ item: "paper" },
{ item: "paper", instock: [ { warehouse: "A", qty: 60 }, { warehouse: "B", qty: 40 } ] }
)

2024/2025 Ampliació de BBDD i EP 65


MongoDB
• Updating documents:
• db.collection.updateOne(filter, update, options)
• db.collection.updateMany(filter, update, options)

db.inventory.updateOne( db.inventory.updateMany(
{ item: "paper" }, { "qty": { $lt: 50 } },
{ $set: { $set:
{ "size.uom": "cm", status: "P" }, { "size.uom": "in", status: "P" },
$currentDate: $currentDate:
{ lastModified: true } { lastModified: true }
}) })

2024/2025 Ampliació de BBDD i EP 66


MongoDB
• Deleting documents:
• db.collection.deleteOne()
• db.collection.deleteMany()

• db.inventory.deleteOne({ status: "D" })


• db.inventory.deleteMany({ status: "A" })
• db.inventory.deleteMany({})

2024/2025 Ampliació de BBDD i EP 67


MongoDB
• Aggregation operations process multiple documents and return
computed results
• You can use them to
• Group values from different documents
• Perform operations on the grouped data
• To perform aggregations
• Aggregation pipelines (preferred)
• Single purpose aggregation methods

2024/2025 Ampliació de BBDD i EP 68


MongoDB
• An aggregation pipeline consists of a series of stages that process the
documents
• Each stage performs an operation on the input documents
• The output of a stage is the input of the next
• An aggregation pipeline can return results for groups of documents

db.orders.aggregate( [
// Stage 1: Filter pizza order documents by pizza size
{ $match: { size: "medium" } },
// Stage 2: Group remaining documents by pizza name and calculate total quantity
{ $group: { _id: "$name", totalQuantity: { $sum: "$quantity" } } }
])

2024/2025 Ampliació de BBDD i EP 69


MongoDB
• Without indexes, MongoDB reads all documents in the collection
(collscan)
• By default, the only index is on _id
• There are three types of indexes:
• Single field index
• Multikey Indexes: for array fields
• Compound indexes: on multiple fields

db.collection.createIndex( { name: -1 } )

2024/2025 Ampliació de BBDD i EP 70


MongoDB
• You can ask MongoDB how it resolves a given query
• hint: forces MongoDB to use an index
• explain: shows execution statistics

db.people.find(
{ name: "John Doe", zipcode: { $gt: "63000" } }
).hint( { zipcode: 1 } ).explain("executionStats")

2024/2025 Ampliació de BBDD i EP 71


MongoDB
• They improve availability by
replicating data using
primary/secondaries setup
• The same data is available on multiple
nodes and clients can get to the data
even when the primary is down
• Usually, the application does not have
to determine if the primary node is
down (this is controlled by the
cluster)
• Writes go to the primary and reads
can be served by the same node or
also by the secondaries.
[MongoDB Documentation]

2024/2025 Ampliació de BBDD i EP 72


MongoDB
• Database is configured by using
replica-sets
• A primary node
• Some secondary nodes
• By default all reads & writes go to
the primary
• Every write can be configured to
wait for the writes to be replicated
to
• Only primary
• Primary + some secondaries
[MongoDB Documentation]

2024/2025 Ampliació de BBDD i EP 73


MongoDB
• If the primary node goes down,
the remaining nodes in the replica
set vote among themselves to
elect a new primary
• When the node that failed comes
back online, it joins in as a
secondary and catches up with the
rest of the nodes by pulling all the
data it needs to get current.
• Some secondaries can be in
another datacenter
[MongoDB Documentation]

2024/2025 Ampliació de BBDD i EP 74


MongoDB
• When we want to scale for write,
we can start sharding the data.
• In sharding, the data is also split by
certain field, but then moved to
different Mongo nodes.
• The data is dynamically moved
between nodes to ensure that
shards are always balanced.
• We can add more replica-sets to
the cluster and increase the
number of writable nodes,
enabling horizontal scaling for
writes. [MongoDB Documentation]

2024/2025 Ampliació de BBDD i EP 75


Document databases
• Good for
• Content Management Systems (CMS), Blogging Platforms
• Web Analytics and Real-Time Analytics
• E-commerce Applications
• Event Logging
• Bad for
• Complex transactions
• Querying against varying aggregates

2024/2025 Ampliació de BBDD i EP 76


Wide Column Databases

2024/2025 Ampliació de BBDD i EP 77


Wide Column
• A wide column (or column-family) stores data in tables, rows and
columns
• But the names and formats of the columns can vary from row to row
• It can be interpreted as a two-dimensional key-value store
• A column-family contains columns of related data
• It is a key-value pair, where the key is mapped to a value that is a set of
columns
• Each column is a triple consisting of a column name, a value and a timestamp

2024/2025 Ampliació de BBDD i EP 78


Wide Column

[From NoSQL for Mere Mortals]

2024/2025 Ampliació de BBDD i EP 79


Wide Column
• Since columns can be added freely, you can model a list of items by
making each item a separate column
• This is very odd if you think of a column family as a table, but quite
natural if you think of a column-family row as an aggregate
• Cassandra uses the terms "wide" and "skinny."
• Skinny rows have few columns with the same columns used across the many
different rows
• In this case, the column family defines a record type, each row is a record, and each
column is a field
• A wide row has many columns (perhaps thousands), with rows having very
different columns
• A wide column family models a list, with each column being one element in that list

2024/2025 Ampliació de BBDD i EP 80


Wide Column
• As they do not allow joins,
relationships are represented
denormalized in a column family
• Both column names and column
values can store data

[From NoSQL for Mere Mortals]

2024/2025 Ampliació de BBDD i EP 81


Wide Column
• Writes are atomic at the row level, which means inserting or updating
columns for a given row key
• Scaling an existing Cassandra cluster is a matter of adding more nodes
• As no single node is a primary, when we add nodes to the cluster, we
are improving the capacity of the cluster to support more writes and
reads
• This type of horizontal scaling allows you to have maximum uptime,
as the cluster keeps serving requests from the clients while new
nodes are being added to the cluster

2024/2025 Ampliació de BBDD i EP 82


Wide Column
• ONE Consistency
• When a write is received by Cassandra, the data is first recorded in a commit
log, then written to an in-memory structure
• A write operation is considered successful once it's written to the commit log
and in-memory structure
• Cassandra returns the data from the first replica, even if the data is stale
• If the data is stale, subsequent reads will get the latest (newest) data; this
process is known as read repair
• The low consistency level is good to use when you do not care if you get stale
data and/or if you have high read performance requirements

2024/2025 Ampliació de BBDD i EP 83


Wide Column
• QUORUM Consistency
• During write operations, the QUORUM consistency setting means that the
write has to propagate to the majority of the nodes before it is considered
successful, and the client is notified
• Setting for both read and write operations ensures that majority of the nodes
respond to the read and the column with the newest timestamp is returned
back to the client, while the replicas that do not have the newest data are
repaired via the read repair operations

2024/2025 Ampliació de BBDD i EP 84


Wide Column
• ALL Consistency
• All nodes will have to respond to reads or writes, which will make the cluster not tolerant to
faults—even when one node is down, the write or read is blocked and reported as a failure.
It's therefore upon the system designers to tune the consistency levels as the application
requirements change
• Within the same application, there may be different requirements of consistency; they can
also change based on each operation
• for example, showing review comments for a product has different consistency requirements
compared to reading the status of the last order placed by the customer

2024/2025 Ampliació de BBDD i EP 85


Wide Column
• These kind of databases prioritize availability by means of peer-to-
peer replication
• Consistency and availability are governed by
(R + W) > N
where
• W is the minimum number of nodes where the write must be successfully
written
• R is the minimum number of nodes that must respond successfully to a read
• N is the number of nodes participating in the replication of data
• http://www.ecyrd.com/cassandracalculator/
2024/2025 Ampliació de BBDD i EP 86
Wide Column
• Good for
• Content Management Systems (CMS), Blogging Platforms
• Web Analytics
• Event Logging
• Bad for
• ACID transactions
• Not great for prototypes (column family design should be stable)

2024/2025 Ampliació de BBDD i EP 87


Graph Databases

2024/2025 Ampliació de BBDD i EP 88


Graph Databases
• Graph databases are an odd fish in the NoSQL pond
• Most NoSQL databases were inspired by the need to run on clusters
(aggregate-oriented)
• Graph databases are motivated by a different frustration with relational
databases and thus have an opposite model
• Small records with complex interconnections, that is a graph data structure of nodes
connected by edges

2024/2025 Ampliació de BBDD i EP 89


Graph Databases
• Graph databases specialize in capturing this sort of information:
• "find the books in the Databases category that are written by someone whom
a friend of mine likes."
• But on a much larger scale than a readable diagram could capture
• This is ideal for capturing any data consisting of complex relationships
such as social networks, product preferences, or eligibility rules
• The fundamental data model of a graph database is very simple:
• nodes connected by edges (also called arcs)

2024/2025 Ampliació de BBDD i EP 90


Graph Databases
• Differences between graph and relational databases:
• a graph database allows you to query that network with query operations
designed with this kind of graph in mind
• Graph databases make traversal along the relationships very cheap
• A large part of this is because graph databases shift most of the work of
navigating relationships from query time to insert time
• This naturally pays off for situations where querying performance is more
important than insert speed

2024/2025 Ampliació de BBDD i EP 91


Neo4j
• Neo4j started as a
graph database using
property graphs
• It has evolved into a
rich ecosystem with
numerous tools,
applications, and
libraries.

[Neo4j Documentation]
2024/2025 Ampliació de BBDD i EP 92
Neo4j
• Neo4j Graph Database
• Core product
• Neo4j Aura
• AuraDB database as a service
• Neo4j Sandbox
• Database in the cloud to do initial experiments
• Neo4j Developer Tools
• Desktop / Browser / Data Importer / Operations Manager / Bloom
• Neo4j Graph Examples

2024/2025 Ampliació de BBDD i EP 93


Neo4J
• Information is organized as nodes,
relationships, and properties.
• Nodes are the entities in the graph.
• Nodes can be tagged with labels,
representing their different roles in
your domain (for example, Person).
• Nodes can hold any number of key-
value pairs, or properties (for
example, name).
• Node labels may also attach
metadata (such as index or constraint
information) to certain nodes.
[Neo4j Documentation]
2024/2025 Ampliació de BBDD i EP 94
Neo4J
• Information is organized as nodes,
relationships, and properties.
• Relationships provide directed,
named connections between two
node entities
• Relationships always have a direction,
a type, a start node, and an end node,
and they can have properties, just like
nodes.
• Nodes can have any number or type
of relationships without sacrificing
performance.
• Although relationships are
always directed, they can be
navigated efficiently in any direction.
[Neo4j Documentation]
2024/2025 Ampliació de BBDD i EP 95
Neo4j
• Let’s take an example domain and model it using Neo4j
• Scenario:
• Two people, Sally and John, are friends. Both John and Sally have read the
book, Graph Databases.
• We can use the information given to
• Identify nodes
• Identify labels
• Identify relationships
• Identify properties

2024/2025 Ampliació de BBDD i EP 96


Neo4j
• Nodes are often use to
represent entities
• Nodes can have properties
• They can be assigned roles of
types
• You can find nodes by identifying
names in your domain
• In the scenario:
• John
• Sally
• Graph Databases [Neo4j Documentation]

2024/2025 Ampliació de BBDD i EP 97


Neo4j
• Labels are used to group nodes
into sets
• Queries can then work with those
sets rather than with the whole
graph
• We can identify labels with
generic nouns or groups of
things
• In our domain:
• Person
• Book [Neo4j Documentation]

2024/2025 Ampliació de BBDD i EP 98


Neo4j
• A Relationship connects two
nodes
• It has a source and a target
• Although Neo4J can navigate it in
both directions
• In our domain:
• John is friends with Sally
• Sally is friends with John
• John has read Graph Databases
• Sally has read Graph Databases [Neo4j Documentation]

2024/2025 Ampliació de BBDD i EP 99


Neo4j
• Properties are name-value pairs
that you store on nodes and
relationships
• In our scenario, we can ask:
• When did John and Sally become
friends?
• What is the average rating of the
Graph Databases book?
• Who is the author of the Graph
Databases book?
• How old is Sally?
• How old is John? [Neo4j Documentation]

2024/2025 Ampliació de BBDD i EP 100


Neo4j
• You can use Cypher statements to create the graph
• There are many ways to do it

[Neo4j Documentation]

2024/2025 Ampliació de BBDD i EP 101


Neo4j
• You can then view the graph in Neo4j:

[Neo4j Documentation]

2024/2025 Ampliació de BBDD i EP 102


Neo4j
• Cypher is Neo4j’s query language
• Developed in the openCypher project
• Cypher provides a visual (ascii art) way of matching patterns and
relationships
• (nodes)-[:CONNECTED_TO]->(other nodes)

2024/2025 Ampliació de BBDD i EP 103


Neo4j

2024/2025 Ampliació de BBDD i EP 104


Neo4j

2024/2025 Ampliació de BBDD i EP 105


Neo4j
MATCH (tom:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(:Movie)<-[:ACTED_IN]-(coActor:Person)
RETURN coActor.name

2024/2025 Ampliació de BBDD i EP 106


Neo4j MATCH (tom:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(movie1:Movie)
<-[:ACTED_IN]-(coActor:Person)-[:ACTED_IN]
->(movie2:Movie)<-[:ACTED_IN]-(coCoActor:Person)
WHERE tom <> coCoActor
AND NOT (tom)-[:ACTED_IN]->(:Movie)<-[:ACTED_IN]-(coCoActor)
RETURN coCoActor.name

2024/2025 Ampliació de BBDD i EP 107


Neo4j
MATCH (tom:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(movie1:Movie)
<-[:ACTED_IN]-(coActor:Person)-[:ACTED_IN]->(movie2:Movie)
<-[:ACTED_IN]-(coCoActor:Person)
WHERE tom <> coCoActor
AND NOT (tom)-[:ACTED_IN]->(:Movie)<-[:ACTED_IN]-(coCoActor)
RETURN coCoActor.name, count(coCoActor) as frequency
ORDER BY frequency DESC
LIMIT 5

2024/2025 Ampliació de BBDD i EP 108


Neo4j
MATCH (tom:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(movie1:Movie)
<-[:ACTED_IN]-(coActor:Person)-[:ACTED_IN]->(movie2:Movie)
<-[:ACTED_IN]-(cruise:Person {name: 'Tom Cruise'})
WHERE NOT (tom)-[:ACTED_IN]->(:Movie)<-[:ACTED_IN]-(cruise)
RETURN tom, movie1, coActor, movie2, cruise

2024/2025 Ampliació de BBDD i EP 109


Graph Databases
• Good for
• Social Networks
• Topology of networked devices
• Recommendations systems
• Credit Fraud Detection
• Bad for
• Block operations and updates
• Very big data due to difficulties on sharding

2024/2025 Ampliació de BBDD i EP 110

You might also like