Notes After Midterm
Notes After Midterm
TABLE OF CONTENTS
SQL VS NoSQL.................................................................................................................................3
SQL Reminder....................................................................................................................................3
NoSQL................................................................................................................................................3
KEY-VALUE DATABASES..................................................................................................................5
Caching in Databases.........................................................................................................................5
Caching in Key-Value Databases........................................................................................................6
CODING - CACHING IN KEY-VALUE DATABASES.................................................................................7
WHAT IS REDIS...............................................................................................................................9
LISTS................................................................................................................................................13
HASHES............................................................................................................................................14
SETS.................................................................................................................................................15
SORTED SETS...................................................................................................................................17
DOCUMENT DATABASES...............................................................................................................24
Comparison to other DB..................................................................................................................24
MongoDB.........................................................................................................................................24
Data Normalization..........................................................................................................................27
COLUMN-FAMILY DATABASES......................................................................................................29
Row-Oriented vs Column-Oriented.................................................................................................29
Column-Family Databases...............................................................................................................30
Database Comparisons....................................................................................................................30
Cassandra........................................................................................................................................31
GRAPH DATABASES......................................................................................................................34
Graphs Reminder.............................................................................................................................34
Components of Graph Databases....................................................................................................34
Comparison of Databases................................................................................................................35
Pros and Cons..................................................................................................................................36
Examples of Graph DB - Gremlin.....................................................................................................37
PageRank.........................................................................................................................................38
FUTURE TRENDS...........................................................................................................................39
In-Memory Databases.....................................................................................................................39
Blockchain.......................................................................................................................................39
Quantum Computing.......................................................................................................................40
Probabilistic data bases................................................................................................................43
TOY EXAMPLE..................................................................................................................................44
1
CONFUSION MATRIX........................................................................................................................45
EXAMPLE APPLICATIONS.................................................................................................................47
BLOOM FILTERES IN A GENERAL KEY-VALUE STORE........................................................................48
2
SQL VS NoSQL
SQL Reminder
SQL is all about tables that interact and are connected with each other
Basic Operations:
Limitations of SQL
These are limitations because nowadays we have massive scales, thanks to the internet there are
many read/writes per second (a lot of codes going on very fast), and there are many types of adata
(Text, media, suer ratings, IoT, etc.)
NoSQL
Types of NoSQL
3
1. Key-Value databases: Like dictionaries, like a large table with two columsn: The key and the
value
2. Document Databases: Like key-values, but the “values” are a collection of items for example
words
3. Column family databases: Like tables in SQL, but different rows can have different columns
4. Graph databases: Structured in nodes and vertices that connect nodes with one another
4
KEY-VALUE DATABASES
- Adding Data
- Overwriting data (changing a value)
- Deleting data
How It works?
Caching in Databases
Cache = Device that stores data for future requests to be served faster
1. Temporal Locality Principle = If certain data is needed now, it will probably be needed again
soon
Example: People looking up “covid” when the pandemic started, they are lively to look it up again
2. Spatial Locality Principle = If certain data is needed, other data that is located close in
memory/disk will probably be needed
Example: You are printing a[0], a[1], a[2], etc. you are likely to print out a[3]…
5
Caching strategies
Caches are a tradeoff: need more space but serve results faster
When you are deleting from cache you delete the one that is least likely to be called next, so for
example in temporal locality principle you would delete the oldest google search
Caches are a trade off Need more space, but serve results faster
Memory is a cache of disk (through paging) – memory has a cache of its own, and that cache has its
own cache that has a smaller capacity but is faster. This last cache is only accessed if the requested
data isn’t found in the first cache
Making new room in the cache uses the Least Recently Used (LRU) principle:
6
LRU = Keep cached values sorted by last access time, when you need to make space you
delete the value that has not been used for the longest time
Least Recently Used (LRU): keep cached values sorted by last access time.
When we need to make room, delete the value that has not been used for
the longest time.
Doubly Linked Lists have two references, to the previous and the next
node. Enables List traversals in both directions.
The HEAD element of the doubly linked list would point to the most
recently used entry.
Very Fast Data Access: Accessing the least recently used item and
updating the cache are operations with a runtime of O(1) cache size does not impact access time.
CODING CACHE
d = {}
d = {}
d[‘Madrid’] = ‘Spain’
d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
7
d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’
d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’
d[‘Lisbon’] = Portugal’
d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’
d[‘Lisbon’] = Portugal’
print(d[‘Paris’])
d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’
d[‘Lisbon’] = Portugal’
print(d[‘Paris’])
d[‘Berlin’] = ‘Germany’
8
WHAT IS REDIS
REDIS: open source in-memory data structure store, used as a database, cache and message broker.
Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries,
bitmaps, hyperloglogs, geospatial indexes, and streams. Redis has built-in replication, Lua scripting,
LRU eviction, transactions, and different levels of on-disk persistence, and provides high availability
via Redis Sentinel and automatic partitioning with Redis Cluster.
ENVIRONMENT SETUP
Let us first run Redis without persistence and sending logs to standard output:
It is possible to start Redis with data persistence (--appendonly yes) and exposing port 6379 and data
to host machine:
basic commands
CODE CONSOLE
Set/Get/Delete your first key:
- SET key value [EX seconds|PX
milliseconds|KEEPTTL] [NX|XX]
NX = not exists; XX = exists
- GET key
- DEL key [key ...]
- EXISTS key [key ...]
9
Check existence:
- EXISTS key [key ...]
Expire Keys:
- Use SET with EX (for seconds) or PX
(for milliseconds) options for
expiration. Useful to manage
sessions.
- TTL (Time To Live) publishes
available time before expiration.
-2=expired
-1=key without expiration set
- PTTL returns TTL in milliseconds!
- EXPIRE and PEXPIRE are alias to
create expired keys:
- expire mykey 10 -> Key mykey
expires in 10 seconds.
- pexpire mykey 10000 -> Key
mykey expires in
10000mseconds.
- PERSIST can be used against a key
to remove expiration time.
10
- RENAMENX key newkey It will
prevent that newkey get
overwritten if it already existed, and
return 0
- TOUCH key [key ...]: set the last
access time of key to now.
- UNLINK key [key...]: remove key
asynchronously, i.e. not in current
thread
- TYPE key: returns the type of the
key (String, List, Set, Hash...)
STRINGS
CODE CONSOLE
APPEND key value
adds to the end of key
11
DECR key & DECRBY key
DECR: decrease key by 1
DECRBY: decrease key by number set
LISTS
12
CODE CONSOLE
- LPOP key
Removes and returns the first element
of the list stored at key.
- RPOP key
Removes and returns the last element
of the list stored at key
LINDEX
Returns the element at index index in the
list stored at key
13
HASHES
CODE CONSOLE
HSET key field value [field value ...]
Sets field in the hash stored at key to
value. If key does not exist, a new key
holding a hash is created. If field already
exists in the hash, it is overwritten.
HGETALL key
Returns all fields and values of the hash
stored at key. In the returned value, every
field name is followed by its value, so the
length of the reply is twice the size of the
hash.
- HVALS key
Returns all values in the hash stored
at key
- HKEYS key
Returns all field names in the hash
stored at key
- HEXISTS key field
Returns if field is an existing field in
the hash stored at key.
- HLEN key
Returns the number of fields
contained in the hash stored at key.
- HSETNX key field value
Sets field in the hash stored at key
to value, only if field does not yet exist.
If key does not exist, a new key holding
a hash is created. If field already exists,
this operation has no effect.
- HDEL key field [field ...]
Removes the specified fields from
the hash stored at key. Specified fields
14
that do not exist within this hash are
ignored. If key does not exist, it is
treated as an empty hash and this
command returns 0.
SETS
CODE CONSOLE
SADD key member [member ...]
Add the specified members to the set
stored at key.
SMEMBERS key
Returns all the members of the set value
stored at key.
SCARD key
Returns the set cardinality (number of
elements) of the set stored at key
15
SMOVE source destination member
Move member from the set at source to
the set at destination. This operation is
atomic. In every given moment the
element will appear to be a member of
source or destination for other clients.
16
behavior changes and the command can
return the same element multiple times. In
this case the number of returned elements
is the absolute value of the specified count.
SORTED SETS
Ordered sets where each element is associated to a score (floating point value)
CODE CONSOLE
ZADD key [NX|XX] [CH] [INCR] score
member [score member ...]
Adds all the specified members
with the specified scores to the
sorted set stored at key.
17
ZCARD key
Returns the sorted set cardinality
(number of elements) of the sorted
set stored at key.
18
key.
19
WEIGTHS help us to specify a multiplication factor for the score of each input sorted set. WEIGHTS
defaults to 1
AGGREGATE is used to specify how the results of an intersection are aggregated (SUM, MIN or MAX).
AGGREGATE defaults to SUM
Redis transactions
CODE CONSOLE
MULTI exec blocks
- Transactions are blocked of commands
executed sequentially
- Transactions guarantee that commands
are executed as a single isolated
operation
- Transactions are atomic: either all
commands are processed, or none
- MULTI: Marks the start of a transaction
block. Subsequent commands will be
queued for atomic execution using EXEC.
- EXEC: Executes all previously queued
commands in a transaction and restores
the connection state to normal.
UNWATCH:
Flushes all the previously watched keys for
a transaction
unwatch does not affect keys which were
watched on a different client!
20
ERRORS INSIDE TRANSACTIONS
However, errors thrown by syntactically correct commands do not affect the rest of the commands
in the transaction
Geospatial feautures
GEO BASIS
- Redis allows to store longitude and latitude pairs as objects, and then perform queries on those
objects
- Geospatial objects are stored as 52-bit GeoHash (a string of letters and digits)
- GeoHash allows for efficient geo-queries, enhanced by Redis low latency
- GeoHash is stored as a sorted set
o GeoHash is the score
o Location name is the value of the set member
- Valid longitudes are from -180 to 180 degrees
- Valid latitudes are from -85.05112878 to 85.05112878 degrees
- Areas very near to the Poles are not indexable
- Assumes Earth is sphere. Worst case error of up to 0.5% (consider this in error-critical
applications)
GEO COMMANDS
CODE CONSOLE
GEOADD key longitude latitude member
[longitude latitude member ...]
Adds the specified geospatial items
(latitude, longitude, name) to the specified
key
21
GEODIST key member1 member2 [m|km|
ft|mi]
Return the distance between two
members in the geospatial index
represented by the sorted set.
22
23
DOCUMENT DATABASES
Comparison to other DB
Key-Value Databases
You can say that document databases are like an extension of key-value databases, the difference
between the two are that in document databases the values connected to keys are documents
Unlike in Key-Value, in document databases you can search among the values
Relational
Remember that relational is like SQL, so they are very ridged – Document databases are much more
flexible than relational tables, you don’t have to have the same fields for every document
Relational: Columns in SQL tables point to other columns in other tables like a network
MongoDB
JSON: XML:
24
Advantages
- Use of JSON files, which are one of the most portable formats, many systems speak in terms
of JSON
- Data is flexible and unstructured, if you require changes, it can adapt easily and fast
Disadvantages
- No data validation, you can write anything, and MongoDB won’t realize it (so you can write
things in the wrong places and it won’t notice, which might be bad)
- SQL allows multiple write operations at once (transactions), MongoDB doesn’t
Example
There will always be another field called “_id” that is automatically created for the documents, this
field is important and will always appear unless you tell it not to appear
Coding
25
with the condition you want
.find({ This extra part in blue you add to the find, it is like selecting exactly
}, what you want it to tell you, here you are saying that you want it to
{ “X” : 1, “_id” : 0} show “X” because you put that 1, and that you don’t want it to show
) “_id”.
Example:
26
Examples:
Data Normalization
Normalization
Normalization = To structure a database in multiple collections (the collections can reference other
collections)
This imitates the philosophy of relational databases where there are many tables and they
point to other tables
Denormalization
This is more related to document databases (not so much relational like the normalization
technique was)
27
- Advantages: There are faster queries, all the information is at reach there is no need to jump
between collections to answer a query
- Disadvantages: Bulky (there is repeated information), Harder to preserve data consistency (if
you want to modify something, since it is repeated in various places you might have to
modify it various times and it may not be consistent)
Normalized Denormalized
Summary
28
COLUMN-FAMILY DATABASES
Row-Oriented vs Column-Oriented
This doesn’t refer to two types of column-family databases, it is simply two different ways of storing
data
Row-Oriented Databases
For example: if you had a database with population information, it would store each person
individually with their name ID, City, etc. If you had a column-oriented database then it would store
all of the names together, all of the ID together, etc. instead of grouping it columns (categories) it
groups it by rows (people).
Column-Oriented Databases
Comparison
With column you get data faster than with rows, because you don’t have to go through every single
datapoint to extract the information, you just take out a whole section of the storage and then you
have all the information about City, or Age, etc..
Better with compression factors too, because there are more patterns, you are more likely
to find a pattern if all of the cities are put next to each other, you might have ‘Madrid’ printed 10
times in a row
To a programmer however they are the same, they are “transparent to the programmer”
29
Column-Family Databases
Column-Family DB = Contains groups of columns, each group contains related variables (this is NOT
the same as column-oriented database)
How it Works
- The columns are grouped into “families” with similar columns (y similar it means that they
are often accessed together)
- Each row will have a unique identifier, the identifier is like the ID of the object
- Each row can have several column families
Pros
- Different records can have different columns, they don’t all need to have the same ones (like
different fields in a document database)
- High performance (it is fast to add/update/remove many records that share the same
columns)
- Scalable (You can make the database huge without altering the performance too much)
- Good interface – similar to SQL which makes things easier
Cons
Database Comparisons
30
Everything is put into value like text, and so you can’t
really search by “city” because the program won’t be able
to find city within the values
Column-Family:
Cassandra
Cassandra is fully distributed – This means that the database is shared among many computers not
just one, and all of them are equally important
To check if a huge table has a specific row ID, you can use probabilistic data structures (bloom filters)
31
^ Now activate the Keyspace:
USE keysapce_name;
CREATE TABLE keyspace_name.table_name( column1 text PRIMARY KEY, column2 text, column3
text);
If you don’t name a column when doing the insert into, then for that column that row will
get a NULL
DESCRIBE TABLES;
DESCRIBE table_name;
Will give you the columns and details for that table
32
GRAPH DATABASES
Graphs Reminder
Edges are the lines that connect them (they can be directed or not, if they are directed they are
arrows)
Degree of a Vertex = Number of edges going to and from the vertext (in both directions)
Example
Cycle = A path that repeats vertices Tree = Acyclic Graph = A graph without cycles
- Vertices
- Directed Edges (MUST be directed)
- Vertex Properties
- Edge Properties
33
Example
Edges: They are directed, and can also have properties such as
“directed” and “acted in”
Comparison of Databases
Graph databases are the most flexible, however this is a tradeoff with efficiency, they might not be
able to handle hundreds of queries at a time
Graph vs Document
Document databases can be translated into graph databases because they are hierarchical, they
have no cycles, however, most likely graph databases can’t be turned into document databases
because of the cycles created
Think of Facebook, there are cycles because A knows B that knows C that knows A, that is a
cycle between the 3 nodes that cannot be represented in a document database – a document
database is like a Tree, not like a graph (trees are graphs,
however, a graph isn’t always a tree)
- Hierarchical information
- Fields can contain other fields (subfields)
Graph vs Relational
Remember relational is like SQL, it is tables that are connected by keys, which means that it might
look like a graph from time to time:
34
However, it is NOT a graph
Difference:
In SQL, a vertex is like a table, and that table is a collection of records with the same columns
Key-Value
Column-Family
This looks sort of like the relational (SQL) one however with more
column families:
The difference here is that many families are allowed not just one
(however the families are still rigid like in relational tables)
Pro
35
However, these shouldn’t be queries that are asked very often, Facebook isn’t asking
how many friends each person has every second, it is something more occasional. If
there is a need for constant queries, then graph databases are not efficient
Con
1. There is no Join
There is no real need to join, there are no columns here, there is nothing to join, you
can just add more nodes and edges if you want to add something on
Neo4j
- Based on Java
- Supports billions of vertices
- Uses query language “cypher” (very powerful)
- Has graph visualization tools
Gremlin
graph = TinkerFactory.createModern()
g = graph.traversal()
36
.outV()
.bothE() Finds all the incoming and outgoing edges from that vertext
.in() Finds incoming neighbors of the vertex
.out() Finds outgoing neighbours of the vertex
.both() Finds all outgoing and incoming neighbours, but won’t tell you which is
which
Math Func. Must be combined with a specific value of the vertices, .values(‘X’).count()
.count() Counting
.max() Finds the max value
.repeat(X).times(Y) Will repeat the command X (for example .out()) Y times
.path() Will find the path
.limit(X) Will print out a max of X results (usually the top X results)
Installing an already made graph database, you need to have the path to where to find it on your
computer:
g.io("C:/Users/YourUserName/Downloads/air-routes.xml").read().iterate()
PageRank
37
Each website is a vertex, and each link is an edge, the more links they have coming in and out of the
website the more edges they will have
Algorithm:
38
FUTURE TRENDS
In-Memory Databases
Databases
In-memory Database = A database that is stored entirely in the RAM. This works because RAMs have
become huge
Changes:
- Advantages: Traditional databases use caches in RAM, however if the whole database is
already in RAM, we don’t need this anymore
- Disadvantages: Persistence (the data in RAM will disappear when the power turns off which
is risky)
The only way to deal with this is to keep a transaction log which will live in the disk
(does not disappear, it is persistent). It will keep track of all changes in the database
and will synchronize often When the power turns on it will restore the database
by reading the journal and doing the changes logged
Blockchain
With a normal database there is trust however with blockchain there is no need for trust.
With the normal ones you trust either someone else’s code (opensource) or you might run it in a
cloud (trust the host), etc.
39
In blockchain all transactions are checked by all participants, its like voting by show of hands
- Distributed = All network participants have a copy of the ledger for complete transparency
- Immutable = Any validated record are irreversible; a transaction timestamp is recorded on
each block. Each participant has a private key to encrypt their own transactions, and
everyone knows the public key to decrypt other’s transactions
- Anonymous = The identity of participants are anonymous or with pseudonymous
- Unanimous = All network participants agree to the validity of each of the records
- Secure = All records are individually encrypted
How it Works
- So far, most cryptocurrencies largely used for speculation, there is no other mainstream
blockchain yet
- Criminal Connection: The identity of participants is anonymous
- Low Scalability: Blockchain works fine for small number of users, however when there is a
large number of users each transaction might take up to hours to process
- High Energy Consumption: You need a lot of computational power which uses a lot of
electricity
- Lack of Privacy: Blockchain is a public distributed ledger, so what is you want to store
sensitive information? You can’t
- Lots of inertia using traditional DB versus blockchain databases (Millions of programmers use
databases in their code, billions of people use and benefit from databases)
- Lots of big players that are established, which means it is hard to agree on a universal
standard
Quantum Computing
Everyday electronic chips are getting smaller and smaller, however, is there a limit to this?
This would mean the end of Moore’s Law for computing power (not the storage one)
Quantum computers can do many tasks at the same time, they no longer think linear
40
How
In classical computers the information is stores in “Bits” that are ither 0 or 1, when you add the two
bits b1 and b2 the result will always be b1+b2. However, quantum computers use “Qubits” that can
be either 0, 1 or both at the same time, so if you add two qubits q1 and q2 you can either get 0, q1,
q2 or q1+q2
Quantum Queries
If you had a tree, and you were trying to look for a specific
number on the tree, a classical computer would look at all
of the nodes of the tree one by one in order from 11 to 6:
Disadvantages/Challenges
41
Probabilistic data bases
myset = {1, 2, 3}
print (4 in myset) prints False with 100% probability
PROBABILISTIC DATA STRUCTURES give sometimes wrong answers (with a small probability). In
exchange, they need much less space. similar idea to the lossless vs. lossy compression choice
Bloom filter
2. Choose k different hash functions Each hash function maps an item to an integer between 0
and m-1
k = # of hash functions
To insert an item: get its k hashes, then set all those bits to 1.
In this example,
42
To query an item: get its k hashes, check if all those bits are 1.
To check if w is in the set, we need to check
that the 3 positions to which it maps are 1s.
We can observe that one of them is not,
hence w is not in our set.
TOY EXAMPLE
m=7
k=2
43
Let’s insert two elements (n=2): {100, 1000}
We need to pass the number we want to insert through both hash functions (f and g). The
REMINDER of the operations will give us the position and we add 1s to those.
f(100) = 200 % 7 = 4
g(100) = 300 % 7 = 6
f(1000) = 2000 % 7 = 5
g(1000) = 3000 % 7 = 4
To check if the number 10,000 is in our set, we need to check if its positions from the hash functions
are 1s.
f(10000) = 20000 % 7 =
1
g(10000) = 30000 % 7 = 5
there is one zero bit, hence the number is not in the set.
CONFUSION MATRIX
Bloom Filter Queries returns either “possibly in set” or “definitely not in set”:
44
- When Filter predicts “Positive”, the actual could be Positive or Negative.
- When Filter predicts “Negative”, actual is sure “Negative”.
-
Knowing how many items we expect to insert (n)
We design (optimize) for given FP rate (p) with formulas and simulators to get optimal k (hash
functions) and m (filter size).
APPROXIMATE FP RATE
OPTIMAL K
(from kb to mb /1024)
QUESTION 1
After inserting two items (n=2) in an empty Bloom filter, it looks like this:
A) k = 2
B) k = 3
QUESTION 2
You are dealing with a dataset of up to 100,000 items that will be put through a Bloom Filter to test
membership.
If the required False Positive Probability is 0.000001, how much storage do you need to allocate for
the filter and what is the optimal number of hash functions to use (k)?
45
EXAMPLE APPLICATIONS
46
READING FILES OVER SEVERAL DISKS: millions of files
47