0% found this document useful (0 votes)
15 views47 pages

Notes After Midterm

Uploaded by

javi ortiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views47 pages

Notes After Midterm

Uploaded by

javi ortiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

ADVANCED DATA STRUCTURES AND STORAGE – ANA OCARIZ SANCHEZ NOTES

TABLE OF CONTENTS

SQL VS NoSQL.................................................................................................................................3
SQL Reminder....................................................................................................................................3
NoSQL................................................................................................................................................3
KEY-VALUE DATABASES..................................................................................................................5
Caching in Databases.........................................................................................................................5
Caching in Key-Value Databases........................................................................................................6
CODING - CACHING IN KEY-VALUE DATABASES.................................................................................7
WHAT IS REDIS...............................................................................................................................9
LISTS................................................................................................................................................13
HASHES............................................................................................................................................14
SETS.................................................................................................................................................15
SORTED SETS...................................................................................................................................17
DOCUMENT DATABASES...............................................................................................................24
Comparison to other DB..................................................................................................................24
MongoDB.........................................................................................................................................24
Data Normalization..........................................................................................................................27
COLUMN-FAMILY DATABASES......................................................................................................29
Row-Oriented vs Column-Oriented.................................................................................................29
Column-Family Databases...............................................................................................................30
Database Comparisons....................................................................................................................30
Cassandra........................................................................................................................................31
GRAPH DATABASES......................................................................................................................34
Graphs Reminder.............................................................................................................................34
Components of Graph Databases....................................................................................................34
Comparison of Databases................................................................................................................35
Pros and Cons..................................................................................................................................36
Examples of Graph DB - Gremlin.....................................................................................................37
PageRank.........................................................................................................................................38
FUTURE TRENDS...........................................................................................................................39
In-Memory Databases.....................................................................................................................39
Blockchain.......................................................................................................................................39
Quantum Computing.......................................................................................................................40
Probabilistic data bases................................................................................................................43
TOY EXAMPLE..................................................................................................................................44

1
CONFUSION MATRIX........................................................................................................................45
EXAMPLE APPLICATIONS.................................................................................................................47
BLOOM FILTERES IN A GENERAL KEY-VALUE STORE........................................................................48

2
SQL VS NoSQL

SQL Reminder

SQL is all about tables that interact and are connected with each other

Basic Operations:

Limitations of SQL

Was born in 1970

- Created for small businesses (not for big data)


- Not internet  Infrequent data accesses
- “Boring” data: Customer names, delivery addresses, etc.

These are limitations because nowadays we have massive scales, thanks to the internet there are
many read/writes per second (a lot of codes going on very fast), and there are many types of adata
(Text, media, suer ratings, IoT, etc.)

SQL (relational databases)  Inefficient and slow in these cases

The answer to all of these limitations is “NoSQL”

NoSQL

NoSQL = A collection of technologies that go beyond the traditional limitations of relational


databases

Types of NoSQL

Classified based on the type of data they handle:

3
1. Key-Value databases: Like dictionaries, like a large table with two columsn: The key and the
value
2. Document Databases: Like key-values, but the “values” are a collection of items for example
words
3. Column family databases: Like tables in SQL, but different rows can have different columns
4. Graph databases: Structured in nodes and vertices that connect nodes with one another

DE MORGAN’S LAWS: SIMPLIFY LOGIC EXPRESSIONS

- Reducing the Number of Negated Expressions


(not is_expensive or not is_well_built) == not (is_expensive and is_well_built)

- Flipping Inequality Relations


not (price < 9.99 or quantity > 30) == (price >= 9.99 and quantity <= 30)

- Reducing Overly Complex Expressions


not A or not B or not (A or B)

= not A or not B or not A and not B

= not A or not A and not B or not B

= not A or not B = not (A and B)

4
KEY-VALUE DATABASES

These behave like dictionaries

In NoSQL there is no fixed type of items that must be there, within


the same column you can have different types of data (unlike in
SQL)

Basic operations supported:

- Adding Data
- Overwriting data (changing a value)
- Deleting data

Example of Key-Value database: “Shelve” in python

Key-value databases can use “namespaces”

How It works?

They use a tree for the keys, which makes finding a


key very fast

Each node (key) knows where to find its data


(value)

Heavy-Duty Example: HDF5

“HDF” = Hierarchical Data Format

- Works for huge datasets


- Does not require loading the entire dataset in memory at the same time
- Optional built-in compression
- Usable from Python

Caching in Databases

Cache = Device that stores data for future requests to be served faster

Main Problem: How do you know what data will be requested?

1. Temporal Locality Principle = If certain data is needed now, it will probably be needed again
soon

Example: People looking up “covid” when the pandemic started, they are lively to look it up again

2. Spatial Locality Principle = If certain data is needed, other data that is located close in
memory/disk will probably be needed

Example: You are printing a[0], a[1], a[2], etc. you are likely to print out a[3]…

5
Caching strategies

Caches are a tradeoff: need more space but serve results faster

When you are deleting from cache you delete the one that is least likely to be called next, so for
example in temporal locality principle you would delete the oldest google search

Caches are a trade off  Need more space, but serve results faster

50% of computer science is about caching

Memory is a cache of disk (through paging) – memory has a cache of its own, and that cache has its
own cache that has a smaller capacity but is faster. This last cache is only accessed if the requested
data isn’t found in the first cache

The CPU also has a cache of memory

Caching in Key-Value Databases

Strategy: Keep database in the disk, but do a cache of it in memory

Making new room in the cache uses the Least Recently Used (LRU) principle:

6
LRU = Keep cached values sorted by last access time, when you need to make space you
delete the value that has not been used for the longest time

Caching in key-value databases

Strategy: keep database in disk, but do a cache of it in memory

When the cache is full, how to make room?

Least Recently Used (LRU): keep cached values sorted by last access time.
When we need to make room, delete the value that has not been used for
the longest time.

Implementation: Hash Map + Doubly Linked List

Doubly Linked Lists have two references, to the previous and the next
node.  Enables List traversals in both directions.

The HEAD element of the doubly linked list would point to the most
recently used entry.

The TAIL would point to the least recently used entry.

Very Fast Data Access: Accessing the least recently used item and
updating the cache are operations with a runtime of O(1)  cache size does not impact access time.

CODING - CACHING IN KEY-VALUE DATABASES

CODING CACHE
d = {}

d = {}
d[‘Madrid’] = ‘Spain’

d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’

7
d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’

d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’
d[‘Lisbon’] = Portugal’

d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’
d[‘Lisbon’] = Portugal’
print(d[‘Paris’])

d = {}
d[‘Madrid’] = ‘Spain’
d[‘Paris’] = ‘France’
d[‘Rome’] = ‘Italy’
d[‘Lisbon’] = Portugal’
print(d[‘Paris’])
d[‘Berlin’] = ‘Germany’

8
WHAT IS REDIS

REDIS: open source in-memory data structure store, used as a database, cache and message broker.

Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries,
bitmaps, hyperloglogs, geospatial indexes, and streams. Redis has built-in replication, Lua scripting,
LRU eviction, transactions, and different levels of on-disk persistence, and provides high availability
via Redis Sentinel and automatic partitioning with Redis Cluster.

Redis Labs (Enterprise Solution)

Redis Open Source

ENVIRONMENT SETUP

Let us first run Redis without persistence and sending logs to standard output:

docker run --name myredis redis

To terminate your shell session use shutdown: shutdown


[NOSAVE|SAVE]  NOSAVE|SAVE applies to current session.
Data that existed prior to your session will remain (unless
modified in current session and SHUTDOWN SAVE is used)

It is possible to start Redis with data persistence (--appendonly yes) and exposing port 6379 and data
to host machine:

docker run --name myredis –h redis -d -p 6379:6379 -v $PWD/redis/data/:/data redis redis-server --


appendonly ye

basic commands

From another terminal, run a command to enter the redis CLI

docker exec -it myredis redis-cli

CODE CONSOLE
Set/Get/Delete your first key:
- SET key value [EX seconds|PX
milliseconds|KEEPTTL] [NX|XX] 
NX = not exists; XX = exists
- GET key
- DEL key [key ...]
- EXISTS key [key ...]

9
Check existence:
- EXISTS key [key ...]

Expire Keys:
- Use SET with EX (for seconds) or PX
(for milliseconds) options for
expiration. Useful to manage
sessions.
- TTL (Time To Live) publishes
available time before expiration.
-2=expired
-1=key without expiration set
- PTTL returns TTL in milliseconds!
- EXPIRE and PEXPIRE are alias to
create expired keys:
- expire mykey 10 -> Key mykey
expires in 10 seconds.
- pexpire mykey 10000 -> Key
mykey expires in
10000mseconds.
- PERSIST can be used against a key
to remove expiration time.

PATTERN MATCHING  KEYS


Command
It uses glob-style patterns:
- H?llo: matches hello, hallo and hxllo
- H*llo: matches hllo and heeeeeello
- H[ae]llo: matches hallo and hello,
but not hillo
- H[^e]llo: matches hallo, hillo, hollo,
hullo, but not hello
- H[a-b]llo: matches hallo and hbllo

Some other commands:


- RANDOMKEY: Return a random key
from the currently selected
database.
- RANAME key newkey  this
potentially overwrites newkey if it
existed.

10
- RENAMENX key newkey  It will
prevent that newkey get
overwritten if it already existed, and
return 0
- TOUCH key [key ...]: set the last
access time of key to now.
- UNLINK key [key...]: remove key
asynchronously, i.e. not in current
thread
- TYPE key: returns the type of the
key (String, List, Set, Hash...)

BACKUP AND RESTORE


DUMP key: provides the serialized value
for key. Values cannot be restored
across Redis versions (for example
between Redis 4 and 5)

RESTORE key ttl serialized-value


[REPLACE] [ABSTTL] [IDLETIMEseconds]
[FREQfrequency]

Redis data type

STRINGS

CODE CONSOLE
APPEND key value
 adds to the end of key

INCR key & INCRBY key


 INCR: increase key by 1
 INCRBY: increase key by number set

11
DECR key & DECRBY key
 DECR: decrease key by 1
 DECRBY: decrease key by number set

INCRBYFLOAT key increment


 increase by decimal number (I,e: 0,5)
GETSET key value
 set a new value and get the previous
value printed out

MSET key value [key value...]


 set multiple key-value pairs
MGET key [key …]
 get multiple keys at once

MSETNX key value [key value …]


 like mset, but only if all supplies keys do
not exist

GETRANGE key start end


 provides substring functionally

- SETEX key seconds value


 set an expiry time for a key.
Remaining time can be checked with TTL
command.
- PSETEX key milliseconds value:
 like setex but time in milliseconds.
- SETRANGE key offset value
 replaces characters in current key
with value from offset position.
- STRLEN key
 return string character count.

LISTS

12
CODE CONSOLE
- LPOP key
 Removes and returns the first element
of the list stored at key.
- RPOP key
 Removes and returns the last element
of the list stored at key

This serves to implement important


communication patterns such as queues.

TRIM key start stop


 Trim an existing list so that it will contain
only the specified range of elements
specified.

LSET key index element


 Sets the list element at index to element

LINDEX
 Returns the element at index index in the
list stored at key

LINSERT key BEFORE|AFTER pivot element


 Inserts element in the list stored at key
either before or after the reference value
pivot.

LREM key count element


 Removes the first count occurrences of
elements equal to element from the list
stored at key. The count argument influences
the operation in the following ways:
- count > 0: Remove elements equal to
element moving from head to tail.
- count < 0: Remove elements equal to
element moving from tail to head.
count = 0: Remove all elements equal to
element.

13
HASHES

Maps between string fields and string values. Object representation.

CODE CONSOLE
HSET key field value [field value ...]
 Sets field in the hash stored at key to
value. If key does not exist, a new key
holding a hash is created. If field already
exists in the hash, it is overwritten.

HGET key field / HMGET key field [field ...]


 Returns the values associated with the
specified fields in the hash stored at key.
For every field that does not exist in the
hash, a nil value is returned.

HGETALL key
 Returns all fields and values of the hash
stored at key. In the returned value, every
field name is followed by its value, so the
length of the reply is twice the size of the
hash.

- HVALS key
 Returns all values in the hash stored
at key
- HKEYS key
 Returns all field names in the hash
stored at key
- HEXISTS key field
 Returns if field is an existing field in
the hash stored at key.
- HLEN key
 Returns the number of fields
contained in the hash stored at key.
- HSETNX key field value
 Sets field in the hash stored at key
to value, only if field does not yet exist.
If key does not exist, a new key holding
a hash is created. If field already exists,
this operation has no effect.
- HDEL key field [field ...]
 Removes the specified fields from
the hash stored at key. Specified fields

14
that do not exist within this hash are
ignored. If key does not exist, it is
treated as an empty hash and this
command returns 0.

HINCRBY key field increment


 Increments the number stored at field in
the hash stored at key by increment.

HSTRLEN key field


 Returns the string length of the value
associated with field in the hash stored at
key

HINCRBYFLOAT key field increment


 Increment the specified field of a hash
stored at key, and representing a floating
point number, by the specified increment.
If the increment value is negative, the
result is to have the hash field value
decremented instead of incremented.

SETS

Unordered collections of Strings. A set can store up to 232 -1 (4 billion+) members!

CODE CONSOLE
SADD key member [member ...]
 Add the specified members to the set
stored at key.

SISMEMBER key member


 Returns if member is a member of the
set stored at key.

SMEMBERS key
 Returns all the members of the set value
stored at key.

SCARD key
 Returns the set cardinality (number of
elements) of the set stored at key

15
SMOVE source destination member
 Move member from the set at source to
the set at destination. This operation is
atomic. In every given moment the
element will appear to be a member of
source or destination for other clients.

SPOP key [count]


 Removes and returns one or more
random elements from the set value store
at key.

SREM key member [member...]


 Remove the specified members from
the set stored at key.

SDIFF key [key ...]


 Returns the members of the set
resulting from the difference between the
first set and all the successive sets.

SDIFFSTORE destination key [key ...]


 This command is equal to SDIFF, but
instead of returning the resulting set, it is
stored in destination.

SINTERSTORE destination key [key ...]


 This command is equal to SINTER, but
instead of returning the resulting set, it is
stored in destination.

SUNION key [key ...]


Returns the members of the set resulting
from the union of all the given sets.

SUNIONSTORE destination key [key ...]


 This command is equal to SUNION, but
instead of returning the resulting set, it is
stored in destination.

SRANDMEMBER key [count]


 When called with just the key argument,
return a random element from the set
value stored at key. When called with the
additional count argument, return an array
of count distinct elements if count is
positive. If called with a negative count the

16
behavior changes and the command can
return the same element multiple times. In
this case the number of returned elements
is the absolute value of the specified count.

SORTED SETS

Ordered sets where each element is associated to a score (floating point value)

CODE CONSOLE
ZADD key [NX|XX] [CH] [INCR] score
member [score member ...]
 Adds all the specified members
with the specified scores to the
sorted set stored at key.

ZRANGE key start stop


[WITHSCORES]
 Returns the specified range of
elements in the sorted set stored at
key. The elements are considered to
be ordered from the lowest to the
highest score. Lexicographical order
is used for elements with equal
score.

ZREVRANGE key start stop


[WITHSCORES]
 like ZRANGE, but in DESCENDING
score & lexicographical order.

ZINCRBY key increment member


 Increments the score of member
in the sorted set stored at key by
increment.

17
ZCARD key
 Returns the sorted set cardinality
(number of elements) of the sorted
set stored at key.

ZREM key member [member ...]


 Removes the specified members
from the sorted set stored at key.
Non existing members are ignored.

ZSCORE key member


 Returns the score of member in
the sorted set at key.

ZRANK key member


 Returns the rank of member in
the sorted set stored at key, with
the scores ordered from low to
high. The rank (or index) is 0-based,
which means that the member with
the lowest score has rank 0.
ZREVRANK key member
 like ZRANK, but in DESCENDING
score & lexicographical order.

ZCOUNT key min max


 Returns the number of elements
in the sorted set at key with a score
between min and max.

ZPOPMAX key [count]


 Removes and returns up to
count members with the highest
scores in the sorted set stored at
key.

ZPOPMIN key [count]


 Removes and returns up to
count members with the lowest
scores in the sorted set stored at

18
key.

ZRANGEBYSCORE key min max


[WITHSCORES] [LIMIT offset count]
 Returns all the elements in the
sorted set at key with a score
between min and max (including
elements with score equal to min or
max). The elements are considered
to be ordered from low to high
scores.

ZRANGEBYLEX key min max [LIMIT


offset count]
 When all the elements in a
sorted set are inserted with the
same score, in order to force
lexicographical ordering, this
command returns all the elements
in the sorted set at key with a value
between min and max. If the
elements in the sorted set have
different scores, the returned
elements are unspecified.

ZINTERSTOREd estination numkeys


key [key ...] [WEIGHTS weight
[weight ...]] [AGGREGATE SUM|
MIN|MAX]
 Computes the intersection of
numkeys sorted sets given by the
specified keys, and stores the result
in destination.

ZUNIONSTORE destination numkeys


key [key ...] [WEIGHTS weight]
[AGGREGATE SUM|MIN|MAX]

19
WEIGTHS help us to specify a multiplication factor for the score of each input sorted set. WEIGHTS
defaults to 1

AGGREGATE is used to specify how the results of an intersection are aggregated (SUM, MIN or MAX).
AGGREGATE defaults to SUM

Redis transactions

CODE CONSOLE
MULTI exec blocks
- Transactions are blocked of commands
executed sequentially
- Transactions guarantee that commands
are executed as a single isolated
operation
- Transactions are atomic: either all
commands are processed, or none
- MULTI: Marks the start of a transaction
block. Subsequent commands will be
queued for atomic execution using EXEC.
- EXEC: Executes all previously queued
commands in a transaction and restores
the connection state to normal.

WATCH key [key ...]


 Marks the given keys to be watched for
conditional execution of a transaction.
Provides check & set behavior

UNWATCH:
 Flushes all the previously watched keys for
a transaction
 unwatch does not affect keys which were
watched on a different client!

20
ERRORS INSIDE TRANSACTIONS

Syntax errors invalidate the whole transactions

However, errors thrown by syntactically correct commands do not affect the rest of the commands
in the transaction

No support for Rollbacks in Redis.

Geospatial feautures

GEO BASIS

- Redis allows to store longitude and latitude pairs as objects, and then perform queries on those
objects
- Geospatial objects are stored as 52-bit GeoHash (a string of letters and digits)
- GeoHash allows for efficient geo-queries, enhanced by Redis low latency
- GeoHash is stored as a sorted set
o GeoHash is the score
o Location name is the value of the set member
- Valid longitudes are from -180 to 180 degrees
- Valid latitudes are from -85.05112878 to 85.05112878 degrees
- Areas very near to the Poles are not indexable
- Assumes Earth is sphere. Worst case error of up to 0.5% (consider this in error-critical
applications)

GEO COMMANDS

CODE CONSOLE
GEOADD key longitude latitude member
[longitude latitude member ...]
 Adds the specified geospatial items
(latitude, longitude, name) to the specified
key

GEOHASH key member [member ...]


 Return valid Geohash strings
representing the position of one or more
elements in a sorted set value
representing a geospatial index

GEOPOS key member [member ...]


 Return the positions (longitude,
latitude) of all the specified members of
the geospatial index represented by the
sorted set at key.

21
GEODIST key member1 member2 [m|km|
ft|mi]
 Return the distance between two
members in the geospatial index
represented by the sorted set.

GEORADIUSBYMEMBER key member


radius m|km|ft|mi [WITHCOORD]
[WITHDIST] [WITHHASH] [COUNT count]
[ASC|DESC] [STORE key] [STOREDIST key]
 This command is exactly like
GEORADIUS with the sole difference that
instead of taking, as the center of the area
to query, a longitude and latitude value, it
takes the name of a member already
existing inside the geospatial index
represented by the sorted set.

GEORADIUS key longitude latitude radius


m|km|ft|mi [WITHCOORD] [WITHDIST]
[WITHHASH] [COUNT count] [ASC|DESC]
[STORE key] [STOREDIST key]
 Return the members of a sorted set
populated with geospatial information
using GEOADD, which are within the
borders of the area specified with the
center location and the maximum distance
from the center (the radius).

22
23
DOCUMENT DATABASES

Comparison to other DB

Key-Value Databases

You can say that document databases are like an extension of key-value databases, the difference
between the two are that in document databases the values connected to keys are documents

Document = A structured collection of data fields (for example words or properties)

Examples: JSON, HTML, XML, etc.

Unlike in Key-Value, in document databases you can search among the values

Example: “Find all blog posts written by X author”

Relational

Remember that relational is like SQL, so they are very ridged – Document databases are much more
flexible than relational tables, you don’t have to have the same fields for every document

Relational: Columns in SQL tables point to other columns in other tables like a network

Document: Fields of each document can contain and


index other fields, emphasis on point, because it is
alike an organized tree, there is a hierarchy in the
information

Examples of Document Databases:

- Webpage that has nested widgets in it, each


widget is like a document that has other
things within, and may contain other widgets within
- Blog Posts: Posts have metadata, text, embedded elements, user comments, likes, etc..
these are things within each blog post that is a document and then the whole website holds
all of these documents

MongoDB

MongoDB is a document database, it is a “document-oriented NoSQL database”

You can use it through python with “pymongo”

It uses JSON files to work  JSON is a text-based serialization format

JSON uses lists and dictionaries like python, XML doesn’t

JSON: XML:

24
Advantages

- Use of JSON files, which are one of the most portable formats, many systems speak in terms
of JSON
- Data is flexible and unstructured, if you require changes, it can adapt easily and fast

Disadvantages

- No data validation, you can write anything, and MongoDB won’t realize it (so you can write
things in the wrong places and it won’t notice, which might be bad)
- SQL allows multiple write operations at once (transactions), MongoDB doesn’t

Example

In this example we have a database in which there is 1


document called students that points to other documents
where these other documents are each a student

Notice that each document has structured fields: Name, Age,


City, Sports, etc.

However, this doesn’t mean that every single document


needs to have the same fields, for example, they don’t all have
“sports”

There will always be another field called “_id” that is automatically created for the documents, this
field is important and will always appear unless you tell it not to appear

Coding

Command What it does


Database.document Here you are just referencing where you want to do your actions, to
which document you want to work with
.find ( { “X” : “Y” } ) It is like select or filter, it will look for all of the documents that go

25
with the condition you want
.find({ This extra part in blue you add to the find, it is like selecting exactly
}, what you want it to tell you, here you are saying that you want it to
{ “X” : 1, “_id” : 0} show “X” because you put that 1, and that you don’t want it to show
) “_id”.

^ This is special because _id is special, if you put a 1 with X then


automatically nothing else will show, EXCEPT for ID, id will always
show unless you tell it not to
/^pattern/ Begins with
/pattern%/ Ends with
/pattern/ Contains
/pattern/i Not case sensitive

Can be combined with


lists using $in
.aggregate( [ { $group : { This will basically do math for you, it will look at all of the values and
“_id”: “ ”, group them and will print out “X = …”, “Y = ….”.
“X” : { $Sum : 1}, In the example on the left it will create a running count (by summing
“Y” : { $X : B} 1) for every X it finds. But you could also say for example:
} “maximum” : {$max : “salary”}
}]) This will look at the salary for everyone, find the highest, and print
out maximum = …

Example:

Average age of students

$match You add this on before in order to


put in some condition, so that you
aren’t looking at all of the students
in the database, you are only looking
at the ones that MATCH a specific
condition:
Average age of handball players

26
Examples:

1. “Greater than 25”


2. NOT lower or equal to 25

They mean the same thing, but are expressed differently

Data Normalization

Normalization

Normalization = To structure a database in multiple collections (the collections can reference other
collections)

This imitates the philosophy of relational databases where there are many tables and they
point to other tables

- Advantages: Compact (there is no redundant information, only the _id is redundant)


- Disadvantages: Some queries can be slow

Denormalization

Denormalization = To structure a database in fewer collections

Information is nested and repeated where necessary

This is more related to document databases (not so much relational like the normalization
technique was)

27
- Advantages: There are faster queries, all the information is at reach there is no need to jump
between collections to answer a query
- Disadvantages: Bulky (there is repeated information), Harder to preserve data consistency (if
you want to modify something, since it is repeated in various places you might have to
modify it various times and it may not be consistent)

Normalized Denormalized

Summary

28
COLUMN-FAMILY DATABASES

Row-Oriented vs Column-Oriented

This doesn’t refer to two types of column-family databases, it is simply two different ways of storing
data

Row-Oriented Databases

Row-oriented = A table-based (SQL) database that stored row-major on disk

For example: if you had a database with population information, it would store each person
individually with their name ID, City, etc. If you had a column-oriented database then it would store
all of the names together, all of the ID together, etc. instead of grouping it columns (categories) it
groups it by rows (people).

Column-Oriented Databases

The opposite of row-based, they are stored by category not by person:

Comparison

With column you get data faster than with rows, because you don’t have to go through every single
datapoint to extract the information, you just take out a whole section of the storage and then you
have all the information about City, or Age, etc..

Makes it more useful for data analytics

Better with compression factors too, because there are more patterns, you are more likely
to find a pattern if all of the cities are put next to each other, you might have ‘Madrid’ printed 10
times in a row

To a programmer however they are the same, they are “transparent to the programmer”

29
Column-Family Databases

Column-Family DB = Contains groups of columns, each group contains related variables (this is NOT
the same as column-oriented database)

A column-oriented database is SQL, this is NOT SQL

Examples: BigTable (Google), HBase (Apache),


Cassandra (Apache)

How it Works

- The columns are grouped into “families” with similar columns (y similar it means that they
are often accessed together)
- Each row will have a unique identifier, the identifier is like the ID of the object
- Each row can have several column families

Keyspace = Collection of rows

Pros

- Different records can have different columns, they don’t all need to have the same ones (like
different fields in a document database)
- High performance (it is fast to add/update/remove many records that share the same
columns)
- Scalable (You can make the database huge without altering the performance too much)
- Good interface – similar to SQL which makes things easier

Cons

- No JOIN, and it is limited by GROUP BY and ORDER BY

Database Comparisons

Key Value Database Document Database

30
Everything is put into value like text, and so you can’t
really search by “city” because the program won’t be able
to find city within the values

Column-Family:

The columns are grouped, and they become sort of subgroups

Flexibility: Relational (SQL) < Column-family < Document

Cassandra

Popular Column-family Database

In Cassandra each column family is a table

It uses SQL-like syntax (CQL – Cassandra Query Language)

There is no “join”  Solved through denormalization

Remember denormalization is duplicating data, which is OK because Cassandra is scalable so


it doesn’t matter if you have huge duplicates – Cassandra optimizes for data retrieval not storage

Scalability = Read/write speed increases linearly as machines are added

Cassandra is fully distributed – This means that the database is shared among many computers not
just one, and all of them are equally important

To check if a huge table has a specific row ID, you can use probabilistic data structures (bloom filters)

CREATE KEYSPACE keyspace_name WITH replication = {'class': 'SimpleStrategy', 'replication_factor':


1};

31
^ Now activate the Keyspace:

USE keysapce_name;

CREATE TABLE keyspace_name.table_name( column1 text PRIMARY KEY, column2 text, column3
text);

Then to add rows into the table:

INSERT INTO keyspace_name.table_name (column1, column2) VALUES(‘value_to_insert_column1’,


‘value_to_insert_column2’);

If you don’t name a column when doing the insert into, then for that column that row will
get a NULL

DESCRIBE TABLES;

Will tell you the name of the table

DESCRIBE table_name;

Will give you the columns and details for that table

Print all table contents:

SELECT * FROM keyspace_name.table_name;

Get only 1 column:

SELECT column_name FROM keyspace_name.table_name WHERE topic = “Column Damily


Databases” ALLOW FILTERING;

32
GRAPH DATABASES

Graphs Reminder

Edges and Paths

Path = A set of edges that connect two vertices

- Direct path = A path that follows arrow directions


- Indirect path = A path that ignores arrow directions

Path length = Number of edges in a path

Degrees and Vertices

Verticies (nodes) are the points: A, B, C, D

Edges are the lines that connect them (they can be directed or not, if they are directed they are
arrows)

Degree of a Vertex = Number of edges going to and from the vertext (in both directions)

- Incoming Degree = Number of edges going into it (arrow pointing at it)


- Outgoing Degree = Number of edges going out of it (arrow pointing away from it)

Example

Vertex Incoming Degree Outgoing Degree Degree


A 0 2 2
B 1 1 2
C 2 0 2
D 1 1 2

Trees and Cycles

Cycle = A path that repeats vertices Tree = Acyclic Graph = A graph without cycles

Can be direct or indirect

Components of Graph Databases

- Vertices
- Directed Edges (MUST be directed)
- Vertex Properties
- Edge Properties

33
Example

Vertices: As you can see there are two different kinds of


vertices, there are movies and people, and they have different
properties such as date or year born, but they don’t necessarily
have to all have the same properties or categories within the
properties. Someone might have died and therefore have date
of death and someone else might not have it yet.

Edges: They are directed, and can also have properties such as
“directed” and “acted in”

Comparison of Databases

Graph databases are the most flexible, however this is a tradeoff with efficiency, they might not be
able to handle hundreds of queries at a time

Flexibility (lowest to highest):

Relational < Column-Family < Document < Graph

Graph vs Document

Document databases can be translated into graph databases because they are hierarchical, they
have no cycles, however, most likely graph databases can’t be turned into document databases
because of the cycles created

Think of Facebook, there are cycles because A knows B that knows C that knows A, that is a
cycle between the 3 nodes that cannot be represented in a document database – a document
database is like a Tree, not like a graph (trees are graphs,
however, a graph isn’t always a tree)

The graph structure of a document database is a tree:

- Hierarchical information
- Fields can contain other fields (subfields)

Graph vs Relational

Remember relational is like SQL, it is tables that are connected by keys, which means that it might
look like a graph from time to time:

34
However, it is NOT a graph

Difference:

In SQL, a vertex is like a table, and that table is a collection of records with the same columns

In Graph, the vertex is ONE record, NOT a collection of records

If you tried to model it like a graph it would look like this:

- Only two types of vertices (columns and rows)


- Rigid structure
Each Type1 node is connected to each Type2 node, and
they aren’t connected to each other

Key-Value

This is what a key-value database would look like as a graph:

- Each key points to ONE value only


- Keys don’t point to each other
- Values don’t point to each other

Column-Family

This looks sort of like the relational (SQL) one however with more
column families:

The difference here is that many families are allowed not just one
(however the families are still rigid like in relational tables)

Pros and Cons

Pro

1. Great for network-like data


Example: Migration and geographical mobility, social relationships
2. Can efficiently answer queries
Example: “Is there a path connecting A and B?”

35
However, these shouldn’t be queries that are asked very often, Facebook isn’t asking
how many friends each person has every second, it is something more occasional. If
there is a need for constant queries, then graph databases are not efficient

Con

1. There is no Join
There is no real need to join, there are no columns here, there is nothing to join, you
can just add more nodes and edges if you want to add something on

Examples of Graph DB - Gremlin

Neo4j

- Based on Java
- Supports billions of vertices
- Uses query language “cypher” (very powerful)
- Has graph visualization tools

Gremlin

- Part of the Apache TinkerPop Suite


- Great and easy language to traverse graphs
- Interactive Shell

Load a built-in graph:

graph = TinkerFactory.createModern()

g = graph.traversal()

Command What it does


.V() Will “get” all of the vertices, if you put it on its own it will print them all out
.V(X) Specific to vertex X
.values(‘X‘) Will print out the characteristic X of that vertex
.E() Prints out all of the edges
.E(X) Prints out all of the edges (in and out) for that specific vertex X
.has(‘X’, ‘Y’) Works like a filter, is asking for vertex (or edge) that has characteristic X
equal to Y
.out(‘X’) Looks at the edges of the vertex that LEAVE that vertex and have the edge
characteristic X (Ex of Edge charactersitic: ‘Knows’ or ‘Directed’)
gt(X) Not a specific command like the others, it isn’t a . command, it goes within
things like .has(), and works like a filter, “Greater than X)
.inE() Finds the incoming edges

36
.outV()

.bothE() Finds all the incoming and outgoing edges from that vertext
.in() Finds incoming neighbors of the vertex
.out() Finds outgoing neighbours of the vertex
.both() Finds all outgoing and incoming neighbours, but won’t tell you which is
which
Math Func. Must be combined with a specific value of the vertices, .values(‘X’).count()
.count() Counting
.max() Finds the max value
.repeat(X).times(Y) Will repeat the command X (for example .out()) Y times
.path() Will find the path
.limit(X) Will print out a max of X results (usually the top X results)

Installing an already made graph database, you need to have the path to where to find it on your
computer:

g.io("C:/Users/YourUserName/Downloads/air-routes.xml").read().iterate()

The yellow part is the path

PageRank

It is the 1st algorithm used by Google to rank websites by importance

It is computed by using sparse matrices

37
Each website is a vertex, and each link is an edge, the more links they have coming in and out of the
website the more edges they will have

Algorithm:

1. Start on a random vertex


2. Move to a random neighbor following the arrow
3. Repeat many times
4. Count how many times we visited each vertex
Hubs = Vertices most frequently visited (important)
Vertices that are rarely visited are not important, they appear at the end of
google

38
FUTURE TRENDS

In-Memory Databases

Moore’s Law for Storage Capacity

- The number of transistors in a integrated circuit doubles every two years


This is true for storage capacity, but not for IO (every time you get data you’re your
disk that is 1 IO) throughput IOPS (Number of IOs you can do per second)
- Memory capacity (RAM) Increases exponentially and its cost decreases exponentially

Databases

In-memory Database = A database that is stored entirely in the RAM. This works because RAMs have
become huge

Remember that RAM is a temporary storage area

Changes:

- Advantages: Traditional databases use caches in RAM, however if the whole database is
already in RAM, we don’t need this anymore
- Disadvantages: Persistence (the data in RAM will disappear when the power turns off which
is risky)
The only way to deal with this is to keep a transaction log which will live in the disk
(does not disappear, it is persistent). It will keep track of all changes in the database
and will synchronize often  When the power turns on it will restore the database
by reading the journal and doing the changes logged

Blockchain

Blockchain = A distributed, peer-to-peer storage and contractual system (Example: Bitcoin,


Ethereum blockchain)

It is like a ledger keeping track of transactions

With a normal database there is trust however with blockchain there is no need for trust.
With the normal ones you trust either someone else’s code (opensource) or you might run it in a
cloud (trust the host), etc.

39
In blockchain all transactions are checked by all participants, its like voting by show of hands

Distributed Ledger Technology

- Distributed = All network participants have a copy of the ledger for complete transparency
- Immutable = Any validated record are irreversible; a transaction timestamp is recorded on
each block. Each participant has a private key to encrypt their own transactions, and
everyone knows the public key to decrypt other’s transactions
- Anonymous = The identity of participants are anonymous or with pseudonymous
- Unanimous = All network participants agree to the validity of each of the records
- Secure = All records are individually encrypted

How it Works

Use of “P2P”  Peer to Peer

1. You change the data  This change is packaged in a “block”


2. You send the change to others (peers)
3. That peer sends it to other peers
4. Everyone check that the change is valid
5. The block is added to the chain (uses checksum and digital signature)

Advantage: Nobody can touch your data

Disadvantages: Data is very hard to remove once it is out in the open

- So far, most cryptocurrencies largely used for speculation, there is no other mainstream
blockchain yet
- Criminal Connection: The identity of participants is anonymous
- Low Scalability: Blockchain works fine for small number of users, however when there is a
large number of users each transaction might take up to hours to process
- High Energy Consumption: You need a lot of computational power which uses a lot of
electricity
- Lack of Privacy: Blockchain is a public distributed ledger, so what is you want to store
sensitive information? You can’t
- Lots of inertia using traditional DB versus blockchain databases (Millions of programmers use
databases in their code, billions of people use and benefit from databases)
- Lots of big players that are established, which means it is hard to agree on a universal
standard

Quantum Computing

Everyday electronic chips are getting smaller and smaller, however, is there a limit to this?

This would mean the end of Moore’s Law for computing power (not the storage one)

Quantum computers can do many tasks at the same time, they no longer think linear

40
How

In classical computers the information is stores in “Bits” that are ither 0 or 1, when you add the two
bits b1 and b2 the result will always be b1+b2. However, quantum computers use “Qubits” that can
be either 0, 1 or both at the same time, so if you add two qubits q1 and q2 you can either get 0, q1,
q2 or q1+q2

Quantum Queries

If you had a tree, and you were trying to look for a specific
number on the tree, a classical computer would look at all
of the nodes of the tree one by one in order from 11 to 6:

However, a quantum computer can check many at the


same time, so after the green instead of going blue and
then orange, it would do blue and red at the same time,
and then orange and yellow at the same time

Disadvantages/Challenges

- Many technical problems with building these computers


- We can only build them with 127 qubits for now (some believe these computers will never
become useful, others believe we are simply at the bottom of Moore’s Law for these new
computers)

41
Probabilistic data bases

Until now, you have studied deterministic data structures only.

They tell you if an object is present or not, with 100% certainty:

myset = {1, 2, 3}
print (4 in myset)  prints False with 100% probability

PROBABILISTIC DATA STRUCTURES give sometimes wrong answers (with a small probability). In
exchange, they need much less space. similar idea to the lossless vs. lossy compression choice

Bloom filter

Example: the “BLOOM FILTER” (a special kind of set)


 Can only answer yes/no queries like “Is item X in the set?”

HOW DOES IT WORK?

1. Create a vector of m bits, initially all 0

2. Choose k different hash functions  Each hash function maps an item to an integer between 0
and m-1

m = length of vector  size of DB

k = # of hash functions

To insert an item: get its k hashes, then set all those bits to 1.

In this example,

k = 3  3 hash functions (x, y, z)

m = 18  with zero-index, last item will be


on position 17

Where each hash points, add a 1.


CLASH: 2 hash functions map to the same position

42
To query an item: get its k hashes, check if all those bits are 1.
To check if w is in the set, we need to check
that the 3 positions to which it maps are 1s.
We can observe that one of them is not,
hence w is not in our set.

PROS AND CONS

- The filter does not store its items


- Only knows if an item is present or not!
- You cannot query “print all items contained in the set”
- Can give false positives (FP)
- Claim that X is in the set, when in fact it isn’t
- Happens rarely (~1% of all negatives)
- Can only add items, not remove or update them (see Cuckoo filters)
- Needs much less space than deterministic methods (hash tables, trees)
- ~10 bits per item (the size of the item doesn’t matter) for 1% FP
- ~17 bits per item for 0.1% FP
- Can test non-belonging for set member in O(k), where k is the number of hash functions. Very
fast for very large n.

TOY EXAMPLE

m=7

k=2

43
Let’s insert two elements (n=2): {100, 1000}

We need to pass the number we want to insert through both hash functions (f and g). The
REMINDER of the operations will give us the position and we add 1s to those.

Let’s insert 100:

f(100) = 200 % 7 = 4

g(100) = 300 % 7 = 6

Let’s insert 1000:

f(1000) = 2000 % 7 = 5

g(1000) = 3000 % 7 = 4

Now, let’s query an element: 10,000.

To check if the number 10,000 is in our set, we need to check if its positions from the hash functions
are 1s.

f(10000) = 20000 % 7 =
1

g(10000) = 30000 % 7 = 5

 there is one zero bit, hence the number is not in the set.

CONFUSION MATRIX

Bloom Filter Queries returns either “possibly in set” or “definitely not in set”:

44
- When Filter predicts “Positive”, the actual could be Positive or Negative.
- When Filter predicts “Negative”, actual is sure “Negative”.
-
Knowing how many items we expect to insert (n)

 We design (optimize) for given FP rate (p) with formulas and simulators to get optimal k (hash
functions) and m (filter size).

APPROXIMATE FP RATE

OPTIMAL K

VECTOR (ARRAY) SIZE m

(from kb to mb  /1024)

QUESTION 1

After inserting two items (n=2) in an empty Bloom filter, it looks like this:

One of these is correct –which one?

A) k = 2

B) k = 3

QUESTION 2

You are dealing with a dataset of up to 100,000 items that will be put through a Bloom Filter to test
membership.

If the required False Positive Probability is 0.000001, how much storage do you need to allocate for
the filter and what is the optimal number of hash functions to use (k)?

45
EXAMPLE APPLICATIONS

SPELL CHECKING: 10 million words in the dictionary

 Goal: check if a word is correct or not

Easy solution: keep all 10M words in memory.

 Size = 10’000’000*(8 bytes/word) = 80MB

Better solution: use a Bloom filter.

- For 1% FP: size = 10’000’000*(10 bits/item) = 100M bits = 12.5MB


- For 0.1% FP: size = 10’000’000*(17 bits/item) = 170M bits = 21.25MB
- We will have a tiny chance of letting an incorrect word go undetected.

CONTENT HOST (MEDIUM.COM): ~100M users, ~ 50M posts

 Goal: know if a user U has already read a post P

MALICIOUS LINK DETECTION (Google Chrome): billions of websites

 Goal: know if a website has been listed as malicious in the past

46
READING FILES OVER SEVERAL DISKS: millions of files

 Goal: know which disk contains a certain file

DATABASES (Cassandra): billions of rows & columns

 Goal: know if a table contains a row ID

BLOOM FILTERES IN A GENERAL KEY-VALUE STORE

- Bloom filter used to speed up answers in a key-value


store system.
- Values are stored on a disk which has slow access times.
Bloom filter decisions are much faster.
- However, some unnecessary disk accesses are made
when the filter reports a positive (in order to weed out
the false positives). Overall, answer speed is better with
the Bloom filter than without the Bloom filter.

47

You might also like