Big Data Analytics
Big Data Analytics
Georgios Gousios
2018-01-26
2
Contents
1 Preface 9
3
4 CONTENTS
5.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Distributed databases 75
6.1 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Distributed transactions . . . . . . . . . . . . . . . . . . . . . 85
5
6 LIST OF TABLES
List of Figures
2.1 Instagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 FaceBook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Data growth rate . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 The big data landscape . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Importance of data . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 A HashTable . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Warning: Java Code Ahead! . . . . . . . . . . . . . . . . . . . 38
4.3 Types of joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Immutability . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 A tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 The tree after addition . . . . . . . . . . . . . . . . . . . . . . 56
7
8 LIST OF FIGURES
9.1 Ken Thomson and Dennis Ritchie, the original authors of Unix 117
9.2 Brian Kernighan and Rob Pike, authors of the homonymous
seminal book . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.3 Richard Stallman, founder of the Free Software movement and
co-author of many of the Unix tools we use on Linux . . . . . 127
9.4 The Unix way . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Chapter 1
Preface
TI2736-B: Big Data Processing is a second year BSc course at TU Delft that,
as its title says, aims is to teach students how to Hi there, this is my great
book. process big data. It is part of the “Data Processing” variantblok, that
also includes courses like Data Mining and Computational Intelligence.
In Jan 2017, I took over the course; my collegue who was teaching the course
before me had already done a fantastic job with it. I decided to follow a
different route; instead of presenting specific systems or technologies, I would
focus mostly on how big data systems are programmed.
Acknowledgments
A lot of people helped me when I was writing this book. Wouter Zorgdrager
Georgios Gousios
9
10 CHAPTER 1. PREFACE
Copyright information
11
12 CHAPTER 2. BIG AND FAST DATA
• 2 Billion users
• 1.32 Billion active users per day
• 300 million photos per day (136k/min)
• Every min: 510k comments, 293k status updates
2.2. THE MANY VS OF BIG DATA 13
Variety
• Structured data: SQL tables, images, format is known
14 CHAPTER 2. BIG AND FAST DATA
Large-scale computing
Not a new discipline:
• Cray-1 appeared in the late ’70s
• Physicists used super computers for simulations in the ’80s
• Shared-memory designs still in large scale use (e.g. TOP500 supercom-
puters)
What is new?
Large scale processing on distributed, commodity computers, enabled by
advanced software using elastic resource allocation.
Software (not HW!) is what drives the Big Data industry
Figure by Banko and Brill, 2001. They showed that simple algorithms per-
form better than complex ones when the data is big enough.
19
20 CHAPTER 3. LANGUAGES FOR BIG DATA PROCESSING
• Python is interpreted
• Python is indentation sensitive: blocks are denoted by a TAB or 2
spaces.
3.3 Declarations
Scala
val a: Int = 5
val b = 5
b = 6 // re-assignment to val
var a = "Foo"
a = "Bar"
a = 4 // type mismatch
a = ImportantClass(...)
return y
f(x, y)
}
Python
def bigger(x, y, f):
return f(x, y)
// Type of a is infered
3.4. OBJECT-ORIENTED PROGRAMMING 23
a = Foo(3,2)
print a.x
a.x = "foo"
print a.x
Scala
class Foo(val x: Int,
var y: Double = 0.0)
trait Printable {
24 CHAPTER 3. LANGUAGES FOR BIG DATA PROCESSING
val s: String
def asString() : String
}
Python
class Foo():
def __init__(self, x, y):
self.x = x
self.y = y
class Bar(Foo):
def __init__(self, x, y, z):
Foo.__init__(self, x, y)
self.z = z
p1 == p2 // True
Case classes are blueprints for immutable objects. We use them to repre-
sent data records. Scala automatically implements hashCode and equals for
them, so we can compare them directly.
3.4. OBJECT-ORIENTED PROGRAMMING 25
value match {
// Match on a value, like if
case 1 => "One"
// Match on the contens of a list
case x :: xs => "The remaining contents are " + xs
// Match on a case class, extract values
case Email(addr, title, _) => s"New email: $title..."
// Match on the type
case xs : List[_] => "This is a list"
// With a pattern guard
case xs : List[Int] if x.head == 5 => "This is a list of integers"
case _ => "This is the default case"
}
Reading ahead
This is by far not an introduction to either programming languages. Please
read more here
• Scala documentation
• Python documentation
Pick one and become good at it!
• BSc student? Pick Scala
• Minor student? Pick Python
26 CHAPTER 3. LANGUAGES FOR BIG DATA PROCESSING
Chapter 4
In this section, we review the basic data types we use when processing data.
Types of data
• Unstructured: Data whose format is not known
– Raw text documents
– HTML pages
• Semi-Structured: Data with a known format.
– Pre-parsed data to standard formats: JSON, CSV, XML
• Structured: Data with known formats, linked together in graphs or
tables
– SQL or Graph databases
– Images
D: What types of data are more convenient when processing?
Sequences / Lists
Sequences or Lists or Arrays represent consecutive items in memory
In Python:
a = [1, 2, 3, 4]
27
28 CHAPTER 4. PROGRAMMING FOR BIG DATA
In Scala
val l = List(1,2,3,4)
Basic properties:
• Size is bounded by memory
• Items can be accessed by an index: a[1] or l.get(3)
• Items can only inserted at the end (append)
• Can be sorted
Sets
Sets store values, without any particular order, and no repeated values.
scala> val s = Set(1,2,3,4,4)
s: scala.collection.immutable.Set[Int] = Set(1, 2, 3, 4)
Basic properties:
• Size is bounded by memory
• Can be queried for containment
• Set operations: union, intersection, difference, subset
Maps or Dictionaries
Maps or Dictionaries or Associative Arrays is a collection of (k,v) pairs
in such a way that each k appears only once.
Some languages have build-in support for Dictionaries
a = {'a' : 1, 'b' : 2}
Basic properties:
• One key always corresponds to one value.
• Accessing a value given a key is very fast (≈ 𝑂(1))
Nested data types: Graphs
A graph data structure consists of a finite set of vertices or nodes, together
with a set of unordered pairs of these vertices for an undirected graph or a
set of ordered pairs for a directed graph.
• Nodes can contain attributes
4.1. BASIC DATA TYPES 29
If we parse the above JSON in almost any language, we get a series of nested
maps
Map(id -> 5542101946,
type -> PushEvent,
actor -> Map(id -> 801183.0, login -> tvansteenburgh),
repo -> Map(id -> 4.2362423E7, name -> juju-solutions/review-queue)
)
Relations
30 CHAPTER 4. PROGRAMMING FOR BIG DATA
Q: How can we get a list of buildings and the people that work there?
. . .
Key/Value pairs
A key/value pair (or KV) is a special type of a Map, where a key k does not
have to appear once.
Key/Value pairs are usually implemented as a Map whose keys are of a
sortable type K (e.g. Int) and values are a Set of elements of type V.
val kv = Map[K, Set[V]]()
K and V are flexible: that’s why the Key/Value abstraction is key to NoSQL
4.2. FUNCTIONAL PROGRAMMING IN A NUTSHELL 31
A function has a side effect if it modifies some state outside its scope or
has an observable interaction with its calling functions or the outside world
besides returning a value.
max = -1
As a general rule, any function that returns nothing (void or Unit) does a
side effect!
Examples of side effects
• Setting a field on an object: OO is not FP!
• Modifying a data structure in place: In FP, data structures are always
persistent.
• Throwing an exception or halting with an error: In FP, we use types
that encapsulate and propagate erroneous behaviour
• Printing to the console or reading user input, reading writing to files
or the screen: In FP, we encapsulate external resources into Monads.
How can we write code that does something useful given those restrictions?
From OO to FP
Buying coffee in OO
class Cafe {
Buying 10 coffees
class Cafe {
def buyCoffee(cc: CreditCard, p: Payments): Coffee = { ... }
}
}
}
class Cafe {
def buyCoffee(cc: CreditCard): (Coffee, Charge) = {
val cup = new Coffee()
(cup, Charge(cc, cup.price))
}
}
class Cafe {
def buyCoffee(cc: CreditCard): (Coffee, Charge) = { ... }
def buyCoffees(cc: CreditCard, num: Int): Seq[(Coffee, Charge)] =
(1 to num).map(buyCoffee(cc))
. . .
This example was adapted from the (awesome) FP in Scala book, by Chiusano
and Bjarnason
Higher-Order functions
A higher order function is a function that can take a function as an argument
or return a function.
class Array[A] {
// Return elements that satisfy f
def filter(f: A => Boolean) : Array[A]
}
In tools like Spark and Flink, we always express computations in a lazy man-
ner. This allows for optimizations before the actual computation is executed
# Word count in PySpark
text_file = sc.textFile("words.txt")
counts = text_file \
.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("results.txt")
getArgument("foo").
flatMap(processArgument).
getOrElse(new Result("default"))
. . .
public class Converter {
public Integer toInt(String a) throws NumberFormatException {
return Integer.parseInt(a)
}
public String toString(Integer a) throws NullPointerException {
return b.toString();
}
We now need to encode error handling, so that the compiler checks that we
40 CHAPTER 4. PROGRAMMING FOR BIG DATA
Typical usage
val it = Array(1,2,3,4).iterator
while(it.hasNext) {
val i = it.next + 1
println(i)
}
Observation
Observation allows us to process (almost) unbounded size data sets, where
the data source controls the processing rate
// Consumer
trait Observer[A] {
def onNext(a: A): Unit
def onError(t: Throwable): Unit
def onComplete(): Unit
}
42 CHAPTER 4. PROGRAMMING FOR BIG DATA
// Producer
trait Observable[A] {
def subscribe(obs: Observer[A]): Unit
}
Typical usage
Observable.from(1,2,3,4,5).
map(x => x + 1).
subscribe(x => println(x))
Observable.from_([1,2,3,4]).map(lambda x: x + 1) # Python
4.4 Operations
Operations are transformations, aggregations or cross-referenceing of data
stored in data types. All of container data types can be iterated.
– Celcius to Kelvin
– € to $
• Filtering: Only present data items that match a condition
– All adults from a list of people
– Remove duplicates
• Projection: Only present parts of each data item
– From a list of cars, only display their brand
people = []
genders = ['Male', 'Female', 'Other']
for i in range(1000):
p = {'id': i,
'age': randint(10,80),
'height': randint(60, 200),
'weight': randint(40, 120),
'gender': genders[randint(0,2)] }
people.append(p)
## people[0]: {'gender': 'Male', 'age': 55, 'id': 0, 'weight': 62, 'height': 63}
Conversion
def to_m(person):
person['height'] = person['height'] * 1.0 / 100
return person
## people[0]: {'gender': 'Male', 'age': 55, 'id': 0, 'weight': 62, 'height': 0.63}
Projection
Projection allows us to select parts of a Tuple, Relation or other nested data
type for further processing. To implement this, we need to iterate over all
items of a collection and apply a conversion function.
Filtering
To filter values from a list, we traverse a collection and apply a predicate
function to each individual element.
𝑓𝑖𝑙𝑡𝑒𝑟(𝑥𝑠 ∶ [𝐴], 𝑓 ∶ (𝐴) → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛) ∶ [𝐴]
Q: How can we implement filter?
. . .
def filter(xs, pred):
result = []
for i in xs:
if pred(i):
result.append(i)
return result
print total_weight
Then we can calculate the total weight of all people in our collection as follows
total = reduce(total_weight, people, 0)
# or equivalently, using an anonymous function
total = reduce(lambda x,y: x + y['weight'], people, 0)
## (((((((((0 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)
## (1 + (2 + (3 + (4 + (5 + (6 + (7 + (8 + (9 + 0)))))))))
reduceR != reduceL
The answer to the previous question is: it depends on whether the reduction
operation is commutative.
An op ∘ is commutative iff 𝑥 ∘ 𝑦 = 𝑦 ∘ 𝑥.
## 2.7557319224e-06
print reduceR(lambda x, y: x * 1.0 / y, range(1,10), 1)
## 2.4609375
Distinct assumes that items have unique identities; in Python, we use the
id() method (same as hashCode() in Java).
Aggregation functions Aggregation functions have the following generic
signature:
𝑓 ∶ [𝐴] → 𝑁 𝑢𝑚𝑏𝑒𝑟
Their job is to reduce sequences of elements to a single measurement. Some
examples are:
• Mathematical functions: min, max, count
• Statistical functions: mean, median, stdev
Grouping
Grouping splits a sequence of items to groups given a classification function.
𝑔𝑟𝑜𝑢𝑝𝐵𝑦(𝑥𝑠 ∶ [𝐴], 𝑓 ∶ 𝐴 → 𝐾) ∶ 𝑀 𝑎𝑝[𝐾, [𝐴]]
def group_by(classifier, xs):
result = dict()
for x in xs:
k = classifier(x)
if k in result.keys():
result[k].append(x)
else:
result[k] = [x]
return result
def number_classifier(x):
if x % 2 == 0:
return "even"
else:
return "odd"
4.6. KEY-VALUE PAIRS 49
a = [1,2,3,4,5,6,7]
print group_by(number_classifier, a)
print avg_age_per_gender
KV databases / systems
KV stores is the most common form of distributed databases.
What KV systems enable us to do effectively is processing data locally (
e.g. by key) before re-distributing them for further processing. Keys are
50 CHAPTER 4. PROGRAMMING FOR BIG DATA
naturally used to aggregate data before distribution. They also enable (dis-
tributed) data joins.
The most common data structure in big data processing is key-value pairs.
. . .
groupByKey: Group the values for each key into a single sequence.
𝑔𝑟𝑜𝑢𝑝𝐵𝑦𝐾𝑒𝑦(𝑘𝑣 ∶ [(𝐾, 𝑉 )]) ∶ [(𝐾, [𝑉 ])]
. . .
reduceByKey: Combine all elements mapped by the same key into one
𝑟𝑒𝑑𝑢𝑐𝑒𝐵𝑦𝐾𝑒𝑦(𝑘𝑣 ∶ [(𝐾, 𝑉 )], 𝑓 ∶ (𝑉 , 𝑉 ) → 𝑉 ) ∶ [(𝐾, 𝑉 )]
. . .
4.6. KEY-VALUE PAIRS 51
join: Return a sequence containing all pairs of elements with matching keys
𝑗𝑜𝑖𝑛(𝑘𝑣1 ∶ [(𝐾, 𝑉 )], 𝑘𝑣2 ∶ [(𝐾, 𝑊 )]) ∶ [(𝐾, (𝑉 , 𝑊 ))]
Joining datasets
First attempt
52 CHAPTER 4. PROGRAMMING FOR BIG DATA
. . .
In Scala, flatMap is special
def deanAddresses2: Seq[(Dean, Addr)] = {
for (
d <- deans;
v <- addr.filter(a => a.k == d.k)
) yield (d, v)
}
4.6. KEY-VALUE PAIRS 53
Types of joins
4.7 Immutability
COW is the basis for many operating system mechanisms, such as process
creation (forking), while many new filesystems (e.g. BTRFS, ZFS) use it as
their storage format.
COW enables systems with multiple readers and few writers to efficiently
share resources.
Immutable data structures
Immutable or persistent data structures always preserve the previous version
of themselves when they are modified [Okasaki, 1999].
With immutable data structures, we can:
• Avoid locking while processing them, so we can process items in parallel
• Maintain and share old versions of data
• Seamlesly persist the data on disk
• Reason about changes in the data
They come at a cost of increased memory usage (data is never deleted).
Scala has both mutable and immutable versions of many common data struc-
tures. If in doubt, use immutable.
Example: immutable tree
Image credits
• Transformation image
• Hashtable image
• Join types
58 CHAPTER 4. PROGRAMMING FOR BIG DATA
Chapter 5
Introduction to distributed
systems
59
60 CHAPTER 5. INTRODUCTION TO DISTRIBUTED SYSTEMS
• MMTR 5 mins
• Redundancy is not very effective
• Most failures are due to misconfiguration
The data is for a professionally managed data centre by a single company.
On the public cloud, failures may affect thousands of systems in parallel.
Timeouts
Timeouts is a fundamental design choice in asynchronous networks: Ethernet,
TCP and most application protocols work with timeouts.
The problem with timeouts is that delays in asynchronous systems are un-
bounded. This can be due to:
• Queueing of packets at the network level, due to high or spiking traffic
• Queueing of requests at the application level, e.g. because the applica-
tion is busy processing other requests
Queues also experience a snowball effect; queues are getting bigger on busy
systems.
Timeouts usually follow the exponential back-off rule; we double the time we
check for an answer up to an upper bound. More fine-grained approaches
use successful request response times to calibrate appropriate timeouts.
Figure 5.4: Soft Watch At The Moment Of First Explosion, by Salvador Dali
5.2. UNRELIABLE TIME 63
. . .
. . .
At this point, 𝐿𝑇 (𝐸6) < 𝐿𝑇 (𝐸4), but it does not mean that 𝐸6 → 𝐸4!
Events 4 and 6 are independent.
Vector clocks Vector clocks[Mattern, 1988] can maintain total causal or-
der.
5.4 Consistency
• strong: at any time, concurrent reads from any node return the same
values
• eventual: if writes stop, all reads will return the same value after a
while
Consensus is the basis upon which we build consistency
The CAP conjecture
By Erik Brewer[Brewer, 2012]: A distributed system can only provide 2 of
the following 3 guarantees
• Consistency: all nodes see the same data at the same time
• Availability: every request receives a response about whether it suc-
ceeded or failed
• Partition tolerance: the system continues to operate despite arbi-
trary partitioning due to network failures
While widely cited, it is only indicative; when the network is working, sys-
tems can offer all 3 guarantees. So it can be reduced to either consistent or
available when partitioned.
Linearisability
At any time, concurrent reads from any node return the same values. As
soon as writes complete successfully, the result is immediately replicated to
all nodes in an atomic manner an is made available to reads. In that sense,
linearisability is a timing constraint[Herlihy and Wing, 1990].
It is not, as while the write operation is in flight, the system cannot return
a consistent answer.
Linearisability primitives
All operations last for a time block called a transaction; this involves setting
up a connection to a remote system and executing the command.
• reads: must always return the latest value from the storage.
• writes: change the value of the shared memory; last-one-wins is the
most common strategy to deal with multiple writers
• compare and swap: or atomic writes. TODO: Fix this It compares the
contents of a memory location with a given value and, only if they are
the same, modifies the contents of that memory location to a new given
value. All further writes fail.
Linearisability example
. . .
General advice
• Avoid implementing a distributed system
• Avoid relying on a distributed system
• If the above fail, only use available solutions
Content credits
• Distributed systems, by Miym
• Lamport clock example
• 2 generals problem picture, by Jens Erat
• Images in Raft explanation, by Diego Ongaro
• Linearisability examples are from Kleppmann[Kleppmann, 2017]
Chapter 6
Distributed databases
. . .
75
76 CHAPTER 6. DISTRIBUTED DATABASES
. . .
6.1 Replication
Why replicate?
With replication, we keep identical copies of the data on different nodes.
D: Why do we need to replicate data?
. . .
• To allow the system to work, even if parts of it are down
• To have the data (geographically) close to the data clients
• To increase read throughput, by allowing more machines to serve read-
only requests
Replication Architectures
In a replicated system, we have two node roles:
• Leaders or Masters: Nodes that accept writes from clients
• Followers, Slaves or replicas: Nodes that provide read-only access
to data
Depending on how replication is configured, we can see the following archi-
tectures
• Single leader or master-slave: A single master accepts writes, which
are distributed to slaves
• Multi leader or master-master: Multiple masters accept writes,
keep themselves in sync, then update slaves
• Leaderless replication All nodes are peers in the replication network
How does replication work?
78 CHAPTER 6. DISTRIBUTED DATABASES
The general idea in replicated systems is that when a write occurs in node,
this write is distributed to other nodes in either of the two following modes:
The master ships all write statements to the slaves, e.g. all INSERT or UPDATE
in tact. However, this is problematic:
UPDATE foo
SET updated_at=NOW()
WHERE id = 10
. . .
Most databases write their data to data structures, such as 𝐵+ -trees. How-
ever, before actually modifying the data structure, they write the intended
change to an append-only write-ahead log (WAL).
WAL-based replication writes all changes to the master WAL also to the
slaves. The slaves apply the WAL entries to get a consistent data.
Logical-based replication
6.1. REPLICATION 79
The database generates a stream of logical updates for each update to the
WAL. Logical updates can be:
Master
> SHOW MASTER STATUS;
+--------------------+----------+
| File | Position |
+--------------------+----------+
| mariadb-bin.004252 | 30591477 |
+--------------------+----------+
1 row in set (0.00 sec)
Slave
>CHANGE MASTER TO
MASTER_HOST='10.0.0.7',
MASTER_USER='replicator',
MASTER_PORT=3306,
MASTER_CONNECT_RETRY=20,
MASTER_LOG_FILE='mariadb-bin.452',
MASTER_LOG_POS= 30591477;
80 CHAPTER 6. DISTRIBUTED DATABASES
Master_Host: 10.0.0.7
Master_User: replicator
Master_Port: 3306
Master_Log_File: mariadb-bin.452
Read_Master_Log_Pos: 34791477
Relay_Log_File: relay-bin.000032
Relay_Log_Pos: 1332
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
# User 2
git clone git://.... # t+2
git add foo.c # t+5
##hack hack hack
git commit -m 'Hacked new file' # t+6
git push # fails # t+7
git pull # CONFLICT # t+7
If we replace user with master node, we have exactly the same problem
How to avoid write conflicts?
• One master per session If session writes do not interfere (e.g., data
are only stored per user), this will avoid issues altogether.
• Converge to consistent state Apply a last-write-wins policy, or-
dering writes by timestamp (may loose data) or report conflict to the
application and let it resolve it (same as git or Google docs)
• Use version vectors Modelled after version clocks, they encode hap-
pens before relationships at an object level.
6.2 Partitioning
Why partitioning?
With partitioning, each host contains a fraction of the whole dataset.
The main reason is scalability:
• Queries can be run in parallel, on parts of the dataset
• Reads and writes are spread on multiple machines
Partitioning is always combined with replication. The reason is that with
partitioning, a node failure will result in irreversible data corruption.
How to partition?
The following 3 strategies are
82 CHAPTER 6. DISTRIBUTED DATABASES
Request routing
To hide partitioning details from the client, most partitioned systems feature
a query router component siting between the client and the partitions.
The query router knows the employed partitioning scheme and directs re-
quests to the appropriate partitions.
6.3 Transactions
Many clients on one database
What could potentially go wrong?
• Many clients try to update the same data store at the same time,
when….
. . .
• the network fails, and then…
. . .
• the database master server cannot reach its network-mounted disk, so…
. . .
• the database tries to fail over to a slave, but it is unreachable, and
then…
. . .
• the application writes timeout.
What is the state of the data after this scenario?
As programmers, we are mostly concerned about the code’s happy path. Sys-
tems use transactions to guard against catastrophic scenarios.
What are transactions?
84 CHAPTER 6. DISTRIBUTED DATABASES
Isolation DR NRR PR
Read uncommitted X X X
Read committed X X
Repeatable read X
Serializable
Transactions may also span multiple systems; for example, we may try to
remove a record from a database and add it to a queue service in an atomic
way.
The most common mechanism used to dead with distributed atomic commits
is the two-phase commit (2PC) protocol.
2-phase commits
A transaction in 2PC
Parallelism
Parallelism is about speeding up computations by utilising clusters of ma-
chines.
• Task parallelism: Distribute computations across different processors
• Data parallelism: Apply the same computation on different data sets
Distributed data parallelism involves splitting the data over several dis-
tributed nodes, where nodes work in parallel, and combine the individual
results to come up with a final one.
Issues with data parallelism
• Latency: Operations are 1.000x (disk) or 1.000.000x (network) slower
than accessing data in memory
• (Partial) failures: Computations on 100s of machines may fail at any
time
This means that our programming model and execution should hide (but not
forget!) those.
Hadoop: Pros and Cons
89
90 CHAPTER 7. PROCESSING DATA WITH SPARK
DryadLINQ
Microsoft’s DryadLINQ combined the Dryad distributed execution engine
with the LINQ language for defining queries.
FlumeJava
7.2. SPARK AND RDDS 91
PTable<String,Collection<Integer>>
groupedWordsWithOnes = wordsWithOnes.groupByKey();
PTable<String,Integer> wordCounts =
groupedWordsWithOnes.combineValues(SUM_INTS);
What is Spark?
Spark is on open source cluster computing framework.
• automates distribution of data and computations on a cluster of com-
puters
• provides a fault-tolerant abstraction to distributed datasets
• is based on functional programming primitives
• provides two abstractions to data, list-like (RDDs) and table-like
(Datasets)
In Python
text_file = sc.textFile("odyssey.mb.txt")
counts = text_file.flatMap(lambda line: line.split(" ")). \
map(lambda word: (word, 1)). \
reduceByKey(lambda a, b: a + b)
For Python, Spark uses Py4J, which allows Python programs to access Java
objects in a remote JVM. The PySpark API is designed to do most compu-
tations in the remote JVM; if processing needs to happen in Python, data
must be copied; this incurs a performance penalty.
How to create an RDD?
RDDs can only be created in the following 3 ways
1. Reading data from external sources
val rdd1 = sc.textFile("hdfs://...")
val rdd2 = sc.textFile("file://odyssey.txt")
val rdd3 = sc.textFile("s3://...")
. . .
2. Convert a local memory dataset to a distributed one
val xs: Range[Int] = Range(1, 10000)
val rdd: RDD[Int] = sc.parallelize(xs)
. . .
3. Transform an existing RDD
rdd.map(x => x.toString) //returns an RDD[String]
odyssey.map(_.toLowerCase).
filter(Seq("a", "the").contains(_))
. . .
reduce, fold: Combine all elements to a single result of the same time.
𝑅𝐷𝐷[𝐴].𝑟𝑒𝑑𝑢𝑐𝑒(𝑓 ∶ (𝐴, 𝐴) → 𝐴) ∶ 𝐴
. . .
aggregate: Aggregate the elements of each partition, and then the results for all the part
𝑅𝐷𝐷[𝐴].𝑎𝑔𝑔𝑟(𝑖𝑛𝑖𝑡 ∶ 𝐵)(𝑠𝑒𝑞𝑂𝑝 ∶ (𝐵, 𝐴) → 𝐵, 𝑐𝑜𝑚𝑏𝑂𝑝 ∶ (𝐵, 𝐵) → 𝐵) ∶
𝐵
Examples of RDD actions
How many words are there?
val odyssey = sc.textFile("datasets/odyssey.mb.txt").flatMap(_.split(" "))
Pair RDDs RDDs can represent any complex data type, if it can be iterated.
Spark treats RDDs of the type RDD[(K,V)] as special, named PairRDDs, as
they can be both iterated and indexed.
Operations such as join are only defined on Pair RDDs, meaning that we
can only combine RDDs if their contents can be indexed.
We can create Pair RDDs by applying an indexing function or by grouping
records:
val rdd = List("foo", "bar", "baz").parallelize // RDD[String]
val pairRDD = rdd.map(x => (x.charAt(0), x)) // RDD[(Char, String)]
pairRDD.collect
// Array((f,foo), (b,bar), (b,baz))
val pairRDD2 = rdd.groupBy(x => x.charAt(0)) // RDD[(Char, Iterable(String))]
pairRDD2.collect
//Array((b,CompactBuffer(bar, baz)), (f,CompactBuffer(foo)))
96 CHAPTER 7. PROCESSING DATA WITH SPARK
reduceByKey: Merge the values for each key using an associative and commutative
𝑟𝑒𝑑𝑢𝑐𝑒𝐵𝑦𝐾𝑒𝑦(𝑓 ∶ (𝑉 , 𝑉 ) → 𝑉 ) ∶ 𝑅𝐷𝐷[(𝐾, 𝑉 )]
. . .
aggregateByKey: Aggregate the values of each key, using given combine functions
𝑎𝑔𝑔𝑟𝐵𝑦𝐾𝑒𝑦(𝑧𝑒𝑟𝑜 ∶ 𝑈 )(𝑓 ∶ (𝑈 , 𝑉 ) → 𝑈 , 𝑔 ∶ (𝑈 , 𝑈 ) → 𝑈 ) ∶
𝑅𝐷𝐷[(𝐾, 𝑈 )]
. . .
join: Return an RDD containing all pairs of elements with matching keys
𝑗𝑜𝑖𝑛(𝑏 ∶ 𝑅𝐷𝐷[(𝐾, 𝑊 )]) ∶ 𝑅𝐷𝐷[(𝐾, (𝑉 , 𝑊 ))]
odyssey.groupBy(partOfSpeach).
aggregateByKey(0)((acc, x) => acc + 1,
(x, y) => x + y)
Q: What are the types of ps and as? How can we join them?
. . .
val pairPs = ps.keyBy(_.id)
val pairAs = as.keyBy(_.person_id)
Join types
Given a “left” RDD[(K,A)] and a “right” RDD[(K,B)]
• Inner Join (join): The result contains only records that have the keys
in both RDDs.
• Outer joins (left/rightOuterJoin): The result contains records that
have keys in either the “left” or the “right” RDD in addition to the
inner join results.
𝑙𝑒𝑓𝑡.𝑙𝑜𝑗(𝑟𝑖𝑔ℎ𝑡) ∶ 𝑅𝐷𝐷[(𝐾, (𝐴, 𝑂𝑝𝑡𝑖𝑜𝑛[𝐵]))] 𝑙𝑒𝑓𝑡.𝑟𝑜𝑗(𝑟𝑖𝑔ℎ𝑡) ∶ 𝑅𝐷𝐷[(𝐾, (𝑂𝑝𝑡𝑖𝑜𝑛[𝐴], 𝐵))]
• Full outer join: The result contains records that have keys in any of
the “left” or the “right” RDD in addition to the inner join results.
𝑙𝑒𝑓𝑡.𝑓𝑜𝑗(𝑟𝑖𝑔ℎ𝑡) ∶ 𝑅𝐷𝐷[(𝐾, (𝑂𝑝𝑡𝑖𝑜𝑛[𝐴], 𝑂𝑝𝑡𝑖𝑜𝑛[𝐵]))]
key.hashCode() % numPartitions
Partition dependencies
res3: String =
(16) MapPartitionsRDD[3] at map at <console>:30 []
| CartesianRDD[2] at cartesian at <console>:28 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
| ParallelCollectionRDD[1] at parallelize at <console>:24 []
Shuffling
Shuffling is very expensive as it involves moving data across the network and
possibly spilling them to disk (e.g. if too much data is computed to be hosted
on a single node). Avoid it at all costs!
Shuffling example
Persistence
persisted is now cached. Further accesses will avoid reloading it and apply-
ing the map function.
7.3. RDDS UNDER THE HOOD 101
fle/repartition the data. Stages (per RDD) are always executed serially. Each
stage consists of one or more tasks.
Tasks is the minimum unit of execution; a task applies a narrow dependency
function on a data partition. The cluster manager starts as many jobs as the
data partitions.
A job graph
We can see how our job executes in we connect to our driver’s WebUI (port
4040 on the driver machine).
Here is the graph for the word counting job we saw before.
Requesting resources
Here is an example of how to start an application with a custom resource
configuration.
spark-shell \
--master spark://spark.master.ip:7077 \
--deploy-mode cluster \
--driver-cores 12
--driver-memory 5g \
--num-executors 52 \
104 CHAPTER 7. PROCESSING DATA WITH SPARK
--executor-cores 6 \
--executor-memory 30g
Fault tolerance
Spark uses RDD lineage information to know which partition(s) to recompute
in case of a node failure.
Recomputing happens at the stage level.
To minimize recompute time, we can use checkpointing. With checkpointing
we can save job stages to reliable storage. Checkpointing effective truncates
the RDD lineage graph.
Spark clusters are reliable to node failures, but not to master failures. Run-
ning Spark on middleware such as YARN or Mesos is the only way to run
multi-master setups.
events.count
106 CHAPTER 7. PROCESSING DATA WITH SPARK
users.count
or to distributed file systems like HDFS, Amazon S3, Azure Data Lake etc
Optimizing Partitioning
Partitioning becomes an important consideration when we need to run iter-
ative algorithms. Some cases benefit a lot from defining custom partitioning
schemes:
• Joins between a large, almost static dataset with a much smaller, con-
tinuosly updated one.
• reduceByKey or aggregateByKey on RDDs with arithmetic keys ben-
efit from range partitioning as the shuffling stage is minimal (or none)
because reduction happens locally!
Broadcasts
From the docs: Broadcast variables allow the programmer to keep a read-
only variable cached on each machine rather than shipping a copy of it with
tasks.
Broadcasts are often used to ship precomputed items, e.g. lookup tables or
machine learning models, to workers so that they do not have to retransfer
them on every shuffle.
With broadcasts, we can implement efficient in-memory joins between a pro-
cessed dataset and a lookup table.
val curseWords = List("foo", "bar") // Use your imagination here!
val bcw = sc.broadcast(curseWords)
Accumulators
Some times we need to keep track of variables like performance counters,
debug values or line counts while computations are running.
// Bad code
var taskTime = 0L
odyssey.map{x =>
val ts = System.currentTimeMillis()
val r = foo(x)
taskTime += (System.currentTimeMillis() - ts)
r
}
Spark Datasets
//Option 1
people.keyBy(_.id).join(addresses.filter(x._2._2.city == "Delft"))
//Option 2
people.keyBy(_.id).join(addresses).filter(x._2._2.city == "Delft")
//Option 3
people.keyBy(_.id).cartesian(addresses).filter(x._2._2.city == "Delft")
. . .
109
110 CHAPTER 8. SPARK DATASETS
In Spark SQL, we trade some of the freedom provided by the RDD API to
enable:
The price we have to pay is to bring our data to a (semi-)tabular format and
describe its schema. Then, we let relational algebra work for us.
It can directly connect and use structured data sources (e.g. SQL databases)
and can import CSV, JSON, Parquet, Avro and data formats by inferring
their schema.
A blog post by MySQL experts Percona wanted to find the number of delayed
files per airline using the air trafic dataset.
On the same server, they used Spark SQL to connect to MySQL, partitioned
the Dataframe that resulted from the connection and run the query in Spark
SQL. It took 192 secs!
This was the result of Catalyst rewriting the SQL query: instead of 1 complex
query, SparkSQL run 24 parallel ones using range conditions to restrict the
examined data volumes. MySQL cannot do this!
The SparkSession
Similarly to normal Spark, SparkSQL needs a context object to invoke its
functionality. This is the SparkSession.
If a SparkContext object exists, it is straightforward to get a SparkSession
val ss = new SparkSession(sc)
map(_.split(" ")).
map(r => Row(r(0), new Date(r(1)), r(2).toInt,
r(3), r(4)))
or
df = sqlContext.read.csv("/datasets/pullreqs.csv", sep=",", header=True,
inferSchema=True)
or in Scala
df("team_size")
$"team_size" //scala only
8.2. CREATING DATA FRAMES AND DATASETS 113
• Selection
df.filter(df.team_size.between(1,4)).show()
Joins
Dataframes can be joined irrespective of the underlying implementation, as
long as they share a key.
people = sqlContext.read.csv("people.csv")
department = sqlContext.read.jdbc("jdbc:mysql://company/departments")
• Full outer:
people.join(department, people.deptId == department.id,
how = full_outer)
114 CHAPTER 8. SPARK DATASETS
The key to both features is that code passed to higher order functions (
e.g. the predicate to filter) is syntactic sugar to generate expression trees.
df.filter(df("team_size") > (3 + 1))
is converted to
df.filter(GreaterThan(
UnresolvedAttribute("team_size"),
Add(Literal(3) + Literal(1))))
Optimization
The optimizer uses tree patterns to simply the AST. For example, the follow-
ing tree:
val ast = GreaterThan(
UnresolvedAttribute("team_size"),
Add(Literal(3) + Literal(1)))
. . .
8.3. SPARKSQL UNDER THE COVERS 115
Figure 9.1: Ken Thomson and Dennis Ritchie, the original authors of Unix
117
118 CHAPTER 9. DATA PROCESSING AT THE COMMAND LINE
The same permission set can be expressed with the number z0755z
Filesystem paths
A path is a sequence of directories to reach a certain file, i.e. /home/gousiosg/foo.txt
Paths can be:
• Absolute - starting from the root directory “/” e.g. /var/log/messages
- The system log file
• Relative to the current directory . e.g. if the current directory is /var,
the relative path to the system log is ./log/messages or log/messages
File listing commands
• ls: list files in a directory
– -l: list details
– -a: list hidden files (files that start with .)
• find <dir>: walk through a file hierarchy starting from <dir>
– -type [dfl]: Only display directories, f iles or links
– -name str: Only display entries that start str
– -{max|min}depth d
File manipulation commands
• touch <file>: Create and empty file named <file> or update the
modification time for the existing file <file>
120 CHAPTER 9. DATA PROCESSING AT THE COMMAND LINE
. . .
. . .
. . .
Documentation
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is speci-
fied.
Figure 9.2: Brian Kernighan and Rob Pike, authors of the homonymous
seminal book
xargs by default appends each line at the end of cmd. Some times, it may
be necessary to append it in the middle. We use the -I {} option and
$ find . -type f -maxdepth 1|xargs -I {} echo File {} is in `pwd`
File ./labcontents.doc is in /Users/gousiosg/Documents/course-material/isrm
File ./Makefile is in /Users/gousiosg/Documents/course-material/isrm
[...]
cut allows us to split a line into columns, given a character, and extract
specific fields.
# Get a list of users and home directories
$ cut -f1,6 -d: /etc/passwd
9.2. THE UNIX PROGRAMMING ENVIRONMENT 125
sed is a domain specific language of its own. You can find a thorough manual
here.
Sorting data
sort writes a (lexicographical) sorted concatenation of all input files to stan-
dard output, using Mergesort
• -r: reverse the sort
• -n: do a numeric sort
• -k and -t: merge by the nth column (argument to -k). -t specifies
what is the separator character
uniq finds unique records in a sorted file
# Print the 10 most used lines in foo
$ cat foo| sort | uniq -c |sort -rn |head -n 10
Joining data
join joins lines of two sorted files on a common field
• -1, -2 specify fields in files 1 (first argument) and 2 (second argument)
that represent keys
126 CHAPTER 9. DATA PROCESSING AT THE COMMAND LINE
$ cat foodtypes.txt
3 Fat
1 Protein
2 Carbohydrate
$ cat foods.txt
Potato 2
Cheese 1
Butter 3
Orchestrating pipelines
Make topologically sorts the specified dependency graph and executes com-
mands (in parallel, if -j is specified) to generate all output files. If some of
those already exist, make skips them.
result : file.csv
file.txt :
curl "http://a/web/page/file.txt" > file.txt
More make
# Find all Jupyter files
JUPYTER_INPUTS = $(shell find . -type f -name '*.ipynb')
9.3. TASK-BASED TOOLS 127
Figure 9.3: Richard Stallman, founder of the Free Software movement and
co-author of many of the Unix tools we use on Linux
ssh provides a way to securely login to a remote server and get a prompt.
In addition, it enables us to remotely execute a command and capture its
output
# List of files on host dutihr
ssh dutihr ls
# On another terminal
$ mongo localhost:27017
Variables in bash are strings followed by =, e.g. cwd="foo" and are derefer-
enced with $, e.g. echo $cwd.
# Store the results of running ls in a variable
listing=`ls -la`
echo $listing
Conditionals
Bash supports if / else blocks
if [ -e 'test' ]; then
echo "File exists"
else
echo "File does not exist"
fi
The for loop iterates over all items in the list provided as argument:
# Print 1 2 3 4...
for i in `seq 1 10`; do
echo $i
done
argA="defaultvalue"
while getopts ":a" opt; do
case $opt in
a)
echo "-a was triggered!" >&2
argA=$OPTARG
;;
\?)
echo "Invalid option: -$OPTARG" >&2
;;
esac
done
Content credits
• Unix license plate, by the Open Group
Bibliography
E. Brewer. Cap twelve years later: How the ”rules” have changed. Computer,
45(2):23–29, Feb 2012. ISSN 0018-9162. doi: 10.1109/MC.2012.37.
Bernadette Charron-Bost. Concerning the size of logical clocks in distributed
systems. Information Processing Letters, 39(1):11–16, 1991.
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility
of distributed consensus with one faulty process. J. ACM, 32(2):374–382,
April 1985. ISSN 0004-5411. doi: 10.1145/3149.214121. URL http://doi.
acm.org/10.1145/3149.214121.
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding net-
work failures in data centers: Measurement, analysis, and implications.
SIGCOMM Comput. Commun. Rev., 41(4):350–361, August 2011. ISSN
0146-4833. doi: 10.1145/2043164.2018477. URL http://doi.acm.org/10.
1145/2043164.2018477.
Jim Gray. The transaction concept: Virtues and limitations (invited paper).
In Proceedings of the Seventh International Conference on Very Large Data
Bases - Volume 7, VLDB ’81, pages 144–154. VLDB Endowment, 1981.
URL http://dl.acm.org/citation.cfm?id=1286831.1286846.
Theo Haerder and Andreas Reuter. Principles of transaction-oriented
database recovery. ACM Comput. Surv., 15(4):287–317, December 1983.
ISSN 0360-0300. doi: 10.1145/289.291. URL http://doi.acm.org/10.1145/
289.291.
Pat Helland. Immutability changes everything. Queue, 13(9):40, 2015.
Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness
condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12
133
134 BIBLIOGRAPHY