Advanced Distributed Paradigms Notes - Google Docs
Advanced Distributed Paradigms Notes - Google Docs
he objective of the chapter is to be able to write two applications A and B that do not
T
necessarily run on the same machine: this brings up the concept ofInformation Exchange(not
just data, this means that it is not just raw bytes, but making sense of these bytes and
understanding one another)
To achieve such an information exchange, this is handled by the operating systems
The OS is that piece of software that takes control of data. you boot it. It gets loaded by the
BIOS and takes control to manage the operation of your machine.
The main motivation behind OS is avoiding redundancy and repeating information so as to
avoid inconsistencies such.
There are some common tasks needed by many programs, such as opening files, allocating
space in memory, reading input from the keyboard, using peripherals, … These are taken care
of by the OS, so you would able to focus on the main part of your program.
It is made up of several modules each taking care of specific tasks.
- sking the networking manager to open a connection to a remote host: we open a virtual
A
connection (a logical representation of an actual connection open with the other side).
Again we do not care about the low-level details.
These system calls that belong to the same family, can be grouped into some kind oflibrary.
From the user perspective (as in programmer) you only care about theAPI.
For networking management, we can talk about thenetworkinglibrary/API. The “fancy” name
for it isThe Socket API. (this name comes from theidea of abstracting a sort of pipe that
connects two programs, where whatever you put in one end of the pipe ends up in the other
end)
Using the socket API, you do not care about the geographical location of the programs (instead
a logical location)
n analogy:
A
For aprogramming language, you may master it, butif you do not have a solution in mind or
analgorithm, you would not be able to write a singleline of code.
TheSocket APIgives you a powerful way to open connections,send and receive, but what are
you going to send and receive, so you need thatprotocolto tell you what to do.
ava Socket API Overview (java.net)
J
A connection has a representation on both sides (client and server) through variables. But as
they represent the same concept, they are variables of the same type. In OOP, these would be
two objects of the same class. The class from which we create these two objects is called
Socket.
Classes:
● ServerSocket (server)
- ServerSocket (int port)
- Socket accept()
● Socket (client, server)
- Socket (String hostNAmeOrIP@, int port)
Instead of having dedicated send and receive methods, they make use of Java.IO. library with
encapsulated two objects of types inputstream and output stream.
Once we have these two objects, everything else is simple IO operations as dictated by the
protocol. (we no longer care about the networking infrastructure and such)
Client/Server Application Example
● Purpose:
This client/server application shall allow the client to download and upload files from/ to
the/server (Fx application for File Exchange)
● Protocol:
1. The client opens a connection with the server andinformsthe server whether it wants
to download or upload a file using a header
2. If the client wants to download a file then
2.1. The header will be as the following:
download[one space][file name][Line Feed]
2.2. Upon receiving this header, the server searches for the specified file.
2.3. If the file is not found then the server shall reply with a header as the following:
NOT[one space]FOUND[Line Feed]
he notion of theheader:
T
We always need some sort of metadata that explains what the data is. We need to send a
description of what we are sending, before we actually send anything. So the protocol describes
what this metadata looks like. Hence, how the header looks is specified at the level of the
protocol.
ata flows between different processes according to different protocols. But if we analyze such
D
traffic of data flow in the internet, we would find that around 90% of the traffic is done adhering
to HTTP protocol. Your browser allows you to browse the Internet using the HTTP protocol (an
HTTP client, talking to HTTP servers out there), it is end-to-end, implemented at the level of
your browser and on whatever server out there.
In SMTP (Simple Mail Transfer Protocol), the protocol defines the sender, cc, subject, body, …
in order to enable a sender and a receiver to exchange mail. (default port 25)
In FTP (File Transfer Protocol), we have not only the number of bytes being exchanged that are
specified at the level of the protocol, but also the file types for example. (default port 21)
elnet (the void protocol) doesn’t specify anything, as other than establishing a connection, it
T
allows the client to send whatever command to the server. (default port 23)
im Berner Lee (father of the Web (WWW)) who worked at CERN, worked on HTTP 0 .9 (which
T
was called the one-line protocol)
ll protocols have a default port for the server (for example Fx (and HTTP) is bound to port 80)
A
The default ports are not reserved for those specific protocols. (it’s more of a convention,
especially for the standard protocols)
hroughout the history of computing, programs started on standalone machines. Engineers and
T
computer scientists realized that there was a need for going outside the box and into another.
As we strive to do even more with less, and being more productive:
ur objective:
O
Can we imagine a programming paradigm, which provides us software developers with the
luxury ofinvoking remote services/functionalities,as if they were local? (In traditional
programming, when you call a function, this function is implemented within your application
(might be in a different library, package, class, ….) but still they would be imported and will be
part of the process at runtime. What we are talking about here is having this functionality not at
all as part of your application (it is remote) but still we would like to be able to invoke it, while it
gives the application developer using the technology, the impression that the functionality is
there locally hence we need the technology and a layer underneath, that makes the necessary
code that fills the gap and makes the invocation reach the other side.)
uch a paradigm should hide all the programming hassle and details mentioned above.
S
This would:
● Increasedeveloper productivity(leveraging the abstraction)(so you expand your own
code, without having to pay the price for all the remote infrastructure)
● Promote software integration for:
○ Richer functionality(extending the functionalityof your application, without
doing it yourself, through an external application (e.g.Translation or weather
services))
○ Higher performance(extending your application byleveraging applications that
perform the heavy computations on your behalf (e.g.When you don’t have the
necessary processing power to provide a certain kind of computation for
example(you have the formula or algorithm, but you just lack the computing
power))
loud computing wouldn’t be possible without such a paradigm (externalizing workloads to have
C
applications or part of them running on premise/in the cloud)
Read about:
● Serialization
● XML (eXtensible Markup Language)
● JSON (JavaScript Object Notation)
rainstorming:
B
Let’s say we are trying to design and develop a traditional client/server application that allows
the client to perform the four basic math operations (+ - x :) on the server.
-> Instead of thinking in terms of headers, commands, we want to adopt a new mindset, we will
think in terms offunctions. These functions wouldbe implemented on the server side and I
would be invoking (through simple calls) and getting back the results.
We need agenericcommand (function name correspondingto some functionality on the
server), as well as a way to tell the other side, the number of parameters it should expect, their
types (some sort of metadata about this). This is whereserializationcomes in (or marshaling)
e need to make the distinction between:RPC (RemoteProcedure Call)being the concept of
W
invoking remote functions as if they were local. And the technology which provides us with the
generic protocol, which generates theproxy code.
he stub, other than marshaling and unmarshalling, also needs to know the location of the
T
server skeleton in order to open a connection and such.
his service API, if the RPC technology was Java based for example, would be an interface or a
T
couple of interfaces, if C, it would be a header file.
There are however some other languages that are used to specify interfaces, such as
XML(WSDL).
ode-first:
C
In this approach, developers start by coding the service business implementation, or at least
defining its business interface in a target programming language, such as Java, Python,
JavaScript, etc. Then, they use an appropriate tool for the chosen technology to generate the
service API. This won't be possible if such a tool doesn't exist for the chosen technology and
target programming language.
he API generator creates an explicit interface from an implicit interface that is present at the
T
level of the business implementation.
The service developer in this code-first approach, implicitly plays the role of service designer as
well.
hen cameXML(eXtensible Markup Language), a languagejust like IDL ( which are not for
T
imperative or OOP) but instead aredescriptivelanguagesallowing you to describe (data,
functionality, services …) It represents a new generation that came with the rise of language
agnostic technologies. It allows you to structure data (allow metadata).
Something like:
● <fname> …….. </fname>
●
<lname> …….. </lname>
These are tag-based (Markup). It iseXtensiblebecausethese tags are user defined.
We need to provide aschema(sort of dictionary)thatdescribes our attributes (what is valid,
whether an attribute is simple or complex (for example having a <address> tag which includes
<street>, <city> and such). We also describe the types of each attribute for rigor, so that we help
the parser for it to help us tell us if our file is okay.
So we have XML + our own schema, which would give us a specialization of XML, that yields a
specific XML-based language
Example of a schema:
<method>
<name> sum </>
<param>
<name> …. </>
<sthg> …. </>
</param>
</method>
WSDL (Web Service Definition Language) (The basis is HTTP, hence Web)
(which is why these are classified under Web services)
TTP needed to be augmented which gave us SOAP (Service Object Access Protocol) (which
H
is XML-based)
So we have XML/SOAP as this specialization of XML.
ayers of abstraction:
L
WSDL
XML
SOAP
HTTP
Socket API
OAP (which is included as the body of the HTTP request) is used to specify details about the
S
names of functions, arguments, …
TTP supports several verbs, but XML/SOAP uses only one HTTP verb/command/method
H
which isPOST.
hen we have a number of lines that represents a variable number of attributes all following the
T
format ofAttribute: ___ [CRLF]
ccept-Encoding: gzip,deflate
A
Content-Type: text/xml;charset=UTF-8
SOAPAction: ""
Content-Length: 321
Host: localhost:9000
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.5
In the body, we can have anything. Its type is specified earlier as part of the header in the
Content-type attribute.
In the case of XML/SOAP, the body is following the SOAP Protocol and is XML-based.
soapenv:Envelope
<
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
mlns:prov="http://provider.calculator.xs.integration.paradigms.sse.au
x
i.ma/">
<soapenv:Header/>
<soapenv:Body>
(The stub generated this:)
<prov:computeAll>
<arg0>7.0</arg0>
<arg1>5.0</arg1>
</prov:computeAll>
</soapenv:Body>
</soapenv:Envelope>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
<S:Body>
<ns0:computeAllResponse
xmlns:ns0="http://provider.calculator.xs.integration.paradigms.sse.aui
.ma/">
<return>
(We notice marshaled attributes:)
<sum>12.0</sum>
<difference>2.0</difference>
<product>35.0</product>
<ratio>1.4</ratio>
</return>
</ns0:computeAllResponse>
</S:Body>
</S:Envelope>
evelopment process:
D
In order to use API-first, we will need to design the API, and hence master WSDL.
Webservicefrom
@ javax.jws.WebService(this annotationis used to support the
XML/SOAP technology)
fter instantiating an instance from the service, it only getspublishedonce we make use of the
A
Endpoint.publish method:
Endpoint.publish(URL, calculator);
(this publishes the calculator object under this URL.
So the client side only needs the WSDL file thatdescribesthe interface (API), as well as the
location(URL))
This URL is used to reach that service.
hen we want to develop any application, we need a package manager that fetches the
W
libraries, dependencies and such for that specific application. We make use of build tools: In
Javascript we use NPM, its equivalent in Java is Gradle / Maven.
● C
hange the service location url within the generated
src/main/resources/CalculatorService.wsdl (<soap:address
location="REPLACE_WITH_ACTUAL_URL"/>) to: http://localhost:9000/calculator
To fulfill these requirements, REST defines a set of constraints / design principles:
C
● lient/Server architecture
● Statelessness (as opposed to statefulness
We haveStatelessandStateful.
Between a client and a server, over an interaction or conversation, we have a set of
requests and responses. For a given interaction, the question is, does the server
remember (preserve/recall) what has happened with that client or not (preserve the state
of the conversation: what has happened so far)!
or example in an e-commerce application, we have a shopping cart (the question is
F
whether the server actually remembers what you have chosen so far, or is it the client
that keeps track of what you have chosen) This consists of thestateof the interaction
rawbacks of Statefulness include: Resources at the level of the server (as the more
D
clients you have the more state you need to preserve (scalability concerns)). We also
need load balancing
In statelessness we do not keep any data which helps with scalability. (do not care about
data replication and synchronization issues)
C
● acheability
● Self-descriptive messages
● HATEOAS: Hypermedia As The Engine Of Application State
If we want to fetch something in brands: GET url/brands/7 - brands here is a collection
e are looking at a case study of a calculator app. We can make use of REST with OpenAPI
W
(here, there is no natural mapping between its method and HTTP methods) (as it is
Processing-driven)
For server-side we will use Spring Boot App Server (IOC Container) (This container it is running
within JVM which is running within the OS) Spring exposes a Web Server making use of HTTP,
responds to REST requests and maps them to the methods we expose (invoke them). This will
generate our OpenAPI to get an API, then a Stub generator makes use of it to generate a
Python stub.
If we want to make use of the add method, our request would be like:
GET http://……:8080/calculator/add?x=5&y=7
e annotate our app using @RestController which instantiates it on our behalf and starts
W
listening for requests, and takes note that this component is mapped to /calculator
allback methods are not called by us directly, they are called by spring so that they call us
C
back (which we expose for spring)
We only specify the what, not when and how (which is taken care of by spring)
IoC stands for Inversion of Control
erformance is measured from a user perspective using the following metric: Response time
P
(they want an acceptable, if not good or very good response time (less than 300 ms))
Increasing processing power is one of the ways, but it is costly. So we need to optimize at the
software level first.
One of the techniques at the software level isasynchronousprogramming(other techniques
exist such as caching)
● Asynchrony as means for supporting performance
Before getting into this subject, we will cover other traditional ways to support performance
Single-threading:
- Statements execute synchronously (statement i+1 never executes before statement i
terminates execution), one after the other (a set of statements, with one line of
execution)
There are cases though, where statement i+1 doesn’t depend on the outcome of
statement i. They might be completely independent that they would even be able to be
executed in parallel. But in this single threaded concurrency model, even though from a
logical perspective you would want them to run in parallel, the fact that you are using this
model stops you from doing so.
- Statements areblocking(they block what comes next)
- Onecall stackis used to keep track of where we are(when you call a function, this call
stack is maintained by the OS by each process to keep track of where we were each
time we call a function so that we are able to come back) (One stack keeps track of one
line of execution, so in case we want to have multiple lines of execution we would need
multiple call stacks.)
- The function that is currently executing: pushed on the top of the stack
- Its stack trace: elements on which it is stacked, all the way to the main
- Where it should return within its calling function
- Once it returns, it gets popped from the top of the stack
- Synchronous/Blocking I/O is a huge waste of CPU time!
hen we have a certain DMA (Direct Memory Access) operation f1 for example that takes too
W
much time. There is anopportunityto run these subsequentstatements (as long as the CPU is
idle and that these statements do not depend on f1)
ulti-threading:
M
It is based on having an independent call stack per line of execution.
he main characteristic of a thread is that a thread shares the same memory, in opposition to
T
processes.
In order to know when an application requires multithreading, this needs support from the
runtime (our application) (for example JVM (Java Virtual Machine and the JRE Java Runtime
Environment) understands the Java program than the process)
hen we add multithreading to our application, this has no incidence on the rules of the
W
communication (so there is no need to change or upgrade theprotocol)
ulti-threading:
M
Based on different lines of execution (threads), sharing memory (heap)
E.g. To handle several requests/clients concurrently
Requires:
Programming language support to request thread creation
Runtime support for actual thread creation and management
An independent call stack per thread to keep track of where we are
Leverages thread interleaving by the runtime to optimize CPU usage
I/O waiting threads get preempted (by the runtime)
Ready threads get executed concurrently (by the runtime)
Thread safety / state consistency is the responsibility of the developer!
Within each thread, the execution is still synchronous: one line after the other
We would want to be able to call the constructor (s= new Socket …) without blocking
or this to be done asynchronously, we would need to have an additional parameter that would
F
tell us once we have the connection ready, this is the logic to apply on the result once the result
is ready (in the future). This recipe would be in the form of a function definition (that would be
eventually called (thanks to the support of the runtime) (a callback))
af1(..., r=>{});
l2 cb2
l3 cb3
This callback function that is ready would need to be pushed into themain call stack
cb2
main
Event loop (async function terminating, its callback function is invoked, …..)
In multithreading (with multiple call stacks), if you launch many threads you end up having so
many call stacks which consumes your memory. In this model, we want to only have one call
stack.
hen a callback gets to the call stack, the subsequent callback cannot get into this call stack
W
until the first one terminates. So here for example cb1 would be waiting for cb2. In case a
callback is performing some heavy processing (not an async I/O operation), we would have cb1
waiting forever. In this case, having such a programming model with only one call stack is not
appropriate for these kinds of applications (heavy processing applications). These runtimes (for
example nodeJS as part of JS), are not suitable for such applications. But for most applications
that we would be working with (like in our projects (capstones, internships) basic common
needs (HR systems)), can be done using Node as most scenarios are just basic I/O).
This model is for I/O intensive apps.
It is more efficient (consumes less memory) than non I/O intensive apps, as it only makes use of
a single call stack which is enough.
If we have multiple asynchronous functions, each calling other asynchronous functions within
their callback functions… =>callback hell
( using callback-based asynchrony, functions do not return, so the result can only be accessed
within its callback function. Which is why we have so much indentation and confusing code)
In the callback, we might have two parameters, first an error in case the async function doesn’t
terminate properly, and another one if it terminates properly and yields a result.
The root cause of the problem when it comes to async functions is the fact they cannot return
e might return a pointer to the result (but not in memory (an address space), but instead in
W
time(so that when the result is available in the future, we get access to it)). So it is not the result,
but apromiseof the result.
his async function we are thinking of, launches in a non-blocking call. And it returns a promise
T
that it will return a result once it is ready.
Pseudo code:
launch execution in a different thread
Create and return a promise
romise = asyncFunc(data);
P
r’= Promise.then(r =>{f(r’)})(we have the promise,and once it is fulfilled (result of
the async function completes execution and its result is returned, then (the callback))
In the case of our calculator app, if we were to have an async version of it, it would look
something like:
soap.createClient(url, (err, calculator) => {
calculator.add(args, (er, result) => {
console.log(args. arg0
+ ‘ + ‘ + args.arg1 + ‘ = ’ +
result.return);
});
calculator.subtract(…….
his would still put us in a situation where we might encounter callback hell, so a promise-based
T
version of this would look like:
calculator P = soap.createClient (“_________”),(thisoperation would be done
only once, and is cached within the promise. It resolves once and its value is the same no
matter how many times you call the .then (on this same promise))
calculator P.then (calculator => calculator.add(args))(the result of this is
passed to then once it is ready, which would then be returned as a promise to the next line)
.then (result => console.log(result));
Calculator P.then (c => c.subtract(args))
.then (r => console.log(r));
then (....)
.
.then (....)
In case we nest multiple thens within the same initial calculator p to which we pass the first
promise, the subsequent operations would only be able to return once the promises for the
previous operations in our calculators have returned. In this case, we are not really leveraging
asynchronicity.
In general, whenever we deal with I/O we should favor asynchronous (non-blocking) libraries
over synchronous libraries (blocking). Within the asynchronous libraries, we should favor
promise-based ones over callback-based ones
T
● he then() method returns a promise
● A value returned inside a then() handler (is returned by the then() method as a promise
for that value) becomes the resolution value of the promise returned from that then()
● If the value inside the then() is a promise, then the promise returned by then will “adopt
the state” of that promise and resolve/reject just as the returned promise does.
romise Orchestration:
P
Promise.all(iterable)
- eturns a promise that either fulfills when all of the promises in the iterable argument
R
have fulfilled or rejects as soon as one of the promises in the iterable argument rejects
- If the returned promise fulfills, it is fulfilled with an array of the values from the fulfilled
promises in the same order as defined in the iterable
- If the returned promise rejects, it is rejected with the reason from the first promise in the
iterable that rejected. This method can be useful for aggregating results of multiple
promises
Promise.race(iterable)
- Returns a promise that fulfills or rejects as soon as one of the promises in the iterable
fulfills or rejects, with the value or reason from that promise
wait:
A
Let response = await fetch(‘____________’)
Let text = await response.text()
console.log (text)
(we cannot do this directly as we need to define the function asasync(boxed))
In this context we define the function not for modularity or code reuse, but because we have
awaitable operations that need boxing.
ere we do not need to define the function first then call it. We just invoke within the wrapped
H
function itself.
● S ingle threading cannot be used in professional applications, especially as the server
side deals with multiple clients concurrently (which we cannot afford to have happen
synchronously) (CPU being idle) when we have I/O
● We either use multithreading (you need to create the threads yourself, and these threads
consume memory) or async programming (either traditional callback or promise-based)
● Another option is only making the I/O operations asynchronous, which favors using
promise-based async programming
et’s imagine a context where we have two windows which are synced, meaning that whenever
L
we make a change in the first window, this change is mirrored in the second one.
Let’s say that we perform checks at a certain frequency, to detect whether these changes
happened and reflect them accordingly. The more we increase check frequency, we decrease
the inconsistency window or time frame.
In this context we have two different actors: the windows themselves and the Operating System
(specifically theFileSystem- which creates the folder,renames it, deletes it and such) .The
window just shows the state of the filesystem. They reflect the state at the level of the File
system (created, kept track of, managed at that level).The truth holderis the filesystem (it is
the main actor, the acting entity), because of that, then instead of making the windows ask each
time, let the filesystem notify back whenever there is a change. Each window would register with
the filesystem (stating that it is interested in whatever changes happening in that path). The
filesystem would take note of this, by putting a pointer at the level of that window.
This window in which we interact with the filesystem and get the feedback (notification).
In this scenario,the window observes the filesystem(but in reality the window doesn’t
actively observe, they register and ask to be notified back, so the active entity is the filesystem.
The window is passive, and justreacts) The partythat reacts is theobserver, and the truth
holder is theobservable(active).
e have this interface, because we might need these different implementations (concrete
W
observers)
It also acts asthe contractbetween the observablesand the observers as if the observables
are telling concrete observers that if they want to get notified at events, then expose and
implement this update() method. (the only way to get this support is through implementing this
method)
In this context, it is the callee that gets the advantage of calling the function, and not its caller.
This is why this update is considered to be a callback. The subject (which is the caller) isn’t the
one concerned with the benefit of making that call.
Notify should iterate through the subscribers and call update() on them
he only requirement from the filesystem to whatever is subscribed, is that it would expose an
T
update method: hence,we just need to have an interface,exposing one method being
update->observer
e can make use of inheritance since the LinkedList <Observer> will be needed for all
W
observers we have.
very observer has its own execution context , don't miss any notification
E
For RX and observable is code and has one observer per context
Observable is data producer?
rx observable produces the data from the beg to the end for each observer that subscribes to it
myobservable.subscribe(myObserver);
In RxJS (since in JavaScript, we have functions as first class-citizens), instead of extending
Observable to Filesystem for example to customize a behavior. We just instantiate, and pass the
recipe (which is a function) (
new Observable (Recipe); )This recipe would take as a
parameter, its subscriber.
() gets executed by the observable once it’s done executing its recipe
2
The successful execution of its recipe
- You can useStack Blitz, as a JS playground to trythings out.
Contrast:
- The function which is the data producer is a passive entity, while the caller is an active
entity.
- The observable as the data producer, on the other hand, is the active entity, while the
observer is the passive one.
Producer Consumer
In common:
- They are both cold, both produce data
An interesting way of defining a for loop for example: (we have an engine, making use of range)
onst r$ = range(0,1000);
c
r$.subscribe((x) => console.log(x));
const my$ = interval(2000);
my$.subscribe((x) => console.log(x));
Interval
(or other operational operators) factory/pipe
|
|
T1 T2 T3
→………….stream> ……….—> Customized recipe into
he raw data stream that is being emitted (is actually only the recipe to emit the data)
T
If the recipe were hot, we might get overwhelmed at the level of execution. But since it iscold,
we are able to take our time for planning and applying all necessary transformations, to get our
customized observable, so when asubscriptionto anobserver occurs ittriggersthe whole
process.
romises: Asynchronous only, emit only once, and arehot, mix planning and execution.
P
Observables: Can be synchronous or async, can emitmany data elements and are cold. Rx
Observables provide artifacts for planning and transformation (through .pipe(operators)), and an
artifact for execution (subscribe).
From now on, we should favor Rx based libraries for asynchronous operations.
- How does the unsubscribe method cancel and stop the execution context
calability is one of the non-functional requirements we take into consideration in the context of
S
software development.
Some people tend to mix performance and scalability: they arerelated but different.
Performancereflects the number of transactions executedper unit of time, or from the user’s
perspective through response time (the time they click on a button, to the time they get back the
response (fully loaded on their screen)).
In this chapter, we are not considering I/O operations (which was the main purpose behind
asynchronous programming), instead we are dealing with heavy computations, and hence
asynchronous programming wouldn’t help.
e first need to maximize the CPU usage we have, then and only then, think of increasing
W
processing power.
Scalability is especially needed in the field of Big Data and High Performance Computing (HPC)
Spark
From Spark, as a distributed platform, we expect:
● Resource management(nodes (which have failed, arefull, are back, ….) / processing,
memory and storage,....)
● Running workloads in a parallel fashion (distributedprocessing)
Parallel operationson data would be for example havingto apply the same function on
different elements of an array (as a collection), happening in a parallel manner.
- RDDs:
park has introduced:Resilient(replicated) (we canadd speed as an objective)Distributed
S
(partitioned) (we can add parallelism as a means) Datasets (RDDs)
→ a main characteristic of these RDDs is that they areimmutable(to avoid inconsistencies) (as
if they were mutable, we’d have inconsistencies since we have replicas and we would need
synchronization and such.)
(an example of an immutable object is the String class in Java, as if we have a String, we can
for example use substring on it, which would return a new string, and our original one would
stay untouched)
- Operations:
The two types of operations we have seen before are Map and Reduce, in Spark, we have new
jargon which isTransformations(allowing to go froman initial RDD to a target RDD)&
Actions.
These operations are supported by the platform, and are offered by the API which gives us
access to these.
- Transformations:
These are all methods under the RDD class.
● Map (we apply a function on each element in a set, which then returns a new set)
● Filter (for example from an initial RDD containing numbers, we move to one containing
only prime numbers or such, so only the ones fulfilling a certain criteria - the function
passed as a parameter, analyzes each element that is sent and returns whether true or
false. Then we filter out the false ones and return the true ones. (its prototype starts with
bool))
● flatMap (it can map an element to a list of elements (not one-to-one). e.g. We can have
an RDD of numbers, then we flatMap it to another RDD, where each element of the RDD
is a list of numbers that divide that element))
● Union (takes another RDD as an argument)
● reduceByKey (a traditional reduce operation takes n elements and reduces them to 1
element. Here we take a collection and another collection is returned.)
● …. Read about other transformations
All of these transformations do not touch or change the original RDD, instead they return a new
one.
In order to fulfill theperformancepromise of Spark,there is a smart optimization at the level of
Transformations, which is:transformations are lazy(cold), which means that when we apply
an operation on an RDD it is added to a plan (it buffers them), which is what gives Spark the
opportunity to optimize. It is only when you call an Action, that theoptimized planis triggered
(executed).
Whenever you write an application for Spark, it is considered to be a Driver (as the
computations would not be performed at the same machine where your driver is executed,
instead on remote nodes that are part of the cluster).
Spark Architecture:
In this example, we have two nodes, and the Manager which is managing them and making
these nodes appear as one. At the level of each node, you would install the binaries for spark
(which are the same between theWorker Nodes andClusterManager, but just labeled
differently). This is our Spark platform for now, which makes our cluster. The difference between
Worker andExecutor, is that at the level of eachnode box, you run the software as a worker,
but each worker can manage several executors per box. A Spark cluster can run the workloads
of several applications (but here we would need to create some separation or isolation) so each
worker node creates a separate executor for each application. The worker node is a JVM, but
when it receives a workload, it creates aJVM perDriver program.
TheDriver program: its entry point to the cluster(to acquire resources from the cluster and
start submitting transformations and actions through the API to the cluster), it uses a main class
(part of the API) which isSparkContext(when instantiated,you specify the resources needed
by your application (how many nodes, how much processing power, how much RAM,) and this
SparkContext take these parameters as well as the location of the cluster manager. So it
xpresses its needs while communicating with the Cluster Manager, and the hassle of making
e
this communication happen is hidden under the SparkContext class.)
The cluster manager returns pointers to the Worker Nodes, so then it no longer communicates
with the Cluster Manager for the resource allocation aspect, (while the distributed computing is
now done directly between the driver program and the worker nodes which are executing its
workloads).
● Where Does the driver program itself run? It can run from any laptop (containing the
SparkContext), or it is submitted to the cluster (through the cluster manager; where it
would pick one of the nodes and run this driver on it.) So there are two modes of
deploying the driver program:client mode(runs inthe machine from which it is
deployed),cluster mode(where it is submitted tothe cluster).
ut if we see this through time, there was an event which happened which is entering an initial
B
value, and then another event happened which is changing it and entering a new value.
If we had recorded this as a stream of events, we would have an immutable event (a create
event) of entering the first value (at t=t1), then another one (an update event) of entering the
second value (at t=t2)
→event-driven system
Same as with a state-oriented system, where we need an infrastructure which for example can
be a relational DBMS, in order to do this, we need an infrastructure which is specialized in
handling streams of events.
P
● roducer may be producing data at a rate that the consumer is not able to handle.
● In case of complete failure of the consumer, does data get lost? Would the producer stop
sending and try again later? There is a burden
In case we have multiple producers and consumers we end up with multiple issues such as
Spaghetti connections:
hese issues stem from the fact that we have direct interaction /tight couplingbetween
T
stream producers and consumers
o, we insert a middleware (a guy in between), which would play this role of the stream
S
infrastructure.
We hence have a brokered communication or interaction/loose couplingbetween stream
producers and consumers.
It may act as a buffer to allow the consumer to consume at their own rate. The broker would be
running on a solid, robust, fast, correctly sized box. It records the events produced by the
producers.
If the consumer fails, there would be no effect on the production of the data. And whenever the
consumer goes back it can read the events from the broker. Since the broker is a solid platform,
and as we can make it highly available, it is hence redundant.
The state-of-the-art now isKafka. (prior to it, therewas RabbitMQ, RocketMQ …)
afka
K
Kafka is the DeFacto Distributed ;Event Store and Stream Processing Platform
Wiki: (Kafka was originally developed at LinkedIn, and was subsequently open sourced in early
2011. Jay Kreps, Neha Narkhede and Jun Rao helped co-create Kafka. Graduation from the
Apache Incubator occurred on 23 October 2012. Jay Kreps chose to name the software after
the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work)
et’s explore the different ways in which this reading and writing (production and consumption)
L
occur:
The most traditional (no more used) is calledPolling:the consumer would connect to the
broker, and ask if it has any new events for it. The broker would say no for example, and the
connection closes. Then at a certain rate the consumer would try again, and so on. The issue
here comes from the fact that we have a static approach (a static time, in which we check each
time). So we could have unavailability of data at a certain time
Pub/Sub:(publish/subscribe) the consumer would subscribewith the broker (saying it is
interested in events from a certain stream). So apersistent connection is maintained, and the
broker would push any new events that come in from that producer. So here we have events
pushed as soon as possible.
he broker records the state of the conversation here. (last event sent with its ID, to know what
T
the next event to be sent is)
There are some drawbacks:
● The broker overloading the consumer.
● If an event is pushed but the consumer doesn’t read it properly, or it should send
feedback saying that it has read it correctly, if not, it needs to be resent. So this can
cause some issues.
ong Polling:
L
You let the consumer ask for the data itself, just like in polling but with an enhancement. It
consumes at its own rate. (the connection is persistent, so the consumer makes a request, then
whenever data is available it would receive it, rather than constantly requesting and having to
open and close connection continuously)
Since the consumer takes the initiative, it keeps track of what the latest message received is.
Each consumer records the state.
This helps release the load. Make its implementation as small as possible. And let each
consumer be responsible for the reading, the state and such. Decentralized at the level of each
consumer.
Kafka Architecture:
opicis Kafka’s vocabulary to say stream.
T
At the level of the broker infrastructure, there could be many streams, produced by many
producers. Each stream would correspond to a topic. These topics are functional
(domain-specific entities).
topic may be handled in parallel, by two or more brokers, part of the same infrastructure.
A
Events belonging to the same topic may be written in parallel to two or more brokers in the
cluster. So a certain broker doesn't have the whole truth about a topic, and hence we go back to
the concept ofpartition.A partition is non-functional.This is for performance and scalability.
Consumer Groups:
e have a cluster of two brokers (called servers here), and we have a topic with 6 partitions
W
from P0 to P5, where each broker is the leader for 3 partitions out of the 6. In order to give more
sense to partitions, there is this concept of consumer groups: So we have for example in group
A, 3 consumers, which are configured to be in the same group. So we have the same
consumer, but for the purpose of scalability, it is run within 3 different machines (C0 to C3, which
are all the same consumer). (in here, we do not want and C2 to process the same event, (as if
you were in the same machine, you would not process the same event twice, so we do the
same here)) from within a consumer group, we read from different partitions, so as not to miss
an event, and no two consumers within a consumer group read from the same partition (no
duplicate processing of events).
afka is marketing itself today, as even an alternative to Spark. (even though initially they
K
compete in different spaces (distributed processing, and distributed streaming))
Done