Lec 30
Lec 30
Lecture - 30
Parameter Servers
Parameter servers.
Refer Slide Time :( 0:17)
Preface, content of this lecture: In this lecture, we will discuss the parameter servers. And also, discuss
the stale synchronous parallel model.
In machine learning systems, especially the scalable machine learning algorithms that we are seeing for,
big data computing, requires the abstractions, for the scalable, building scalable systems.
Refer Slide Time :( 0:48)
And here, we can see that, there are three different sections, the machine learning, systems and landscape.
We have to see that there are data flow systems which are, built over Hadoop and stack platforms, using
the graph system algorithms and shared memory systems. And these machine learning algorithms, which
are nothing but supported as the Dataflow systems, such as Naive Bayes and requires an abstraction from
the underlying systems like, Hadoop and Spark and for graph systems it requires abstractions from the
GraphLab and TensorFlow. And similarly for Shared Memory Systems and these are Bosen and DMTK
also requires the parameter servers.
So, what we will see here is that, in a simple case, the parameters of machine learning are stored in a
distributed hash table that is accessible through the network, which is there in the shared memory
systems. So, parameter servers which are used in Google, Yahoo and also is used as an academic work by
Smola and Xing.
And now, we have to see the parameter server and how it is going to be used as the framework for
machine learning. So, to build the scalable machine learning systems, so skill machine learning
framework requires, the use of parameter server and it distributes, a model over multiple machines and
offers two different operations, one is called, ‘Pull’, that is the query part of the model. And the, ‘Push’,
means, update part of the model. So, in any machine learning update equation, which is shown here a, wi
🡨 wi + ∆ is to be handled into the model, by push that is in the form of plus
operation. So, in a stochastic gradient descent and all in a collapsed
Gibbs sampling for a topic modeling, this aggregate, that is the push
updates is done via the addition operation. So, whenever this addition
operation is there, so these parameters are to be pushed into the model
and that is handled using the parameter server.
So, parameter server is used in the training phase also. So in the training state it is stored in the parameter
shard and asynchronous updates are being handled. So, this is the example, figure which shows that, the
models are stored as a part of parameters shards, parameter server charts and whenever, there is an
update. So, it has to be updated at all the shards which is managing the parameter server. So, the
parameter servers are maintained in this big data scenario, that is through the clusters, that we are going to
understand here in this part of the discussion.
So, parameter servers are flexible. And here, the number of model parameters which is being supported is
quite large and number of cores, which are there which supports this parameter server is shown, for
different machine learning landscape. For example, topic modeling and matrix factorization and CNN
that is convolutional neural networks and DNN-deep neural networks, all requires flexible parameter
servers and huge amount of parameters are complex parameter servers are being managed, efficiently
therefore it is shown here in this example figures.
So, the extensions which are available here in the parameter server is in the form of push-pull interface, to
send or receive most recent copy of the subset of the parameters and the blocking is optional. And the
extension, can block until the push-pulls with a clock which is bounded by t minus rho is completed.
Refer Slide Time :( 7:14)
So, let us see the data parallel learning with the parameter server. And here, the parameter servers are
shown in this particular picture that they are being exchanged, through a different part of the model is
available on different servers and the workers, they retrieve the part of the model as it is needed.
Refer Slide Time :( 7:39)
So therefore, the abstractions are provided which are internally implemented as the data parallel
operations in the parameter server, such as the key/value API is for the workers that is they can get the
key. And also, that means they may get the model values and by get key operation. Similarly, I had key
with the value δ, to be updated in the parameter server and this particular addition can be performed here
using add operation.
So, let us see how the iterations in a Map Reduce is supported, using the parameter server. So, this
particular iterations in the Map Reduce is often required, to implement the scalable machine learning and
this requires the training data to be provided
And then, they will redundantly save the output between these stages in the Map Reduce function and
therefore increase the cost of this iterations of the Map Reduce.
And therefore the parameter servers are very much needed, to make this machine learning, scalable
machine learning every support. So, this particular parameter servers requires, still synchronous parallel
model we will understand about them. So here, we have to see these model parameters are stored on the,
parameter server machines and, and accessed via the key/value, interface using distributed memory.
So, here we can see that in typical Map Reduce model, where the data model is the independent records
and the programming abstraction is in the form of Map Reduce and the execution semantics, which is
used as bulk synchronous parallel, in the Map Reduce to support the map, to support the iterative machine
learning algorithms. Now, if we compare this with the parameter server here, the data model in the
parameter server is also independent records and the programming abstraction is not the Map Reduce, but
it is simply the key value store that is in the form of a distributed shared memory to be provided. Now,
what is the execution semantics is it, the bulk synchronous parallel or some other synchronous that we are
going to see.
Now, the problem here is that, the networks are slow. That means, when these parameter servers which
are maintaining the parameters, of the model and using gate and adding the keys are the operations,
which are to be performed on the network, there becomes quite slow. As the network is slow compared to
the local memory axis. So, we want to explore the options for handling, this with the help of the caching.
The second solution, to implement the efficient, access of the parameter server on the network is about
the asynchronous operations. So here, the machine one, machine two, machine three, they will perform
another computation, they will iterate and now they have to wait, till others, other iteration they complete
their iterations. And then, they will interchange their communication and then, the computation can go on.
,
So, the in every iteration we see that, there will be a barrier, after the machine finishes the iteration and
start the communication and before, all the communication completes, so that the next iteration will begin
and these are the barriers and if we collapse these barriers, now we can introduce a much efficient way, of
implementing this a synchronous, asynchronous here in this case. So, enable the more frequent
coordination on, the parameter servers.
So, this asynchronous operations are, if it is there then, let us see the problems, in this asynchronous
execution. So, asynchronous execution lacks the theoretical guarantees, as the distributive environment
can have, arbitrary delays from the network and stragglers.
Let us see what problem will create, during the using asynchronous communication. And therefore, we
require to make about execution semantics and decided so that, this particular a synchronous, operation
should not be a bounded, unbounded. So, it has to be a bounded synchronous operation, which has to be
supported.
Refer Slide Time :( 16:14)
So, the parameter servers, will use the model which is called, ‘Stale Synchronous Parallel’. Model where
in the global time is let us say, t and the parameter workers, will get and can be out of the date, but cannot
be older than t - τ. So, τ is the tolerance, which is introduced and therefore, it is called, ‘Bounded
Synchronous’. So, tau controls the Staleness, also called as,’ Stale Synchronous Parallel’. So, stale
synchronous parallel, execution semantics. We are going to see here,
which is being supported in the parameter servers. So, in the stale synchronous parallel, we see here is
that, they will these are workers 1 ,2, 3 & 4. So, these workers now, they will be changing the parameters,
as synchronously and the stillness is also, a bounded by s is equal to 3 that means, now the here we can
see here is that, the worker 2 will be blocked until, worker 1 reaches the clock of, clock of 2. And so that
means, the updates will be guaranteed and this is black means, updates are guaranteed and green means it
is visible to the, worker 2 and blue will indicate that it is in complete. And incomplete updates sent to the
worker 2 and not yet visible, updates not yet sent to the worker too. So therefore, this particular slots
which is divided and stainless bounds s = 3 which is being calculated. So, this way the interoperate,
interpolate between BSP and the as synchronous modes and they subsumes both. Allow the workers to
usually run at their own pace. And the fastest, slowest threats not allowed to drift, the clock UI that means
more than, s clocks apart. So, efficiently it implements the cache parameters, in this particular manner.
So, consistency here matters. Because, staleness has to be bounded by BSP, which is a strong consistency
but we relaxed the consistency. And staleness is bounded, so therefore, suitable delay will gives a bigger
speed up here in this particular case.
So, in this lecture, we have discussed the parameter servers. And the is stale synchronous parallel model
of operations, which is supported in the parameter server. Thank you.