Ajila Adatadrivenpriorityschedulingtechniquefor
Ajila Adatadrivenpriorityschedulingtechniquefor
Processing Platform
by
Oluwatobi Ajila
1
Abstract
Big data processing has become essential for businesses in recent years as it enables
organizations to gather insights from streaming data in near real-time and capitalize on business
opportunities. One drawback of stream processing engines is the lack of support for priority
scheduling. There are cases where businesses need to ensure that important input data items are
processed with low latencies thus avoiding a missed business opportunity. This thesis proposes a
technique that enables users to prioritize important input data so that they are processed in time
even when the system is under high or bursty input load. Using a prototype this thesis demonstrates
the efficacy of the proposed technique. Performance analysis demonstrates that there is a
significant latency improvement for high priority data over low priority data especially when there
i
Acknowledgements
I would like to thank my supervisor, Dr. Shikharesh Majumdar, for his patience and
guidance during the process of writing this thesis and also throughout my time in the graduate
program. I appreciate the time and effort that he has put into helping me complete my thesis.
I also would like to thank my parents, Samuel and Victoria, for supporting me throughout
my academic career. This would not have been possible without their help.
ii
Abstract........................................................................................................................................... i
Acknowledgements ....................................................................................................................... ii
iii
2.3 Automatic Resource Scaling ......................................................................................... 16
4.1.4 Implementation of the Message Sending Application and the Synthetic Spark
iv
4.1.5 Experimental Results ............................................................................................ 50
References .................................................................................................................................... 94
v
A.1 Workload Parameters .................................................................................................. 100
vi
List of Tables
vii
List of Figures
Figure 4.2: Effect of Mean Arrival Rate on Average End-to-end Latency .................................. 53
Figure 4.3: Effect of Mean Arrival Rate on Average Job Queuing Latency ................................ 55
Figure 4.4: Effect of Mean Arrival Rate on Average Scanning Latency ...................................... 55
Figure 4.6: Difference in Scheduling Overhead between DDPS and FIFO ................................. 57
Figure 4.7: Effect of Mean Arrival Rate on Average End-to-end Latency with 3 Priority Levels
....................................................................................................................................................... 58
viii
Figure 4.11: Effect of Batch Duration on Average Job Queuing Latency.................................... 64
Figure 4.20: Effect of Message Length on Average Job Queuing Latency .................................. 73
Figure 4.23: Effect of Message Service Time on Average End-to-end Latency .......................... 75
Figure 4.24: Effect of Message Service Time on Average Job Queuing Latency ........................ 76
Figure 4.25: Cloud: Effect of Number of Cores on Average End-to-end Latency ....................... 77
Figure 4.26: Cloud: Effect of Number of Cores on Average Job Queuing Latency..................... 78
Figure 4.33: Sentiment Analysis Scenario: Average Job Queuing Latency ................................. 87
ix
Figure 4.34: Sentiment Analysis Scenario: Average Scanning Latency ...................................... 87
Figure A.2: Cloud: Effect of λ on Average Job Queuing Latency ............................................. 101
Figure A.6: Cloud: Effect of Batch Duration on Average End-to-end Latency ......................... 103
Figure A.7: Cloud: Effect of Batch Duration on Scheduling Overhead ..................................... 104
Figure A.8: Cloud: Effect of Batch Duration on Average Job Queuing Latency ....................... 104
x
List of Abbreviations
GB Gigabyte
MB Memory Block
xi
List of Symbols
Aq Average job queuing latency for jobs in Spark Streaming job queue
BD Batch duration
L Length of messages
msg Message
s Second
SO Scheduling overhead
xii
1 Difference between the average end-to-end latency for low and high priority
messages
2 Difference between the average end-to-end latency for medium and high
priority messages
3 Difference between the average end-to-end latency for low and medium priority
messages
xiii
Chapter 1: Introduction
In recent years big data has become one of the important factors to consider when designing
business applications. Businesses often need to deal with large amounts of data in a timely manner.
The characteristics of big data that make its processing challenging are often characterized by 3Vs
where the V’s are Volume, Velocity and Variety [1]. Volume refers to the amount of data that is
processed; for some applications this can range from several Gigabytes to petabytes of data.
Velocity is the speed at which data needs to be processed and variety is the diversity of the data.
Diversity can result from various data forms such as text and video as well as various data formats
such as ASCII, JSON and XML. Real-time stream processing engines such as Apache Spark [2]
and Storm [3] are widely used to tackle challenges that concern the velocity and variety
characteristics of big data. Some of the major use cases of real-time stream processing engines are
sentiment analysis, fraud detection, log monitoring and processing customer behaviour. Platforms
like Spark and Storm allow users to write applications that can perform data analytics in near real-
time while offering fault tolerance, scalability and guaranteed message processing [4].
Sentiment analysis is the process of gathering and categorizing public opinion in order to
determine a feeling or sentiment about a certain issue [5] [6]. It is becoming increasingly common
for businesses and government organizations to perform sentiment analysis on message boards or
social media to gather actionable intelligence for marketing, security and automation purposes [7].
Some businesses use sentiment analysis to determine what consumers think about a particular
product [8]. Sentiment analysis is also used by companies in the financial industry to perform real-
time analysis for stock trades. Security agencies use sentiment analysis to analyze public forums
for potential illegal or terrorist activity [9]. There may be cases where one may want to assign
1
priority to certain input data items. For example, suppose newspaper company is analyzing tweets
in real-time for a potential news worthy event. A tweet containing the word “fire” would warrant
immediate attention. Ideally, one would want to process this tweet immediately to determine if
there is an emergency or if the word “fire” is being used in a more benign context. One could also
construct a similar example in a security context. Security agencies may be monitoring social
media for signs of potential criminal activity. Or, a financial organization may want to take
immediate action upon news of a new directive from an elected official. In these cases, it is
imperative that the analysis is performed in a timely manner as a delay in analysis could result in
a missed opportunity. If incoming data is arriving at a high rate or the input load is very bursty this
may delay the processing of high priority data. The ability to assign priorities to data inputs helps
to ensure that the most important data is processed first in these situations [10].
Streaming platforms are often the tool of choice for performing data analytics. One of the
difficulties with stream processing is responding to changes in input load in an efficient manner.
Ideally, one would one would allocate enough resources such that it can handle the input load
without causing queuing delays. This requires that the system is able to process the input load as
fast as it enters the system. Figure 1.1 demonstrates a batch processing scenario where the Y-axis
indicates the amount of time it takes to process a batch of data. The horizontal blue line indicates
the time it takes for a new batch of data to be created. A stable system is a system where the current
batch will be processed before the new batch is created. This occurs when the black horizontal line
There are many cases where it is not possible to predict the system load that the system needs
to support at any point in time. If the system does not have enough resources to handle the load,
queuing delays will occur resulting in large processing delays or a system failure. This is captured
2
as the red portion of the graph in Figure 1.1. In these cases, one solution is to overprovision the
system. Overprovisioning is a technique where the system is allocated enough resources to handle
the peak load. This often results in a waste of resources as most of the time only a fraction of the
resources will be in use. This is shown as the green portion of the graph in Figure 1.1. Another
issue with overprovisioning is it requires that one is able to provision for the worst case scenario,
Another alternative is to dynamically provision resources, where the system adds or removes
resources based on the input workload. This is typically done by either vertical or horizontal
scaling [11]. Horizontal scaling is a technique where more processing nodes are added to the
system. This can be done by adding more machines to a cluster. Vertical scaling is where the
processing capabilities of the existing nodes are increased. In a case where one wants to increase
3
with vertical scaling, one can increase the number of virtual CPUs that the application has access
to. Another example of vertical scaling is increasing the amount of memory available to a memory
bound application. The challenges with this approach are that scaling is not instantaneous, it takes
time for dynamic provisioning systems to add and remove resources. In cases where the input load
is bursty, meaning the interval between arrivals varies dramatically, this may result in processing
delays when the load suddenly spikes (see Figure 1.2). The problem is exacerbated if the input
load returns to normal just before the system allocates more resources, as the system will be
overprovisioned.
This thesis proposes a data driven priority scheduler (DDPS) which helps to provide stability
to a data streaming engine under varying loads without requesting more computing resources. This
4
scheduling technique requires that the user assign priorities to all the different types of input data
ahead of time. At runtime the scheduler prioritizes input data with the higher priorities. In cases
where the system load increases dramatically, the latency of higher priority jobs remains relatively
stable, and the lower priority jobs remain in the queue and as a result incur processing delays.
When the input load reduces to a level where all jobs can be processed by the system, both higher
This scheduling technique does not require additional resources to function, however, it can
be used in conjunction with dynamic resource allocation techniques as well. The data driven
engines are streaming engines that divide the stream input data into small batches, after, the data
is then processed at a batch granularity. In this thesis, the proposed data driven priority scheduler
is implemented on the Spark Streaming framework. A new API is added to the Spark Streaming
framework to allow users to assign priorities to input data types as they are writing their
application. The Spark Streaming scheduler itself is modified to scan the input data and ensure that
This thesis focuses on a data driven priority scheduler for handling a single streaming
application. This technique requires the user of the given streaming application to specify the
priority of the input data types. This lets the scheduler know what input data is most important so
that it can prioritize the processing of the high priority data items. Determining the priority
mappings automatically from previous history of system operation by using machine learning
5
The cases identified in the thesis that benefit from this technique are text streaming
applications for performing sentiment analysis of twitter feeds, or messages boards for example.
In these cases, it is possible to assign priorities to input data that contain certain words, or are sent
from a certain sender. The future works section discusses further approaches that can be used to
improve the data driven priority scheduler. The DDPS technique does not allow users to update
the priorities of input data types at runtime, all priorities must be determined a priori. Although
this approach is apt for handling situations in which priorities are static, further research is
warranted for handling dynamic changes in priority. For example, a priority change may be
required for a sentiment analysis application analysing Twitter feeds after the occurrence of an
The scheduling technique investigated in this thesis focusses on reducing the end-to-end
processing latency for high priority messages. Techniques that can handle deadlines associated
• A novel data driven priority based scheduling technique for streaming applications that
prioritizes high priority data items and can effectively handle bursty loads.
• A proof of concept prototype of the data driven priority scheduler implemented on the
• Demonstration of the efficacy of the proposed technique using a synthetic workload and a
6
• Insights into system behavior and performance resulting from a rigorous performance
analysis of the prototype that include the impact of various system and workload
parameters on performance.
Apache Spark, Spark Streaming, machine learning and language classification and discusses
related work. Chapter 3 introduces DDPS a and the implementation of the DDPS prototype.
Chapter 4 presents the performance evaluation of the DDPS prototype and discusses the results.
Finally, Chapter 5 concludes the thesis and presents topics for future research directed at enhancing
7
Chapter 2: Background and Related Work
This chapter focuses on the background information for systems and technologies and a
literature survey for of the state of the art related to the thesis research. It introduces the Apache
Spark and the Spark Streaming systems. It also presents background information on sentiment
analysis techniques and machine learning. A literature survey of auto-scaling and scheduling
2.1 Spark
using Scala [12] and runs in a Java Virtual Machine (JVM) environment. Spark provides a
functional programming interface that enables users to write algorithms to process data in parallel.
Spark is mainly used with big data workloads, much like its predecessor Hadoop. However, Spark
differs in that it offers in-memory processing giving it a large performance advantage over Hadoop
Spark represents input data as a Resilient Distributed Dataset (RDD). RDDs are an
immutable collections of data elements represented as Scala objects that can be partitioned across
a cluster of machines [12]. RDDs are fault tolerant and can be rebuilt if a node storing a partition
fails.
Spark enables users to perform parallel operations on RDDs. There are two kinds of
operations that can be performed: transformations and actions [13]. Spark defines transformations
as operations that produce a new RDD (or multiple RDDs) from an existing one. These
transformations include map, which takes a function and applies it to each element of a RDD and
generates a new one [14] or join, which combines two RDDs that contain key-value tuples and
8
produces a new RDD that is a combination of the input RDDs. Spark defines actions as operations
that are performed on RDDs which do not generate a new RDD. An example is count, which
Apache Spark runs with multiple nodes for parallel data processing. The main node is
called the driver node, and all others are worker nodes [15]. Figure 2.1 illustrates the relationship
between the driver and worker nodes. When running Spark in the cluster mode, each node is a
JVM instance which may have multiple threads executing on it. In the local mode all the nodes are
in a single JVM instance. The Spark driver program (an instance of SparkContext) is an application
that executes on the driver node. An application may have multiple threads. The SparkContext
defines a directed acyclic graph (DAG) of RDDs where the operations are edges and the RDDs
are nodes [12]. The Spark scheduler splits the graph into stages and attempts to pipeline operations.
Tasks are created for each stage, where a task is defined in terms of the data and the computation
to be performed on it. The Spark scheduler assigns tasks to executors which run on worker nodes
Spark Streaming is a framework that is built on top of Apache Spark. It offers fault-tolerant
near real-time processing of big data. It can process data from input sources such as Kinesis, TCP
sockets, Kafka [16] and can output processed data to databases, filesystems and dashboards. Spark
streaming allows users to apply transformation on streams. These transformations are similar to
the ones that Spark APIs allow on RDDs. An example of these transformations is map which
DStream is a continuous stream of RDDs where the elements of each RDD are data items from a
9
specific time interval. This technique is commonly known as micro batching. The time interval is
called the batch interval. The batch interval impacts the latency of processing the data. Figure 2.2
demonstrates an example where an input source sends data items to a Spark Streaming instance.
The RDDs contain input data items that were sent within a certain batch interval, t (seen on x-axis
of Figure 2.2). The continuous stream of RDDs over time is the DStream.
RDD [17]. As a result, DStream transformations are essentially RDD transformations continuously
applied over time. The RDDs are eventually processed by the spark computing engine and the
result is a series of output batches. A DStream graph is directed acyclic graph where each stream
is an edge and transformations are nodes. Figure 2.3 illustrates an example where there are three
DStreams. The second and third DStream are produced by the first and second transformation
respectively. The output of the preceding DStream is the input of the following DStream in the
10
next batch interval. For example, the output RDD of DStream1 at time t is the input RDD for
The major components that compose Spark Streaming are the receiver, job scheduler, job
manager, receiver tracker and the DStream graph. The receiver executes on a worker node and
accepts input data items. The receiver tracker keeps track of all the input data items received and
11
places them into an RDD at the end of each batch interval. For each batch interval the job scheduler
creates jobs for each RDD in the DStream graph. A job represents all the input data items (RDD)
for a specific time window and the transformations that will be applied to it. After the job is created,
it is placed in a job queue. The job queue is a First In First Out (FIFO) queue managed by the job
manager. The job manager only removes a single job from the queue at a time and hands it off to
the Spark scheduler. The Spark scheduler divides the job into tasks which are then processed on
executors. Figure 2.4 shows the high level components of Spark Streaming. Similar to
SparkContext for Apache Spark, Spark Streaming has a SparkStreamingContext. This represents
an instance of the streaming application. It is the receiver for streaming API calls.
Spark Streaming does not support any queueing discipline other than FIFO for processing
jobs. However, the underlying Spark framework does support different scheduling modes for
scheduling jobs within an application. By default, when an application’s threads submit tasks, they
12
are processed in a first come first served manner. If a thread’s tasks are very large, they will use
up as many resources as required to process them. This may cause delays for the other application
threads that are waiting. If the application does not require all the resources, the next thread in the
queue may begin running its tasks. The second mode Spark offers for scheduling within jobs is the
fair scheduler. This mode runs jobs in a round-robin fashion so that all the threads in the queue get
an equal share of resources regardless of order or size. The fair mode also supports grouping of
threads into pools and assigning weights to each pool. Pools make it possible to give a certain
Spark’s default (standalone) mode provides each application its own JVM and there is no sharing
of computing resources between applications. The resources in the cluster are assigned to
applications in a FIFO manner. One can also statically partition the clusters resources for each
application. Spark also supports third party resource schedulers such as Mesos [19] and YARN
[20]. Both of these offer static partitioning. Mesos also supports resources sharing where an
application can relinquish its own resource for another application if it doesn’t need it. None of
the existing schedulers for Spark support dynamic job priority for which the priority of a job can
change dynamically based on the data elements it contains. This thesis is directed at addressing
this gap in the state of the art by introducing a data driven priority scheduler for Spark-based stream
processing systems.
computational intelligence that is used to identify the public’s emotions or sentiments and act on
13
it with computational algorithms. This thesis makes use of a sentiment analysis scenario to test the
performance of DDPS.
In recent years sentiment analysis has become an increasingly popular field of study. Its
applications range from combating cyber-bullying to election predictions [21]. Sentiment analysis
is the process of extracting, identifying and quantifying opinions or emotions of a population [22].
The origin of sentiment analysis dates back to the early 1900’s in the form of public opinion
analysis. This type of analysis involved creating a series of questions or statements to represent
opinions [23]. The questions and statements were then rated by people on a scale. A ranking of
opinions was then created using the results from the ratings. More recently in the 1990’s
computational linguistics was used to conduct subjectivity analysis of fictional narrative texts [21].
This was done by writing algorithms which made use of the syntactic, lexical and morphological
elements of the text to determine whether a sentence expresses an opinion, feeling or emotion [24].
In recent times natural language processing techniques, machine learning and lexicon based
approaches have been used to extract and analyze sentiment from text [22]. These techniques often
involve some sort of sentiment classification and sentiment rating prediction. Sentiment
classification involves classifying an opinion or sentiment into distinct categories (e.g. positive,
negative). Sentiment rating prediction is the process of determining the intensity of the sentiment
Machine learning is a field of study that makes use of computers to simulate the learning
capabilities of humans. This is often done to acquire new information, identify existing information
or to optimize performance of a system [25]. Machine learning has its roots in computational
14
learning theory and pattern recognition [26]. In recent years machine learning has become the most
The four main aspects of machine learning are: external environment, learning, repository
and execution [25]. The external environment represents the information that is available outside
the machine learning system. This could be tweets, traffic data, medical history of patients [26]
and so on. Learning is the aspect of processing data from the external environment to obtain
knowledge. A repository is a system used to store the knowledge obtained in the learning phase.
[25]. Knowledge can be represented in various forms such as semantic networks, production rules
and logic statements [25]. The final aspect, execution, is the process of taking the stored knowledge
from the repository and using it to complete a task. The results of the execution phase are then fed
The common types of machine learning are supervised learning, unsupervised learning,
semi-supervised learning, and reinforcement learning [27]. Machine learning systems typically use
labeled data, unlabeled data or combination of the two as an input to the system [28]. Unlabeled
data is any piece of information (natural or artificially created) that is available to the machine
learning system. This can be tweets, pictures, seismic data, news articles, etc. Labelled data is an
association between unlabeled data and an identifier or classification that describes what the
unlabeled data is. Supervised learning is the process of comparing the computed output with the
expected output for a given input. The machine learning system records the error between
computed and expected output and attempts to minimize this error [27]. Unsupervised learning is
the process of learning based on the discovery of input patterns. Reinforcement learning is a type
of learning where an incorrect output is penalized and correct output is rewarded. The goal of this
approach is to maximize a long term reward. This is different from supervised learning in that the
15
input and output pairs are never given [27]. Semi-supervised learning is a hybrid of supervised and
unsupervised learning where both user input and machine learning discovery are used for
automatically adjusts the amount of resources an application is using without minimal or no human
involvement [30]. As previously stated in Section 1.1, there are two ways for auto-scalers to
dynamically allocate and deallocate resources. One is vertical scaling and the other is horizontal
scaling. Horizontal scaling involves adjusting the number of nodes that the application is being
run on. Increasing the number of nodes is referred to as scaling out, reducing the number of nodes
is called scaling in. Vertical scaling typically involves adjusting the CPU or memory capabilities
of the existing nodes on which an application is being run on. Increasing the number or computing
resources via vertical scaling is called scaling up, reducing the number of resources is called
scaling down. Vertical scaling typically involves adjusting the CPU or memory capabilities of the
existing nodes that an application is being run on. Typically, this often requires the auto-scaler to
reboot the operating system that is hosting the application [30]. For this reason, most cloud
There are two main techniques for autoscaling, reactive and proactive [30] [31]. Auto-
scalers often monitor performance metrics such as CPU utilization, input load, service response
time or dropped requests [32]. Reactive auto-scalers will set thresholds for one or more
performances metrics to determine when more resources are required or if resources need to be
released. The performance metrics are monitored at a specified sampling frequency. Increasing the
sampling frequency increases the responsiveness of the auto-scaler, but also increases the overhead
16
of the auto-scaler [30]. An example of a reactive auto-scaler is presented in [33]. This technique
runs an algorithm on every request arrival. The algorithm decides if the system needs to acquire
more resources or if the incoming request must be rejected. The goal of the algorithm is to
minimize the cost of computing resources while ensuring that the pre-determined grade of service
is met. Proactive auto-scalers also monitor performance metrics, however, they use the
performance metrics to predict future loads. This lets the auto-scaler pre-emptively allocate or
deallocate resources. The analyses required to predict future loads contributes to the overhead of
the auto-scaler. An example of this is the proactive automatic resource scheduler presented in [34].
This technique uses machine learning to predict number of resources required to handle the
upcoming input load based on previous input load patterns. There are some auto-scalers that
combine both proactive and reactive auto-scaling techniques when provisioning computing
resources. An example is the work presented in [35] which uses a proactive machine learning
approach in conjunction with a reactive scaling algorithm in order scale the machine resources to
meet the demands of the input load. It invokes the reactive algorithm after each request arrival and
The literature suggests that there are two ways to solve the problem of bursty input loads
in stream processing environments. These techniques are auto-scaling presented in Section 2.4.1
Most of the existing studies attempt to mitigate the effects of varying input loads by using
auto scaling techniques. Addressing the variance in input loads is important as this could affect the
response time of an application. If an application suddenly receives a large burst of input data this
17
could result in missing SLAs or perhaps running out of memory to store all the received data. An
example of an auto-scaler is seen in [36] which implements a dynamic scaling system on the
Nephele stream processing engine. In this work, the researchers created a model to estimate the
latency of processing data based on the current resource allocations. Using this model they were
able to scale in or out the amount of resources required to meet latency constraints. Research
presented in [37] also demonstrates an auto scaling technique for real time stream processing
engines based on system utilization thresholds and reinforcement learning where past scaling
decisions and outcomes are recorded and used to determine future scaling decisions. Similarly,
[38] provides an auto scaling mechanism to stream processing engines that makes use of machine
learning to classify workloads and predict whether scaling of resources will be required in the near
future. The advantage of using the data driven priority scheduler presented in this thesis over auto-
scaling techniques is that it functions in situations where it is not possible to acquire more
resources.
Another technique to address the effects of bursty input loads is load balancing. Load
balancing [39] is the process of taking load from one node that is over-utilised and sharing or
moving it entirely to another node which has unused processing capacity [40]. Examples of load
balancing techniques are Aurora and Medusa [41]. Both of these run a daemon on each node to
monitor its utilization. If the node is operating at maximum capacity it will attempt to either move
all computations to other nodes or transfer part of the computations to another node and continue
processing. The downside to load balancing is that it requires the system to have extra nodes which
are under-utilised or not utilised at all in order to function properly. This thesis proposes a
18
technique that addresses vary input loads without requiring the system to have under-utilised
nodes.
The scheduling techniques discussed earlier rely on having the ability to increase the
computational resources of the cluster. This is done either by horizontal scaling where more nodes
are added to the cluster or by load balancing where nodes that are not being fully utilized are used
to alleviate the pressure on nodes that are over utilized. There may be cases where it is not possible
to acquire more nodes or load balance. For example, a company hosting an on premise data center
would not be able to easily increase their computing capacity if the input load suddenly increases.
Also, if all their nodes are fully utilized, load balancing will not work as there will be no nodes to
transfer computations to. Research that attempts to address fluctuating input loads without the use
To the best of our knowledge, there are only three other bodies of work that address the
effects dynamic of input loads in real-time micro batching engines like Spark Streaming without
the use of auto scaling techniques or load balancing. The first is research on adaptive scheduling
with Spark Streaming [42]. This research introduced a technique that dynamically modifies the
batch duration based on input load. It demonstrates that there is a trade-off between latency and
throughput. When running Spark Streaming with a small batch duration the end to end processing
latency is smaller, but throughput is reduced. With a large batch duration throughput improves but
the latency gets larger. In cases where the input load varies over time, the adaptive stream scheduler
attempts to stabilize the system by either increasing the batch duration when the input load
increases, or reducing the batch duration when the input load reduces. This technique sacrifices
responsiveness at peak loads and minimizes latency when the system is under a reduced load. One
advantage the advantage DDPS presented in this thesis research has over adaptive scheduler is that
19
when the system load increases, only the response time of low priority data items gets worse.
However, with the adaptive scheduler, the response time of all data items increases.
The second technique is DSlash proposed by Birke et al. This work [43] highlights the fact
that workloads may vary greatly over time and that this may cause issues for time sensitive data
analysis. The authors propose a data controller, DSlash, that processes data as fast as it can given
an application target latency. It achieves this by delaying data processing and load shedding of
data older then a supplied threshold. Load shedding is a technique where input data items are
discarded when the system is unable to keep up with the demands [44]. By discarding input data
the system is able to recover and improve its responsiveness. The advantage of using the data
driven priority scheduler over DSlash is that it doesn’t discard input data items to respond to peak
loads. Load shedding is a lossy operation which may affect the application’s results.
The third approach is the A-Scheduler [45], which is implemented on Spark streaming. It
improves performance by dynamically modifying the number of jobs that are processed in parallel.
By default, Spark streaming only processes a single job at a time. The A-Scheduler assesses data
dependencies between jobs to determine if jobs can be run in parallel. By increasing parallelism,
both throughput and latency are improved. The drawback to this approach is that the A-scheduler
only improves performance when the system is underutilized. Increasing parallelism requires more
resources; if there are no more resources on the system, there will be no improvement. The benefit
of using the technique presented in this thesis over the A-scheduler is that it minimizes the latency
of high priority data items even when the system is running at maximum capacity.
The main drawback with the three techniques [42] [43] [44] is that they do not allow one
to assign priorities to input data items. The ability to assign priority is essential if there is important
information that must be analyzed in a timely manner. Load shedding [43] is not an option because
20
it can cause important input data items to be discarded. Adaptive scheduling techniques [42] can
improve the efficiency of the streaming engine which improves throughput but sacrifices
To the best of our knowledge there is only one other body of work that demonstrates
priority scheduling in a real-time micro batching engine like Spark Streaming. In this research [46]
a priority based resource scheduling technique for real-time distributed streaming processing
systems is introduced. The technique enables users to assign priorities to certain nodes in the
stream processing graph. Input data is assigned a priority based on the nodes that it is destined to
reach. The scheduler re-orders the input data such that the high priority data items are processed
first. The difference between the technique proposed in [46] and DDPS presented in this thesis is
that DDPS assigns priorities to input data types rather than nodes in the DAG. Thus our proposed
technique lets users differentiate between input data items that will reach the same nodes in a DAG
by assigning high priority keywords to input data items, whereas, [46] doesn’t provide any
There has been similar research done in the context of complex event processing. This
work [47] presents a deadline aware scheduling technique where a deadline is assigned to an event
stream. The researchers propose a pre-emptive time-sliced scheduling technique that can prioritize
event streams based on priority. This work [47] differs from the research described in this thesis
in that it [47] assigns a priority to the entire stream whereas this thesis allows users to assign
2.5 Summary
This chapter presents the Apache Spark and Spark Streaming systems. It shows that
Sparking Streaming is a micro-batching stream processing engine that associates batches with jobs
21
and executes jobs on the Apache Spark engine in a FIFO manner. This chapter introduces
sentiment analysis and machine learning and describes some of its uses in data analytics. The
motivations for priority based scheduling is discussed and a survey of priority based scheduling
and techniques to address fluctuating input loads in stream processing environments is presented.
None of the papers presented have addressed the problem of bursty input loads where input data
22
Chapter 3: The Data Driven Priority Scheduling Technique
This chapter presents an overview of the Data Driven Priority Scheduling technique. It also
introduces a priority mapping strategy and discusses the key algorithms for the DDPS technique
with an analysis of the algorithmic complexity. Lastly, it describes the implementation of DDPS
In the default Spark Streaming system, the system receives a stream of input data items
and places the data items in an RDD. Each RDD is associated with a job and jobs are placed in a
job queue. Jobs in the job queue are processed in a First-In-First-Out manner. A FIFO queue is a
queue where the item that entered the queue first is removed from the queue first. Figure 3.1
illustrates an example where an input data stream sends input data items to the streaming system.
Spark Streaming places the input data item into a RDD. The RDD is associated with a job and the
job is placed in the job queue. Note that with a FIFO approach, if there is an important input data
item in Job N + 2, the data item cannot be processed until Job N and Job N + 1 have been processed.
The goal of the DDPS technique is to give precedence to higher priority input data items.
This is accomplished in two phases. The first is classifying the priority of input data items. This is
done by the user before starting the streaming application. In this thesis it is done by the user before
starting the streaming application. The second phase is performing priority scheduling. Priority
scheduling is done in two steps. The first is step is to scan the input data items to determine the
priority of the data items. Once the priority of the data items is known it is used to determine the
priority of the job. The second step is to reorganize the job queue such that the highest priority
jobs are processed first and lowest priority jobs are processed last. Figure 3.2 shows a scenario
where an input data item in Job N + 2 is marked as high priority. With the default Spark Streaming
23
system the high priority data item would only be processed after the two preceding jobs. But with
DDPS, the job containing the high priority input data item is moved to the head of the queue and
As input data items enter the streaming system, they are scanned to determine their priority.
In order for DDPS to know the priority of an input data item, the user must provide a list of priority
mappings before the system is started. A priority mapping is a general technique that is used to
assign a priority to an input data item. A priority mapping is a key value pair (i.e. {key, value})
where the key is a feature or characteristic of a data item. For example, if data items were composed
of strings, a key could be a word or a phrase. Or, if the input data items were composed of a JSON
objects the key could be a specific value for a property in the JSON object. The value in the priority
24
mapping is an integer that represents the priority. The larger the value of the integer the higher the
priority and vice versa. A user may supply multiple priority mappings (i.e. a list of priority
Input data types can be composed of ASCII messages, JSON objects, files, tweets, GPS
coordinates, database entries, etc. A stream will only be composed of one type of input data item.
This means that a stream of JSON objects will only contain JSON objects. There cannot be a
mixture of multiple types of input data items in the same stream. For this thesis, the focus is on
In a case where an input data stream contains ASCII sentences, a user could provide the
following list of priority mappings for example {{“fire”, 10}, {“explosion”, 100}}. With the
supplied list of priority mappings, the words “fire” and “explosion” are the keys and the integers
25
10 and 100 are the values. Therefore, words “fire” and “explosion” are assigned to a priority of 10
and 100 respectively. This means that the word “explosion” has a higher priority than the word
“fire”.
The priority of a job is the priority of all the input data items contained in its RDD summed
together. For example, if the system were to contain a RDD with the following input data items:
The priority of the job associated with the RDD would be 10 (Input Data Item2) + 100
(Input Data Item3) = 110. Only the words with mappings contribute to the priority of the Job. This
means that words without mappings in effect have a priority of zero. Since Input Data Item 1 does
not contain any words in the list of priority mappings it does not contribute to the priority of the
job.
In the following example there are two jobs with RDDs that contain the following input
data items:
Job 1:
RDD: Input Data Item1 - “Fire, fire, fire, fire, fire, fire, fire, fire, fire, fire, fire!!”
Job 2:
In this case Job 1 has a priority of 110 because the word “fire” appears 11 times in the in its
associated RDD. Job 2 has a priority of 100 because the word “explosion appears only once in its
RDD. As a result, Job 1 will be treated with higher precedence than Job 2.
26
There is no restriction on what the value in a key-value pair is allowed to be. Also, multiple
keys may be mapped to the same value. This means that the words “loud” and “boisterous” can be
assigned the same priority. In addition, there is no limit on the size of the list of priority mappings
The previous discussion has focussed on textual data streams. The RDD scanning
algorithm can detect any keyword specified by the user and has been used effectively in the context
of sentiment analysis on live tweets described in Section 4.2. Other examples include scanning
data containing text based transaction logs of bank accounts to detect fraudulent activity [48] and
live scanning of sales transactions for merchants. Handling a stream characterized by a sequence
of structures where each structure corresponds to the current reading of a sensor value in a sensor
based system will require changing the data scanning algorithm. The user will need to define
specific values for a designated field within the structure that will determine the priority of the
The priority scheduler proposed in this thesis assigns a priority to input data items, and
allows the system to give precedence to higher priority data items over lower priority data items.
This is performed in two phases. The first is the classification stage and second is the scheduling
stage. A classification of data items by the user is done ahead of time before the streaming
application begins, as shown in Figure 3.3. In this stage the user assigns priorities for input data
types by creating priority mappings. The priority mappings are then supplied to the streaming
engine. The streaming engine will use the priority mappings to determine which input data items
27
Figure 3.3: Data Driven Priority Scheduler Overview
The scheduler is invoked on every batch interval and schedules the jobs that process the
input data items in the current batch. The scheduling stage is performed at runtime as input data
items are being placed in the RDD. The data items are scanned for determining their priority and
the number of high priority elements in the RDD are added up to determine the priority of the job.
A short description of the scheduling technique is provided next. Algorithm 1 concerns the
receiving of messages and their storage into memory blocks (MB). Algorithm 2 assigns the
memory blocks to a new RDD. Algorithm 3 is responsible for scanning RDDs, creating Jobs and
assigning job priorities. Algorithm 4 is responsible for ordering the jobs in the job queue.
Algorithm 5 is responsible for recycling memory blocks that are no longer in use. Algorithm 1, 2,
and 5 are required for both DDPS and the default FIFO scheduler. This is because algorithm 1, 2
28
and 5 are required for any job scheduling policy as they are responsible allocating memory blocks
for input data items, creating RDDs and freeing the unused memory blocks. Algorithms 3 and 4
are specific to DDPS, therefore, the only overheads that DDPS introduces are Algorithm 3 and 4.
Minor changes are made to Algorithm 5 in order to support DDPS. More details on this are
Algorithm 1: The input parameter to Algorithm 1 is the unattached blocks list. This is a list
of memory blocks that have not yet been assigned to an RDD. First, the receiver reads in input
data from a socket connection (line 3). The read is a blocking call and will only complete once
there is data to be read. The current memory block is the memory block that the receiver will add
the received input data to. If the current memory block is full or has been removed from the list of
unattached memory blocks, the receiver will allocate a new memory block (line 5) and add it to
the list of unattached memory blocks. The newly allocated memory block becomes the current
memory block. Algorithm 1 runs continuously on the receiver for the lifetime of the application.
Algorithm 2: The input parameters to algorithm 2 are the unattached blocks list and the
RDD list. The RDD list is a list of all the RDDs that have been created. Algorithm 2 is executed
by the receiver tracker. At the beginning of each batch interval the receiver tracker creates a RDD
and assigns it an ID that includes the time at which the RDD was created. The RDD is also added
to the list of RDDs (line 4). Next, all the memory blocks from the unattached blocks lists are added
to the newly created RDD (line 4 – 8). Once this is completed the newly created RDD is returned.
Algorithm 3: The input parameters to algorithm 3 are a new RDD, the Job map and the list
of priority mappings. The RDD is the newly created RDD from Algorithm 2. The job map is a
key-value map where the key is an ID and the value is a job. The receiver tracker scans the data
29
items in the RDD to see if they contain any items listed in the priority mappings list (lines 2 – 8).
This process calculates the cumulative priority of all the data items in the RDD, and the result is
used to determine the priority of the job. Once this is complete, the RDD is assigned to a new job
which keeps track of the RDD and all the operations that will be applied to the RDD (line 16). The
job will be assigned an ID that is the same as the RDD’s ID. This ID represents the age of the job
and it is used as the key when entering the job in the job map. Next, the job is handed to the job
Algorithm 4: The input parameters to Algorithm 4 are the job queue and a newly created
job. Algorithm 4 is executed by the job manager. The job queue is a queue consisting of jobs where
the jobs are ordered first by their priority and secondly by their age. If two jobs have the same
After the receiver tracker executes Algorithm 3, it hands the job to the job manager. The
job manager checks and ensures that the queue is sorted in order of job priority: higher the priority
of a job, higher is its position in the queue. This is done by comparing the priority of the incoming
30
job with that of the highest priority job in the priority queue (line 3). If the incoming job has a
higher priority it is inserted at the head of the queue (line 4). If it has the same priority as the
priority of the job at the head of the queue and is older (created before) it is placed at the head of
the queue (line 7-8). Otherwise, the incoming job’s priority is compared with that of the next job
A Spark Streaming application may have multiple receivers that accept data in parallel.
This creates a race condition where two threads are submitting jobs to the job manager at the same
time. It is possible that an older job may be submitted later because its thread was de-scheduled by
the Operating System. For this reason, the job manager must check the age of the job by looking
After the currently executing job has completed, the Job manager takes the job at the head
of the priority queue and sends it to the Spark engine which divides the job into tasks to be
processed on executors. Note that each message is processed by a separate task. The order in which
messages are processed on executors is determined by the Spark engine and is not modified by the
31
DDPS technique proposed in this thesis. Once the job has been processed, it is removed from the
Job map.
If no priority mappings are supplied, the DDPS technique will behave as a FIFO scheduling
policy for processing all the jobs. This is equivalent to disabling DDPS.
Algorithm 5: The input parameters to algorithm 5 are the RDD and the Job map. The
Algorithm 5 is run after a job has been processed. The algorithm iterates through all the RDDs in
the RDD list and checks if the RDD is older than the remember duration (line 3). The remember
duration is the duration for which the streaming engine must keep the input data. This must been
done because Spark Streaming offers operations that can operate on input data that was processed
by previous jobs. As a result, the remember duration is required for any scheduling policy used by
32
Spark Streaming. As a default, the remember duration is the batch duration multiplied by 2,
however, this duration can be increased if users wish to operate on data from a larger time window.
To determine if the RDD is older than the remember duration, the RDDs ID is compared with the
current time. If the RDD’s age exceeds the remember duration, the Job map is then consulted to
determine if the job associated with the RDD has been processed or is still waiting the priority job
queue (line 4). The Job map is queried using the RDD’s ID, if no entry is found then the RDD is
removed from the RDD list and all memory blocks associated with the RDD are freed (line 5). If
an entry is found the RDD is not removed and the next RDD in the RDD list is searched.
33
Algorithm 5: Algorithm for freeing unused memory blocks
// The following algorithm is run each time
// a job has completed processing
// Input:
// RDDList a list of all the created RDDs
// JOBMap a key-value map
1: Begin
// loop through all the RDDs in the RDD list
2: for each RDDi in RDDList
// check if the RDD is older than
// the remember length
3: if age(RDDi) > (REMEMBER_DURATION)
// search the Job map to see if the job
// associated with the RDD has been processed
4: if (JOBMap.get(RDDi.ID) == empty)
// if the job is not in the Job map free
// all memory blocks associated with the RDD
5: freeAllMemoryBlocks(RDD)
6: endif
7: endif
// Check the next RDD in the RDD list
8: endfor
9: End
This section discusses the implementation of the prototype Data Driven Priority Scheduler
running on Spark Streaming. It discusses the five algorithms that are used by DDPS and focuses
on the implementation details. It also highlights the key changes made to Spark Streaming and
some of the APIs that were used in the DDPS implementation. The proof of concept prototype
system was built using Spark Streaming Version 2.1.2 running on top of Apache Spark 2.1.2.
The Spark Streaming API [49] was extended to support DDPS on Spark Streaming. A new
method “addPriorityMappings” was added to the API the Spark Streaming API which enables
users to provide priority mappings to DDPS scheduler. The new method takes in a java.util.List of
34
Tuples and saves it in an internal buffer. Since java.util.List is an interface, users can supply any
[50]. The type parameter for the list is a Tuple2. This type is a tuple type that contains two
elements. In this case, the elements must be a string and an integer. The string will be the priority
Implementation of Algorithm 1
Section 3.3 outlines the five key algorithms that comprise DDPS. Algorithm 1 is
responsible for receiving incoming data and placing it into memory blocks. When the streaming
engine is started a socket connection is established to receive input data items, by launching a
receiver thread that receives data items from the socket connection as shown in Algorithm1. The
receiver will continue to receive data from the socket connection until the streaming engine is
shutdown. Algorithm 1 is required for any scheduler that runs on Spark Streaming.
Implementation of Algorithm 2
Algorithm 2 is responsible for taking the data in memory blocks and placing it into RDDs.
This process is started when main thread launches a job scheduler instance. The job scheduler is
responsible triggering the creation of jobs. It accomplishes this by starting a timer that is set to
generate a job creation event on every batch interval. The job creation event is asynchronous, so
after the job scheduler sends the job creation event, it does not wait for a response but sleeps for
the duration specified by the batch duration and repeats this process. The job creation event is sent
to the ReceiverTracker which then creates the RDD. This is done by first collecting all the memory
blocks in the unattached memory blocks list and assigning it to the RDD as shown in lines 4 – 7
of Algorithm 2. The RDD is also given an ID that contains a time stamp which is identical to the
35
time at which the job scheduler sent the job creation event. Algorithm 2 is required for any
Implementation of Algorithm 3
Algorithm 3 is responsible for scanning RDDs to determine if they contain any priority
keywords. The scanning stage occurs immediately after the RRD is created. The following
describes the implementation: First, an accumulator used to track the priority of the RDD is set to
zero. The next step is to iterate over all the mappings in the internally stored list of priority
mappings. In each iteration a “foreach” [51] operation is done on the RDD to search each element
in the RDD for the priority keyword, this is shown in line 4 of Algorithm 3. A foreach operation
applies a function to all the elements in a collection [51]. The elements of the RDD will be
instances of “java.lang.String”. The search is done by looking for the first occurrence of the priority
keyword in the RDD element. To do this, the “java.lang.String.indexOf(String str, int fromIndex)”
[52] utility function is used. This “indexOf” function takes in a string and an index as the
parameters. It searches for the first occurrence of the string starting at the given index. If the string
is found it returns a positive value indicating the index at which the string was found. If the string
is not found it returns “-1”. When scanning the element, the “indexOf” function is given the
priority keyword as a parameter and “0” as the offset. If “-1” is returned this means the element
does not contain a priority keyword and the next element in the RDD is searched. If a non-negative
number is returned this means the priority keyword was found. The priority value associated with
the priority keyword is then added to an accumulator for the RDD (line 5 of Algorithm 3). The
“indexOf” function is then called again on the same RDD element except this time the index of
the previous call to the function is used as a parameter. This is done in case the same priority
36
In Spark Streaming jobs are represented as a Job class. The Job class contains a callsite
field which corresponds to the operations that will be applied to the RDD associated with the job.
The Job class also contains a start and end time field which are used to record when the
SparkEngine began and finished processing the job. The association between the Job and the RDD
is done through the Job “id” field. The “id” for the job and RDD will correspond to each other
(line 10 of Algorithm 3). The job also has a time field which is a time that corresponds to the ID.
The time field can be used to perform comparisons with other time instances. The receiver tracker
then sends the job to the Job manager. The job manager adds the job to a job map where the key
is the time attribute of the job and the value is the job. Algorithm 3 is implemented specifically to
Implementation of Algorithm 4
Algorithm 4 is responsible for ordering jobs within the job queue. In order to accomplish
this, the Job class was modified to implement the “java.lang.Comaparable” [53] interface as shown
in Figure 3.4. The Comparable interface declares one abstract function “int compareTo(Object o)”.
This instance function compares the input parameter “o” with the receiver of the method call. If
they are the same the method returns zero. If the parameter is greater the result is negative, if the
A new method “compareTo” is added to the Job class to implement the Comparable
interface. This method compares two Job instances and returns the result. Consider for example
Job A that represents the receiver of the call and Job B that represents the parameter for the
“compareTo” method call (ie. A.compareTo(B)). If the priority of Job A and Job B are the same,
then the result of subtracting the time field of Job B from the time field of Job A is returned. Recall
that the time field of a job indicates the time at which the job was created. If Job B is older a
37
negative number is returned and if Job A is older a positive number is returned. This technique
works because the time field is the Unix Epoch time [54]. This is the time in milliseconds from
the first of January, 1970. This means that the larger the value of the time, the more recent it is.
If the priorities of the Jobs are different, the result of subtracting the priority of Job A from
the priority of Job B is returned. If Job A has a higher priority the result is positive, if Job B has a
lower priority the result is negative. The method “compareTo” is shown in lines 3 – 10 of
Algorithm 4.
The Job manager was modified to replace the existing job queue with a
implement “Comparable”. The queue orders the elements according to the ordering determine by
the “compareTo” method. The two key functions are “put” and “take” which place and remove an
Once the receiver tracker creates a job it sends it to the Job manager. After, the job manager
places the job into the queue. The PriorityBlockingQueue ensures that all the elements are ordered
with the highest priority jobs first and the lowest priority jobs last as shown in Algorithm 4. With
the implementation of “compareTo”, if no priority mappings are supplied to the streaming engine,
the queue behaves as a FIFO queue because if all the priorities are the same the older jobs are
given precedence. The job manager removes the job at the head of the queue and hands it to the
Spark Engine to be processed. After the job has been processed, the Job manager sends an event
to the receiver tracker to free unused memory blocks then it hands the next job in the queue to the
Spark Engine. Algorithm 4 is required to support DDPS on Spark Streaming. Other schedulers on
38
Figure 3.4: Job Class Diagram
Implementation of Algorithm 5
Algorithm 5 is responsible for recycling memory blocks after the job has been processed.
This process begins when the receiver tracker receives the event to free memory blocks, it searches
through the list of RDDs and filters out the RDDs that are younger than the “rememberDuration”
by doing a “filter(test : ()=> Boolan) ” operation on the RDD list (line 3 of Algorithm 5). The filter
operation takes a lambda function as a parameter. The lambda function is the test that will be
applied to each element in the list. In this case the test will simply check to see if the age of the
RDD is higher than the current time minus the “rememberDuration”. If that is the case, True is
returned, False is returned otherwise. The “filter” operation iterates through all the elements and
tests each one to see whether the element passes the condition specified by the filter operation, if
39
it does the element is added to a list. Once all the elements have been tested the list is returned.
The receiver tracker then checks to see if the any of the RDDs in the list is associated to any jobs
that are in the job map (line 4 of Algorithm 5). This is done by taking the ID of the RDD and
converting it to a time, and the using the time as a key to query the job map. If querying the job
map doesn’t provide any results, it means that all the memory blocks that belong to the RDD can
be freed. This process continues for the rest of the RDDs. Querying the job map (line 4 of
Algorithm 5) is the additional step required to support DDPS. This step is necessary because DDPS
reorders the jobs in the job queue which means that a job waiting to be processed can be older than
the “rememberDuration”. The remainder of Algorithm 5 is required for any scheduler that runs on
Spark Streaming
A sequence diagram of the DDPS implementation is shown in Figure 3.5. On the left is the
Receiver which is run in a continuous loop to populate the unattached memory blocks. To the right
of the receiver is the Job scheduler which has a timer to trigger job events. The receiver tracker
creates RDDs and a Job which is then passed and to the Job manager. The Job manager orders all
the jobs based on the priority in a job queue. The job at the head of the queue is sent to the Spark
Engine which processes the job. When the job has been processed the unused memory blocks are
3.5 Summary
This chapter describes the Data Driven Priority Scheduler and how it differs from the
default FIFO scheduler on Spark Streaming. It also introduces a priority mapping technique which
is used by DDPS to determine the priority of input data items. This chapter presents the 5 key
algorithms required by DDPS and provides an analysis of the runtime performance of these
algorithms in terms of the workload and system parameters. The next chapter will focus the
40
performance assessment of DDPS and compare its performance to that of the default FIFO
41
Chapter 4: Performance Evaluation of DDPS
In this chapter the performance of DDPS is evaluated using a prototype and measurement
approach under two scenarios. The first scenario constructs a synthetic workload and measures the
performance of DDPS when workload and system parameters are varied. The second scenario
performs real-time sentiment analysis on live tweets and records the performance of DDPS. In
both scenarios the performance of DDPS is analyzed revealing key insights on the behaviour of
DDPS. Section 4.1 describes the synthetic workload scenario. Section 4.2 describes the sentiment
analysis scenario.
In this scenario a series of experiments are conducted to evaluate the performance of DDPS
with a synthetic workload. The experiments consist of launching a message sending application
and a Spark Streaming application, no other applications are run. The message sending application
connects to the Spark Streaming application and sends messages to the streaming application.
Upon receiving the messages the streaming application performs some computations to model the
consumption of CPU cycles by a program. In these experiments the streaming engine is run with
both the default the FIFO scheduler and DDPS. In each experiment only one workload or system
parameter (Section 4.1.1and Section 4.1.2) is varied at a time. Key performance metrics (in Section
4.1.3) are measured and the results (Section 4.1.5) are presented. Section 4.1.4 describes the
implementation of the message sending application and the Spark Streaming application in more
detail.
The parameters of the synthetic workload used in the evaluation of the proposed scheduling
42
Mean Arrival rate (λ) is the rate at which the streaming application receives messages measured
Coefficient of variation of interval time (Ca) captures the variability in message inter-arrival
Batch duration (BD) is the duration of the time interval in which the streaming engine collects
Number of priority mappings (K): is the number of priority mappings registered with the
streaming engine. For the experiments there are three priority levels: high, medium and low. For N
priority levels only N-1 priority mappings need to be supplied because any message that does not
contain a priority keyword is treated as a low priority message. Each time the streaming engine
scans messages it searches all of the priority mappings specified by the user.
Message Length (L) (characters) is length of the message sent by the message sending process
measured in characters. All messages sent by the message sending application are the same length.
Message Service Time (S) is the amount of time in seconds a CPU takes to process a message.
Message service times are fixed and do not vary from message to message.
that contain at least one high priority keyword. In the synthetic workload scenario a high priority
message will only contain a single high priority keyword and no medium priority keywords.
%) that contain at least one medium priority keyword and no high priority keywords. In the
synthetic workload scenario a medium priority message will only contain a single medium priority
43
4.1.2 System Parameters
Number of Cores (C) is the number of cores (logical processing units) available to the Spark
Streaming application.
The primary performance metrics used to evaluate system performance are discussed in this
section.
• The message is received by the receiver and placed in a memory block. At the end of the batch
interval the message is placed in a RDD. The batch delay is the time from when the message is
received by the receiver to the time at which the current batch interval ends. On average this
will be half of the batch duration. A further discussion on this is presented in Section 4.1.5.2.
𝑖
Let the batch delay for the ith message be given by 𝑡𝐵𝑑𝑒𝑙𝑎𝑦
• The messages in the RDD are scanned by the receiver tracker to detect priority keywords. Let
𝑖
the time required to scan the ith message be given 𝑡𝑠𝑐𝑎𝑛 .
• The RDD is assigned to a job that is placed in the priority job queue. The job remains in the
queue until it is dispatched by the job manager. Let the time the job for the ith message waits in
𝑖
the job queue be given by 𝑡𝑞𝑢𝑒𝑢𝑒 .
• Once the job is dispatched it is processed by the Spark engine on executors. Let the time taken
𝑖
to process the ith message in a job be given by 𝑡𝑝𝑟𝑜𝑐𝑒𝑠𝑠 .
Since all the previous steps are performed in sequence the end-to-end processing latency for the
44
Average End-to-End Latency (Ae) (ms): The end-to-end latency for a message is the difference
between the time at which the message processing is completed and the time at which the message
is sent by the message sending application. The average end to end latency is the average of the
Latency is recorded and grouped by message priority level: (ie. high priority (HP), medium
Average job queuing latency (Aq) (ms): The duration the ith job waits in the job queue is given
by t ijobqueue . The average job queuing latency in a given experiment is computed as:
tijobqueue
Aq = ∑N
i=0 (4.3)
J
Where J is the total number of jobs that are created during the experiment. Each message has a
priority level associated with it. For these experiments there are only three priority levels for
messages (high, medium and low). The priority of a job is calculated by summing the priority of all
the messages that it contains. Therefore, there may be a very large number of different priority
levels for jobs. For this reason, the average job queuing latency is not grouped in terms of priority
levels, instead, the average job queuing latency for all the jobs is presented.
Average scanning latency (As) (ms): The scanning latency for all the messages in an RDD
is given by:
ti
𝑡𝑟𝑑𝑑𝑠𝑐𝑎𝑛 = ∑M scan
i=0 M (4.4)
Where M is the number of messages in the RDD. The average scanning latency is given by:
tirddscan
As = ∑Ri=0 R
(4.5)
45
Where R is the total number of RDDs created during the experiment and t irddscan is the time taken
overheads:
• The message is received by the receiver and placed in a memory block. The memory block
must be allocated before messages are stored. When all the messages contained in a memory
block are processed the memory block is freed. Let the time spent allocating and freeing all
𝑡𝑜𝑡𝑎𝑙
memory blocks for the duration of the experiment be given by 𝑡𝑚𝑏 .
• On every batch interval, memory blocks are associated with an RDD. Let the time spent
associating memory blocks with RDDs for the duration of the experiment be given by
𝑡𝑜𝑡𝑎𝑙
𝑡𝑚𝑏𝑇𝑜𝑅𝑑𝑑 .
• On every batch interval, the messages in the RDD are scanned by the receiver tracker to
detect high priority keywords. Let the time taken to scan all the messages in the RDDs for
t 𝑡𝑜𝑡𝑎𝑙 R i
rddscan = ∑i=0 t rddscan (4.6)
• On every batch interval, a newly created RDD is assigned to a job that is placed in the
priority job queue. Let the time taken to assign all RDDs to jobs for the duration of the
𝑡𝑜𝑡𝑎𝑙
experiment be given by 𝑡𝑟𝑑𝑑𝑇𝑜𝐽𝑜𝑏 .
• On every batch interval, the job manager places a new job in the queue. The job manager
then orders the job queue based on job priority. Let the time the job manager spends ordering
the job queue for the duration of the experiment be given by t 𝑡𝑜𝑡𝑎𝑙
orderqueue .
𝑡𝑜𝑡𝑎𝑙 𝑡𝑜𝑡𝑎𝑙
Let 𝑡𝑠𝑜 denote the total scheduling overhead that Spark Streaming incurs. 𝑡𝑠𝑜 is computed
as:
46
𝑡𝑜𝑡𝑎𝑙 𝑡𝑜𝑡𝑎𝑙
𝑡𝑜𝑡𝑎𝑙
𝑡𝑠𝑜 = 𝑡𝑚𝑏 + 𝑡𝑚𝑏𝑇𝑜𝑅𝑑𝑑 + t 𝑡𝑜𝑡𝑎𝑙 𝑡𝑜𝑡𝑎𝑙 𝑡𝑜𝑡𝑎𝑙
rddscan + 𝑡𝑟𝑑𝑑𝑇𝑜𝐽𝑜𝑏 + t orderqueue (4.7)
Scheduling Overhead (SO) (%): The total number of cores used by Spark Streaming during
the experiments in denoted by C (as stated in 4.1.1). Let the processing time for ith core be given by
computed as:
C
t total i
core = ∑i=0 t core (4.8)
The scheduling overhead percentage is the proportion of time spent on scheduling overheads for
the duration of the experiment. The scheduling overhead percentage is computed as:
𝑡𝑜𝑡𝑎𝑙
𝑡𝑠𝑜
SO = ∗ 100 (4.9)
ttotal
core
4.1.4 Implementation of the Message Sending Application and the Synthetic Spark
Streaming Application
Each experiment begins by first launching the synthetic message sending application
written in Java that starts a server socket and blocks until it receives data. When a connection is
established the application sends ASCII messages via the socket connection. The messages can be
appended with a special keyword to indicate the priority of the message. For these experiments
there are three special keywords. One for low, medium and high priority messages. The message
sending application can be modified to vary the percentage of low, medium and high priority
messages. This is done by setting a range for each priority level where the range represents the
proportion of messages that will contain that priority. For example, if the goal is to ensure that
99% of the messages are low priority, 0.9% of the messages are medium priority and 0.1% are
high priority, the ranges would be: less than 1, for high priority, between 1 and 9 for medium
priority, and between 10 to 1000 for low priority. The message sending application uses the
Math.random() [56] function to generate a random number. This number is multiplied by a 1000
47
and used to determine which priority is selected for the message. The proportion of low, medium
and high priority messages vary depending on the experiment. The message proportions used can
After each message is sent, the message ending application sleeps for a certain amount of
time before sending the next message. The time that the application sleeps for is determined by a
by a random number generator. The random number generator generates values uses an
exponential distribution. The generator takes two parameters, a mean arrival rate and a coefficient
of variation for the mean inter-arrival time. The generator returns an inter-arrival time based on
the input parameters. Using the generator one can specify a message arrival rate. For example, if
the goal is to send an average of 18 messages per second with second and if the Poisson arrival
process is used, one would pass in 18 as the mean arrival rate and 1 as the coefficient of variation.
The generator would generate inter-arrival times which would be used as the duration that the
message sending application would sleep before sending the next message. After a message is sent,
the message sending application generates a new number and repeats the process.
Next the Spark Streaming application is launched. The streaming application defines a set
of operations that can be characterized as a DAG shown below in Figure 4.1. The first node of the
DAG is the SocketTextStream connection to the message sending application. All the received
simulate a service time for processing messages. The computations are performed in a loop and
the loop iterations can be increased or decreased to increase or decrease the service time. Other
researchers [57] [58] have constructed artificial workloads in a similar manner to assess the
48
key-value tuple. This is represented as ‘<Priority, Int>’ in Figure 4.1, where Priority is the priority
of the message (high, medium or low) and Int is an integer value. The tuples generated from the
‘MapToPair’ transformation will contain the priority of the message and an integer value of ‘1’.
An example of this is shown in Figure 4.1 between the ‘MapToPair’ and ‘ReduceByKey’ vertices.
The result of the ‘MapToPair’ transformation is a PairDStream (stream of tuples) which is passed
to the ‘ReduceByKey’ transformation. The ‘ReduceByKey’ transformation receives the tuples and
adds together all the values that correspond to the same priority. For example, as sequence of the
following tuples: {Low, 1}, {Low, 1}, {High, 1}, {Low, 1} would result in {Low, 3}, {High, 1}.
49
4.1.5 Experimental Results
The experiments were conducted on a 4core (8 logical CPUs) Intel i7-3720QM (3.6GHz)
machine with 16GB of RAM and on an Amazon cluster consisting of four m4.2xlarge [59] nodes
where each node is rated as having 8 vCPUs (2.4GHz) and 32GB of RAM. Running the
experiments on a single machine and on the AWS cloud yielded similar trends in the performance
metrics of interest. For this reason the experimental results on a single machine is presented in the
main text and results for running on the cloud are presented in Appendix A: The only exception is
the experiment which varies the number of cores. This experiment is only run on the cloud and is
presented in the main text. Table 4.1 and Table 4.2 shows the workload and system parameters
used in the experiments presented in the main text. The experiments follow a factor at a time
method meaning that only a single workload or system parameter is varied during an experiment
while the other parameters are held at their default values. The range of the workload and system
parameters is shown in the ‘Possible Values’ column. This is the range that is used when varying
workload and system parameters for the experiments. The range is chosen such that the results will
demonstrate some meaningful change in the performance metrics. For example, the range of the
mean arrival rate is chosen such that the knee in the average end-to-end latency curve is shown.
The default values are the values of the workload parameters used when that parameter is not being
varied. The default values for batch duration and message service time fall within the range of
default values used by other researchers [60] [58]. These values tend to lie somewhere near the
middle of the possible values range. The experiments are run for 2 hours and each run of the
experiment is repeated such that a confidence interval less than or equal to ± 3% is achieved at a
50
Table 4.1: Workload Parameters
In these experiments messages are sent using a random number generator to generate the inter-
arrival times. In all experiments except the one that investigates the impact of Ca on performance,
arrival of messages is modeled by a Poisson process (C a = 1). These experiments varies mean
arrival rate of messages and measures the performance impact on various performance metrics.
Increasing the increases the number of messages that the streaming system receives per unit
time. To the best of our knowledge there is no other priority based data driven scheduling algorithm
for Spark Streaming. We have thus compared the effect of the mean arrival rate on average end-
to-end latency of DDPS with that of the well-known First In First Out (FIFO) scheduler. In Spark
51
Streaming when the priority scheduler is turned off, the system uses the default scheduler which
is a FIFO scheduler in which the jobs are executed based on their arrival time: a job with a lower
arrival time is executed prior to the jobs that arrive later. Two experiments are run. The first is run
with two priority levels: high and low. The second experiment is run with three priority levels:
Figure 4.2 shows four lines: HP and LP correspond to the system running the data driven
priority scheduler proposed in this thesis while HP-FIFO and LP-FIFO correspond to a system
where the priority scheduler is turned off. Figure 4.2 shows that as increases the average end-to-
end latency for HP and LP messages with both the DDPS and FIFO scheduler (LP, HP, LP-FIFO
and HP-FIFO) increase. Figure 4.2 shows that for any given , LP messages are processed with a
higher average end-to-end latency than the HP messages when running with DDPS. This
demonstrates the utility of the data driven priority scheduler. The performance difference is
observed to increase with an increase in . When running the FIFO scheduler, the system
performance achieved with low (LP-FIFO) and high (HP-FIFO) priority messages are comparable
to one another. Figure 4.2 shows the benefit of using DDPS; the average end-to-end latency of HP
messages with the priority scheduler is significantly lower than that achieved with the HP
messages when the FIFO scheduler is used. However, this comes at a cost to LP messages. Figure
4.2 shows that the average end-to-end latency of LP messages when running with DDPS is higher
than both LP-FIFO and HP-FIFO. This occurs as a result of DDPS favouring HP messages at the
expense of LP messages. When jobs containing HP messages are prioritized by DDPS, jobs
without HP messages spend more time in the job queue and as a result the average end-to-end
52
A similar relationship between average end-to-end latency and was captured on a cloud
system and the results are presented in Figure A.1 of Appendix A.2.
Figure 4.3 shows the effect of on the average job queuing latency when running with the
DDPS and FIFO schedulers. Figure 4.3 shows that the effect of on average job queuing latency
is very similar for both the DDPS and FIFO schedulers. As increases so will the average number
of messages in each RDD. This means that it will take longer for jobs to be processed resulting in
increased job queuing latencies. DDPS prioritizes jobs with HP messages, so these jobs will spend
less time in the job queue. Conversely, jobs without HP messages will spend more time in the job
queue. The results show that when the average of all jobs (jobs with and without HP messages) is
considered, the measured average job queuing latency with DDPS is the same as the FIFO
scheduler. This implies that on average, the reduction of the average queuing time of jobs
53
containing HP messages is accompanied by a corresponding in average queuing time of jobs that
The results in Figure 4.3 also show that as increases the average job queuing latency also
increases. The rate of change remains fairly constant until about the point where reaches 19.5 at
which the rate of change increases dramatically. This also appears to be the same point at which
the average latencies of HP and LP messages begin to differ in Figure 4.2. DDPS prioritizes HP
messages by reordering jobs in the job queue based on priority. This means that the more jobs
there are in the job queue the greater the effect of DDPS. The less jobs are in the queue, the smaller
the effect of DDPS. As the average job queuing latency increases, so will the average number of
jobs in the job queue. This explains why the difference between the average end-to-end latency of
HP and LP jobs increases when the average job queueing time increases.
A similar relationship between average job queuing latency and is captured on a cloud
system and the results are presented in Figure A.2 of Appendix A.2.
Figure 4.4 shows the effect of on the average job scanning time when running with
DDPS. The results show that as increases the average scanning latency increases also. The
increase in average scanning latency appears to be linear with respect to . The value of
determines the rate at which messages arrive. As the mean arrival rate of messages increases the
average number of messages in a RDD will increase also. The scanning time is proportional to the
number of messages in a RDD since the each message in the RDD must be scanned. This explains
A similar relationship between average job scanning latency and is captured on a cloud
system and the results are presented in Figure A.3 of Appendix A.2.
54
Figure 4.3: Effect of Mean Arrival Rate on Average Job Queuing Latency
55
Figure 4.5 shows the effect of on the scheduling overhead when running with DDPS and
the FIFO scheduler. The results show that as increases the scheduling overhead also increases
for both the DDPS and FIFO schedulers. As increases the number of messages that enter the
system per unit time also increases. This means more memory blocks need to be allocated (and
eventually deallocated) to store the messages. The allocation and deallocation of memory blocks
system and the results are presented in Figure A.4 of Appendix A.2.
The results in Figure 4.5 also show that for any given the scheduling overhead of DDPS
is larger than the scheduling overhead of the FIFO scheduler. DDPS incurs additional overheads
when it scans the RDDs for priority mappings, whereas, the FIFO scheduler does not incur this
56
overhead. As a result, DDPS will have a larger scheduling overhead in comparison to the FIFO
scheduler. This analysis is confirmed in Figure 4.6. Figure 4.6 shows the difference between the
scheduling overheads of the DDPS and FIFO scheduler. It is the result of subtracting DDPS line
with the FIFO line in Figure 4.5. Figure 4.6 shows that the difference between the scheduling
overhead of DDPS scheduler and the FIFO scheduler seems to grow monotonically as increases.
The scanning time of the RDD also grows linearly as increases. Since DDPS has to scan RDDs
and the FIFO scheduler does not, this would explain the difference between the scheduling
The previous discussion has concerned a system with two priority levels, the performance
of a system having three priority levels (HP, MP and LP) is discussed next. Figure 4.7 shows three
lines: HP, MP and LP. These lines correspond to the average end-to-end latency of high, medium
57
and low priority messages respectively. Figure 4.7 shows that for any given , LP always has the
largest latency followed by MP and HP messages always have the lowest latencies. This highlights
the impact of DDPS when multiple priority mappings are supplied as DDPS is able to prioritize
system and the results are presented in Figure A.5 of Appendix A.2.
Figure 4.7: Effect of Mean Arrival Rate on Average End-to-end Latency with 3 Priority
Levels
This experiment focuses on the impact of varying the batch duration on various
performance metrics. The batch duration determines the amount of time in which the streaming
58
system collects data for a batch (RDD). Increasing the batch duration will result in more messages
being placed in a single RDD, and reducing the batch duration reduces the number messages being
placed in the RDD. Since each experiment is run for a fixed duration of time, increasing the batch
duration reduces the number of RDDs that are created during the experiment, and reducing the
Figure 4.8 shows the results of varying the batch duration with the DDPS and FIFO
scheduler. The two lines HP and LP show the average end-to-end latency for high and low priority
messages achieved with DDPS. HP-FIFO and LP-FIFO show the average end-to-end latency for
high and low priority messages when running with the FIFO scheduler. The results show that with
the FIFO scheduler there is minimal difference between low and high priority messages as
expected. Figure 4.8 shows that as batch duration increases the average end-to-end latency of LP-
FIFO and HP-FIFO increases. Figure 4.8 shows that when the batch duration increases the average
end-to-end latency of LP and HP also increases. However, when running with DDPS at small batch
durations the results show a large difference between high and low priority messages which does
not appear with the FIFO scheduler. Figure 4.8 shows that when the batch duration is between 1.2
and 1.4 seconds there is a large difference between the average end-to-end latency of HP and that
of LP. When the batch duration is greater than two seconds the difference between the average
A similar relationship between average end-to-end latency and batch duration is captured
on a cloud system and the results are presented in Figure A.6 of Appendix A.3.
59
Figure 4.8: Effect of BD on Average End-to-end Latency
Figure 4.9 shows the effect of batch duration on the scheduling overhead with the DDPS
and FIFO scheduler. The results show that with a small batch duration the scheduling overhead
with both the DDPS and FIFO scheduler is large and as the batch duration the scheduling overhead
reduces. When the batch duration is small more batches are created during the course of the
experiment. As a result, all the events that occur on every batch interval such as, assigning memory
blocks to RDDs, scanning RDDs, creating Jobs, recycling memory blocks, etc. occur more
frequently. The effect is that the scheduling overheads consume a larger proportion of computing
resources in comparison to an experiment where fewer batches are created. This means that there
will be fewer computing resources to process jobs, which will result in increased job queuing
times. This is why Figure 4.9 shows that as the batch duration increases the scheduling overhead
reduces. The scheduling overhead is higher when running with DDPS because of the additional
60
overheads of scanning RDDs. The effect of batch duration on scheduling overheads is seen in
Figure 4.8. Between 1.2 and 1.4 seconds the average end-to-end latency of LP, LP-FIFO and HP-
FIFO decrease. This occurs because the scheduling overheads are decreasing.
cloud system and the results are presented in Figure A.7 of Appendix A.3.
In Section 4.1.3 batch delay (t iBdelay ) is introduced. Batch delay is the time difference
between when the receiver receives a message to the end of the batch duration as shown in Figure
4.10. In Figure 4.10 the message arrives at time t. Time t will occur somewhere within a batch
interval. The start and end of the batch interval is indicated as BStart and BEnd respectively.
61
Therefore time t is greater than BStart and smaller than BEnd. The batch delay is the difference
between BEnd and time t. The batch delay is the duration the message waits in the receiver queue.
As shown in equation 4.1, the batch delay contributes to the overall latency of processing a
message. This means that any workload parameter that has an impact on the batch delay will also
All the experiments use a random number generator with an exponential distribution to
determine inter-arrival times. This means that a message arrival can occur at any time within a
batch duration. On average message arrivals will occur at the midpoint between BStart and BEnd
as shown in Figure 4.10. A message that arrives at the midpoint between BStart and BEnd will
have a batch delay that is equal to half of the batch duration. Thus, the batch delay is linearly
proportional to the batch duration. This means that increasing the batch duration will increase the
batch delay.
Since batch delay also contributes to the overall latency of processing messages, this would
imply that increasing the batch duration would also increase the average end-to-end latency. Figure
62
4.8 shows that increasing the batch duration increases the average end-to-end latency of HP, LP,
HP-FIFO and LP-FIFO. This behaviour occurs because the batch delay is a major contributing
The effect of batch duration on scheduling overhead, batch delay and are similar with both
the DDPS and the FIFO scheduler. However, the effect of batch duration on average end-to-end
latency differs between the DDPS and the FIFO scheduler when it comes to high priority messages
but not low priority messages. DDPS always gives precedence to high priority messages, however
if there is no contention the difference in the average end-to-end latency between higher priority
messages and lower priority messages will be small. When the batch duration is small, the resource
contention is at its highest due to the high scheduler overheads (as shown in Figure 4.9). This
means that when the batch duration is small the difference in average end-to-end latency between
HP and LP messages will be at its highest. This explains why the HP curve is lower than the LP
curve in Figure 4.8 when the batch duration is less than 1.5 seconds. As the batch duration
increases, contention for resources decreases and average end-to-end latency for both LP and HP
exhibit comparable values. Since the FIFO scheduler does not prioritize high priority messages,
this behaviour is not observed in the LP-FIFO and HP-FIFO curves. The reason why the HP curve
does not continue to remain smaller than the LP curve after batch duration exceeds 1.5 seconds is
because at that point the impact of the batch delay is greater. The result is that the DDPS and the
FIFO scheduler will behave in the same manner after 1.5 seconds.
Figure 4.11 shows the effect of the batch duration on the average job queuing latency. Two
lines are shown: DDPS and FIFO which correspond to the streaming engine being run with the
DDPS and FIFO schedulers respectively. Figure 4.11 shows that the average job queuing latencies
of DDPS and FIFO are comparable for a given value of BD. The figure shows that when the batch
63
duration is small the average job queuing latency is high. But as the batch duration increases, the
average job queuing latency also decreases. A similar behaviour was seen in Figure 4.9 where the
scheduling overhead reduced as the batch duration increased. The reason why the average job
queuing latency is high when BD is small is because the high scheduling overhead (shown in
Figure 4.9) causes resource contention. This will increase the time it takes to process jobs and
therefore increase job queuing latency. As the scheduling overhead reduces the so will the resource
contention and as a result jobs will be processed faster thus reducing the average job queueing
latency.
A similar relationship between average job queuing latency and batch duration is captured
on a cloud system and the results are presented in Figure A.8 of Appendix A.3.
64
4.1.5.3 Effect of Coefficient of Variation for Interval Time
This experiment measures the effect of varying the Ca. The Ca determines the variability
of inter-arrival times, so increasing Ca increases the burstiness of message arrivals. Figure 4.12
shows four lines: HP, LP, HP-FIFO and LP-FIFO. HP and LP are the average end-to-end latencies
of the high and low priority messages when running with DDPS. HP-FIFO and LP-FIFO are the
average latencies of low and high priority messages when running with the FIFO scheduler. The
results in Figure 4.12 show that as Ca increases the average latencies HP, LP, HP-FIFO and LP-
FIFO all increase. This means that increasing the variability of inter-arrival times increases the
average end-to-end latency when using DDPS and the FIFO scheduler. The results also show that
for any given value of Ca HP messages have a lower latency than LP messages. It also shows that
as message arrivals become more bursty (as Ca increases) the performance difference between the
average end-to-end latency of LP and HP messages increases. This indicates that DDPS processes
HP messages earlier than LP messages when the system has a bursty input load. However, with
the FIFO scheduler this behaviour is not seen. For any value of Ca the average end-to-end latency
values for HP-FIFO and LP-FIFO are comparable to one another as expected. This is because the
FIFO scheduler makes no distinction between jobs that contain high priority messages and jobs
system and the results are presented in Figure A.9 of Appendix A.4.
Figure 4.13 shows the effect of Ca on the average job queueing time. The results show that
as Ca increases so does the average job queuing latency for both the DDPS and the FIFO scheduler.
In fact the increase in average end-to-end latency seen in Figure 4.12 is caused in largely by the
increases in average job queuing latency. The reason why increasing Ca increases queuing delays
65
is because a high Ca increases the variability of message inter-arrival times which makes the
message arrivals more bursty. As a result the number of messages owned by a Job will become
more varied as Ca increases. The time it takes to process a job depends on the number of messages
that it contains. Therefore, as Ca increases the variability of job processing times will also increase.
Increasing the variability of processing times in queues is known to be one of the factors that
increases waiting time [61]. The increase in average job queuing for DDPS in Figure 4.13 also
seems to correspond with an increase in the difference between HP and LP in Figure 4.12. This
means the effect of Ca on average job queuing latency is the same when running with the DDPS
A similar relationship between average job queuing latency and Ca is captured on a cloud
system and the results are presented in Figure A.10 of Appendix A.4.
66
Figure 4.13: Effect of Ca on Average Job Queuing Latency
The previous discussion has concerned a system with two priority levels, the performance
of a system having three priority levels (HP, MP and LP) is discussed next. The effect of Ca on
average end-to-end latency with three priority levels when running with DDPS is shown below in
Figure 4.14. There are three lines: HP, MP and LP. These correspond to the average end-to-end
latencies of high, medium and low priority messages respectively. The results show that as C a
increases the average latencies for high, medium and low priority messages also increase. For
every value of Ca HP has the lowest average end-to-end latency, followed by MP and LP. This
demonstrates the ability of DDPS is able to give precedence to high priority messages over medium
priority messages, and medium priority messages over low priority messages and shows the
system and the results are presented in Figure A.11of Appendix A.4.
67
Figure 4.14: Effect of Ca on Average End-to-end Latency with 3 Priority Levels
In this experiment the number of priority mappings, K, are varied and the impact on various
performance metrics is measured. The number of priority mappings determines the keywords that
DDPS will search for during the scanning phase. If no priority mappings are supplied the scanning
Figure 4.15 shows the effect of K on average end-to-end latency. The results show that
when K is low the average end-to-end latency is also low. As K increases so does the average end-
to-end latency. The increase in average end-to-end latency with respect to K appears to be linear.
The reason why this occurs is that as the number of priority mappings increases, the number of
key words that need to be searched for when scanning RDDs also increases. The rationale for this
68
Figure 4.15: Effect of K on Average End-to-end Latency
The effect of K on the average scanning latency is shown below in Figure 4.16. The results
show that when K is zero the average scanning is also zero. As K increases the average scanning
latency increases in a linear fashion. The scanning time increases linearly because all messages
need to be scanned for each priority mapping. This increase in scanning time contributes to the
Figure 4.17 shows the effect of K on scheduling overhead. The results as K increases the
scheduling overhead increases. This is similar to the impact of K on average scanning latency that
was captured in Figure 4.16. The average scanning latency contributes to the overall scheduling
overhead. Therefore, when the average scanning latency increases the scheduling overhead also
increases.
69
Figure 4.16: Effect of K on Average Scanning Latency
70
The effect of K on the average job queuing latency is shown below in Figure 4.18. The
results show that for any value of K the average job queuing latency remains fairly constant. This
means that K does not have any effect on the average job queuing latency. Increasing K increases
the average scanning latency which affects the overall latency. However, messages are scanned
before the job is placed in the job queue therefore do not prolong the duration the job spends in
the job queue. As a result, any delay that causes an increase in the average scanning latency does
In this experiment the length of messages sent by the message sending process is varied.
The effect of message length on average end-to-end latency when running with DDPS is shown in
Figure 4.19. The figure shows the average end-to-end latency on HP and LP messages. Both HP
and LP appear to increase as the message length increases. As expected, for every value of L the
71
average latency of LP messages is higher than that of HP messages as expected. The difference
between the average latencies of HP and LP also appear to increase as message length increases.
The effect of message length on average job queuing latency is shown in Figure 4.20. The results
show that the average job queuing latency remains fairly constant with respect to message length.
This means the message length does not have an impact on job queuing latency. The reason for
this is presented. The time it takes to process a job depends on the number of messages contained
in a job and the amount of time it takes to process each message. The number of messages
contained in job is determined by which is a constant in this experiment. The time it takes to
process each message is determined by the message service time which is a workload parameter
(as described in Section 4.1.1). In this experiment neither or the message service time are affected
by increasing the message length. As a result the average queuing time remains fairly constant.
72
Figure 4.20: Effect of Message Length on Average Job Queuing Latency
The effect of message length on average scanning latency is shown in Figure 4.21. The
results show that as message length increases there is also an increase in average scanning latency.
This increase appears to be linear with respect to message length. The reason for this result is that
DDPS scans all messages to search for priority keywords. The longer the message is, the longer it
takes to scan the message. As a result, increasing the message length will increase the average
scanning latency. The average scanning latency contributes to the total latency for processing a
message (equation 4.1). Therefore increasing the average scanning latency will also increase the
average end-to-end latency. This explains why in Figure 4.19 the average end-to-end latency
Figure 4.22 presents the effect of message length on scheduling overhead. The results show
that as the message length increases the scheduling overhead also increases. This is because the
73
time taken to scan messages contributes to the scheduling overhead. Increasing the message length
increases the average scanning latency that in turn increases the scheduling overhead.
74
4.1.5.6 Effect of Message Service Time
In this experiment the message service time is varied and the average end-to-end latency
and average job queuing latency is measured when running with DDPS. The message service time
determines the amount of time taken to process a message. The effect of message service time on
average end-to-end latency is shown below in Figure 4.23. The results show that when message
service time increases the average end-to-end latency of both HP and LP messages also increase.
It also shows that that for any message service time HP messages have a lower average end-to-end
latency than LP messages. As the message service time increases, the difference between HP and
LP also increases. This is similar to the results seen when varying (Figure 4.2) and Ca (Figure
75
Figure 4.24 shows the effect of message service time on the average job queuing latency.
The results show that as message service time increases the average job queuing also increases.
When the message service time is around 137ms, the increase in average job queuing latency is
more dramatic. This is also the same point where the difference between the average end-to-end
latency of HP and LP message begins to increase in Figure 4.23. Increasing the mean service time
increases the amount of time each message takes to be processed. As a result, the amount of time
it takes to process a job will increase also. An increase in job processing time will increase the job
queuing delays.
Figure 4.24: Effect of Message Service Time on Average Job Queuing Latency
In this experiment the number CPUs assigned to the Spark Streaming engine are varied.
This experiment is run on the Amazon EC2 cloud. Figure 4.25 shows the effect C on average end-
to-end latency. The figure has four lines, HP, LP, HP-FIFO and LP-FIFO. These correspond to
high and low priority messages with the DDPS and FIFO scheduler respectively. The results show
76
that when C is low, the average end-to-end latency of all four lines is at its highest. As C increases
the average end-to-end latency reduces. The results also show that for any value of C HP has a
lower priority than LP. When C is low the difference between HP and LP is high. As C increases
the difference becomes smaller. Figure 4.25 shows that the average end-to-end latency of HP-
Figure 4.26 shows the effect of C on the average job queuing latency. The results show that
the average job queuing latency is similar when running with the DDPS and the FIFO scheduler.
The results also show that when there are fewer CPUs, there average job queuing latency is high.
But as the number of CPUs increases the average job queuing latency reduces. This occurs because
increasing the number of CPUs increases the parallelism when processing a job. This means the
amount of time it takes to process a job is reduced. This explains why in in Figure 4.25 the average
77
Figure 4.26: Cloud: Effect of Number of Cores on Average Job Queuing Latency
In this experiment the percentage of high priority messages is varied and the results are
shown in Figure 4.27. The y-axis on Figure 4.27 shows the result of subtracting the average end-
to-end latency of high priority messages from the average end-to-end latency of low priority
messages (1). On the x-axis is the percentage of high priority messages presented in a logarithmic
scale. The results show that when the percentage of HP messages is low the latency difference is
high. But as the percentage of HP messages increase the difference reduces. This occurs because
when there are fewer HP messages, the job that contains the HP message takes precedence over
other jobs in the priority queue, and as a result the average latency of processing the messages in
that job are lower than jobs without HP messages. When the number of HP messages increases
there are more jobs containing HP messages, and these jobs will all have a higher priority than
jobs without HP messages. When multiple jobs have the same priority, they are processed in a
78
FIFO order with respect to each other. Therefore, as the number of high priority jobs increases the
In this experiment three priority levels for messages are used, high, medium and low. The
percentage of medium priority messages is varied and the average end-to-end latency for low,
medium and high priority messages is recorded while running with DDPS. Only the difference
between two priority levels is presented at a time for clarity. The difference between the average
end-to-end latency of medium and high priority messages is shown in Figure 4.28. The difference
between the average end-to-end latency of low and medium priority messages is shown in Figure
4.29. Figure 4.28 shows the results of varying the percentage of medium priority messages. The
y-axis shows the result of subtracting the average end-to-end latency of high priority messages
from that of medium priority messages (2). The results show that 2 is always greater than zero.
79
This means that the average end-to-end latency of medium priority messages is always larger than
that of high priority messages. The results also show that when there is a small percentage of
medium priority messages, 2 is small. As the percentage of medium priority messages increases,
2 begins to increase. Figure 4.29 shows the results of varying the percentage of medium priority
messages and recording the difference in average end-to-end latencies between low priority
messages and medium priority messages (3). The results show that 3 is always greater than
zero. This means that the average end-to-end latency of low priority messages is always larger
than that of medium priority messages. The results also show that when there is a small percentage
of medium priority messages 3 is large, but as the percentage of medium priority messages
increases 3 reduces.
As the percentage of medium priority messages increases the proportion of high priority
messages remains the same, the proportion of medium priority messages increases and the
proportion of low priority messages reduces. Therefore, as the percentage of medium priority
messages increases, a given medium priority message on average will see a higher number of
medium priority messages ahead of it in the queue. As a result the average end-to-end latency of
medium priority messages will increase. Since 2 is the difference between the average end-to-
end latency of high and medium priority messages, when the percentage of medium priority
messages increases, 2 will increase also. Similarly, 3 is the difference between the average end-
to-end latency of medium and low priority messages, therefore, as the average end-to-end latency
80
Figure 4.28: Effect of Percentage of Medium Priority Messages on 2
81
4.2 Live Twitter Sentiment Analysis Scenario
This experiment aims to test the performance of DDPS in a real-world sentiment analysis
scenario. The experiment begins by launching a Spark Streaming Application. The streaming
application trains a language model classifier [62] by feeding it messages and as well as data to
describe whether the message has a ‘positive’, ‘neutral’ or ‘negative’ sentiment. This data comes
from a training set [63] of over one million entries containing English tweets. Each tweet is marked
specific topic. This training set has been used by other researchers [64] [65] [66] to conduct
research on sentiment analysis techniques. The experiment uses a supervised machine learning
technique [67] where a dataset (in this case a series of tweets) is passed to the classifier along with
the category (positive, neutral or negative sentiment) that it is associated with. Once this process
is complete, the classifier can be queried with a message and it will produce an estimate describing
The streaming application defines a set of operations that can be characterized by a DAG
shown in Figure 4.30 that displays the different stages of processing after it receives the input data
from Twitter Server. The application establishes a connection to Twitter Server using the Twitter
API for Spark Streaming [68]. Once the connection is made the twitter stream will produce tweets.
Each tweet is represented as a ‘twitter4j.status’ object. Note that “twitter4j” is a shorter version of
“Twitter for java” that is the API used for connecting to Twitter Server. This object contains the
contents of the tweet as well as the name of the sender, the location that the tweet was sent from
and the time at which the tweet was sent. These tweets are from all over the world and can be
written in any language. The tweets are passed to a Filter transformation which filters out the
English tweets. The English tweets are then passed to a MapToPair transformation through an
82
InputDStream. The MapToPair transformation evaluates the message using the classifier and
determines its category. The transformation returns a key-value tuple where the key is the category
of the message and the value is an integer of 1 via a PairDStream. This stream is passed to a
ReduceByKey transformation which takes the key-value tuples and groups them based on their
key, then adds up the values. The final result are two tuples containing the number of messages
belonging to positive and negative sentiment categories (i.e. {positive, 1}, {neutral, 1}, {positive,
1}, {positive, 1}, {negative, 1} becomes {positive, 3},{neutral, 1}),{negative, 1}). After, the
The experiment was conducted on a 4core (8 logical CPUs) Intel i7-3720QM machine. The
spark streaming application was lauched with access to 3 logical CPUs. One for the receiver and
83
two for the executors. The experiment was run on May 12, 2019 between 3pm and 3am EST. Any
tweet contain the name Trump was treated as a high priority tweet. The Results were recorded and
divided into 15 minute sections. The average of the 15minute sections is displayed in the graphs
below. The performance metrics used to measure the performance of the system are average end-
to-end latency, average job queuing latency and average scanning latency. These are defined in
Section 4.1.3
Figure 4.31 shows the average end-to-end latency of processing tweets. The resutls show
two lines HP and LP, where HP coressponds to high priority messages and LP coressponds to low
priority messages. The results in Figure 4.31 show that between 3pm and 7pm the average end-to-
end latency of LP tweets is significantly higher than HP tweets. Between 7 and 9pm the average
end-to-end latency of HP tweets reduces and becomes comparable to that of LP tweets. Between
9 and 11pm the average end-to-end latency of LP tweets increases again, but the average end-to-
end latency of LP tweets remains relatively stable. After 11pm the average end-to-end latency of
LP reduces and becomes similar to the average end-to-end latency of HP tweets. This behavour
continues until 3am. One possible explanation for the spikes in LP latency between 3pm to 7pm,
and 9pm to 11pm is that 3 to 7pm is around the time most people in North America finish school
and work. So there would be an increase in twitter activity. Between 7 and 9pm most, people in
North America would be eating dinner, so twitter activity would reduce. Twitter activity would
increase after dinner and reduce after 11pm when most people would have gone to bed. The results
in Figure 4.32 are in line with these observations. The figure plots the number of tweets received
for each 15minute window for the duration of the experiment. The results show high twitter activity
between 3 to 7pm and 9 to 11pm. An increase in tweets would increase resource contention and
as a result job queuing latencies will increase. The impact of DDPS on performance is more
84
prominent when resource contention is higher. When there is high frequency of tweets there is a
high resource contention and the average end-to-end latency for HP tweets is significantly lower
than that of the LP tweets. When there is a low frequency of tweets the resource contention is low
Figure 4.33 shows the average job queuing latencies for the experiement. The average job
queuing latencies increases between 3pm and 7pm and between 9pm and 11pm. This is inline with
Figure 4.31 in which the difference between the avaerage end-to-end latency achieved with HP
and LP tweets is highest between 3 and 7pm and between 9 and 11pm.
Figure 4.34 shows the average scanning latency during the course of the experiment. The
figure shows a correlation between the number of messages received during a 15minute window
and the average scanning latency within that window. The reason for this is that the time it takes
to scan the tweets is proportional to the number of tweets that are in the RDD. As the number of
85
The results in Figure 4.31 show the performance of DDPS in a real world sentiment
analysis scenario. It shows that DDPS is able to effectively prioritize HP tweets when the input
load increases.
4.3 Summary
This chapter discusses the evaluation of DDPS using the prototype implementation on
Spark Streaming. Two scenarios were constructed: the first exercised DDPS with a synthetic
workload and the second concerned a real-time sentiment analysis on tweets. The results of the
evaluation showed that DDPS is able to give preference to high priority data items. It also showed
that the higher the resource contention the more DDPS favoured high priority data items.
86
Figure 4.33: Sentiment Analysis Scenario: Average Job Queuing Latency
87
Chapter 5: Summary and Conclusions
This chapter presents a summary of the thesis and provides concluding remarks. A
summary of the thesis research is presented in Section 5.1. Section 5.2 presents the key conclusions
5.1 Summary
One of the main problems with the state of the art is that it does not provide mechanisms
for prioritizing certain input data items over others. The ability to prioritize input data items is
essential in cases where one needs to ensure that important data is always processed with low
latency. Ensuring this becomes more challenging under bursty loads as the streaming system may
be temporarily overwhelmed resulting in large queuing delays. This thesis addresses this challenge
by proposing a scheduling technique called the Data Driven Priority Scheduler that can be used on
stream processing engines where certain input data items can be given higher priority than others.
This thesis implements the DDPS on the Spark Streaming platform. The default scheduler that
comes with Spark Streaming is a FIFO scheduler. In cases where the input load suddenly increases,
the latency for processing input data items will increase due to resource contention. If there are
input data items that are more important than others, these data items will also suffer from
increased processing latencies because the FIFO scheduler does not prioritize the processing of
important data items over less important data items. To solve this problem, this thesis introduces
a priority-based scheduling technique in which the priority of an input data item is based on its
content. The technique assigns priorities to input data items such that higher priority data items are
processed before lower priority data items. A user can define the priority of input data types by
creating priority mappings. A priority mapping is a key value pair where the key is feature or
characteristic (e.g. words, phrases, locations, etc.) of the input data item, and the value is an integer
88
representing the priority of the input data item. Once the user has constructed the priority
mappings, it is then supplied to the streaming engine. At runtime the streaming engine groups input
data into jobs. The priority of each job is calculated based on the collective priority of all the data
items that it contains. The job manager ensures that jobs with highest priority are processed first.
To evaluate the performance of DDPS a prototype of DDPS is constructed using the Spark
Streaming platform. Performance analysis is performed both on a single machine and on a cloud
comprising multiple nodes. The conclusions derived from the results of the experiments performed
5.2 Conclusions
The performance analysis of the prototype demonstrates the efficacy of the DDPS
scheduler. The experiments performed on the single machine and on the cloud lead to a similar set
of conclusions regarding the effect of various system and workload parameters on performance
and are described in the following subsections each of which focuses on experiments
corresponding to a different scenario. The first scenario exercises DDPS under a synthetic
workload and compares DDPS with the FIFO scheduler. The second scenario exercises DDPS
with live sentiment analysis of Twitter tweets and analyzes the effectiveness of DDPS on this real-
world scenario. Key performance metrics are such as average end-to-end latency, average job
queueing latency and average scanning latency are measured for each of these scenarios.
In this scenario a message sending application is run in parallel with the Spark Streaming
application. The message sending application sends messages to the Streaming engine. The
workload parameters are the mean arrival rate of messages, coefficient of variation of inter-arrival
times, message length, message service time, batch duration, and the number of cores. The
89
experiments vary one workload parameter at a time and measure the following performance
metrics: average latency, average job queuing time, scheduling overhead and average scanning
time. The key insights into system behaviour and performance resulting from this performance
• High priority messages are processed with a lower average latency than low priority
messages when running with DDPS. As the average arrival rate increases, the difference
between the average latency of LP and HP messages gets larger. With the default FIFO
scheduler high and low priority messages are processed with the same average latency.
When running DDPS with multiple priority levels, the highest priority messages have the
lowest average latency, and the lowest priority messages have the highest processing
latency.
• The average job queuing time increases as the arrival rate increases. This behaviour occurs
with both the DDPS and the FIFO scheduler. As the arrival rate increases contention for
resources increases leading to an increase in in the average job queuing time. When the
average job queuing time is high, the difference between the average processing latency of
• The average scanning latency for messages in an RDD increases as the arrival rate increases.
The scheduling overhead (which comprises of scanning messages as well as allocating and
freeing memory blocks, associating RDDs with jobs, re-ordering the job queue, etc.) also
• The experiments conducted show that the smaller the batch duration the larger is the
difference between that average end-to-end latency for HP and LP messages when running
90
with DDPS. It also shows that the when the batch duration increases the average end-to-
end-latency for both high and low priority messages also increases.
However, for any given value of Ca the average latency of processing LP messages is always
larger than that of the HP messages. As Ca increases, so does the difference between the
average latency of LP and HP messages. Increasing Ca also increases the average job
queuing latency.
• Increasing the number of cores reduces the average processing of both high and low priority
messages. As the number of cores increases, the difference between the average latencies
of low and high priority messages reduces. Increasing the number of cores reduces the
In this scenario a streaming application was constructed to perform sentiment analysis of live
tweets. The application first constructed a classifier based on a large dataset of historical tweets.
The classifier was used at runtime to evaluate the sentiment of tweets. The word “Trump” was
used to signify a high priority tweet. Key insights into system behavior and performance are
summarized:
• The rate at which tweets arrived varied over time. When there was a sudden increase in
tweets DDPS was able to prioritize tweets contained the high priority keyword. This
resulted in a high priority tweets being processed with a lower average latency than low
priority tweets. When the tweet arrival rate of tweets reduced the difference between the
average latency of processing high and low priority tweets also reduced.
91
• When the tweet arrival rate was high the measured average job queuing latency was also
high.
• As the tweet arrival rate increased, the average scanning time also increased.
• Investigate the impact of additional workload and system parameters on DDPS system
performance.
• A variable batch duration is a batch duration that can be dynamically increased or decreased
while the streaming application is running. Variable batch durations may be able to improve
DDPS ability to process high priority data items with low latency. In order to give
immediate attention to high priority data items one can collapse the batch duration as soon
as a high priority data item arrives. By reducing the batch duration this will reduce the batch
delay for the items in the batch. Also, reducing the batch duration will reduce the number
of data items contained in a job. A high priority job with a small number of data items will
be processed faster. Investigation of such a variable batch duration for DDPS warrants
further investigation.
• DDPS prioritizes high priority data items at the expense of low priority data items. This can
lead to starvation of low priority data items. Investigation of aging techniques for low
priority data items (similar to OS task schedulers) such that the starvation of low priority
data times can be bounded based on user requirements forms an important direction for
future research.
• DDPS requires users to input priority mappings in order to determine which data items
should be given precedence. Investigation into machine learning techniques that enable
92
DDPS to learn how to distinguish high priority data items from low priority data items
• This thesis demonstrated the efficacy of DDPS with a live twitter sentiment analysis
scenario. A similar experiment with Internet of Things (IOT) technology, stock trading
applications or geo physics log monitoring applications would strengthen the case for DDPS
and may uncover other areas of improvement for this scheduling technique.
• Investigate dynamic priority mapping techniques where priorities for keywords can be
93
References
[1] D. Laney, "3D Data Management: Controlling Data Volume, Velocity, and Variety,"
[Online]. Available: https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-
Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. [Accessed August 2019].
[2] Apache Spark, "Spark Overview," [Online]. Available: https://spark.apache.org/docs/latest/.
[Accessed February 2018].
[3] Apache Storm, "Apache Storm," [Online]. Available: http://storm.apache.org/. [Accessed
February 2019].
[4] E. Thompson, "Use cases for Real Time Stream Processing Systems," [Online]. Available:
https://www.linkedin.com/pulse/use-cases-real-time-stream-processing-systems-emma-
thompson. [Accessed January 2018].
[5] R. D. Desai, "Sentiment Analysis of Twitter Data," in Proceedings of the 2nd International
Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 2018,
pp. 114-117.
[6] V. A and S. S.S, "Sentiment Analysis of Twitter Data: A Survey of Techniques,"
International Journal of Computer Applications, vol. 139, no. 11, pp. 5-15, 2016.
[7] D. A. Bermingham, "Sentiment Analysis Use Cases," [Online]. Available:
http://www.isin.ie/assets/38/903B8F83-111B-4DCA-
A84606B4F6E557B6_document/0915_Sentiment_Analysis_Dr._Adam_Bermingham.pdf.
[Accessed March 2018].
[8] F. Neri, C. Aliprandi, F. Capeci, M. Cuadros and T. By, "Sentiment Analysis on Social
Media," in Proceedings of the Advances in Social Network Analysis and Mining (ASONAM),
Istanbul, Turkey, 2012, pp. 919-926.
[9] A. Bermingham, M. Conway, L. McInerney, N. O’Hare1 and A. F. Smeaton, "Combining
Social Network Analysis and Sentiment Analysis to Explore the Potential for Online
Radicalisation," in Proceedings of the International Conference on Advances in Social
Network Analysis and Mining, Athens, Greece, 2009, pp. 231 - 236.
[10] Y. Zhao and H. Zeng, "The Concept of Unschedulability Core for Optimizing Real-Time
Systems with Fixed-Priority Scheduling," IEEE Transactions on Computers, vol. 68, no. 8,
pp. 926-938, 2019.
[11] B. Lohrmann, P. Janacik and K. Odej, "Elastic Stream Processing with Latency Guarantees,"
in Proceedings of the 35th IEEE International Conference on Distributed Computing
Systems, Columbus, USA, 2015, pp. 399-410.
[12] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker and I. Stoica, "Spark: Cluster
Computing with Working Sets," in Proceedings of the 2nd USENIX conference on Hot topics
in cloud computing (HotCloud'10), Boston, USA, 2010, pp. 10-17.
[13] Apache Spark, "Spark RDD Operations-Transformation & Action with Example," [Online].
Available: https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/.
[Accessed February 2018].
94
[14] W. Huang, L. Meng, D. Zhang and W. Zhang, "In-Memory Parallel Processing of Massive
Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model," IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 1, pp. 3-19,
2016.
[15] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S.
Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker and I. Stoica, "Apache
Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no.
11, pp. 56-65, 2016.
[16] Apache Spark, "Spark Streaming Programming Guide," [Online]. Available:
http://spark.apache.org/docs/latest/streaming-programming-guide.html. [Accessed February
2018].
[17] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, "Discretized Streams: Fault-
Tolerant Streaming Computation at Scale," in Proceedings of the 24th ACM Symposium on
Operating Systems Principles (SOSP '13), Farminton, USA, 2013, pp. 423-438.
[18] Apache Spark, "Scheduling Within an Application," [Online]. Available:
https://spark.apache.org/docs/2.0.0/job-scheduling.html#scheduling-within-an-application.
[Accessed March 2018].
[19] The Apache Software Foundation, "Apache Mesos," [Online]. Available:
http://mesos.apache.org/. [Accessed July 2018].
[20] Apache Hadoop, "Apache Hadoop YARN," [Online]. Available:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
[Accessed July 2018].
[21] M. V. Mäntylä, D. Graziotin2 and M. Kuutila, "The Evolution of Sentiment Analysis - A
Review of Research Topics, Venues, and Top Cited Papers," Computer Science Review, vol.
27, no. 1, pp. 16-32, February 2018.
[22] J. Serrano-Guerrero, J. A. Olivas, F. P. Romero and E. Herrera-Viedma, "Sentiment Analysis:
A Review and Comparative Analysis of Web Services," Information Sciences, vol. 311, no. 1,
pp. 18-38, 2015.
[23] D. D. Droba, "Methods Used for Measuring Public Opinion," American Journal of Sociology,
vol. 37, no. 3, pp. 410-423, 1931.
[24] J. M. Wiebe, "Recognizing Subjective Sentences: A Computational Investigation of Narrative
Text," PhD Thesis, Department of Computer Science, State University of New York at
Buffalo, 1990.
[25] H. Wang, C. Ma and L. Zhou, "A Brief Review of Machine Learning and Its Application," in
Proceedings of the International Conference on Information Engineering and Computer
Science, Wuhan, China, 2009, pp. 1-4.
[26] S. Angra and S. Ahuja, "Machine Learning and its Applications: A Review," in Proceedings
of the International Conference on Big Data Analytics and Computational Intelligence
(ICBDAC), Chirala, India, 2017, pp. 57-60.
[27] S. Das, A. Dey, A. Pal and N. Roy, "Applications of Artificial Intelligence in Machine
Learning: Review and Prospect," International Journal of Computer Applications, vol. 115,
no. 9, pp. 31-41, 2015.
95
[28] S. Wu and P. A. Flach, "Feature Selection with Labelled and Unlabelled Data," in
Proceedings of ECML Workshop on Integration and Collaboration Aspects of Data Mining,
Decision Support and Meta-Learning (PKDD’02), Helsinki, Finland, 2002, pp. 156-167.
[29] H. Wang, Q. Zhang, J. Wu, S. Pan and Y. Chen, "Time Series Feature Learning with Labeled
and Unlabeled Data," Pattern Recognition, vol. 89, no. 1, pp. 55-56, 2019.
[30] T. Lorido-Botran, J. Miguel-Alonso and J. A. Lozano, "A Review of Auto-scaling
Techniques for Elastic Applications in Cloud Environments," Journal of Grid Computing,
vol. 12, no. 4, pp. 559-592, 2014.
[31] A. Ali-Eldin, M. Kihl, J. Tordsson and E. Elmroth, "Efficient Provisioning of Bursty
Scientific Workloads on the Cloud using Adaptive Elasticity Control," in Proceedings of the
3rd workshop on Scientific Cloud Computing (ScienceCloud '12), Delft, Netherlands, 2012,
pp. 31 - 40.
[32] H. S. B. L. M. I. G. Ghanbari, "Exploring Alternative Approaches to Implement an Elasticity
Policy," in Proceedings of the IEEE International Conference on Cloud Computing
(CLOUD), Washington DC, USA, 2011, pp. 716-723.
[33] A. Biswas, S. Majumdar, B. Nandy and A. El-Haraki, "An Auto-Scaling Framework for
Controlling Enterprise Resources on Clouds," in Proceedings of the IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing, Shenzhen, China, 2015, pp. 971-980.
[34] A. Biswas, S. Majumdar, B. Nandy and A. El-Haraki, "Automatic Resource Provisioning: A
Machine Learning Based Proactive Approach," in Proceedings of the IEEE 6th International
Conference on Cloud Computing Technology and Science, Singapore, Singapore, 2014, pp
168-173.
[35] A. Biswas and S. M. N. El-Haraki, "A Hybrid Auto-scaling Technique for Clouds Processing
Applications with Service Level Agreements," Journal of Cloud Computing, vol. 6, no. 29,
pp. 1-14, 2017.
[36] B. Lohrmann, P. Janacik and O. Kao, "Elastic Stream Processing with Latency Guarantees,"
in Proceedings of the 35th IEEE International Conference on Distributed Computing
Systems, Columbus, USA, 2015, pp. 399-410.
[37] T. Heinze, V. Pappalardo, Z. Jerzak and C. Fetzer, "Auto-scaling Techniques for Elastic Data
Stream Processing," in Proceedings of the 30th IEEE International Conference on Data
Engineering Workshops, Chicago, USA, 2014, pp. 318-321.
[38] T. M. Ahmed, F. H. Zulkernine and J. R. Cordy, "Proactive Auto-Scaling of Resources for
Stream Processing Engines in the Cloud," in Proceedings of the 26th Annual International
Conference on Computer Science and Software Engineering (CASCON '16), Toronto,
Canada, 2016, pp. 226-231.
[39] X. Liao, Y. Huang, L. Zheng and H. Jin, "Efficient Time-Evolving Stream Processing at
Scale," IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 1, pp. 1-14, 2019.
[40] L. J. d. l. C. Llopis, A. V. Rodas, E. S. Gargallo and M. A. Igartua, "Load Splitting in
Clusters of Video Servers," Computer Communications, vol. 35, no. 8, pp. 993-1003, 2012.
[41] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, Y. Xing and S.
Zdonik, "Scalable Distributed Stream Processing," in Proceedings of the 1st Biennial
Conference on Innovative Data Systems Research (CIDR), Asilomar, US, 2003, pp. 500-511.
96
[42] T. Das, Y. Zhong, I. Stoica and S. Shenker, "Adaptive Stream Processing using Dynamic
Batch Sizing," in Proceedings of the ACM Symposium on Cloud Computing (SOCC '14),
Seatle, USA, 2014, pp. 1-13.
[43] R. Birke, M. Bjorkvist, E. Kalyvianaki and L. y. Chen, "Meeting Latency Target in Transient
Burst: a Case on Spark Streaming," in Proceedings of the IEEE International Conference on
Cloud Engineering (IC2E), Vancouver, Canada, 2017, pp. 149-158.
[44] A. L. S. i. N. Monitoring, "Pere Barlet-Ros; Gianluca Iannaccone; Josep Sanjuas-Cuxart;
Diego Amores-Lopez; Josep Sol´e-Pareta," in Proceedings of the USENIX Annual Technical
Conference (ATC'07), Santa Clara, USA, 2007, pp. 47 - 50.
[45] D. Cheng, Y. Chen, X. Zhou, D. Gmach and D. Milojicic, "Adaptive Scheduling of Parallel
Jobs in Spark Streaming," in Proceedings of the IEEE Conference on Computer
Communications (INFOCOM), Atlanta, USA, 2017, pp. 1 - 9.
[46] P. Bellavista, A. Corradi, A. Reale and N. Ticca, "Priority-Based Resource Scheduling in
Distributed Stream Processing Systems for Big Data Applications," in Proceedings of the 7th
IEEE International Conference on Utility and Cloud Computing, London, U.K, 2014, pp.
363-370.
[47] N. Li and Q. Guan, "Deadline-aware Complex Event Processing Models over Distributed
Monitoring Streams," Mathematical and Computer Modelling, vol. 55, no. 1, pp. 901-917,
2012.
[48] R. Moffatt, "ATM Fraud Detection with Apache Kafka and KSQL," [Online]. Available:
https://www.confluent.io/blog/atm-fraud-detection-apache-kafka-ksql. [Accessed August
2019].
[49] Apache Spark, "Spark Streaming API," [Online]. Available:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.package.
[Accessed July 2018].
[50] Oracle, "Javadocs java.util.Collection," [Online]. Available:
https://docs.oracle.com/javase/8/docs/api/java/util/Collection.html. [Accessed July 2019].
[51] Apache Spark, "Apache Spark API," [Online]. Available:
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/rdd/RDD.html. [Accessed July
2019].
[52] Oracle, "Javadocs: java.lang.String," [Online]. Available:
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html. [Accessed April 2019].
[53] Oracle, "Javadocs: java.lang.Comparable," [Online]. Available:
https://docs.oracle.com/javase/8/docs/api/java/lang/Comparable.html. [Accessed April 2019].
[54] T. O. G. B. Specifications, "The Open Group Base Specifications," [Online]. Available:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16.
[Accessed July 2019].
[55] Oracle, "Javadocs: java.util.concurrent.PriorityBlockingQueue," [Online]. Available:
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/PriorityBlockingQueue.html.
[Accessed April 2019].
[56] Oracle, "Javadocs: java.lang.Math," [Online]. Available:
https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html. [Accessed May 2019].
97
[57] J. S. v. d. Veen, B. v. d. Waaij, E. Lazovik, W. Wijbrandi and R. J. Meijer, "Dynamically
Scaling Apache Storm for the Analysis of Streaming Data," in Proceedings of the IEEE First
International Conference on Big Data Computing Service and Applications
(BigDataService), Redwood City, USA, 2015, pp. 154-161.
[58] Y. Zhou, B. C. Ooi, K.-L. Tan and J. Wu, "Efficient Dynamic Operator Placement in a
Locally Distributed Continuous Query System," On the Move to Meaningful Internet Systems
2006: CoopIS, DOA, GADA, and ODBASE, vol. 42, no. 75, pp. 54-71, 2006.
[59] Amazon AWS, "Amazon EC2 Instance Types," [Online]. Available:
https://aws.amazon.com/ec2/instance-types/. [Accessed May 2019].
[60] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, "Discretized streams: Fault-
tolerant Streaming Computation at Scale," in Proceedings of the 24th ACM Symposium on
Operating Systems Principles (SOSP '13), Farminton, Pennsylvania, 2013, pp. 423-438.
[61] J. Medhi, Stochastic Models in Queueing Theory, San Diego, USA: Elsevier Science &
Technology, 2002.
[62] T. Yan and G.-L. Gao, "Research on the Methods of Chinese Text Classification using Bayes
and Language Model," in Proceedings of the Chinese Conference on Pattern Recognition,
Beijing, China, 2008, pp. 1-6.
[63] Sentiment140, "Sentiment140," [Online]. Available: http://help.sentiment140.com/for-
students. [Accessed May 2019].
[64] K. Korovkinas, P. Danėnas and G. Garšva, "SVM and Na¨ýve Bayes Classification Ensemble
Method for Sentiment Analysis," Baltic Journal of Modern Computing, vol. 5, no. 4, pp. 398-
409, 2017.
[65] W. Park, Y. You and K. Lee, "Detecting Potential Insider Threat: Analyzing Insiders’
Sentiment Exposed in Social Media," Security and Communication Networks, vol. 1, no. 1,
pp. 1 - 8, 2018.
[66] S. U. S. K. Young Gyo Jung College of Software, K. T. Kim, B. Lee and H. Y. Youn,
"Enhanced Naive Bayes Classifier for Real-time Sentiment Analysis with SparkR," in
Proceedings of the International Conference on Information and Communication Technology
Convergence (ICTC), Jeju, South Korea, 2016, pp. 141-146.
[67] S. B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques," in
Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in
Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI,
Information Retrieval and Pervasive Technologies, Amsterdam, Netherlands, 2007, pp. 3-24.
[68] H. MEDHAT, "Apache Spark Streaming Tutorial: Identifying Trending Twitter Hashtags,"
[Online]. Available: https://www.toptal.com/apache/apache-spark-streaming-twitter.
[Accessed May 2019].
[69] S.-H. Kim and W. Whitt, "Statistical Analysis with Little's Law," Operations Research
Journal, vol. 61, no. 4, pp. 1030-1045, 2013.
[70] M. C. T. D. A. D. J. M. M. M. M. F. S. S. a. I. S. M. Zaharia, "Resilient distributed datasets:
A Fault-tolerant Abstraction for In-memory Cluster Computing," in Proceedings of the 9th
USENIX conference on Networked Systems Design and Implementation (NSDI'12 ), San Jose,
USA, 2012, pp. 15-29.
98
[71] A. Muhammad, N. Wiratunga and R. Lothian, "Contextual Sentiment Analysis for Social
Media Genres," Knowledge-Based Systems, vol. 108, no. Complete, pp. 92-101, 2016.
99
Appendix A: Performance Measurements on the Cloud
100
A.2 Effect of Mean Arrival Rate
101
Figure A.3: Cloud: Effect of λ on Average Scanning Latency
102
Figure A.5: Cloud: Effect of λ with 3 Priority Levels
103
Figure A.7: Cloud: Effect of Batch Duration on Scheduling Overhead
Figure A.8: Cloud: Effect of Batch Duration on Average Job Queuing Latency
104
A.4 Effect of Coefficient of Variation of Interval Time
105
Figure A.11: Cloud: Effect of Ca with 3 Priority Levels
106