0% found this document useful (0 votes)
2 views6 pages

Gana Path I 2010

This paper presents a statistical framework using Kernel Canonical Correlation Analysis (KCCA) to predict resource requirements and execution times for MapReduce jobs in cloud computing environments. It emphasizes the importance of accurate performance prediction and realistic workload generation for optimizing system management and design. The authors demonstrate the effectiveness of their approach through evaluations using data from a production Hadoop deployment, achieving significant improvements in prediction accuracy with the right choice of job feature vectors.

Uploaded by

nadhifaardiana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Gana Path I 2010

This paper presents a statistical framework using Kernel Canonical Correlation Analysis (KCCA) to predict resource requirements and execution times for MapReduce jobs in cloud computing environments. It emphasizes the importance of accurate performance prediction and realistic workload generation for optimizing system management and design. The authors demonstrate the effectiveness of their approach through evaluations using data from a production Hadoop deployment, achieving significant improvements in prediction accuracy with the right choice of job feature vectors.

Uploaded by

nadhifaardiana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Statistics-Driven Workload Modeling for the Cloud

Archana Ganapathi, Yanpei Chen, Armando Fox, Randy Katz, David Patterson
Computer Science Division, University of California at Berkeley
{archanag, ychen2, fox, randy, pattrsn}@cs.berkeley.edu

Abstract— A recent trend for data-intensive computations is key technical finding is that with the right choice of predictive
to use pay-as-you-go execution environments that scale trans- features, KCCA leads to highly accurate predictions that
parently to the user. However, providers of such environments improve with the quality and coverage of performance data.
must tackle the challenge of configuring their system to provide
maximal performance while minimizing the cost of resources These features form the basis of a statistics-driven workload
used. In this paper, we use statistical models to predict re- generator that synthesizes realistic workloads using the models
source requirements for Cloud computing applications. Such a developed in the KCCA framework (Section IV). This work-
prediction framework can guide system design and deployment load generator allows us to evaluate MapReduce optimizations
decisions such as scale, scheduling, and capacity. In addition, we on realistic workloads in the absence of a widely accepted
present initial design of a workload generator that can be used
to evaluate alternative configurations without the overhead of performance benchmark. As we will detail, the workload
reproducing a real workload. This paper focuses on statistical generator relies on useful features identified by the prediction
modeling and its application to data-intensive workloads. framework, as well as components that are common across dif-
ferent applications and computing paradigms. In other words,
I. I NTRODUCTION we argue that an effective prediction model is a prerequisite
The computing industry has recently uncovered the potential for a good workload generator. Conversely, a good workload
of large-scale data-intensive computing. Internet companies generator allows the KCCA prediction framework to guide
such as Google, Yahoo!, Amazon, and others rely on the ability hardware, configuration and system management choices with-
to process large quantities of data to drive their core business. out committing a full implementation on a production cluster.
Traditional decision support databases no longer suffice be- We begin our discussion with an overview of MapReduce
cause they do not provide adequate scaling of compute and (Section II), and we interleave reviews of relevant related work
storage resources. To satisfy their data-processing needs, many where appropriate. Our paper builds the case for statistics-
Internet services turn to frameworks like MapReduce [1], driven distributed system modeling and design, an approach
a big-data computation paradigm complementary to parallel that is extensible to computing paradigms other than parallel
databases. At the same time, the advent of cloud computing databases and MapReduce.
infrastructures such as Amazon EC2 and Rackspace, makes
large-scale cluster computing accessible even to small com- II. M AP R EDUCE OVERVIEW
panies [2]. The prevalence of SQL-like interfaces such as MapReduce was initially developed at Google for parallel
Hive [3] and Pig [4] further ease the migration of traditional processing of large datasets [1]. Today, MapReduce powers
database workloads to the Cloud. Google’s flagship web search service, as well as clusters at
Data intensive computing in the Cloud presents new chal- Yahoo!, Facebook, and others [6]. Programs written using
lenges for system management and design. Key questions in- MapReduce are automatically executed in a parallel fash-
clude how to optimize scheduling, reduce resource contention, ion on the cluster. Also, MapReduce can run on clusters
and adapt to changing loads. The penalty for suboptimal deci- of cheap commodity machines, an attractive alternative to
sions is amplified by the large number of simultaneous users expensive, specialized clusters. MapReduce is highly scalable,
and the sheer volume of data. As we will argue, heuristics allowing petabytes of data to be processed on thousands and
and cost functions traditionally used for query optimization even millions of machines. Most importantly for our work,
no longer suffice, and resource management strategies increas- MapReduce is especially suitable for the KCCA prediction
ingly involve multiple metrics of success. Thus, there is the framework because it has a homogeneous execution model,
need for accurate performance prediction mechanisms to guide and production MapReduce workloads often have repeated
scheduling and resource management decisions, and realistic queries on similar or identical datasets.
workload generators to evaluate the choice of policies prior to At its core, MapReduce has two user-defined functions. The
full production deployment. Map function takes in a key-value pair, and generates a set of
In this work, we describe and evaluate a statistical frame- intermediate key-value pairs. The Reduce function takes in all
work that uses Kernel Canonical Correlation Analysis (KCCA) intermediate pairs associated with a particular key, and emits
to predict the execution time of MapReduce jobs (Section III). a final set of key-value pairs. Both the input pairs to Map
This framework is an extension to our work in [5], where we and the output pairs of Reduce are placed in an underlying
have demonstrated the effectiveness of the KCCA technique distributed file system (DFS). The run-time system takes care
for predicting query performance in parallel databases. Our of retrieving from and outputting to the DFS, partitioning

978-1-4244-6523-1/10/$26.00 © 2010 IEEE 87 ICDE Workshops 2010


the data, scheduling parallel execution, coordinating network X Y
communication, and handling machine failures.   
   
A MapReduce execution occurs in several stages. There
is a master daemon that coordinates a cluster of workers.        

The master divides the input data into many splits, each 1 …N 1 …N
read and processed by a Map worker. The intermediate key- 1 1
: :
value pairs are periodically written to the local disk at the K : : K
N N
Map workers, usually separate machines from the master,

and the locations of the pairs are sent to the master. The
0 KK  KK 0 
master forwards these locations to the Reduce workers, who =
KK 0  0 KK 
read the intermediate pairs from Map workers using remote
procedure call (RPC). After a Reduce worker has read all the K *A K *B
intermediate pairs, it sorts the data by the intermediate key,
applies the Reduce function, and appends the output pairs
to a final output file for the Reduce partition. If any of the
Map or Reduce executions lags behind, backup executions are
launched. An entire MapReduce computation is called a job, Fig. 1. Training: KCCA projects vector of Hadoop job features and
and the execution of a Map or Reduce function on a worker performance features onto dimensions of maximal correlation across the data
is called a task. Each worker node allocates resources in the sets. Furthermore, its clustering effect causes “similar” jobs to be collocated.
form of slots and each Map task or Reduce task uses one slot.
For our work, we select the Hadoop implementation of
MapReduce [7]. The Hadoop distributed file system (HDFS) modeling is to represent each Hadoop job as a feature vector of
implements many features of the Google DFS [8]. The open job characteristics and a corresponding vector of performance
source nature of Hadoop has made it a target for optimizations, metrics. This step is the only place where we deviate from
e.g., an improved way to launch backup tasks [9], a fair sched- the winning methodology in [5]. We explain our choice of
uler for multi-user environments [10], pipeline task execution feature vectors in Section III-B.
and streaming queries [11], and resource managers from mul- The core idea behind using the KCCA algorithm is that
tiple computational frameworks including MapReduce [12]. multi-dimensional correlations are difficult to extract from the
There have been several efforts to extend Hadoop to ac- raw data of job features and performance features. KCCA
commodate different data processing paradigms. Most relevant allows us to project the raw data onto subspaces α and β
to our work, Hive [3] is an open source data warehouse such that the projections of the data are maximally correlated.
infrastructure built on top of Hadoop. Users write SQL-style The precise mathematical construction is as follows. We
queries in a declarative language called HiveQL, which is start with N job feature vectors xk and corresponding per-
compiled into MapReduce jobs and executed on Hadoop. Hive formance vectors yk . We form an N × N matrix Kx whose
represents a natural place to begin our effort to extend the (i, j)th entry measures the similarity between (xi , xj ), and
KCCA prediction framework to Hadoop. another N ×N matrix Ky whose (i, j)th entry is the similarity
III. P REDICTING P ERFORMANCE OF H ADOOP J OBS between (yi , yj ). Our similarity metric is constructed from
Gaussian kernel functions as discussed in [5]. Our prior work
Our goal is to predict Hadoop job performance by corre- in [5] also discusses the impact of different kernel functions.
lating pre-execution features and post-execution performance
We then project Kx and Ky onto subspaces α and β.
metrics. We take inspiration from the success of using sta-
For this projection step, KCCA calculates the projection
tistical techniques to predict query performance in parallel
matrices A and B, respectively consisting of the basis vectors
databases [5]. KCCA allows us to simultaneously predict
of subspaces α and β. In particular, the matrices A and
multiple performance metrics using a single model. This
B are calculated using the generalized eigenvector problem
property captures interdependencies between multiple metrics,
formulation in Figure 1 such that the projections Kx × A
a significant advantage over more commonly used techniques
and Ky × B are maximally correlated. In other words, job
such as Regression, which model a single metric at a time.
feature projections Kx ×A and performance feature projections
For a more detailed comparison of KCCA to other statistical
Ky × B are collocated on subspaces α and β. Thus, we can
techniques, we refer the reader to [13].
leverage subspaces α and β for performance prediction.
Since Hive’s query interface is similar to that of commercial
Once we build the KCCA model, performance prediction is
parallel databases, it is a natural extension to evaluate the
as follows. Beginning with a Hive query whose performance
prediction accuracy of Hive queries in Hadoop using the
we want to predict, we create job feature vector x̂ and calculate
KCCA technique as described in [5].
its coordinates in subspace α. We infer the job’s coordinates
A. KCCA Prediction Framework on the performance projection subspace β by using its 3
Figure 1 summarizes our adaptation of KCCA for Hadoop nearest neighbors in the job projection. This inference step is
performance modeling. The first step in using KCCA-based possible because KCCA projects the raw data onto dimensions

88
6
of maximal correlation, thereby collocating points on the 10
job and performance projections. Finally, our performance
prediction is calculated using a weighted average of the 3

Actual Execution Time (sec)


nearest neighbors’ raw performance metrics.
We evaluate this methodology using data from a production 10
4

Hadoop deployment at a major web service. This deployment


was on a multi-user environment comprising hundreds of
homogeneous nodes, with a fixed number of map and reduce
slots per node based on available memory. 2
We extracted our data from Hadoop job history logs, which 10
are collected by the cluster’s Hadoop master node to track
details of every job’s execution on the cluster. From these
job logs, we construct performance feature vectors to include
0
map time, reduce time, and total execution time. These metrics 10 0 2 4 6
are central to any scheduling decisions. We also include data 10 10 10 10
Prediected Execution Time (sec)
metrics such as map output bytes, HDFS bytes written, and
locally written bytes.
Fig. 2. Predicted vs. actual execution time for Hive queries, modeled using
There are several possible options for job feature vectors. Hive operator instance counts as job features. The model training and test sets
The choice greatly affects prediction accuracy. Luckily, the contained 5000 and 1000 Hive queries respectively. The diagonal green line
best choice is intuitive and leads to good prediction accuracy. represents the perfect prediction scenario. Note: the results are plotted on a
log-log scale to accommodate the variance in execution time.
B. Prediction Accuracy for Hive
The first choice of job feature vectors is an extension of 6
10
feature vectors in [5], proved effective for parallel databases.
Like relational database queries, Hive queries are translated
Actual Execution Time (sec)

into execution plans involving sequences of operators. We


observed 25 recurring Hive operators, including Create Table, 4
Filter, Forward, Group By, Join, Move and Reduce Output, to 10
name a few. Our initial job feature vector contained 25 features
- corresponding to the number of occurrences of each operator
in a job’s execution plan.
2
Figure 2 compares the predicted and actual execution time 10
using Hive operator instance counts as job features. The
prediction accuracy was very low, with an negative R2 value,
indicating poor correlation between predicted and actual values
1
. Our results suggest that Hive operator occurrence counts are 0
10 0 2 4 6
insufficient for modeling Hive query performance. 10 10 10 10
This finding is somewhat unsurprising. Unlike relational Predicted Execution Time (sec)
databases, Hive executions plans are an intermediate step
Fig. 3. Predicted vs. actual execution time for Hive queries, modeled using
before determining the number and configuration of maps job configuration and input data characteristics as job features. The model
and reduces to be executed as a Hadoop job. Job count and training and test sets contained 5000 and 1000 Hive queries respectively. The
configuration are likely to form more effective job feature diagonal green line represents the perfect prediction scenario. Note: the results
are plotted on a log-log scale to accommodate the variance in execution time.
vectors, since they describes the job at the lowest level of
abstraction visible prior to executing the job.
Thus, our next choice of job feature vector used Hive
query’s configuration parameters and input data characteristics. Using the same model, our prediction accuracy was 0.84 for
We included the number and location of maps and reduces map time, 0.71 for reduce time, and 0.86 for bytes written.
required by all Hadoop jobs generated by each Hive query, Our prediction accuracy was lower for reduce time since
and data characteristics such as bytes read locally, bytes read the reduce step is fundamentally exposed to more variability
from HDFS, and bytes input to the map stage. due to data skew and uneven map finishing times. These
Figure 3 shows our prediction results for the same training results convincingly demonstrate that feature vectors with job
and test set of Hive queries as in Figure 2. Our R2 prediction configuration and input data characteristics enable effective
accuracy is now 0.87 (R2 = 1.00 signifies perfect prediction). modeling of Hive query performance.
1 Negative R2 values are possible since the training data and test data are
In future, we can improve prediction accuracy even further
disjoint. Note that this metric is sensitive to outliers. In several cases, the R2 by using more customized kernel functions, two-step KCCA
value improved significantly by removing the top one or two outliers. prediction, and better training set coverage [5].

89
C. Prediction for Other Hadoop Jobs Pig queries [14]. The technique estimates the remaining query
A significant advantage of our chosen job and performance execution time by comparing the number of tuples processed
feature vectors is that they contain no features that limit their to the number of remaining tuples. This heuristic is effective
scope to Hive queries. As a natural extension, we evaluate our when there is a constant tuple processing rate during query ex-
performance prediction framework on another class of Hadoop ecution. In contrast, our statistical approach remains effective
jobs that mimic data warehouse Extract Transform Load (ETL) even if the tuple processing rate changes due to changing data
operations. ETL involves extracting data from outside sources, locality or other effects. Another advantage of our technique
transforming it to fit operational needs, then loading it into is that we can make an accurate prediction prior to query
the end target data warehouse. KCCA prediction is especially execution, instead of halfway through the execution.
effective for Hadoop ETL jobs because the same jobs are In addition, we have a good way to identify lagging tasks
often rerun periodically with varying quantities/granularities once we have accurate, statistics driven prediction of task
of data. Also, KCCA prediction can bring great value because finishing times. The LATE scheduler can use these predictions
ETL jobs are typically long-running. Thus, it is important to instead of the finishing time estimation heuristics in [9],
anticipate the job execution times so that system administrators leading to performance improvements.
can plan and schedule the rest of their workload. Moreover, the Hadoop FAIR scheduler [10] can perform
resource allocation using predicted execution time in addition
8
10 to using the number of slots as a resource consumption proxy.
This combination could lead to better resource sharing and
decrease the need for killing or preempting tasks.
Actual Execution Time (sec)

Accurate finishing time predictions also enable several other


6 scheduler policies. One such example is a shortest-job-first
10
scheduler, where we minimize per-job wait time by executing
jobs in order of increasing predicted execution time. We can
also implement a deadline-driven scheduler that allows jobs
with approaching deadlines to jump the queue “just in time”.
4
10 Furthermore, the prediction framework can help with re-
source provisioning. Given predicted resource requirements
and desired finishing time, one can evaluate whether there are
enough resources in the cluster or if more nodes should be
2 pooled in. If predicted resource requirements are low, one can
10 2 4 6 8
10 10 10 10 assign nodes for other computations, or turn off unnecessary
Predicted Execution Time (sec) nodes to reduce power without impacting performance.
Fig. 4. Predicted vs. actual time for ETL jobs, modeled using job E. Ongoing Work
configuration and input data characteristics as job features. The model training
and test sets contained 5000 and 1000 ETL jobs respectively. The diagonal To realize the full potential of our prediction framework,
green line represents the perfect prediction scenario. Note: the results are there are several concerns to address before integrating KCCA
plotted on a log-log scale to accommodate the variance in execution time.
with scheduling and resource management infrastructures.
Different Hadoop workloads: Different organizations have
Figure 4 shows the predicted vs. actual execution times of
different computational needs. We need to evaluate the pre-
ETL jobs at the same Hadoop deployment. Our R2 prediction
diction effectiveness for different workloads to increase confi-
accuracy for job execution time was 0.93. Prediction results for
dence in the KCCA framework. In the absence of a represen-
other metrics were equally good, with R2 values of 0.93 for
tative MapReduce benchmark, we must rely on case studies on
map time and 0.85 for reduce time. While the R2 values are
production logs. However, access to production logs involve
better than Hive predictions, there are some visible prediction
data privacy and other logistical issues. Thus, we need a way to
errors for jobs with short execution times. These jobs are
anonymize sensitive information while preserving a statistical
inherently more difficult to predict because the setup time
description of the MapReduce jobs and performance.
for big data Hadoop jobs is also on the same time scale.
Different job types: Here, we have obtained prediction
Nevertheless, the results indicate that the KCCA framework
results for Hive queries and ETL jobs. However, we believe
is also effective for predicting Hadoop ETL jobs.
the prediction framework is applicable for a more generic set
D. Potential Applications of Hadoop jobs. The features we used for our predictions are
Our finding has several implications beyond our immediate common to all Hadoop jobs. Given that resource demands vary
work. The high prediction accuracy suggests that related work by the map and reduce functions, we can augment feature vec-
on MapReduce optimization should consider augmenting their tors with specific map and reduce function identifiers and/or
mechanisms using KCCA predictions. the language in which these functions were implemented.
For example, others researchers have adapted traditional Different resources: A good scheduler should understand
query progress heuristics to estimate the execution time of whether particular jobs have conflicting or orthogonal resource

90
demands. However, due to hardware and configuration dif- not among the input data to the workload generator because
ferences, we cannot directly compare resource consumption it is the prediction output of our model.
between different clusters. If we can “rerun” the workload It may require an unwieldy amount of data to accurately
with good instrumentation, we can monitor the utilization of represent the statistical distributions. Therefore, we only ex-
various resources including CPU, disk, network. This capabil- tract the 1st, 25th, 50th, 75th, and 99th percentiles for the
ity allows us to augment the job feature vector with resource distributions of inter-job arrival times, input sizes, and data
consumption levels, providing more accurate predictions to ratios. We do linear extrapolation between these percentiles to
the scheduler. Furthermore, we can “rerun” the workload approximate the full distributions.
on different hardware and cluster configurations, turning the The precise algorithm is as follows:
KCCA framework into an even more powerful tool that can 1) From production trace
guide decisions on hardware and configuration choices. • Compute the 1st, 25th, 50th, 75th, and 99th per-
The next section describes a workload generation frame- centiles of inter-job arrival times.
work that takes advantage of these opportunities. • Compute the CDF of job counts by jobN ame.
• For each job, compute the 1st, 25th, 50th, 75th, and
IV. T OWARDS A W ORKLOAD G ENERATION F RAMEWORK
99th percentiles of scaled input sizes, shuffle-input
Recent research on MapReduce has relied on sort jobs data ratio, and output-shuffle data ratio.
and grid-mix [15] to generate evaluation workloads. These
2) Perform probabilistic sampling on
techniques are criticized for their inadequate diversity in data
• The approximated distribution of inter-job arrival
transfer patterns and insufficient evaluation of performance
in a steady but non-continuous job streams. In contrast, our times to get t.
• The truncated CDF of job counts to get jobN ame.
predecessor work benefited from the widely accepted TPC-
• The approximated distribution of scaled input sizes
DS database benchmark for its evaluations. In this section, we
describe our initial work towards developing a MapReduce for jobN ame to get inputSize.
• The approximated distribution of shuffle-input data
workload generation framework that captures the statistical
properties of production MapReduce traces. The workload ratio for jobN ame to get shuf f leInputRatio.
• The approximated distribution of output-shuffle data
generator can act as a richer synthetic benchmark compared
to current MapReduce evaluation methods. ratio for jobN ame to get outputShuf f leRatio.
We have several design goals. First, the framework must 3) Add to workload [t, jobN ame, inputSize,
mask the actual computation done by jobs to prevent leakage shuf f leInputRatio, outputShuf f leRatio].
of confidential competitive information. Often, MapReduce We repeat steps 2 and 3 until we have the required workload
job source code reveals trade secrets about the scale and size in terms of either the number of jobs or the time duration
granularity of data processed for companies’ core business of the workload.
decisions. Hiding job-specific characteristics would encourage This algorithm requires us to have a MapReduce job that ad-
companies to contribute traces towards the workload gener- heres to specified shuffle-input and output-shuffle data ratios.
ation framework. Second, the framework has to be agnostic This job is a straightforward modification of randomwrite,
to the hardware, MapReduce implementation, and cluster a code example included in recent Hadoop distributions.
configurations. This goal allows our framework to be used There are several trade-offs associated with such a workload
without the need to replicate production cluster environments. generator. Most visibly, we do not capture the compute part
Third, the framework has to accommodate different cluster of MapReduce jobs. As explained by our goals, this trade-
sizes and workload durations. This property allow us to off is necessary, since knowledge of the computation could
investigate provisioning the right cluster size for a particular lead to a leakage of confidential information. On the other
type of workload, and understanding both short and long-term hand, it allows us to compare workloads from organizations
performance characteristics. using MapReduce for different computations. Moreover, we
The following design satisfies these goals. We start by still capture the workload data transfer activity, which is often
extracting a statistical summary of the production MapReduce the biggest contributor to overall job finishing time. With a full
trace. This summary includes a distribution of inter-job arrival implementation of our workload generator, we can concretely
times and a distribution of job counts according to job name. verify whether this is a sufficiently common case.
For each job name, we also extract the distribution of job Another trade-off is the loss of hardware, data locality, and
input sizes and input/shuffle/output data ratios. We scale the cluster configuration information from the initial job trace. The
job input sizes by the number of nodes in the cluster. This advantage of a hardware, locality, and configuration indepen-
adjustment preserves the amount of data processed by each dent workload is that we can ”re-run” workloads on different
node, facilitating a comparison of different node types, e.g. choices of hardware, MapReduce schedulers/implementations,
big nodes with more cores vs. small nodes with few cores. and cluster configurations. Thus, we would be able to antic-
We then probabilistically sample the distributions to generate ipate which choice would lead to the fastest finishing time
the workload as vectors of [launch times, job name, input size, for the workload at hand. This is a fundamentally different
and input/shuffle/output data ratios]. The job finishing time is approach than detailed Hadoop simulators such as Mumak

91
[16], which is required for expediting the design and evaluation making. An open challenge is to optimize resource usage
cycle of heuristics driven mechanisms. through VM placement. The KCCA prediction framework
We also do not capture any data skew that may affect could help address this challenge, provided we verify its
MapReduce finishing time. This is also a conscious decision. effectiveness in VM environments.
Data skew would cause some map tasks or reduce tasks finish Lastly, an opportunity yet to be capitalized in performance
slower than the others and hold up the entire job. Thus, well- modeling is to appropriately account for variability inherent
written MapReduce jobs should include mechanisms like hash within a platform. It would be useful to normalize measured
functions to remove data skew, and our workload generator performance metrics with respect to each node’s historical
encompasses the desired operating mode. behavior, such as resource availability, average response time,
Lastly, despite appearances, the choice of 1st, 25th, 50th, and failure profile.
75th, and 99th percentiles is anything but ad-hoc. These We have a working prototype of our statistics-driven work-
percentiles correspond to the five-number summary of statis- load generator. We plan to use this framework to evaluate
tical distributions, i.e. min, max, median, quantiles. The key potential solutions to the above issues, as well as to develop
strength of this summary is that it makes no assumptions about and evaluate our ideas on resource management. Our work
the underlying statistical distribution while capturing both the demonstrates the advantage of statistics over heuristics in
dispersion and skew in the data. Any sensitivity of quantile system optimization. We strongly believe that the statistical
boundaries due to perturbations in the data would be bounded approach is relevant to a variety of modern distributed systems
by the neighboring quantile boundaries. While capturing the and parallel computing platforms.
complete distribution would be ideal, we believe the five-
R EFERENCES
number summary is an acceptable trade-off.
We believe that such a workload generator would allow [1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing
on Large Clusters,” Communications of the ACM, vol. 51, no. 1, pp.
us to apply our KCCA prediction framework on different 107–113, January 2008.
hardware, MapReduce implementations, and cluster configu- [2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Kon-
rations, while preserving the data transfer characteristics of winski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia,
“Above the Clouds: A Berkeley View of Cloud Computing,” EECS
the production traces. Thus, the KCCA prediction framework Department, University of California, Berkeley, Tech. Rep. UCB/EECS-
becomes a powerful tool that can guide choice of hardware, 2009-28, 2009.
MapReduce schedulers/optimizers, and cluster configurations. [3] A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony,
H. Liu, P. Wyckoff, and R. Murthy, “Hive - A Warehousing Solution
At the same time, this architecture provides a framework for Over a MapReduce Framework,” in Proc. International Conference on
MapReduce operators to contribute anonymized production Very Large Data Bases, 2009.
traces that would benefit the research community as a whole. [4] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig
Latin: A Not-So-Foreign Language for Data Processing,” in Proc. ACM
SIGMOD International Conference on Management of Data, 2008.
V. S UMMARY AND O PEN Q UESTIONS [5] A. Ganapathi, H. Kuno, U. Daval, J. Wiener, A. Fox, M. Jordan, and
We have presented a statistics-driven modeling framework D. Patterson, “Predicting Multiple Performance Metrics for Queries:
Better Decisions Enabled by Machine Learning,” in Proc International
for data-intensive applications in the Cloud. We demonstrated Conference on Data Engineering, 2009.
good prediction accuracy on production Hadoop data analytic [6] “Hadoop Power-By Page,” http://wiki.apache.org/hadoop/PoweredBy.
and warehousing jobs. We can leverage KCCA-based pre- [7] “Hadoop,” http://hadoop.apache.org.
[8] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,”
dictions for making decisions including job scheduling, re- SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29–43, 2003.
source allocation, and workload management. A precondition [9] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica,
to implementing and validating frameworks for resource man- “Improving MapReduce Performance in Heterogeneous Environments,”
in Symposium on Operating Systems Design and Implementation, 2008.
agement is the presence of representative, realistic, portable [10] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and
workloads. We have also described a statistics-driven workload I. Stoica, “Job Scheduling for Multi-User MapReduce Clusters,” EECS
generator to meet this prerequisite. Department, University of California, Berkeley, Tech. Rep. UCB/EECS-
2009-55, 2009.
There are several unanswered questions for resource man- [11] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy,
agement in the Cloud. First, it is unclear what granularity of and R. Sears, “MapReduce Online,” EECS Department, University of
scheduling decisions is appropriate in Hadoop. Concurrently California, Berkeley, Tech. Rep. UCB/EECS-2009-136, 2009.
[12] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,
scheduled jobs lead to variability in a job’s performance, while S. Shenker, and I. Stoica, “Nexus: A Common Substrate for Cluster
concurrent tasks on the same node create variability in task Computing,” Workshop on Hot Topics in Cloud Computing, 2009.
finishing times. This problem is further complicated in clusters [13] A. Ganapathi, “Predicting and Optimizing System Utilization and Per-
formance via Statistical Machine Learning,” Ph.D. dissertation, UC
shared among various applications and not exclusively used Berkeley, 2009.
by Hadoop [12]. One could devise per-application scheduling [14] K. Morton, A. Friesen, M. Balazinska, and D. Grossman, “Estimating the
techniques, per-job techniques, per-task techniques, or per- Progress of MapReduce Pipelines,” in Proc. International Conference on
Data Engineering, 2010.
node techniques; the limitations of each remain unexplored. [15] “Gridmix,” HADOOP-HOME/src/benchmarks/gridmix in all recent
Next, several cloud computing infrastructures use virtual Hadoop distributions.
machines to provide isolation and abstract away resource [16] “Mumak,” http://issues.apache.org/jira/browse/MAPREDUCE-728, last
retrieved Nov. 2009.
availability. However, the additional layer of abstraction also
creates complexity for performance modeling and decision

92

You might also like