Google Wide Profiling: A Continuous Profiling Infrastructure For Data Centers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

[3B2-14] mmi2010040065.

3d 30/7/010 19:43 Page 65

..........................................................................................................................................................................................................................

GOOGLE-WIDE PROFILING:
A CONTINUOUS PROFILING
INFRASTRUCTURE FOR DATA CENTERS
..........................................................................................................................................................................................................................
GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR

DATA CENTERS, PROVIDES PERFORMANCE INSIGHTS FOR CLOUD APPLICATIONS. WITH

NEGLIGIBLE OVERHEAD, GWP PROVIDES STABLE, ACCURATE PROFILES AND A

DATACENTER-SCALE TOOL FOR TRADITIONAL PERFORMANCE ANALYSES. FURTHERMORE,

GWP INTRODUCES NOVEL APPLICATIONS OF ITS PROFILES, SUCH AS APPLICATION-

PLATFORM AFFINITY MEASUREMENTS AND IDENTIFICATION OF PLATFORM-SPECIFIC,

MICROARCHITECTURAL PECULIARITIES.

...... As cloud-based computing grows


in pervasiveness and scale, understanding data-
uniquely qualified for performance monitor-
ing in the data center. Traditionally, sam-
center applications’ performance and utiliza- pling tools run on a single machine,
tion characteristics is critically important, monitoring specific processes or the system
because even minor performance improve- as a whole. Profiling can begin on demand Gang Ren
ments translate into huge cost savings. Tradi- to analyze a performance problem, or it can
tional performance analysis, which typically run continuously.1 Eric Tune
needs to isolate benchmarks, can be too com- Google-Wide Profiling can be theoreti-
plicated or even impossible with modern cally viewed as an extension of the Digital Tipp Moseley
datacenter applications. It’s easier and more Continuous Profiling Infrastructure
representative to monitor datacenter applica- (DCPI)1 to data centers. GWP is a continu- Yixin Shi
tions running on live traffic. However, applica- ous profiling infrastructure; it samples across
tion owners won’t tolerate latency degradations machines in multiple data centers and col- Silvius Rus
of more than a few percent, so these tools must lects various events—such as stack traces,
be nonintrusive and have minimal overhead. hardware events, lock contention profiles, Robert Hundt
As with all profiling tools, observer distortion heap profiles, and kernel events—allowing
must be minimized to enable meaningful cross-correlation with job scheduling data, Google
analysis. (For additional information on re- application-specific data, and other informa-
lated techniques, see the ‘‘Profiling: From sin- tion from the data centers.
gle systems to data centers’’ sidebar.) GWP collects daily profiles from several
Sampling-based tools can bring overhead thousand applications running on thousands
and distortion to acceptable levels, so they’re of servers, and the compressed profile
...................................................................


0272-1732/10/$26.00 c 2010 IEEE Published by the IEEE Computer Society 65
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 66

...............................................................................................................................................................................................
DATACENTER COMPUTING

..............................................................................................................................................................................................
Profiling: From single systems to data centers
1
From gprof to Intel VTune (http://software.intel.com/en-us/intel- as MPI calls) from parallel programs but aren’t suitable for continuous
vtune), profiling has long been standard in the development process. profiling.
Most profiling tools focus on a single execution of a single program.
As computing systems have evolved, understanding the bigger picture
across multiple machines has become increasingly important. References
Continuous, always-on profiling became more important with Morph 1. S.L. Graham, P.B. Kessler, and M.K. Mckusick, ‘‘Gprof: A
and DCPI for Digital UNIX. Morph uses low overhead system-wide sam- Call Graph Execution Profiler,’’ Proc. Symp. Compiler Con-
pling (approximately 0.3 percent) and binary rewriting to continuously struction (CC 82), ACM Press, 1982, pp. 120-126.
evolve programs to adapt to their host architectures.2 DCPI gathers 2. X. Zhang et al., ‘‘System Support for Automatic Profiling and
much more robust profiles, such as precise stall times and causes, Optimization,’’ Proc. 16th ACM Symp. Operating Systems
and focuses on reporting information to users.3 Principles (SOSP 97), ACM Press, 1997, pp. 15-26.
OProfile, a DCPI-inspired tool, collects and reports data much in the 3. J.M. Anderson et al., ‘‘Continuous Profiling: Where Have All
same way for a plethora of architectures, though it offers less sophisti- the Cycles Gone?’’ Proc. 16th ACM Symp. Operating Sys-
cated analysis. GWP uses OProfile (http://oprofile.sourceforge.net) as a tems Principles (SOSP 97), ACM Press, 1997, pp. 357-390.
profile source. High-performance computing (HPC) shares many profiling 4. N.R. Tallent et al., ‘‘Diagnosing Performance Bottlenecks in
challenges with cloud computing, because profilers must gather repre- Emerging Petascale Applications,’’ Proc. Conf. High Perfor-
sentative samples with minimal overhead over thousands of nodes. mance Computing Networking, Storage and Analysis (SC
Cloud computing has additional challenges—vastly disparate applica- 09), ACM Press, 2009, no. 51.
tions, workloads, and machine configurations. As a result, different profil- 5. V. Herrarte and E. Lusk, Studying Parallel Program Behavior
ing strategies exist for HPC and cloud computing. HPCToolkit uses with Upshot, tech. report, Argonne National Laboratory,
lightweight sampling to collect call stacks from optimized programs 1991.
and compares profiles to identify scaling issues as parallelism increases.4 6. O. Zaki et al., ‘‘Toward Scalable Performance Visualization
Open|SpeedShop (www.openspeedshop.org) provides similar func- with Jumpshot,’’ Int’l J. High Performance Computing Appli-
tionality. Similarly, Upshot5 and Jumpshot6 can analyze traces (such cations, vol. 13, no. 3, 1999, pp. 277-288.

database grows by several Gbytes every day.  Does a particular memory allocation
Profiling at this scale presents significant scheme benefit a particular class of
challenges that don’t exist for a single ma- applications?
chine. Verifying that the profiles are correct  What is the cycles per instruction (CPI)
is important and challenging because the for applications across platforms?
workloads are dynamic. Managing profiling
overhead becomes far more important as Additionally, we can derive higher-level data
well, as any unnecessary profiling overhead to more complex but interesting questions,
can cost millions of dollars in additional such as which compilers were used for appli-
resources. Finally, making the profile data cations in the fleet, whether there are more
universally accessible is an additional chal- 32-bit or 64-bit applications running, and
lenge. GWP is also a cloud application, how much utilization is being lost by subop-
with its own scalability and performance timal job scheduling.
issues.
With this volume of data, we can answer Infrastructure
typical performance questions about datacen- Figure 1 provides an overview of the en-
ter applications, including the following: tire GWP system.

 What are the hottest processes, rou- Collector


tines, or code regions? GWP samples in two dimensions. At any
 How does performance differ across moment, profiling occurs only on a small
software versions? subset of all machines in the fleet, and
 Which locks are most contended? event-based sampling is used at the machine
 Which processes are memory hogs? level. Sampling in only one dimension would
....................................................................

66 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 67

be unsatisfactory; if event-based profiling


were active on every machine all the time, Data center
at a normal event-sampling rate, we would Machine
database
be using too many resources across the Applications
fleet. Alternatively, if the event-sampling (with profiling
interfaces)
rate is too low, profiles become too sparse
to drill down to the individual machine Daemons
level. For each event type, we choose a Collectors
sampling rate high enough to provide mean- Binary
ingful machine-level data while still repository Raw profiles
minimizing the distortion caused by the
Symbolizer
profiling on critical applications. The system (MapReduce)
has been actively profiling nearly all Binary Processed
collector profiles
machines at Google for several years with
only rare complaints of system interference.
A central machine database manages all
machines in the fleet and lists every machine’s Profile database
Webserver
name and basic hardware characteristics. The (OLAP)
GWP profile collector periodically gets a list
of all machines from that database and selects
a random sample of machines from that pool. Figure 1. An overview of the Google-Wide Profiling (GWP) infrastructure.
The collector then remotely activates profil- The whole system consists of collector, symbolizer, profile database,
ing on the selected machines and retrieves Web server, and other components.
the results. It retrieves different types of
sampled profiles sequentially or concurrently,
depending on the machine and event type. lower sampling frequencies). Finally, we
For example, the collector might gather save the profile and metadata in their raw for-
hardware performance counters for several mat and perform symbolization on a separate
seconds each, then move on to profiling for set of machines. As a result, the aggregated
lock contention or memory allocation. It profiling overhead is negligible—less than
takes a few minutes to gather profiles for a 0.01 percent. At the same time, the derived
specific machine. profiles are still meaningful, as we show in
For robustness, the GWP collector is a the ‘‘Reliability analysis’’ section.
distributed service. It helps improve availabil-
ity and reduce additional variation from the Profiles and profiling interfaces
collector itself. To minimize distortion on GWP collects two categories of profiles:
the machines and the services running on whole-machine and per-process. Whole-
them, the collector monitors error conditions machine profiles capture all activities happening
and ceases profiling if the failure rate reaches on the machine, including user applications,
a predefined threshold. Aside from the collec- the kernel, kernel modules, daemons,
tor, we monitor all other GWP components and other background jobs. The whole-
to ensure an always-on service to users. machine profiles include hardware perfor-
On the top of the two-dimensional sam- mance monitoring (HPM) event profiles,
pling approach, we apply several techniques kernel event traces, and power measurements.
to further reduce the overhead. First, we mea- Users without root access cannot directly
sure the event-based profiling overhead on invoke most of the whole-machine profiling
a set of benchmark applications and then con- systems, so we deploy lightweight daemons
servatively set the maximum rates to ensure on every machine to let remote users (such
the overhead is always less than a few percent. as GWP collectors) access those profiles.
Second, we don’t collect whole call stacks for The daemons act as gate keepers to control
the machine-wide profiles to avoid the high access, enforce sampling rate limits, and col-
overhead associated with unwinding (but we lect system variables that must be synchron-
collect call stacks for most server profiles at ized with the profiles.
....................................................................

JULY/AUGUST 2010 67
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 68

...............................................................................................................................................................................................
DATACENTER COMPUTING

We use OProfile (http://oprofile.sourceforge. profiles so that we later correlate the profiles


net) to collect HPM event profiles. OProfile is with job, machine, or datacenter attributes.
a system-wide profiler that uses HPM to gen-
erate event-based samples for all running Symbolization and binary storage
binaries at low overhead. To hide the hetero- After collection, the Google File System
geneity of events between architectures, we (GFS) stores the profiles.3 To provide mean-
define some generic HPM events on top ingful information, the profiles must corre-
of the platform-specific events, using an late to source code. However, to save
approach similar to PAPI.2 The most com- network bandwidth and disk space, applica-
monly used generic events are CPU cycles, tions are usually deployed into data centers
retired instructions, L1 and L2 cache misses, without any debug or symbolic information,
and branch mispredictions. We also provide which can make source correlation impossi-
access to some architecture-specific events. ble. Furthermore, several applications, such
Although the aggregated profiles for those as Java and QEMU, dynamically generate
events are biased to specific architectures, and execute code. The code is not available
they provide useful information for machine- offline and can therefore no longer be sym-
specific scenarios. bolized. The symbolizer must also symbolize
In addition to whole-machine profiles, we operating system kernels and kernel loadable
collect various types of profiles from most modules.
applications running on a machine using Therefore, the symbolization process be-
the Google Performance Tools (http://code. comes surprisingly complicated, although
google.com/p/google-perftools). Most appli- it’s usually trivial for single-machine profil-
cations include a common library that ing. Various strategies exist to obtain binaries
enables process-wide stacktrace-attributed with debug information. For example, we
profiling mechanisms for heap allocation, could try to recompile all sampled applica-
lock contention, wall time and CPU time, tions at specific source milestones. However,
and other performance metrics. The com- it’s too resource-intensive and sometimes im-
mon library includes a simple HTTP server possible for applications whose source isn’t
linked with handlers for each type of profiler. readily available. An alternative is to persis-
A handler accepts requests from remote tently store binaries that contain debug infor-
users, activates profiling (if it’s not already mation before they’re stripped.
active), and then sends the profile data back. Currently, GWP stores unstripped
The GWP collector learns from the cluster- binaries in a global repository, which other
wide job management system what applica- services use to symbolize stack traces for
tions are running on a machine and on automated failure reporting. Since the
which port each can be contacted for remote binaries are quite large and many unique
profiling invocation. A machine lacking re- binaries exist, symbolization for a single day
mote profiling support has some programs, of profiles would take weeks if run sequen-
which the per-process profiling doesn’t cap- tially. To reduce the result latency, we dis-
ture. However, these are few; comparison tribute symbolization across a few hundred
with system-wide profiles shows that remote machines using MapReduce.4
per-process profiling captures the vast major-
ity of Google’s programs. Most programs’ Profile storage
users have found profiling useful or unobtru- Over the past years, GWP has amassed
sive enough to leave them enabled. several terabytes of historical performance
Together with profiles, GWP collects other data. GFS archives the entire performance
information about the target machine and logs and corresponding binaries. To make
applications. Some of the extra information the data useful and accessible, we load the
is needed to postprocess the collected profiles, samples into a read-only dimensional data-
such as a unique identifier for each running base that is distributed across hundreds of
binary that can be correlated across machines machines. That service is accessible to all
with unstripped versions for offline symboliza- users for ad hoc queries and to systems for
tion. The rest are mainly used to tag the automated analyses.
....................................................................

68 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 69

The database supports a subset of SQL-


like semantics. Although the dimensional
database is well suited to perform queries
that aggregate over the large data set, some
individual queries can take tens of seconds
to complete. Fortunately, most queries are
seen frequently, so the profile server uses ag-
gressive caching to hide the database latency.

User interfaces
For most users, GWP deploys a webserver
to provide a user interface on top of the pro-
file database. This makes it easy to access
profile data and construct ad hoc queries
for the traditional use of application profiles
(with additional freedom to filter, group, and
aggregate profiles differently).
(a)

Query view. Several visual interfaces retrieve


information from the profile database, and
all are navigable from a web browser. The
primary interface (see Figure 2) displays
the result entries, such as functions or exe-
cutables, that match the desired query
parameters. This page supplies links that
let users refine the query to more specific
data. For example, the user can restrict the
query to only report samples for a specific
executable collected within a desired time
period. Additionally, the user can modify
or refine any of the parameters to the current
query to create a custom profile view. The
GWP homepage has links to display the
top results, Google-wide, for each perfor- (b)
mance metric.
Figure 2. An example query view: an application-level profile (a) and a
Call graph view. For most server profile function-level profile (b).
samples, the profilers collect full call stacks
with each sample. Call stacks are aggregated
to produce complete dynamic call graphs describing overall profile information about
for a given profile. Figure 3 shows an exam- the file and a histogram bar showing the rel-
ple call graph. Each node displays the func- ative hotness of each source file line. Because
tion name and its percentage of samples, different versions of each file can exist in
and the nodes are shaded based on this per- source repositories and branches, we retrieve
centage. The call graph is also displayed a hash signature from the repository for each
through the web browser, via a Graphviz source file and aggregate samples on files
plug-in. with identical signatures.

Source annotation. The query and call graph Profile data API. In addition to the web-
views are useful in directing users to specific server, we offer a data-access API to read
functions of interest. From there, GWP pro- profiles directly from the database. It’s
vides a source annotation view that presents more suitable for automated analyses that
the original source file with a header must process a large amount of profile
....................................................................

JULY/AUGUST 2010 69
[3B2-14] mmi2010040065.3d 30/7/010 19:48 Page 70

...............................................................................................................................................................................................
DATACENTER COMPUTING

Application-specific profiling is generic


and can target any specific set of machines.
For example, we can use it to profile a set
of machines deployed with the newest kernel
version. We can also limit the profiling dura-
tion to a small time period, such as the appli-
cation’s running time. It’s useful for batch
jobs running on data centers, such as MapRe-
duce, because it facilitates collecting, aggre-
gating, and exploring profiles collected from
hundreds or thousands of their workers.

Reliability analysis
To conduct continuous profiling on data-
center machines serving real traffic, extremely
low overhead is paramount, so we sample in
both time and machine dimensions. Sam-
pling introduces variation, so we must mea-
sure and understand how sampling affects
the profiles’ quality. But the nature of data-
center workloads makes this difficult; their
behavior is continually changing. There’s
Figure 3. An example dynamic call graph. no direct way to measure the datacenter
Function names are intentionally blurred. applications’ profiles’ representativeness. In-
stead, we use two indirect methods to evalu-
ate their soundness.
data (such as reliability studies) offline. We First, we study the stability of aggregated
store both raw profiles and symbolized pro- profiles themselves using several different
files in ProtocolBuffer formats (http://code. metrics. Second, we correlate profiles with
google.com/apis/protocolbuffers). the performance data from other sources to
Advanced users can access and reprocess cross-validate both.
them using their preferred programming
language. Stability of profiles
We use a single metric, entropy, to mea-
Application-specific profiling sure a given profile’s variation. In short, en-
Although the default sampling rate is high tropy is a measure of the uncertainty
enough to derive top-level profiles with high associated with a random variable, which in
confidence, GWP might not collect enough this case is profile samples. The entropy H
samples for applications that consume rela- of a profile is defined as
tively few cycles Google-wide. Increasing
the overall sampling rate to cover those pro- X
n

files is too expensive because they’re usually H ðW Þ ¼  p ðxi Þ logð p ðxi ÞÞ


i¼1
sparse.
Therefore, we provide an extension to where n is the total number of entries in the
GWP for application-specific profiling on profile and p(x) is the fraction of profile
the cloud. The machine pool for applica- samples on the entry x.5
tion-specific profiling is usually much In general, a high entropy implies a flat
smaller than GWP, so we can achieve a profile with many samples. A low entropy
high sampling rate on those machines for usually results from a small number of sam-
the specific application. Several application ples or most samples being concentrated to
teams at Google use application-specific few entries. We’re not concerned with en-
profiling to continuously monitor their tropy itself. Because entropy is like the signa-
applications running on the fleet. ture in a profile, measuring the inherent
....................................................................

70 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 71

20 5.0

18 4.5

16 4.0
Number of samples (millions)

14 3.5

12 3.0

Entropy
10 2.5

8 2.0

6 1.5

4 1.0

2 0.5

0 0
Random dates

Figure 4. The number of samples and the entropy of daily application-level profiles.
The primary y-axis (bars) is the total number of profile samples. The secondary
y-axis (line) is the entropy of the daily application-level profile.

variation, it should be stable between repre- samples collected for each date. Unless speci-
sentative profiles. fied, we used CPU cycles in the study, and
Entropy doesn’t account for differences our conclusions also apply to the other
between entry names. For example, a func- types. As the graph shows, the entropy of
tion profile with x percent on foo and y per- daily application-level profiles is stable be-
cent on bar has the same entropy as a profile tween dates, and it usually falls into a small
with y percent on foo and x percent on bar. interval. The correlation between the num-
So, when we need to identify the changes ber of samples and the profile’s entropy is
on the same entries between profiles, we cal- loose. Once the number of samples reaches
culate the Manhattan distance of two profiles some threshold, it doesn’t necessarily lead
by adding the absolute percentage differences to a lower entropy, partly because GWP
between the top k entries, defined as sometimes samples more machines than nec-
essary for daily application-level profiles.
X
k   This is because users frequently must drill
M ðX ; Y Þ ¼ px ðxi Þ  py ðxi Þ
i¼1
down to specific profiles with additional fil-
ters on certain tags, such as application
where X and Y are two profiles, k is the num- names, which are a small fraction of all pro-
ber of top entries to count, and py(xi) is 0 files collected.
when xi is not in Y. Essentially, the Manhat- We can conduct similar analysis on an
tan distance is a simplified version of relative application’s function-level profile (for ex-
entropy between two profiles. ample, the application in Figure 2b). The re-
sult, shown in Figure 5a, is from an
Profiles’ entropy. First, we compare the application whose workload is fairly stable
entropies of application-level profiles when aggregating from many clients. Its
where samples are broken down on individ- entropy is actually more stable. It’s interest-
ual applications. Figure 2a shows an exam- ing to analyze how entropy changes among
ple of such an application-level profile. machines for an application’s function-
Figure 4 shows daily application-level level profiles. Unlike the aggregated profiles
profiles’ entropies for a series of dates, to- across machines, an application’s per-
gether with the total number of profile machine profiles can vary greatly in terms
....................................................................

JULY/AUGUST 2010 71
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 72

...............................................................................................................................................................................................
DATACENTER COMPUTING

0.36
0.34 6.0
0.32 5.5
0.30
5.0

Number of samples (millions)


0.28
0.26 4.5
0.24
4.0
0.22

Entropy
0.20 3.5
0.18
3.0
0.16
0.14 2.5
0.12 2.0
0.10
0.08 1.5
0.06 1.0
0.04
0.5
0.02
0 0
Random dates
(a)

7.0

6.5

6.0

5.5

5.0

4.5

4.0
Entropy

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.0
0 0.01 0.02 0.03 0.04 0.05 0.06
Number of samples (millions)
(b)

Figure 5. Function-level profiles. The number of samples and the entropy for a single
application (a). The correlation between the number of samples and the entropy for all
per-machine profiles (b).

....................................................................

72 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 73

of number of samples. Figure 5b plots the root relationship between the distance and the
relationship between the number of sam- number of samples,
ples per machine and the entropy of func- pffiffiffiffiffiffiffiffiffiffiffiffi
tion-level profiles. As expected, when the M ðX Þ ¼ C= N ðX Þ
total number of samples is small, the pro-
where N(X) is the total number of samples
file’s entropy is also small (limited by the
in a profile and C is a constant that depends
maximum possible uncertainty). But once
on the profile type.
the threshold is reached at the maximum
number of samples, the entropy becomes
Derived metrics. We can also indirectly
stable. We can observe two clusters from
evaluate the profiles’ stability by computing
the graph; some entropies are concentrated
some derived metrics from multiple pro-
between 5.5 and 6, and the others fall be-
files. For example, we can derive CPI
tween 4.5 and 5. The application’s two be-
from HPM profiles containing cycles and
havioral states can explain the two clusters.
retired instructions. Figure 7 shows that
We’ve seen various clustering patterns on
the derived CPI is stable across dates. Not
different applications.
surprisingly, the daily aggregated profiles’
CPI falls into a small interval between 1.7
The Manhattan distance between profiles. We
and 1.8 for those days.
use the Manhattan distance to study the
variation between profiles considering
the changes on entry name, where smaller Comparing with other sources
distance implies less variation. Figure 6a Beyond measuring profiles’ stability
illustrates the Manhattan distance between across dates, we also cross-validate the pro-
the daily application-level profiles for a files with performance and utilization data
series of dates. The results from the Man- from other Google sources. One example is
hattan distances for both application-level the utilization data that the data center’s
and function-level profiles are similar to monitoring system collects. Unlike GWP,
the results with entropy. the monitoring system collects data from all
In Figure 6b, we plot the Manhattan dis- machines in the data center but at a coarser
tance for several profile types, leading to two granularity, such as overall CPU utilization.
observations: Its CPU utilization data, in terms of core-
seconds, matches the measurement from
 In general, memory and thread profiles GWP’s CPU cycles profile with the follow-
have smaller distances, and their varia- ing formula:
tions appear less correlated with the CoreSeconds ¼ Cycles * SamplingRatemachine *
other profiles. SamplingPeriod/CPUFrequencyaverage
 Server CPU time profiles correlate with
HPM profiles of cycles and instructions
in terms of variations, which could Profile uses
imply that those variations resulted nat- In each profile, GWP records the samples
urally from external reasons, such as of interesting events and a vector of associ-
workload changes. ated information. GWP collects roughly a
dozen events, such as CPU cycles, retired
To further understand the correlation be- instructions, L1 and L2 cache misses, branch
tween the Manhattan distance and the num- mispredictions, heap memory allocations,
ber of samples, we randomly pick a subset and lock contention time. The sample defi-
of machines from a specific machine set and nition varies depending on the event
then compute the Manhattan distance of the type—it can be CPU cycles or cache misses,
selected subset’s profile against the whole bytes allocated, or the sampled thread’s lock-
set’s profile. We could use a power function’s ing time. Note that the sample must be
trend line to capture the change in the Man- numeric and capable of aggregation.
hattan distance over the number of samples. The associated vector contains information
The trend line roughly approximates a square such as application name, function name,
....................................................................

JULY/AUGUST 2010 73
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 74

...............................................................................................................................................................................................
DATACENTER COMPUTING

22 0.20
0.19
20 0.18
0.17
18 0.16
0.15

Number of samples (millions)


16
0.14

Manhattan distance
14 0.13
0.12
12 0.11
0.10
10 0.09
0.08
8 0.07
0.06
6
0.05
4 0.04
0.03
2 0.02
0.01
0 0
Random dates
Cycles Instr L1_Miss L2_Miss CPU Heap Threads

(a)
0.85
0.80
0.75
0.70
0.65
0.60
0.55
Manhattan distance

0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14 16 18
(b) Number of samples (millions)

Figure 6. The Manhattan distance between daily application-level profiles for various
profile types (a). The correlation between the number of samples and the Manhattan
distance of profiles (b).

....................................................................

74 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 75

24 2.0
22
1.8
20
1.6
Number of samples (millions)

Cycles per instruction (CPI)


18
1.4
16
14 1.2

12 1.0
10 0.8
8
0.6
6
0.4
4
2 0.2

0 0
Random dates

Figure 7. The correlation between the number of samples and derived cycles per
instruction (CPI).

platform, compiler version, image name, and functions, which is useful for many as-
data center, kernel information, build revi- pects of designing, building, maintaining,
sion, and builder’s name. Assuming that and operating data centers. Infrastructure
the vector contains m elements, we can rep- teams can see the big picture of how their
resent a record GWP collected as a tuple software stacks are being used, aggregated
<event, sample counter, m-dimension across every application. This helps identify
vector>. performance-critical components, create rep-
When aggregating, GWP lets users resentative benchmark suites, and prioritize
choose k keys from the m dimensions and performance efforts.
groups the samples by the keys. Basically, it At the same time, an application team can
filters the samples by imposing one or use GWP as the first stop for the applica-
more restrictions on the rest of the dimen- tion’s profiles. As an always-on profiling ser-
sions (m—k) and then projects the samples vice, GWP collects a representative sample of
into k key dimensions. GWP finally displays the application’s running instances over time.
the sorted results to users, delivering answers Application developers often are surprised
to various performance queries with high by application’s profiles when browsing
confidence. Although not every query GWP results. For example, Google’s speech-
makes sense in practice, even a small subset recognition team quickly optimized a previ-
of them are demonstrably informative in ously unknown hot function that they
identifying performance issues and providing couldn’t have easily located without the
insights into computing resources in the aggregated GWP results. Application teams
cloud. also use GWP profiles to design, evaluate,
and calibrate their load tests.
Cloud applications’ performance
GWP profiles provide performance Finding the hottest shared code. Shared code
insights for cloud applications. Users can is remarkably abundant. Profiling each pro-
see how cloud applications are actually con- gram independently might not identify hot
suming machine resources and how the pic- shared code if it’s not hot in any single ap-
ture evolves over time. For example, Figure 2 plication, but GWP can identify routines
shows cycles distributed over top executables that don’t account for a significant portion
....................................................................

JULY/AUGUST 2010 75
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 76

...............................................................................................................................................................................................
DATACENTER COMPUTING

Table 1. Platform affinity example. computation by looking at its percentage


of all Google’s computations. As another
Random assignment of instructions (CPI in brackets) example, GWP profiles can identify the
applications running on old hardware con-
Platform 1 Platform 2
figurations and evaluate whether they
NumCrunch 100 (1) 100 (1) should be retired for efficiency.
MemBench 100 (1) 100 (2)
Total cycles 200 300 Optimizing for application affinities
Some applications run better on a partic-
Optimal assignment of instructions ular hardware platform due to sensitivity to
Platform 1 Platform 2 architectural details, such as processor micro-
architecture or cache size. It’s generally very
NumCrunch 0 (1) 200 (1)
hard or impossible to predict which applica-
MemBench 200 (1) 0 (2)
tion will fare best on which platform. In-
Total cycles 200 200
stead, we measure an efficiency metric,
CPI, for each application and platform com-
of any single application but consume the bination. We can then improve job schedul-
most cycles overall. For example, the ing so that applications are scheduled on
GWP profiles revealed that the zlib library platforms where they do best, subject to
(www.zlib.net) accounted for nearly 5 per- availability. The example in Table 1 shows
cent of all CPU cycles consumed. That how the total number of cycles needed to
motivated an effort to optimize zlib routines run a fixed number of instructions on a
and evaluate compression alternatives. In fixed machine capacity drops from 500 to
fact, some users have used GWP numbers 400 using preferential scheduling. Specifi-
to calculate the estimated savings for perfor- cally, although the application NumCrunch
mance tuning efforts on shared functions. runs just as well on Platform1 as on Plat-
Not surprisingly, given the Google fleet’s form2, application MemBench does poorly
scale, a single-percent improvement on a on Platform2 because of the smaller L2
core routine could potentially save signifi- cache. Thus, the scheduler should give Mem-
cant money per year. Unsurprisingly, the Bench preference to Platform1.
new informal metric, ‘‘dollar amount per The overall optimization process has sev-
performance change,’’ has become popular eral steps. First, we derive cycle and instruc-
among Google engineers. We’re considering tion samples from GWP profiles. Then, we
providing a new metric, ‘‘dollar per source compute an improved assignment table by
line,’’ in annotated source views. moving instructions away from application-
At the same time, some users have used platform combinations with the worst relative
GWP profiles as a source of coverage infor- efficiency. We use cycle and instruction sam-
mation to assess the feasibility of deprecation ples over a fixed period of time, aggregated
and to uncover users of library functions that per job and platform. We then compute
are to be deprecated. Due to the profiles’ and normalize CPI by clock rate. We can for-
dynamic nature, the users might miss less mulate finding the optimal assignment as a
common clients, but the biggest, most linear programming problem. The one un-
important callers are easy to find. known is Loadij —the number of instruction
samples of application j on platform i.
Evaluating hardware features. The low-level The constants are:
information GWP provides about how
CPU cycles (and other machine resources)  CPIij , the measured CPI of application
are spent is also used for early evaluation j on platform i.
of new hardware features that datacenter  TotalLoadj , the total measured number
operators might want to introduce. One in- of instruction samples of application j.
teresting example has been to evaluate  Capacityi , the total capacity for plat-
whether it would be beneficial to use a spe- form i, measured as total number of
cial coprocessor to accelerate floating-point cycle samples for platform i.
....................................................................

76 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 77

The equation is:  function samples can reconstruct the


X entire call graph, showing users how
Minimize CPIij  Loadij the CPU cycles (and other events) are
i;j distributed in the program.
X
where Loadij ¼ TotalLoadj
j A dramatic change to hot functions in the
X daily profiles could trigger a finer-grained
and CPIij  Loadij  Capacityi
j
comparison, which eventually points blame
to the source code revision number, com-
We use a simulated annealing solver that piler, or data center.
approximates the optimal solution in seconds
for workloads of around 100 jobs running Feedback-directed optimization
on thousands of machines of four different Sampled profiles can also be used for
platforms over one month. Although appli- feedback-directed compiler optimization
cation developers already mapped major (FDO), outlined in work by Chen et al.7
applications to their best platform through GWP collects such sampled profiles and
manual assignment, we’ve measured 10 to offers a mechanism to extract profiles for a
15 percent potential improvement in most particular binary in a format that the com-
cases where many jobs run on multiple plat- piler understands. This profile will be higher
forms. Similarly, users can use GWP data to quality than any profile derived from test
identify how to colocate multiple applica- inputs because it was derived from running
tions on a single machine to achieve the on live data. Furthermore, as long as devel-
best throughput.6 opers make no changes in critical program
sections, we can use an aged profile to im-
Datacenter performance monitoring prove a freshly released binary’s performance.
GWP users can also use GWP queries Many Web companies have release cycles of
with computing-resource—related keys, such two weeks or less, so this approach works
as data center, platform, compiler, or well in practice.
builder, for auditing purposes. For example, Similar to load test calibration, users also
when rolling out a new compiler, users in- use GWP for quality assurance for profiles
spect the versions that running applications generated from static benchmarks. Bench-
are actually compiled with. Users can easily marks can represent some codes well, but not
measure such transitions from time-based others, and users use GWP to identify which
profiles. Similarly, a user can measure how codes are ill-suited for FDO compilation.
soon a new hardware platform becomes Finally, users use GWP to estimate the
active or how quickly old ones are retired. quality of HPM events and microarchitec-
This also applies to new versions of applica- tural features, such as cache latencies of vari-
tions being rolled out. Grouping by data ous incarnations of processors with the same
center, GWP displays how the applications, instruction set architecture (ISA). For exam-
CPU cycles, and machine types are distrib- ple, if the same application runs on two plat-
uted in different locations. forms, we can compare the HPM counters
Beyond providing profile snapshots, and identify the relevant differences.
GWP can monitor the changes between cho-
sen profiles from two queries. The two pro-
files should be similar in that they must
have the identical keys and events, but differ-
B esides adding more types of perfor-
mance events to collect, we’re now
exploring more directions for using GWP
ent in one or more other dimensions. For profiles. These include not only user-interface
applications, GWP usually focuses on moni- enhancement but also advanced data-mining
toring function profiles. This is useful for techniques to detect interesting patterns in
two reasons: profiles. It’s also interesting to mesh-up GWP
profiles with performance data from other
 performance optimization normally sources to address complex performance
starts at the function level, and problems in datacenter applications. MICRO
....................................................................

JULY/AUGUST 2010 77
[3B2-14] mmi2010040065.3d 30/7/010 19:58 Page 78

...............................................................................................................................................................................................
DATACENTER COMPUTING

....................................................................
optimization. His research interests include
References
application performance tuning in data
1. J.M. Anderson et al., ‘‘Continuous Profiling:
centers and building tools for datacenter-scale
Where Have All the Cycles Gone?’’ Proc. 16th
performance analysis and optimization. Ren
ACM Symp. Operating Systems Principles
has a PhD in computer science from the
(SOSP 97), ACM Press, 1997, pp. 357-390.
University of Illinois at Urbana-Champaign.
2. J. Dongarra et al., ‘‘Using PAPI for Hard-
ware Performance Monitoring on Linux
Eric Tune is a senior software engineer at
Systems,’’ Conf. Linux Clusters: The HPC
Google, where he’s working on a system that
Revolution, 2001.
allocates computational resources and pro-
3. S. Ghemawat, H. Gobioff, and S.-T. Leung,
vides isolation between jobs that share
‘‘The Google File System,’’ Proc. 19th
machines. Tune has a PhD in computer
ACM Symp. Operating Systems Principles
engineering from the University of Califor-
(SOSP 03), ACM Press, 2003, pp. 29-43.
nia, San Diego.
4. J. Dean and S. Ghemawat, ‘‘MapReduce:
Simplified Data Processing on Large Clus- Tipp Moseley is a software engineer at
ters,’’ Proc. 6th Conf. Symp. Operating Sys- Google, where he’s working on datacenter-
tem Design and Implementation (OSDI 04), scale performance analysis. His research
Usenix Assoc., 2004, pp. 137-150. interests include program analysis, profiling,
5. S. Savari and C. Young, ‘‘Comparing and and optimization. Moseley has a PhD in
Combining Profiles,’’ Proc. Workshop on computer science from the University of
Feedback-Directed Optimization (FDO 02), Colorado, Boulder. He is a member of ACM.
ACM Press, 2000, pp. 50-62.
6. J. Mars and R. Hundt, ‘‘Scenario Based Yixin Shi is a software engineer at Google,
Optimization: A Framework for Statically where he’s working on performance analysis
Enabling Online Optimizations,’’ Proc. tools and large-volume imagery data proces-
2009 Int’l Symp. Code Generation and Opti- sing. His research interests include architec-
mization (CGO 09), IEEE CS Press, 2009, tural support for securing program execution,
pp. 169-179. cache design for wide-issue processors, com-
7. D. Chen, N. Vachharajani, and R. Hundt, puter architecture simulators, and large-scale
‘‘Taming Hardware Event Samples for data processing. Shi has a PhD in computer
FDO Compilation, Proc. 8th Ann. IEEE/ engineering from the University of Illinois
ACM Int’l Symp. Code Generation and Opti- at Chicago. He is a member of ACM.
mization (CGO 10), ACM Press, 2010,
pp. 42-52. Silvius Rus is a senior software engineer at
Google, where he’s working on datacenter
Gang Ren is a senior software engineer at application performance optimization
Google, where he’s working on datacenter through compiler and library transformations
application performance analysis and based on memory allocation and reference
analysis. Rus has a PhD in computer science
from Texas A&M University. He is a
member of ACM.

Robert Hundt is a tech lead at Google,


where he’s working on compiler optimiza-
tion and datacenter performance. Hundt has
a Diplom Univ. in computer science from
the Technical University in Munich. He is a
member of IEEE and SIGPLAN.

Direct question and comments about


this article to Gang Ren, Google, 1600
Amphitheatre Pkwy. Mountain View, CA
94043; [email protected].
....................................................................

78 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 79

....................................................................

JULY/AUGUST 2010 79

You might also like