Google Wide Profiling: A Continuous Profiling Infrastructure For Data Centers
Google Wide Profiling: A Continuous Profiling Infrastructure For Data Centers
Google Wide Profiling: A Continuous Profiling Infrastructure For Data Centers
..........................................................................................................................................................................................................................
GOOGLE-WIDE PROFILING:
A CONTINUOUS PROFILING
INFRASTRUCTURE FOR DATA CENTERS
..........................................................................................................................................................................................................................
GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR
MICROARCHITECTURAL PECULIARITIES.
0272-1732/10/$26.00 c 2010 IEEE Published by the IEEE Computer Society 65
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 66
...............................................................................................................................................................................................
DATACENTER COMPUTING
..............................................................................................................................................................................................
Profiling: From single systems to data centers
1
From gprof to Intel VTune (http://software.intel.com/en-us/intel- as MPI calls) from parallel programs but aren’t suitable for continuous
vtune), profiling has long been standard in the development process. profiling.
Most profiling tools focus on a single execution of a single program.
As computing systems have evolved, understanding the bigger picture
across multiple machines has become increasingly important. References
Continuous, always-on profiling became more important with Morph 1. S.L. Graham, P.B. Kessler, and M.K. Mckusick, ‘‘Gprof: A
and DCPI for Digital UNIX. Morph uses low overhead system-wide sam- Call Graph Execution Profiler,’’ Proc. Symp. Compiler Con-
pling (approximately 0.3 percent) and binary rewriting to continuously struction (CC 82), ACM Press, 1982, pp. 120-126.
evolve programs to adapt to their host architectures.2 DCPI gathers 2. X. Zhang et al., ‘‘System Support for Automatic Profiling and
much more robust profiles, such as precise stall times and causes, Optimization,’’ Proc. 16th ACM Symp. Operating Systems
and focuses on reporting information to users.3 Principles (SOSP 97), ACM Press, 1997, pp. 15-26.
OProfile, a DCPI-inspired tool, collects and reports data much in the 3. J.M. Anderson et al., ‘‘Continuous Profiling: Where Have All
same way for a plethora of architectures, though it offers less sophisti- the Cycles Gone?’’ Proc. 16th ACM Symp. Operating Sys-
cated analysis. GWP uses OProfile (http://oprofile.sourceforge.net) as a tems Principles (SOSP 97), ACM Press, 1997, pp. 357-390.
profile source. High-performance computing (HPC) shares many profiling 4. N.R. Tallent et al., ‘‘Diagnosing Performance Bottlenecks in
challenges with cloud computing, because profilers must gather repre- Emerging Petascale Applications,’’ Proc. Conf. High Perfor-
sentative samples with minimal overhead over thousands of nodes. mance Computing Networking, Storage and Analysis (SC
Cloud computing has additional challenges—vastly disparate applica- 09), ACM Press, 2009, no. 51.
tions, workloads, and machine configurations. As a result, different profil- 5. V. Herrarte and E. Lusk, Studying Parallel Program Behavior
ing strategies exist for HPC and cloud computing. HPCToolkit uses with Upshot, tech. report, Argonne National Laboratory,
lightweight sampling to collect call stacks from optimized programs 1991.
and compares profiles to identify scaling issues as parallelism increases.4 6. O. Zaki et al., ‘‘Toward Scalable Performance Visualization
Open|SpeedShop (www.openspeedshop.org) provides similar func- with Jumpshot,’’ Int’l J. High Performance Computing Appli-
tionality. Similarly, Upshot5 and Jumpshot6 can analyze traces (such cations, vol. 13, no. 3, 1999, pp. 277-288.
database grows by several Gbytes every day. Does a particular memory allocation
Profiling at this scale presents significant scheme benefit a particular class of
challenges that don’t exist for a single ma- applications?
chine. Verifying that the profiles are correct What is the cycles per instruction (CPI)
is important and challenging because the for applications across platforms?
workloads are dynamic. Managing profiling
overhead becomes far more important as Additionally, we can derive higher-level data
well, as any unnecessary profiling overhead to more complex but interesting questions,
can cost millions of dollars in additional such as which compilers were used for appli-
resources. Finally, making the profile data cations in the fleet, whether there are more
universally accessible is an additional chal- 32-bit or 64-bit applications running, and
lenge. GWP is also a cloud application, how much utilization is being lost by subop-
with its own scalability and performance timal job scheduling.
issues.
With this volume of data, we can answer Infrastructure
typical performance questions about datacen- Figure 1 provides an overview of the en-
ter applications, including the following: tire GWP system.
66 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 67
JULY/AUGUST 2010 67
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 68
...............................................................................................................................................................................................
DATACENTER COMPUTING
68 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 69
User interfaces
For most users, GWP deploys a webserver
to provide a user interface on top of the pro-
file database. This makes it easy to access
profile data and construct ad hoc queries
for the traditional use of application profiles
(with additional freedom to filter, group, and
aggregate profiles differently).
(a)
Source annotation. The query and call graph Profile data API. In addition to the web-
views are useful in directing users to specific server, we offer a data-access API to read
functions of interest. From there, GWP pro- profiles directly from the database. It’s
vides a source annotation view that presents more suitable for automated analyses that
the original source file with a header must process a large amount of profile
....................................................................
JULY/AUGUST 2010 69
[3B2-14] mmi2010040065.3d 30/7/010 19:48 Page 70
...............................................................................................................................................................................................
DATACENTER COMPUTING
Reliability analysis
To conduct continuous profiling on data-
center machines serving real traffic, extremely
low overhead is paramount, so we sample in
both time and machine dimensions. Sam-
pling introduces variation, so we must mea-
sure and understand how sampling affects
the profiles’ quality. But the nature of data-
center workloads makes this difficult; their
behavior is continually changing. There’s
Figure 3. An example dynamic call graph. no direct way to measure the datacenter
Function names are intentionally blurred. applications’ profiles’ representativeness. In-
stead, we use two indirect methods to evalu-
ate their soundness.
data (such as reliability studies) offline. We First, we study the stability of aggregated
store both raw profiles and symbolized pro- profiles themselves using several different
files in ProtocolBuffer formats (http://code. metrics. Second, we correlate profiles with
google.com/apis/protocolbuffers). the performance data from other sources to
Advanced users can access and reprocess cross-validate both.
them using their preferred programming
language. Stability of profiles
We use a single metric, entropy, to mea-
Application-specific profiling sure a given profile’s variation. In short, en-
Although the default sampling rate is high tropy is a measure of the uncertainty
enough to derive top-level profiles with high associated with a random variable, which in
confidence, GWP might not collect enough this case is profile samples. The entropy H
samples for applications that consume rela- of a profile is defined as
tively few cycles Google-wide. Increasing
the overall sampling rate to cover those pro- X
n
70 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 71
20 5.0
18 4.5
16 4.0
Number of samples (millions)
14 3.5
12 3.0
Entropy
10 2.5
8 2.0
6 1.5
4 1.0
2 0.5
0 0
Random dates
Figure 4. The number of samples and the entropy of daily application-level profiles.
The primary y-axis (bars) is the total number of profile samples. The secondary
y-axis (line) is the entropy of the daily application-level profile.
variation, it should be stable between repre- samples collected for each date. Unless speci-
sentative profiles. fied, we used CPU cycles in the study, and
Entropy doesn’t account for differences our conclusions also apply to the other
between entry names. For example, a func- types. As the graph shows, the entropy of
tion profile with x percent on foo and y per- daily application-level profiles is stable be-
cent on bar has the same entropy as a profile tween dates, and it usually falls into a small
with y percent on foo and x percent on bar. interval. The correlation between the num-
So, when we need to identify the changes ber of samples and the profile’s entropy is
on the same entries between profiles, we cal- loose. Once the number of samples reaches
culate the Manhattan distance of two profiles some threshold, it doesn’t necessarily lead
by adding the absolute percentage differences to a lower entropy, partly because GWP
between the top k entries, defined as sometimes samples more machines than nec-
essary for daily application-level profiles.
X
k This is because users frequently must drill
M ðX ; Y Þ ¼ px ðxi Þ py ðxi Þ
i¼1
down to specific profiles with additional fil-
ters on certain tags, such as application
where X and Y are two profiles, k is the num- names, which are a small fraction of all pro-
ber of top entries to count, and py(xi) is 0 files collected.
when xi is not in Y. Essentially, the Manhat- We can conduct similar analysis on an
tan distance is a simplified version of relative application’s function-level profile (for ex-
entropy between two profiles. ample, the application in Figure 2b). The re-
sult, shown in Figure 5a, is from an
Profiles’ entropy. First, we compare the application whose workload is fairly stable
entropies of application-level profiles when aggregating from many clients. Its
where samples are broken down on individ- entropy is actually more stable. It’s interest-
ual applications. Figure 2a shows an exam- ing to analyze how entropy changes among
ple of such an application-level profile. machines for an application’s function-
Figure 4 shows daily application-level level profiles. Unlike the aggregated profiles
profiles’ entropies for a series of dates, to- across machines, an application’s per-
gether with the total number of profile machine profiles can vary greatly in terms
....................................................................
JULY/AUGUST 2010 71
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 72
...............................................................................................................................................................................................
DATACENTER COMPUTING
0.36
0.34 6.0
0.32 5.5
0.30
5.0
Entropy
0.20 3.5
0.18
3.0
0.16
0.14 2.5
0.12 2.0
0.10
0.08 1.5
0.06 1.0
0.04
0.5
0.02
0 0
Random dates
(a)
7.0
6.5
6.0
5.5
5.0
4.5
4.0
Entropy
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0 0.01 0.02 0.03 0.04 0.05 0.06
Number of samples (millions)
(b)
Figure 5. Function-level profiles. The number of samples and the entropy for a single
application (a). The correlation between the number of samples and the entropy for all
per-machine profiles (b).
....................................................................
72 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 73
of number of samples. Figure 5b plots the root relationship between the distance and the
relationship between the number of sam- number of samples,
ples per machine and the entropy of func- pffiffiffiffiffiffiffiffiffiffiffiffi
tion-level profiles. As expected, when the M ðX Þ ¼ C= N ðX Þ
total number of samples is small, the pro-
where N(X) is the total number of samples
file’s entropy is also small (limited by the
in a profile and C is a constant that depends
maximum possible uncertainty). But once
on the profile type.
the threshold is reached at the maximum
number of samples, the entropy becomes
Derived metrics. We can also indirectly
stable. We can observe two clusters from
evaluate the profiles’ stability by computing
the graph; some entropies are concentrated
some derived metrics from multiple pro-
between 5.5 and 6, and the others fall be-
files. For example, we can derive CPI
tween 4.5 and 5. The application’s two be-
from HPM profiles containing cycles and
havioral states can explain the two clusters.
retired instructions. Figure 7 shows that
We’ve seen various clustering patterns on
the derived CPI is stable across dates. Not
different applications.
surprisingly, the daily aggregated profiles’
CPI falls into a small interval between 1.7
The Manhattan distance between profiles. We
and 1.8 for those days.
use the Manhattan distance to study the
variation between profiles considering
the changes on entry name, where smaller Comparing with other sources
distance implies less variation. Figure 6a Beyond measuring profiles’ stability
illustrates the Manhattan distance between across dates, we also cross-validate the pro-
the daily application-level profiles for a files with performance and utilization data
series of dates. The results from the Man- from other Google sources. One example is
hattan distances for both application-level the utilization data that the data center’s
and function-level profiles are similar to monitoring system collects. Unlike GWP,
the results with entropy. the monitoring system collects data from all
In Figure 6b, we plot the Manhattan dis- machines in the data center but at a coarser
tance for several profile types, leading to two granularity, such as overall CPU utilization.
observations: Its CPU utilization data, in terms of core-
seconds, matches the measurement from
In general, memory and thread profiles GWP’s CPU cycles profile with the follow-
have smaller distances, and their varia- ing formula:
tions appear less correlated with the CoreSeconds ¼ Cycles * SamplingRatemachine *
other profiles. SamplingPeriod/CPUFrequencyaverage
Server CPU time profiles correlate with
HPM profiles of cycles and instructions
in terms of variations, which could Profile uses
imply that those variations resulted nat- In each profile, GWP records the samples
urally from external reasons, such as of interesting events and a vector of associ-
workload changes. ated information. GWP collects roughly a
dozen events, such as CPU cycles, retired
To further understand the correlation be- instructions, L1 and L2 cache misses, branch
tween the Manhattan distance and the num- mispredictions, heap memory allocations,
ber of samples, we randomly pick a subset and lock contention time. The sample defi-
of machines from a specific machine set and nition varies depending on the event
then compute the Manhattan distance of the type—it can be CPU cycles or cache misses,
selected subset’s profile against the whole bytes allocated, or the sampled thread’s lock-
set’s profile. We could use a power function’s ing time. Note that the sample must be
trend line to capture the change in the Man- numeric and capable of aggregation.
hattan distance over the number of samples. The associated vector contains information
The trend line roughly approximates a square such as application name, function name,
....................................................................
JULY/AUGUST 2010 73
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 74
...............................................................................................................................................................................................
DATACENTER COMPUTING
22 0.20
0.19
20 0.18
0.17
18 0.16
0.15
Manhattan distance
14 0.13
0.12
12 0.11
0.10
10 0.09
0.08
8 0.07
0.06
6
0.05
4 0.04
0.03
2 0.02
0.01
0 0
Random dates
Cycles Instr L1_Miss L2_Miss CPU Heap Threads
(a)
0.85
0.80
0.75
0.70
0.65
0.60
0.55
Manhattan distance
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14 16 18
(b) Number of samples (millions)
Figure 6. The Manhattan distance between daily application-level profiles for various
profile types (a). The correlation between the number of samples and the Manhattan
distance of profiles (b).
....................................................................
74 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 75
24 2.0
22
1.8
20
1.6
Number of samples (millions)
12 1.0
10 0.8
8
0.6
6
0.4
4
2 0.2
0 0
Random dates
Figure 7. The correlation between the number of samples and derived cycles per
instruction (CPI).
platform, compiler version, image name, and functions, which is useful for many as-
data center, kernel information, build revi- pects of designing, building, maintaining,
sion, and builder’s name. Assuming that and operating data centers. Infrastructure
the vector contains m elements, we can rep- teams can see the big picture of how their
resent a record GWP collected as a tuple software stacks are being used, aggregated
<event, sample counter, m-dimension across every application. This helps identify
vector>. performance-critical components, create rep-
When aggregating, GWP lets users resentative benchmark suites, and prioritize
choose k keys from the m dimensions and performance efforts.
groups the samples by the keys. Basically, it At the same time, an application team can
filters the samples by imposing one or use GWP as the first stop for the applica-
more restrictions on the rest of the dimen- tion’s profiles. As an always-on profiling ser-
sions (mk) and then projects the samples vice, GWP collects a representative sample of
into k key dimensions. GWP finally displays the application’s running instances over time.
the sorted results to users, delivering answers Application developers often are surprised
to various performance queries with high by application’s profiles when browsing
confidence. Although not every query GWP results. For example, Google’s speech-
makes sense in practice, even a small subset recognition team quickly optimized a previ-
of them are demonstrably informative in ously unknown hot function that they
identifying performance issues and providing couldn’t have easily located without the
insights into computing resources in the aggregated GWP results. Application teams
cloud. also use GWP profiles to design, evaluate,
and calibrate their load tests.
Cloud applications’ performance
GWP profiles provide performance Finding the hottest shared code. Shared code
insights for cloud applications. Users can is remarkably abundant. Profiling each pro-
see how cloud applications are actually con- gram independently might not identify hot
suming machine resources and how the pic- shared code if it’s not hot in any single ap-
ture evolves over time. For example, Figure 2 plication, but GWP can identify routines
shows cycles distributed over top executables that don’t account for a significant portion
....................................................................
JULY/AUGUST 2010 75
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 76
...............................................................................................................................................................................................
DATACENTER COMPUTING
76 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 77
JULY/AUGUST 2010 77
[3B2-14] mmi2010040065.3d 30/7/010 19:58 Page 78
...............................................................................................................................................................................................
DATACENTER COMPUTING
....................................................................
optimization. His research interests include
References
application performance tuning in data
1. J.M. Anderson et al., ‘‘Continuous Profiling:
centers and building tools for datacenter-scale
Where Have All the Cycles Gone?’’ Proc. 16th
performance analysis and optimization. Ren
ACM Symp. Operating Systems Principles
has a PhD in computer science from the
(SOSP 97), ACM Press, 1997, pp. 357-390.
University of Illinois at Urbana-Champaign.
2. J. Dongarra et al., ‘‘Using PAPI for Hard-
ware Performance Monitoring on Linux
Eric Tune is a senior software engineer at
Systems,’’ Conf. Linux Clusters: The HPC
Google, where he’s working on a system that
Revolution, 2001.
allocates computational resources and pro-
3. S. Ghemawat, H. Gobioff, and S.-T. Leung,
vides isolation between jobs that share
‘‘The Google File System,’’ Proc. 19th
machines. Tune has a PhD in computer
ACM Symp. Operating Systems Principles
engineering from the University of Califor-
(SOSP 03), ACM Press, 2003, pp. 29-43.
nia, San Diego.
4. J. Dean and S. Ghemawat, ‘‘MapReduce:
Simplified Data Processing on Large Clus- Tipp Moseley is a software engineer at
ters,’’ Proc. 6th Conf. Symp. Operating Sys- Google, where he’s working on datacenter-
tem Design and Implementation (OSDI 04), scale performance analysis. His research
Usenix Assoc., 2004, pp. 137-150. interests include program analysis, profiling,
5. S. Savari and C. Young, ‘‘Comparing and and optimization. Moseley has a PhD in
Combining Profiles,’’ Proc. Workshop on computer science from the University of
Feedback-Directed Optimization (FDO 02), Colorado, Boulder. He is a member of ACM.
ACM Press, 2000, pp. 50-62.
6. J. Mars and R. Hundt, ‘‘Scenario Based Yixin Shi is a software engineer at Google,
Optimization: A Framework for Statically where he’s working on performance analysis
Enabling Online Optimizations,’’ Proc. tools and large-volume imagery data proces-
2009 Int’l Symp. Code Generation and Opti- sing. His research interests include architec-
mization (CGO 09), IEEE CS Press, 2009, tural support for securing program execution,
pp. 169-179. cache design for wide-issue processors, com-
7. D. Chen, N. Vachharajani, and R. Hundt, puter architecture simulators, and large-scale
‘‘Taming Hardware Event Samples for data processing. Shi has a PhD in computer
FDO Compilation, Proc. 8th Ann. IEEE/ engineering from the University of Illinois
ACM Int’l Symp. Code Generation and Opti- at Chicago. He is a member of ACM.
mization (CGO 10), ACM Press, 2010,
pp. 42-52. Silvius Rus is a senior software engineer at
Google, where he’s working on datacenter
Gang Ren is a senior software engineer at application performance optimization
Google, where he’s working on datacenter through compiler and library transformations
application performance analysis and based on memory allocation and reference
analysis. Rus has a PhD in computer science
from Texas A&M University. He is a
member of ACM.
78 IEEE MICRO
[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 79
....................................................................
JULY/AUGUST 2010 79