SCaLE Linux Vs Solaris Performance2014 PDF
SCaLE Linux Vs Solaris Performance2014 PDF
Solaris performance
and vice-versa
Brendan Gregg
Lead Performance Engineer
[email protected] SCaLE12x
@brendangregg February, 2014
Linux vs Solaris Performance Differences
CPU scalability
CONFIGurable
real 0m18.534s
user 0m18.450s
sys 0m0.018s
systemB$ time perl -e 'for ($i = 0; $i < 100_000_000; $i++) { $s = "SCaLE12x" }'
real 0m16.253s
user 0m16.230s
sys 0m0.010s
The kernel may also control the CPU clock speed (eg, Intel
SpeedStep), and vary it for temp or power reasons
Sure, but would that happen for this simple Perl program?
Possible Differences: Kernel, cont.
During a perturbation, the kernel CPU scheduler may
migrate the thread to another CPU, which can hurt
performance (cold caches, memory locality)
Sure, but would that happen for this simple Perl program?
# dtrace -n 'profile-99 /pid == $target/ { @ = lquantize(cpu, 0, 16, 1); }' -c ...
value ------------- Distribution ------------- count
< 0 | 0
0 | 1
1 |@@@@@ Yes, a lot! 483
2 | 1
3 |@@@@@@@ 663
4 | 2
5 |@@@ This shows the CPUs 276
6 | 0
7 |@@@@@@ Perl ran on. It should 512
8 | 1
9 |@@@ stay put, but instead 288
10 | 0
11 |@@@@@@ runs across many. 576
12 | 0
13 |@@@@@
14 |
We've been fixing 442
2
15 |@@@
16 |
this in SmartOS 308
0
Kernel Myths and Realities
Myth: "The kernel gets out of the way for applications"
The only case where the kernel gets out of the way is
when your software calls halt() or shutdown()
SmartOS
Mature: Zones, ZFS, DTrace, fully pre-emptable kernel
Microstate accounting, symbols by default, CPU scalability,
MPSS, libumem, FireEngine, Crossbow, binary /proc,
process swapping
Big Differences: Linux
Latest application versions, with the latest
Up-to-date packages
performance fixes
Weird perf issue? May be answered on
Large community
stackoverflow, or discussed at meetups
There can be better coverage for high
More device drivers
performing network cards or driver features
futex Fast user-space mutex
CPU scalability
CONFIGurable
SmartOS
perf tools by default, kstat, vfsstat, iostat -e, ptime -m,
CPU-only load averages, some STREAMS leftovers, ZFS
SCSI cache flush by default, different TCP slow start
default, ...
Small Differences, cont.
Small differences change frequently: a feature is added to one
kernel, then the other a year later; a difference may only exist
for a short period of time.
These small kernel differences may still make a significant
performance difference, but are classified as "small" based on
engineering cost.
System Similarities
It's important to note that many performance-related features
are roughly equivalent:
Both are Unix-like systems: processes, kernel, syscalls,
time sharing, preemption, virtual memory, paged virtual
memory, demand paging, ...
SmartOS
SMF/FMA, contracts, privileges, mdb (postmortem
debugging), gcore, crash dumps by default, ...
WARNING
Linux SmartOS
Compiler Options, cont.
Can be addressed by tuning packages in the repo
Also file bugs/patches with developers to tune Makefiles
Someone has to do this, eg, package repo staff/community
who find and do the workarounds anyway
likely()/unlikely()
These become compiler hints (__builtin_expect) for branch
prediction, and are throughout the Linux kernel:
net/ipv4/tcp_output.c, tcp_transmit_skb():
[...]
if (likely(clone_it)) {
if (unlikely(skb_cloned(skb)))
skb = pskb_copy(skb, gfp_mask);
else
skb = skb_clone(skb, gfp_mask);
if (unlikely(!skb))
return -ENOBUFS;
}
[...]
This shows the initial I/O control flow. There are optimizations/
variants for improving the HW Virt I/O path, esp for Xen.
Zones, cont.
Comparing 1 GB instances on Joyent
Max network throughput:
KVM: 400 Mbits/sec
Zones: 4.54 Gbits/sec (over 10x)
Max network IOPS:
KVM: 18,000 packets/sec
Zones: 78,000 packets/sec (over 4x)
Numbers go much higher for larger instances
http://dtrace.org/blogs/brendan/2013/01/11/virtualization-performance-zones-kvm-xen
Zones, cont.
Performance analysis for Zones is also easy. Analyze the
applications as usual:
Operating System
Applications analyze
Zone . ...
System Libraries
System Call Interface
VFS Sockets Scheduler
Kernel
Metal
Zones, cont.
Host Applications
Compared QEMU
Linux
File Systems TCP/UDP ...
Volume Managers IP Virtual
kernel Block Device Interface Ethernet Memory
Resource Controls
Device Drivers correlate
observability
Virtual Devices
boundary
System Libraries
System Call Interface
KVM VFS Sockets Scheduler
host File Systems TCP/UDP
Metal
Zones, cont.
Linux has been learning: LXC & cgroups, but not widespread
adoption yet. Docker will likely drive adoption.
STREAMS
AT&T modular I/O subsystem
Like Unix shell pipes, but for kernel messages. Can push
modules into the stream to customize processing
Introduced (fully) in Unix 8th Ed Research Unix, became SVr4
STREAMS, and was used by Solaris for network TCP/IP stack
With greater demands for TCP/IP performance, the overheads
of STREAMS reduced scalability
Sun switched high-performing paths to be direct function calls
STREAMS, cont.
A cautionary tale: not good for high performance code paths
Symbols
Compilers on Linux strip symbols by default, making perf
profiler output inscrutable without the dbgsym packages
57.14% sshd libc-2.15.so [.] connect
|
--- connect
|
|--25.00%-- 0x7ff3c1cddf29
|
|--25.00%-- 0x7ff3bfe82761
| 0x7ff3bfe82b7c What??
|
|--25.00%-- 0x7ff3bfe82dfc
--25.00%-- [...]
fbt::vfs_read:entry, fbt::vfs_write:entry
/stringof(((struct file *)arg0)->f_path.dentry->d_sb->s_type->name) == "ext4"/
{
@[execname, probefunc + 4] = quantize(arg2);
}
dtrace:::END
{
printa("\n %s %s (bytes)%@d", @);
}
# ./ext4rwsize.d
dtrace: script './ext4rwsize.d' matched 3 probes
^C
CPU ID FUNCTION:NAME
1 2 :END
[...]
vi read (bytes)
value ------------- Distribution ------------- count
128 | 0
256 | 1
512 |@@@@@@@ 17
1024 |@ 2
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 75
8192 | 0
dtrace4linux: Example
Tracing TCP retransmits (tcpretransmit.d for 3.11.0-17):
#!/usr/sbin/dtrace -qs
fbt::tcp_retransmit_skb:entry {
this->so = (struct sock *)arg0;
this->d = (unsigned char *)&this->so->__sk_common; /* 1st is skc_daddr */
printf("%Y: retransmit to %d.%d.%d.%d, by:", walltimestamp,
this->d[0], this->d[1], this->d[2], this->d[3]);
stack(99);
}
# ./tcpretransmit.d
Tracing TCP retransmits... Ctrl-C to end.
1970 Jan 1 12:24:45: retransmit to 50.95.220.155, by:
kernel`tcp_retransmit_skb
kernel`dtrace_int3_handler+0xcc
kernel`dtrace_int3+0x3a
that kernel`tcp_retransmit_skb+0x1
kernel`tcp_retransmit_timer+0x276
used to kernel`tcp_write_timer
kernel`tcp_write_timer_handler+0xa0
work... kernel`tcp_write_timer+0x6c
kernel`call_timer_fn+0x36
kernel`tcp_write_timer
kernel`run_timer_softirq+0x1fd
kernel`__do_softirq+0xf7
kernel`call_softirq+0x1c
[...]
perf_events
In the Linux tree. perf-tools package. Can do sampling, static
and dynamic tracing, with stack traces and local variables
Often involves an enablecollectdumpanalyze cycle
A powerful profiler, loaded with
features (eg, libunwind stacks!)
Isn't programmable, and so has
limited ability for processing data
in-kernel. Does counts.
You can post-process in user-
land, but the overheads of
passing all event data incurs
overhead; can be Gbytes of data
perf_events: Example
Dynamic tracing of tcp_sendmsg() with size:
# perf probe --add 'tcp_sendmsg size'
[...]
# perf record -e probe:tcp_sendmsg -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.052 MB perf.data (~2252 samples) ]
# perf script
# ========
# captured on: Fri Jan 31 23:49:55 2014
# hostname : dev1
# os release : 3.13.1-ubuntu-12-opt
[...]
# ========
#
sshd 1301 [001] 502.424719: probe:tcp_sendmsg: (ffffffff81505d80) size=b0
sshd 1301 [001] 502.424814: probe:tcp_sendmsg: (ffffffff81505d80) size=40
sshd 2371 [000] 502.952590: probe:tcp_sendmsg: (ffffffff81505d80) size=27
sshd 2372 [000] 503.025023: probe:tcp_sendmsg: (ffffffff81505d80) size=3c0
sshd 2372 [001] 503.203776: probe:tcp_sendmsg: (ffffffff81505d80) size=98
sshd 2372 [001] 503.281312: probe:tcp_sendmsg: (ffffffff81505d80) size=2d0
[...]
ktap
A new static/dynamic tracing tool for Linux
Lightweight, simple, based on lua. Uses bytecode for
programmable and safe tracing
Suitable for use on embedded Linux
http://www.ktap.org
Features are limited (still in
development), but I've been
impressed so far
In development, so I can't recommend
production use yet
ktap: Example
Summarize read() syscalls by return value (size/err):
# ktap -e 's = {}; trace syscalls:sys_exit_read { s[arg2] += 1 }
trace_end { histogram(s); }'
^C
value ------------- Distribution ------------- count
-11 |@@@@@@@@@@@@@@@@@@@@@@@@ 50
18 |@@@@@@ 13
histogram 72 |@@ 6
1024 |@ 4
of a key/ 0 | 2
value table 2 | 2
446 | 1
515 | 1
48 | 1
# more /usr/share/doc/systemtap/README.Debian
[...]
supported yet, see Debian bug #691167). To use systemtap you need to
manually install the linux-image-*-dbg and linux-header-* packages
that match your running kernel. To simplify this task you can use the
stap-prep command. Please always run this before reporting a bug.
# stap-prep
You need package linux-image-3.11.0-17-generic-dbgsym but it does not seem
to be available
Ubuntu -dbgsym packages are typically in a separate repository
Follow https://wiki.ubuntu.com/DebuggingProgramCrash to add this
repository
SystemTap: Setup, cont.
After following ubuntu's DebuggingProgramCrash site:
# apt-get install linux-image-3.11.0-17-generic-dbgsym
Reading package lists... Done
Building dependency tree but my perf issue
Reading state information... Done
The following NEW packages will be installed: is happening now...
linux-image-3.11.0-17-generic-dbgsym
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 834 MB of archives.
After this operation, 2,712 MB of additional disk space will be used.
Get:1 http://ddebs.ubuntu.com/ saucy-updates/main linux-image-3.11.0-17-
generic-dbgsym amd64 3.11.0-17.31 [834 MB]
0% [1 linux-image-3.11.0-17-generic-dbgsym 1,581 kB/834 MB 0%] 215 kB/s
1h 4min 37s
In fairness:
1. The Red Hat SystemTap developer's primary focus is to
get it working on Red Hat (where they say it works fine)
probe begin
{
printf("\n%6s %6s %16s %s\n", "UID", "PID", "COMM", "PATH");
}
probe syscall.open
{
printf("%6d %6d %16s %s\n", uid(), pid(), execname(), filename);
}
Output:
# ./opensnoop.stp
UID PID COMM PATH
0 11108 sshd <unknown>
0 11108 sshd <unknown>
0 11108 sshd /lib/x86_64-linux-gnu/libwrap.so.0
0 11108 sshd /lib/x86_64-linux-gnu/libpam.so.0
0 11108 sshd /lib/x86_64-linux-gnu/libselinux.so.1
0 11108 sshd /usr/lib/x86_64-linux-gnu/libck-connector.so.0
[...]
LTTng
Profiling, static and dynamic tracing
Based on Linux Trace Toolkit (LTT), which dabbled with
dynamic tracing (DProbes) in 2001
Involves an enablestartstopview cycle
Designed to be highly efficient
I haven't used it properly yet,
so I don't have an informed
opinion (sorry LTTng, not
your fault)
LTTng, cont.
Example sequence:
# lttng create session1
# lttng enable-event sched_process_exec -k
# lttng start
# lttng stop
# lttng view
# lttng destroy session1
DTrace, cont.
2014 is an exciting year for dynamic tracing and Linux
one of these may reach maturity and win!
DTrace, final word
What Oracle Solaris can learn from dtrace4linux:
Dynamic tracing is crippled without source code
Oracle could give customers scripts to run, but customers
lose any practical chance of writing them themselves
top layer
If only
strace layer it were
this
Kernel
simple...
tcpdump layer
What about the other tools and metrics that are part of Linux?
perf_events, tracepoints/kprobes/uprobes, schedstats, I/O
accounting, blktrace, etc.
Culture, cont.
Understand the system, and measure if at all possible
Hypothesis instrumentation data hypothesis
Use perf_events (and others once they are stable/safe)
strace(1) is intermediate, not advanced
High performance doesn't just mean hardware, system, and
config. It foremost means analysis of performance limiters.
What Both can Learn
What Both can Learn
Get better at benchmarking
Benchmarking
How Linux vs Solaris performance is often compared
Results incorrect or misleading almost 100% of the time
Get reliable benchmark results by active benchmarking:
Analyze performance of all components during the
benchmark, to identify limiters
http://www.ktap.org/doc/tutorial.html
http://www.brendangregg.com/activebenchmarking.html
https://blogs.oracle.com/OTNGarage/entry/doing_more_with_dtrace_on
Thank You
More info:
illumos: http://illumos.org
SmartOS: http://smartos.org
DTrace: http://dtrace.org
Joyent: http://joyent.com
Systems Performance book:
http://www.brendangregg.com/sysperf.html