0% found this document useful (0 votes)
46 views29 pages

BCSP 801 Aos Lab File

This document outlines five experiments to be completed as part of an advanced operating systems lab session. The experiments focus on exploring I/O behavior, kernel implications of inter-process communication, microarchitectural implications of IPC, TCP state machines, and TCP latency and bandwidth. A benchmark for analyzing I/O behavior using DTrace is described. Students will use this benchmark in the first experiment and Jupyter notebooks to integrate various tools like DTrace for performance analysis.

Uploaded by

Paras Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views29 pages

BCSP 801 Aos Lab File

This document outlines five experiments to be completed as part of an advanced operating systems lab session. The experiments focus on exploring I/O behavior, kernel implications of inter-process communication, microarchitectural implications of IPC, TCP state machines, and TCP latency and bandwidth. A benchmark for analyzing I/O behavior using DTrace is described. Students will use this benchmark in the first experiment and Jupyter notebooks to integrate various tools like DTrace for performance analysis.

Uploaded by

Paras Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

PRACTICAL FILE

ADVANCE OPERATING SYSTEM


SESSION 22-23

Submitted To: - Submitted By: -


Ms. Ruqayya Saifi Name: -
Assistant Professor Branch: - CSE
Sem: - 08th
Roll No: -
1|Pa ge
LIST OF EXPERIMENTS

S NAME OF THE DATE OF DTAE OF PAGE FACULTY


NO. EXPERIMENT EXPERIMENT SUBMISSION NO. SIGNATURE

1. Getting started with 06/03/2023 20/03/2023 3


kernel tracing – I/O

2. Kernal Implications 20/03/2023 03/04/2023 9


of IPC

3. Micro Architectural 03/04/2023 24/04/2023 15


Implications of IPC

4. The TCP State 24/04/2023 15/05/2023 19


Machine

5. TCP Latency and 15/05/2023 29/05/2023 25


Bandwidth

2|Pa ge
Lab 1 - Getting Started with Kernel Tracing - I/O

The goals of this lab are to:


• Introduce you to our experimental environment and DTrace.
• Have you explore user-kernel interactions via system calls and traps.
• Gain experience tracing I/O behaviour in UNIX.
• Build intuitions about the probe effect.
You will do this by using DTrace to analyse the behaviour of a potted, kernel-
intensive block-I/O benchmark.

Background: POSIX I/O system calls


POSIX defines a number of synchronous I/O APIs, including the read() and write()
system calls, which accept a file descriptor to a file or other storage object, a pointer to
a buffer to read to or write from, and a buffer length as arguments. Although they are
synchronous calls, some underlying aspects of I/O are frequently deferred; for example,
write() will store changes in the buffer cache, but they may not make it to disk for some
time unless a call is made to fsync() to force a writeback.
You may wish to read the FreeBSD read(2), write(2), and fsync(2) system-call manual
pages to learn more about the calls before proceeding with this lab. You can do this
using the man command; you may need to specify the manual section to ensure that
you get API documentation rather than program documentation; for example, man 2
write for the system call to ensure that you don’t get the write(1) program man page.

The benchmark
Our I/O benchmark is straightforward: it performs a series of read() or write() I/O
system calls in a loop using configurable buffer and total I/O sizes. Optionally, an
fsync() system call be issued at the end of the I/O loop to ensure that buffered data
is written to disk. The benchmark samples the time before and after the I/O loop,
optionally displaying an average bandwidth. The lab bundle will build two versions
of the benchmark: io-static and io-dynamic. The former is statically linked (i.e.,
without use of the run-time linker or dynamic libraries), whereas the latter is
dynamically linked.

Compiling the benchmark


The laboratory I/O benchmark has been preinstalled onto the BeagleBone Black
(BBB) SD card image. However, you will need to build it before you can begin
work. Once you have configured the BBB so that you can log in, build the bundle:

3|Pa ge
# cd io
# make
# cd ..
Running the benchmark
Once built, you can run the benchmark binaries as follows, with command-line
arguments specifying various benchmark parameters:
# io/io-static
or:
# io/io-dynamic
The benchmark must be run in one of three operational modes: create, read, or
write, specified by flags. In addition, the target file must be specified. If you run the
io-static or io-dynamic benchmark without arguments, a small usage statement will
be printed, which will also identify the default buffer and total I/O sizes configured
for the benchmark.
In your experiments, you will need to be careful to hold most variables constant
in order to isolate the effects of specific variables; for example, you may wish to
hold the total I/O size constant as you vary the buffer size. You may wish to
experiment initially using /dev/zero – the kernel’s special device node providing an
unlimited source of zeros, but will also want to run the benchmark against a file in
the filesystem.

Required operation flags


Specify the mode in which the benchmark should operate:

-c Create the specified file using the default (or requested) total I/O size.
-r Benchmark read() of the target file, which must already have been created.
-w Benchmark write() of the target file, which must already have been created.

Optional I/O flags


-b buffersize Specify an alternative buffer size in bytes. The total I/O size must be a
multiple of buffer size.
-t totalsize Specify an alternative total I/O size in bytes. The total I/O size must be a
multiple of buffer size.
-B Bare mode disables the normal preparatory work done by the benchmark, such as
flushing outstanding disk writes and sleeping for one second. This mode is preferred
when performing whole-program benchmarking.
-d Direct I/O mode disables use of the buffer cache by the benchmark by specifying the
ODIRECT flag to the open() system call on the target file. When switching from

4|Pa ge
buffered mode, the first measurement using -d should be discarded, as some cached data
may still be used.
-s Synchronous mode causes the benchmark to call fsync() after writing the file to cause
all buffered writes to complete before the benchmark terminates – and in particular
before the final timestamp is taken.

Terminal output flags


The following arguments control terminal output from the benchmark; remember
that output can substantially change the performance of the system under test, and
so you should ensure that output is either entirely suppressed during tracing and
benchmarking, or that tracing and benchmarking only occurs during a period of
program execution unaffected by terminal I/O:

-q Quiet mode suppress all terminal output from the benchmark, which is preferred
when performing wholeprogram benchmarking.
-v Verbose mode causes the benchmark to print additional information, such as the
time measurement, buffer size, and total I/O size.

Example benchmark commands


This command creates a default-sized data file in the /data filesystem:
# io/io-static -c iofile
This command runs a simple read() benchmark on the data file, printing additional
information about the benchmark run:
# io/io-static -v -r iofile
This command runs a simple write() benchmark on the data file, printing additional
information about the benchmark run:
# io/io-static -v -w iofile
If performing whole-program analysis using DTrace, be sure to suppress output and
run the benchmark in bare mode:
# io/io-static -B -q -r iofile
The following command disables use of the buffer cache when running a read
benchmark; be sure to discard the output of the first run of this command:
# io/io-static -d -r iofile
To better understand kernel behaviour, you may also wish to run the benchmark
against /dev/zero, a pseudodevice that returns all zeroes, and discards all writes:
# io/io-static -r /dev/zero
To get a high-level summary execution time, including a breakdown of total wall-
clock time, time in userspace, and ‘system time’, use the UNIX time command:

5|Pa ge
# /usr/bin/time -p io/io-static -r -B -d -q iofile
real 1.31 user 0.00 sys 0.31
However, it may be desirable to use DTrace to collect more granular timestamps by
instrumenting return from execve() and entry of exit() for the program under test.
This run of io-static, reading the data file /data/iofile, bypassing the buffer cache,
and running in bare mode (i.e., without quiescing prior to the benchmark) took 1.31
seconds of wall-clock time to complete. Of this, (roughly) 0.00 seconds were spent
in userspace, and (roughly) 0.31 seconds were spent in the kernel on behalf of the
process. From this output, it is unclear where the remaining 1.00 seconds were
spent, but presumably a substantial fraction was spent blocked on (slow) SD Card
I/O. Time may also have been spent running other processes – and lower-precision
time measurement, such as provided by time, may suffer from non-trivial rounding
error.

Objects on which to run the benchmark


You may wish to point the benchmark at one of two objects:

/dev/zero The zero device: an infinite source of zeroes, and also an infinite sink for
any data you write to it.
/data/iofile A writable file in a journalled UFS filesystem on the SD card. You will
need to create the file using -c before performing read() or write() benchmarks.

Jupyter
The Jupyter Notebook is a web application that allows you to create and
share documents that contain live code, equations, visualisations and
explanatory text. Uses include: data cleaning and transformation,
numerical simulation, statistical modelling, machine learning and much
more. (jupyter.org)
The laboratory work requires students to bring together a diverse range of skills and
knowledge: from shell commands, scripting languages and DTrace to knowledge of
microarchitectural features and statistical analysis. The course’s aim is to focus on
the intrinsic complexity of the subject matter and not the extrinsic complexity
resulting from integrating disparate tools and platforms. Jupyter Notebooks support
this goal by providing a unified environment for:
• Executing benchmarks.
• Measuring the performance of these benchmarks with DTrace.
• Post-processing performance measurements.
• Plotting performance measurements.
• Performing statistical analysis on performance measurements.

6|Pa ge
Further information about the Jupyter Notebooks can be found at the project’s
website: jupyter.org.

Template
The BeagleBone Black comes preinstalled with a template Jupyter Notebook 2017-
2018-l41-lab-template.ipnb. This template is designed to give working examples of all
the features necessary to complete the first laboratory: including measuring the
performance of the benchmarks using DTrace, producing simple graphs with matplotlib
and performing basic statistics on performance measurements using pandas.
Details of working with the Jupyter Notebook template are given in the L41: Setup
guide.

Notes on using DTrace


You will want to use DTrace in two ways during this lab:
• To analyse just the I/O loop itself; or
• To analyse whole-program execution including setup, run-time linking, etc.
In the former case, it is useful to know that the system call clockgettime is both
run immediately before, and immediately after, the I/O loop. You may wish to
bracket tracing between a return probe for the former, and an entry probe for the
latter. For example, you might wish to include the following in your DTrace scripts:
syscall::clock_gettime:return
/execname == "io-static" && !self->in_benchmark/
{self->in_benchmark = 1; }
syscall::clock_gettime:entry
/execname == "io-static" && self->in_benchmark/
{ self->in_benchmark = 0; }
END { /* You might print summary statistics here. */ exit(0); }

Other DTrace predicates can then refer to self->inbenchmark to determine whether


the probe is occurring during the I/O loop. The benchmark is careful to set up the
runtime environment suitably before performing its first clock read, and to perform
terminal output only after the second clock read, so it is fine to leave benchmark
terminal output enabled.
In the latter case, you will wish to configure the benchmark for whole-program
analysis by disabling all output (-q) and enabling bare mode (-B). In this case, you
will simply want to match the executable name for system calls or traps using
/execname == "io-static"/ (or io-dynamic as required).

Notes on benchmark
This benchmark calculates average I/O rate for a particular run. Be sure to run the
benchmark more than once (ideally, perhaps a dozen times), and discard the first

7|Pa ge
output which may otherwise be affected by prior benchmark runs (e.g., if data is left
in the buffer cache and the benchmark is not yet in the steady state). Do be sure that
terminal I/O from the benchmark is not included in tracing or time measurements.

Experimental questions
Your lab report will compare several configurations of the benchmark, exploring
(and explaining) performance differences between them. Do ensure that your
experimental setup quiesces other activity on the system, and also use a suitable
number of benchmark runs. The following questions are with respect to the
benchmark reading a file through the buffer cache:

• Holding the total I/O size constant (16MB), how does varying I/O buffer size
affect IO-loop performance?
• Holding the buffer size constant (16K) and varying total I/O size, how does
static vs. dynamic linking affect whole-program performance?
• At what file-size threshold does the performance difference between static and
dynamic linking fall below 5%? At what file-size threshold does the
performance difference fall below 1%?
• Consider the impact of the probe effect on your causal investigation.
For the purposes of performance graphs, plot measured performance (the dependent
variable, or Y axis) with respect to I/O bandwidth rather than literal execution time.
This will make it easier to analyse relative I/O efficiency (per unit of data) as file
and buffer sizes vary.

8|Pa ge
Lab 2 - Kernel Implications of IPC

The goals of this lab are to:


• Continue to gain experience tracing user-kernel interactions via system calls and
traps.
• Explore the performance of varying IPC models, buffer sizes, and process models.
• Gather data to support writing your first assessed lab report.
You will do this by using DTrace to analyse the behaviour of a potted, kernel-intensive
IPC benchmark.

Background: POSIX IPC objects


POSIX defines several types of Inter-Process Communication (IPC) objects, including
pipes (created using the pipe() system call) and sockets (created using the socket() and
socketpair() system calls).

Pipes are used most frequently between pairs of processes in a UNIX process pipeline:
a chain of processes started by a single command line, whose output and input file
descriptors are linked. Although pipes can be set up between unrelated processes,
the primary means of acquiring a pipe is through inheritance across fork(), meaning
that they are used between closely related processes (e.g., with a common parent
process).
Sockets are used when two processes are created in independent contexts and must later
rendezvous – e.g., via the filesystem, but also via TCP/IP. In typical use, each
endpoint process creates a socket via the socket() system call, which are then
interconnected through use of bind(), listen(), connect(), and accept(). However,
there is also a socketpair() system call that returns a pair of interconnected
endpoints in the same style as pipe() – convenient for us as we wish to compare the
two side-by-side.
Both pipes and sockets can be used to transmit ordered byte streams: a sequence of bytes
sent via one file descriptor that will be received reliably on the other without loss or
reordering. As file I/O, the read() and write() system calls can be used to read and write
data on file descriptors for pipes and sockets. It is useful to know that these system calls
are permitted to return partial reads and partial writes: i.e., a buffer of some size (e.g.,
1k) might be passed as an argument, but only a subset of the requested bytes may be
received or sent, with the actual size returned via the system call’s return value. This
may happen if the in-kernel buffers for the IPC object are too small for the full amount,
or if non-blocking I/O is enabled. When analysing traces of IPC behaviour, it is
important to consider both the size of the buffer passed and the number of bytes returned
in evaluating the behaviour of the system call.

9|Pa ge
You may wish to read the FreeBSD pipe(2) and socketpair(2) manual pages to learn
more about these APIs before proceeding with the lab.

The benchmark
As with our earlier I/O benchmark, the IPC benchmark is straightforward: it sets up a
pair of IPC endpoints referencing a shared pipe or socket, and then performs a series of
write() and read() system calls on the file descriptors to send (and then receive) a total
number of bytes of data. Data will be sent using a smaller userspace buffer size –
although as hinted above, there is no guarantee that a full user buffer will be sent or
received in any individual call. Also as with the I/O benchmark, there are several modes
of operation: sending and receiving within a single thread, a pair of threads in the same
process, or between two threads in two different processes.
The benchmark will set up any necessary IPC objects, threads, and processes, sample
the start time using the clockgettime() system call, perform the IPC loop (perhaps split
over two threads), and then sample the finish time using the clockgettime() system call.
Optionally, both the average bandwidth across the IPC object, and also more verbose
information about the benchmark configuration, may be displayed. Both statically and
dynamically linked versions of the binary are provided: ipc-static and ipc-dynamic.

Compiling the benchmark


The laboratory IPC benchmark has been preinstalled onto the BeagleBone Black (BBB)
SD card image. However, you will need to build it before you can begin work. Once
you have configured the BBB so that you can log in (see L41: Lab Setup), you can build
the benchmark as follows:
# cd /data
# make -C ipc

Running the benchmark


Once built, you can run the benchmark binaries as follows, with command-line
arguments specifying various benchmark parameters:
# ipc/ipc-static
or:
# ipc/ipc-dynamic
If you run the benchmark without arguments, a small usage statement will be printed,
which will also identify the default IPC object type, IPC buffer, and total IPC sizes
configured for the benchmark. As in the prior lab, you will wish to be careful to hold
most variables constant in order to isolate the effects of specific variables. For example,
you might wish the vary the IPC object type while holding the total IPC size constant.

10 | P a g e
Required operation argument
Specify the mode in which the benchmark should operate:
1thread Run the benchmark entirely within one thread; note that, unlike other
benchmark configurations, this mode interleaves the IPC calls and must place the
file descriptors into non-blocking mode or risk deadlock. This may have observable
effects on the behaviour of the system calls with respect to partial reads or writes.
2thread Run the benchmark between two threads within one process: one as a ‘sender’
and the other as a ‘receiver’, with the sender capturing the first timestamp, and the
receiver capturing the second. System calls are blocking, meaning that if the in-
kernel buffer fills during a write(), then the sender thread will sleep; if the in-kernel
buffer empties during a read(), then the receiver thread will sleep.
2proc As with the 2thread configuration, run the benchmark in two threads – however,
those threads will be in two different processes. The benchmark creates a second
process using fork() that will run the sender. System calls in this variation are
likewise blocking.

Optional I/O flags


-b buffersize Specify an alternative userspace IPC buffer size in bytes – the amount of
memory allocated to hold to-be-sent or received IPC data. The same buffer size
will be used for both sending and receiving. The total IPC size must be a multiple
of buffer size.
-i ipctype Specify the IPC object to use in the benchmark: pipe, local, or tcp (default
pipe).
-t totalsize Specify an alternative total IPC size in bytes. The total IPC size must be a
multiple of userspace IPC buffer size.
-B Run in bare mode: disable normal quiescing activities such as using sync() to cause
the filesystem to synchronise before the IPC loop runs, and using sleep() to await
terminal-I/O quietude. This will be the more appropriate mode in which to perform
whole-program analysis but may lead to greater variance if simply analysing the
IPC loop.
-s When operating on a socket, explicitly set the in-kernel socket-buffer size to match
the userspace IPC buffer size rather than using the kernel default. Note that per-
process resource limits will prevent use of very large buffer sizes.

Terminal output flags


The following arguments control terminal output from the benchmark; remember that
output can substantially change the performance of the system under test, and you
should ensure that output is either entirely suppressed during tracing and benchmarking,

11 | P a g e
or that tracing and benchmarking only occurs during a period of program execution
unaffected by terminal I/O:
-q Quiet mode suppress all terminal output from the benchmark, which is preferred
when performing wholeprogram benchmarking.
-v Verbose mode causes the benchmark to print additional information, such as the time
measurement, buffer size, and total IPC size.

Example benchmark commands


This command performs a simple IPC benchmark using a pipe and default userspace
IPC buffer and total IPC sizes within a single thread of a single process:
# ipc/ipc-static -i pipe 1thread
This command performs the same pipe benchmark, but between two threads of the same
process:
# ipc/ipc-static -i pipe 2thread
And this command does so between two processes:
# ipc/ipc-static -i pipe 2proc
This command performs a socket-pair benchmark, and requests non-default socket-
buffer sizes synchronised to a userspace IPC buffer size of 1k:
# ipc/ipc-static -i local -s -b 1024 2thread
As with the I/O benchmark, additional information can be requested using verbose
mode:
# ipc/ipc-static -v -i pipe 1thread
And, likewise, all output can be suppressed, and bare mode can be used, for whole-
program analysis:
# ipc/ipc-static -q -B -i pipe 1thread

Note on kernel configuration


By default, the kernel limits the maximum per-socket socket-buffer size that can be
configured, in order to avoid resource starvation. You will need to tune the kernel’s
default limits using the following command, run as root, prior to running benchmarks.
Note that this should be set before any benchmarks are run, whether or not they are
explicitly configuring the socket-buffer size, as the limit will also affect socket-buffer
auto-sizing.
# sysctl kern.ipc.maxsockbuf=33554432

12 | P a g e
Notes on using DTrace
On the whole, this lab will be concerned with just measuring the IPC loop, rather than
whole-program behaviour. As in the last lab, it is useful to know that the system call
clockgettime is both run immediately before, and immediately after, the IPC loop. In
this benchmark, these events may occur in different threads or processes, as the sender
performs the initial timestamp before transmitting the first byte over IPC, and the
receiver performs the final timestamp after receiving the last byte over IPC. You may
wish to bracket tracing between a return probe for the former, and an entry probe for
the latter; see the notes from the last lab for an example.
As with the last lab, you will want to trace the key system calls of the benchmark:
read() and write(). For example, it may be sensible to inspect quantize() results for both
the execution time distributions of the system calls, and the amount of data returned by
each (via arg0 in the system-call return probe). You will also want to investigate
scheduling events using the sched provider. This provider instruments a variety of
scheduling-related behaviours, but it may be of particular use to instrument its on-cpu
and off-cpu events, which reflect threads starting and stopping execution on a CPU. You
can also instrument sleep and wakeup probes to trace where threads go to sleep waiting
for new data in an empty kernel buffer (or for space to place new data in a full buffer).
When tracing scheduling, it is useful to inspect both the process ID (pid) and thread ID
(tid) to understand where events are taking place.
By its very nature, the probe effect is hard to investigate, as the probe effect does, of
course, affect investigation of the effect itself! However, one simple way to approach
the problem is to analyse the results of performance benchmarking with and without
DTrace scripts running. When exploring the probe effect, it is important to consider not
just the impact on bandwidth average/variance, but also on systemic behaviour: for
example, when performing more detailed tracing, causing the runtime of the benchmark
to increase, does the number of context switches increase, or the distribution of read()
return values? In general, our interest will be in the overhead of probes rather than the
overhead of terminal I/O from the DTrace process – you may wish to suppress that
output during the benchmark run so that you can focus on probe overhead.

Notes on benchmark
As with the prior lab, it is important to run benchmarks more than once to collect a
distribution of values, allowing variance to be analysed. You may wish to discard the
first result in a set of benchmark runs as the system will not yet have entered its steady
state. Do be sure that terminal I/O from the benchmark is not included in tracing or time
measurements (unless that is the intent).

13 | P a g e
Experimental questions (part 1/2)
You will receive a separate handout during the next lab describing Lab Report 2;
however, this description will allow you to begin to prepare for the assignment, which
will also depend on the outcome of the next lab. Your lab report will compare several
configurations of the IPC benchmark, exploring (and explaining) performance
differences between them. Do ensure that your experimental setup suitably quiesces
other activity on the system, and also use a suitable number of benchmark runs; you
may wish to consult the FreeBSD Benchmarking Advice wiki page linked to from the
module’s reading list for other thoughts on configuring the benchmark setup. The
following questions are with respect to a fixed total IPC size with a statically linked
version of the benchmark, and refer only to IPC-loop, not whole-program, analysis.
Using 2thread and 2proc modes, explore how varying IPC model (pipes, sockets, and
sockets with -s) and IPC buffer size affect performance:

• How does increasing IPC buffer size uniformly change performance across IPC
models – and why?
• Is using multiple threads faster or slower than using multiple processes?
Graphs and tables should be used to illustrate your measurement results. Ensure that,
for each question, you present not only results, but also a causal explanation of those
results – i.e., why the behaviour in question occurs, not just that it does. For the purposes
of graphs in this assignment, use achieved bandwidth, rather than total execution time,
for the Y axis, in order to allow you to more directly visualise the effects of
configuration changes on efficiency.

14 | P a g e
Lab 3: Micro-architectural implications of IPC

The goals of this lab are to:


• Hardware performance counters
Sketch of ARM Cortex A8 memory hierarchy
• Extending Lab2 from OS effects to architecture/micro-architecture
• Gather further data for assessed Lab Report 2

Sketch of ARM Cortex A8 Memory Hierarchy

i e u refers to an ISA-le el ie of e ecu on


• Architectural refers to an ISA-Level view of execution
• Micro-architectural refers to behaviors below the ISA
i o- i e u refers to eha iours elo the ISA

his is a ery ery rough s etch in ee

Hardware performance counters (1/2)


• Seems simple enough:
• Source code compiles to instructions
• Instructions are executed by the processor • But some instructions take longer
than others:
• Register-register operations generally singlecycle (or less)
• Multiply and divide may depend on the specific numeric values
• Floating point may take quite a while
• Loads/stores cost different amounts depending on TLB/cache use

15 | P a g e
Hardware performance counters (2/2)
• Optimisation is therefore not just about reducing instruction count
• Optimisation must take into account microarchitectural effects
• TLB/cache effects tricky as they vary with memory footprint
• How can we tell when the cache overflows?
• Hardware performance counters let us directly ask the processor about
architectural and micro-architectural events
• #instructions, #memory accesses, #cache misses, DRAM traffic...

The benchmark – now with PMC


root@l41-beaglebone data/ipc # ./ipc-static ipc-static [-Bqsv] [-b buffersize] [-i
pipe|local] [-t totalsize] mode

Modes (pick one - default 1thread):


1thread IPC within a single thread
2thread IPC between two threads in one process
2proc IPC between two threads in two different processes

Optional flags:
-B Run in bare mode: no preparatory activities
-i pipe|localSelect pipe or socket for IPC (default: pipe)
-P l1d|l1i|l2|mem|tlb|axi Enable hardware performance counters
-q Just run the benchmark, don't print stuff out
-s Set send/receive socket-buffer sizes to buffersize
-v Provide a verbose benchmark description
-b buffersize Specify a buffer size (default: 131072) -t totalsize
Specify total I/O size (default: 16777216)

• -P argument requests profiling of load/store instructions, L1 D-cache, L1 I-cache, L2


cache, I-TLB, DTLB, and AXI traffic
Example: Profile memory instructions

root@l41-beaglebone:/data/ipc # ./ipc-static -vP mem -b


1048576 -i local 1thread
Benchmark configuration:
buffersize: 1048576
totalsize: 16777216
blockcount: 16
mode: 1thread

16 | P a g e
ipctype: socket
time: 0.084140708
pmctype: mem
INSTR_EXECUTED: 25463397
CLOCK_CYCLES: 46233168
CLOCK_CYCLES/INSTR_EXECUTED: 1.815672
MEM_READ: 8699699
MEM_READ/INSTR_EXECUTED: 0.341655
MEM_READ/CLOCK_CYCLES: 0.188170
MEM_WRITE: 7815423
MEM_WRITE/INSTR_EXECUTED: 0.306928
MEM_WRITE/CLOCK_CYCLES: 0.169044
194721.45 KBytes/sec

Example: Profile memory instructions (1/2)


• Benchmark run pushed 16M data through a socket using 1M buffers for reads
and writes
• Reasonable expectation of load and store memory footprints to be 16M ×2 + ε
reflecting copies to and from kernel buffers
• Memory reads: 8,699,699
• Word size in ARMv7: 32bits
• 8,699,699 × 4 ≈ 32M
• Sum of buffer accesses in user and kernel memory:

Example: Profile memory instructions


• Could now query L1, L2 caches
• How many of those accesses are in each cache, and how does it affect
performance?
• How does L1,L2 cache miss rate relate to cycles/instruction?
• How would DTrace profiling show changed behaviour as cycles/instruction goes
u?

Experimental questions for the lab report


• Experimental questions (2/2):
• How does changing the IPC buffer size affect architectural and micro-
architectural memory behaviour – and why?

17 | P a g e
• Can we reach causal conclusions about the scalability of pipes vs. sockets from
processor performance counters?
• Remember to consider the hypotheses the experimental questions are exploring.
• Ensure that you directly consider the impact of the probe effect on your causal
investigation.

This lab session


• Use this session to continue to build experience:
• Ensure that you can use PMC to collect information about the memory
subsystem: instructions, cache behaviour, AXI behaviour
• Continue data collection for the Lab Report2
• Identify inflection points where performance trends change as a result of
architectural or micro-architectural thresholds
• Remember to use data from both Lab 2 and Lab 3 to write the lab report.
• Do ask us if you have any questions or need help.

18 | P a g e
Lab 4 - The TCP State Machine

The goals of this and the following lab are to:


• Use DTrace to investigate the actual TCP state machine and its interactions with
the network stack
• Use DTrace and DUMMYNET to investigate the effects of latency on TCP state
transitions
In this lab, we begin that investigation, which will be extended to include additional
exploration of TCP bandwidth in Lab 5.

Background: Transmission Control Protocol (TCP)


The Transmission Control Protocol (TCP) is a near-universally used protocol that
provides reliable, bi-directional, ordered byte streams over the Internet Protocol (IP)
between two communication endpoints. TCP connections are built between a pair of IP
addresses, identifying host network interfaces, and port numbers selected applications
(or automatically by the kernel) on either endpoint – collectively, a 4-tuple. While other
models are possible, typical TCP use has one side play the role of a ‘server’, which
provides some network-reachable service on a well-known port, and the other the
‘client’, building a connection to reach that service from an ephemeral port randomly
selected by the client TCP implementation.
The BSD (and now POSIX) sockets API offers a portable and simple interface for
TCP/IP client and server programming. The server opens a socket using the socket(2)
system call, binds a well-known or previously negotiated port number using bind(2),
and performs listen(2) to begin accepting new connections, returned as additional
connected sockets from calls to accept(2). The client application similarly calls
socket(2) to open a socket, and connect(2) to connect to a target address and port
number. Once open, both sides can use system calls such as read(2), write(2), send(2),
and recv(2) to send and receive data over the connection. The close(2) system call both
initiates a connection close (if not already closed) and releases the socket – whose state
may persist for some further period to allow data to drain and prevent premature re-use
of the 4-tuple.
As discussed in lecture, TCP connections are implemented by a pair of state machine
instances, one on each communications endpoint. Once in the ESTABLISHED steady
state, data passes in each direction via segments that are acknowledged by packets
passing in the other direction. The rate of data flow is controlled by TCP’s flow-control
and congestion-control mechanisms that respectively prevent the sender from sending
more data than the receiver or network can handle. Congestion control operates in three
phases: slow start, in which use of bandwidth is rapidly ramped up to exploit available
network bandwidth either at the start of the connection or following a timeout, and two
tightly coupled phases of congestion avoidance and fast recovery as TCP discovers and
maintains a congestion window close to available fair bandwidth limit.

19 | P a g e
TCP identifies every byte in one direction of a connection via a sequence number.
Data segments contain a starting sequence number and length, describing the range of
transmitted bytes. Acknowledgment packets contain the sequence number of the byte
that follows the last contiguous byte they are acknowledging. Acknowledgments are
piggybacked onto data segments traveling in the opposite direction to the greatest extent
possible to avoid additional packet transmissions. In slow start, TCP performance is
directly limited by latency, as the congestion window can be opened only by receiving
ACKs – which require successive round trips. These periods are referred to as latency
bound for this reason, and network latency a critical factor in effective utilisation of path
bandwidth.

The benchmark
Our IPC benchmark also supports a tcp socket IPC type which requests use of TCP over
the loopback interface on port 10141. Use of a fixed port number makes it easy to
identify and classify experimental packets on the loopback interface using packet-
sniffing tools such as tcpdump, and also via DTrace predicates. You are advised to
minimise network activity during the running of TCP-related benchmarks, and when
using DTrace, to reduce the degree of interference both from the perspective of
analysing behaviour, and for reasons of the probe effect.

Compiling the benchmark


Labs 4 and 5 use the same IPC benchmark utilized in Labs 2 and 3. Follow instructions
present in those lab assignments to download, untar, and build the IPC benchmark.

Running the benchmark


As before, you can run the benchmark using the ipc-static and ipc-dynamic commands,
specifying various benchmark parameters. For the purposes of this benchmark, we
recommend the following configuration:
• Use ipc-static
• Use 2-thread mode
• Do not set the socket-buffer size flag
• Do not modify the total I/O size
Do ensure that, as in Labs 2 and 3, you have increased the kernel’s maximum socket-
buffer size.

IPFW and DUMMYNET


To control latency for our experimental traffic, we will employ the IPFW firewall for
packet classification, and the DUMMYNET traffic-control facility to pass packets over
simulated ‘pipes’. To configure two 1-way DUMMYNET pipes, each carrying a 10ms
one-way latency, run the following commands as root:
20 | P a g e
ipfw pipe config 1 delay 10 ipfw
pipe config 2 delay 10
During your experiments, you will wish to change the simulated latency to other values,
which can be done by reconfiguring the pipes. Do this by repeating the above two
commands but with modified last parameters, which specify one-way latencies in
milliseconds (e.g., replace ‘10’ with ‘5’ in both commands). The total Round-Trip Time
(RTT) is the sum of the two latencies – i.e., 10ms in each direction comes to a total of
20ms RTT. Note that DUMMYNET is a simulation tool, and subject to limits on
granularity and precision. Next, you must assign traffic associated with the experiment,
classified by its TCP port number and presence on the loopback interface (lo0), to the
pipes to inject latency:
ipfw add 1 pipe 1 tcp from any 10141 to any via lo0 ipfw
add 2 pipe 2 tcp from any to any 10141 via lo0
You should configure these firewall rules only once per boot.

Configuring the loopback MTU


Network interfaces have a configured Maximum Transmission Unit (MTU) – the size,
in bytes, of the largest packet that can be sent. For most Ethernet and Ethernet-like
interfaces, the MTU is typically 1,500 bytes, although larger ‘jumbograms’ can also be
used in LAN environments. The loopback interface provides a simulated network
interface carrying traffic for loopback addresses such as 127.0.0.1 (localhost), and
typically uses a larger (16K+) MTU. To allow our simulated results to more closely
resemble LAN or WAN traffic, run the following command as root to set the loopback-
interface MTU to 1,500 bytes after each boot:
ifconfig lo0 mtu 1500
Example benchmark command
This command instructs the IPC benchmark to perform a transfer over TCP in 2-thread
mode:
./ipc-static -v -i tcp 2thread

DTrace probes for TCP


FreeBSD’s DTrace implementation contains a number of probes pertinent to TCP,
which you may use in addition to system-call and other probes you have employed in
prior labs:
fbt::syncache add:entry
FBT probe when a SYN packet is received for a listening socket, which will lead to a
SYN cache entry being created. The third argument (args[2]) is a pointer to a struct
tcphdr.

21 | P a g e
fbt::syncache expand:entry
FBT probe when a TCP packet converts a pending SYN cookie or SYN cache
connection into a full TCP connection. The third argument (args[2]) is a pointer to a
struct tcphdr.
fbt::tcp do segment:entry
FBT probe when a TCP packet is received in the ‘steady state’. The second argument
(args[1]) is a pointer to a struct tcphdr that describes the TCP header (see RFC 893).
You will want to classify packets by port number to ensure that you are collecting data
only from the flow of interest (port 10141), and associating collected data with the right
direction of the flow. Do this by checking TCP header fields thsport (source port) and
thdport (destination port) in your DTrace predicate. In addition, the fields thseq
(sequence number in transmit direction), thack (ACK sequence number in return
direction), and th win (TCP advertised window) will be of interest. The fourth argument
(args[3]) is a pointer to a struct tcpcb that describes the active connection.
fbt::tcp state change:entry
FBT probe that fires when a TCP state transition takes place. The first argument
(args[0]) is a pointer to a struct tcpcb that describes the active connection. The tcpcb
field tstate is the previous state of the connection. Access to the connection’s port
numbers at this probe point can be achieved by following tinpcb->inpinc.incie, which
has fields iefport (foreign, or remote port) and ie lport (local port) for the connection.
The second argument is the new state to be assigned.
When analysing TCP states, the D array tcpstatestring can be used to convert an integer
state to a human-readable string (e.g., 0 to TCPSCLOSED). For these probes, the port
number will be in network byte order; the D function ntohs() can be used to convert to
host byte order when printing or matching values in thsport, thdport, ielport, and iefport.
Note that sequence and acknowledgment numbers are cast to unsigned integers. When
analysing and graphing data, be aware that sequence numbers can (and will) wrap due
to the 32-bit sequence space.

Sample DTrace scripts


The following script prints out, for each received TCP segment beyond the initial SYN
handshake, the sequence number, ACK number, and state of the TCP connection prior
to full processing of the segment:
dtrace -n ‘fbt::tcp_do_segment : entry {
trace((unsigned int)args[1]->th_seq);
trace((unsigned int)args[1]->th_ack);
trace(tcp_state_string[args[3]->t_state]); }’

22 | P a g e
Trace state transitions printing the receiving and sending port numbers for the
connection experiencing the transition:
dtrace -n fbt::tcp_state_change : entry ‘{
trace(ntohs(args[0]->t_inpcb->inp_inc.inc_ie.ie_lport));
trace(ntohs(args[0]->t_inpcb->inp_inc.inc_ie.ie_fport));
trace(tcp_state_string[args[0]->t_state]);
trace(tcp_state_string[args[1]]); }’
These scripts can be extended to match flows on port 10141 in either direction as
needed.

Exploratory questions
These questions are intended to help you understand the TCP state machine and
behaviour of TCP with simulated latencies, and should help provide supporting
evidence for your experimental questions. However, they are just suggestions – feel free
to approach the problem differently! These questions do not need to be addressed in
your lab report.
1. Exploring the TCP state machine:
• Trace state transitions occurring for your test TCP connections.
• Using DTrace’s stack() function, determine which state transitions are
triggered by packets received over the network (e.g., passing via tcpinput() vs.
those that are triggered by local system calls).
2. Baseline benchmark performance analysis:
• As you vary one-way latency between 0ms and 40ms, with 5ms intervals, what
is the net effect on performance?

Experimental questions (part 1)


These questions form the first part of your lab report spanning Labs 4 and 5. As
described above, configure the IPC benchmark to use TCP in 2thread mode. When
exploring TCP state-machine behaviour, use whole-program analysis. When exploring
the effects of latency on performance, use only I/O-loop analysis. We recommend the
use of GraphViz as a state-machine plotting tool, which with suitable scripting will
allow diagrams to be automatically generated from measurements.

• Plot an effective (i.e., as measured) TCP state-transition diagram for the two
directions of a single TCP connection: states will be nodes, and transitions will be
edges. Where state transitions diverge between the two directions, be sure to label
edges indicating ‘client’ vs. ‘server’.

23 | P a g e
• Extend the diagram to indicate, for each edge, the TCP header flags of the received
packet triggering the transition, or the local system call (or other event – e.g., timer)
that triggers the transition.
• Compare the graphs you have drawn with the TCP state diagram in RFC 793.
• Using DUMMYNET, explore the effects of simulated latency at 5ms intervals
between 0ms and 40ms. What observations can we make about state-machine
transitions as latency increases?
In Lab 5, we will extend our analysis using knowledge of TCP’s congestion-control
model, illustrating behaviours using time–sequence-number diagrams. Be sure, in your
lab report, to describe any apparent simulation or probe effects.

24 | P a g e
Lab 5 - TCP Latency and Bandwidth

The goals of this lab are to:


• Learn to draw TCP time-bandwidth graphs.
• Evaluate the effects of latency on effective TCP bandwidth.
• Evaluate the effects of socket-buffer size on effective TCP bandwidth.
Lab 5 builds on the investigation started in Lab 4, and uses the same TCP benchmark.

Background: TCP, latency, and bandwidth


The Transmission Control Protocol (TCP) layers an reliable, ordered, octet-stream
service over the Internet Protocol (IP). As explored in the previous lab, TCP goes
through complex setup and shutdown procedures, but (ideally) spends the majority of
its time in the ESTABLISHED state, in which stream data can be transmitted to the
remote endpoint. TCP specifies two rate-control mechanisms:
Flow control allows a receiver to limit the amount of unacknowledged data transmitted
by the remote sender, preventing receiver buffers from being overflowed. This is
implemented via window advertisements sent via acknowledgments back to the sender.
When using the sockets API, the advertised window size is based on available space in
the receive socket buffer, meaning that it will be sensitive to both the size configured
by the application (using socket options) and the rate at which the application reads data
from the buffer.
Contemporary TCP implementations auto-resize socket buffers if a specific size has not
been requested by the application, avoiding use of a constant default size that may
substantially limit overall performance (as the sender may not be able to fully fill the
bandwidth-delay product of the network)1. Note that this requirement for large buffer
sizes is in tension with local performance behaviour explored in prior IPC labs.
Congestion control allows the sender to avoid overfilling the network path to the
receiving host, avoiding unnecessary packet loss and negative impacting on other traffic
on the network (fairness). This is implemented via a variety of congestion-detection
techniques, depending on the specific algorithm and implementation – but most
frequently, interpretation of packet-loss events as a congestion indicator. When a
receiver notices a gap in the received sequence-number series, it will return a duplicate
ACK, which hints to the sender that a packet has been lost and should be retransmitted2.
TCP congestion control maintains a congestion window on the sender – similar in effect
to the flow-control window, in that it limits the amount of unacknowledged data a
sender can place into the network. When a connection first opens, and also following a
timeout after significant loss, the sender will enter slow start, in which the window is

25 | P a g e
‘opened’ gradually as available bandwidth is probed. The name ‘slow start’ is initially
confusing as it is actually an exponential ramp-up. However, it is in fact slow compared
to the original TCP algorithm, which had no notion of congestion and overfilled the
network immediately!
When congestion is detected (i.e., because the congestion window has gotten above
available bandwidth triggering a loss), a cycle of congestion recovery and avoidance is
entered. The congestion window will be reduced, and then the window will be more
slowly reopened, causing the congestion window to continually (gently) probe for
additional available bandwidth, (gently) falling back when it re-exceeds the limit. In the
event a true timeout is experienced – i.e., significant packet loss – then the congestion
window will be cut substantially and slow start will be re-entered.
The steady state of TCP is therefore responsive to the continual arrival and departure of
other flows, as well as changes in routes or path bandwidth, as it detects newly available
bandwidth, and reduces use as congestion is experienced due to over utilisation.
TCP composes these two windows by taking the minimum: it will neither send too much
data for the remote host, nor for the network itself. One limit is directly visible in the
packets themselves (the advertised window from the receiver), but the other must either
be intuited from wire traffic, or more preferably, monitored using end-host
instrumentation. Two further informal definitions will be useful:
Latency is the time it takes a packet to get from one endpoint to another. TCP
implementations measure RoundTrip Time (RTT) in order to tune timeouts detecting
packet loss. More subtlely, RTT also limits the rate at which TCP will grow the
congestion window, especially during slow start: the window can grow only as data is
acknowledged, which requires round-trip times as ACKs are received.
Bandwidth is the throughput capacity of a link (or network path) to carry data, typically
measured in bits or bytes per second. TCP attempts to discover the available bandwidth
by iteratively expanding the congestioncontrol window until congestion is experienced,
and then backing off. While bandwidth and latency are notionally independent of one
another, they are entangled in TCP as the protocol relies on acknowledgments to control
the rate at which the congestion window is expanded, which is dependent upon round-
trip time.

Background: Plotting TCP connections


TCP time-bandwidth graphs plot time on a linear X axis, and bandwidth achieved by
TCP on a linear or log Y axis. Bandwidth may be usefully calculated as the change in
sequence number (i.e., bytes) over a window of time – e.g., a second. Care should be
taken to handle wrapping in the 32-bit sequence space; for shorter measurements this
might be accomplished by dropping traces from experimental runs in which sequence
numbers wrap.
This graph type may benefit from overlaying of additional time-based data, such as
specific annotation of trace events from the congestion-control implementation, such as

26 | P a g e
packet-loss detection or a transition out of slow start. Rather than directly overlaying,
which can be visually confusing, a better option may be to “stack” the graphs: place
them on the same X axis (time), horizontally aligned but vertically stacked. Possible
additional data points (and Y axes) might include advertised and congestion-window
sizes in bytes.

The benchmark
This lab uses the same IPC benchmark as prior labs. You will run the benchmark both
with, and without, setting the socket-buffer size, allowing you to explore the effects of
manual versus automatic socket-buffer tuning. The benchmark continues to send its data
on the accepted server-side socket on port 10141. This means that data segments
carrying benchmark data from the sender to the receiver will have a source port of
10141, and acknowledgements from the receiver to the sender will have a destination
port of 10141. Do ensure that, as in Lab 2, you have increased the kernel’s maximum
socket-buffer size.

DTrace probes
As in Lab 4, you will utilise the tcpdosegment FBT probe to track TCP input. However,
you will now take advantage of access to the TCP control block (tcpcb structure –
args[3] to the tcpdosegment FBT probe) to gain additional insight into TCP behaviour.
The following fields may be of interest: sndwnd On the sender, the last received
advertised flow-control window. sndcwnd On the sender, the current calculated
congestion-control window.
sndssthresh On the sender, the current slow-start threshold – if sndcwnd is less than or
equal to snd ssthresh, then the connection is in slow start; otherwise, it is in congestion
avoidance.
When writing DTrace scripts to analyse a flow in a particular direction, you can use the
port fields in the TCP header to narrow analysis to only the packets of interest. For
example, when instrumenting tcpdosegment to analyse received acknowledgments, it
will be desirable to use a predicate of /args[1]->thdport == htons(10141)/ to select only
packets being sent to the server port (e.g., ACKs), and the similar (but subtly different)
/args[1]->thsport == htons(10141)/ to select only packets being sent from the server
port (e.g., data). Note that you will wish to take care to ensure that you are reading fields
from within the tcpcb at the correct end of the connection – the ‘send’ values, such as
last received advertised window and congestion window, are properties of the server,
and not client, side of this benchmark, and hence can only be accessed from instances
of tcp dosegment that are processing server-side packets
To calculate the length of a segment in the probe, you can use the tcp:::send probe to
trace the iplength field in the ipinfo_t structure (args[2]):

27 | P a g e
typedef struct ipinfo {
uint8_t ip_ver; /* IP version (4, 6) */
uint16_t ip_plength; /* payload length */
string ip_saddr; /* source address */
string ip_daddr; /* destination address */
} ipinfo_t;
As is noted in the DTrace documentation for this probe this ipplength is the expected IP
payload length so no further corrections need be applied.
Data for the two types of graphs described above is typically gathered at (or close to)
one endpoint in order to provide timeline consistency – i.e., the viewpoint of just the
client or the server, not some blend of the two time lines. As we will be measuring not
just data from packet headers, but also from the TCP implementation itself, we
recommend gathering most data close to the sender. As described here, it may seem
natural to collect information on data-carrying segments on the receiver (where they are
processed by tcpdosegment), and to collect information on ACKs on the server (where
they are similarly processes). However, given a significant latency between client and
server, and a desire to plot points coherently on a unified real-time X axis, capturing
both at the same endpoint will make this easier.
It is similarly worth noting that tcpdosegment’s entry FBT probe is invoked before the
ACK or data segment has been processed – so access to the tcpcb will take into account
only state prior to the packet that is now being processed, not that data itself. For
example, if the received packet is an ACK, then printed tcpcb fields will not take that
ACK into account.

Flushing the TCP host cache


FreeBSD implements a host cache that stores sampled round-trip times, bandwidth
estimates, and other information to be used across different TCP connections to the
same remote host. Normally, this feature allows improved performance as, for example,
by allowing past estimates of bandwidth to trigger a transition from slow start to steady
state without ‘overshooting’, potentially triggering significant loss. However, in the
context of this lab, carrying of state between connections reduces the independence of
our experimental runs. As such, we recommend issuing the following command (as
root) between runs of the IPC benchmark:
sysctl net.inet.tcp.hostcache.purgenow=1
This will flush all entries from the host cache, preventing information that may affect
congestion-control decisions from being carried between runs.

28 | P a g e
Experimental questions (part 2)
These questions supplement the experimental questions in the Lab 4 handout. Configure
the benchmark as follows:

• To use the statically linked version: ipc-static


• To use TCP: -i tcp
• To use a 2-thread configuration: 2thread
• To use a fixed 1MB buffer -b 1048576
• To set (or not set) the socket-buffer size: -s
• To use only I/O-loop analysis
• Flush the TCP host cache between all benchmark runs
Explore the following experimental questions, which consider only the TCP steady
state, and not the three-way handshake or connection close:

• Plot DUMMYNET-imposed latency (0ms .. 40ms in 5ms intervals) on the X axis


and effective bandwidth on the Y axis, considering both the case where the socket-
buffer size is set versus allowing it to be auto-resized. Is the relationship between
round-trip latency and bandwidth linear? How does socket-buffer auto-resizing
help, hurt, or fail to affect performance as latency varies?
• Plot a time–bandwidth graph comparing the effects of setting the socket-buffer size
versus allowing it to be auto-resized by the stack. Stack additional graphs showing
the sender last received advertised window and congestion window on the same X
axis. How does socket-buffer auto-resizing affect overall performance, as
explained in terms of the effect of window sizes?
• Be sure, in your lab report, to describe any apparent simulation or probe effects.
Ensure that your final lab report answers all of the experimental questions in both labs
4 and 5.

29 | P a g e

You might also like