TS-Perf General Performance Measurement of Trusted Execution Environment and Rich Execution Environment On Intel SGX Arm TrustZone and RISC-V Keystone
TS-Perf General Performance Measurement of Trusted Execution Environment and Rich Execution Environment On Intel SGX Arm TrustZone and RISC-V Keystone
ABSTRACT A trusted execution environment (TEE) is a new hardware security feature that is isolated from
a normal OS (i.e., rich execution environment (REE)). The TEE enables us to run a critical process, but
the behavior is invisible from the normal OS, which makes it difficult to debug and tune the performance.
In addition, the hardware/software architectures of TEE are different on CPUs. For example, Intel SGX
allows user-mode only, although Arm TrustZone and RISC-V Keystone run a trusted OS. In addition,
each TEE has each SDK for programming. Each SDK offers own APIs and makes difficult to write a
common program. These features make it difficult to compare the performance fairly between TEE and
REE on different CPUs. To obtain precise performance and behavior in TEE, we propose TS-perf which is
a compiler-based performance measurement method. TS-perf accesses the hardware timestamp counter in
TEE as well as REE and keeps a precise log. The codes for measurement are inserted in a TEE binary by
the compiler options (i.e., profile option, constructor, and destructor). Furthermore, we utilize the separate
compilation technique, and the same benchmark binary is used for a fair comparison between TEE and
REE. The architecture of TS-perf is general and implemented for three TEE architectures (Arm TrustZone,
Intel SGX, and RISC-V Keystone). TS-perf measures the performance of GlobalPlatform’s TEE internal
APIs, matrix multiplication, memory access, and storage access. The comparisons show the difference in
performance between TEE and REE and the unusual behavior of trusted applications (TAs).
INDEX TERMS Trusted execution environment (TEE), rich execution environment (REE), performance
measurement, Arm TrustZone, Intel SGX, RISC-V Keystone.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
133520 VOLUME 9, 2021
K. Suzaki et al.: TS-Perf: General Performance Measurement of TEE and REE
many smartphones with a vendor specific trusted OS; KNOX separate compilation technique, and the same
for Samsung [19]–[21], RTOSck for Huawei [22], QSEE for object is linked to the benchmark binaries in TEE
Qualcomm [23], [24], etc. RISC-V Keystone also offers some and REE. The performance is measured by the same
choices: eyrie [14], [15], seL4 [25], etc.) time resource between TEE and REE using TS-perf,
In addition, each TEE has a software development which enables fair performance comparison on a state-
kit (SDK) for programming a TA. Unfortunately, the abstrac- changing style TEE. The results can show the precise
tion depends on SDK, and APIs for TA are not standardized differences between TEE and REE on different CPUs.
(namely, there is no POSIX-like standard). These features 3) Implementation benefits: The techniques used
make it difficult to write a portable program for TEE. by TS-perf (i.e., hardware timestamp, compiler
On the other hand, availability is an important factor options) and the separate compilation are general tech-
for security, but the current TEE does not support ade- niques and can be applied to many TEE implementa-
quate performance measurement tools because TEE hides tions. The ability is shown by the implementation for
the behavior from the normal OS. Therefore, TA debugging Arm Cortex-A (TrustZone), Intel x86-x64 (SGX), and
and performance tuning are not easy. To solve this issue, RISC-V U540 (Keystone).
some approaches have been proposed [26]–[34], but they 4) TEE behavior is compared with the view of the
include high overhead or dependency of special hardware. normal OS: The behavior of TA is monitored from the
In addition, the previous approaches are not based on the view of the normal OS and compared with the results
same reliable time counter between TEE and REE, and the of TS-perf. We confirm some unusual behaviors
performance comparison is not accurate. (i.e., core change and uncertain CPU load) and find an
To solve these problems, we have implemented the interrupt-handler bug between REE and TEE.
portable GlobalPlatform TEE internal API library for differ- The remainder of this paper is organized as follows.
ent TEE architectures, but the performance details have not Section II describes the background knowledge. Section III
been measured because we lacked a suitable measurement depicts the design of TS-perf and the separate compilation.
method. In addition, we cannot confirm that the performance Section IV describes the implementation of TS-perf, and
in TEE is the same as that in REE. This has given us motiva- section V shows the measured performance and behavior.
tions to establish a fair performance comparison that runs the Section VI describes some related topics, and section VII
same binary in TEE and REE and measures the performance concludes this paper.
with the same time resource in both environments.
This paper proposes the precise measurement method II. BACKGROUND
TS-perf which utilizes a CPU hardware timestamp counter This section describes the background of the TEE archi-
in TEE and REE, and GCC [35] compiler options to insert tecture, time counter, common TEE programming API, and
the codes for measurement. TS-perf is combined with the related works for performance and behavior measurement
separate compilation technique and makes it possible to run in TEE.
the same binary in TEE and REE to enable fair perfor-
mance comparison. The performance measurement results A. TEE ARCHITECTURE
show some unusual behaviors (e.g., strange core change and The three TEE architectures used in this paper are described.
uncertain CPU load), and we confirm that they come from the These TEEs are implemented by changing the state of the
difference in TEE implementations or a bug. CPU cores.
Contributions and Challenges:
1) General Performance measurement based on the 1) ARM TrustZone
hardware timestamp in TEE and REE: We design
Since smartphones, game machines, and set-top boxes
TS-perf which is a compiler-based performance
use Arm Cortex-A TrustZone [3], [4]1 for critical
measurement method that can be applied in many
processing, it is the most popular TEE. TrustZone
TEE implementations. TS-perf obtains the hardware
offers 2 world view model, i.e., the secure world
timestamp in TEE as well as in REE. The codes
(i.e., TEE) and the normal world (i.e., REE). The world
for measurement are inserted by the GCC compiler
(namely, the state of REE and TEE) is changed by
options (i.e., profile option, constructor,
the SMC (secure monitor call) instruction. Each world has
and destructor). The result is reported to the REE
user- and supervisor-mode and runs applications on a kernel
after the full logging process, which does not cause
(i.e., trusted OS or normal OS). Many trusted OSes on Trust-
runtime overhead. TS-perf is implemented for three
Zone(e.g., QSEE [23], [24], OP-TEE [36]) offer the APIs
TEE architectures (i.e., Arm TrustZone, Intel SGXv2,
defined by GlobalPlatform [37], [38] for TA programming.
and RISC-V Keystone).
Some APIs require the help of a normal OS (e.g., secure
2) Comparison of the same benchmark in TEE and
storage).
REE: In many cases, different compiler options are
used for TEE and REE applications and result in dif- 1 Arm Cortex-M also has a different TrustZone. This paper considers
ferent binaries. For a fair comparison, we utilize the Cortex-A only.
The memory allocation for TEE is flexible, but most imple- TABLE 1. Hardware timestamp counter, reading instruction, and
frequency for each architecture.
mentations allocate the limited memory at the boot time
to keep a small trusted computing base (TCB). When an
interrupt is issued, a trusted OS can handle it if the interrupt
is caused by the secure world’s peripheral. However, many
interrupts are passed to a normal OS with a context switch
because they are issued for the normal world.
2) INTEL SGX
Intel Software Guard Extensions (SGX) [5]–[8] offers a sin-
gle address model of TEE, named enclave. SGX allows a 1) TIMESTAMP CYCLE
TA as a part of a normal application on a host OS, and the
The hardware timestamp counts the number of the internal
TA runs on the user-mode (ring 3) in an enclave which is
processor clock cycle. The timestamp counter is a special
the state of TEE. An enclave is created dynamically, and
hardware resource in a CPU, and access is allowed by spe-
the TA is loaded on it. The TA is implemented as a shared
cial instructions. For example, the Arm CPU’s counter-timer
library offered by Intel SDK. When an interrupt is issued,
physical count register CNTPCT_EL0 is allowed for the
the processing is passed to the normal OS in REE. The total
supervisor only, but the counter-timer virtual count register
memory for enclaves is defined by UEFI and reserved at
CNTVCT_EL0 is allowed for the user. In addition, some
power-on. The maximum size is fixed (128MB is reserved
CPUs can change their CPU speed frequency to reduce power
on SGX version 1, and 256MB is reserved on version 2).
consumption, but the internal processor clock cycle is not
The memory region is encrypted with the key generated at
affected, and the timestamp counter remains monotonic.
power-on.
Table 1 shows the timestamp counters of the target archi-
Intel SDK includes the enclave definition language (EDL)
tectures of this paper. They are Intel x86-64’s timestamp
named edger8r, which offers glue codes for secure com-
counter (TSC) with rdtsc instruction, Arm Cortex-A’s
munications between the normal application and the TA
counter-timer virtual count register (CNTVCT_EL0) with
(i.e., OCALL from the TA to the application and ECALL
mrs instruction, and RISC-V’s hardware performance mon-
from the application to the TA). The glue codes check the
itor (HPM) with rdcycle instruction. The table also
region of the pointer and the size of the buffer. They wrap
includes the frequency of the hardware timestamp of the CPU
the edge/boundary and serialize/deserialize the arguments/re-
used in this paper. Arm Cortex-A53 has a slower hardware
sults. The glue codes are crafted with formal verification to
timestamp than the CPU speed frequency (1,400 MHz). Thus,
reduce attack surfaces. An exhaustive analysis of the EDL is
the resolution is not high but is sufficient to measure a
described in [39].
function-level amount of code, as shown in Section V.
3) RISC-V KEYSTONE
RISC-V is an open instruction set architecture (ISA) and has 2) SYSTEM CLOCK
some TEE implementations [10]–[18]). This paper utilizes The system clock indicates the global clock, which is used
Keystone because it is open-source and runs on a real CPU for verifying certificates and logging. In general, the source
(SiFive’s Unleashed board). Keystone, a mixed architecture of the system clock is based on an external peripheral named
of Arm TrustZone and Intel SGX, can create a TEE dynam- real-time clock (RTC) at boot time. The RTC has a battery
ically as SGX, although TrustZone offers only one TEE at power supply and keeps working even if the CPU power is
boot time. The TEE is named enclave as SGX but has user- down.
and supervisor-mode as TrustZone. Each TEE has its own The system clock is obtained by a system call
runtime, which works as an OS kernel. The memory for gettimeofday() on Linux. If gettimeofday() is
an enclave is dispatched dynamically from REE (i.e., Linux) implemented purely, it accesses the RTC. However, much
using the physical memory protection (PMP) mechanism. time is needed to access the external peripheral. To reduce
The dispatched memory is sanitized and used for critical the access overhead, the initial value of the system clock is
processing. The PMP also manages the state of cores and obtained from the RTC, and the system clock is maintained by
changes from REE to TEE. In the same manner as Intel SGX, other clock resources (e.g., timestamp counter). In addition,
Keystone offers the EDL named keyedger, and the glue most implementations of gettimeofday() use a virtual
codes protect the OCALL.2 dynamic shared object (VDSO) and omit the context switch
of a system call.
B. TWO TYPES OF TIME COUNTERS
Current computers keep two types of time. One is the hard-
C. COMMON TEE PROGRAMMING API
ware timestamp, and the other is the system clock which
indicates the global clock (i.e., calendar clock). In general, each TEE has an SDK for its TA programming.
For example, Intel offers SGX-SDK [5], and RISC-V Key-
2 Current Keystone has not implemented ECALL yet. stone also offers Keystone SDK [40]. To solve this problem,
TABLE 2. Comparison of related works. performance of each function. The paper [32] customizes
the OP-TEE to measure the performance of the stress test
benchmark; STRESS-NG [42]. They are useful but are not
generic to enable comparison to other architectures.
Some CPU vendors offer hardware assistance to obtain
the performance information. Intel Processor Trace (PT) [33]
and Arm CoreSight [34] are well-known mechanisms and
integrated current perf tools, but they are disabled by default
on SGX and TrustZone.
TS-perf measures performance using a general hard-
we implemented the portable library of GlobalPlatform’s ware timestamp counter inside TEE and REE, and the results
TEE Internal API [41] because the specification is opened are precise. We offer the same APIs on different TEE archi-
and widely used on smartphones. tectures and make it possible to compare the performance
On Arm TrustZone, the trusted OS OP-TEE offers Glob- between different architectures.
alPlatform’s TEE Internal API already, and we implemented
the portable library for Intel SGX and RISC-V Keystone. III. DESIGN
Intel SGX has no supervisor mode, and RISC-V Keystone This section describes two design issues for TS-perf and
offers eyrie runtime at the supervisor level, which provides the method of comparing REE and TEE.
limited API for a TA. GlobalPlatform’s TEE Internal API
includes approximately 300 APIs with many arguments that A. DESIGN OF TS-PERF
include many cipher suites. We selected notable APIs for TS-perf is a compiler-based performance measurement
common applications. method that obtains the time from the hardware timestamp
We divide the GlobalPlatform APIs into 2 categories: counter in TEE and REE. The time data are saved in memory
CPU-independent and CPU-dependent. CPU-independent during performance measurement because the writing of data
APIs are cryptographic functions and can be implemented involves high overhead. TS-perf writes the time data in
easily as a portable library, whereas CPU-dependent APIs memory to a file after the measurement.
are functions related to secure storage, timer, and random To implement these functions, TS-perf has three chal-
number. These APIs must be customized for each architecture lenges. The first concerns the access to the hardware times-
but are also implemented as a library. tamp counter in TEE and measuring the time before and
after a function. The second is the memory allocation for
D. RELATED WORKS logging in TEE. The third is the writing of data from TEE
The importance of TEE performance measurement has been to a file in REE. TS-perf solves these challenges using the
recognized, and some methods have been proposed. features of the GNU Compiler Collection (GCC)
Table 2 summarizes the related works. Gjerdrum et al. [26], [35] (i.e., profile option, constructor, and
SGX-Perf [27], and TEEMon [29] are Intel SGX spe- destructor).
cific solutions and cannot be applied to other architectures. The means of accessing the hardware timestamp counter
Gjerdrum’s and SGX-Perf are early approaches and use the in TEE is dependent on each architecture. On Intel SGX,
timestamp outside TEE. Their main target is the performance TS-perf is implemented for SGX version 2 (SGXv2) with
of SGX primitive functions (e.g., context switches) and does rdtsc instruction because SGX version 1 (SGXv1) does
not provide a method to measure the functions of a TA. not allow access to TSC in user-mode. SGX2 is offered for
TEEMon is a real-time performance monitoring framework limited CPUs only, but we choose an available machine.
that uses the timestamp inside TEE, but the overhead is Accessing the timestamp counter in SGX is not our contribu-
reported to be high from 5% to 17%. tion, but we utilize it for fair performance comparison. Arm
TEE-Perf [28] is a general perf implementation but TrustZone and RISC-V Keystone allow access to their TSC
assumes a recorder process that occupies a core along with in user-mode (i.e., CNTVCT_EL0. and HPM with rdtsc and
the perf process. The recorder process has a software counter rdcycle instruction, respectively). The code to obtain the
on a shared memory, which is incremented by an infinite loop. timestamp must be inserted on the top and bottom of a target
The perf process obtains the number of the software counter function. Fortunately, the profile option of GCC offers
via shared memory. TEE-Perf can be applied to REE, but the this feature, and TS-perf utilizes it.
implementation is redundant and inaccurate (the overhead is The second challenge concerns how to keep the time
reported to be up to 5.7× higher than that for Linux perf). on memory in TEE. The memory region must be assigned
On Arm TrustZone, some trusted OSs offer measure- before measurement because it reduces the runtime overhead.
ment functions because they are based on the Glob- Fortunately, the constructor of GCC offers the mech-
alPlatform APIs which include time measurement (i.e., anism to insert code before the main program. TS-perf
TEE_GetREETime() and TEE_GetSystemTime()). utilizes the constructor to reserve memory for
Furthermore, OP-TEE offers gprof [30], [31] to measure the logging.
The third challenge is reporting log data to REE after mea- implemented for REE, but this section concentrates on TEE
surement because access to REE includes heavy overhead. because the implementation is almost the same and simple.
The destructor of GCC enables insertion of a code after Step 1 (Compiler Phase): The profile option of GCC
the main function, and TS-perf utilizes it. The way to requires the compile flag to build an object for measuring
save data from TEE to a file in REE depends on the TEE functions. The flag is -finstrument-function, which
implementation. TS-perf follows the manner suitable for inserts a code at the top and the bottom of every function.
each architecture. These inserted codes access the timestamp counter from TEE
on each architecture.
TS-perf also requires the codes for preparing the
log buffer and reporting the log to REE (i.e., normal
OS). These codes must be executed before or after
the main part of TA execution. The GCC compiler
offers __attribute__((constructor)) and
__attribute__((destructor)) to insert code at the
start and end of the main function. However, the linker script
for the TA is not compatible on each architecture, and it is
not easy to insert arbitrary codes.
Fortunately, each TA has its own entry point which enables
the insertion of arbitrary codes before and after the TA.
The entry points are eapp_entry, ecall_ta_main
and TA_InvokeCommandEntryPoint on RISC-V Key-
FIGURE 1. Separate compilation for TEEE and REE benchmark. stone, Intel SGX, and Arm TrustZone, respectively. The code
for obtaining a 64KB log buffer is inserted before the entry
B. DESIGN OF A METHOD FOR COMPARING REE AND TEE point. Its size is fixed for simplicity, and it is large enough
TS-perf enables precise performance measurement based to profile on each function. The code to report the log data
on the same timestamp in TEE and REE, but it is insufficient to REE is inserted after the entry point. The implementation
because the same binary cannot be run in TEE and REE in depends on each architecture described as follows.
general. The reason is that the system call is different, and a Step 2 (Recorder Phase): During the TA execution, the
TA must link special libraries for TEE execution. In addition, address of the function and the timestamp counter value of
the compiler options are different in many cases. If link each enter/exit are logged into the buffer. After the TA is
time optimization (LTO) is enabled, the functions may be finished, the buffer data must be written to the log file in REE.
optimized differently. To solve this problem, we utilize the (Note: TS-perf runs a log-saving code in REE and writes
separate compilation technique. the log file using the system call of Linux.)
Figure 1 shows the overview. The functions for measuring In TS-perf in TEE, the __profiler_map_info
are saved in an original source file and compiled to an object and __profiler_unmap_info functions are registered
file (in Figure 1, memory and CPU are the measuring targets). in the entry point of each TEE and play an important
The benchmark main and TS-perf codes are prepared for role in logging. The __profiler_unmap_info
TEE and REE because they depend on these environments. function is inserted before the main function of TA
The benchmark main code includes the function to call the by __attribute__((constructor)) and prepares
measuring target, and the TS-perf code uses GCC profiling. the log buffer (64KB). The __profiler_unmap_info
TS-perf offers measuring codes for TEE and REE. Hence, function is inserted after the main function of TA
TS-perf does not use the GCC original prof because of the by __attribute__((constructor)) and saves
objective of fair comparison. the log to a file in REE (i.e., Linux).
The measuring target object file is linked to the benchmark The implementation of the __profiler_unmap_info
main and TS-perf object and creates a benchmark binary. function is different on each TEE because it needs to collab-
The same measuring target binaries are thus used in REE orate with Linux. In Keystone and SGX, OCALL functions
and TEE. are used to write the log buffer to the file in REE. On the
Unfortunately, this technique is not applied to the func- other hand, in TrustZone (namely, OP-TEE), the log data are
tion that uses TEE architecture-dependent APIs (e.g., secure moved into the shared buffer. After the end of the TA, it is
storage). In that case, the source code is written as similar as written to the file in REE.
possible and compiled with the same options. In RISC-V Keystone, the entry point is the eapp_entry
function which should be added with the EAPP_ENTRY
IV. IMPLEMENTATION keyword with The __profiler_unmap_info. The
TS-perf is implemented for Arm TrustZone, Intel SGX, __profiler_unmap_info uses OCALL functions such
and RISC-V Keystone. The implementation in TS-perf as open, write, and close for the log file on Linux
consists of three steps on GCC [35]. TS-perf is also (in Code 1).
Code 2. Saving log in REE by OCALL on RISC-V Keystone. In TEE, after the main program, the __profiler_unmap_
info function is called (in Code 4). In the __profiler_
unmap_info function, the log data are simply moved to the
In Intel SGX, the __profiler_unmap_info is also shared buffer.
registered at the SGX entry point ecall_ta_main. The When the TA terminates, the TEEC_InvokeCommand
OCALL functions are used in the same way as Keystone, but function returns in REE. The buffer is packed with the
the arguments are not the same (in Code 2). log data for each function. The log data are saved with
In Arm TrustZone, a secure OS OP-TEE manages a POSIX-compliant functions such as open, write, and close in
TA. OP-TEE has no OCALL as Keystone and SGX, and REE (in Code 5).
we have to write a program to share log data. We pre-
pare the buffer both in REE and in TEE explicitly to Step 3 (Analyzer Phase): The performance result is saved
share the log data. In REE, the shared buffer is allocated as a binary file. The analysis tool parses the data, organizes
with the output flag. The buffer is conveyed to the TA into a readable format, and compares the figure between the
by TEEC_InvokeCommand, which starts the TA in TEE different architectures.
(in Code 3).
TABLE 3. Target machines (The intel pentium CPU does not include
hyperthreading.).
V. EVALUATION
TS-perf measured some types of benchmarking in
TEE and REE on three different architectures listed in
Code 3. Getting Buffer between TEE and REE on OP-TEE/Arm TrustZone. Table 3. The CPU speed frequency is fixed at 1,000 MHz
by cpufreq-set to prevent an automatic change on
Inside the TA, the __profiler_unmap_info function Intel x86-64. However, we preformed evaluations on
is also registered by TA_InvokeCommandEntryPoint. 1,400 MHz Arm because the Raspberry Pi3 B+ offers 600
TABLE 4. Timestamp Counts and Time (µ-seconds) for or system calls that are not supported in TEE. In addition,
TEE_GetREETime() and TEE_GetSystemTime().
we want to show a fair performance comparison between
TEE and REE. The three benchmarks are CPU, memory, and
storage intensive.
1) FEATURES OF BENCHMARKS
CPU Intensive: CPU-intensive benchmarks measure
25,000,000 multiplications of integers or double float
numbers. They are simple arithmetic benchmarks and are
assumed to have no difference in TEE and REE. The bench-
marks utilize the separate compilation technique for a fair
comparison. (Note: The iteration number is determined by
the average elapsed time on all architectures.)
Memory Intensive: Memory-intensive benchmarks mea-
sure 1MB memory read/write access sequentially or ran-
or 1,400 MHz frequency only. Each benchmark is mea- domly. The benchmarks may cause performance differences
sured 200 times, and the average is shown. when the memory is encrypted. However, cache and branch
The evaluations aim to show (1) the accuracy and pre- prediction may hide the performance difference. The bench-
cision of performance measurement, (2) the difference in marks utilize the separate compilation technique for a fair
TEE implementation on different CPUs, and the difference comparison. (Note: the memory size is decided to compare
between TEE and REE on the same CPU, and (3) the unusual all architectures.)
behavior in TEE. Storage Intensive: Storage-intensive benchmarks mea-
sure the 1MB file read or write sequential access only
A. OVERHEAD FOR OBTAINING TIME because the current implementations of GlobalPlatform
To show the accuracy of TS-perf, we measured the APIs for storage (i.e., TEE_WriteObjectData() and
time functions of GlobalPlatform TEE Internal APIs: TEE_ReadObjectData()) do not allow random access.
TEE_GetREETime() and TEE_GetSystemTime(). Read or write access occurs for each 32KB unit due to
TEE_GetREETime() obtains the system clock from REE, the TEE buffer size. Storage depends on the different API
and TEE_GetSystemTime() obtains the hardware times- implementations in TEE and REE. Therefore, the separate
tamp in the user-mode of TEE, which is the same as that in compilation technique cannot be used.
TS-perf.
Table 4 shows the results. TEE_GetREETime() 2) RESULTS OF BENCHMARKS
causes OCALL, and the average time is more than Table 5 summarizes the results for TEE and REE, and
15 µ-seconds on each architecture. On the other hand, Figure 2 visualizes the results for the CPU- and memory-
the average time of TEE_GetREETime() is less than intensive benchmarks.
0.5 µ-seconds on each architecture. Hence, the average CPU Intensive: The results show almost the same perfor-
time of TEE_GetSystemTime() is 30 times faster mance for the multiplication of integers and double float num-
than that of TEE_GetREETime(). The maximum time bers in TEE and REE. These results are quite natural because
and standard deviation in TEE_GetREETime() on the TEE and REE run on the same core architecture. However,
Arm TrustZone and Intel SGX were higher than those on each architecture has a slight difference. Arm TrustZone
the RISC-V Keystone. We speculate that the differences shows that TEE is approximately 3% slower; this impact is
are caused by the complex hardware on Arm and Intel the highest among the three CPUs. The maximum and mini-
(e.g., cache hierarchy, branch prediction). The relative mum times did not have large differences, but the differences
maximum time and standard deviation on TEE_ were stable. We hypothesize that there are the architectural
GetSystemTime() are less than those on differences, and analysis is left to future work.
TEE_GetREETime(), but the absolute values of TEE_ Memory Intensive: The results show almost the same per-
GetSystemTime() are shorter than those on formance for sequential and random memory access in TEE
TEE_GetREETime(). and REE of Intel SGX and RISC-V Keystone. Arm TrustZone
The time-related functions were measured by shows that TEE is slower, especially with 11% overhead on
TS-perf, which uses the hardware timestamp as random access. Arm has a large impact on random memory
TEE_GetSystemTime(). The standard deviations access in TEE, and programmers should exercise caution.
were low; therefore, TS-perf is stable and accurate. We expected that SGX’s memory encryption mechanism
would cause performance degradation, but the results do
B. TEE AND REE BENCHMARKS not show this feature. We analyzed further performance
We use three original benchmarks because existing bench- on Intel, and we changed the memory size from 1MB to
marks are not suitable for TEE. They assume input/output 32MB. Figure 3 shows the results. The sequential access
TABLE 5. Performance comparison between TEE and REE on Arm Cortex-A, Intel X86-64, and RISC-V U540.
FIGURE 2. Performance comparison between TEE and REE on Arm Cortex-A, Intel X86-64, and RISC-V U540.
performance is almost proportional to the memory size, but TABLE 6. Memory access performance on SGX (cycle).
the random accesses are slower for TEE than for REE. Table 6
shows the detailed results. The performance degradation is
not clear until 4MB. We expect the same performances on
small memory to be caused by the CPU cache because
Pentium j5005 has a 4MB L2 cache. We also expect the
degradation in random access in TEE to be caused by memory
encryption because the effects of the cache are the same
in REE and TEE. The overhead for memory encryption is
exposed upon large memory random access.
Storage Intensive: The results show the difference between (i.e., Linux). On TrustZone, both read and write performance
TEE and REE. As expected, the results were unstable because in TEE showed a large difference, perhaps because imple-
TEE requires OCALL to save the encrypted data in REE mentation is complex for OP-TEE in terms of file access.
SGX and Keystone cause OCALLs, which include glue 2) CPU LOAD
code created by the EDL. We expected that the EDL affects The htop showed the core load changed from 100% to 0%
the performance. However, the stability is not good, and the sometimes on RISC-V; this was unusual because the TA
results cannot clearly show the effect of the EDL. should consume the CPU until the finish, and the results
of TS-perf did not show a large difference. We ana-
lyzed the code of the eyrie runtime, and the unusual
behavior led us to find a bug. The bug is the treatment of
handle_timer_interrupt, which shares the platform-
level interrupt controller (LPIC) between Linux and eyrie.
The bug omits the CPU load on a core. We posit that this result
was caused by a design mismatch between TEE and REE and
subsequently discuss this topic in section VI-C.
VI. DISCUSSIONS
FIGURE 3. Memory Access Performance on SGX.
A. APPLYING TS-PERF TO ANOTHER TEE ARCHITECTURE
TEE implementation is not limited to core sharing,
TABLE 7. The view of the TA from REE.
e.g., Apple iPhone’s Secure Enclave [46], [47]. The secure
enclave is implemented on another CPU and does not cause
core change. The performance information is also hidden
from the normal OS. This style of TEE architecture is not
related to core change and does not affect application per-
formance in REE. In addition, the separated-CPU TEE can
avoid microarchitectural vulnerability (e.g., Spectre [48]. The
vulnerability also infects the TEE (e.g., ForeShadow [49])).
However, the implementation results in a higher cost. Even if
C. COMPARING FROM THE VIEW OF REE the core-sharing style TEE is used, hyperthreading technol-
The behaviors of TEE benchmarks were monitored by htop ogy causes vulnerabilities for side channel attacks. Disabling
on Linux (REE), which shows the load on each core. The hyperthreading is recommended for some CPUs. Fortunately,
results showed two unusual behaviors: (1) the TEE running the CPU used in this paper has no hyperthreading.
core was changed, and (2) the core load did not remain at The portability of TS-perf can be reserved on separated-
100% even if a heavy benchmark was run. Table 7 summa- CPU TEE if the time measurement works and communica-
rizes the results. tion between REE and TEE is guaranteed. This extension
is enabled by the compiler-based performance measurement
1) CORE CHANGE method.
On the other hand, some core-shared TEE has another
Because the core maintained a 100% CPU load, htop
cryptographic accelerator, e.g., Secure Element (SE) [50]
informed which core was used for the TEE benchmark. The
or Rambus CryptoManager [51], which work as a root of
CPU load category shown by htop was different in each TEE
trust [52]. GlobalPlatform defines the API from core-shared
architecture: system load for TrustZone and Keystone
TEE to SE [53]. This style hides the performance of the cryp-
and user load for SGX, which indicated the view of the
tographic accelerator, and current TS-perf cannot cover the
TA from REE. In addition, htop showed that the 100% load
performance measurement.
core changed sometimes. This behavior was unusual.
We confirmed that the TA’s core was changed when the
normal application in REE was changed by the taskset B. COVERAGE OF TS-PERF
command, which can designate the running application core. Fair performance comparison is a fundamental issue because
These results indicate that the Linux scheduler changes the current hardware and OS have performance hiding mecha-
process’s core even if the process uses TEE. Therefore, the nisms, e.g., cache hierarchy, branch prediction, and Linux’s
current Linux scheduler does not recognize whether a process page cache for I/O. As mentioned in section V-B, the per-
uses TEE. We regard this as a next research topic for the formance degradation caused by memory encryption was not
scheduler to collaborate with TEE. easy to disclose. These performance hiding mechanisms are
TrustZone showed more unusual behavior. The TA some- effective for small access sizes and fixed patterns. In general,
times did not follow the normal application, and thus, it ran on traditional TAs have been used for cryptographic processing,
a different core from the normal application. We imagine that and the binaries were small, which can yield the effect of per-
OP-TEE changes the core when it accepts SMC instruction. formance hiding mechanisms. However, current TAs are used
This does not violate the rule of the trusted OS, but we could by machine learning, genome analysis, privacy processing,
not determine why, leaving this question to future research. etc. The codes and data are large, and the processing shows
native performance. Since performance tuning becomes more [3] S. Pinto and N. Santos, ‘‘Demystifying Arm TrustZone: A comprehensive
important, TS-perf aims in the development of these TAs. survey,’’ ACM Comput. Surv., vol. 51, no. 6, pp. 1–36, Feb. 2019.
[4] D. Cerdeira, N. Santos, P. Fonseca, and S. Pinto, ‘‘SoK: Understanding the
TS-Perf is not limited to the same benchmark library prevailing security vulnerabilities in TrustZone-assisted TEE systems,’’ in
and can measure the precise performance of different bina- Proc. IEEE Symp. Secur. Privacy (SP), May 2020, pp. 18–20.
ries using the hardware timestamp counter. For example, [5] Intel. Intel Software Guard Extensions (Intel SGX) Developer Guide.
Accessed: Sep. 25, 2021. [Online]. Available: https://software.intel.com/
TS-Perf can measure the binary that is optimized for REE content/www/us/en/develop/download/intel-software-guard-extensions-
or TEE. The results may show another perspective on this intel-sgx-developer-guide.html
[6] V. Costan and S. Devadas, ‘‘Intel SGX explained,’’ IACR Cryptol. ePrint
difference. This topic is the subject of our future work.
Arch., vol. 2016, no. 86, pp. 1–118, 2016.
[7] V. Costan, I. Lebedev, and S. Devadas, ‘‘Secure processors—Part I: Back-
C. INTEGRATED DESIGN BETWEEN REE AND TEE ground, taxonomy for secure enclaves and Intel SGX architecture,’’ Found.
Trends Electron. Des. Autom., vol. 11, nos. 1–2, pp. 1–248, 2017.
As mentioned in section I, the programming and execution [8] V. Costan, I. Lebedev, and S. Devadas, ‘‘Secure processors—Part II: Intel
environments are different between REE and TEE, which SGX security analysis and MIT sanctum architecture,’’ Found. Trends
includes hardware architecture as well as software archi- Electron. Des. Autom., vol. 11, no. 3, pp. 249–361, 2017.
[9] R. Buhren, C. Werling, and J.-P. Seifert, ‘‘Insecure until proven updated:
tecture. This style was effective on smartphones because Analyzing AMD SEV’s remote attestation,’’ in Proc. ACM SIGSAC Conf.
the target applications are limited (e.g., key management, Comput. Commun. Secur., Nov. 2019, pp. 1087–1099.
DRM management). However, TEE has become popular, [10] V. Costan, I. Lebedev, and S. Devadas, ‘‘Sanctum: Minimal hardware
extensions for strong software isolation,’’ in USENIX Secur. Symp.
and many normal applications want to be executed in TEE (USENIX Sec), 2016, pp. 857–874.
(e.g., machine learning and genome analysis). They require [11] T. Bourgeat, I. Lebedev, A. Wright, S. Zhang, Arvind, and S. Devadas,
‘‘MI6: Secure enclaves in a speculative out-of-order processor,’’ in
the execution of the same normal program in TEE.
Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, Oct. 2019,
To run normal applications without customization, pp. 42–56.
SGX-LKL [54] and SCONE [55] have been developed; how- [12] (2018). HexFive. [Online]. Available: https://hex-five.com/
[13] S. Weiser, M. Werner, F. Brasser, M. Malenko, S. Mangard, and
ever, they cannot offer complete compatibility. For example, A.-R. Sadeghi, ‘‘TIMBER-V: Tag-isolated memory bringing fine-grained
SGX-LKL does not support fork(). Current SCONE sup- enclaves to RISC-V,’’ in Proc. Netw. Distrib. Syst. Secur. Symp., Feb. 2019,
ports fork() but recommends avoiding fork() based on pp. 1–16.
[14] D. Lee, D. Kohlbrenner, S. Shinde, D. Song, and K. Asanović, ‘‘Keystone:
the performance problem. An open framework for architecting TEEs,’’ 2019, arXiv:1907.10119.
We think that these problems are caused by the unfixed [Online]. Available: http://arxiv.org/abs/1907.10119
abstraction of TEE. As mentioned in section V-C, the mis- [15] D. Lee, D. Kohlbrenner, S. Shinde, K. Asanović, and D. Song, ‘‘Keystone:
An open framework for architecting trusted execution environments,’’ in
match between TEE and REE causes some unusual behavior. Proc. 15th Eur. Conf. Comput. Syst., Apr. 2020, pp. 1–16.
A seamless programming style in REE and TEE is desired, [16] P. Nasahl, R. Schilling, M. Werner, and S. Mangard, ‘‘HECTOR-
but the abstraction model and its support formal verification V: A heterogeneous CPU architecture for a secure RISC-V execution
environment,’’ 2020, arXiv:2009.05262. [Online]. Available: http://arxiv.
tools are not established. Hence, TEE remains in use in many org/abs/2009.05262
research fields. TS-perf is a compiler-based performance [17] R. Bahmani, F. Brasser, G. Dessouky, P. Jauernig, M. Klimmek,
measurement method and offers a seamless programming A.-R. Sadeghi, and E. Stapf, ‘‘CURE: A security architecture with CUs-
tomizable and resilient enclaves,’’ in Proc. USENIX Secur. Symp. (USENIX
tool that can bridge REE and TEE. Sec), 2021, pp. 1073–1090.
[18] D. Oliveira, T. Gomes, and S. Pinto, ‘‘UTango: An open-source TEE
VII. CONCLUSION for the Internet of Things,’’ 2021, arXiv:2102.03625. [Online]. Available:
http://arxiv.org/abs/2102.03625
TS-perf is a general compiler-based performance measure- [19] Samsung. Samsung KNOX. [Online]. Available: https://www.
ment method and can be applied in many TEE implemen- samsungknox.com/en
[20] U. Kanonov and A. Wool, ‘‘Secure containers in Android: The Samsung
tations. TS-perf is based on the timestamp counter that KNOX case study,’’ in Proc. 6th Workshop Secur. Privacy Smartphones
is available in REE and TEE on three architectures (Arm Mobile Devices, Oct. 2016, pp. 3–12.
Cortex-A, Intel x86-64, and RISC-V U540), and this method [21] M. Dorjmyagmar, M. Kim, and H. Kim, ‘‘Security analysis of Sam-
sung Knox,’’ in Proc. 19th Int. Conf. Adv. Commun. Technol. (ICACT),
enables a fair comparison between REE and TEE. To conduct Feb. 2017, pp. 550–553.
a fair comparison, we also propose to utilize the separate [22] D. Shen, ‘‘Exploiting TrustZone on Android,’’ Black Hat USA, Aug. 2015.
compilation and enable the use of the same binary in REE and [23] D. Rosenberg, ‘‘QSEE TrustZone kernel integer overflow vulnerability,’’
Black Hat USA, Aug. 2014.
TEE. The performance results showed the sameness (arith- [24] K. Ryan, ‘‘Hardware-backed heist: Extracting ECDSA keys from Qual-
metic performance) and difference (memory encryption and comm’s TrustZone,’’ in Proc. ACM SIGSAC Conf. Comput. Commun.
storage) between REE and TEE. The TEE results were also Secur., Nov. 2019, pp. 181–194.
[25] Keystone-seL4. Accessed: Sep. 25, 2021. [Online]. Available: https://
compared with the view from REE, and strange core change github.com/keystone-enclave/keystone-sel4
and an interrupt-handler bug were found. [26] A. T. Gjerdrum, R. Pettersen, H. D. Johansen, and D. Johansen,
‘‘Performance of trusted computing in cloud infrastructures with Intel
SGX,’’ in Proc. 7th Int. Conf. Cloud Comput. Services Sci., 2017,
REFERENCES pp. 668–675.
[1] S. Dambra, L. Bilge, and D. Balzarotti, ‘‘SoK: Cyber insurance—Technical [27] N. Weichbrodt, P.-L. Aublin, and R. Kapitza, ‘‘Sgx-perf: A performance
challenges and a system security roadmap,’’ in Proc. IEEE Symp. Secur. analysis tool for Intel SGX enclaves,’’ in Proc. 19th Int. Middleware Conf.,
Privacy (SP), May 2020, pp. 293–309. Nov. 2018, pp. 201–213.
[2] Y. Shin and L. Williams, ‘‘An empirical model to predict security vulnera- [28] M. Bailleu, D. Dragoti, P. Bhatotia, and C. Fetzer, ‘‘TEE-perf: A profiler
bilities using code complexity metrics,’’ in Proc. 2nd ACM-IEEE Int. Symp. for trusted execution environments,’’ in Proc. 49th Annu. IEEE/IFIP Int.
Empirical Softw. Eng. Meas. (ESEM), Oct. 2008, pp. 315–317. Conf. Dependable Syst. Netw. (DSN), Jun. 2019, pp. 414–421.
[29] R. Krahn, D. Dragoti, F. Gregor, D. L. Quoc, V. Schiavoni, P. Felber, [54] C. Priebe, D. Muthukumaran, J. Lind, H. Zhu, S. Cui, V. A. Sartakov,
C. Souza, A. Brito, and C. Fetzer, ‘‘TEEMon: A continuous performance and P. Pietzuch, ‘‘SGX-LKL: Securing the host OS interface for trusted
monitoring framework for TEEs,’’ in Proc. 21st Int. Middleware Conf., execution,’’ 2019, arXiv:1908.11143. [Online]. Available: http://arxiv.
Dec. 2020, pp. 178–192. org/abs/1908.11143
[30] Linaro. Gprof in OP-TEE Documentation. Accessed: Sep. 25, 2021. [55] S. Arnautov, B. Trach, F. Gregor, T. Knauth, A. Martin, C. Priebe,
[Online]. Available: https://optee.readthedocs.io/en/latest/debug/gprof. J. Lind, D. Muthukumaran, D. O’keeffe, M. L. Stillwell, and D. Goltzsche,
html ‘‘SCONE: Secure Linux containers with Intel SGX,’’ in Proc. Symp.
[31] I. Opaniuk and J. Forissier, ‘‘Benchmark and profiling in OP-TEE,’’ Linaro Operating Syst. Design Implement. (OSDI), 2016, pp. 689–703.
Connect Budapest, Mar. 2017.
[32] J. Amacher and V. Schiavoni, ‘‘On the performance of ARM TrustZone,’’
in Proc. IFIP Int. Conf. Distrib. Appl. Interoperable Syst. Denmark:
Springer, 2019, pp. 133–151.
[33] J. R. Blackbelt. (2013). Processor Tracing. Accessed: Sep. 25, 2021.
[Online]. Available: https://software.intel.com/content/www/us/en/ KUNIYASU SUZAKI (Member, IEEE) received
develop/blogs/processor-tracing.html the B.E. and M.E. degrees in computer science
[34] Arm. (2010) Coresight Trace Memory Controller Technical Refer- from Tokyo University of Agriculture and Tech-
ence Manual. Accessed: Sep. 25, 2021. [Online]. Available: https:// nology, and the Ph.D. degree in computer science
developer.arm.com/documentation/ddi0461/b/ from The University of Tokyo, Tokyo, Japan. He is
[35] GNU Compiler Collection (GCC). Accessed: Sep. 25, 2021. [Online]. currently a Senior Researcher with the National
Available: https://gcc.gnu.org/
Institute of Advanced Industrial Science and Tech-
[36] OP-TEE.Org. OP-TEE. Accessed: Sep. 25, 2021. [Online]. Available:
nology (AIST) and a Researcher with the Tech-
https://github.com/op-tee/
[37] GlobalPlatform. Accessed: Sep. 25, 2021. [Online]. Available:
nology Research Association of Secure IoT Edge
https://globalplatform.org Application Based on RISC-V Open Architecture
[38] GlobalPlatform API Archives. Accessed: Sep. 25, 2021. [Online]. Avail- (TRASIO). His research interests include security on CPU, operating sys-
able: https://globalplatform.org/specs-library/ tems, and hypervisor.
[39] J. Van Bulck, D. Oswald, E. Marin, A. Aldoseri, F. D. Garcia, and
F. Piessens, ‘‘A tale of two worlds: Assessing the vulnerability of enclave
shielding runtimes,’’ in Proc. ACM SIGSAC Conf. Comput. Commun.
Secur., Nov. 2019, pp. 1741–1758.
[40] D. Lee. Keystone SDK. Accessed: Sep. 25, 2021. [Online]. Available:
https://github.com/keystone-enclave/keystone-sdk KENTA NAKAJIMA received the M.S. degree in
[41] K. Suzaki, K. Nakajima, T. Oi, and A. Tsukamoto, ‘‘Library implemen- mathematical informatics from The University of
tation and performance analysis of GlobalPlatform TEE internal API for Tokyo. He is currently a Researcher with the Tech-
Intel SGX and RISC-V keystone,’’ in Proc. IEEE 19th Int. Conf. Trust, nology Research Association of Secure IoT Edge
Secur. Privacy Comput. Commun. (TrustCom), Dec. 2020, pp. 1200–1208. Application Based on RISC-V Open Architecture
[42] (2013). Stress-NG. Accessed: Sep. 25, 2021. [Online]. Available: (TRASIO). His research interests include software
https://kernel.ubuntu.com/~cking/stress-ng/ engineering on operating systems, system security,
[43] Raspberry Pi Foundation. Raspberry Pi 3 Model B+. Accessed: and software automation. He is interested in how
Sep. 25, 2021. [Online]. Available: https://www.raspberrypi.org/products/ the Linux OS and container libraries work.
raspberry-pi-3-model-b-plus/
[44] Intel. NUC PJYH. Accessed: Sep. 25, 2021. [Online]. Available:
https://ark.intel.com/content/www/us/en/ark/products/126137/intel-nuc-
kit-nuc7pjyh.html
[45] SiFive. SiFive Unleashed Board. Accessed: Sep. 25, 2021. [Online]. Avail-
able: https://www.sifive.com/boards/hifive-unleashed
[46] Apple. Secure Enclave Overview. Accessed: Sep. 25, 2021. TSUKASA OI is currently a Researcher with
[Online]. Available: https://support.apple.com/en-am/guide/security/sec the Technology Research Association of Secure
59b0b31ff/web IoT Edge Application Based on RISC-V Open
[47] T. Mandt, M. Solnik, and D. Wang, ‘‘Demystifying the secure enclave Architecture (TRASIO). His research interests
processor,’’ Black Hat USA, Aug. 2016. include security of operating systems and virtual
[48] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, machines.
M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, ‘‘Spectre
attacks: Exploiting speculative execution,’’ in Proc. IEEE Symp. Secur.
Privacy (SP), May 2019, pp. 1–19.
[49] J. Van Bulck, M. Minkin, O. Weisse, D. Genkin, B. Kasikci, F. Piessens,
M. Silberstein, T. F. Wenisch, Y. Yarom, and R. Strackx, ‘‘Foreshadow:
Extracting the keys to the Intel SGX kingdom with transient out-of-order
execution,’’ in Proc. USENIX Secur. Symp. (USENIX Sec), 2018, pp. 1–18.
[50] GlobalPlatform. (2018). Introduction to Secure Elements.
[Online]. Available: https://globalplatform.org/wp-content/
AKIRA TSUKAMOTO received the M.S. degree
uploads/2018/05/Introduction-to-Secure-Element-15May2018.pdf
in computer science from Columbia University,
[51] Rambus. CryptoManager Trusted Provisioning Services. Accessed:
Sep. 25, 2021. [Online]. Available: https://www.rambus.com/security/ New York. He currently works with the National
provisioning-and-key-management/cryptomanager-trusted-provisioning- Institute of Advanced Industrial Science and
services/ Technology (AIST). He has worked on products
[52] S. Marisetty. (2017). Demystifying Security Root of Trust. Linaro Connect based on Cell/B.E. and Arm. His research inter-
SFO. [Online]. Available: https://www2.slideshare.net/linaroorg/sfo17- ests include software engineering on a networks,
304-demystifying-ro-tfinallc-83555369 operating systems, and system security, and he
[53] GlobalPlatform. (2013). TEE Secure Element API Version 1.0. is enthusiastic regarding any kind of technical
[Online]. Available: https://globalplatform.org/wp-content/uploads/ development.
2018/06/GPD_TEE_ SE_API_v1.0.pdf