Digital Fingerprinting PDF
Digital Fingerprinting PDF
Gerdes
YongGuan SnehaKumarKasera
Editors
Digital
Fingerprinting
Digital Fingerprinting
Cliff Wang Ryan M. Gerdes
Editors
Digital Fingerprinting
123
Editors
Cliff Wang Yong Guan
Computing and Information Science Department of Electrical and Computer
Division Engineering
Army Research Ofce Iowa State University
Durham, NC Ames, IA
USA USA
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 1
Yong Guan, Sneha Kumar Kasera, Cliff Wang and Ryan M. Gerdes
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 1
2 Applications and Requirements of Fingerprints . . . . . . . . . . . . . . .... 2
3 Types of Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 2
Types and Origins of Fingerprints . . . . . . . . . . . . . . . . . . . . . . ......... 5
Davide Zanetti, Srdjan Capkun and Boris Danev
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Physical-Layer Device Identication . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 General View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Device Under Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Identication Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Device Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Physical-Layer Identication System . . . . . . . . . . . . . . . . . . . . 13
2.7 System Performance and Design Issues . . . . . . . . . . . . . . . . . . 14
2.8 Improving Physical-Layer Identication Systems . . . . . . . . . . . 15
3 State of the Art. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Transient-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Modulation-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Attacking Physical-Layer Device Identication . . . . . . . . . . . . 23
3.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
v
vi Contents
Yong Guan, Sneha Kumar Kasera, Cliff Wang and Ryan M. Gerdes
1 Overview
Y. Guan (B)
Iowa State University, Ames, IA 50011, USA
e-mail: [email protected]
S.K. Kasera
University of Utah, Salt Lake City, UT 84112, USA
e-mail: [email protected]
C. Wang
Army Research Office, Research Triangle Park, Durham, NC 27709, USA
e-mail: [email protected]
R.M. Gerdes
Utah State University, Logan, UT 84341, USA
e-mail: [email protected]
Springer Science+Business Media New York 2016 1
C. Wang et al. (eds.), Digital Fingerprinting,
DOI 10.1007/978-1-4939-6601-1_1
2 Y. Guan et al.
The first two techniques are subject to theft of the passwords or physical objects. The
third technique is robust against such thefts. However, most of the existing fingerprint-
based authentication systems have focused on authenticating human beings only.
There is a strong need to extend the fingerprinting ideas to devices for the purpose
of building robust and more convenient authentication systems for devices. Further-
more, device fingerprints may also have many other related applications including
forensics, intrusion detection, and assurance monitoring that need to be explored as
well. Device fingerprint research requires cross-disciplinary expertise in hardware
and software spanning different branches of engineering and computer science.
Fingerprints have broad applications and are useful in many contexts, security and
otherwise, including in
1. determining whether an entity is a friend or a foe,
2. detecting intrusions,
3. collecting and analyzing data for forensic purposes,
4. authenticating before allowing access to resources, services, or networks, and
5. assurance monitoring.
Fingerprints must be robust to environmental changes and aging, resistant to attacks,
accurate (low false negative and false positives), easy to measure in a predictable
manner, and convenient to use. Some applications, e.g., deciding whether someone
is a friend or foe, would require a very quick decision-making and thus the fin-
gerprints must lend themselves to quick measurement and verification, while other
applications, e.g., forensics, could allow for more time to measure and verify finger-
prints. For some applications, gross classification may be enough. In the preceding,
The choice of the fingerprinting method depends largely on the consequences of
incorrect authentication.
Fingerprints should be immutable and inimitable (at least with a high proba-
bility). Fingerprints, not necessarily of the same kind, should be able to deal with
different types of adversaries, potentially those that can spoof and reproduce finger-
prints, and/or can jam the communication channel. Fingerprints and fingerprinting
techniques must be adaptive depending on the nature of the applications and the
adversary.
3 Types of Fingerprints
The possibility of fingerprinting a specific device comes from the fact that each
individual device has variations in some features (in some components) when it is
manufactured or in its context. Environment, channel and many other factors have
their imprints branded in the output from a device.
Introduction 3
1 Introduction
Devices are traditionally identified by some unique information that they hold such as
a public identifier or a secret key. Besides by what they hold, devices can be identified
by what they are, i.e., by some unique characteristics that they exhibit and that can
be observed. Examples include characteristics related to device components such
operating system, drivers, clocks, radio circuitry, etc. Analyzing these components
for identifiable information is commonly referred to as fingerprinting, since the goal
is to create fingerprints similar to their biometric counterparts [2].
Here, we focus on techniques that allow wireless devices to be identified by
unique characteristics of their analog (radio) circuitry; this type of identification
is also referred to as physical-layer device identification. More precisely, physical-
layer device identification is the process of fingerprinting the analog circuitry of a
device by analyzing the devices communication at the physical layer for the purpose
of identifying a device or a class of devices. Physical-layer device identification is
possible due to hardware imperfections in the analog circuitry introduced at the man-
ufacturing process. These hardware imperfections appear in the transmitted signals
which makes them measurable. While more precise manufacturing and quality con-
D. Zanetti (B)
Institute of Information Security, ETH Zurich, Zrich, Switzerland
e-mail: [email protected]
S. Capkun
e-mail: [email protected]
B. Danev
e-mail: [email protected]
trol could minimize such artifacts, it is often impractical due to significantly higher
production costs.1
The use of physical-layer device identification has been suggested for defensive
and offensive purposes. It has been proposed for intrusion detection [4, 15, 45], access
control [3, 48], wormhole detection [33], cloning detection [6, 23], malfunction
detection [49], secure localization [44], rogue access point detection [21], etc. It
has also been discussed as one of the main hurdles in achieving anonymity and
location privacy [29, 30]. Wireless platforms for which physical-layer identification
has been shown to be feasible include HF Radio Frequency IDentification (RFID)
transponders, UHF (CC1000) sensor nodes, analog VHF transmitters, IEEE 802.11
and 802.15.4 (CC2420) transceivers.
Being able to assess, for a given wireless platform, if physical-layer identification
is feasible and under which assumptions, accuracy, and cost is important for the
construction of accurate attacker models and consequently for the analysis and design
of security solutions in wireless networks. So far, to the best of our knowledge,
physical-layer device identification has not been systematically addressed in terms
of feasibility, design, implementation and evaluation. This lack of systematization
often results in misunderstanding the implications of device identification on the
security of wireless protocols and applications.
The goal of this work is to enable a better understanding of device identification
and its implications by systematizing the existing research on the topic. We review
device identification systems, their design, requirements, and properties, and provide
a summary of the current state-of-the-art techniques. We finally summarize issues
that are still open and need to be addressed for this topic to be fully understood.
Fig. 1 Entities involved in the physical-layer identification of wireless devices and their main
components
from devices during communications with the ultimate aim of identifying (or veri-
fying) devices or their affiliation classes.
Such an identification system can be viewed as a pattern recognition system typi-
cally composed of (Fig. 1): an acquisition setup to acquire signals from devices under
identification, also referred to as identification signals, a feature extraction module
to obtain identification-relevant information from the acquired signals, also referred
to as fingerprints, and a fingerprint matcher for comparing fingerprints and notifying
the application system requesting the identification of the comparison results.
Typically, there are two modules in an identification system: one for enrollment
and one for identification. During enrollment, signals are captured from either each
device or each (set of) class-representative device(s) considered by the application
system. Fingerprints obtained from the feature extraction module are then stored in a
database (each fingerprint may be linked with some form of unique ID representing
the associated device or class). During identification, fingerprints obtained from the
devices under identification are compared with reference fingerprints stored during
enrollment. The task of the identification module can be twofold: either recognize
(identify) a device or its affiliation class from among many enrolled devices or classes
(1:N comparisons), or verify that a device identity or class matches a claimed identity
or class (1:1 comparison).
The typical operation of an identification module flows as follows: the acquisition
setup (Sect. 2.6) acquires the signals transmitted (Sect. 2.3) from the device under
identification (Sect. 2.2), which may be a response to a specific challenge sent by
the acquisition setup. Then, the feature extraction module (Sect. 2.6) extracts fea-
tures (Sect. 2.4) from the acquired signals and obtains device fingerprints (Sect. 2.5).
Subsequently, the fingerprint matcher (Sect. 2.6) retrieves the reference fingerprints
associated to the device under identification from the fingerprint database and com-
pares them against the obtained fingerprints to determine or verify the identity (or
the class) of the device under identification. The results of the fingerprint matcher
can then be incorporated in the decision making process of the application system
requesting the identification (e.g., to grant or not to grant access to a certain location).
The design specification of an identification system usually includes requirements
for system accuracy (allowable error rates), computational speed, exception handling,
8 D. Zanetti et al.
and system cost [2]. We detail those aspects, as well as strategies to improve device
identification performance in Sects. 2.7 and 2.8 respectively.
(b)
(a)
Fig. 2 Block diagrams of two classes of wireless devices. a RFID transponder. b IEEE 802.11
transceiver
Types and Origins of Fingerprints 9
ware artifacts can be then located in the modulator sub-circuit of the transceivers.
Table 1 shows a non-exhaustive list of reported identification experiments together
with the considered devices and (possible) causes of imperfections. Knowing the
components that make devices uniquely identifiable may have relevant implications
on both attacks and applications, which makes the investigation on such components
an important open problem and research direction.
Considering devices communicating through radio signals, i.e., sending data accord-
ing to some defined specification and protocol, identification at the physical layer
aims at extracting unique characteristics from the transmitted radio signals and to
use those characteristics to distinguish among different devices or classes of devices.
We defined identification signals as the signals that are collected for the purpose of
identification. Signal characteristics are mainly based on observing and extracting
information from the properties of the transmitted signals, like amplitude, frequency,
or phase over a certain period of time. These time-windows can cover different parts
of the transmitted signals. Mainly, we distinguish between data and non-data related
parts. The data parts of signals directly relate to data (e.g., preamble, midamble,
payload) transmission, which leads to considered data-related properties such as
modulation errors [3], preamble (midamble) amplitude, frequency and phase [25,
34], spectral transformations [17, 25]. Non-data-related parts of signals are not asso-
ciated with data transmission. Examples include the turn-on transients [45, 46], near-
transient regions [35, 50], RF burst signals [6]. Figure 3 shows an non-exhaustive list
of signal regions that have been used to identify active wireless transceivers (IEEE
802.11, 802.15.4) and passive transponders (ISO 14443 HF RFID).
2.4 Features
Features are characteristics extracted from identification signals. Those can be pre-
defined or inferred. Table 1 shows a non-exhaustive list of reported identification
experiments together with the deployed features.
Predefined features relate to well-understood signal characteristics. Those can be
classified as in-specification and out-specification. Specifications are used for qual-
ity control and specify error tolerances. Examples of in-specification characteristics
include modulation errors such as frequency offset, I/Q origin offset, magnitude
and phase errors [3], as well as time-related parameters such as the duration of the
response [32]. Examples of out-specification characteristics include clock skew [21]
and the duration of the turn-on transient [33]. Figure 4a, b show a predefined, in-
specification feature used to identify EPC C1G2 RFID tags [52]. The explored fea-
ture relates to the tags transmitted data rate (B L F = 1/Tcycle ). The EPC C1G2
10
Table 1 Non-exhaustive list of reported identification experiments together with feature-related information
Devicea Signal Partb Featurec Typed Cause of Imperfectionse Reference
Analog VHF txmtr Transient Wavelets Inferred Frequency synthesizer Toonstra and
Kinsner [45]
Bluetooth trx Transient Wavelets Inferred Hall et al. [17]
IEEE 802.15.4 trx Transient FFT spectra Inferred Danev and Capkun [5]
IEEE 802.11 trx Data Modulation errors Predefined (in-spec) Modulator circuitry Brik et al. [3]
ISO 14443 RFID txpndr RF burst FFT spectra Inferred Antenna, charge pump Danev et al. [6]
IEEE 802.11 trx Data Clock skew Predefined (out-spec) Trx analog circuitry Jana and Kasera [21]
UHF trx Transient Transient length Predefined (out-spec) Rasmussen and
Capkun [33]
IEEE 802.11 trx Data (preamble) Wavelets Inferred Klein et al. [25]
EPC C1G2 RFID txpndr Data Timing errors Predefined (out-spec) Oscillator Zanetti et al. [51]
GSM trx Near-transient, Data Amp., freq., phase Predefined Williams et al. [50]
a Device: class of considered devices.
b Signal Part: the signal part used to extract fingerprints.
c Feature: basic signal characteristic.
d Type: type of the considered features. Predefinedwell-understood signal characteristics. Inferredvarious signal transformations.
e Cause of Imperfections: device component likely to be the cause of exploited hardware variations.
D. Zanetti et al.
Types and Origins of Fingerprints 11
(a) (b)
0.4
Turn-on transient Data 3.0 RF Burst Challenge
0.3 2.0
Voltage [V]
1.0
0.2
0.0
0.1 -1.0
Voltage [V]
-2.0
0 -3.0
-0.1 0.50 RF Burst Response
Voltage [V]
0.25
-0.2
0.0
-0.3 -0.25
-0.4 -0.50
0 5 10 15 20 0 1 2 3 4 5
Time [us] Time [us]
(c) (d)
8
0.6 RF Signal
Voltage [V]
0.3
0 0
-0.3
-0.6 -2
0.6 Baseband Q Channel
Voltage [V]
-4
0.3
0
-6
-0.3
-0.6 -8
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-5
Time [ms] Time [10 s]
Fig. 3 Several signal parts (regions) commonly used for identification. a Turn-on transient of an
IEEE 802.15.4 (CC2420) transceiver. b ISO 14443 HF RFID tag response to an out-of-specification
RF burst signal. c Preamble and data modulated regions in IEEE 802.11 transceivers. Signal parts
can be either analyzed at RF or at baseband (I/Q). d HF/UHF RFID tag response to in-specification
commands
standard [11] allows a maximum tolerance of 22 % around the nominal data rate:
different tags transmit at different data rates.
Differently from predefined features, where the considered characteristics are
known in advance prior to recording of the signals, we say that features are inferred
when they are extracted from signals, e.g., by means of some spectral transforma-
tions such as Fast Fourier Transform (FFT) or Discrete Wavelet Transform (DWT),
without a-priori knowledge of a specific signal characteristic. For example, wavelet
transformations have been applied on signal turn-on transients [17, 18] and different
data-related signal regions [25, 26]. The Fourier transformation has also been used
to extract features from the turn-on transient [5] and other technology-specific device
responses [6]. Figure 4c, d show an inferred feature used to identify EPC C1G2 RFID
tags [51]. The explored feature relies on the spectral transformation (FFT) of the tags
data-related signal region: different tags present different signal spectra.
Both predefined and inferred features can be subject to further statistical analysis
in order to improve their quality. We discuss more in detail such improvements in
Sect. 2.8.
12 D. Zanetti et al.
(a) (b)
520
T BLF upper bound (+22%)
70 cycle
500
60 480
440
40 Nominal BLF
420
30
400
20
380
10 360
1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Time [us] Tag #
(c) (d)
60
Tag #1
Tag #2
60 50
Tag #3
40
Voltage [mV]
40
Power
30
20
20
10
0
0
0 100 200 300 400 500 600 700 1 2 3 4 5 6 7 8 9 10
Time [ns] Frequency bin
Fig. 4 Different features used for identification. Predefined feature: a data-modulated region of
EPC C1G2 RFID tags [52] and b the considered predefined feature, i.e., the tags data rate (B L F =
1/Tcycle ) for different tags, as well as the given nominal data rate and tolerances according to
the EPC C1G2 specifications [11]. Inferred feature: c data-modulated region of EPC C1G2 RFID
tags [51] and d the considered inferred feature, i.e., the signal spectral transformation (FFT), for
different tags
Fingerprints are sets of features (or combinations of features, Sect. 2.8) that are used
to identify devices. The properties that fingerprints need to present in order to achieve
practical implementations (adapted from [2]) are:
Universality: every device (in the considered device-space) should have the con-
sidered features.
Uniqueness: no two devices should have the same fingerprints.
Permanence: the obtained fingerprints should be invariant over time.
Collectability: it should be possible to capture the identification signals with exist-
ing (available) equipments.
Types and Origins of Fingerprints 13
A physical-layer identification system (Fig. 1) has the tasks to acquire the identifi-
cation signals (acquisition setup), extract features and obtain fingerprints from the
identification signals (feature extraction module), and compare fingerprints (finger-
print matcher). The system may either passively collect identification signals or it
may actively challenge devices under identification to produce the identification sig-
nals.
The acquisition setup is responsible for the acquisition and digitalization of the
identification signals. We refer to a single acquired and digitalized signal as sample.
Depending on the considered features to extract, before digitalizing the identifica-
tion signals, those may be modified, e.g., downconverted. The acquisition process
should neither influence nor degrade (e.g., by adding noise) the signals needed for
the identification, but should preserve and bring into the digital domain the unique
signal characteristics on which the identification relies on. Therefore, high-quality
(and expensive) equipment may be necessary. Typically, high-quality measurement
equipment has been used to capture and digitize signal turn-on transients [17] and
baseband signals [3].
The acquisition setup may also challenge devices under identification to trans-
mit specific identification signals. Under passive identification, the acquisition setup
acquires the identification signals without interacting with the devices under identi-
fication, e.g., identification signals can simply relate to data packets sent by devices
under identification during standard communication with other devices. Differently,
under active identification, the acquisition setup acquires the identification signals
after challenging the devices under identification to transmit them. Besides the
14 D. Zanetti et al.
genuine device (False Reject Rate or FRR). These error rates are usually expressed
in the Receiver Operating Characteristic (ROC) that shows the FRRs at different
FAR levels. The operating point in ROC, where FAR and FRR are equal, is referred
to as the Equal Error Rate (EER). The EER is a commonly used accuracy metric
because it is a single value metric and also tells that one recognition system is better
than another for the range of FAR/FRR operating points encompassing the EER. For
the accuracy at other operating points, one has to consider the ROC. We note that it
is also common to provide the FRR for certain benchmark operating points such as
FAR of 0.01, 0.1, 1 %.
The ROC and EER are the mostly commonly used metrics for the comparison of
identification (verification) systems [13].
We note that physical-layer device identification systems in current state-of-art
works (Sect. 3) were often evaluated as classification systems [1]. In a classification
system, unknown device fingerprints are classified (correctly or incorrectly) to their
respective reference device fingerprints. The error rate is referred to as the classifi-
cation error rate and shows the ratio of the number of incorrectly classified device
fingerprints over all classified fingerprints. The classification error rate does not cap-
ture the acceptance of imposters nor the rejection of genuine devices, and therefore is
typically not an appropriate metric for the evaluation of the accuracy of identification
(verification) systems.
The requirement on computational resources, cost, and exception handling need
to be considered as well. In physical-layer identification techniques the complex-
ity of the extracted fingerprints directly relates to the quality and speed of signal
acquisition and processing; the higher the quality and speed, the higher the cost.
Acquisition setups depend on environmental factors which make exception handling
a critical component (e.g., signals may be difficult to acquire from certain locations;
alternatively, acquired signals may not have the acceptable quality for feature extrac-
tion). Therefore, appropriate procedures need to be devised in order to fulfill given
requirements.
Last but not least, the evaluation of a physical-layer device identification system
must address related security and privacy issues. Can the device fingerprints be forged
and therefore compromise the system? How can one defend against attacks on the
integrity of the system? Related works on these systems have largely neglected these
issues.
Before enrollment and identification modules can be deployed, the identification sys-
tem must go through a building phase where design decisions (e.g., features, feature
extraction methods, etc.) are tested and, in case, modified to fulfill the requirements
on the above-mentioned system properties: accuracy, computational speed, exception
handling, and costs.
16 D. Zanetti et al.
Although these last three may significantly affect the design decisions, accuracy
is usually the most considered property to test and evaluate an identification system.
Typically, to improve the accuracy of a (physical-layer) identification system (for
wireless devices), i.e., to improve its overall error rates, different strategies can be
deployed: (i) acquire signals with multiple acquisition setups, (ii) acquire signals from
multiple transmitters on the same device (e.g., when devices are MIMO2 systems),
(iii) consider several acquisitions of the same signals, (iv) consider different signal
parts (e.g., both transients and data) and different features, and (v) deploy different
approaches for both feature extraction and matching.
So far, neither MIMO systems as devices under identification nor multiple acquisi-
tion setups have been considered yet. MIMO systems as devices under identification
may offer a wider range of characteristics which the identification process can be
based on. This can lead to more robust fingerprints (by analogy with human finger-
prints, it is like verifying a human identity by scanning two different fingers). Using
multiple acquisition setups may increase the accuracy of the identification, e.g., by
acquiring a signal from different location at the same time may lead to more robust
fingerprints. The impact of MIMO systems and of multiple acquisition setups is still
unexplored.
Considering several acquisitions (samples) of the same signal is the common
approach to obtain more reliable fingerprints [5, 17, 33]. Generally, the acquired
samples are averaged out into one significant sample, which is then used by the
feature extractor module to create fingerprints.
Considering different signal parts, features, and feature extraction methods is often
referred to as multi-modal biometrics, where different modalities are combined to
increase the identification accuracy and bring more robustness to the identification
process [38]. Several works have already considered combining different modalities.
For example, different signal properties (e.g., frequency, phase) were used in [3, 17],
different signal regions, signal properties and statistics (e.g., skewness, kurtosis) were
explored in [25, 35]. Different modalities extracted from device responses to various
challenge requests were studied in [6]. The use of more modalities have resulted in
significant improvement of the overall device identification accuracy. It should be
noted that the above modalities were mostly combined before the feature match-
ing (classification) procedure. Therefore, the combination of different classification
techniques remains to be explored [20, 24].
In addition to the above-mentioned strategies to improve the accuracy of an identi-
fication system, it is worth to mention feature selection and statistical feature extrac-
tion. Feature selection aims at selecting from a set of features, the sub-set that leads
to the best accuracy [19] (that sub-set will then be used in enrollment and identifi-
cation modules). Statistical feature extraction exploits statistical methods to choose
and/or transform features of objects (in our case, devices) such that the similarities
2 MIMO refers to multiple-input and multiple-output. Such wireless systems use multiple antennas
for transmitting and receiving for the purpose of improving communication performance.
Types and Origins of Fingerprints 17
between same objects are preserved, while the differences between different objects
are enhanced [1]. Statistical feature extraction is a powerful technique to improve
the features discriminant quality.
Identification of radio signals gained interest in the early development of radar sys-
tems during the World War II [22, 28]. In a number of battlefield scenarios it became
critical to distinguish own from enemy radars. This was achieved by visually compar-
ing oscilloscope photos of received signals to previously measured profiles [28]. Such
approaches gradually became impractical due to increasing number of transmitters
and more consistency in the manufacturing process.
In mid and late 90s a number of research works appeared in the open literature to
detect illegally operated radio VHF FM transmitters [4, 18, 45, 46]. Subsequently,
physical-layer identification techniques were investigated for device cloning detec-
tion [6, 23], defective device detection [49], and access control in wireless personal
and local area networks [3, 15, 16, 21, 33]. A variety of physical properties of the
transmitted signals were researched and related identification systems proposed.
Here we review the most prominent techniques to physical-layer identification
available in the open literature. We structure them in three categories, namely
transient-based, modulation-based, and other approaches based on signal part used
for feature extraction. For each category, we discuss the works in chronological order.
A concise summary is provided in Table 2.
sync.
Danev and Transient FFT spectra IEEE 802.15.4 50 D3 Distance, Verification, 0.24 %
Capkun [5] trx location, attacks
voltage, temp.
Suski et al. [40] Preamble Power spectrum IEEE 802.11 3 D3 Proximity, SNR Classification 13 %
density trx
Danev et al. [6] RF burst FFT spectra, ISO 14443 HF 50 D3 Varied position, Verification 4%
modulation RFID txpndr distance
Periaswamy et Preamble Minimum EPC C1G2 50 D3 fixed position Verification 5%
al. [31] power response UHF RFID
txpndr
Williams et Data, near Amplitude, GSM trx. 16 D1 Fixed position, Verification 520 %
al. [50] transient frequency, SNR
phase, statistics
a D1:
Devices from different manufacturers and some of the same model; D2: Devices from different manufacturers and models; D3: Devices from the same
manufacturer and model (identical)
19
20 D. Zanetti et al.
A number of physical-layer identification techniques have been proposed [6, 21, 40]
that could not be directly related to the aforementioned categories. These approaches
usually targeted a specific wireless technology and/or exploited additional properties
from the signal and logical layer.
Suski et al. [40] proposed using the baseband power spectrum density of the packet
preamble to uniquely identify wireless devices. A device fingerprint was created by
measuring the power spectrum density (PSD) of the preamble of an IEEE 802.11a
(OFDM) packet transmission. Subsequently, device fingerprints were matched by
spectral correlation. The authors evaluated the accuracy of their approach on 3 devices
and achieved an average classification error rate of 20 % for packet frames with SNR
greater than 6 dB. Klein et al. [25, 26] further explored IEEE 802.11a (OFDM)
device identification by applying complex wavelet transformations and multiple dis-
criminant analysis (MDA). The classification performance of their technique was
evaluated on 4 same model Cisco wireless transceivers. The experimental results
showed SNR improvement of approx. 8 dB for a classification error rate of 20 %.
Varying SNR and burst detection error were also considered.
Various signal characteristics, signal regions and statistics were recently investi-
gated on GSM devices [34, 35, 50]. The authors used the near-transient and midamble
regions of GSM-GMSK burst signals to classify devices from 4 different manufac-
turers. They observed that the classification error using the midamble is significantly
higher than using transient regions. Various factors were identified as potential areas
of future work on the identification of GMSK signals. In a follow-up work [35], it
has been shown that near-transient RF fingerprinting is suitable for GSM. Additional
performance analysis was provided for GSM devices from the same manufacturer
in [50]. The analysis revealed that a significant SNR increase (2025 dB) was required
in order to achieve high classification accuracy within same manufacturer devices.
Recently, a number of works investigated physical-layer identification of different
classes of RFID [6, 31, 32, 36, 37, 51]. Periaswamy et al. [31, 32] considered
fingerprinting of UHF RFID tags. In [31], the authors showed that the minimum
power response characteristic can be used to accurately identify large sets of UHF
RFID tags. An identification accuracy of 94.4 % (with FAR of 0.1 %) and 90.7 %
(with FAR of 0.2 %) was achieved on two independent sets of 50 tags from two
manufacturers. Timing properties of UHF RFID tags have been explored in two
independent works [32, 51]. The authors showed that the duration of the response
can be used to distinguish same manufacturer and type RFID tags independent of the
environment. This poses a number of privacy concerns for users holding a number of
these tags, e.g., user unauthorized tracking can be achieved by a network of readers
with a high accuracy [51].
In the context of HF RFID, Danev et al. [6] explored timing, modulation, and spec-
tral features extracted from device responses to purpose-built in- and out-specification
signals. The authors showed that timing and modulation-shape features could only
be used to identify between different manufacturers. On the other hand, spectral fea-
Types and Origins of Fingerprints 23
tures would be the preferred choice for identifying same manufacturer and model
transponders. Experimental results on 50 identical smart cards and a set of elec-
tronic passports showed an EER of 2.43 % from close proximity. Similarly, Romero
et al. [36] demonstrated that the magnitude and phase at selected frequencies allow
fingerprinting different models of HF RFID tags. The authors validated their tech-
nique on 4 models, 10 devices per model. Recently, the same authors extended their
technique to enable identification of same model and manufacturer transponders [37].
The above works considered inductive coupled HF RFID tags and the proposed fea-
tures work from close proximity.
Jana and Kasera [21] proposed an identification technique based on clock skews
in order to protect against unauthorized access points (APs) in a wireless local area
network. A device fingerprint is built for each AP by computing its clock skew at
the client station; this technique has been previously shown to be effective in wired
networks [27]. The authors showed that they could distinguish between different APs
and therefore detect an intruder AP with high accuracy. The possibility to compute
the clock skew relies on the fact that the AP association request contains time-stamps
sent in clear.
The large majority of works have focused on exploring feature extraction and match-
ing techniques for physical-layer device identification. Only recently the security of
these techniques started being addressed [5, 7, 9]. In these works, attacks on physical-
layer identification systems can be divided into signal replay and feature replay
attacks. In the former, the attackers goal is to observe analog identification signals
of a targeted device, capture them in a digital form, and then transmit (replay) these
signals towards the identification system by some appropriate means (e.g., through
purpose-built devices or more generic ones like software-defined radios [12], high-
end signal analyzers, or arbitrary waveform generators). Differently, feature replay
attacks aim at creating, modifying, or composing identification signals that repro-
duce only the features considered by the identification system. Such attacks can be
launched by special devices such as arbitrary waveform generators that produce the
modified or composed signals, by finding a device that exhibits similar features to
the targeted device, or to replicate the entire circuitry of the targeted device or at
least the components responsible for the identification features.
Edman and Yener [9] developed impersonation attacks on modulation-based iden-
tification techniques [3]. They showed that low-cost software-defined radios [12]
could be used to reproduce modulation features (feature replay attacks) and imper-
sonate a target device with a success rate of 5075 %. Independently, Danev et al. [7]
have designed impersonation attacks (both feature and signal replay attacks) on tran-
sient and modulation-based approaches using both software-defined radios and high-
end arbitrary waveform generators. They showed that modulation-based techniques
are vulnerable to impersonation with high accuracy, while transient-based techniques
24 D. Zanetti et al.
are likely to be compromised only from the location of the target device. The authors
pointed out that this is mostly due to presence of wireless channel effects in the
considered device fingerprints; therefore the channel needed to be taken into con-
sideration for successful impersonation. In addition, Danev and Capkun [5] showed
that their identification system may be vulnerable to hill-climbing attacks if the num-
ber of signals used for building the device fingerprint is not carefully chosen. This
attack consists of repeatedly sending signals to the device identification system with
modifications that gradually improve the similarity score between these signals and a
target genuine signal. They also demonstrated that transient-based approaches could
easily be disabled by jamming the transient part of the signal while still enabling
reliable communication.
A detailed look of the state of the art shows a number of observations with respect
to the design, properties, and evaluation of physical-layer identification systems.
A broad spectrum of wireless devices (technologies) have been investigated. The
devices under identification cover VHF FM transmitters, IEEE 802.11 network access
cards (NIC) and access points (AP), IEEE 802.15.4 sensor node devices, Bluetooth
mobile phones, and RFID transponders. Identification at the physical layer has been
shown to be feasible for all the considered types of devices.
In terms of feature extraction, most works explored inferred features for device
identification [5, 10, 15, 40, 42, 45, 48]. Few works used predefined features [3,
21, 33] with only one work [3] exploiting predefined in-specification features. Typ-
ically, predefined features would be more controlled by device manufacturers (e.g.,
standard compliance) and are therefore likely to exhibit less discriminative properties
compared to inferred features. The inferred features are however more difficult to dis-
cover and study given that purpose-built equipment and tailored analysis techniques
are required. Both transient and data parts of the physical-layer communication were
used for extracting device fingerprints.
The majority of works used standard classifiers such as Neural Network, Nearest
Neighbor, and Support Vector Machines classifiers [1] to classify (match) fingerprints
from different devices. Classification error rate was used as a metric of accuracy in [3,
10, 15, 33, 40, 42, 45, 48], while identification (verification) accuracy in terms of
FAR, FRR and EER metrics is used in [5, 6]. In Sect. 2.7, we discuss the differences
between those metrics and suggest an appropriate usage.
In terms of system evaluation, earlier works mostly considered heterogeneous
devices from different manufacturers and models, while recent works focused on
the more difficult task of identifying same model and manufacturer devices (see
Table 2). In addition to hardware artifacts in the analog circuitry introduced at the
manufacturing process, physical-layer identification of devices that present different
hardware design, implementation, and were subject to a different manufacturing
process may benefit from those differences. Differently, physical-layer identification
Types and Origins of Fingerprints 25
of devices that present the same hardware design, implementation, and manufacturing
process is exclusively based on hardware variability in the analog circuitry introduced
at the manufacturing process, which makes the physical-layer identification of those
devices a harder task.
Proper investigations on the actual components that make devices uniquely iden-
tifiable have been so far neglected. Although in some (few) works these components
can be easily identified (e.g., Toonstra and Kinsner [45] based their device identifi-
cation on signals generated by the local frequency synthesizer), in most of the other
works only suggestions were provided (e.g., the device antenna and charge pump [6]
or the modulator sub-circuit of the transceiver [3]).
Only few works considered evaluating the robustness of the extracted fingerprints
to environment and in-device effects (see Table 2). Although parameters like tem-
perature and voltage (at which the device under identification is powered) were con-
sidered, robustness evaluations mainly focused on determine the impact of distance
and orientation of the device under identification with respect to the identification
system. Obviously, features not (or only minimally) affected by distance and orien-
tation will be easily integrated in real-world applications. Results show that inferred
features based on spectral transformations such as Fast Fourier Transform or Dis-
crete Wavelet Transform are particularly sensitive to distance and orientation [5, 6]
(i.e., the identification accuracy significantly decreases when considering different
distances and orientations), while features less affected by the transmission medium
(i.e., the wireless channel) like clock skews or (some) modulation errors [3] are less
sensitive.
In general, the proposed system evaluations rarely considered acquisition cost
and time, feature extraction overhead and device fingerprint size. For example, some
brief notes on feature extraction overhead and fingerprint size can be found in [5, 6]
and on signal acquisition time in [3], but they are rather an exception in the reviewed
state-of-the-art works.
Security and privacy considerations were largely neglected. Only recently,
researchers considered attacks on selected physical-layer techniques [7, 9], but no
comprehensive security and privacy analysis has been attempted.
systems can benefit from more tailored features and detailed attack analysis, while
attackers can use this information for advanced feature replay attacks.
Robust fingerprints.
Analyze the robustness of fingerprints with respect to application-related envi-
ronmental and in-device aspects would help in both understanding the limitations
and finding improvements on the considered features. Potential, and currently
not-explored areas of improvement include MIMO systems, multiple acquisition
setups, and multi-modal fingerprints. Deploying multiple acquisition setups may
increase the accuracy of the identification while MIMO systems as devices under
identification may offer a wider range of identification features. Considering dif-
ferent signal parts, features, and feature extraction methods and combining them
to obtain multi-modal fingerprints may increase the identification accuracy and
bring more robustness to the identification process.
Security and privacy of device identification.
Attacks on both security and privacy of physical-layer identification entities need
to be thoroughly investigated and appropriate countermeasures designed and eval-
uated. Investigation of data-dependent properties in device fingerprints might be
a promising direction to improve the resilience against replay attacks.
5 Conclusion
multi-modal features) can be exploited for improving the accuracy and increasing
the robustness of these systems. Similarly, data-dependent properties could largely
enhance the resilience to replay attacks.
References
24. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal.
Mach. Intell. 20(3), 226239 (1998)
25. Klein, R.W., Temple, A., Mendenhall, M.J.: Application of wavelet-based RF fingerprinting to
enhance wireless network security. Secur. Commun. Netw. 11(6), 544555 (2009)
26. Klein, R.W., Temple, M.A., Mendenhall, M.J.: Application of wavelet denoising to improve
OFDM-based signal detection and classification. Secur. Commun. Netw. 3(1), 7182 (2010)
27. Kohno, T., Broido, A., Claffy, K.: Remote physical device fingerprinting. IEEE Trans. Depend-
able Secure Comput. 2(2), 93108 (2005)
28. Margerum, D.: Pinpointing Location Of Hostile Radars. Microwaves (1969)
29. Mitra, M.: Privacy for RFID systems to prevent tracking and cloning. Int. J. Comput. Sci. Netw.
Secur. 8(1), 15 (2008)
30. Pang, J., Greenstein, B., Gummadi, R., Seshan, S., Wetherall, D.: 802.11 user fingerprinting.
In: Proceedings of ACM International Conference on Mobile Computing and Networking
(MobiCom), pp. 99110 (2007)
31. Periaswamy, S.C.G., Thompson, D., Di, J.: Fingerprinting RFID tags. IEEE Trans. Dependable
Secure Comput. 8(6), 938943 (2011)
32. Periaswamy, S.C.G., Thompson, D.R., Romero, H.P.: Fingerprinting radio frequency iden-
tification tags using timing characteristics. In: Proceedings of Workshop on RFID Security
(RFIDSec Asia) (2010)
33. Rasmussen, K., Capkun, S.: Implications of radio fingerprinting on the security of sensor
networks. In: Proceedings of International ICST Conference on Security and Privacy in Com-
munication Networks (SecureComm) (2007)
34. Reising, D.R., Temple, M.A., Mendenhall, M.J.: Improved wireless security for GMSK-based
devices using RF fingerprinting. Int. J. Electron. Secur. Digit. Forensics 3(1), 4159 (2010)
35. Reising, D.R., Temple, M.A., Mendenhall, M.J.: Improving intra-cellular security using air
monitoring with RF fingerprints. In: Proceedings of IEEE Wireless Communications and Net-
working Conference (WCNC) (2010)
36. Romero, H.P., Remley, K.A., Williams, D.F., Wang, C.M.: Electromagnetic measurements for
counterfeit detection of radio frequency identification cards. IEEE Trans. Microwave Theory
Tech. 57(5), 13831387 (2009)
37. Romero, H.P., Remley, K.A., Williams, D.F., Wang, C.M., Brown, T.X.: Identifying RF iden-
tification cards from measurements of resonance and carrier harmonics. IEEE Trans. Microw.
Theory Tech. 58(7), 17581765 (2010)
38. Ross, A., Jain, A.: Multimodal biometrics: an overview. In: Proceedings of European Signal
Processing Conference (EUSIPCO), pp. 12211224 (2004)
39. Shaw, D., Kinsner, W.: Multifractal modeling of radio transmitter transients for classifica-
tion. In: Proceedings of IEEE Conference on Communications, Power and Computing (WES-
CANEX), pp. 306312 (1997)
40. Suski, W., Temple, M., Mendenhall, M., Mills, R.: Using spectral fingerprints to improve wire-
less network security. In: Proceedings of IEEE Global Communications Conference (GLOBE-
COM), pp. 15 (2008)
41. Suski, W.C., Temple, M.A., Mendenhall, M.J., Mills, R.F.: Radio frequency fingerprinting
commercial communication devices to enhance electronic security. Int. J. Electron. Secur.
Digital Forensics 1(3), 301322 (2008)
42. Tekbas, O., Ureten, O., Serinken, N.: An experimental performance evaluation of a novel
radio-transmitter identification system under diverse environmental conditions. Can. J. Electr.
Comput. Eng. 29(3), 203209 (2004)
43. Tekbas, O., Ureten, O., Serinken, N.: Improvement of transmitter identification system for low
SNR transients. Electron. Lett. 40(3), 182183 (2004)
44. Tippenhauer, N.O., Rasmussen, K.B., Ppper, C., Capkun, S.: Attacks on public WLAN-based
positioning. In: Proceedings of ACM/USENIX International Conference on Mobile Systems,
Applications and Services (MobiSys), pp. 2940 (2009)
45. Toonstra, J., Kinsner, W.: Transient analysis and genetic algorithms for classification. In: Pro-
ceedings of IEEE Conference on Communications, Power, and Computing (WESCANEX),
pp. 432437, vol. 2 (1995)
Types and Origins of Fingerprints 29
46. Toonstra, J., Kinsner, W.: A radio transmitter fingerprinting system ODO-1. In: Proceedings
of Canadian Conference on Electrical and Computer Engineering, pp. 6063, vol. 1 (1996)
47. Ureten, O., Serinken, N.: Detection of radio transmitter turn-on transients. Electron. Lett. 35,
19961997 (2007)
48. Ureten, O., Serinken, N.: Wireless security through RF fingerprinting. Can. J. Electr. Comput.
Eng. 32(1), 2733 (2007)
49. Wang, B., Omatu, S., Abe, T.: Identification of the defective transmission devices using the
wavelet transform. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 696710 (2005)
50. Williams, M., Temple, M., Reising, D.: Augmenting bit-level network security using physical
layer RF-DNA fingerprinting. In: Proceedings of IEEE Global Telecommunications Confer-
ence (GLOBECOM), pp. 16 (2010)
51. Zanetti, D., Danev, B., Capkun, S.: Physical-layer identification of UHF RFID tags. In: Proceed-
ings of ACM Conference on Mobile Computing and Networking (MOBICOM), pp. 353364
(2010)
52. Zanetti, D., Sachs, P., Capkun, S.: On the practicality of UHF RFID fingerprinting: how real is
the RFID tracking problem? In: Proceedings of Privacy Enhancing Technologies Symposium
(PETS), pp. 97116 (2011)
Device Measurement and Origin of Variation
Abstract In this chapter a methodology is set forth that allows one to determine
whether or not a particular device component causes, or contributes significantly to,
the differences in signalling behaviour between devices that allow for their identifi-
cation.
1 Introduction
Fig. 1 Depiction of a two-port model for a component (input voltage/current denoted by V_1/I_1
and output voltage/current by V_2/I_2). Note it is assumed that voltage/current measurements are
carried out at device terminals
The black box, or port, model of a component specifies only the port characteristics
of the devicei.e. it only indicates what the voltage/current will be at one port given a
voltage/current applied at anotherbut not why this is so. While this model may not
explain the behaviour of a component, it does capture the behaviour of a component
precisely, within the limits of the measured inputs/outputs and under the assumption
of linearity.
In a two-port model (Fig. 1) an input voltage, V1 , and current, I1 are related
to the output voltage, V2 , and current, I2 via linear combination. Given four vari-
ables, there are six ways to choose two dependent and two independent variables;
these six choices represent the possible two-port models (also called parameters) we
must choose from. (It is actually slightly more complicated than this as we must
choose which dependent variable is be written first. Furthermore, linear combina-
tions of independent variables with linear combinations of dependent variables fur-
ther increases the number of possible equations to infinity. Refer to [2] for a detailed
discussion.) The parameter type chosen usually depends on how multiple two-port
models are to be connected [1].
Before proceeding with our discussion of how a two-port model of a component
can be constructed in order to determine the influence the component has on the
unique behaviour of the device, it should be noted that if the operating frequency of
the component being modelled is is high, as it is in wired networking technologies
beyond 10 Mb Ethernet and for wireless devices, then it may be necessary to use
two-port models where the independent/dependent variables are themselves linear
combinations of independent/dependent variables (see Sect. 2). In addition, a two-
port model of a device is only valid if a true ground plane exists; i.e. the ground is
of zero potential, zero resistance, and is continuous [2].
Device Measurement and Origin of Variation 33
For our analysis we chose ABCD parameters because of the ease of combining ABCD
models for multiple components in a cascaded or chained fashion (when using ABCD
parameters with multiple components connected in a cascaded configuration, it is
only necessary to perform simple matrix multiplication to determine their combined
response). This allows us to build more complicated models, in which additional
component models are added to the chain, to see how different components affect a
devices signal in combination with ease.
ABCD parameters treat V1 and I1 as dependent variables and V2 and I2 as inde-
pendent ones; the input voltage/current and output voltage/current are related via
V1 A B V2
= (1)
I1 C D I2
ZIN
V1 = VS (5)
ZI N + ZS
where Z I N is the input impedance of the cascade (the component plus the shunt load).
As per the definition of ABCD parameters, dividing (2a) by (2c) gives Z I N = A/C.
The input impedance needed for (5) is then
a + bY L
ZIN = (6)
c + dY L
2 Measuring Parameters
For illustrative purposes, we shall assume that voltage and current are measured
at the terminals of the component; i.e. wave phenomena may be ignored. While this
simplification is acceptable for low-speed/low-frequency components of devices,
high-speed networking devices, whose components operate in the radio frequency
range of spectrum, would require S-parameters defined in terms of travelling waves
(see chapter three of [2]).
In reference to Fig. 1, voltage-referenced S-parameters for a two-port component
are defined as [2]
V1 I1 Z 1
S11 = (8a)
V1 + I1 Z 1 V2 =I2 Z 2
V2 I2 Z 2 |Re (Z 1 )|
S21 = (8b)
V1 + I1 Z 1 |Re (Z 2 )|
V2 =I2 Z 2
V1 I1 Z 1 |Re (Z 2 )|
S12 = (8c)
V2 + I2 Z 2 |Re (Z 1 )|
V1 =I1 Z 1
V2 I2 Z 2
S22 = (8d)
V2 + I2 Z 2 V1 =I1 Z 1
where Z 1 and Z 2 are the impedances of the source and load, respectively, causing the
excitation of the component. Under the assumption that Z 0 = Z 1 = Z 2 and that the
source and load impedances used in Fig. 2 are equivalent to Z 1 and Z 2 , respectively,
the equivalent ABCD parameters are [2]
An idealised signal (i.e. a signal composed of attributes unique only to the particular
brand/model i of a networking technology), Ai , could be constructed by taking n
sampled signals from each device, aligning, and then averaging. This signal, Ai
serves as the input, VS , to the model depicted in Fig. 2.
j
The output of the model constructed for the jth device, V2 , is found by using (7)
with VS = Ai
Fig. 2 Model to examine how an input signal (V_S) is affected by a component with ABCD
parameters of M (Z_S is the impedance of the source generating V_S and Z_L is the impedance of
a test load)
Device Measurement and Origin of Variation 37
j Ai
V2 = (10)
a j + d j + b j Y L + c j /Y L
where a j (k), b j (k), c j (k), d j (k) are the ABCD parameters measured at the frequency
of the kth bin. The time-domain
output of the model for the jth devices component
is then V2 = F 1 V2 , where F 1 {} denotes the inverse-Fourier transform. As
j j
each bin is of finite width, it is not possible to have exact ABCD parameters; inter-
polation between measured parameters, to fill in for unmeasured frequencies, and/or
the addition of multiple frequencies, to account for the frequencies contained within
a bin, must therefore be used.
Significance
Identity
4 Conclusion
References
1. Kumar, S., Suresh, K.K.S.: Electric Circuits and Networks. Pearson Education (2009)
2. Weber, R.J.: Introduction to Microwave Circuits: Radio Frequency and Design Applications.
IEEE Press (2001)
Crytpo-Based Methods and Fingerprints
Abstract Device fingerprints primarily provide underlying seeds and keys for the
cryptographic operations of authentication and secret key generation. In this chapter,
we present techniques and technologies to use this information with cryptography. We
also present cryptographic functions derived from authentication and key generation
that use device fingerprints.
1 Introduction
Device fingerprints serve two primary functions in cryptography. Those functions are
authentication and key generation. All other cryptographic operations (e.g., secure
hash functions, certified code generation) involving fingerprints are derived from
these.
1.1 Authentication
Because device fingerprints are unique to a device, they provide positive identifica-
tion of the device. This proof of identity authenticates the device to a system. For
most applications, it is not sufficient to repeatedly present the fingerprint to a system
for authentication because that value may be intercepted and copied by an attacker.
This issue is resolved with challenge-response models. In one model, the system
mathematically derives the response a device will return given a specific challenge.
In another model, the manufacturer or a trusted third party applies a very large number
of random challenges to the device before it is placed into service. The device com-
bines its fingerprint with challenges to generate responses. The manufacturer stores
these challenge-response pairs in a database for later retrieval. A system authen-
ticates a device by sending a challenge to it and verifying the response it returns
either against a derived response or against the challenge-response pair stored in the
database.
In contrast to authentication, a device keeps its fingerprint secret for key genera-
tion applications. Because the key is kept secret, a device may repeatedly use its
fingerprint for cryptographic key operations. Since a fingerprint is an intrinsic char-
acteristic of a device, it is not explicitly stored on the device. This makes it difficult
to forcibly extract the key. It also makes the fingerprint difficult to clone. Many cryp-
tographic algorithms require keys of specific lengths or which have certain mathe-
matical properties. If a device fingerprint does not natively have these properties, a
hash is performed on the fingerprint to map it to a key appropriate for the algorithm
[42].
A symmetric or pre-shared key application uses a common secret key to encrypt and
decrypt data at both sides of the exchange. Since robust fingerprints are unique to
devices, each party will not know the fingerprint of the other. Only one side of the
exchange has secure access to the shared secret key. The other side of the exchange
must store the fingerprint in memory which may not be secure. Consequently, device
fingerprints are often used in scenarios where one side of the exchange is in a secure
location but the other side is not. The unsecured side does not need to store the secret
key because it is intrinsic to the device. It is possible to implement cryptographic
protocols so that both sides of an exchange can determine a shared secret key based
on challenge-response pairs over an unsecured channel. In this case, both sides of
the exchange may be located in unsecured environments.
Asymmetric key cryptography consists of a public and private key pair. In this type
of cryptographic operation, the fingerprint of a device is the private key. Device
fingerprints are then used in any asymmetric cryptographic algorithms and operations
such as RSA, Diffie-Hellman key exchange, and digital signatures.
Crytpo-Based Methods and Fingerprints 41
2 Techniques
Device fingerprints are an active area of research and few commercial devices make
use of them at the time of this writing. As the area evolves, more systems will benefit
from their use. This section describes how fingerprints may be used in practice.
The multiplexer-based arbiter PUF [12, 42] implementation applies an input chal-
lenge sequence to the select inputs of a series of multiplexers. The device outputs
the response to the challenge. The implementation generates a single bit of output in
the following manner. The inputs to two 2 1 multiplexers are tied high. The device
applies a single bit from the challenge to the multiplexer select input. This creates a
race condition where the timing of the outputs of the multiplexers is dependent on
the lengths of the wires and the switching time of the transistors. The delays will
vary between microchips because of manufacturing imperfections. This is one stage
of the arbiter circuit. Signals pass through a series of stages to create the delay path
of the output bit. Successive bits of the challenge are applied to the select inputs of
the series of multiplexers. In the final stage of the circuit, one multiplexer output is
tied to the data input of a D flip-flop. The other multiplexer output is tied to the clock
42 J.H. Novak et al.
input of the flip-flop. If the data path bit arrives first, the output bit of the arbiter
circuit is a 1. If the clock path bit arrives first, it is a 0. Multiple instances of the
circuit are combined to generate an output response of any desired length.
For example, consider the eight-bit challenge 10110100 applied to an arbiter
where the least significant bit is bit number zero. The device breaks the challenge
into two four-bit challenges to generate two bits of output. It applies challenge bits
seven through four to one sequence of multiplexers (circuit A) and it simultaneously
applies bits three through zero to a separate sequence of multiplexers (circuit B).
Dividing the challenge in this manner serves two purposes. First, it creates a more
robust fingerprint because each sequence of multiplexers will have different wire
delays. Second, it allows the response bits to be generated in parallel decreasing the
amount of time required to generate the response.
The high bit of the challenge has a value of 1 which causes the signals to travel
through the high inputs of the multiplexers in the first stage of circuit A. The next
bit of the challenge has a value of 0 and the signals travel through the low inputs of
the multiplexers. The next two bits of the challenge both have values of 1 and the
signals travel through the high inputs of the third and fourth stages. Because the path
lengths differ and because the switching time of the transistors in the multiplexers
differ, the signals will arrive at the D-flip flop at different times. If the D-input signal
arrives before the clock input, the most significant response bit will have a value
of 1. If the clock input arrives before the D-input signal, the most significant response
bit will have a value of 0.
The low order bits of the challenge cause the signals to flow in a similar manner
through circuit B. Challenge bit three has a value of 0 which causes the signals to
flow through the low inputs of the multiplexers. Challenge bit two has a value of 1
and the signals flow through the high inputs of the next stage. Bits one and zero of
the challenge both have values of 0 and the signals flow through the low inputs of
the third and fourth stages. The race between the D-input and the clock input paths
to the D-flip flop determine the value of the least significant response bit in the same
manner as circuit A determines the value of the most significant response bit.
The PUF concatenates the output bits of circuits A and B to create the response to
the challenge. In an actual implementation, a challenge is typically thousands of bits
in length. Each series of multiplexers consists of hundreds or thousands of stages.
the AND gate transitions from low to high when both paths have completed. The
response is the amount of time taken between applying the challenge to the input
and the transition of the output of the AND gate from low to high.
As an example, assume a PUF consists of a series of 2-input by 2-output switches.
Each switch either routes its two input signals straight through if the select input is
0 or routes them to the opposite outputs if the value is 1. The device applies the
challenge 0101 to the delay-based circuit. The pseudorandom function permutes
the input and applies the sequence 1001 to the switches. After some setup time,
the control logic applies the input transition to both inputs of the first switch at the
same time. The pseudorandom function applies a switch select value of 1 to the first
stage of the circuit. The switch routes the input signals to the opposite outputs. The
pseudorandom function applies select values of 0 to the next two stages of the circuit
and the switches route the signals straight through to their respective outputs. The
pseudorandom function applies a select value of 1 to the final stage and the signals
are routed to the opposite outputs of the switch. The final stage feeds its output to
the inputs of an AND gate. The gate transitions from low to high when both signal
paths arrive at the gate. The device fingerprint is the amount of time taken from the
input transition to the output transition.
Each unique challenge produces a different delay through the circuit since the path
lengths and delays of individual wires vary. The lengths and delays vary between
identical microchips because of manufacturing imperfections.
It is possible to implement delay-based PUFs with any suitable self-timed circuit.
For example, consider a system using a self-timed floating point processor [31] as
shown in Fig. 1. There is no system clock in the circuit. Control signals flow through
the self-timed circuit based on delay elements rather than a system clock. A device
operates this type of self-timed circuit by first applying the data, and then by sending
a pulse on the request input. When the self-timed circuit has completed its operation,
it sends a pulse on the acknowledge output. The challenge to this system consists of
Fig. 1 Self-timed delay. A delay-based PUF implemented with a self-timed circuit. The response
is the time taken to compute the output of the floating point operation
44 J.H. Novak et al.
the operands and function of the floating point operation such as addition or division.
The response is not the result of the computation because it is a predictable value.
Instead, the response is the time taken to compute the result. The floating point result
is discarded.
Another type of PUF creates fingerprints from the difference in frequencies of logi-
cally identical ring oscillators [12, 42] or other types of astable multivibrators. Phys-
ical realizations commonly use ring oscillators such as those shown in Fig. 2 because
they do not require mixed analog and digital fabrication. A circuit may replace the
first inverter in the chain with a NAND gate to enable or disable the oscillator based
on an external logic signal. Any odd number of inverting logic gates produces a
square wave output. The circuit designer adjusts the frequency of the waveform by
adding or removing inverters [28]. Adding logic gates to the ring increases the time
it takes for the signal to propagate through the circuit resulting in a longer period and
lower frequency.
The oscillator circuit is initially not powered and the output of each inverter has a
logic value of 0. When power is applied, noise causes the output of the transistors of
the inverters to be some small voltage value. The inverting amplifier in the inverters
will invert the voltage and increase its magnitude. This process continues through the
inverter chain until the value is fed back into the loop and oscillations begin. It takes
time for the signal to propagate through the chain because each inverter introduces
Fig. 2 Ring oscillators. PUF implementations use ring oscillators such as these. Any odd number
of inverting logic gates produces a square wave output. More gates result in more delay through the
circuit and lower frequency
Crytpo-Based Methods and Fingerprints 45
some delay. Eventually, the voltage will be amplified to logic value of 0 or 1. This
amplification combined with the propagation delay result in a square wave at the
output of the circuit. The period of the square wave is proportional to the number of
inverters in the chain.
Physical realizations commonly use macro blocks to create logically identical
oscillator circuits on a microchip. Despite this, logically identical blocks produce
slightly different frequencies. The primary contributor to this phenomenon is man-
ufacturing variation. It is this variation which allows the circuit to compare pairs of
logically identical oscillators to produce unique fingerprints per device.
The circuit generates output response bits by comparing the frequencies of logi-
cally identical pairs of self oscillating loops. The challenge consists of a bit pattern
that selects the pairs of self oscillating loops to be compared. A bit in the response
indicates which self oscillating loop in the respective pair has a higher frequency.
Consider a simplified implementation consisting of four oscillators (0 through 3),
two multiplexers (A and B), and two counters (A and B). Oscillator 0 is tied to the
0 input of both multiplexers. Similarly, oscillator 1 is tied to the 1 input, oscillator
2 to input 2, and oscillator 3 to input 3. The outputs of multiplexers A and B are
sent to counters A and B, respectively. The challenge is 4 bits in length. The two
least significant bits are tied to the select input of multiplexer A and the two most
significant bits to the select input of multiplexer B. The challenge selects which
oscillators are compared. For example, a challenge of 1000 selects oscillator 0 from
multiplexer A and oscillator 2 from multiplexer B. The circuit resets the counters and
then allows them to run until the frequency of each selected oscillator is determined.
The comparator compares the counter values to determine which oscillator has the
highest frequency. If the frequency of multiplexer B is higher than that of multiplexer
A, the output is a 1. Otherwise, the output is a 0. The output is a single bit in the
response.
In an actual implementation, there are hundreds or thousands of oscillators and
the challenge is much longer. The multiplexers accept more inputs and there are
multiple instances of the circuit. The challenge is divided into groups and applied to
the instances of the circuit to generate the response bits.
2.1.4 Noise
For cryptographic operations such as secret key generation, the response to a chal-
lenge must be repeatable. Noise caused by voltage fluctuations or temperature
changes can cause bits in the response to change. Error correcting codes are used to
compensate for these effects [11]. The manufacturer or a trusted third party places
the device in ideal environmental conditions when challenge-response pairs are ini-
tially generated and recorded in the database. Error correcting syndrome values are
computed during this initialization process. They are stored along with the challenge-
response pairs. The syndrome values do not need to be kept secret but they do leak
46 J.H. Novak et al.
w = dG
100101
= 1 0 1 0 1 0 0 1 1
001110
= 1 0 1 2 1 1 mod 2
= 101011
The first three bits of the code word, 101, are the response and the last three bits,
011, are the syndrome values. The manufacturer records the response and syndrome
values in the database along with the challenge c. At a later time a user presents the
syndrome values 011 in addition to the challenge c to the device. The device uses
the syndrome values to correct bit errors in the response before using the response
for cryptographic operations. Assume the device incorrectly produces a response
of 111 to challenge c because of noise. Before using the response in cryptographic
operations, the device performs error correction to determine the correct value 101.
To perform the error correction, the device first generates the vector r by concate-
nating the uncorrected response 111 with the syndrome 011 supplied by the user. It
then multiplies the vector r by the syndrome matrix S to produce an estimate d of
the response.
Crytpo-Based Methods and Fingerprints 47
d = r S
1 0 1
0 1 1
1 1 0
= 1 11011 1
0 0
0 1 0
0 0 1
= 2 3 3 mod 2
= 0 11
An estimate d of all zeros indicates no bit errors within the tolerance of the block
code. Since the estimate is not all zeros, the device performs an error correction step
on the response before producing the final output. It matches the estimate against
the rows in the syndrome matrix S. The matching row indicates the bit position of
the error. In this example, the estimate 011 matches the second row in the syndrome
matrix. The device knows that the second bit of the response 111 is in error. It then
inverts the second bit to produce the correct response of 101.
Ring oscillator implementations are susceptible to environmental variation
because the frequency changes with temperature. This is particularly a difficulty
when temperature changes in one area of a microchip faster than in another area. In
this situation, the frequency of one oscillator changes faster than the frequency of
another and the relationship between the frequencies may be reversed resulting in
bit errors [42].
Consider the system physically arranged as shown in Fig. 3. During normal operat-
ing conditions, the left side of the cryptographic device has slightly elevated temper-
ature compared to the right side of the device. The waveforms of oscillators A, B, and
C at normal temperatures are shown on the left side of Fig. 4. During periods of heavy
CPU load, the temperature at oscillator A increases more than at oscillators B or C
because oscillator A is physically located near the CPU which is generating heat. The
waveforms during these times are shown at the center of Fig. 4. As the temperature at
oscillator A increases, its frequency decreases. The frequencies of oscillators B and
C decrease also but to a lesser degree. The consequence is that oscillator A is now
lower in frequency than oscillator B. Comparing these two oscillators during periods
of high CPU activity generates a different output than when measured at periods of
low CPU activity. During periods of heavy graphics operations the temperature at
Fig. 3 Temperature
variation. Physical
arrangement of this system
demonstrates the effects of
temperature variation on
oscillators in a cryptographic
device
48 J.H. Novak et al.
2.1.5 Privacy
As with any type of hardware identifier, device fingerprints raise privacy concerns
[10, 34, 38]. For example, in the case of tracking applications users may not want
to be identified by their devices. One solution to this is to allow a user to provide
an additional parameter that is combined with the challenge to produce the response
returned by the device. The additional parameter is called a personality [10, 11].
The device combines the personality with the challenge before they are applied
to the PUF. Consequently, personalities must be part of the initialization process and
recorded along with the challenge-response pairs in the database. The PUF appears
to be a different cryptographic device depending on which personality is applied
with the challenge. This prevents applications from collaborating to identify a user
based on the device fingerprint.
Crytpo-Based Methods and Fingerprints 49
2.1.6 Improvements
2.2.1 Primitives
There are two cryptographic primitives on a CPUF. The first primitive generates
a response to a challenge. The device computes the response by first hashing the
challenge and the submitted program. The user codes a literal consisting of the
challenge into the program so that a hash of the same program with a different
challenge produces different results. The device submits the result of the hash to the
PUF portion of the hardware to obtain the response. This process is shown at the top
of Fig. 5. The second primitive generates a secret key. The device generates a secret
key by first submitting a challenge to the PUF to obtain a response. It then hashes
the response with the program to create the secret key. As with response generation,
the user codes a literal of the challenge into the program to create a unique hash. The
secret key function is shown at the bottom of Fig. 5.
50 J.H. Novak et al.
Fig. 5 CPUF primitives. Response generation generate_response (top) and secret key generation
generate_secret (bottom) cryptographic primitives of a CPUF
2.2.2 Initialization
Initial challenge-response pairs must be obtained over a secure channel. The man-
ufacturer or trusted third party obtains the initial pairs by submitting a program to
the device that contains a function call to the primitive that generates a response.
For example, to obtain the response to challenge 123, the manufacturer submits the
following program to the device.
{
response = generate_response(123);
return response;
}
A user establishes a shared secret key with a device by submitting a program similar to
that shown in the example below. Suppose the manufacturer has given the challenge-
response pair (123, 456) to the user. Further suppose the hash of the result 456 and a
hash of the program below is 789. The device computes the secret by first applying
123 to the PUF to obtain the response 456. It then hashes 456 with a hash of the
program to obtain the secret 789. Because the user already knows the response to the
challenge which is the output of the PUF, she can compute the hash of the program
and the response to create the secret. The response 456 is already known to the user.
The user hashes 456 with a hash of the program to obtain the same secret 789.
{
key = generate_secret(123);
}
Crytpo-Based Methods and Fingerprints 51
A user should select previously unused challenge-response pairs for each program
execution to avoid leaking information to an attacker. The user obtains an initial
challenge-response pair from the manufacturer over a secure channel. The manu-
facturer need provide only a single pair the user. The user may obtain additional
challenge-response pairs directly from the device over an unsecured channel.
Suppose the user knows the challenge-response pair (123, 456) and wants to
obtain a new challenge-response pair. The user selects an arbitrary number, 987
in this example. The arbitrary number is not the challenge, but it is used later to
determine the challenge. To obtain the new response, the user submits the program
below.
{
new_response = generate_response(987);
key = generate_secret(123);
enc_response = encrypt(new_response, key);
mac = create_mac(new_response, key);
return enc_response, mac;
}
Fig. 6 Challenge-response pair generation. The device executes the program to create a new
challenge-response pair. Numbers from the example are in parenthesis. The functions inside the
dashed line are computed by the user to obtain the new challenge
52 J.H. Novak et al.
The MAC allows the user to verify that the result came from the device and not
from an attacker as a result of a person-in-the-middle attack. An attacker may attempt
to modify the old challenge to obtain the new response, but this attempt fails because
the new response and the secret key are partially derived from a hash of the program.
Modifying the old challenge creates a different hash.
{
key = generate_secret(123);
result = perform_some_computation();
mac = create_mac(result, key);
return result, mac;
}
The program generates a shared secret key derived from the challenge 123. Then
the program executes the users computation. The device returns the result of the
computation along with a MAC of the result. The user computes the shared secret as
described in Sect. 2.2.3 and computes the MAC of the result. If the MAC returned
by the device matches the MAC computed by the user, then the computation was
performed on the intended device.
2.3.1 Definition
Clock skew is defined as the difference between the frequencies of two clocks [29].
The frequency of a clock is the rate at which the clock progresses. Formally, if
Crytpo-Based Methods and Fingerprints 53
Ct (t) = t is true clock time, then the offset of clock Ca (t) is the difference between
the time reported by that clock and the time reported by Ct (t). Similarly, the offset of
clock Ca (t) with respect to clock Cb (t) is the difference between the time reported
by clock Ca (t) and the time reported by clock Cb (t). The offset Oa (t) of clock Ca (t)
is given by
Oa (t) = Ca (t) Ct (t) .
The offset Oab (t) of clock Ca (t) with respect to clock Cb (t) is given by
The frequency f a of clock Ca (t) is the first derivative of the clock with respect to
time. It is given by
d
f a = (Ca (t)) .
dt
The skew sab of clock Ca (t) with respect to clock Cb (t) is the difference between
their respective frequencies. It is given by
d d
sab = (Ca (t)) (Cb (t)) = f a f b .
dt dt
Clock skew cannot be directly measured. To estimate the clock skew, an algorithm
compares the clock of the device being fingerprinted to the clock of the measuring
device, or fingerprinter. The clock of the device being fingerprinted is Ca (t) and
the clock of the fingerprinter is Cb (t). Clock offset data points are plotted and an
algorithm fits a line to the plot. The slope of the line is the estimate of the clock skew.
2.3.2 Estimation
Two common methods of estimating the line are a method based on linear program-
ming and least squares fit [22, 29, 46]. These techniques are illustrated in Fig. 7. The
solid line is the clock skew estimate computed with the linear programming method.
The line is above all data points. The dashed line is the clocks skew estimate when
computed with least squares fit. The line is through the center of the data points.
Clock offset measurements have some degree of variability from sample to sam-
ple. A line cannot therefore be drawn from the origin to an arbitrary data point to
approximate the data set. The linear programming approach is to fit a line above
all data points [29]. The linear programming method fits a line by minimizing the
objective function
1
N
z= (mxi + b yi )
N i=1
54 J.H. Novak et al.
Fig. 7 Clock skew estimation. The slope of the line is the clock skew. The solid line is the clock
skew computed with the linear programming method. The dashed line is the clock skew computed
with least squares fit
with constraints
mxi + b yi , i = 1, ..., N .
In these equations, N is the number of data points, xi is the clock offset of the
device, yi is the clock offset of the fingerprinter, m is the slope of the line, and b is
the y-intercept of the line. Outputs of the linear programming method are the slope,
m, and y-intercept, b. The slope is the fingerprint.
Least squares fit minimizes the sum of the squares of the errors between the actual
data points and the estimate of the data points calculated with the line that is fit to
the data [44]. Formally, the function
N
z= (yi (mxi + b))2
i=i
is minimized.
As in the linear programming method, N is the number of data points, xi is the
clock offset of the device, yi is the clock offset of the fingerprinter, m is the slope of
the line, and b is the y-intercept of the line. The least squares fit method determines
the slope, m, and the y-intercept, b, of the line. The slope is the fingerprint.
Crytpo-Based Methods and Fingerprints 55
Systems use clock skew in a number of ways to identify a node on either wired
or wireless networks. Systems exploit TCP timestamps or ICMP timestamps over
IEEE 802.11 and wired networks to obtain the clock skew of nodes over multiple
hops [23]. They can exploit the Time Synchronization Function in the beacon probe
response frames of the IEEE 802.11 protocol to obtain time stamp information in
order to detect fake access points [22]. Timestamps are not readily available in
some types of networks such as the IEEE 802.15.4 wireless sensor node network
[18]. In these situations, systems employ additional protocols such as the Flooding
Time Synchronization Protocol to capture clock skew information [17]. In a beacon-
enabled IEEE 802.15.4 network, a coordinating node periodically broadcasts beacon
frames. The duration between transmissions is precisely timed because other devices
in the network rely on this timing information to determine when to transmit data.
Systems can use the time between beacon frames to compute the clock skew of the
coordinating node [32].
Regardless of the technique employed to obtain clock skew information, the pro-
cedure is similar. The system either sends a probe request to the device being fin-
gerprinted or monitors a periodic transmission from the device. In either case, the
device being fingerprinted transmits timing information. The system uses this infor-
mation to estimate the clock skew of the device. The clock skew authenticates the
device to the network. Figure 8 depicts a fingerprinter requesting timing information
from the device being fingerprinted. The device responds with a packet containing
the timestamp required for the fingerprinter to calculate the clock skew. Figure 9
illustrates a coordinating device periodically broadcasting precisely timed beacon
frames.
Fig. 8 Timestamp information. The fingerprinter obtains timestamp information from the device
56 J.H. Novak et al.
Fig. 9 Beacon-enabled
IEEE 802.15.4 network. The
coordinator periodically
broadcasts framing
information. Broadcasts are
precisely timed
connect to the Tor network to advertise a service and clients request services from the
Tor network. The Tor network acts as a rendezvous point so that clients and servers
are hidden from each other. A server typically has a public IP address through which it
accesses the Tor network. Assume an attacker determines a range of IP addresses for
all servers in the Tor network but does not know which server is running a particular
service. The attackers goal is to find the public IP address of the server. To perform
the attack, the attacker requests the service at specific intervals which increases the
load on the CPU of the server running the service. This induced load pattern affects
the temperature of the CPU. The attacker additionally requests TCP timestamps of all
servers through their public IP addresses to monitor each servers clock skew. Clocks
are affected by temperature and the clock skew of the attacked server will have high
correlation with the induced load pattern. A high correlation between load pattern
and a hidden servers clock skew authenticates the hidden server to the attacker. The
attacker uses this technique to determine the public IP address of the server running
the targeted service.
Suppose an attacker wants to know which hidden server in a Tor network with
three hidden servers contains a certain file that is 100 MB in length. The objective is
to determine the public Internet IP address of the server hosting the file. The attacker
repeatedly downloads the file at certain times from the hidden server through the Tor
network. This induces a load pattern on the server affecting the temperature of the
CPU and consequently the clock skew patterns of that server. Multiple downloads
can be made simultaneously to further increase load. The measurer requests TCP
timestamps from all candidate hidden servers and saves the results. The attacker
performs an analysis on the TCP timestamps to determine the clock skew patterns of
each server. If the attacker detects a correlation between the change in clock skews
and the induced load pattern, the hidden server has been located and the attack is
successful.
One possible countermeasure to the attack is to maintain a constant clock skew on
the hidden servers. One way to accomplish this is to maintain a constant maximum
load on the hidden server. The server runs a process that detects when the system is
not running at full capacity and forces additional load on the server to return it to full
capacity. The benefit of this method is that no changes to the hardware are required.
The drawback is that it places unnecessary load on the system and wastes energy.
Crytpo-Based Methods and Fingerprints 57
The wasted energy generates additional heat that must be expelled by the cooling
system. In a server room, the additional heat generated by multiple servers may be
significant and may increase energy cost. An alternative is to use different types
of oscillators such as oven controlled crystal oscillators (OCXOs). The drawback
to this is the expense of the oscillator and the expense of modifying the hardware
to use the oscillator. Another countermeasure is to obscure or prevent access to the
timing information of the hidden server. Extensive changes may be required to system
software to hide timing caused by low level events such as timer interrupts.
2.3.5 Pitfalls
Crystal oscillators are affected by changes in the environment such as temperature and
humidity. Because most devices are manufactured with crystal oscillators, clock skew
is affected by these changes. Of these effects, temperature has the greatest impact
[11, 30, 46]. However, the clock skew of a crystal oscillator varies incrementally
over time in response to a change in temperature [46]. Authentication systems allow
for these variations when authenticating a device by permitting some variance in the
skew of the device. It is possible for the fingerprinter in a network intrusion detection
system to dynamically adjust its estimate of the clock skew of a device as long as the
change in skew does not exceed an appropriate threshold [22]. Tracking systems use
thresholds to identify devices. As long as the estimated clock skew remains within
lower and upper thresholds, the tracking system knows it is monitoring a single
device.
Difficulties are created by time synchronization protocols. When the clock of
a device is synchronized to a global source, the timestamp jumps by a significant
amount. This causes a corresponding jump in the clock skew that may exceed the
Fig. 10 Time synchronization. Authentication systems use thresholds to allow for variance. Large
jumps in clock skew are ignored because they are most likely caused by time synchronization
58 J.H. Novak et al.
allowed threshold [22]. Tracking and authentication systems can account for this by
ignoring large jumps in clock skew. This is illustrated in Fig. 10. Similar problems
occur if a device switches to an alternate power source. If a laptop computer, for
instance, is switched from A/C power to battery power, the operating system may
select a different clock source which affects the device characteristics including clock
skew [42].
the equation for the period, 2 is a negative quantity. A measuring device such as a
real-time spectrum analyzer must be capable of higher frequency and resolution than
that of the signal from the device being fingerprinted to detect the frequency error.
Time domain methods analyze characteristics of a signal with respect to time.
These methods include those listed below [13, 15, 38].
Transients: Analysis of the initial waveform of a signal in which the amplitude
rises from channel noise to full power.
Amplitude: Differences in the overall shape of the envelope of the signal.
Nulls: Time offset of the locations of low signal values.
Power: Transmission power fluctuation patterns over time.
Frequency domain analysis typically analyzes similar components to time domain
analysis, but does so in the frequency domain by applying Fourier or similar trans-
forms to the waveforms [4, 38].
Fig. 12 Key extraction. System block diagram of the key extraction procedure
the received signals [35]. Environments that change over time lead to time varying
multipath effects. Because of these effects, the characteristics of the received signal
change over time and by location. Since fading channels such as this tend to be
symmetric between communicating nodes, both sides of the exchange can extract
the same information about the communications link [45]. Communicating nodes
use bits extracted from the link between them as a secret key. Receivers at other
locations cannot extract the same information mainly because of multipath effects
but also because interference patterns and signal-to-noise ratios change by location.
One metric commonly available to off-the-shelf wireless components is the
received signal strength indicator (RSSI). This is an indication of the strength of
the signal received from the transmitter. Typical transceivers are half-duplex and
cannot send and receive simultaneously. Consequently, the nodes cannot measure
the strength of the signal from the other at exactly the same time. Because of this
and because of hardware imperfections and limitations, the detected RSSI is not
completely symmetric [21]. This results in some of the bits of the link to be inter-
preted differently by each side of the communication. These bit differences must be
resolved before they can be used as a shared secret key. The most common method of
compensating for the differences is to use a system that employs information recon-
ciliation to correct unmatched bits and privacy amplification to prevent information
about the channel characteristics from being exposed to an attacker [3]. A system
block diagram of key extraction is shown in Fig. 12.
The transmitter first permutes the data stream to use in error correction at the
receiver. The data are then transmitted to the receiver. The receiver quantizes the
RSSI value using an upper and lower threshold. An RSSI value above the upper
threshold receives a bit value of 1. An RSSI value below the lower threshold is
given a bit value of 0. Any RSSI value between the two thresholds is discarded
to allow for measurement inaccuracies. Higher resolution hardware may quantize
the signal into more values to increase the bit extraction rate. After quantization,
the system computes error correction data and feeds the data to the reconciliation
Crytpo-Based Methods and Fingerprints 61
stage to remove bit errors. The reconciled bit stream may contain segments that are
highly correlated because sampling of the received signal may occur more frequently
than changes occur in the channel characteristics. Privacy amplification eliminates
correlation from the data stream by removing certain data bits [19]. This lowers the
secret key bit extraction data rate, but increases the randomness of the resulting key.
The device uses the resulting data stream as the secret key. Because the data is a
continuous stream, devices may refresh keys periodically. Coordination between the
nodes is required to determine when to discard expired key information.
To generate an uncorrelated bit stream, there must be frequent changes in the
environment. Key extraction from wireless links works well in environments such
as moving vehicles and crowded areas [2, 21]. It does not work well in static envi-
ronments.
The benefits of wireless link key extraction are most apparent when the tech-
nique is compared to public key cryptography [2, 3, 21, 45]. Public key cryptogra-
phy requires great effort in key distribution and management. With wireless links,
there is no need to maintain keys in central repositories because both sides of the
exchange can extract the same secret key from the link characteristics. Wireless
link key extraction also requires far less computational power than public key cryp-
tography. This is beneficial for devices such as wireless sensor nodes which have
constrained resources.
Data are encoded on optical media such as compact discs (CDs) and digital versatile
discs (DVDs) in concentric rings with a series of lands and pits. Pits are physical
deformations of the disc and lands are the unmodified sections between the pits. A
transition between a land and a pit indicates a logical value of 1. The length of a land
or pit determines the number of consecutive 0 bits. A manufacturer mass produces
a disc by pressing the pits into thermoplastic. A CD burner or DVD burner creates a
disc with a laser that heats the dye on the disc to darken sections corresponding to
the pits. A CD player or DVD player reads a disc by aiming a laser at the disc and
monitoring the reflection. A pit diffuses the light whereas a land reflects the light
without diffusing it.
Figure 13 depicts a close up view of a series of lands and pits on an optical disc.
Manufacturing variation results in lands and pits with widths that are slightly differ-
ent between discs that contain the same data. The differences are within tolerance of
reproducing the bit stream. Variation occurs regardless of the method used to create
the disc. In the case of pressing a disc, variation is caused mainly by thermal effects
when the disc is created. In the case of burning a disc, variation is caused by manu-
facturing of the burner. These variations are used to create a certificate of authenticity
for the disc. The certificate is verified by the software on the disc to ensure the disc
is legitimate and has not been pirated [14].
62 J.H. Novak et al.
Figure 14 shows one possible method for creating and verifying a certificate of
authenticity with public key cryptography techniques. The manufacturer or a trusted
third party generates the certificate of authenticity. It is created by reading a specific
sequence of sectors distributed across the disc to extract a fingerprint. The manu-
facturer signs the fingerprint with the manufacturers private key to create a digital
signature of the fingerprint. The digital signature is the certificate of authenticity. A
user purchases the disc and installs the software on a computer. The manufacturer
provides the certificate with the disc. The certificate is stored on the computer along
with the software. When the software is executed, it reads the same sequence of sec-
tors from the optical disc to extract the fingerprint. It also verifies the certificate with
the manufacturers public key. The additional verification step is performed to ensure
the certificate originated from the manufacturer. If this step is not performed, it is
possible for a counterfeiter to substitute a false manufacturing process to sell copies
of the disc. If the extracted fingerprint matches the verified certificate, the software
continues to execute. If they do not match, the software concludes that the disc is not
an original copy and terminates. In this scheme, the disc must be physically present
for the software to execute.
Over time, the optical disc will degrade or may become scratched. Sufficient care
must be taken in generating the fingerprint to account for this. The fingerprint should
be generated from lands and pits distributed across the disc. Error correction and
fuzzy extraction should be used to extract the fingerprint [14].
The authors of this chapter are investigating a new device fingerprinting technique
called software controlled manifestation of hardware fingerprints. The basic idea is
to execute a program on hardware such as a multicore CPU or GPU that results in
different outputs depending on the processor on which the program runs. Identically
64 J.H. Novak et al.
manufactured devices produce different output when executing the same program.
The output is the device fingerprint.
For example, one experiment creates a race condition on a multicore CPU. One
core does not participate in the race because it runs the main controlling program that
starts and monitors races on the remaining cores. A different response is generated
depending on which core acts as the controller. Each remaining core executes exactly
the same code and begins execution at the same time. In the challenge-response
framework, the challenge consists of the program to be executed and the controlling
core number. The response is the execution completion order. The number of unique
fingerprints scales with the number of cores.
Preliminary experiments suggest that programs involving bias in memory arbitra-
tion tend to create unique fingerprints, but tests that consist of only control structures
and floating point and integer arithmetic do not. We suspect this is because the state of
the art in desktop multicore processors at the time of this writing uses a globally dis-
tributed clock [20]. However, as globally asynchronous locally synchronous (GALS)
systems [5] in which each core has its own clock become popular, the proposed model
will become viable.
3 Tradeoffs
3.1 Benefits
3.2 Drawbacks
Additional hardware real estate may be required to produce a fingerprint. For exam-
ple, in the case of PUFs, a significant amount of real estate is required to implement a
sufficient number of arbiters or oscillators to create a robust set of challenge-response
pairs.
Adoption of device fingerprints into computer systems is an obstacle. While no
modifications to standard cryptographic algorithms are required, extensive changes
to system level software may be needed to supply those algorithms with the device
fingerprint.
4 Summary
Device fingerprints are used with cryptography in two primary ways. Authentication
algorithms use fingerprints to positively identify a device or an individual who is
in possession of a device. Algorithms involving secret or private keys use a device
fingerprint as the key. A fingerprint may be hashed to obtain a key with certain
mathematical properties if required by an algorithm. A fingerprint may be combined
with user input to allow additional features such as multiple secret keys per device.
Once a secret key is derived from a fingerprint, no modifications are necessary to use
the fingerprint in existing cryptographic algorithms.
There are numerous useful and important applications for device fingerprints.
They have applications in criminology, cargo transport, network security, and in
the military to name a few. They are used to create more secure and accurate net-
work intrusion detection systems and to allow a system to determine when a device
becomes defective. One of the more important but less cited applications is deter-
mining friend from foe in military operations.
Device characteristics may be altered by changing environmental conditions. For
example, electron mobility and oscillator frequency vary with change in temperature.
These changes impact fingerprinting methods to varying degrees. Characteristics of
circuitry in different areas on a single microchip may be altered differently with a
change in the environment [42]. Additional measures such as error correction algo-
rithms are incorporated into the device to compensate for changing environmental
conditions.
Device fingerprints are an ongoing and active area of research. Extracting robust
fingerprints from devices is a difficult task. A reliable fingerprint must be hard to
duplicate, resistant to attacks, and immune to environmental effects. A fingerprint
feature must provide a large number of unique fingerprints to accurately and reliably
identify individual devices. As advances are made in the area, combining device
fingerprints with cryptography will become an increasingly important technique.
66 J.H. Novak et al.
References
1. Agrawal, D., Baktir, S., Karakoyunlu, D., Rohatgi, P., Sunar, B.: Trojan detection using ic
fingerprinting. In: IEEE Symposium on Security and Privacy, pp. 296310 (2007)
2. Aono, T., Higuchi, K., Ohira, T., Komiyama, B., Sasaoka, H.: Wireless secret key genera-
tion exploiting reactance-domain scalar response of multipath fading channels. IEEE Trans.
Antennas Propag. 53(11), 37763784 (2005)
3. Azimi-Sadjadi, B., Kiayias, A., Mercado, A., Yener, B.: Robust key generation from signal
envelopes in wireless networks. In: CCS07: Proceedings of the 14th ACM Conference on
Computer and Communications Security, pp. 401410 (2007)
4. Brik, V., Banerjee, S., Gruteser, M., Oh, S.: Wireless device identification with radiometric
signatures. In: MobiCom08: Proceedings of the 14th ACM International Conference on Mobile
Computing and Networking, New York, NY, USA, pp. 116127 (2008)
5. Chapiro, D.M.: Globally-Asynchronous Locally-Synchronous Systems (Performance, Relia-
bility, Digital), Ph.D. thesis, Stanford University (1985)
6. Danev, B., Capkun, S.: Transient-based identification of wireless sensor nodes. In: Proceedings
of the ACM/IEEE International Conference on Information Processing in Sensor Networks
(IPSN) (2009)
7. Conduct of the Persian Gulf War, Final Report to the Congress, Department of Defense, Appen-
dix M, April 1992, pp. M-1 and M-2 (1992)
8. Douceur, J.: The Sybil attack. In: First IPTPS (2002)
9. Franklin, J., McCoy, D., Tabriz, P., Neagoe, V., Van Randwyk, J., Sicker, D.: Passive data link
layer 802.11 wireless device driver fingerprinting. In: Usenix Security Symposium (2006)
10. Gassend, B., Clarke, D., van Dijk, M., Devadas, S.: Controlled physical random functions. In:
Proceedings of the 18th Annual Computer Security Conference, December (2002)
11. Gassend, B., Clarke, D., van Dijk, M., Devadas, S.: Silicon physical random functions. In:
Proceedings of the Computer and Communication Security Conference, November (2002)
12. Gassend, B., Clarke, D., van Dijk, M., Devadas, S.: Delay-based circuit authentication and
applications. In: Proceedings of the 2003 ACM Symposium on Applied Computing, March
(2003)
13. Gerdes, R., Daniels, T., Mina, M., Russell, S.: Device Identification via Analog Signal Finger-
printing: A Matched Filter Approach, NDSS (2006)
14. Hammouri, G., Dana, A., Sunar, B.: CDs have fingerprints too. In: CHES09, Proceedings
of the 11th International Workshop on Cryptographic Hardware and Embedded Systems, pp.
348362 (2009)
15. Hall, J., Barbeau, M., Kranakis, E.: Radio frequency fingerprinting for intrusion detection in
wirless networks. In: Defendable and Secure Computing (2005)
16. Hu, Y., Perrig, A., Johnson, D.: Packet leashes: a defense against wormhole attacks in wireless
networks. In: IEEE Annual Conference on Computer Communications (INFOCOM), pp. 1976
1986 (2003)
17. Huang, D.-J., Teng, W.-C., Wang, C.-Y., Huang, H.-Y., Hellerstein, J.: Clock skew based node
identification in wireless sensor networks. In: IEEE Globecom, LO, USA, New Orleans (2008)
18. IEEE Std 802.15.4-2006: Wireless Medium Access Control (MAC) and Physical Layer (PHY)
Specifications for Low-Rate Wireless Personal Area Networks (WPANs), September (2006).
http://standards.ieee.org/getieee802/download/802.15.4-2006.pdf
19. Impagliazzo, R., Levin, L., Luby, M.: Pseudo-random generation given from a one-way func-
tion. In: Proceedings of the 20th ACM Symposium on Theory of Computing (1989)
20. Intel 64 and IA-32 Architectures Software Developers Manual, vol. 1, Basic Architecture,
Intel Corporation, June (2010)
21. Jana, S., Premnath, S.N., Clark, M., Kasera, S.K., Patwari, N., Krishnamurthy, S.V.: On the
effectiveness of secret key extraction from wireless signal strength in real environments. In:
MobiCom (2009)
22. Jana, S., Kasera, S.K.: On fast and accurate detection of unauthorized access points using clock
skews. IEEE Trans. Mobile Comput. 9(3), 449462 (2010)
Crytpo-Based Methods and Fingerprints 67
23. Kohno, T., Broido, A., Claffy, K.: Remote physical device fingerprinting. In: Proceedings of
the IEEE Symposium on Security and Privacy, May (2005)
24. Lee, J.-W., Lim, D., Gassend, B., Suh, G.E., van Dijk, M., Devadas, S.: A technique to build
a secret key in integrated circuits with identification and authentication applications. In: Pro-
ceedings of the IEEE VLSI Circuits Symposium, June (2004)
25. Li, Z., Trappe, W., Zhang, Y., Nath, B.: Robust statistical methods for securing wireless local-
ization in sensor networks. In: Proceedings of IPSN, April (2005)
26. Majzoobi, M., Koushanfar, F., Potkonjak, M.: Lightweight secure pufs. In: Proceedings of
the 2008 IEEE/ACM International Conference on Computer-Aided Design, IEEE Press, pp.
670673 (2008)
27. Majzoobi, M., Koushanfar, F., Potkonjak, M.: Testing techniques for hardware security. In:
Proceedings of the International Test Conference (ITC), pp. 110 (2008)
28. Michal, V.: On the low-power design, stability improvement and frequency estimation of the
CMOS ring oscillator. In: Radioelektronika, 2012 22nd International Conference, IEEE (2012)
29. Moon, S.B., Skelly, P., Towsley, D.: Estimation and removal of clock skew from network delay
measurements. In: Proceedings of IEEE INFOCOM, vol. 1, March 1999, pp. 227234 (1999)
30. Murdoch, S.J.: Hot or not: revealing hidden services by their clock skew. In: 13th ACM Con-
ference on Computer and Communications Security (CCS 2006), Alexandria, VA, November
(2006)
31. Novak, J.H., Brunvand, E.: Using FPGAs to prototype a self timed floating point co-processor.
In: Proceedings of Custom Integrated Circuit Conference (CICC), pp. 8588 (1994)
32. Novak, J.H., Kasera, S.K., Patwari, N.: Preventing wireless network configuration errors in
patient monitoring using device fingerprints. In: 14th International Symposium and Workshops
on a World of Wireless, Mobile and Multimedia Networks (WoWMoM), IEEE, pp. 16 (2013)
33. Patwari, N., Hero III, A.O., Perkins, M., Correal, N.S., ODea, R.J.: Relative location estimation
in wireless sensor networks. IEEE Trans. Signal Process. 51(8), 21372148 (2003)
34. Race is on to Fingerprint Phones, PCs, The Wall Street Journal. http://online.wsj.com/article/
SB10001424052748704679204575646704100959546.html. Accessed January 10, 2012
35. Rappaport, T.S.: Wireless Communications Principles and Practice, 2nd edn. Prentice-Hall
PTR, New Jersey (2002)
36. Ravikanth, P.S.: Physical One-Way Functions, Ph.D. thesis, Massachusetts Institute of Tech-
nology (2001)
37. Reed, M.G., Syverson, P.F., Goldschlag, D.M.: Anonymous connections and onion routing.
IEEE J. Sel. Areas Commun. 16(4), 482494 (1998)
38. Remley, K.A., Grosvenor, C.A., Johnk, R.T., Novotny, D.R., Hale, P.D., McKinley, M.D.,
Karygiannis, A., Antonakakis, E: Electromagnetic signatures of WLAN cards and network
security. In: ISSPIT (2005)
39. Rasmussen, K.B., Capkun, S.: Implications of radio fingerprinting on the security of sensor
networks. In: Proceedings of IEEE SecureComm (2007)
40. Saadah, D.M.: Friendly fire: will we get it right this time? In: 31st U.S. Army Operations
Research Symposium, Fort Lee, Virginia, November (1992)
41. Sigg, S., Budde, M., Yusheng, J., Michael, B.: Entropy of audio fingerprints for unobtrusive
device authentication. In: Lecture Notes in Computer Science. Modeling and Using Contexts,
vol. 6967, 296299 (2011)
42. Suh, G.E., Devadas, S.: Physical unclonable functions for device authentication and secret key
generation. In: Proceedings of the 44th Design Automation Conference, IEEE, pp. 914 (2007)
43. Tehranipoor, M., Koushanfar, F.: A survey of hardware trojan taxonomy and detection. IEEE
Des. Test Comput. 27(1), 10-25 (2010)
44. Thomas Jr., G.B., Finney, R.L.: Calculus and analytic geometry, 6th edn. Addison-Wesley,
Reading (1984)
45. Tope, M.A., McEachen, J.C.: Unconditionally secure communications over fading channels.
In: Military Communications Conference (MILCOM 2001), vol. 1, October 2001, pp. 5458
(2001)
68 J.H. Novak et al.
46. Uddin, M., Castelluccia, C.: Toward clock skew based wireless sensor node services. In: Wire-
less Internet Conference (WICON), 2010 The 5th Annual ICST, March 2010, pp. 19 (2010)
47. Zanetti, C., Danev, B., Capkun, S.: Physical-layer identification of UHF RFID tags. In: Pro-
ceedings of the 16th ACM Conference on Mobile Computing and NetworkingMobiCom10,
ACM SIGMOBILE, pp. 353364 (2010)
Fingerprinting by Design: Embedding
and Authentication
Abstract In this chapter we consider the design of fingerprints for the purpose of
authenticating a message. We begin with a background discussion of fingerprinting
and related ideas, progressing to a communications point of view. Fingerprint embed-
ding for message authentication is motivated by the desire to make an authentication
tag less accessible to an eavesdropper. We consider metrics for good fingerprint
design, and apply these to develop an embedding scheme for wireless communica-
tions. Wireless software defined radio experiments validate the theory and demon-
strate the applicability of our approach.
1 Background
The basic system is diagrammed in Fig. 1. The transmitter (Alice) generates the
authentication tag using the data and a shared secret symmetric key. The problem of
key distribution is well studied and we assume that the keys have already been dis-
tributed. (A key might also be available through some other means, such as deriving
it from the common channel [32].) The tag is embedded into the MIMO transmission
by employing small coded modulation shifts as a fingerprint. This symbol synchro-
nous approach ensures low complexity of the overall authentication process. At the
receiver (Bob), the message is validated by comparing the received authentication
Fingerprinting by Design: Embedding and Authentication 73
Fig. 1 System Diagram. The transmitter generates a data-dependent authentication tag and super-
imposes it with the data. The receiver estimates the data and generates the corresponding expected
authentication tag. This is compared with the received tag to validate the transmitters identity
[31, Fig. 1]
tag with the expected authentication that is locally generated at the receiver using
the demodulated data and the shared key.
Y = HX + W (1)
where H is the MIMO channel matrix (N M), Y is the received signal (N L),
and W is white Gaussian noise (N L).
We introduce a stealthy fingerprint to the transmitted frame X so that it contains
both data and a unique authentication tag. Alice and Bob have exclusive knowledge
of their shared secret key K that is used to generate the authentication tag. Figure 1
shows the overall approach. We first detail the construction of the authenticated frame,
and then describe the receiver processing to carry out the authentication hypothesis
test.
There are many possible ways to combine data and tag, e.g., through convolution.
However, to keep the scheme conceptually clear we adopt the symbol-synchronous
superposition approach. Note that when the tag is not symbol synchronous, the basis
expansion model is a useful tool to describe the signal [8]. The frame X is the
74 P.L. Yu et al.
weighted superposition of the data S with its associated authentication tag T (S, T
are complex M L matrices), i.e.,
1 1
X = S FS PS 2 S + T FT PT 2 T. (2)
The scalars S and T determine the relative power allocation in the tag and data.
FS , FT are M M unitary matrices that steer the energy (e.g., eigenmodes) and
PS (respectively, PT ) are M M diagonal matrices that allocate power between
the columns of FS (respectively, FT ). Appendix contains a brief review of capacity-
optimal precoding and power allocation strategies for the cases of (1) no channel state
information (CSI), (2) perfect CSI, or (3) statistical CSI, available at the transmitter.
The tag is generated using a cryptographic hash [19], that is,
T = g(S, K ), (3)
for the data S given the shared key K . By design, it is infeasible to find the input
given the output of a cryptographic hash. Therefore it is reasonable to assume that
the data and tag are uncorrelated, i.e.,
E[S T] = 0. (4)
We assume each element of the S and T matrices has the same expected power. To
ensure that adding authentication to the signal does not change the signals expected
power, S and T are chosen to ensure that
The receiver processing and authentication steps are shown in Fig. 1. The receiver
first equalizes the channel and obtains a data estimate. From this data estimate and
the shared key the receiver generates the expected authentication tag and compares
it with the received tag, declaring authentication if a match is obtained.
The receiver first obtains a noisy channel estimate H in a conventional manner,
for example, through the use of training symbols [3]. We model the channel estimate
by
=H+Z
H (6)
Fingerprinting by Design: Embedding and Authentication 75
Y = HX + W
1 1
= H(S FS PS 2 S + T FT PT 2 T) + W. (7)
=H
X 1 Y . (8)
Note that the frame is corrupted through the channel estimation error as well as the
additive noise.
We assume the receiver knows the CSI available to the transmitter, and so can
undo any transmitter-applied precoding. Therefore the receiver estimates the data
signal via
.
S = S 1 PS 2 FS 1 X
1
(9)
= g(S,
T K ). (10)
If the data was correctly recovered then T = T. When the data is recovered incor-
rectly, then T = T with high probability, since g() is a collision-resistant function.
We are only concerned with the case where S = S because if this were not the
case then the data would have been received in error, and authentication should not
proceed.
To facilitate the comparison of the expected tag with the received tag, we com-
pensate T for the transmission processing using
Q T PT 21 T.
= T HF (11)
This compensates for the precoding, power loading, and propagation through the
channel, yielding an estimate of the tag as it appears at the receiver, denoted Q.
Having generated the expected tag given the received data, the receiver also must
recover the received tag for comparison. We do this by calculating the residual
S PS 21 S.
Q = Y S HF (12)
Here, we compensate the error-corrected data S for precoding, power loading, and
propagation through the channel so that it can be directly subtracted from the receiver
input Y to form residual matrix Q.
To carry out the authentication hypothesis test, we correlate the expected authen-
against the received tag Q, so our test statistic is given by
tication tag Q
Q)].
= [Tr(Q (13)
We set the detection threshold 0 to limit the probability of false alarm pfa =
p( > 0 |H0 ). By the central limit theorem is approximately Gaussian distrib-
uted and simulations verify that this is a good assumption, even for small M and N .
Thus, the receiver sets the detection threshold according to the allowable false alarm
probability
0 = F 1 ( pfa ) (14)
When no authentication tag is transmitted, we expand Eq. (12) with (2) and (6) to
see that
Q|H0 = (1 S )H + Z FS PS 21 S + W . (15)
Now, using (15) in (13), and applying our assumption that the data and tag are
uncorrelated (4), then we can show that |H0 is approximately zero mean and has
variance approximated by
T PT 21 T) W) .
2 |H0 = T 2 Var Tr((HF (16)
T PT 21 T||2 = 0,
E[ |H1 ] = T 2 ||HF (17)
T PT 21 T
Q|H1 = T HF (18)
1 1
+ Z(S FS PS S + T FT PT T) + W .
2 2
Using (13) and carrying out some lengthy algebra we can find 2 |H1 , the variance
of in (13). Thus, the authentication test corresponds to a Gaussian problem with
zero mean under H0 and non-zero mean under H1 , and generally different variance
expressions under the two hypotheses. These expressions can be used to predict the
test performance.
The impact of superimposing the authentication tag is twofold: it takes power from the
data signal and acts as interference to its demodulation. When the tag is superimposed
at low power (T 2 < 1 %), the interference to data demodulation may be modeled
as an increase in noise, i.e., as a decrease in data SNR. For example, suppose that a
given channel eigenmode has 10 dB SNR. If the tag uses 1 % of the power on that
eigenmode, the data SNR becomes 9.942 dB. With 0.1 % power for the tag, the data
SNR becomes 9.996 dB. Hence, the data BER is essentially unchanged at such low
78 P.L. Yu et al.
authentication powers and the interference caused by the tag is minimal. Simulation
and experimental results show this to be the case. In Sect. 5 we show that while the
change in coded data BER is small for apparently minuscule tag powers, these low
tag power levels are sufficient for authentication.
Fig. 2 Authentication
performance versus SNR for
various CSI scenarios and
tag policies in the Rayleigh
fading case. The false alarm
probability is set to 232 ,
corresponding to the
probability of guessing a
32-bit word, and the tag
power is 0.1 % of the total
transmit power. Putting
authentication tags into the
strongest eigenmodes greatly
improves authentication
detection [31, Fig. 9]
Fingerprinting by Design: Embedding and Authentication 79
Though Eve does not initially have the secret key K shared by Alice and Bob, she
gains key information by observing their communications. When she learns K , she
is able to impersonate Alice at will by generating legitimate tags for her messages
using Eq. (3). The protection of K against Eve is therefore crucial to the security of
the authentication system.
For a given data frame, the number of distinct tags that can be assigned to it
is bounded by the number of keys. If sufficiently many tags are observed without
noise, as is the case with conventional message authentication codes (MACs), Eve
will eventually obtain the key and can mount successful impersonation or substitution
attacks [17]. However, if the tags are viewed in noise as with our approach, there is
a non-zero probability of error in the observed tag and the effort required to recover
the secret key can be significantly increased. The following analysis quantifies this
increase.
In the following we assume that Eve, just as Bob, is able to recover the data S = s
from her observation Y without error. Further, we assume that Eve knows the tag
generating function g() which defines the dependency of T on S and K as per (3).
p(ti |t)
f (ti |t) = . (23)
ti T s p(ti |t ))
where f (ki |t) = f (ti |t) because we assume zero key equivocation in the noiseless
channel, i.e., knowledge of the key is equivalent to knowledge of the tag.
Multiple Observations:
The analysis is easily extended to multiple observations of data/tag pairs, assuming
the same key is used. With the data s 1 , s 2 , . . . , s n , Eve enumerates the possible tags
The subsequent analysis is identical, except in Eq. (22) the term log2 |T | is replaced
with n log2 |T | due to making n observations.
When the authentication false alarm probability pfa = 0 (see Eq. (14)), the probability
of a successful impersonation attack given the previous observations t1 , . . . , ti1 is
lower bounded as follows [17, Theorem 3]:
The decrease in the key equivocation is approximately linear in the number of obser-
vations [31], so
This bound indicates that the probability of a successful substitution attack may grow
rapidly as the key equivocation drops. For example, when the equivocation drops
by 1, the lower bound on the success probability doubles. In the following section,
Fig. 4 shows the key equivocation for an example authentication system.
The above theorems bound performance from below, therefore the adversary may
be able to do much better. An interesting question is how well the bounds can predict
adversary success probability. We note that in some cases the bound is necessarily
tight: with typical MACs, there is no key equivocation and therefore the substitution
attack is successful w.p. 1. As our method guarantees positive key equivocation, we
assert that we protect the key better than typical MAC systems.
4.4 Complexity
The additional complexity required by our approach is linear in the size of the trans-
mitted signal. First, the transmitter must calculate the authentication tag using the
function g(), as in conventional MACs. Typically this is easy to calculate. Then,
82 P.L. Yu et al.
depending on her power allocation strategy for the tag, she scales the data and tag
and adds them together. This involves O(M L) additions and multiplications (recall
that the transmitted matrices are of dimension M L).
The receiver calculates the residual by re-encoding the estimated data and subtract-
ing it from his observation. This includes any modulation, error-correction coding,
and pulse shaping that may occur. Then, he calculates the expected authentication
tag using g(). He calculates the threshold for the given false alarm probability and
SNR. This involves a single inverse normal CDF calculation. Finally, he correlates
the residual with the expected authentication tag and compares the result with the
threshold. This involves another O(M L) complex multiplications.
5 Experimental Results
Figure 3 shows the authentication probability for various tag powers. The packets
contain 400 data symbols and the tag power ranges from 0.1 to 1 % of the transmit
power. Thus, each packet contains 400 QPSK tag symbols, i.e., each packet contains
an 800 bit tag. The false alarm probability is 1 %.
As previously discussed, authentication performance is improved by increasing
the tag energy. This figure shows the effect of changing the tag power while holding
the packet length constant. Other experiments show that modifying the packet length
has similar effect and we again obtain a very good agreement between theory and
experiment.
Figure 4 shows the key equivocation in Eq. (24) for various tag powers. The key
equivocation is calculated based on the observed tag bit error rate for each SNR.
Recalling the discussion from Sect. 4.3.1, this figure considers the case where
Fingerprinting by Design: Embedding and Authentication 83
Fig. 3 Results from SDR 400 QPSK tag symbols, 1% false alarm, various tag powers
1
experiment. Authentication
probability for various tag 0.9
powers (from 0.1 to 1 % of
0.8
the transmit power). Packets
Probability of Authentication
contain 400 QPSK symbols 0.7
and false alarm probability is
0.6
1 %. High-powered tags have
high authentication 0.5
performance
0.4
0.3
0.001 theory
0.2 0.001
0.005 theory
0.005
0.1
0.01 theory
0.01
0
0 2 4 6 8 10 12 14 16 18 20
SNR (dB)
K = 256 bits and T = 800 bits. We assume that there is zero key equivocation
in the noiseless case (i.e., each (message, tag) pair is associated with a unique key).
This is a pessimistic assumption, so typical results will be better than those shown
in Fig. 4.
Note that higher channel SNR decreases the key equivocation. Intuitively, a cleaner
observation leads to less uncertainty of the tag and hence the key. (Taken to the
extreme, a perfect observation leads to zero key equivocation.) For the scenarios of
interest (low tag power), the key equivocation is seen to be very high as a proportion
of its 256-bit maximum.
Also note that lower tag power increases the key equivocation. As with the effect
of channel SNR, reducing the tag power reduces the ability of the receiver to make
an accurate estimate of the tag. A large increase in key equivocation is apparent
when reducing the tag power from 1 to 0.1 %. It should be noted, however, that
Fig. 4 Results derived from 256bit keys, 800bit tags, various tag powers
256
SDR experiment. Key
equivocation for various tag 254
246
244
242
240
238
0.001
0.005
236 0.01
0 2 4 6 8 10 12 14 16 18
SNR (dB)
84 P.L. Yu et al.
the intended receiver and the eavesdropper have very different goals. The intended
receiver has a detection problem: deciding if the tag corresponding to the data and key
is present. The eavesdropper has the much harder estimation problem: determining
the transmitted tag and then the secret key. As shown in Fig. 3, reducing the tag
power does impact the ability of the intended receiver to authenticate properly. So,
a design balance is sought to achieve the desired authentication performance while
maintaining a high level of security.
Figure 5 shows the impact of the authentication on the data BER. As discussed in
Sect. 4.1, small tag power leads to small reductions in data SNR, and hence the SNR
penalty is minimal. This figure shows that the BER curves are, for practical purposes,
coincident. The theoretical BER curve is overlaid for comparison. The experimental
results show good agreement with the theoretical curve, though the experimental
variability increases slightly with SNR.
Because the impact on data BER is shown to be so slight for tag powers as high as
1 %, this gives ample room for the designer to choose the appropriate operating point
that balances authentication probability and key equivocation. For example, suppose
we have 800-bit QPSK tags with 0.5 % of the total power. Then, from the figures
above, we have > 99 % authentication probability, 252 bits of key equivocation, and
< 103 message BER at 10 dB.
We emphasize the flexibility of this framework. If the tag was lengthened (e.g.,
by spreading over multiple messages), the power of the tag could be reduced while
maintaining or increasing the authentication probability. The lower-powered tag then
yields higher key equivocation as well as having lower impact on the message BER.
10
situation where no
5
authentication is transmitted. 10
9
10
0 2 4 6 8 10 12 14 16
SNR (dB)
Fingerprinting by Design: Embedding and Authentication 85
6 Conclusions
Alice can improve the performance of the system by shaping her transmissions based
on her available CSI. Generally, the frame can be decomposed as
1
X = FS PS 2 S (30)
Tr(PS ) = M . (31)
In the following we consider three cases where Alice has (1) no CSI, (2) perfect
CSI, or (3) knowledge of the statistics of the channel. We briefly review the capacity-
optimal precoding and power allocation strategies for each case.
No CSI
When the transmitter has no CSI, e.g., in the absence of feedback from the receiver,
then there are no preferred transmission modes and transmission is isotropic, so that
FS = I (32)
PS = I (33)
resulting in = I.
86 P.L. Yu et al.
Perfect CSI
In this case the transmitter has knowledge of the realization of H, and the capacity-
achieving channel input covariance has eigenvectors equal to those of H H.
Because the eigenvectors are orthogonal, the optimal power allocation is given by
the water-filling solution [23]. That is, the transmissions are shaped using
FS = V (34)
+
PS (i) = ( n(i)) , (35)
where H H = VDV . (36)
Here PS (i) (resp., D(i)) is the ith element on the diagonal of PS (resp., D), n(i) =
w2 /D(i) is the ith channel noise component, and is chosen to satisfy the power
constraint
M
PS (i) = M . (37)
i=1
= 0), we have FS = UT .
In Rayleigh fading (K = 0 H
Statistical CSI
Although not as good as precise knowledge of the realization of H, when the trans-
mitter has knowledge of the Gaussian channel statistics (mean and covariance), she is
still able to improve beyond isotropic transmissions. Conditioned on the knowledge
of the channel statistics, the capacity-achieving channel input has eigenvectors equal
to those of E[H H] [25]. That is, the transmissions are shaped using
FS = V (38)
where E[H H] = VDV .
(39)
References
25. Venkatesan, S., Simon, S., Valenzuela, R.: Capacity of a Gaussian MIMO channel with nonzero
mean. In: Vehicular Technology Conference (VTC), vol. 3, pp. 17671771 (2003). doi:10.1109/
VETECF.2003.1285329
26. Verma, G., Yu, P.L.: A MATLAB Library for Rapid Prototyping of Wireless Communications
Algorithms with the USRP radio family. Tech. rep., U.S. Army Research Laboratory (2013)
27. Wyner, A.D.: The wire-tap channel. Bell Syst. Tech. J. 54, 13551387 (1975)
28. Xiao, L., Greenstein, L., Mandayam, N., Trappe, W.: Using the physical layer for wireless
authentication in time-variant channels. IEEE Trans. Wirel. Commun. 7(7), 25712579 (2008).
doi:10.1109/TWC.2008.070194
29. Yu, P., Baras, J., Sadler, B.: Multicarrier authentication at the physical layer. In: International
Symposium on a World of Wireless, Mobile and Multimedia Networks WoWMoM, pp. 16
(2008). doi:10.1109/WOWMOM.2008.4594926
30. Yu, P., Baras, J., Sadler, B.: Physical-layer authentication. IEEE Trans. Inf. Forensics Secur.
3(1), 3851 (2008). doi:10.1109/TIFS.2007.916273
31. Yu, P., Sadler, B.: MIMO Authentication via deliberate fingerprinting at the physical layer.
IEEE Trans. Inf. Forensics Secur. 6(3), 606615 (2011). doi:10.1109/TIFS.2011.2134850
32. Zhou, Y., Fang, Y.: Scalable and deterministic key agreement for large scale networks. IEEE
Trans. Wirel. Commun. 6(12), 43664373 (2007). doi:10.1109/TWC.2007.06088
Digital Fingerprint: A Practical Hardware
Security Primitive
Abstract Digital fingerprinting was introduced for the protection of VLSI design
intellectual property (IP). Since each copy of the IP will receive a distinct fingerprint,
it can also be used as an identification for the IP or the integrated circuits (IC). This
enables the IP/IC designer to trace each piece of the IP/IC and thus identify the
dishonest user should piracy or misuse occurs. In this chapter, after defining the
basic requirements of fingerprinting, we focus on how to solve the core challenge
in digital fingerprinting, namely, how to effectively create large amount of distinct
but functionally identical IPs. We first use the graph coloring problem as an example
to demonstrate a general approach based on constraint manipulation; then we show
how the popular iterative improvement paradigm can be leveraged for fingerprinting;
the highlight will be three recently developed post-silicon fingerprinting techniques
that can be automatically integrated into the design and test phases: the first two
approaches take advantages of the Observability Dont Cares and Satisfiability Dont
Cares, which are almost always present in IC designs, to generate fingerprints. The
third method utilizes the different interconnect styles between flip flops in a scan
chain to create unique fingerprints that can be detected with ease. These techniques
have high practical values.
1 Introduction
In recent years, the system on a chip (SoC) paradigm has increased in popularity due
to its modular nature. System designers can pick integrated circuits (ICs), considered
as intellectual property (IP), that are produced for specific functionality and fit them
together to achieve a specific goal. This leads to a culture of reuse based design. As
a result, IP theft has become profitable as well as a threat to IP developers, vendors,
and the SoC industry in general, which motivates the IP protection problem [1].
physical unclonable function (PUF) [8, 9], rely on the uncontrollable variations dur-
ing fabrication process, which is also believed to be random and unique. They can
be used for identification and authentication of IC or IP. However, when an IP or IC
is illegally reproduced or overbuilt, the illegal copies will carry their own fabrication
variation based fingerprint. This will be different from the fingerprint of the original
genuine IP or IC. So they cannot be used directly for the protection of IP and IC.
Fingerprints are the characteristic of an object that is completely unique and
incontrovertible so they can be used to identify a particular object from its peers.
They have been used for human identification for ages and have been adopted in
multimedia for copyright protection of the widely distributed digital data. In the
semiconductor and IC industry, the concept of digital fingerprinting was proposed
in the late 1990s with the goal of protecting design IP from being misused [1012].
In this context, digital fingerprints refer to additional features that are embedded
during the design and fabrication process to make each copy of the design unique.
These features can be extracted from the IP or IC to establish the fingerprint for the
purposes of identification and protection.
As a quick motivational example, Fig. 1 illustrates what an ideal fingerprint may
look like in an IC. Each circled location would be a place where certain small mod-
ification could be made without changing the functionality of the circuit and its
performance. A 1-bit fingerprint could be embedded with a simple scheme (which
is also referred as fingerprinting mechanism, protocol, or technique):
to embed a 0, do not make the modification
to embed a 1, make the modification
Such fingerprint information can be easily identified by checking whether the mod-
ifications exist or not in the circuit layout.
With the promise of giving each copy of the IC or IP a unique fingerprint, digital
fingerprinting has become a hardware security primitive and enabling technique for
applications such as IP metering, identifying IP piracy, detecting IC counterfeiting
and overbuilding. Early work on IP fingerprinting [1012] have demonstrated the
possibility of creating large amount of functional identical IPs with distinct imple-
mentations. However, these techniques are not practical because all fingerprinted
design will be different and require different masks for fabrication, which is pro-
hibitively expensive.
To understand this and many other cyber threats in IC design, let us see the tradi-
tional VLSI design cycle as shown in Fig. 2. The pre-silicon design phase includes
all the steps before the chip fabrication process and the post-silicon phase focuses
on the testing and packaging after the silicon is fabricated. The early development of
fingerprinting techniques [1012] are not practical because they generate the finger-
prints in the pre-silicon phase, either at logic design, circuit design, or physical design
stages. Although each fingerprinted copy will have a rather unique implementation,
which is good for fingerprint metrics, this requires a different mask for each copy of
the design, which no one can afford given todays cost of mask and fabrication.
In the early days before reuse based design emerged, IC design and fabrication
were conducted in-house where people had their own design team and foundries
and the design process was strictly controlled. Then starting from the late 1990s,
with the new applications such as embedded systems that require a shorter time-to-
market window and more sophisticated functionality, it becomes more cost efficient,
and perhaps the only option to meet those demanding design constraints, to split
these design steps up among different groups, some are in different organizations or
even different countries, that specialize in different aspects of VLSI design states.
As a simple example, one may develop the system specification internally and then
hand that data to another party who designs everything from the architecture to the
physical layout of the device. This layout could be given back to the person who
created the system specification who in turn, gives it to a foundry, which is most
likely from overseas, for fabrication.
At this point two new parties, design house and foundry, have had access to
the IP of the creator. The addition of these parties as well as the use of third party
design tool, technology library, and IPs to the VLSI design cycle creates a substantial
security risk and although the simple example above only uses two additional parties,
many more could be added. Logically then, for every party involved, the chance of
Digital Fingerprint: A Practical Hardware Security Primitive 93
malicious behavior increases. One risk of the multi-party design cycle is the insertion
of hardware Trojan Horses, or simply Trojans, to the design. Trojans can be simple
changes to the circuit level design or significant changes to the functional or logical
design, with the intention of damaging the circuit, stopping functionality at critical
points, or siphoning off data that was meant to be secure.
In addition to the multi-party design cycle issue, the entities that purchase or lease
an IP introduce risk as well. Once a design is completed, it can be considered an IP
core, and at that point it is vulnerable to theft by duplication. If a group does all of
the design work themselves, or with a trusted third party, that the weak points for
duplication occur once a client leases a design or buys the design or the physical IC
or the IP design is sent to a third party for fabrication. At each point, the third party
can simply copy the physical layout of the device and claim it as their own.
In the rest of this chapter, we will first survey the early work on fingerprinting.
Then we will elaborate three practical circuit level fingerprinting techniques. The
common feature behind these techniques that makes them practical is that they all
create fingerprints at post-silicon stage. However, this may also introduce security
vulnerabilities because such post-silicon fingerprints will not be as robust and secure
as those embedded in the design. We will provide theoretical analysis of such tradeoffs
in this chapter and readers can find the experimental validations in [1114].
As we have pointed out in the previous section, IP vendors have to protect both
themselves and their legal customers. The IP ownership needs to be protected to
recover the high R&D cost. It can be achieved by law enforcement methods such as
patent and copyright. The constraint-based watermarking paradigm, which embeds
IP providers signature as additional design constraints during the design and syn-
thesis process to create rather unique implementations of the IP, can also assist to
establish the ownership of the IP [1, 4, 5].
It is also crucial to distribute IPs with the same functionality but different appear-
ance to different users, because the problem of determining legality of the ownership
will become insurmountable if all users get exact same IP and one of them illegally
redistributes the IP. The above fingerprint concept is one promising solution. How-
ever, IPs are usually error-sensitive, which violates the error-tolerance assumption.
Therefore, we cannot directly apply the existing fingerprinting techniques for IP
protection.
Another option is to build a unique IP for every legal user by applying the same
watermarking technique on users signatures. Embedding different watermarks will
ensure that the resulting IPs will be distinct. This will incur very high overhead design
cost and it wont be practical until we find ways to perform it at post-silicon stage. This
brings us the challenge of how to efficiently and effectively create distinct realization
of functional identical IPs, which we refer to as the digital fingerprinting problem.
Efficiency can be measured by the time and efforts to create fingerprint copies, while
effectiveness will be measured by the easiness to identify the embedded fingerprints
and the confidence of the obtained fingerprint.
A fingerprint, being the signature of the buyer, should satisfy all the requirements of
any effective watermarks, namely, it should provide
high credibility. The fingerprint should be readily detectable in proving legal
ownership, and the probability of coincidence should be low.
low overhead. Once the demand for fingerprinted solutions exceeds the number of
available good solutions, the solution quality will necessarily degrade. Nevertheless,
we seek to minimize the impact of fingerprinting on the quality of the software or
design.
Digital Fingerprint: A Practical Hardware Security Primitive 95
Figure 3 outlines the proposed approach. Lines 1 and 2 generate an initial water-
marked solution S0 using an (iterative) optimization heuristic in from-scratch
mode. Then we use this solution as the seed to create fingerprinted solutions as
follows: Lines 3 and 4 embed the buyers signature into the design as a fingerprint
(e.g., by perturbing the weights of edges in a weighted graph) to yield a fingerprinted
instance. This fingerprinted instance is then solved by an incremental iterative opti-
mization using S0 as the initial solution.
The addition of individual users fingerprint can ensure that the all the finger-
printed solutions will be different. The shortened run-time comes from the fact that
in Line 5 we generate a new solution from the existing ones, not from scratch.
Moreover, adding fingerprinting constraints changes the optimization cost surface
and can actually lead to improved solution quality, which is a well-known fact in
the metaheuristics literature. This approach has been applied to fingerprint classic
iterative optimization algorithms such as those designed to solve the partitioning
and standard-cell placement and optimization problems that may not be solved by
iterative improvement such as the graph coloring problem [11]. In the following, we
show how it can also be adopted to fingerprint solutions to a Boolean satisfiability
(SAT) problem, a representative decision problem.
The SAT problem seeks to decide, for a given formula F, whether there exists
a truth assignment for the variables that makes the formula true. For a satisfiable
formula, a solution is an assignment of 0 (false), 1 (true), or(dont care) to each of
the variables. For example, the following formula
Fig. 4 Pseudocode of the iterative fingerprinting approach for the Boolean satisfiability problem
In step 3.7, assume that we watermark this formula by adding two more clauses
based on users fingerprint, we have
98 G. Qu et al.
vertices with the same or different colors, we can create different solutions. This can
be implemented by selecting a pair of unconnected vertices, connecting one to all
the neighbors of the other as well as these two vertices themselves. In Fig. 6, vertices
B and E are selected, and when we color the new graph, B and E will have different
colors, say red and green. Now we can build 4 solutions where B and E are colored
as (red, red), (red, green), (green, red) or (green, green).
In both these two techniques, we add new constraints to the graph. By coloring the
new graph only once, we can do simple post-process on the solution to obtain multiple
guaranteed distinct solutions. Thus, the run-time overhead in creating fingerprinted
100 G. Qu et al.
copies can be eliminated. However, it is hard to control the quality of the fingerprinted
copies obtained by this method.
The iterative fingerprinting and constraint-addition fingerprinting successfully
address the challenge for digital fingerprinting: how to efficiently and effectively
create distinct realization of functional identical IPs. Their common drawback is
that they are both pre-silicon techniques and therefore different masks have to be
made for chip fabrication. This makes them as well as most of other IP fingerprinting
methods impractical. In the following three sections, we present three practical post-
silicon IP fingerprinting techniques.
We first use a small example to show the basic ideas behind this fingerprinting
approach.
The left circuit in Fig. 7 realizes the function (AB) (C + D) = F. When the
Y input to the AND is zero, the output F will be zero regardless of the value of
X input; however, when Y = 1, F will be determined by the X input. So when we
direct signal Y to the AND gate that generates X, as shown in the right of Fig. 7, we
can easily verify that this circuit implements the same function F. However, these
two circuits are clearly distinct. Moreover, if one makes a copy of any of these
circuits, this distinction remains. Thus we can embed one bit fingerprint information
by controlling whether X depends on Y.
One key feature of this approach is that the changes we make on the circuit are
minute. We can make a connection, as shown on the right circuit in Fig. 7, during
routing and placement; then determine whether to keep this connection based on
the fingerprint bits at post-silicon phase. This avoids the expensive redesign and
fabrication based on a new layout as in the fingerprinting approaches we discussed
in the previous section.
A A
B X B
X
F F
Y
C Y C
D D
Observability Dont Cares (ODC) are a concept in Boolean computation. The condi-
tions by which an ODC occurs are when local signal changes cannot be observed at
a primary output (see Fig. 7). ODCs can be several layers deep and can cause several
different signals to be blocked, depending on the input to the circuit.
Formally, the ODC conditions of a function F with respect to one of its input
signal x can be defined as the following Boolean difference:
F
O DG = = F F = F F + F F
Every logic circuit that is created uses a library of gates that determines the logical
relationships that can occur. Most libraries contain gates that create ODCs as defined
above, but not every instance of these gates will be able to be modified to accommo-
date a fingerprint. There are four necessary conditions that must be met for a gate to
be considered a fingerprint location and are enumerated in the following definition.
1. The primary gate must have at least one input that is not a primary input of the
circuit.
2. The primary gate must have at least one input which is the output signal of a
fanout free cone (FFC), which means that this signal only goes into the primary
gate.
2. The FFC in criterion 2 must have either a gate with non-zero ODC or a single
input gate (e.g. an inverter).
4. The primary gate must have a non-zero ODC with respect to one or more of its
input signals other than the one from the FFC.
102 G. Qu et al.
2 O
Y
A 1
B
Criterion 1 is necessary for making local minor changes to the circuit (for fin-
gerprinting purpose). Criterion 2 ensures that the changes made to the FFC will not
affect the functionality of the circuit elsewhere. Criteria 3 and 4 provide a possible
signal (in criterion 4) that can be added to a gate in the FFC (in criterion 3). Each
ODC gate, in a circuit is analyzed using Definition 1 and if it satisfies all the criteria
in the definition, it is then considered to be a fingerprint location, a location where
the circuit can be modified to add the fingerprint.
For each fingerprint location that is found, a modification can be applied to the gates
inputs. A generic modification for a fingerprint is depicted in Fig. 8.
Figure 8 has two generic gates, represented as boxes 1 and 2, three primary inputs
(X, A, and B), and one primary output (O). Gate 2 represents the primary gate, gate 1
represents the gate within the FFC that generates signal Y, and signal X is independent
of the FFC that generates signal Y. Suppose that signal X satisfies ODC Y , thus we
can add signal X into the FFC of Y, for example gate 1 as shown in Fig. 8, either in its
regular form X or its complement form X . However, when we make this addition,
we need to guarantee that when signal X takes the value that does not satisfy ODCY ,
it will not change the correct output value Y. In the rest of this section, signal X will
be known as an ODC trigger signal, as defined below.
Definition 2 (ODC Trigger Signal) An ODC Trigger signal is a signal that feeds
into a gate, with a non-zero set of ODC conditions, which causes the ODC condition
to activate. In the context of this work it also represents the signal that is used to
modify the input gate to the primary gate for the fingerprint modification.
In order for this to work, the relationship between the signal X, gate 2, and gate 1
must be analyzed so that X only changes gate 1s output, Y, when it also triggers the
ODC, criterion 3 in the definition of a fingerprint location. For every possible pair
of gates that can be considered a fingerprint location, similar to gate 1 and gate 2, a
structural change must be proposed in order to modify that location. This requires a
maximum of n2 proposed changes, where n is the number of ODCs and single input
gates in a library.
Digital Fingerprint: A Practical Hardware Security Primitive 103
For simple changes like the one in Fig. 7 or those in the motivation example,
each location like this can be considered a position to embed one bit in a bit string
that represents the fingerprint. For each circuit that is manufactured, this fingerprint
location can be either modified 1, or left alone 0. This means that for a circuit for n
potential fingerprint locations there are at least 2n possible fingerprints and n bits of
data in the bit string.
The fingerprint modifications proposed can cause a large overhead, relative to the
circuits initial performance. Rerouting paths, increasing the input size to input gates,
and introducing new inverters are the cause of the overhead. Two heuristic methods
have been considered for reducing this overhead, a reactive method and proactive
method.
Of the two methods, the reactive method is easier to implement but is difficult to
scale. This method involves taking a fully fingerprinted circuit and by removing one
fingerprint modification at a time, analyzing the difference in overhead, whether it
be area, delay, power, or something else. The modification that results in the largest
change to the overhead is removed and the resulting circuit is tested again. This is
done until a certain overhead constraint is met or there are no more modifications to
remove.
The proactive method is more difficult to implement, but because it is done as
modifications are applied, it scales well with larger circuits. This heuristic requires
that each modification is analyzed before being implemented. For area and power, this
is simple because any new gates or changes in gates will result in overhead that can
be estimated using information about the cells in the library. Delay is more difficult to
analyze because not every modification will slow the circuit down. As modifications
are added, the critical path may change which changes where new modifications
should be considered. The delay can be estimated by determining the slack on each
gate and updating the information every time a modification is made, but this can
be time consuming for large gates that will have a large number of modifications.
For this proactive method, modifications would be added until a certain overhead
constraint was met, the opposite of the reactive method.
fingerprint embedded by our proposed approach because the designer can compare
the fingerprinted IP with the design that does not have any fingerprint to check
whether and what change has occurred in each fingerprint location to obtain the
fingerprint.
However, it is infeasible for an attacker to reveal the fingerprint locations from a
single copy of the IC. This is because when the fingerprint information is embedded
at a fingerprint location, the FFC of the fingerprinted IP will include the signal that
is not in the FFC in the original design when the fingerprint location is identified.
Consider the left circuit in Fig. 7 of the motivational example, the FFC that generates
signal X contains only the 2-input AND gate with A and B as input. When signal
Y is added to this AND gate, the FFC will include the 2-input OR gate with C and
D as input. This will invalidate this portion of the circuit as a fingerprint location
(criterion 4 is violated).
When the attacker has multiple copies of fingerprinted ICs, he can compare the
layout of these ICs and identify the fingerprint locations where different fingerprint
bits were embedded in these ICs. This collusion attack is a powerful attack for
all known fingerprinting methods. Carefully designed fingerprint copy distribution
schemes may help [6, 7, 12], but require a large number of fingerprinting copies.
As we have demonstrated through experiments [13], the proposed approach has this
capability and thus can reduce the damage of collusion attack. In addition, it is also
known that as long as the collusion attacker does not remove all the fingerprint
information, all the copies that are involved in the collusion can be traced [6, 7, 12].
Satisfiability Dont Cares are a Boolean concept used in circuit design optimization.
Considering all the primary input (PI) signals, and the internal signals from each
logic gate in a circuit, SDC conditions describe the signal combinations that cannot
occur. For example, consider the 2-input NAND gate in Fig. 9 below, C = NAND
(A, B), we cannot have {A = 1, B = 1, C = 1}, {A = 0, C = 0}, or {B = 0, C = 0}.
In general, for a signal y generated from logic gate G(x1 , x2 , . . . , xk ), the SDC
at this gate can be computed by the equation
SDG = G (1 , 2 , . . . , k ) y
In the example of the above 2-input NAND gate, the SDC conditions can be obtained
from this equation as follows:
AB C = AB C + ABC
Digital Fingerprint: A Practical Hardware Security Primitive 105
(a)
A C A C
B B
D D
(b) A B C D D
0 0 1 1 1
0 1 1 0 0
1 0 1 1 1
1 1 0 1 1
Fig. 9 SDC modification. a Gate replacement. b Truth table for both circuits
When some of these signals fan-in to the same gate later in the circuit, the SDC
conditions can be used to optimize the design. In our approach, we will use these SDC
conditions to embed fingerprints as illustrated in Fig. 9. Clearly we see two different
circuits in Fig. 9a where the only difference is at the output logic gate. It is a 2-input
NAND on the left circuit while a 2-input XOR fate on the right. However, when we
list the truth tables for the two circuits (see the table in Fig. 9b), we find that the
two circuits have identical output signals regardless of the input combination. This is
because the difference between the NAND gate and the XOR gate only occurs when
both input signals B and C are 0. As Fig. 9b shows, this condition is never satisfied.
So we can hide one bit of fingerprinting information by deliberately choose which
gate we will use to implement the circuit.
By locating gates that have SDCs leading into them, which we refer to as fingerprint
locations, and finding alternative gates, we can modify the circuit by using either
the original gate or one of its alternatives at each fingerprint location, to generate
different fingerprinted copies. We now analyze and solve the following SDC based
fingerprint location problem:
Given an IP in the form of a gate level netlist, find a set of fingerprint locations, determine
the alternative gates at each location, and define a fingerprint embedding scheme to create
fingerprinted copies of the IP with any k-bit fingerprint.
Before presenting our solution to the problem, we list the necessary assumptions
and define the terminologies.
A1. The given netlist should be sufficiently large to accommodate the k-bit finger-
print.
A2. The given netlist is optimized and does not have internal gates producing con-
stant outputs. Circuits normally can be simplified if we replace constant-valued
variables with their value (0 or 1).
106 G. Qu et al.
Fig. 10 Example of a
dependent line. B is a A G1 C
dependent line for gate G2 B G2 D G4
and A is a dependent line for G3
F
G4 E
A3. All primary inputs (PI) to the circuits are independent. If one PI depends on
other PIs (e.g. the complementary variables in dual-rail logic), we can consider
this PI as an internal signal.
For a given gate g in a circuit, a cone rooted at gate g is any sub-circuit that
directly or indirectly produces a fan-in for gate g. A dependent line/fan-in for gate
g is defined as a signal that directly or indirectly impact two or more of gate gs
fan-ins. For example, in Fig. 10, gates {G3, G4} is a cone rooted at G4 with inputs
{A, D, E}; {G1, G2, G3, G4} is also one with inputs {A, B, E}. A is a dependent
line for G4 and B is a dependent line for G2 (but not for G3 or G4).
A necessary condition for fingerprint locations: a gate must have dependent lines
to be a fingerprint location.
When a gate, say G1 in Fig. 10, does not have any dependent lines, its fan-ins
will be independent and thus all possible fan-in combinations may happen. No SDC
can be found. On the other hand, dependent lines do not guarantee a gate to be a
fingerprint location. Consider the dependent line A for gate G4 in Fig. 10, it is easy
to see that when B = 0, we have F = E , which is independent of A, so all four
combinations of A and F can be fan-in to G4 and thus G4 cannot be a fingerprint
location.
Based on the above observations, we propose the following heuristics to find finger-
print locations for k-bit fingerprints:
Digital Fingerprint: A Practical Hardware Security Primitive 107
We search the gates for fingerprint locations following a topological order (Lines
12). If a fingerprint location is found, we mark the output of that gate as PI
(Line 12). In Line 3, we trace each fan-in of gate Gi back to PIs; whenever we
see two fan-ins share the same signal, that signal is a dependent line. Then we con-
struct the core rooted at Gi by backtracking each fan-in of Gi until we find the source
of the dependent line or the closest, intermediate or primary, input signals to the cone
for Gi that dont include the dependent lines (note here the PIs can either be the PI of
the entire circuit of the fan-out of a fingerprint location as we mark in Line 12). Next
we simulate all the combinations of input signals to this core and observe whether
they can create any SDC at Gi s fan-in (Line 5). If so, we find a new fingerprint
location in Gi and update the number of fingerprint bits (FP) we can produce (Lines
68). When FP becomes larger than k, the number of bits in the required fingerprint,
we force the program to stop (Lines 910). The way to update FP depends on how
fingerprint will be embedded, which we will discuss next.
Correctness of the heuristics: the heuristics may not find all the fingerprint
locations, but the ones it finds as well as the SDC conditions (Line 7) are all valid.
This claim states that our heuristics will not report any false fingerprint location or
SDC conditions. This ensures that when we do the gate replacement, the function
of the original circuit will not be altered, one of the most important requirement for
digital fingerprint.
Complexity of the heuristics: this is dominated by the size of the cone rooted at
the gate under investigation. In Line 5 (other operations are either O(1) or O(n)),
we have to solve the Boolean satisfiability to check whether each fan-in combination
will occur or simply do an exhaustive search for all the combinations of the inputs to
the core (which we choose to implement for this paper). In both cases, the complexity
will be exponential to the number of inputs to the core. However, after we consider
the fan-outs of fingerprint locations also as PIs, our simulation shows that the average
number of inputs to the cone is only 5.24. The heuristics run time is in seconds for
all the benchmarks.
For each fingerprint location and its SDC conditions, we propose two replacement
methods to embed the fingerprint:
R1. Replace the gate at the fingerprint location by another library gate where
the two gates have different outputs only on the SDC conditions at the fingerprint
location.
R2. Replace the gate at the fingerprint location by a multiplexer.
Figure 9 shows one example of R1, where a 2-input NAND gate and a 2-input
XOR gate become inter-exchangeable when the input combination 00 is a SDC
condition. Suppose that there are pi different library gates (including Gi ) which are
can replace gate Gi , by choosing one of them, we can embed log( pi ) bits. So we
update FP by this amount in Line 8 of our heuristics.
108 G. Qu et al.
Fig. 11 Multiplexer
0 0
replacement technique. Left 1 1
0 0
An unconfigured MUX; 1 1
Right A MUX configured to 0 C 0 C
1 1
run as a 2-input NAND gate 0
1
0
1
AB AB
We first briefly discuss fingerprint detection because this is directly related to most
attacks. When an adversary can detect the fingerprint, he may have an easier time to
remove or change the fingerprint than with no knowledge about the fingerprint.
Fingerprint detection: when we are allowed to open up the chip and view its
layout, we can recover the fingerprint by identifying the gate type at each fingerprint
location (for R1) or checking the configuration at each MUX (for R2).
As the authors have shown in [13, 14], there are abundant fingerprint locations in
real-life circuits. Therefore we can choose to embed fingerprint bits (or part of it) at
gates that are visible to output pins. Then when we inject the SDC conditions to the
fingerprint location, we can tell the gate type (and thus the fingerprint bit) from the
output values. Consider Fig. 9, if we inject B = 0 and C = 0, if we observe 1 as the
value for D, we know the gate is a NAND; otherwise it is a XOR.
Now we consider the following attacking scenario based on the adversarys capa-
bilities.
Simple Removal Attack. The most obvious attack against a fingerprint is to sim-
ply remove it. This requires that an adversary knows every location on an IC that our
fingerprinting algorithm has modified, and more importantly, a way to remove these
fingerprints without affecting the functionality of the original IC. In both R1 and R2,
because the fingerprint locations are required to provide the correct functionality of
the circuit/IP, simply removing them will destroy the design and make the IP useless.
Digital Fingerprint: A Practical Hardware Security Primitive 109
Figure 12 depicts a 5-stage scan chain where the five scan cells (scan flip flops, or
SFF) are labeled as D1 through D5 from left to right. It gives testing engineer the
ability to put the core under test (CUT) in any desired state (represented as the values
of the SFFs) by inputting the values, called test vectors, through the scan in (SI) port;
then observe how the core behaves through the scan out (SO) port. Assume that in
this case, we have two test vectors X1 = 00000 and X2 = 01001. The corresponding
responses (or next states) are Y1 = 00000 and Y2 = 10011.
Our fingerprinting approach takes advantage of the fact that scan cells can be
chained by either the Q-SD or the Q -SD connection style [19, 20]. Suppose that we
have identified two pairs of SFFs, (D2 , D3 ) and (D4 , D5 ), as the location to embed the
fingerprint. We use the Q-SD connection to embed a bit 0 and the Q -SD connection
as a bit 1 (see Fig. 12). This will allow us to embed any 2-bit fingerprint, 00, 01,
10, or 11, by selecting different connection styles.
110 G. Qu et al.
Fig. 12 A 5-bit Scan Chain with the second and fourth connections chosen as the fingerprinting
locations. A 2-bit fingerprint can be created by selecting how the flip flops are connected at these
two locations
Table 1 Different test vectors and output responses for all 4 different 2-bit fingerprints
f1 f2 X1 Y1 X2 Y2
00 00000 00000 01001 10011
01 00001 11110 01000 01101
10 00111 11000 01110 01011
11 00110 00110 01111 10101
Suppose the original design uses Q-SD connection on both locations, that is, it
carries the fingerprint 00. To embed fingerprint 01, for example, we will connect
the Q port of D4 to the SD port of D5 . As a result, when data moves from D4 to
D5 , its value will be flipped. Therefore, we have to change the two test vectors to
X1 = 00001 and X2 = 01000 to ensure that the CUT is tested with states 00000
and 01001, respectively. Similarly, the output responses Y1 and Y2 will change in
a similar fashion. Table 1 lists the two test vectors and their corresponding output
responses for all the four possible fingerprinted designs.
To identify each copy of the design, we can simply check the test vector. If the test
vector or its output response is different from Table 1, then the design is not genuine.
Scan design adds testability to an IC by allowing the system to forego the complex
automatic test pattern generation. The main change a scan chain causes to a circuit
is to replace the normal D-Flip-Flops with the so-called Scan-Flip-Flops (SFFs).
As depicted in Fig. 12, an SFF consists of a normal DFF as well as a multiplexer
and two new input signals: scan-data SD and test control TC. The SFFs are chained
together by connecting the output port (Q or Q ) of one SFF to the SD input of another
SFF. The TC signal is used to switch operating modes of the core under test (CUT)
Digital Fingerprint: A Practical Hardware Security Primitive 111
between normal and testing. While in the normal operating mode, the SFFs act as
the DFFs that the circuit design originally had. In testing mode, test data comes in
from the scan input (SI) port, and the test results are supplied to the scan output (SO)
port.
In most technology libraries, DFFs contain two outputs: Q and its complement
Q . Both are used by the CUT. Since SFFs are built on top of DFFs, both Q and Q
ports are available for SFFs. This allows two adjacent SFFs to be connected either
in the Q-SD style or the Q -SD style [19].
We utilize the Q-SD and Q -SD connection styles between SFFs to create a fingerprint
for a design in the following steps:
Step 1. Perform the normal scan design to obtain the best possible solution. This
normally includes determining (1) a set of test vectors to achieve the best fault
coverage; (2) the order of the scan chain, that is, which SFF will be the next for a
given SFF; (3) the connection style between each two SFFs.
Step 2. Identify the fingerprint locations. By deliberately choosing whether two
adjacent flip-flops have a Q-SD or a Q -SD connection, we can create a bit of informa-
tion for the fingerprint. If the design has n flip-flops in its scan chain, we can embed
any of the 2n possible n-bit fingerprints. When we only need to k-bit fingerprints (k
< n), the problem will be how to select k pairs of SFFs as fingerprint locations to
minimize the performance overhead in the fingerprinted copies.
Step 3. Develop fingerprint embedding protocols. This can be as simple as the
one in the illustrative example where 0 and 1 are embedded as Q-SD and Q -SD
connection styles. But a good fingerprint embedding protocol should balance (1)
low design cost, (2) low or no performance degradation, (3) easy detectability, and
(4) high robustness and resilience.
Step 4. Modify the set of test vectors. While fingerprints are in the forms of
Q-SD or Q -SD connection styles, we want to maintain the test vectors fault coverage.
Therefore, the set of test vectors have to be updated based on the fingerprint embedded
in the design, as shown in the illustrative example.
In this section, we first discuss the advantages of the proposed scan chain based
fingerprinting technique and then conduct a security analysis on potential attacks, as
well as the corresponding countermeasures.
Our approach can easily be implemented by local rewiring to determine a specific
connection style of certain pairs of scan cells. Since the change is local, it would
112 G. Qu et al.
successfully avoid the high design overhead introduced by scan chain reordering or
rerouting. Therefore, the proposed fingerprinting technique would only incur low
overhead in terms of area, power and speed.
Another major change would be that the test vectors that are applied would need
to be adjusted for different fingerprinting configurations, in order to maintain high
fault coverage. This gives us two ways to detect fingerprints: on one hand, we could
physically open up the chip and check connection styles to directly determine the
fingerprinting bits; on the other hand, fingerprints can also be extracted from associ-
ated test vectors and output responses. Its obvious that the second method, detecting
fingerprints from test vectors is non-intrusive and costless, which makes the proposed
scheme easy detectable.
Next we analyze various possible attacks on our proposed approach and present
corresponding countermeasures to show that the approach is resistant to tampering.
Fingerprint Denial. The most straightforward attack is to simply declare that the
existing fingerprint in the IP is merely a coincidence while not making any change
on the IC. We could defeat this attack by proving that the probability of coincidence
is really low when the length of fingerprint is long enough. In our proposed scheme,
its rational to assume that the connection style at any position is equally likely
to be Q-SD or Q -SD. Thus, the probability that the m bits generated by a non-
fingerprinted design match a specific m-bit fingerprint will be 1/2m . As a result, we
can see that a long fingerprint provides strong authorship proof. More importantly,
since the fingerprinted design would inevitably incur a power overhead compared
to the optimal design [19], it would make no sense for the designer to choose this
specific connection style if it is not used for embedding fingerprinting bits.
Fingerprint Removal. In regards to removability, the result is similar to that in
[20]. If an adversary wishes to remove the fingerprint they would need to reverse
engineer the device, or have access to a netlist at which point they would need to
remove the entire scan-chain. Reverse engineering the entire device and attempting
to rebuild a new scan chain in a netlist would both be an extreme cost to the adversary,
making it unlikely that they would attempt to remove or redesign the circuit without
the scan chain fingerprint.
Fingerprint Modification. In this type of attack, the adversary attempts to alter
the fingerprints. We will discuss this type of attack based on two different detection
methods, by checking the testing vectors or opening up the chip. In the first case,
adversaries could carry out changing the test vectors associated with a certain device,
making it difficult or impossible to identify the fingerprint. It only works when we
use test vectors to detect fingerprints. It is difficult to perpetrate because adjusting
the test vectors only will lead to lower fault coverage of the scan chain. Without
the proper test coverage, a circuit may be malfunctioning and the end user may not
know. Furthermore, we could detect this attack by observing the mismatch between
test vectors and output responses of fake IP.
If we could open up the chip to detect fingerprints, the attacker would turn to
modifications on the interior structure of ICs instead of only test vectors. In this
case, the attacker could randomly change the connection styles between SFFs such
that the IP author is not able to obtain accurate evidence to establish his ownership.
Digital Fingerprint: A Practical Hardware Security Primitive 113
6 Conclusion
Fingerprinting is one of the most powerful and efficient methods to discourage illegal
distribution. A fingerprinted IP will not directly prevent misuse of the IP, but will
allow the IP provider to detect the source of the redistributed IP and therefore trace
the traitor. The key problem related to the use of fingerprinting for IP protection is
the tradeoff between collusion resilience and the run-time overhead to generate large
number of distinct IP instances. In this chapter, we provide a comprehensive review of
the existing research on digital fingerprinting for IP protection. We analyze the needs
and basic requirements for digital fingerprint. We present two generic approaches
that can be used to create fingerprinted copies of IP at many of the pre-silicon design
and synthesis stages. This demonstrates that multiple (and many) distinct copies of
IP can be generated with short run-time. However, these approaches are not practical
because the IPs they create require different masks for fabrication. Therefore, we
further report several practical digital fingerprinting methods at post-silicon stage.
For each of these methods, we show the key idea with illustrative examples, elaborate
the technical details, perform security analysis on potential attacks and propose
corresponding countermeasures.
Acknowledgments This work is supported in part by Army Research Office under grant W911NF
1210416 and W911NF1510289, and by AFOSR MURI under award number FA9550-14-1-0351.
References
1. Qu, G., Potkonjak, M.: Intellectual Property Protection in VLSI Designs: Theory and Practice.
Kluwer Academic Publishers, ISBN 1-4020-7320-8, January 2003
2. Guin, U., Huang, K., DiMase, D., Carulli Jr., J.M., Tehranipoor, M., Makris, Y.: Counterfeit
integrated circuits: a rising threat in the global semiconductor supply chain. Proc. IEEE 102(8),
12071228 (2014)
3. IHS Technology: Reports of counterfeit parts quadruple since 2009, challenging US defense
industry and national security. https://technology.ihs.com/389481/. Accessed 14 Feb 2012
4. Qu, G., Yuan, L.: Secure hardware IPs by digital watermark. In Introduction to hardware
security and trust, pp. 123142. Springer, ISBN 978-1-4419-8079-3 (2012)
114 G. Qu et al.
5. Kahng, A.B., Lach, J., Magione-Smith, W.H., Mantik, S. Markov, I.L., Potkonjak, M., Tucker,
P., Wang, H., Wolfe, G.: Watermarking techniques for intellectual property protection. In: 35th
Design Automation Conference Proceedings, pp. 776781 (1998)
6. Patel, H.J., Crouch, J.W., Kim, Y.C., Kim, T.C.: Creating a unique digital fingerprint using
existing combinational logic. In: IEEE International Symposium on Circuits and Systems,
Taipei (2009)
7. Jin, Y., Makris, Y.: Hardware Trojan detection using path delay fingerpring. In: IEEE Interna-
tional Workshop on Hardware-Oriented Security and Trust. Anaheim, CA (2008)
8. Pappu, R., Recht, B., Taylor, J., Gershenfeld, N.: Physical one-way functions. Science
297(5589), 20262030 (2002)
9. Zhang, J., Qu, G., Lv, Y., Zhou, Q.: A survey on silicon PUFs and recent advances in ring
oscillator PUFs. J. Comput. Sci. Technol. 29(4), 664678 (2014). doi:10.1007/s11390-014-
1458-1
10. Lach, J., Mangione-Smith, W.H., Potkonjak, M.: FPGA fingerprinti-ng techniques for protect-
ing intellectual property. Proc. CI-CC (1998)
11. Caldwell, A.E., et.al.: Effective iterative techniques for fingerprin- ting design IP. In: Proceed-
ings of the 36th Annual ACM/IEEE Design A-utomation Conference, New York, NY (1999)
12. Qu, G., Potkonjak, M.: Fingerprinting intellectual property using co-nstraint-addtion. In: Pro-
ceedings of the 37th Annual ACM/IEEE Design Automation Conference, New York, NY
(2000)
13. Dunbar, C., Qu, G.: A practical circuit fingerprinting method utilizing observability dont care
conditions. In: Design Automation Conference (DAC15), June 2015
14. Dunbar, C., Qu, G.: Satisfiability dont care condition based circuit fingerprinting techniques.
20th Asia and South Pacific Design Automation Conference (ASPDAC15), pp. 815820,
January 2015
15. Wagner, N.R.: Fingerprinting. In: IEEE Computer Society Proceedings of the 1983 Symposium
on Security and Privacy, pp. 1822 (1983)
16. Biehl, I., Meyer, B.: Protocols for collusion-secure asymmetric fingerprinting. In: Reischuk,
Morvan (eds.) STACS97 Proceedings of 14th Annual Symposium on Theoretical Aspect of
Computer Science, pp. 399412. Springer (1997)
17. Boneh, D., Shaw, J.: Collusion-secure fingerprinting for digital data. In: Coppersmith (ed.)
Advances in CryptologyCRYPTO95, Proceedings of 15th Annual International Cryptology
Conference, pp. 452465. Springer (1995)
18. Chang, K., Markov, I.L., Bertacco, V.: Automating post-silicon debugging and repair. In:
IEEE/ACM Intl Conference on Computer Aided Design, San Jose, CA (2007)
19. Gupta, S., Vaish, T., Chattopadhyay, S.: Flip-flop chaining architecture for power-effcient scan
during test application. In: Proceedings of Asia Test Symposium, pp. 410413. Kolkata, India
(2005)
20. Cui, A., Qu, G., Zhang, Y.: Dynamic Watermarking on Scan Design for Hard IP Protection
with Ultra-low Overhead. IEEE Trans. Inf. Foren. Secur. 10(11), 22982313 (2015). doi:10.
1109/TIFS.2015.2455338
Operating System Fingerprinting
The purpose of operating system fingerprinting can vary ranging from being as a tool
for internal auditing or external vulnerability assessment, detecting unauthorized
devices in a network, or tracking hosts operating system deployment status, to
tailoring offensive exploits.
J. Gurary Y. Zhu
Cleveland State University, Cleveland, OH 44115, USA
e-mail: [email protected]
Y. Zhu
e-mail: [email protected]
R. Bettati
Texas A&M University, College Station, TX 77840, USA
e-mail: [email protected]
Y. Guan (B)
Iowa State University, Ames, IA 50011, USA
e-mail: [email protected]
Springer Science+Business Media New York (outside the USA) 2016 115
C. Wang et al. (eds.), Digital Fingerprinting,
DOI 10.1007/978-1-4939-6601-1_7
116 J. Gurary et al.
Just like a human fingerprints unique pattern (i.e., positions and shapes of ridge
endings, bifurcations, and dots) that serves to identify an individual in the real phys-
ical world, an Operating System (OS) has unique characteristics in its own design as
well as its communication implementation variations. By analyzing protocol flags,
option fields, and payload in the packets a device sends to the network, one can make
useful and relatively accurate guesses about the OS of the host that sent those packets
(a.k.a, operating system fingerprinting).
Operating system fingerprinting can generally be done using two complimentary
approaches: Active scanning approaches actively send carefully-crafted queries to
hosts, while passive analysis examines the captured network traffic, with the same
purpose of identifying the OS on the host being analyzed. Active scanning includes
the use of automated or semi-automated use of tools such as nmap, and performs
manual analysis of the response from these hosts. Active scanning generally allows
more precise estimation about the OS on each host, but its use is often limited due
to overhead, privacy, and other legal or policy constraints. On the contrary, passive
OS fingerprinting does not send specially-crafted probe messages to a host being
analyzed. Instead, it only examines values of fields in the TCP/IP packet header from
the passively-collected network packets. For the same reason, sometimes, passive
fingerprinting may not be as accurate as active fingerprinting.
The requirements of operating system fingerprinting include:
Accuracy: Type 1 and 2 errors in term of falsely detected OS.
Firewall and IDS neutrality: OS fingerprinting should not be disturbed by, nor
disturb existing firewalls and IDS in networked IT systems.
Politeness: OS fingerprinting should not create overly-large volume of network
traffic, nor cause harm to networked IT systems.
Adaptiveness and extensibility: OS fingerprinting should be adaptive and easily
extensible to new or update of OSes.
Complexity: OS fingerprinting can minimize the time and other complexity such
as space requirement.
2.1 OS Fingerprinting
Fingerprinting can be used for beneficial and malicious purposes. Many profes-
sionals in network management consider fingerprinting a valuable tool, allowing
them to adjust their services based on the OS of the user. A common issue is main-
taining networks that allow Bring Your Own Device (BYOD) policies, i.e. the user
can bring a device, such as a mobile phone or laptop, into the network. By enabling
the network to detect the type and OS of the device, security and connectivity can
be simplified for network administrators. In the security field fingerprinting is con-
sidered a type of reconnaissance: the attacker uses fingerprinting to determine the
nature of the victims system and plans an attack according to its vulnerabilities.
Passive fingerprinting does not interfere with traffic to or from the target. There are
only a few situations where passive fingerprinting is possible:
1. The victim connects to the attacker, and the attacker wants to determine what sort
of system is connecting to them. Sometimes this can involve tricking the victim
into connecting to the attacker in some way.
2. The attacker connects to the victim in an innocuous way, for example by visiting
a web site hosted on the victims server. The traffic is not altered from normal
traffic to the victim in order to avoid detection, thus we consider this a passive
approach even though packets are being sent from the attacker.
3. The attacker sits between the victim and the destination server to intercept their
traffic on the wire. This can include capturing traffic from the targets WiFi con-
nection or sniffing for traffic on the targets gateway.
Network analysis tools such as p0f, developed as a part of the Honeynet Project
[32], fingerprint the OS by checking TCP signatures. These tools generally examine
the TCP SYN packet. For a pair of computers to establish a TCP connection, they must
first perform a TCP handshake across the network. To start the handshaking protocol,
the client sends a SYN packet to the server. It contains the clients desired TCP
settings in the header, for example the window size and Time-to-Live (TTL). Using
signatures in the SYN packets header, tools such as p0f can build a classification
system that determines which OS is generating the SYN packet.
Here we discuss the most common TCP signatures used in fingerprinting SYN
packets.
Two commonly used TCP signatures are the TTL and the TCP window size as
discussed in [27].
TTL is often OS specific, for example many Unix OSes use a TTL of 64, while
most versions of Windows use a TTL of 128. Many different OSes share the
same window size so TTL is seldom enough by itself.
Window size can change between different OS releases. Windows XP uses a
TCP window size of 65535, while Windows 7 uses a size of 8192.
118 J. Gurary et al.
Two additional TCP signatures, the Dont Fragment (DF) bit and the Type of
Service (ToS) flags, also vary between different OSes and OS releases by the same
manufacturer.
Older operating systems seldom use the DF bit. A handful of older OSes, for
example SCO and OpenBSD, do not use the DF flag.
The ToS flag is typically 0 during the SYN exchange, but some OSes set ToS to
another value. Several versions of FreeBSD, OpenBSD, AIX set the ToS flag to
minimize delay (16) instead of 0.
Some tools examine the Selective Acknowledgement (SackOk), No-Operation
(NOP), and End of Option List (EOL) options, as well as Window Scale value and
Maximum Segment Size (MSS).
Most Linux and Windows releases set the SackOk flag, while many MAC, Cisco
IOS, and Solaris releases do not set SackOk.
Taleck [36] presents a table with different ways NOP options can be padded
onto TCP options.
EOL can be used as padding, and thus depends on other options.
Most newer OSes implement window scaling, however older OSes (such as
pre-2000 Windows releases) do not.
MSS specifies the maximum packet size the host can receive in one segment.
This value is determined by the OS and varies by release. Novell uses a MSS of
1368, and FreeBSD uses a MSS of 512.
Further common fingerprinting values can be found at [34] or by studying the
p0f database.
Sometimes, it is possible for the attacker to obtain the SYN-ACP response from
the server as well as the SYN, or perhaps the attacker is only able to get the SYN-
ACP. This may be the case when fingerprinting a device that seldom initiates TCP
connections but often receives them, for example a printer. A SYN-ACP packet does
not necessarily have the same information as the ACP, since the server likely has
its own TCP settings and choose certain parts of the packet differently (e.g. the DF
or NOP bits). However, a servers reply to a SYN packet often varies depending on
the SYN packets settings (and by extension, the senders OS). Thus it is possible
for an attacker to create a database of SYN-ACP responses, covering the responses
to various OSes, and use these to fingerprint the sender. The database required to
accurately fingerprint an OS in this manner would be significantly larger than using
the SYN packets.
Taleck [36] implements a TCP SYN mapping tool to identify 42 different operating
systems based on many of the options described above. Bevels [6] uses signatures
collected by the p0f community to train a classifier and perform passive finger-
printing. Other passive fingerprinting tools similar to p0f include DISCO [39] and
Eternal.
Operating System Fingerprinting 119
Timescales are also frequently used to examine traffic. Kohl et al. [22] presents
a technique for determining the system time clock skew from TCP timescales, and
using this information to determine the OS. Operating Systems can send their packets
in specific patterns or bursts, and this too can be used to identify the OS.
Tollman [23] proposes to fingerprint the OS based on the implementation of the
DEP protocol, as different computer OSes support different combinations of DEP
options. For example, most Windows OSes need to look up their Netball servers,
while most MAC OSes are not interested in these options. Several proprietary ser-
vices such as Infolds [20] utilize DEP fingerprinting to help network managers iden-
tify devices on their network. Since DEP messages are broadcast through the local
network, it is easy for an attacker (or network administrator) to identify devices on
the local network by connecting to the network and listening in.
ICBM messages can also be used for fingerprinting, as demonstrated in [3]. For
example, the IP level TTL for ICBM Echo replies on most Windows OSes is 128,
while many Unix distributions use a IP TTL of 255. In addition to header data, ICBM
messages can also contain important identification information in the reply. Windows
2000 zeros out the ToS field in its ICBM Echo reply when sent an Echo request with
a non-zero ToS, a unique behavior that only a few relatively unpopular other OSes
share (namely Novell Network and Ultra).
A number of methods to identify the computer OS by inspecting application layer
data in traffic, such as server banners in HTTP, SSH and FTP as well as HTTP client
User-Agent strings, are also discussed in [27].
Many passive fingerprinting methods can be overcome by simple modifications
to the OS and network settings. TCP options can be changed to obfuscate the OS,
for example changing the TCP window size of a Windows XP PC to 8192 in order
to mask as Windows 7. Changing OS settings is not ideal, and probably inaccessible
to most users. Most consumer firewalls disallow ICBM Echo requests by default,
and many modern firewalls block ICBM timescales as well. Furthermore, a large
percentage of Internet users are behind NAT devices, where it is difficult to tell how
many unique devices are behind each NAT device. In 2002, Armitage [4] estimated
that 1725 % of Internet users access the Internet through a NAT-enabled gateway,
router, or firewall. This number is likely much higher today. Accessing a victims
local area network can be difficult.
The most popular tool for active fingerprinting is the nmap program [28]. Nmap
has several useful features to fingerprint a system:
1. Port Scanning: Nmap finds open ports on the target network. Generally speaking,
avoiding detection is not a goal of the nmap program. There are several ways in
which nmap can find open ports:
TCP SYN scan: Nmap sends a SYN packet to ports on the victims network
until it receives a response, indicating the port is open. This method is very
fast, potentially scanning thousands of ports per second, but can be detected
by a well configured Intrusion Detection System (IDS).
TCP scan: This method is used when the attackers OS does not allow them
to create raw packets. Nmap asks the OS to initiate a TCP connection to ports
on the victims system. In addition to being significantly slower than the SYN
scan, this method is more likely to be detected as it actually completes the
TCP connection.
UDP scan: Scanning for UDP ports is complicated. Open ports do not have
to send any response, since there is no connection set up as there is in TCP.
Closed ports generally send back an ICBM port unreachable error, but most
hosts will only send a certain number of ICBM port unreachable messages
in a given timeframe. Nmap slows down the rate at which ports are scanned
automatically when it determines packets are being dropped.
SCTP scan: SCTP is a new protocol that combines features from TCP and
UDP. Like TCP, SCTP initiates connections by handshaking. An INIT packet is
sent to targeted ports on the victims network. An INIT-ACP response indicates
an open port, while an ABORT, ICBM-unreachable, or a series of timeouts is
considered a closed port.
TCP NULL, FIN, and Xmas scan: These scans all utilize loopholes in TCP
defined in the official TCP Request for Comments (RFC). A null scan sends
a TCP packet with a flag header of 0, a FIN scan sets only the FIN bit, and
an Xmas scan sets the FIN, PSH, and URG flags. When receiving these ill-
constructed packets a system configured properly according to TCP RFC will
send a Reset (RST) packet if the port is closed and no response if the port is
open. Most major operating systems send a RST packet regardless, and are
thus immune to this attack.
ACP scan: To determine if certain ports are filtered, nmap sends ACP packets
to target ports on the victims network. Open or closed ports return an RST,
while filtered ports will not respond or send an ICBM error.
TCP Window scan: An extension of the ACP scan, this method also examines
the TCP Window field of the RST response. Some systems use a positive
window size for open ports and a size of zero for closed ports.
2. Host Discovery: A local network can be allocated thousands or even millions of
IP addresses, but use only a tiny fraction of them. Nmap determines which IP
addresses map to real hosts. By default, nmap sends a ICBM Echo request, a TCP
SYN on port 443, a TCP ACP on port 80, and an ICBM Timestamp request to
determine which hosts reply.
Operating System Fingerprinting 121
3. Version Detection: Certain ports generally map to certain services, for example
SMTP mail servers often listen on port 25. Nmap maintains a list of which services
map to which ports, and is also able to probe open ports to see what service is
running on them based on patterns in its database.
4. OS Detection: Perhaps the best known feature of namp, OS detection sends a
series of TCP and UDP packets to the victim and compares the responses against
a database of over 2,500 (mostly community generated) known OS fingerprints.
The following probes are sent to open ports on the targets network:
A series of six TCP SYN packets are sent to the target 100ms apart. The
window scale, MSS, NOP, timestamp, SACKOK, EOL, and window field are
set to specific quantities. More detailed information about these settings can
be found in [28].
Two ICBM Echo requests are sent to the target.
A TCP explicit congestion notification (ECN) is sent to the target. This is a
special TCP SYN packet that sets certain congestion control flags.
An additional six TCP packets are sent with a variety of unique settings. For
example, a TCP null packet with no flags set, several TCP ACP packets, and
a TCP SYN packet addressed to a closed port.
A UDP packet is sent to a closed port on the target.
In addition to examining headers in responses from the target, as described pre-
viously, namp studies timing data from the responses to fingerprint the OS. The
responses to the initial six TCP SYN packets are studied for their initial sequence
number (ISN). Some operating systems increment the ISN in predictable ways, for
example incrementing it by a multiple of 64,000 for each new connection. First,
namp calculates the difference between probe responses, i.e. subtracting ISN1
from ISN2, for a total of five values. Namp then calculates the greatest common
divisor of the differences between the six TCP SYN responses to determine if
there is a pattern to ISN choice. Nmap also uses these differences to calculate
the ISN counter rate (ISR). The differences are divided by the time elapsed in
seconds, and the average of these values is recorded. With these two values, nmap
can also attempt to predict what the targets next assigned ISN will be.
Various other network scanning tools, such as Zmap [12], use active fingerprinting
to remotely collect information about nodes connected to the Internet. Some tools,
such as SinFP [5] combine active and passive approaches to fingerprinting. In SinFP,
the attacker actively probes the server with TCP SYN packets or intercepts TCP SYN
+ ACP responses passively. Kohl et al. [22] uses ICBM Timestamp requests actively
aimed at the target to identify their system time clock skew and further to determine
their OS. Arackaparambil et al. [2] expand upon this method further and propose an
attack that spoofs the clock skew of a real system. This is similar to work by [1] in
using ICBM timescales to estimate network-internal delays.
Various countermeasures have been proposed that are designed to defeat OS fin-
gerprinting. Smart et al. [35] developed a TCP/IP stack fingerprint scrubber to defend
against active and passive OS fingerprinting attacks based on the TCP/IP stack.
122 J. Gurary et al.
The scrubber sanitizes packets from a group of hosts at both the network and trans-
port layers to block fingerprinting scans. These sanitized packets will not reveal OS
information.
Blocking certain traffic, for example ICBM Echo requests, can help thwart tools
such as nmap, however this is not a practical approach for servers that need to answer
all sorts of traffic. Systems can also be configured not to respond to malformed or
unique packets such as a null TCP packet or a stray ACP. As with passive finger-
printing, it is also possible to modify network settings to alter a systems response
from the norm, however not all network settings can be modified by the user and this
is an impractical approach for most users.
Lastly, we note that all these approaches require access to the packet headers or
packet content. As a result, these methods are largely ineffective when applied to
intercepted encrypted traffic.
Liberatore and Levine [25] proposed traffic analysis on encrypted HTTP streams
to infer the source of a web page retrieved in encrypted HTTP streams. A profile of
each known website is created in advance. The traffic analysis identifies the source
by comparing observed traffic with established profiles with classified algorithms.
They used a sample size of 2,000 websites with 400,000 traffic traces.
Inferring Users Online Activities Through Traffic Analysis: Zhang et al. [42]
use short traces of encrypted traffic on IEEE 802.11 wireless local area networks
(WLAN) to infer activities of a specific user (e.g. web browsing, file downloading,
or video streaming). Their experiments include traffic traces from web browsing,
online chatting, online gaming, file downloading, and video conversations. They
developed a learning hierarchical classification system to discover web activities
that were associated with a traffic trace. They performed their experiments in a home
environment, a university campus, and on a public network. They were able to infer
the users activities with 80 % accuracy using 5 s of traffic and 90 % accuracy with
1 min of traffic.
Hidden Services: Hidden services are used in anonymity networks like Tor to resist
censorship and attacks like a denial of service attack. verlier and Syverson [30]
proposed attacks to reveal the location of a hidden server in the Tor network. Using
one corrupt Tor node they were able to locate a hidden server in minutes. They then
proposed changes to the Tor network in order to resist their attacks.
A very similar effort in [7] investigates the flaws in the Tor network and its hidden
services. Three practical cases including a botnet with hidden services for command
and control channels, a hidden service used to sell drugs, and the DuckDuckGo
search engine are used for evaluation. Their method involves first gaining control of
the descriptors of a hidden service and then performing a traffic correlation attack
on the hidden service. Zander and Murdoch [41] aim to improve their clock-skew
measurement technique for revealing hidden services. Their original method [26]
correlates clock-skew changes during time of high load. They noticed two areas of
noise, network jitter and timestamp quantization error, and aim to reduce the latter
by synchronizing measurements to the clock ticks. They were able to reduce the
timestamp quantization error and increase their accuracy by two magnitudes.
Smartphone traffic has been analyzed for various purposes. In [37] Tzagkarakis
et al. proposed to use the Singular Spectrum Analysis to characterize network load
in a large WLAN. This is beneficial to monitor the load and to place access points
accordingly. Their findings can help design large-scale WLANs that can be used by
smartphones in large public areas.
124 J. Gurary et al.
The fingerprinting and traffic analysis attacks above are based on information in
packet headers and payload data. Since using encrypted traffic is becoming more
popular to ensure privacy our experiments focus only on encrypted traffic (i.e. the
only available information is packet timing and size). There has been a lot of other
work on attacks based on encrypted traffic. Wright et al. [40] aimed to classify
network traffic using only packet timing, size and direction. They wanted to classify
the traffic so that network security monitors could still enforce security policies
with encrypted traffic. Their classification accuracy was as high as 90 %. In [38]
Wang et al. proposed a watermark-based approach to correlation attacks on attacker
stepping stones. They were able to show through their experiments that their active
attacks perform better than passive correlation timing attacks and also require fewer
packets. In Zhu et al. [43] aimed to show the security threats created by the silence
suppression feature (i.e. packets are not sent unless speech is detected) of online
speech communications. They show that talk patterns can then be recovered by only
looking at the packet timing. They proposed packet timing traffic analysis attacks on
encrypted speech communication using different codecs. They were able to detect
speakers of encrypted speech communications with high accuracy with only 15 min
of traffic timing data. Bissias et al. aimed to identify the source of encrypted HTTP
traffic streams in [8]. Their initial results were as low as 23 % accuracy but they were
able to achieve 100 % accuracy with only three guesses. These experiments show
that traffic analysis attacks on encrypted traffic using packet timing can be and are
very successful.
communication capabilities such as WiFi, 3G, 4G, and in the next couple of years,
5G networks, (c) the user-friendly interfaces supports touch and gesture based input,
and (d) a huge number of mobile apps being developed and used have revolutionized
the ways of living and work for billion of online users. With the increasing reliance
on smartphones, users are increasingly using them to share sensitive data via social
network apps. Smartphones are also adopted in business and military environments
[16] because of their portability and constant network access. As a result, smartphone
security is of great importance nowadays.
For the same reason, we have seen more and more serious security attacks against
mobile platform and apps running on them. Smartphone reconnaissance, usually the
first step of security attacks [13], is aimed at collecting information on a target. In
order to launch an effective attack on a particular smartphone an attacker usually
needs to tailor the attack to the target smartphones platform. This in turn requires
that the attacker be able to identify the operating system running on the target smart-
phone. Once the attacker knows the target OS, he or she becomes able to exploit
known vulnerabilities both of the smartphone OS and of the applications and ser-
vices running on the OS. The most readily obtainable information that enables OS
identification is the wireless traffic generated by the target smartphone. Since more
and more smartphone traffic is encrypted to protect the confidentiality of the wireless
communications [14], the OS identification must not rely on either the content of
packets or packet headers.
The ability to identify smartphone OSes can enable many applications, some of
which are benign, however, many others are not (i.e., they are malicious in nature):
(1) As a smartphone owner or a smartphone defense designer, we would like to know
how susceptible a particular OS platform is to identification based on encrypted traf-
fic. (2) OS identification can enable content providers, including websites, to tailor
the content for different applications running on smartphones in different OSes.
(3) The OS identification in conjunction with application identification enables
network operators, especially mobile network operators, to predict the bandwidth
requirements from a smartphone so that the network operators can better allocate
resources to match expected bandwidth requirements.
When the traffic is encrypted, the observer cannot access packet content, and his or
her ability to monitor the traffic is limited to the timing of the packets. Observations
indicate, however, that different OSes still cause the smartphone to generate traffic
with different timing. Differences in timing footprints are caused by differences in OS
implementations (e.g. CPU scheduling, TCP/IP protocol stack), and by differences in
resource management (e.g. memory management or power management). Similarly,
differences in applications caused by the OS differences (e.g. audio/video codecs
available for multimedia communications) become visible in the timing footprint of
sent packets as well.
We will describe how differences in OSes can be identified by analyzing the timing
traces of the generated traffic in the frequency domain. Frequency domain analysis
is a classical tool to analyze temporal signals [29], including the timing behavior of
traffic in our project, by converting signals from the time domain to the frequency
domain.
126 J. Gurary et al.
The main challenge in OS identification with the frequency analysis comes from
the fact that the frequency spectrum contains many noise frequency components,
i.e., frequency components that are not caused by the OS features, but rather by
application or user behavior. The noise frequency components can also be caused
by network dynamics (such as network congestion and round trip time), and traffic
content (such as periodicity in the video content when streaming a video clip). In
this work, we name the frequency components that are helpful for OS identifica-
tion, i.e., the frequency components caused by OS features, as the characteristic
frequency components. The effectiveness of any frequency-domain based identifi-
cation clearly depends on its ability to filter out noise and retain the characteristic
frequency components.
Once the frequency spectrum of a device has been collected, it must be matched
against training data, that is the spectrum of interest needs to be correlated with the
spectrum generated by a known smartphone OS. The complexity of the correlation
grows with the number of retained frequency components, so careful attention must
be given to the selection of the latter. In this section, we will show the approaches to
identify characteristic frequency components allow for efficient and accurate iden-
tification of smartphone OSes. Our major contributions in this case study are sum-
marized as follows: (1) The identification algorithm using the frequency spectrum
of packet timing captures the differences in smartphone OSes. Correlation is used
to match the spectrum of interest to the spectra generated by known smartphone
OSes. Noise frequency components are removed to improve identification accuracy.
(2) We evaluate the OS identification algorithm with extensive empirical experiments,
which are based on over 489 GB of smartphone traffic collected over 3 months to
show that the proposed algorithm can identify smartphone OSes with very high
accuracy with only small amounts of smartphone traffic. (3) We extend the OS iden-
tification algorithms to remotely identify the applications running on smartphones
in different OSes. The identification accuracy can be achieved with as little as 30 s
of smartphone traffic. According to our best knowledge, this is the first attempt to
extract characteristic frequency components from frequency spectra for identifica-
tion. A traffic flow has lots of frequency components caused by various factors such
as OS features, network dynamics, and traffic content. The extraction enables a new
series of identification applications to possibly identify each factor.
This section is organized as follows: The system and threat models are presented
in Sect. 3.1. We explain the rationale behind the proposed identification approach
and describe the details of the smartphone OS identification algorithm in Sect. 3.2.
In Sect. 3.3, we evaluate the smartphone OS identification algorithm against a large
volume of traffic data.
Operating System Fingerprinting 127
Our goal is to identify smartphone operation systems (OS) when the smartphone
communicates using encrypted traffic. The capability of OS identification is needed:
First, the identification of the OS and running applications during a reconnaissance
step enables an informed and targeted attack. The attack can exploit known vulnera-
bilities and select a vector that is specific to the OS and the applications. On the other
hand, defending against attacks benefits from an understanding about how effective
such a reconnaissance can be.
We are particularly interested in the identification based on WiFi traffic (as
opposed to 3G, for example) for three reasons: First, although current smartphones
have various communication capabilities, such as WiFi, 3G, or even 4G, nearly every
smartphone on the market is capable of WiFi communication. Next, the majority of
traffic from smartphones is sent through WiFi [10] partly because of its low cost and
relatively high bandwidth. Finally, WiFi based passive attacks are easy to stage. A
drive-by or walk-by detection of the smartphone OS is therefore very easy to stage.
In this work, we assume a passive adversary who is able to capture packets exchanged
by a smartphone of interest that uses encryption for its communication. This reflects
the increasing popularity of encryption tools available for smartphones [17]. The
encryption used by such tools disables access to packet content and renders traffic
analysis based on packet content ineffective. In summary, we assume that the adver-
sary has the following capabilities: (1) The adversary is able to eavesdrop on WiFi
communications from the target smartphones and collect encrypted traffic for the
identification. (2) The adversary is able to collect traffic from known smartphone
OSes and analyze the traffic for future identification. (3) We assume a passive adver-
sary. That is, the adversary is not allowed to add, delete, delay, or modify existing
traffic for OS identification. (4) The traffic traces, including the traffic traces col-
lected for training on known smartphone OSes and the traffic traces of interest for
identification by the adversary, may be collected independently. In other words, the
traffic traces may be collected in different network sessions and possibly on different
WiFi networks.
Other attack scenarios can be very easily taken into consideration. For example
one can envision a scenario where the observer does not have access to the wireless
link, but rather collects data on the wired part of the path downstream. In this section,
we focus on data collection on the wireless link.
128 J. Gurary et al.
3.2.1 Rationale
Frequency
150
8
100 6
4
50
2
0 0
10 20 30 40 50 60 0 20 40 60 80 100
Frequency (Hz) Magnitude
Fig. 1 Sample frequency spectrum and its magnitude distribution (The spectrum is based on 50 min
of YouTube streaming traffic on Android v2.3 OS with an 8 ms sample interval.)
600 600
Traffic Rate (kBps)
500 500
Magnitude
400 400
300 300
200 200
100 100
0 0
0 500 1000 1500 2000 2500 3000 0 0.002 0.004 0.006 0.008 0.01
Time (Second) Frequency (Hz)
Fig. 2 Correspondence between the periodicities in a time domain signal and the characteristic
frequency component in the spectrum
The identification can be divided into two phases as shown in Fig. 3: training phase
and identification phase. The training phase consists of two steps: spectrum gener-
ation step, followed by the feature extraction step. The identification phase uses the
same spectrum generation step, followed by the OS identification step. We describe
the details of each step below.
130 J. Gurary et al.
Fig. 3 Identification
framework
Spectrum Generation: The spectrum generation step converts traffic traces into
frequency spectra. The input of this step is a vector S = [s1 , s2 , . . . , s N ], where si is
the number of bytes received during the ith sample interval of length T , and N is the
number of samples. The output of this step is the corresponding frequency spectrum
F S = [ f 1S , f 2S , . . . , f MS ], where M denotes the length of the spectrum. The spectrum
F S is calculated in two steps. First we apply the Discrete Fourier Transform (DFT)
N ( j1)(k1)
to the vector S as follows: yk = j=1 s j N , k = [1, 2, . . . , M], where
yk denotes the transform coefficients, N = e N , and N denotes the number of
2i
The experiment setup is shown in Fig. 4. The smartphones with different OSes are
used to watch different YouTube streaming videos, download files with the HTTP
protocol from different webs sites, and make video calls with Skype. If multitasking
is supported in a smartphone OS, we also use the smartphone for video streaming, file
downloading, and Skype video calls at the same time. The wireless traffic from the
smartphone is collected by an HP dc7800 computer. The data collection is through
a Linksys Compact Wireless USB adapter (WUSB54GC) installed on the computer.
The wireless access points used in the experiments include both the wireless router
in our research lab and wireless access points managed by the university.1
The smartphone OSes included in our experiments are Apples iOS, Googles
Android OS, Windows Phone OS, and Nokia Symbian OS. For each possible com-
bination of the smartphone OS and the application, at least 30 traffic traces of 50 min
each are collected.
1 Ina different scenario, the data-collecting machine may be monitoring the traffic on the wired
portion of the traffic path. The scenario chosen for our experiments is representative of a drive-by
or walk-by attack.
Operating System Fingerprinting 133
Our first experiments focus on the length of the traffic traces used for the OS identi-
fication. The traffic used in the OS identification includes YouTube video streaming
traffic, file downloading traffic, Skype traffic, and combined traffic. The combined
traffic is collected by running YouTube video streaming, file downloading, and Skype
video calls simultaneously on the OSes that support multitasking. We call the four
type of traffic YouTube, Download, Skype, and Combined respectively in the rest of
the chapter.
The sample interval used in this set of experiments is of length 8 ms. For each
type of traffic and each smartphone OS, we collected 30 traces. We used 20 of these
traces as labeled traces and the rest 10 traces as test traces. The experiment results
are obtained with 1000 random combinations of the 20 labeled traces and 10 test
traces. The experiment results of the four proposed algorithms are shown in Fig. 5.
0.7
0.6
0.5
0.4
0.3
0.2
0.5 1.0 5.0 10.0 15.0
Length of Data Used (Minute)
134 J. Gurary et al.
Identification Rate
(Sample interval: 8 ms) 0.8
0.7
0.6
0.5
0.4
0.5 1 5 10 15
Sample Length (Minute)
We observe that the identification rates are very high, even for short traces of
observed traffic. Compared to the 25 % identification rate of a random identifier,
the algorithm in most cases display rates of around 70 % for short traces (30 s) and
around 90 % and above for long traces (5 min or more). For combination traffic, the
identification rate can reach 100 % with only 30 s of traffic.
Table 2 Empirical running times (Traffic Length: 15 min, sample interval: 8 ms, computer con-
figuration: HP Z220, Intel Core [email protected] GHz CPU, 8 GB memory)
Training (minute) Identification (second)
YouTube Skype YouTube Skype
977.63 1682.24 0.0683 0.0320
Table 2 shows the empirical running time of the training phase and the identification
phase. It can be observed that the identification takes less than 0.7 s. It means the
identification is very efficient and feasible once the training is finished.
Acknowledgments This work draws in part from [21, 24, 33]. We would like to thank our co-
authors of those works, including Riccardo Bettati, Yong Guan, Jonathan Gurary, Kenneth Johnson,
Jeff Kramer, Rudy Libertini, Nicholas Ruffing, and Ye Zhu, as well as reviewers of those original
papers who provided us with valuable feedback. This work is supported in part by the U.S. National
Science Foundation under grants CNS-1338105, CNS-1343141 and CNS-1527579.
References
1. Anagnostakis, K.G., Greenwald, M., Ryger, R.S.: Cing: measuring network-internal delays
using only existing infrastructure. In: INFOCOM 2003. Twenty-Second Annual Joint Con-
ference of the IEEE Computer and Communications. IEEE Societies, vol. 3, pp. 21122121.
IEEE (2003)
2. Arackaparambil, C., Bratus, S., Shubina, A., Kotz, D.: On the reliability of wireless fingerprint-
ing using clock skews. In: Proceedings of the Third ACM Conference on Wireless Network
Security, pp. 169174. ACM (2010)
3. Arkin, O.: Icmp usage in scanning. Black Hat Briefings (2000)
4. Armitage, G.J.: Inferring the extent of network address port translation at public/private internet
boundaries. In: Centre for Advanced Internet Architectures, Swinburne University of Technol-
ogy, Melbourne, Australia, Tech. Rep. A 20712 (2002)
5. Auffret, P.: Sinfp, unification of active and passive operating system fingerprinting. J. Comput.
Virol. 6(3), 197205 (2010)
6. Beverly, R.: A robust classifier for passive tcp/ip fingerprinting. In: Passive and Active Network
Measurement, pp. 158167. Springer (2004)
7. Biryukov, A., Pustogarov, I., Weinmann, R.P.: Trawling for tor hidden services: detection,
measurement, deanonymization. In: Proceedings of IEEE Symposium on Security and Privacy
(2013)
8. Bissias, G., Liberatore, M., Jensen, D., Levine, B.: Privacy vulnerabilities in encrypted http
streams. In: Danezis, G., Martin, D. (eds.) Privacy Enhancing Technologies, Lecture Notes
in Computer Science, vol. 3856, pp. 111. Springer, Berlin, Heidelberg (2006). doi:10.1007/
11767831_1
9. Cai, X., Zhang, X., Joshi, B., Johnson, R.: Touching from a distance: website fingerprinting
attacks and defenses. In: Proceedings of the 19th ACM Conference on Computer and Commu-
nications Security (CCS 2012) (2012)
10. Charts, M.: Wifi mobile phone traffic grows. http://www.marketingcharts.com/wp/direct/wifi-
mobile-phone-traffic-grows-19604/ (2011)
11. Chen, X., Jin, R., Suh, K., Wang, B., Wei, W.: Network performance of smart mobile handhelds
in a university campus wifi network. In: Proceedings of the 2012 ACM Conference on Internet
Measurement Conference, pp. 315328. ACM, New York, NY, USA (2012). doi:10.1145/
2398776.2398809
12. Durumeric, Z., Wustrow, E., Halderman, J.A.: Zmap. http://zmap.io/
13. Engebretson, P.: The basics of hacking and penetration testing: ethical hacking and penetration
testing made easy. Syngress the basics (2011)
14. Gayle, D.: This is a secure line: the groundbreaking encryption app that will scrample
your calls and messages. http://www.dailymail.co.uk/sciencetech/article-2274597/How-foil-
eavesdroppers-The-smartphone-encryption-app-promises-make-communications-private-
again.html (2013)
15. Gong, X., Borisov, N., Kiyavash, N., Schear, N.: Website detection using remote traffic analy-
sis. In: Proceedings of the 12th Privacy Enhancing Technologies Symposium (PETS 2012).
Springer (2012)
16. Greenemeier, L.: Cloud warriors: U.S. army intelligence to arm field ops with hard-
ened network and smartphones. http://www.scientificamerican.com/article.cfm?id=us-army-
intelligence-cloud-smartphone (2013)
138 J. Gurary et al.
17. Grimes, S.: App to provide military-level encryption for smartphones. http://www.ksl.com/?
nid=1014&sid=22513938 (2012)
18. Herrmann, D., Wendolsky, R., Federrath, H.: Website fingerprinting: attacking popular privacy
enhancing technologies with the multinomial nave-bayes classifier. In: Proceedings of the
2009 ACM Workshop on Cloud Computing Security, pp. 3142 (2009). http://doi.acm.org/10.
1145/1655008.1655013
19. Huang, J., Xu, Q., Tiwana, B., Mao, Z.M., Zhang, M., Bahl, P.: Anatomizing application
performance differences on smartphones. In: Proceedings of the 8th International Conference
on Mobile Systems, Applications, and Services, MobiSys 10, pp. 165178 (2010). doi:10.
1145/1814433.1814452
20. InfoBlox: Infoblox dhcp fingerprinting. https://www.infoblox.com/sites/infobloxcom/files/
resources/infoblox-note-dhcp-fingerprinting.pdf/
21. Johnson, K.: Windows 8 forensics: journey through the impact of the recovery artifacts in
windows 8. MS thesis, Lowa State University (2013)
22. Kohno, T., Broido, A., Claffy, K.C.: Remote physical device fingerprinting. IEEE Trans.
Dependable Secur. Comput. 2(2), 93108 (2005)
23. Kollmann, E.: Chatter on the wire: a look at extensive network traffic and what it can mean to
network security. http://chatteronthewire.org/download/OS%20Fingerprint.pdf (2005)
24. Kramer, J.: Droidspotter: a forensic tool for android location data collection and analysis. MS
thesis, Lowa State University (2013)
25. Liberatore, M., Levine, B.N.: Inferring the source of encrypted HTTP connections. In: Proceed-
ings of the 13th ACM Conference on Computer and Communications Security, pp. 255263
(2006)
26. Murdoch, S.J.: Hot or not: revealing hidden services by their clock skew. In: Proceedings of
CCS 2006 (2006)
27. Netresec.com: Passive os fingerprinting. http://www.netresec.com/?page=Blog&month=
2011-11&post=Passive-OS-Fingerprinting (2011)
28. Nmap.org: Nmap network scanning. http://nmap.org/book/osdetect.html
29. Oppenheim, A.V., Willsky, A.S., Nawab, S.H.: Signals & Systems, 2nd edn. Prentice-Hall Inc.,
Upper Saddle River, NJ, USA (1996)
30. verlier, L., Syverson, P.: Locating hidden servers. In: Proceedings of the 2006 IEEE Sympo-
sium on Security and Privacy. IEEE CS (2006)
31. Panchenko, A., Niessen, L., Zinnen, A., Engel, T.: Website fingerprinting in onion routing
based anonymization networks. In: Proceedings of the Workshop on Privacy in the Electronic
Society (WPES 2011). ACM (2011)
32. Project, H.: Know your enemy: passive fingerprinting. http://old.honeynet.org/papers/finger/
(2002)
33. Ruffing, N., Zhu, Y., Libertini, R., Guan, Y., Bettati, R.: Smartphone reconnaissance: operating
system identification. In: 2016 13th IEEE Annual Consumer Communications Networking
Conference (CCNC), pp. 10861091 (2016). doi:10.1109/CCNC.2016.7444941
34. Sanders, C.: Practical Packet Analysis: Using Wireshark to Solve Real-World Network Prob-
lems. No Starch Press (2011)
35. Smart, M., Malan, G.R., Jahanian, F.: Defeating tcp/ip stack fingerprinting. In: Proceedings of
the 9th Conference on USENIX Security Symposium, SSYM00, vol. 9, pp. 1717. USENIX
Association, Berkeley, CA, USA (2000). http://dl.acm.org/citation.cfm?id=1251306.1251323
36. Taleck, G.: Ambiguity resolution via passive os fingerprinting. In: Recent Advances in Intrusion
Detection, pp. 192206. Springer (2003)
37. Tzagkarakis, G., Papadopouli, M., Tsakalides, P.: Singular spectrum analysis of traffic workload
in a large-scale wireless lan. In: Proceedings of the 10th ACM Symposium on Modeling,
Analysis, and Simulation of Wireless and Mobile Systems, MSWiM 07, pp. 99108. ACM,
New York, NY, USA (2007). doi:10.1145/1298126.1298146
38. Wang, X., Reeves, D.: Robust correlation of encrypted attack traffic through stepping stones
by flow watermarking. IEEE Trans. Dependable Secur. Comput. 8(3), 434449 (2011). doi:10.
1109/TDSC.2010.35
Operating System Fingerprinting 139
1 Introduction
A. Bates (B)
University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
e-mail: [email protected]
D.J. Pohly
Pennsylvania State University, University Park, State College, PA 16801, USA
e-mail: [email protected]
K.R.B. Butler
University of Florida, Gainesville, FL 32611, USA
e-mail: [email protected]
Springer Science+Business Media New York (outside the USA) 2016 141
C. Wang et al. (eds.), Digital Fingerprinting,
DOI 10.1007/978-1-4939-6601-1_8
142 A. Bates et al.
how can we be assured that the chain of custody for data from the time it was
originated until it arrived in its current state is both secure and trustworthy?
Data provenance provides a compelling means of providing answers to this chal-
lenging problem. The term provenance comes from the art world, where it refers
to the ability to trace all activities related to an piece of art, in order to establish
that is genuine. An example of this usage is with Jan van Eycks Arnolfini portrait,
currently hanging in the National Gallery of London. The provenance of this cele-
brated portrait can be traced back almost 600 years to its completion in 1432, with
metadata in the form of markings associated with its owners painted on the paintings
protective shutters helping to establish the hands through which it has passed over
the centuries [23].
More recently, data provenance has become a desired feature in the computing
world. From its initial deployment in the database community [17] to a more recent
focus to its proposed use as an operating systems feature [42], data provenance
provides a broad new capability for reasoning about the genesis and subsequent
modification of data. In contrast to the current computing paradigm where interac-
tions between system components are largely opaque, data provenance allows users
to track, and understand how a piece of data came to exist in its current state. The
realization of provenance-aware systems will fundamentally redefine how comput-
ing systems are secured and monitored, and will provide important new capabilities
to the forensics community. Ensuring its efficacy in a computer system, though, is
an extremely challenging problem, to the extent that the Department of Homeland
Security has included provenance as one of its Hard Problems in Computing [14].
Ensuring that information is collected in a trustworthy fashion is the first problem that
needs to be solved in order to assure the security of provenance. Without adequate
protections in place, adversaries can target the collection mechanisms to destroy or
tamper with provenance metadata, calling the trustworthiness of data into question
or using it in a malicious fashion.
This book chapter focuses on how to ensure the secure and trustworthy collection
of data provenance within computing systems. We will discuss past approaches to
provenance collection and where and why those fall short, and discuss how taking a
systems security approach to defining trustworthy provenance collection can provide
a system that fulfills the qualities necessary for a secure implementation. We will then
discuss approaches from the research community that have attempted to ensure the
fine-grained secure collection of provenance, as well as our own work in this area
to provide a platform for deploying secure provenance collection as an operating
system service.
2 Provenance-Aware Systems
Data provenance provides the ability to describe the history of a data object, including
the conditions that led to its creation and the actions that delivered it to its present
state. The potential applications for this kind of information are virtually limitless;
Secure and Trustworthy Provenance Collection for Digital Forensics 143
The earliest efforts in provenance tracking arose from the scientific processing and
database management communities. While the potential use cases for data prove-
nance have broadened in scope over time, early investigators aims were to maintain
virtual data descriptions that would allow them to explain data processing results and
re-constitute those results in the event of their deletion. One of the earliest efforts
in this space was Chimera [17], which provided a virtual data management system
that allowed for tracking the derivations of data through computational procedures.
Chimera is made up of a virtual data catalog that represents computation proce-
dures used to derive data and a virtual data language interpreter for constructing
and querying the catalog. It uses transformation procedures (i.e., processes) as its
integral unit; its database is made up of transformations that represent executable
programs and derivations that represent invocations of transformations. All other
information (e.g., input files, output files, execution environment) are a subfield in
the process entry.
The Earth Science System Workbench, used for processing satellite imagery, also
offered support for provenance annotations [18], as did the Collaboratory for the
Multi-scale Chemical Sciences [46] and the Kepler system [41]. Specification-based
approaches, which generated data provenance based on process documentation [40]
also appeared in the literature at this time. Chimera and other early systems relied on
manual annotations or inferences from other metadata as sources for data provenance,
and are therefore referred to as disclosed provenance-aware systems.
144 A. Bates et al.
Capturing data provenance at the operating system layer offers a broad perspective
into system activities, providing insight into all applications running on the host.
Muniswamy-Reddy et al.s Provenance-Aware Storage System (PASS) instruments
the VFS layer of the Linux kernel to automatically collect, maintain, and provide
search for data provenance [42]. PASS defines provenance as a description of the
execution history that produced a persistent object (file). Provenance records are
attribute/value pairs that are referenced by a unique pond number. PASS prove-
nance is facilitate a variety of useful tasks, including script generation and document
reproduction, detecting system changes, intrusion detection, retrieving compile-time
flags, build debugging, and understanding system dependencies. One limitation of
the PASS system is that the model for provenance collection was fixed, and did not
provide a means of extending the system with additional provenance attributes or
alternate storage models.
Gehani and Tariq present SPADE in response to requests for coarser-grained
information and the the ability to experiment with different provenance attributes,
novel storage and indexing models, and handling provenance from diverse sources
[21]. SPADE is a java-based daemon that offers provenance reporter modules for
Windows, Linux, OSX, and Android. The reporters are based on a variety of methods
of inference, including polling of basic user space utilities (e.g., ps for process info,
lsof for network info), audit log systems (e.g., Windows ETW, OSXs BSM), and
interposition via user space file systems like FUSE. Due to its modular design, SPADE
can be easily extended to support additional provenance streams.
Both the PASS and SPADE systems facilitate provenance collection through ad
hoc instrumentation or polling efforts, making it difficult to provide any assurance
of the completeness of the provenance that they collect. In fact, there are several
examples of how these systems fail to provide adequate tracking for explicit data
flows through a system. As SPADE records provenance in part through periodic
polling of system utilities, there exists the potential for race conditions in which
short-lived processes or messages could be successfully created between polling
intervals. By observing the VFS layer, PASS provides support for non-persistent
data, such as network sockets which are represented by a system file; however, it
fails to track a variety of forms of interprocess communication, such as signals or
shared memory, which provides a covert channel for communicating applications.
Secure and Trustworthy Provenance Collection for Digital Forensics 145
by defining provenance objects and describing the flows between those objects.
CPL offers the advantages of avoiding version disconnect between files that are
seemingly distinct to the operating system but are actually ancestors, integration
between different provenance-aware applications due to a look-up function in the
API, and reconciling different notions of provenance in a unified format.
DPAPI is a component of the PASSv2 project [43]. Its intended purpose is to
create provenance-aware applications whose provenance can be layered on top of
information collected by the PASS system, allowing system operators to reason holis-
tically about activities at multiple system layers. Several exemplar applications for
provenance layering as part of this effort: Kepler, Lynx, and a set of general-purpose
Python wrappers. Similarly, the SPADE system offers support for provenance layer-
ing by exposing a named pipe and generic domain-specific language for application
layer provenance disclosure [21].
Other work has sought out alternate deployment model to create provenance-
aware applications at a lower cost and without developer cooperation. Hassan et al.
present Sprov, a modified version of the stdio library that captures provenance
for file I/O system calls at the application layer. By replacing the glibc library
with the modified version, Sprov is able to record file provenance for all dynamically
linked applications on the system. This system also provides integrity for provenance
records through the introduction of a tamper-evident provenance chain primitive. By
cryptographically binding time-ordered sequences of provenance records together
for a given document, Sprov is able to prevent undetected rewrites of the documents
history.
Recent efforts have also attempted to reconstitute application workflow prove-
nance through analysis of system layer audit logs, requiring only minimally inva-
sive and automated transformation of the monitored application. A significant
consequence of provenance tracking at the operating system layer is dependency
explosionfor long-lived processes, each new output from an application must con-
servatively be assumed to have been derived from all prior inputs, creating false
provenance. The BEEP systems resolves this problem through analysis and trans-
formation of binary executables [32]. Leveraging the insight that most long-lived
processes are made up of an initialization phase, main work loop, and tear-down
phase, BEEP procedurally identifies the main work loop in order to decompose the
process into autonomous units of work. After this execution partitioning (EP) step,
the system audit log can then be analyzed to build causal provenance graphs for the
monitored applications. Ma et al. go on to adapt these techniques Windows and other
proprietary software [35], where EP can be performed by perform regular expression
analysis of audit logs in order to identify autonomous units of work. The LogGC
system extends BEEP by introducing a garbage collection filtering mechanism to
improve the forensic clarity of the causal graphs [33]; for example, if a process
creates and makes use of a short-lived temporary file that no other process ever
reads, this node contains no semantic value in the causal graph, and can therefore
be pruned. These techniques should be able to be applied in tandem with Chapman
et al.s provenance factorization techniques that find common subtrees and manipu-
late them to reduce the provenance size [11]. Although these systems do not modify
Secure and Trustworthy Provenance Collection for Digital Forensics 147
the operating system, by operating at the system call or audit logs levels, Sprov,
LogGC, and BEEP provide provenance at a similar granularity to that of provenance-
aware operating systems; they offer only limited insight into application layer seman-
tics.
Unfortunately, while the above work has shown that provenance is an invaluable capa-
bility when securing systems, less attention has been given to securing provenance-
aware systems. Provenance itself is a ripe attack vector; adversaries may seek to
tamper with provenance to hide evidence of their misdeeds, or to subvert another
148 A. Bates et al.
However, Lyle and Martins proposal did not describe a complete provenance-aware
operating system, as provenance proofs did not describe configuration files, environ-
ment variables, generated code, and load information, nor can this system explain
who accessed a piece of data.
To address the challenges discussed above, McDaniel et al. [38] developed the con-
cept of a provenance monitor, where provenance authorities accept host-level prove-
nance data from validated monitors to assemble a trustworthy provenance record.
Subsequent users of the data obtain a provenance record that identifies not only
the inputs, systems, and applications leading to a data item, but also evidence of
the identity and validity of the recording instruments that observed its evolution.
At the host level, the provenance monitor acts as the recording instrument that
observes the operation of a system and securely records each data manipulation.
The concept for a provenance monitor is based on the reference monitor proposed
by Anderson (cite), which has become a cornerstone for evaluating systems security.
The two concepts share the following three fundamental properties.
The host level provenance monitor should enforce the classic reference monitor
guarantees of complete mediation of relevant operations, tamper-proofness of the
monitor itself, and basic verification of correct operation. For the purpose of the
provenance monitor, we define these as follows.
The provenance monitor provides powerful guarantees for the secure and trust-
worthy collection of provenance. However, while the idea is seemingly simple in
concept, its execution requires considerable design and implementation considera-
tions. As we have seen above, a large amount of provenance related proposals, while
pushing forth novel functionality and advancing the state of research, do not pass
the provenance monitor criteria. Complete mediation and tamperproofness cannot be
guaranteed if the mechanisms used to collect provenance are subject to compromise,
and collecting sufficiently fine-grained provenance to ensure complete mediation is
150 A. Bates et al.
a challenge unaddressed by other systems discussed. The next two sections discuss
recent attempts to satisfy the provenance monitor concept and detail the challenges
and design decisions made to assure a practical and functional collection system.
Hi-Fi consists of three components: the provenance collector, the provenance log, and
the provenance handler. An important difference between Hi-Fi and previous work
is that rather than collecting events at the file system level, Hi-Fi ensures complete
mediation by collecting events as a Linux Security Module (LSM) (cite). Because the
collector is an LSM, it resides below the application layer in the operating systems
kernel space, and is notified whenever a kernel object access is about to take place.
When invoked, the collector constructs an entry describing the action and writes it to
the provenance log. The log is a buffer which presents these entries to userspace as
a file. The provenance handler can then access this file using the standard file API,
Secure and Trustworthy Provenance Collection for Digital Forensics 151
process it, and store the provenance record. The handler used in our experiments
simply copies the log data to a file on disk, but it is possible to implement a custom
handler for any purpose, such as post-processing, graphical analysis, or storage on a
remote host.
Such a construction allows for a far more robust adversarial model. Hi-Fi main-
tains the fidelity of provenance collection regardless of any compromise of the OS
user space by an adversary. This is a strictly stronger guarantee than those provided
by any previous system-level provenance collection system. Compromises are possi-
ble against the kernel, but other techniques for protecting kernel integrity, including
disk-level versioning [57] or a strong write-once read-many (WORM) storage sys-
tem [55], can mitigate the effects of such compromises. Because provenance never
changes after being written, a storage system with strong WORM guarantees is
particularly well-suited to this task. For socket provenance, Hi-Fi guarantees that
incoming data will be recorded accurately; to prevent on-the-wire tampering by an
adversary, standard end-to-end protection such as IPsec should be used.
The responsibility of the provenance handler is to interpret, process, and store
the provenance data after it is collected, and it should be flexible enough to support
different needs. Hi-Fi decouples provenance handling from the collection process,
allowing the handler to be implemented according to the systems needs.
For the purposes of recording provenance, each object which can appear in the
log must be assigned an identifier which is unique for the lifetime of that object.
Some objects, such as inodes, are already assigned a suitable identifier by the kernel.
Others, such as sockets, require special treatment. For the rest, Hi-Fi generates a
provid, a small integer which is reserved for the object until it is destroyed. These
provids are managed in the same way as process identifiers to ensure that two objects
cannot simultaneously have the same provid.
flows between processes use one of the objects described in subsequent sections.
However, several actions are specific to processes: forking, program execution, and
changing subjective credentials.
Since LSM is designed to include kernel actions, it does not represent actors using
a PID or task_struct structure. Instead, LSM hooks receive a cred structure,
which holds the user and group credentials associated with a process or kernel action.
Whenever a process is forked or new credentials are applied, a new credential struc-
ture is created, allowing us to use these structures to represent individual system
actors. As there is no identifier associated with these cred structures, we generate
a provid to identify them.
Regular files are the simplest and most common means of storing data and sharing
it between processes. Data enters a file when a process writes to it, and a copy of this
data leaves the file when a process reads from it. Both reads and writes are mediated
Secure and Trustworthy Provenance Collection for Digital Forensics 153
by a single LSM hook, which identifies the the actor, the open file descriptor, and
whether the action is a read or a write. Logging file operations is then straightforward.
Choosing identifiers for files, however, requires considering that files differ from
other system objects in that they are persistent, not only across reboots of a single
system, but also across systems (like a file on a portable USB drive). Because of this,
it must be possible to uniquely identify a file independent of any running system. In
this case, already-existing identifiers can be used rather than generating new ones.
Each file has an inode number which is unique within its filesystem, which can
be combined with a UUID that identifies the filesystem itself to obtain a suitable
identifier that will not change for the lifetime of the file. UUIDs are generated for
most filesystems at creation.
Files can also be mapped into one or more processes address spaces, where they
are used directly through memory accesses. This differs significantly from normal
reading and writing in that the kernel does not mediate accesses once the mapping is
established. Hi-Fi only records the mapping when it occurs, along with the requested
access mode (read, write, or both). This does not affect the notion of complete
mediation if it is assumed that flows via memory-mapped files take place whenever
possible.
Shared memory segments are managed and interpreted in the same way. POSIX
shared memory is implemented using memory mapping, so it behaves as described
above. XSI shared memory, though managed using different system calls and medi-
ated by a different LSM hook, also behaves the same way, so our model treats them
identically. In fact, since shared memory segments are implemented as files in a tem-
porary filesystem, their identifiers can be chosen in the same way as file identifiers.
The remaining objects have stream or message semantics, and they are accessed
sequentially. In these objects, data is stored in a queue by the writer and retrieved
by the reader. The simplest such object is the pipe, or FIFO. Pipes have stream
semantics and, like files, they are accessed using the read and write system calls.
Since a pipe can have multiple writers or readers, it cannot be directly represented
as a flow from one process to another. Instead, flow is split into two parts, modeling
the data queue as an independent file-like object. In this way, a pipe behaves like
a sequentially-accessed regular file. In fact, since named pipes are inodes within a
regular filesystem, and unnamed pipes are inodes in the kernels pipefs pseudo-
filesystem, pipe identifiers can be chosed similarly to files.
Message queues are similar to pipes, with two major semantic differences: the data
is organized into discrete messages instead of a single stream, and these messages can
be delivered in a different order than that in which they are sent. However, because
LSM handles messages individually, a unique identifier can be created for each,
allowing reliable identification of which process receives the message regardless of
the order in which the messages are dequeued. Since individual messages have no
natural identifier, a provid is generated for each.
Sockets are the most complex form of inter-process communication handled by
Hi-Fi but can be modeled very simply. As with pipes, a sockets receive queue can
be represented an intermediary file between the sender and receiver. Sending data
merely requires writing to this queue, and receiving data is reading from it. The details
154 A. Bates et al.
of network transfer are hidden by the socket abstraction. Stream sockets provide the
simplest semantics with respect to data flow: they behave identically to pipes. Since
stream sockets are necessarily connection-mode, all of the data sent over a stream
socket will arrive in the same receive queue. Message-oriented sockets, on the other
hand, do not necessarily have the same guarantees. They may be connection-mode or
connectionless, reliable or unreliable, ordered or unordered. Each packet therefore
needs a separate identifier, since it is unclear at which endpoint the message will
arrive.
Socket identifiers must be chosen carefully. An identifier must never be re-used
since since a datagram can have an arbitrarily long lifetime. The identifier should also
be associated with the originating host. Associating messages with a per-boot UUID
addresses these requirements. By combining this UUID with an atomic counter, a
sufficiently large number of identifiers can be generated.
events are also subject to LSM mediation. In fact, the LSM is initialized before
the VFS, which has a peculiar consequence for the relay we use to implement the
provenance log. Since filesystem caches have not yet been allocated, the relay cannot
be created when the LSM is initialized, which violated Hi-Fis goal of fidelity. In
response, Hi-Fi separates relay creation from the rest of the modules initialization
and registers it as a callback in the kernels generic initcall system. This allows
it to be delayed until after the core subsystems such as VFS have been initialized.
In the meantime, provenance data is stored in a small temporary buffer. Inspection
of this early boot provenance reveals that a one-kilobyte buffer is sufficiently large
to hold the provenance generated by the kernel during this period. Once the relay is
created, temporary boot-provenance buffer is flushed of its contents and freed.
OS Integration. One important aspect of Hi-Fis design is that the provenance han-
dler must be kept running to consume provenance data as it is written to the log. Since
the relay is backed by a buffer, it can retain a certain amount of data if the handler
is inactive or happens to crash. It is important, though, that the handler is restarted
in this case. Fortunately, this is a feature provided by the operating systems init
process. By editing the configuration in /etc/inittab, we can specify that the
handler should be started automatically at boot, as well as respawned if it should
ever crash.
Provenance must also be collected and retained for as much of the operating
systems shutdown process as possible. At shutdown time, the init process takes
control of the system and executes a series of actions from a shutdown script. This
script asks processes to terminate, forcefully terminates those which do not exit
gracefully, unmounts filesystems, and eventually powers the system off. Since the
provenance handler is a regular user space process, it is subject to this shutdown
procedure as well. However, there is no particular order in which processes are
terminated during the shutdown sequence, so it is possible that another process may
outlive the handler and perform actions which generate provenance data.
In response, Hi-Fi handles the shutdown process similarly to a system crash.
The provenance handler must be restarted, and this is accomplished by modifying
the shutdown script to re-execute the handler after all other processes have been
terminated before filesystems are unmounted. This special case requires a one-
shot mode in the handler which, instead of forking to the background, exits after
handling the data currently in the log. This allows it to handle any remaining shutdown
provenance, then returns control to init to complete the shutdown process.
Bootstrapping Filesystem Provenance. Intuitively, a complete provenance record
contains enough information to recreate the structure of an entire filesystem. This
requires three things: a list of inodes, filesystem metadata for each inode, and a list of
hard links (filenames) for each inode. Hi-Fi includes a hook corresponding to each
of these items, to ensure all information appears in the provenance record starting
from an empty filesystem. However, this is difficult to do in practice, as items may
have been used elsewhere or provenance may be collected on an existing, populated
filesystem. Furthermore, it is actually impossible to start with an empty filesystem.
Without a root inode, which is created by the corresponding mkfs program, a filesys-
156 A. Bates et al.
tem cannot even be mounted. Unfortunately, mkfs does this by writing directly to a
block device file, which does not generate the expected provenance data.
Therefore, provenance must be bootstrapped on a populated filesystem. To have
a complete record for each file, a creation event for any pre-existing inodes must be
generated. Hi-Fi implements a utility called pbang (for provenance Big Bang)
which does this by traversing the filesystem tree. For each new inode it encounters,
it outputs an allocation entry for the inode, a metadata entry containing its attributes,
and a link entry containing its filename and directory. For previously encountered
inodes, it only outputs a new link entry. All of these entries are written to a file to
complete the provenance record. A new filesystem is normally created using mkfs,
then made provenance-aware by executing pbang immediately afterward.
Opaque Provenance. Early versions of Hi-Fi generated continuous streams of prove-
nance even when no data was to be collected. Inspection of the provenance record
showed that this data described the actions of the provenance handler itself. The
handler would call the read function to retrieve data from the provenance log,
which then triggered the file_permission LSM hook. The collector would
record this action in the log, where the handler would again read it, triggering
file_permission, and so on, creating a large amount of feedback in the
provenance record. While technically correct behavior, this floods the provenance
record with data which does not provide any additional insight into the systems
operation. One option for solving this problem is to make the handler completely
exempt from provenance collection. However, this could interfere with filesystem
reconstruction. Instead, the handler is provenance-opaque, treated as a black box
which only generates provenance data if it makes any significant changes to the
filesystem.
To achieve this Hi-Fi informs the LSM which process is the provenance han-
dler, by leveraging the LSM frameworks integration with extended filesystem
attributes. The provenance handler program is identified by setting an attribute
called security.hifi. The security attribute namespace, which is reserved
for attributes used by security modules, is protected from tampering by malicious
users. When the program is executed, the bprm_check_security hook exam-
ines this property for the value opaque and sets a flag in the processs credentials
indicating that it should be treated accordingly. In order to allow the handler to create
new processes without reintroducing the original problemfor instance, if the han-
dler is a shell scriptthis flag is propagated to any new credentials that the process
creates.
Socket Provenance. Network socket behavior is designed to be both transparent and
incrementally deployable. To allow interoperability with existing non-provenanced
hosts, packet identifiers are placed in the IP Options header field. Two Netfilter hooks
process packets at the network layer. The outgoing hook labels each packet with the
correct identifier just before it encounters a routing decision, and the incoming hook
reads this label just after the receiver decides the packet should be handled locally.
Note that even packets sent to the loopback address will encounter both of these
hooks.
Secure and Trustworthy Provenance Collection for Digital Forensics 157
In designing the log entries for socket provenance, Hi-Fi aims to make the recon-
struction of information flows from multiple system logs as simple as possible. When
the sender and receiver are on the same host, these entries should behave the same as
reads and writes. When they are on different hosts, the only added requirement should
be a partial ordering placing each send before all of its corresponding receives. Lam-
port clocks [30] would satisfy this requirement. However, the socket_recvmsg
hook, which was designed for access control, executes before a process attempts to
receive a message. This may occur before the corresponding socket_sendmsg
hook is executed. To solve this, a socket_post_recvmsg hook is placed after
the message arrives and before it is returned to the receiver; this hook generates the
entry for receiving a message.
Support for TCP and UDP sockets is necessary to demonstrate provenance for both
connection-mode and connectionless sockets, as well as both stream and message-
oriented sockets. Support for the other protocols and pseudo-protocols in the Linux
IP stack, such as SCTP, ping, and raw sockets, can be implemented using similar tech-
niques. For example, SCTP is a sequential packet protocol, which has connection-
mode and message semantics.
TCP Sockets. TCP and other connection-mode sockets are complicated in that a
connection involves three different sockets: the client socket, the listening server
socket, and the server socket for an accepted connection. The first two are created in
the same way as any other socket on the system: using the socket function, which
calls the socket_create and socket_post_create LSM hooks. However,
sockets for an accepted connection on the server side are created by a different
sequence of events. When a listening socket receives a connection request, it creates
a mini-socket instead of a full socket to handle the request. If the client completes
the handshake, a new child socket is cloned from the listening socket, and the relevant
information from the mini-socket (including our IP options) is copied into the child.
In terms of LSM hooks, the inet_conn_request hook is called when a mini-
socket is created, and the inet_csk_clone hook is called when it is converted
into a full socket. On the client side, the inet_conn_established hook is
called when the SYN+ACK packet is received from the server.
Hi-Fi must treat the TCP handshake with care, since there are two different sockets
participating on the server side. A unique identifier is created for the mini-socket in
the inet_conn_request hook, and this identifier is later copied directly into
the child socket. The client must then be certain to remember the correct identifier,
namely, the one associated with the child socket. The first packet that the client
receives (the SYN+ACK) will carry the IP options from the listening parent socket.
To keep this from overriding the child socket, the inet_conn_established
hook clears the saved identifier so that it is later replaced by the correct one.
UDP Sockets. Since UDP sockets are connectionless, we an LSM hook must assign
a different identifier to each datagram. In addition, this hook must run in process
context to record the identifier of the process which is sending or receiving. The only
existing LSM socket hook with datagram granularity is the sock_rcv_skb hook,
but it is run as part of an interrupt when a datagram arrives, not in process context. The
158 A. Bates et al.
remaining LSM hooks are placed with socket granularity; therefore, two additional
hooks are placed to mediate datagram communication. If the file descriptor of the
receiving socket is shared between processes, they can all receive the same datagram
by using the MSG_PEEK flag. In fact, multiple processes can also contribute data
when sending a single datagram by using the MSG_MORE flag or the UDP_CORK
socket option. Because of this, placing send and receive hooks for UDP is a very
subtle task.
Since each datagram is considered to be an independent entity, the crucial points
to mediate are the addition of data to the datagram and the reading of data from it. The
Linux IP implementation includes a function which is called from process context to
append data to an outgoing socket buffer. This function is called each time a process
adds data to a corked datagram, as well as in the normal case where a single process
constructs a datagram and immediately sends it. This makes it an ideal candidate for
the placement of the send hook, which we call socket_dgram_append. Since
this hook is placed in network-layer code, it can be applied to any message-oriented
protocol and not just UDP.
The receive hook is placed in protocol-agnostic code, for similar flexibility. The
core networking code provides a function which retrieves the next datagram from
a sockets receive queue. UDP and other message-oriented protocols use this func-
tion when receiving, and it is called once for each process that receives a given
datagram. This is an ideal location for the message-oriented receive hook, so the
socket_dgram_post_recv hook is placed in this function.
Hi-Fi represents a significant step forward in provenance collection, being the first
system to consider design with regard to the provenance monitor concept. The com-
plexity of design and implmentation attest to the goals of complete mediation of
provenance. However, it fails to address other security challenges identified in this
chapter.
Hi-Fi does not completely satisfy the provenance monitor concept; enabling
Hi-Fi blocks the installation of other LSMs, such as SELinux or Tomoyo, effec-
tively preventing the installation of a mandatory access control (MAC) policy that
could otherwise be used to protect the kernel. This leaves the entire system, includ-
ing Hi-Fis trusted computing base, vulnerable to attack, and Hi-Fi is therefore not
tamperproof. Hi-Fi is also vulnerable to network attacks. Hi-Fi embeds an identifier
into each IP packet transmitted by the host, which the recipient host to later use
the identify to query the sender for the provenance of the packet. However, because
these identifiers are not cryptographically secured, an attacker in the network can
strip the provenance identifiers off of packets in transit, violating the forensic validity
of Hi-Fis provenance in distributed environments. Finally, Hi-Fi does not provide
support for provenance-aware applications. Provenance layering is vital to obtaining
a comprehensive view of system activity; however, rather than providing an insecure
Secure and Trustworthy Provenance Collection for Digital Forensics 159
disclosure mechanism like PASSv2 [43], Hi-Fi does not offer layering support at all,
meaning that its provenance is not complete in its observations of relevant operations.
The LPM project provides an explicit definition for the term whole-system prove-
nance introduced in the Hi-Fi work that is broad enough to accommodate the needs
of a variety of existing provenance projects. To arrive at a definition, four past propos-
als were inspected that collect broadly scoped provenance: SPADE [21], LineageFS
[52], PASS [42], and Hi-Fi [48]. SPADE provenance is structured around primitive
operations of system activities with data inputs and outputs. It instruments file and
160 A. Bates et al.
process system calls, and associates each call to a process ID (PID), user identi-
fier, and network address. LineageFS uses a similar definition, associating process
IDs with the file descriptors that the process reads and writes. PASS associates a
processs output with references to all input files and the command line and process
environment of the process; it also appends out-of-band knowledge such as OS and
hardware descriptions, and random number generator seeds, if provided. In each of
these systems, networking and IPC activity is primarily reflected in the provenance
record through manipulation of the underlying file descriptors. Hi-Fi takes an even
broader approach to provenance, treating non-persistent objects such as memory,
IPC, and network packets as principal objects.
In all instances, provenance-aware systems are exclusively concerned with oper-
ations on controlled data types, which are identified by Zhang et al. as files, inodes,
superblocks, socket buffers, IPC messages, IPC message queue, semaphores, and
shared memory [64]. Because controlled data types represent a superset of the
objects tracked by system layer provenance mechanisms, LPM defines whole-system
provenance as a complete description of agents (users, groups) controlling activities
(processes) interacting with controlled data types during system execution.
We also determine that beyond the reference monitor-inspired properties that
comprise the provenance monitor concept, two additional goals are necessary to
support whole-system provenance.
provenance from data in transit. Because captured provenance can be put to use
in other applications, the adversarys goal may even be to target the provenance
monitor itself. The implications and methods of such an attack are domain-specific.
For example:
LPM defines a provenance trusted computing base (TCB) to be the kernel mech-
anisms, provenance recorder, and storage back-ends responsible for the collection
and management of provenance. Provenance-aware applications are not considered
part of the TCB.
An overview of the LPM architecture is shown in Fig. 1. The LPM patch places
a set of provenance hooks around the kernel; a provenance module then registers
System Provenance
Integrity Measurements
162 A. Bates et al.
to control these hooks, and also registers several Netfilter hooks; the module then
observes system events and transmits information via a relay buffer to a provenance
recorder in user space that interfaces with a datastore. The recorder also accepts
disclosed provenance from applications after verifying their correctness using the
Integrity Measurements Architecture (IMA) [51].
The LPM patch introduces a set of hook functions in the Linux kernel. These hooks
behave similarly to the LSM frameworks security hooks in that they facilitate mod-
ularity, and default to taking no action unless a module is enabled. Each provenance
hook is placed directly beneath a corresponding security hook. The return value of
the security hook is checked prior to calling the provenance hook, thus assuring that
the requested activity has been authorized prior to provenance capture. A workflow
for the hook architecture is depicted in Fig. 2. The LPM patch places over 170 prove-
nance hooks, one for each of the LSM authorization hooks. In addition to the hooks
that correspond to existing security hooks, LPM also supports a hook introduced by
Hi-Fi that is necessary to preserve Lamport timestamps on network messages [30].
DAC Checks
LSM Module
Examine context.
"Authorized?" Does request pass policy?
LSM Hook Yes or No Grant or deny.
LPM Module
Examine context.
"Prov collected?" Collect provenance.
LPM Hook Yes or No If successful, grant.
Access Inode
Secure and Trustworthy Provenance Collection for Digital Forensics 163
sequence number in the IP options field [49] of each outbound packet. This approach
allowed Hi-Fi to communicate as normal with hosts that were not provenance-aware,
but unfortunately was not secure against a network adversary. In LPM, provenance
sequence numbers are replaced with Digital Signature Algorithm (DSA) signatures,
which are space-efficient enough to embed in the IP Options field. LPM implements
full DSA support in the Linux kernel by creating signing routines to use with the
existing DSA verification function. DSA signing and verification occurs in the Net-
Filter inet_local_out and inet_local_in hooks. In inet_local_out,
LPM signs over the immutable fields of the IP header, as well as the IP payload. In
inet_local_in, LPM checks for the presence of a signature, then verifies the
signature against a configurable list of public keys. If the signature fails, the packet
is dropped before it reaches the recipient application, thus ensuring that there are no
breaks in the continuity of the provenance log. The key store for provenance-aware
hosts is obtained by a PKI and transmitted to the kernel during the boot process by
writing to securityfs. LPM registers the Netfilter hooks with the highest priority
levels, such that signing occurs just before transmission (i.e., after all other IPTables
operations), and signature verification occurs just after the packet enters the interface
(i.e., before all other IPTables operations).
To support layered provenance while preserving our security goals, LPM requires a
means of evaluating the integrity of user space provenance disclosures. To accomplish
this, LPM Provenance Recorders make use of the Linux Integrity Measurement
Architecture (IMA) [51]. IMA computes a cryptographic hash of each binary before
execution, extends the measurement into a TPM Platform Control Register (PCR),
and stores the measurement in kernel memory. This set of measurements can be used
by the Recorder to make a decision about the integrity of the a Provenance-Aware
Application (PAA) prior to accepting the disclosed provenance. When a PAA wishes
to disclose provenance, it opens a new UNIX domain socket to send the provenance
data to the Provenance Recorder. The Recorder uses its own UNIX domain socket
to recover the processs pid, then uses the /proc filesystem to find the full path of
the binary, then uses this information to look up the PAA in the IMA measurement
list. The disclosed provenance is recorded only if the signature of PAA matches a
known-good cryptographic hash.
A demonstration of this functionality is shown in Fig. 3 for the popular ImageMag-
ick utility.1 ImageMagick contains a batch conversion tool for image reformatting,
mogrify. Shown in Fig. 3, mogrify reads and writes multiple files during exe-
cution, leading to an overtainting problemat the kernel layer, LPM is forced to
conservatively assume that all outputs were derived from all inputs, creating false
dependencies in the provenance record. To address this, we extended the Provmon
protocol to support a new message, provmsg_imagemagick_convert, which
1 See http://www.imagemagick.org.
164 A. Bates et al.
b.png a.png
Used Used
WasDerivedFrom mogrify -format jpg *.png WasDerivedFrom
WasGeneratedBy WasGeneratedBy
b.jpg a.jpg
links an input file directly to its output file. When the recorder receives this message,
it first checks the list of IMA measurements to confirm that ImageMagick is in a
good state. If successful, it then annotates the existing provenance graph, connecting
the appropriate input and output objects with WasDerivedFrom relationships. LPM
presents a minimally modified version of ImageMagick that upports layered prove-
nance at no additional cost over other provenance-aware systems [21, 42], and does
so in a manner that provides assurance of the integrity of the provenance log.
After booting into the provenance-aware kernel, the runtime integrity of the TCB
must also be assured. To protect the runtime integrity of the kernel, we deploy a
Mandatory Access Control (MAC) policy, as implemented by Linux Security Mod-
ules. On our prototype deployments, we enabled SELinuxs MLS policy, the security
of which was formally modeled by Hicks et al. [25]. Refining the SELinux policy to
prevent Access Vector Cache (AVC) denials on LPM components required minimal
effort; the only denial we encountered was when using the PostgreSQL recorder,
which was quickly remedied with the audit2allow tool. Preserving the integrity
of LPMs user space components, such as the provenance recorder, was as simple
as creating a new policy module. We created a policy module to protect the LPM
recorder and storage back-end using the sepolicy utility. Uncompiled, the policy
module was only 135 lines.
In this section, we briefly consider metrics for evaluating the provenance monitor
solutions that we have discussed in this chapter, specifically Hi-Fi and LPM. We
consider their evaluative metrics from both a coverage and performance perspective.
Our first task is to show that the data collected by Hi-Fi is of sufficient fidelity to
be used in a security context. We focus our investigation on detecting the activity of
network-borne malware. A typical worm consists of several parts. First, an exploit
allows it to execute code on a remote host. This code can be a dropper, which serves
to retrieve and execute the desired payload, or it can be the payload itself. A payload
can then consist of any number of different actions to perform on an infected system,
such as exfiltrating data or installing a backdoor. Finally, the malware spreads to
other hosts and begins the cycle again.
For our experiment, we chose to implement a malware generator which would
allow us to test different droppers and payloads quickly and safely. The generator
is similar in design to the Metasploit Framework [39], in that you can choose an
exploit, dropper, and payload to create a custom attack. However, our tool also
includes a set of choices for generating malware which automatically spreads from
166 A. Bates et al.
one host to another; this allows us to demonstrate what socket provenance can record
about the flow of information between systems. The malware behaviors that we
implement and test are drawn from Symantecs technical descriptions of actual Linux
malware [58].
To collect provenance data, we prepare three virtual machines on a common
subnet, all of which are running Hi-Fi. The attacker generates the malware on machine
A and infects machine B by exploiting an insecure network daemon. The malware
then spreads automatically from machine B to machine C. For each of the malicious
behaviors we wish to test, we generate a corresponding piece of malware on machine
A and launch it. Once C has been infected, we retrieve the provenance logs from all
three machines for examination.
Each malware behavior that we test appears in some form in the provenance
record. In each case, after filtering the log to view only the vulnerable daemon and
its descendants, the behavior is clear enough to be found by manual inspection. Below
we describe each behavior and how it appears in the provenance record.
Frequently, the first action a piece of malware takes is to ensure that it will continue
to run for as long as possible. In order to persist after the host is restarted, the malware
must write itself to disk in such a way that it will be run when the system boots. The
most straightforward way to do this on a Linux system is to infect one of the startup
scripts run by the init process. Our simulated malware has the ability to modify
rc.local, as the Kaiten trojan does. This shows up clearly in the provenance log:
[6fe] write B:/etc/rc.local
In this case, the process with provid 0 6fe has modified rc.local on Bs root
filesystem. Persistent malware can also add cron jobs or infect system binaries to
ensure that it is executed again after a reboot. Examples of this behavior are found
in the Sorso and Adore worms. In our experiment, these behaviors result in similar
log entries:
[701] write B:/bin/ps
on that machine. We include this behavior in our experiment as well, and it appears
simply as:
[707] write B:/usr/local/include/stdio.h
Once the malware has established itself as a persistent part of the system, the next
step is to execute a payload. This commonly includes installing a backdoor which
allows the attacker to control the system remotely. The simplest way to do this is to
create a new root-level user on the system, which the attacker can then use to log in.
Because of the way UNIX-like operating systems store their account databases, this
is done by creating a new user with a UID of 0, making it equivalent to the root user.
This is what the Zab trojan does, and when we implement this behavior, it is clear to
see that the account databases are being modified:
[706] link (new) to B:/etc/passwd+
[706] write B:/etc/passwd+
[706] link B:/etc/passwd+ to B:/etc/passwd
[706] unlink B:/etc/passwd+
[706] link (new) to B:/etc/shadow+
[706] write B:/etc/shadow+
[706] link B:/etc/shadow+ to B:/etc/shadow
[706] unlink B:/etc/shadow+
A similar backdoor technique is to open a port which listens for connections and
provides the attacker with a remote shell. This approach is used by many pieces
of malware, including the Plupii and Millen worms. Our experiment shows that
the provenance record includes the shells network communication as well as the
attackers activity:
[744] exec B:/bin/bash -i
[744] socksend B:173
[744] sockrecv unknown
[744] socksend B:173
[751] exec B:/bin/cat /etc/shadow
[751] read B:/etc/shadow
[751] socksend B:173
[744] socksend B:173
[744] sockrecv unknown
[744] socksend B:173
[744] link (new) to B:/testfile
[744] write B:/testfile
Here, the attacker uses the remote shell to view /etc/shadow and to write a
new file in the root directory. Since the attackers system is unlikely to be running
a trusted instance of Hi-Fi, we see unknown socket entries, which indicate data
received from an unprovenanced host. Remote shells can also be implemented as
reverse shells, which connect from the infected host back to the attacker. Our tests
168 A. Bates et al.
on a reverse shell, such as the one in the Jac.8759 virus, show results identical to a
normal shell.
6.1.4 Exfiltration
Another common payload activity is data exfiltration, where the malware reads infor-
mation from a file containing password hashes, credit card numbers, or other sensitive
information and sends this information to the attacker. Our simulation for this behav-
ior reads the /etc/shadow file and forwards it in one of two ways. In the first test,
we upload the file to a web server using HTTP, and in the second, we write it directly
to a remote port. Both methods result in the same log entries:
[85f] read B:/etc/shadow
[85f] socksend B:1ae
Emailing the information to the attacker, as is done by the Adore worm, would create
a similar record.
6.1.5 Spread
Our experiment also models three different mechanisms used by malware to spread
to newly infected hosts. The first and simplest is used when the entire payload can
be sent using the initial exploit. In this case, there does not need to be a separate
dropper, and the resulting provenance log is the following (indentation is used to
distinguish the two hosts):
[807] read A:/home/evil/payload
[807] socksend A:153
[684] sockrecv A:153
[684] write B:/tmp/payload
The payload is then executed, and the malicious behavior it implements appears in
subsequent log entries.
Another mechanism, used by the Plupii and Sorso worms, is to fetch the payload
from a remote web server. We assume the web server is unprovenanced, so the log
once again contains unknown entries:
[7ff] read A:/home/evil/dropper
[7ff] socksend A:15b
[685] sockrecv A:15b
[685] write B:/tmp/dropper
[6ef] socksend B:149
[6ef] sockrecv unknown
[6ef] write B:/tmp/payload
If the web server were a provenanced host, this log would contain host and socket
IDs in the sockrecv entry corresponding to a socksend on the server.
Secure and Trustworthy Provenance Collection for Digital Forensics 169
Finally, to illustrate the spread of malware across several hosts, we tested a relay
dropper which uses a randomly-chosen port to transfer the payload from each infected
host to the next. The combined log of our three hosts shows this process:
[83f] read A:/home/evil/dropper
[83f] socksend A:159
[691] sockrecv A:159
[691] write B:/tmp/dropper
[6f5] exec B:/tmp/dropper
[844] read A:/home/evil/payload
[844] socksend A:15b
[6fc] sockrecv A:15b
[6fc] write B:/tmp/payload
[74e] read B:/tmp/dropper
[74e] socksend B:169
[682] sockrecv B:169
[682] write C:/tmp/dropper
[6e6] exec C:/tmp/dropper
[750] read B:/tmp/payload
[750] socksend B:16b
[6ed] sockrecv B:16b
[6ed] write C:/tmp/payload
Here we can see the attacker transferring both the dropper and the payload to the
first victim using two different sockets. This victim then sends the dropper and the
payload to the next host in the same fashion.
We now turn our focus to LPM, which provides additional features for demonstraitng
the provenance monitor concept beyond what Hi-Fi enforces. We demonstrate that
LPM meets all of the required security goals for trustworthy whole-system prove-
nance. In this analysis, we consider an LPM deployment on a physical machine that
was enabled with the Provmon module, which mirrors the functionality of Hi-Fi.
Complete. We defined whole-system provenance as a complete description of
agents (users, groups) controlling activities (processes) interacting with controlled
data types during system execution (Sect. 5.1). LPM attempts to track these system
objects through the placement of provenance hooks (Sect. 5.3.1), which directly
follow each LSM authorization hook. The LSMs complete mediation property has
been formally verified [15, 64]; in other words, there is an authorization hook prior to
every security-sensitive operation. Because every interaction with a controlled data
type is considered security-sensitive, we know that a provenance hook resides on all
control paths to the provenance-sensitive operations. LPM is therefore capable of
collecting complete provenance on the host.
It is important to note that, as a consequence of placing provenance hooks beneath
authorization hooks, LPM is unable to record failed access attempts. However, insert-
170 A. Bates et al.
ing the provenance layer beneath the security layer ensures accuracy of the prove-
nance record. Moreover, failed authorizations are a different kind of metadata than
provenance because they do not describe processed data; this information is better
handled at the security layer, e.g., by the SELinux Access Vector Cache (AVC) Log.
Tamperproof. The runtime integrity of the LPM trusted computing base is assured
via the SELinux MLS policy, and we have written a policy module that protects the
LPM user space components. Therefore, the only way to disable LPM would be to
reboot the system into a different kernel; this action can be disallowed through secure
boot techniques and is detectable by remote hosts via TPM attestation.
Verifiable. While we have not conducted an independent formal verification
of LPM, our argument for its correctness is as follows. A provenance hook fol-
lows each LSM authorization hook in the kernel. The correctness of LSM hook
placement has been verified through both static and dynamic analysis techniques
[15, 19, 27]. Because an authorization hook exists on the path of every sensitive
operation to controlled data types, and LPM introduces a provenance hook behind
each authorization hook, LPM inherits LSMs formal assurance of complete media-
tion over controlled data types. This is sufficient to ensure that LPM can collect the
provenance of every sensitive operation on controlled data types in the kernel (i.e.,
whole-system provenance).
Authenticated Channel. Through use of Netfilter hooks [59], LPM embeds a
DSA signature in every outbound network packet. Signing occurs immediately prior
to transmission, and verification occurs immediately after reception, making it impos-
sible for an adversary-controlled application running in user space to interfere. For
both transmission and reception, the signature is invisible to user space. Signatures
are removed from the packets before delivery, and LPM feigns ignorance that the
options field has been set if get_options is called. Hence, LPM can enforce that
all applications participate in the commitment protocol.
Prior to implementing our own message commitment protocol in the kernel, we
investigated a variety of existing secure protocols. The integrity and authenticity of
provenance identifiers could also be protected via IPsec [29], SSL tunneling,2 or
other forms of encapsulation [3, 66]. We elected to move forward with our approach
because (1) it ensures the monitoring of all processes and network events, includ-
ing non-IP packets, (2) it does not change the number of packets sent or received,
ensuring that our provenance mechanism is minimally invasive to the rest of the
Linux network stack, and (3) it preserves compatibility with non-LPM hosts. An
alternative to DSA signing would be HMAC [6], which offers better performance
but requires pairwise keying and sacrifices the non-repudiation policy; BLS, which
approaches the theoretical maximum security parameter per byte of signature [7]; or
online/offline signature schemes [9, 16, 20, 53].
Authenticated Disclosures. We make use of IMA to protect the channel between
LPM and provenance-aware applications wishing to disclose provenance. IMA is able
2 See http://docs.oracle.com/cd/E23823_01/html/816-5175/kssl-5.html.
Secure and Trustworthy Provenance Collection for Digital Forensics 171
to prove to the provenance recorder that the application was unmodified at the time
it was loaded into memory, at which point the recorder can accept the provenance
disclosure into the official record. If the application is known to be correct (e.g.,
through formal verification), this is sufficient to establish the runtime integrity of
the application. However, if the application is compromised after execution, this
approach is unable to protect against provenance forgery.
A separate consideration for all of the above security properties are Denial of
Service (DoS) attacks. DoS attacks on LPM do not break its security properties. If
an attacker launches a resource exhaustion attack in order to prevent provenance
from being collected, all kernel operations will be disallowed and the host will cease
to function. If a network attacker tampers with a packets provenance identifier, the
packet will not be delivered to the recipient application. In all cases, the provenance
record remains an accurate reflection of system events.
In this chapter, we have discussed the provenance monitor approach to secure and
trustworthy collection of data provenance, which can be an extraordinary source of
metadata for forensics investigators. The ability to use provenance for this goal is
predicated on its complete collection in an environment that cannot be tampered. As
we discussed through our exploration of the Hi-Fi and Linux Provenance Modules
system, the goals of a provenance monitor can be seen as a superset of reference
monitor goals because of the need for integration of layers and the notion of attested
disclosure, which are properties unique to the provenance environment.
A common limitation shared by provenance collection systems, including not only
Hi-Fi and LPM but also proposals such as SPADE and PASS, is that provenance col-
lection at the operating system layer demands large amounts of storage. For example,
in short-lived benchmark trials, each of these systems generated gigabytes of prove-
nance over the course of just a few minutes [21, 42]. There are some promising
methods of reducing the costs of collection. Ma et al.s ProTracer system offers dra-
matic improvement in storage cost by making use of a hybrid audit-taint model for
provenance collection [36]. ProTracer only flushes new provenance records to disk
when system writes occur (e.g., file write, packet transmission); on system reads, Pro-
Tracer propagates a taint label between kernel objects in memory. By leveraging this
approach along with other garbage collection techniques [32, 33], ProTracer reduces
the burden of provenance storage to just tens of megabytes per day. Additionally,
Bates et al. [4] considered that much of the provenance collected by high-fidelity
systems is simply uninteresting; in other words, it is the collection of data that does
not provide new information essential to system reconstruction or forensic analysis,
for example. By focusing on information deemed important through its inclusion in
the systems trusted computing base as inferred by its mandatory access policy, it is
172 A. Bates et al.
possible to identify the subset of processes and applications critical to enforcing the
systems security goals. By focusing on these systems, the amount of data that needs
to be collected can be reduced by over 90 %. Such an approach can be complementary
to other proposals for data transformation to assure the efficient storage of provenance
metadata [11] and the use of techniques such as provenance deduplication [61, 62].
Extending provenance beyond a single host to distributed systems also poses a
considerable challenge. In distributed environments, provenance-aware hosts must
attest the integrity of one another before sharing provenance metadata [34], or in
layered provenance systems where there is no means to attest provenance disclo-
sures [43]. Kernel-based provenance mechanisms [42, 48] and sketches for trusted
provenance architectures [34, 38] fall short of providing a fully provenance-aware
system for distributed, malicious environments. Complicating matters further, data
provenance is conceptualized in dramatically different ways throughout the litera-
ture, such that any solution to provenance security would need to be general enough
to support the needs of a variety of diverse communities. Extending provenance
monitors into these environments can provide a wealth of new information to the
forensics investigator but must be carefully designed and implemented.
While we focus on the collection of provenance in this chapter, it is also important
to be able to efficiently query the provenance once it is Provenance queries regarding
transitive causes/effects of a single system state or event can be answered by a
recursive procedure that retrieves relevant portions of a provenance graph [66, 67].
While such queries are useful in many applications, e.g., to find root causes of a
detected policy violation, further research is necessary into efficient query languages
to allow system operators to perform more complex queries that can identify user-
specified subgraphs from the collected provenance in a manner that is easily usable
and that facilitate inference of analytics.
To conclude, provenance represents a powerful new means for gathering data
about a system for a forensics investigator. Being able to establish the context within
which data was created and generating a chain of custody describing how the data
came to take its current form can provide vast new capabilities. However, as the sys-
tems discussed in this chapter demonstrate, ensuring that provenance is securely col-
lected is a challenging task. Future systems can build from existing work to addresses
the challenges we outlined above in order to bring the promises of provenance to
practical reality.
Acknowledgments This work draws in part from [5, 38, 48]. We would like to thank our co-authors
of those works, including Patrick McDaniel, Thomas Moyer, Stephen McLaughlin, Erez Zadok,
Marianne Winslett, and Radu Sion, as well as reviewers of those original papers who provided us
with valuable feedback. This work is supported in part by the U.S. National Science Foundation
under grants CNS-1540216, CNS-1540217, and CNS-1540128.
Secure and Trustworthy Provenance Collection for Digital Forensics 173
References
1. Aldeco-Prez, R., Moreau, L.: Provenance-based auditing of private data use. In: Proceedings
of the 2008 International Conference on Visions of Computer Science: BCS International
Academic Conference. VoCS08, pp. 141152. British Computer Society, Swinton, UK (2008)
2. Bates, A., Mood, B., Valafar, M., Butler, K.: Towards secure provenance-based access control
in cloud environments. In: Proceedings of the 3rd ACM Conference on Data and Application
Security and Privacy, CODASPY 13, pp. 277284. ACM, New York, NY, USA (2013). doi:10.
1145/2435349.2435389
3. Bates, A., Butler, K., Haeberlen, A., Sherr, M., Zhou, W.: Let SDN be your eyes: secure forensics
in data center networks. In: NDSS Workshop on Security of Emerging Network Technologies,
SENT (2014)
4. Bates, A., Butler, K.R.B., Moyer, T.: Take only what you need: leveraging mandatory access
control policy to reduce provenance storage costs. In: Proceedings of the 7th International
Workshop on Theory and Practice of Provenance, TaPP15 (2015)
5. Bates, A., Tian, D., Butler, K.R.B., Moyer, T.: Trustworthy whole-system provenance for
the linux kernel. In: Proceedings of the 2015 USENIX Security Symposium (Security15).
Washington, DC, USA (2015)
6. Bellare, M., Canetti, R., Krawczyk, H.: Keyed hash functions and message authentication. In:
Proceedings of Crypto96, LNCS, vol. 1109, pp. 115 (1996)
7. Boneh, D., Lynn, B., Shacham, H.: Short signatures from the weil pairing. In: Boyd, C. (ed.)
Advances in CryptologyASIACRYPT (2001)
8. Carata, L., Akoush, S., Balakrishnan, N., Bytheway, T., Sohan, R., Seltzer, M., Hopper, A.: A
primer on provenance. Commun. ACM 57(5), 5260 (2014). doi:10.1145/2596628. http://doi.
acm.org/10.1145/2596628
9. Catalano, D., Di Raimondo, M., Fiore, D., Gennaro, R.: Off-line/On-line signatures: theoretical
aspects and experimental results. In: PKC08: Proceedings of the Practice and Theory in Public
Key Cryptography. 11th International Conference on Public Key Cryptography, pp. 101120.
Springer, Berlin, Heidelberg (2008)
10. Centers for Medicare & Medicaid Services: The health insurance portability and accountability
act of 1996 (HIPAA). http://www.cms.hhs.gov/hipaa/ (1996)
11. Chapman, A., Jagadish, H., Ramanan, P.: Efficient provenance storage. In: Proceedings of the
2008 ACM Special Interest Group on Management of Data Conference, SIGMOD08 (2008)
12. Chiticariu, L., Tan, W.C., Vijayvargiya, G.: DBNotes: a post-it system for relational databases
based on provenance. In: Proceedings of the 2005 ACM SIGMOD International Conference
on Management of Data, SIGMOD05 (2005)
13. Clark, D.D., Wilson, D.R.: A comparison of commercial and military computer security poli-
cies. In: Proceedings of the IEEE Symposium on Security and Privacy. Oakland, CA, USA
(1987)
14. Department of Homeland Security: A Roadmap for Cybersecurity Research (2009)
15. Edwards, A., Jaeger, T., Zhang, X.: Runtime verification of authorization hook placement
for the linux security modules framework. In: Proceedings of the 9th ACM Conference on
Computer and Communications Security, CCS02 (2002)
16. Even, S., Goldreich, O., Micali, S.: On-line/off-line digital signatures. In: Proceedings on
Advances in Cryptology, CRYPTO 89, pp. 263275. Springer, New York, USA (1989). http://
portal.acm.org/citation.cfm?id=118209.118233
17. Foster, I.T., Vckler, J.S., Wilde, M., Zhao, Y.: Chimera: AVirtual data system for representing,
querying, and automating data derivation. In: Proceedings of the 14th Conference on Scientific
and Statistical Database Management, SSDBM02 (2002)
18. Frew, J., Bose, R.: Earth system science workbench: a data management infrastructure for
earth science products. In: Proceedings of the 13th International Conference on Scientific and
Statistical Database Management, pp. 180189. IEEE Computer Society (2001)
174 A. Bates et al.
19. Ganapathy, V., Jaeger, T., Jha, S.: Automatic placement of authorization hooks in the linux
security modules framework. In: Proceedings of the 12th ACM Conference on Computer and
Communications Security, CCS 05, pp. 330339. ACM, New York, USA (2005). doi:10.1145/
1102120.1102164
20. Gao, C.Z., Yao, Z.A.: A further improved online/offline signature scheme. Fundam. Inf. 91,
523532 (2009). http://portal.acm.org/citation.cfm?id=1551775.1551780
21. Gehani, A., Tariq, D.: SPADE: support for provenance auditing in distributed environments.
In: Proceedings of the 13th International Middleware Conference, Middleware 12 (2012)
22. Glavic, B., Alonso, G.: Perm: processing provenance and data on the same data model through
query rewriting. In: Proceedings of the 25th IEEE International Conference on Data Engineer-
ing, ICDE 09 (2009)
23. Hall, E.: The Arnolfini Betrothal: Medieval Marriage and the Enigma of Van Eycks Double
Portrait. University of California Press, Berekely, CA (1994)
24. Hasan, R., Sion, R., Winslett, M.: The case of the fake picasso: preventing history forgery
with secure provenance. In: Proceedings of the 7th USENIX Conference on File and Storage
Technologies (FAST09), FAST09. San Francisco, CA, USA (2009)
25. Hicks, B., Rueda, S., St.Clair, L., Jaeger, T., McDaniel, P.: A logical specification and analysis
for SELinux MLS policy. ACM Trans. Inf. Syst. Secur. 13(3), 26:126:31 (2010). doi:10.1145/
1805874.1805982
26. Holland, D.A., Bruan, U., Maclean, D., Muniswamy-Reddy, K.K., Seltzer, M.I.: Choosing
a data model and query language for provenance. In: Proceedings of the 2nd International
Provenance and Annotation Workshop, IPAW08 (2008)
27. Jaeger, T., Edwards, A., Zhang, X.: Consistency analysis of authorization hook placement in
the linux security modules framework. ACM Trans. Inf. Syst. Secur. 7(2), 175205 (2004).
doi:10.1145/996943.996944
28. Jones, S.N., Strong, C.R., Long, D.D.E., Miller, E.L.: Tracking emigrant data via transient
provenance. In: 3rd Workshop on the Theory and Practice of Provenance, TAPP11 (2011)
29. Kent, S., Atkinson, R.: RFC 2406: IP Encapsulating Security Payload (ESP) (1998)
30. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM
21(7), 558565 (1978). doi:10.1145/359545.359563
31. Lampson, B.W.: A note on the confinement problem. Commun. ACM 16(10), 613615 (1973)
32. Lee, K.H., Zhang, X., Xu, D.: High accuracy attack provenance via binary-based execution par-
tition. In: Proceedings of the 20th ISOC Network and Distributed System Security Symposium,
NDSS (2013)
33. Lee, K.H., Zhang, X., Xu, D.: LogGC: garbage collecting audit log. In: Proceedings of the
2013 ACM Conference on Computer and Communications Security, CCS (2013)
34. Lyle, J., Martin, A.: Trusted computing and provenance: better together. In: 2nd Workshop on
the Theory and Practice of Provenance, TaPP10 (2010)
35. Ma, S., Lee, K.H., Kim, C.H., Rhee, J., Zhang, X., Xu, D.: Accurate, low cost and
instrumentation-free security audit logging for windows. In: Proceedings of the 31st Annual
Computer Security Applications Conference, ACSAC 2015, pp. 401410. ACM (2015). 22.
doi:10.1145/2818000.2818039
36. Ma, S., Zhang, X., Xu, D.: ProTracer: towards practical provenance tracing by alternating
between logging and tainting. In: Proceedings of the 23rd ISOC Network and Distributed
System Security Symposium, NDSS (2016)
37. Macko, P., Seltzer, M.: A general-purpose provenance library. In: 4th Workshop on the Theory
and Practice of Provenance, TaPP12 (2012)
38. McDaniel, P., Butler, K., McLaughlin, S., Sion, R., Zadok, E., Winslett, M.: Towards a secure
and efficient system for end-to-end provenance. In: Proceedings of the 2nd conference on
Theory and practice of provenance. USENIX Association, San Jose, CA, USA (2010)
39. Metasploit Project. http://www.metasploit.com
40. Moreau, L., Groth, P., Miles, S., Vazquez-Salceda, J., Ibbotson, J., Jiang, S., Munroe, S., Rana,
O., Schreiber, A., Tan, V., Varga, L.: The provenance of electronic data. Commun. ACM 51(4),
5258 (2008). http://doi.acm.org/10.1145/1330311.1330323
Secure and Trustworthy Provenance Collection for Digital Forensics 175
41. Mouallem, P., Barreto, R., Klasky, S., Podhorszki, N., Vouk, M.: Tracking files in the kepler
provenance framework. In: SSDBM 2009: Proceedings of the 21st International Conference
on Scientific and Statistical Database Management (2009)
42. Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.: Provenance-aware storage
systems. In: Proceedings of the Annual Conference on USENIX 06 Annual Technical Confer-
ence, Proceedings of the 2006 Conference on USENIX Annual Technical Conference (2006)
43. Muniswamy-Reddy, K.K., Braun, U., Holland, D.A., Macko, P., Maclean, D., Margo, D.,
Seltzer, M., Smogor, R.: Layering in provenance systems. In: Proceedings of the 2009 Confer-
ence on USENIX Annual Technical Conference, ATC09 (2009)
44. Nguyen, D., Park, J., Sandhu, R.: Dependency path patterns as the foundation of access control
in provenance-aware systems. In: Proceedings of the 4th USENIX Conference on Theory and
Practice of Provenance. TaPP12, p. 4. USENIX Association, Berkeley, CA, USA (2012)
45. Ni, Q., Xu, S., Bertino, E., Sandhu, R., Han, W.: An access control language for a general
provenance model. In: Secure Data Management (2009)
46. Pancerella, C., Hewson, J., Koegler, W., Leahy, D., Lee, M., Rahn, L., Yang, C., Myers, J.D.,
Didier, B., McCoy, R., Schuchardt, K., Stephan, E., Windus, T., Amin, K., Bittner, S., Lansing,
C., Minkoff, M., Nijsure, S., von Laszewski, G., Pinzon, R., Ruscic, B., Wagner, A., Wang,
B., Pitz, W., Ho, Y.L., Montoya, D., Xu, L., Allison, T.C., Green Jr., W.H., Frenklach, M.:
Metadata in the collaboratory for multi-scale chemical science. In: Proceedings of the 2003
International Conference on Dublin Core and Metadata Applications: Supporting Communities
of Discourse and PracticeMetadata Research & Applications, pp. 13:113:9. Dublin Core
Metadata Initiative (2003)
47. Park, J., Nguyen, D., Sandhu, R.: A provenance-based access control model. In: Proceedings
of the 10th Annual International Conference on Privacy, Security and Trust (PST), pp. 137144
(2012). doi:10.1109/PST.2012.6297930
48. Pohly, D.J., McLaughlin, S., McDaniel, P., Butler, K.: Hi-Fi: collecting high-fidelity whole-
system provenance. In: Proceedings of the 2012 Annual Computer Security Applications Con-
ference, ACSAC 12. Orlando, FL, USA (2012)
49. Postel, J.: RFC 791: Internet Protocol (1981)
50. Revkin, A.C.: Hacked E-mail is new fodder for climate dispute. New York Times 20 (2009)
51. Sailer, R., Zhang, X., Jaeger, T., van Doorn, L.: Design and implementation of a TCG-based
integrity measurement architecture. In: Proceedings of the 13th USENIX Security Symposium.
San Diego, CA, USA (2004)
52. Sar, C., Cao, P.: Lineage file system. http://crypto.stanford.edu/cao/lineage.html (2005)
53. Shamir, A., Tauman, Y.: Improved online/offline signature schemes. In: Advances in
CryptologyCRYPTO 2001 (2001)
54. Silva, C.T., Anderson, E.W., Santos, E., Freire, J.: Using vistrails and provenance for teaching
scientific visualization. Comput. Graph. Forum 30(1), 7584 (2011)
55. Sion, R.: Strong WORM. In: Proceedings of the 2008 The 28th International Conference on
Distributed Computing Systems (2008)
56. Spillane, R.P., Sears, R., Yalamanchili, C., Gaikwad, S., Chinni, M., Zadok, E.: Story book: an
efficient extensible provenance framework. In: First Workshop on the Theory and Practice of
Provenance. USENIX (2009)
57. Sundararaman, S., Sivathanu, G., Zadok, E.: Selective versioning in a secure disk system. In:
Proceedings of the 17th USENIX Security Symposium (2008)
58. Symantec: Symantec security response. http://www.symantec.com/security_response (2015)
59. The Netfilter Core Team: The netfilter project: packet mangling for linux 2.4. http://www.
netfilter.org/, http://crypto.stanford.edu/~cao/lineage.html (1999)
60. U.S. Code: 22 U.S. Code 2778control of arms exports and imports. https://www.law.cornell.
edu/uscode/text/22/2778 (1976)
61. Xie, Y., Muniswamy-Reddy, K.K., Long, D.D.E., Amer, A., Feng, D., Tan, Z.: Compressing
provenance graphs. In: Proceedings of the 3rd USENIX Workshop on the Theory and Practice
of Provenance (2011)
176 A. Bates et al.
62. Xie, Y., Feng, D., Tan, Z., Chen, L., Muniswamy-Reddy, K.K., Li, Y., Long, D.D.: A hybrid
approach for efficient provenance storage. In: Proceedings of the 21st ACM International
Conference on Information and Knowledge Management, CIKM 12 (2012)
63. Zanussi, T., Yaghmour, K., Wisniewski, R., Moore, R., Dagenais, M.: Relayfs: an efficient
unified approach for transmitting data from kernel to user space. In: Proceedings of the 2003
Linux Symposium, pp. 494506. Ottawa, ON, Canada (2003)
64. Zhang, X., Edwards, A., Jaeger, T.: Using CQUAL for static analysis of authorization hook
placement. In: Proceedings of the 11th USENIX Security Symposium (2002)
65. Zhou, W., Sherr, M., Tao, T., Li, X., Loo, B.T., Mao, Y.: Efficient querying and maintenance of
network provenance at internet-scale. In: Proceedings of the 2010 ACM SIGMOD International
Conference on Measurement of Data (2010)
66. Zhou, W., Fei, Q., Narayan, A., Haeberlen, A., Loo, B.T., Sherr, M.: Secure network provenance.
In: ACM Symposium on Operating Systems Principles (SOSP) (2011)
67. Zhou, W., Mapara, S., Ren, Y., Haeberlen, A., Ives, Z., Loo, B.T., Sherr, M.: Distributed time-
aware provenance. In: Proceedings of VLDB (2013)
Conclusion
Yong Guan, Sneha Kumar Kasera, Cliff Wang and Ryan M. Gerdes
Abstract We identify a series of research questions for further work in the broad
area of fingerprinting.
1 Overview
Y. Guan (B)
Iowa State University, Ames, IA 50011, USA
e-mail: [email protected]
S.K. Kasera
University of Utah, Salt Lake City, UT 84112, USA
e-mail: [email protected]
C. Wang
Army Research Office, Research Triangle Park, NC 27709, USA
e-mail: [email protected]
R.M. Gerdes
Utah State University, Logan, UT 84341, USA
e-mail: [email protected]
Springer Science+Business Media New York 2016 177
C. Wang et al. (eds.), Digital Fingerprinting,
DOI 10.1007/978-1-4939-6601-1_9
178 Y. Guan et al.
2 Measurements of Fingerprints
The effort expended in collecting data about various communications and sensor
devices would not only allow researchers the opportunity to further validate existing
fingerprinting techniques but the resulting data would inform and motivate efforts
in understanding the science, origin, and resiliency of fingerprints. We should thus
view measurement as a not only being necessary to establish the scientific validity of
fingerprinting but also as a critical component of a self-sustaining feedback loop in
fingerprinting research: data gathered on new and exiting technologies, under differ-
ing deployment scenarios, would need to be examined using existing fingerprinting
theories and frameworks, and if our ability to fingerprint the data should be negatively
affected new methods could be proposed and tested.
Currently, fingerprinting researchers are working in isolation with small, disparate
datasets, collected using different experimental methodologies, under different envi-
ronmental conditions, and scenarios that may not reflect real-world use. This makes
it difficult to compare the relative merits of each approach and judge their ability to
scale (i.e. the ability to identify many devices). What is needed is a statistically valid
sample of a device population for a given technology, acquired at different times and
from differing deployments, and available to all interested researchers. Such a dataset
would allow us to investigate (1) channel effects due to device mobility, (2) device
aging, (3) experiment with methods for tracking devices after they have been absent
from, and then returned to, the network, (4) and fingerprint drift.
Researchers should also propose new deployment scenarios that require measure-
ment, e.g. if devices are mobile is the Doppler effect significant? It is also probable
that environmental factors that impact our ability to fingerprint devices will not
become apparent until researchers have analyzed large amounts of data. Uncovering
such impacts would then suggest new scenarios for measurement.
Gathering such data will most likely require assistance from manufacturers, as
we would be interested in, for instance, dates of manufacture, information on device
architecture, and insight into the manufacturing process. This information would
be useful for creating device models that could then be used to inform us about
which aspects of device behavior should be fingerprinted and how. However, we
have identified several reasons why manufacturers may be unwilling to lend their
Conclusion 179
assistance: (1) privacy (users of a device could be identified); (2) liability (a user
could be tracked and then harmed in some way using fingerprinting); (3) trade secrets
(inner-working of devices revealed); and (4) cost (to not only provide data but changes
to the manufacturing process if extrinsic fingerprints are used). A possible way to
interest manufacturers in fingerprinting work would be to focus research efforts on
reducing costs associated with acquiring and processing fingerprints to accelerate
deployment of fingerprinting technology in the enterprise. This would open another
revenue for sales, as manufacturers would be able to expand the sale of fingerprinting
beyond the military and government (where the hitherto high costs associated with
fingerprinting can be justified).
4 Science of Fingerprints
5 Security of Fingerprints
Note: Page numbers followed by f and t indicate figures and tables respectively
K
K-bit fingerprints, 106 N
Key extraction, 60f, 61 Network analysers, 35
Key generation, 40 Network authentication, 55
asymmetric keys, 40 Network traffic fingerprinting, 3
symmetric keys, 40 NI-USRP software-defined radios, 82
nmap program, 120
Noise, 4548
L Noise frequency components, 126, 129
LineageFS, 160 No-Operation (NOP), 118
Linear programming method, 5354
Linksys CompactWireless USB adapter
(WUSB54GC), 132
Linux Integrity Measurement Architecture O
(IMA), 163 Observability Dont Care (ODC)
Linux kernels boot-time initialization fingerprinting, 100
process, 154 conditions, 101
Linux provenance modules (LPM), 159 determining potential fingerprinting
augmenting whole-system provenance, modifications, 102103
159160 finding locations for circuit modification
deploying, 164165 based on, 101102
design of, 161, 161f illustrative example, 100
netfilter hooks, 162163 ODC trigger signal, 102
provenance hooks, 162 overhead constraints, maintaining, 103
workflow provenance, 163164 security analysis, 103104
security analysis of, 169 Operating System (OS) fingerprinting, 115
authenticated channel, 170 active fingerprinting, 119122
authenticated disclosures, 170171 detection, 120
complete, 169170 encrypted traffic, analysis of, 124
tamperproof, 170 future directions, 135
verifiable, 170 packet-content agnostic traffic analysis,
threat model, 160161 122
Linux Security Module (LSM), 150 hidden services, 123
LogGC system, 146, 148
inferring users online activities
through traffic analysis, 123
website fingerprinting, 122123
M
passive fingerprinting, 117119
Mandatory Access Control (MAC), 148,
158, 165 smartphone OS reconnaissance, 124
Marking assumption, 93 empirical evaluation, 132135
Maximum Segment Size (MSS), 118 identifying, 128132
Measurements of fingerprints, 178179 system model, 127
Memory-backed pseudo-filesystems, 154 threat model, 127
Message authentication code (MAC), 5152 smartphone traffic, analysis of, 123124
Messages, authenticating, 71 Optical disc fingerprints, 62f
MIMO systems, 16, 26 Optical media, 6163
MIMO transmission, 72 Oscillator implementation, 4445
Modification, fingerprint, 112 Out-specification characteristics, 9
Modulation-based identification techniques, Oven controlled crystal oscillators
21 (OCXOs), 57
Index 187