Heat Behind The Meter: A Hidden Threat of Thermal Attacks in Edge Colocation Data Centers
Heat Behind The Meter: A Hidden Threat of Thermal Attacks in Edge Colocation Data Centers
Abstract—The widespread adoption of Internet of Things and Metered Load Generated Heat
latency-critical applications has fueled the burgeoning develop- (e.g., 200W) (e.g., 300W, 100W from Battery)
ment of edge colocation data centers (a.k.a., edge colocation) Power Æ Heat
— small-scale data centers in distributed locations. In an edge Heat
colocation, multiple entities/tenants house their own physical Power
Load
ACÆDC
servers together, sharing the power and cooling infrastructures Outlet
(e.g., PDU) Internal Components
for cost efficiency and scalability. In this paper, we discover that Integrated
the sharing of cooling systems also exposes edge colocations’ Battery
potential vulnerabilities to cooling load injection attacks (called Power Supply Unit Server
thermal attacks) by an attacker which, if left at large, may
create thermal emergencies and even trigger system outages.
Importantly, thermal attacks can be launched by leveraging Fig. 1. An attacker uses its built-in batteries to stealthily inject additional
the emerging architecture of built-in batteries integrated with heat to overload the cooling system.
servers that can conceal the attacker’s actual server power (or
cooling load). We consider both one-shot attacks (which aim at
creating system outages) and repeated attacks (which aim at
causing frequent thermal emergencies). For repeated attacks, we recent Uptime Institute survey [4] shows that more than 75%
present a foresighted attack strategy which, using reinforcement of the respondents will use edge colocations to house their
learning, learns on the fly a good timing for attacks based on physical servers and deploy edge applications.
the battery state and benign tenants’ load. We also combine The criticality of hosted applications, such as assisted driv-
prototype experiments with simulations to validate our attacks ing [3], clearly mandates a high level of security for edge
and show that, for a small 8kW edge colocation, an attacker can
potentially cause significant losses. Finally, we suggest effective colocations. While securing servers and networks from cyber
countermeasures to the potential threat of thermal attacks. attacks remains a key issue, recent research has also identified
critical vulnerabilities in data center physical infrastructures.
I. I NTRODUCTION More concretely, the practice of infrastructure oversubscription
In the wake of the Internet of Things and ubiquitous exposes data centers to well-timed power load attacks that
computing demand, edge computing has recently emerged as aim at overloading the power capacity and compromising the
a game-changing paradigm that brings computation to the data center availability [5]–[8]. Likewise, data center cooling
Internet edge, thereby enabling ultra-low latencies for many system removes server heat to avoid overheating and hence
critical applications such as augmented reality and assisted is also crucial for service uptime. If not properly managed,
driving [1]. Consequently, the rise of edge computing spurs malicious workloads can create more hot spots that expose
the burgeoning development of multi-tenant edge colocation servers to an adverse thermal environment and thus more
data centers (a.k.a., edge colocation). An edge colocation is thermal emergencies [9]. Importantly, cooling system has
a small-scale shared colocation data center built at numerous emerged as a leading root cause for downtime incidents in
distributed locations for hosting latency-ultrasensitive work- state-of-the-art data centers (e.g., Microsoft’s) [10], [11].
loads such as assisted driving [2]. In such a colocation, the To meet the power capacity constraints and avoid outages
operator provides power and cooling resources to multiple [12]–[14], the colocation operator has power meters to con-
entities (i.e., tenants) for housing their own physical servers. tinuously monitor tenants’ server power usage. Meanwhile,
Thus, this fundamentally differs from a multi-tenant cloud power meters are also used as a proxy to measure servers’
platform where users/tenants share the cloud resources without cooling loads,1 ensuring that the designed cooling capacity is
owning the physical servers. not violated. The reason is that nearly 100% server power is
Edge colocations have become the preferred choice for edge eventually converted into heat or cooling load [6], [15], [16].
service providers. For example, Vapor IO, an edge colocation Therefore, with proper heat dissipation, meeting the power
operator, is rolling out thousands of edge colocations in capacity constraints also implicitly means meeting the cooling
partnership with wireless tower companies [3]. Moreover, a capacity constraints [16].
This work was supported in part by the NSF CNS-1551661 (CAREER). 1 Heat generated by servers is “cooling load” for the cooling system.
331
Managed by Operator
1 1 supply
Benign 2
…
inlet
UPS Tenants
PDU 3 internal
4 outlet
Attacker Cold 2 3 4 Hot
Cooling System Aisle Aisle Air Circulation
Fig. 2. An edge colocation data center with an attacker. Fig. 3. Overview of a cooling system in an edge colocation.
oversubscription for capital cost saving in modern data centers the inlet temperature while server internal temperature is the
(e.g., Facebook aggressively oversubscribes its power capacity highest and regulated by servers’ internal fans. Hence, with
by 47% on average) [13], [25]–[27]. heat containment, we have the following [30]–[32]:
The colocation operator provides non-IT infrastructure sup-
port (i.e., power and cooling systems), while each tenant Tinlet ≈ Tsup < Toutlet < Tinternal . (1)
brings and controls its own physical servers.2 The non-IT In a data center, server inlet temperature is the most impor-
infrastructure is expensive and/or time-consuming to construct, tant thermal metric [31], [33], because servers’ internal tem-
taking nearly 60% of the total cost of ownership over a 10- perature control uses the inlet temperature as a reference [34].
year lifespan for a colocation operator [12], [15], [21]. Thus, For example, in modern data centers, server inlet temperature
like network bandwidth, the operator’s power and cooling is conditioned at 27◦ C for cooling efficiency, as recommended
infrastructure capacity is a limited resource carefully sized by ASHRAE [33], [35]. Also note that, while server heat is
based on the tenants’ demand. responsible for increase in internal and outlet temperatures,
neither Tinternal nor Toutlet is a reliable indicator for a server’s
A. Power Infrastructure
cooling load since they depend on the server’s internal heat
As illustrated in Fig. 2, typically, an edge colocation data management (e.g., fan speed) and air flow rate.
center uses a tree-type power hierarchy with total capacity
in the range of a few kilowatts to a few tens of kilowatts III. T HERMAL ATTACK
shared by multiple tenants. Utility power first enters the data The main focus of this section is to present the potential
center through an uninterruptible power supply (UPS). Then, threat of battery-assisted thermal attacks (when concealed
the UPS-protected power goes into a power distribution unit cooling loads are behind the meter and not promptly detected)
(PDU), which distributes the power to its downstream servers. and help strengthen edge colocations. As a precursor, we first
introduce our threat model that outlines the scenario consid-
B. Cooling Infrastructure ered for thermal attacks. We then present the potential impacts
While various cooling methods (e.g., computer room air on edge colocations. Finally, we introduce two possible attack
conditioner, chiller, and “free” outside air cooling) are avail- strategies followed by discussions on their feasibility.
able [28], an edge colocation usually uses a computer room
A. Threat Model
air conditioner to remove servers’ heat due to its small size
and often rugged deployment (e.g., outdoor with a wireless We consider an edge colocation data center with a total
tower). Fig. 3 illustrates a typical cooling system in an edge power/cooling capacity of C, housing a few racks of servers
colocation. For the best cooling efficiency, today’s edge colo- owned by multiple tenants. There exists a malicious tenant
cations also implement hot/cold aisle containment to prevent (i.e., attacker) that runs artificial workloads without real values
the hot air from mixing with the cold air [23], [29]. and has bad intentions.
There are four different notions of temperature in a data What the attacker can do. The attacker houses its own
center: supply air temperature Tsup , server inlet temperature physical servers in the edge colocation, sharing the power and
Tinlet (i.e., temperature of cold air entering a server), server cooling infrastructures with benign tenants. As illustrated in
internal temperature Tinternal (e.g., CPU temperature), and Figs. 1 and 4(a), the attacker’s server power supply units has
server outlet temperature Toutlet (i.e., temperature of hot air built-in battery units, which can conceal the attacker’s actual
exiting a server). With heat containment installed, all the server power/cooling load from the operator’s power meters.
servers’ inlet temperature is nearly identical to the supply air Fig. 4(a) shows an overview of the attacker’s server.
temperature. Thus, supply air temperature and server inlet tem- The attacker subscribes a data center capacity of ca from
perature are the lowest and baseline, whose increase will lead the colocation operator and keeps its power drawn from the
to increases in server internal and outlet temperatures. Server operator’s PDU below ca at all times (even during an attack),
outlet temperature is typically elevated by 10+◦ C compared to in order to meet the operator’s requirement.
When launching a thermal attack, the attacker runs power-
2A tenant can share fraction of a rack space with other tenants. hungry applications (e.g., intensive computation) to increase
332
below a certain level (a.k.a. power capping). The actual amount
ADC
12V and duration of power capping can be either pre-determined
PSU
based on SLA terms [35] or decided at runtime through a
Voltage
Battery Regulator dynamic coordination mechanism [12], [44].
Nonetheless, handling a thermal emergency by capping
…
tenants’ server power (through, e.g., CPU throttling) inevitably
results in performance degradation, which can in turn cause
(a) (b) user dissatisfaction, revenue loss, and/or SLA violation [12],
[13], [18], [45]. Some workloads may be re-routed to other
Fig. 4. (a) Attacker’s server with built-in batteries. (b) Supermicro’s power unaffected data centers for service continuity, but this comes
supply [17]. The built-in battery module is highlighted in a red circle.
at a higher latency since otherwise those workloads would
have been processed in the preferred site to achieve the best
its actual server power consumption to pa > ca where ca performance without being re-routed.
amount comes from the operator’s PDU and the rest from its 2) System outage: In order to prevent permanent hardware
built-in battery. In practice, when running at the peak load, a damage, if the server inlet temperature continues rising despite
single server equipped with multiple CPUs and/or GPUs can cooling load capping, automatic system shutdown may occur,
easily consume several hundred watts, even more than 1kW leading to a system outage (e.g., the shared PDU can power
[36]. Thus, the attacker can inject an additional cooling load off when the inlet temperature reaches 45◦ C) and service
of pb = pa − ca beyond its subscribed capacity by discharging interruptions [33]. Such system outages can cause loss of
built-in battery units (which can be achieved by a dual-source working data sets, and also suffer from long restart waiting
power supply that can simultaneously draw power from the time. Financially, a system outage can cost thousands of
PDU and the battery units [21], [37]–[40].) dollars every minute [10]. For latency-critical applications, an
The attacker uses a voltage side channel, as proposed in [5], outage event may cause even more catastrophic consequences
to estimate benign tenants’ real-time total server load with a such as decreased safety in edge-assisted driving [46].
high accuracy (Fig. 5(b)). We also run a prototype experiment to demonstrate the
What the attacker cannot do. We do not consider naive potential impact of thermal attacks on benign tenants, and the
attacks, such as self-explosion and tampering with the physical results are in Appendix A.
infrastructures, which are beyond the scope of our work.
C. Attack Strategies
Moreover, other attacks, such as network DDoS attacks, are
also orthogonal to our focus. We introduce two possible strategies for battery-assisted
thermal attacks with different goals.
B. Impact of Thermal Attacks One-shot attack. It aims at creating a system outage by
Although non-trivial efforts are needed in the threat model, increasing the server inlet temperature beyond the safety limit
a successful thermal attack can overload a data center’s cooling (e.g., 45◦ C [33]). It can also be coordinated across multiple
system and possibly increase the server inlet temperature to a edge colocations for a wide-area service interruption. Even
dangerous level, triggering frequent performance degradation successfully launched only once, the caused damage may
and even system outages [34], [41]. be significant, especially for safety-critical applications (e.g.,
1) Performance degradation: Before system shutdown, a edge-assisted driving) [46].
preventative mechanism is to temporarily cap the data center- Repeated attacks. Instead of aggressively overheating and
wide cooling load (i.e., server power) below the cooling capac- shutting down the entire edge colocation, repeated attacks
ity [20], [42]. Specifically, when the server inlet temperature aim at frequently degrading performance of benign tenants’
exceeds a threshold (e.g., 32◦ C) for a certain amount of time latency-sensitive applications over a long period (e.g., one
[20], it is considered that a data center exception, called year) by triggering thermal emergencies and cooling load
thermal emergency, has occurred and servers are forcibly put capping. Thus, repeated attacks compromise the long-term
in a low power state. The wait-time between inlet temperature cooling system availability in edge colocations.
violation and thermal emergency declaration depends on oper- In general, one-shot attack requires a higher battery capacity
ator’s risk management policy. The temperature threshold for to support more intense attack loads (which may still be
a thermal emergency is set lower than the server’s automatic feasible as shown in Section VI). On the other hand, repeated
shutdown temperature to proactively handle an emergency. For attacks require relatively less (still a considerable amount of)
example, in a Google-type data center, disk speeds and/or resource, but they require more sophisticated timing of the
CPUs are throttled to lower the server power load (i.e., cooling attacks and can be easy to detect.
load) in the event of a thermal emergency [20], [43]. Similar
mechanisms also exist in multi-tenant colocations to handle a D. Feasibility of Thermal Attacks
thermal emergency. Concretely, without controlling tenants’ Motivation for thermal attacks. One-shot attack is as
servers, the operator sends signals to tenants’ own server motivating as traditional DDoS attacks, as it can potentially
management systems such that tenants can cap power loads create service outages. Likewise, repeated attacks can result
333
in frequent performance degradation for latency-sensitive ap- 0.15
Probability
plications, which in turn causes user dissatisfaction, revenue
0.10
loss, and/or SLA violation. Thus, although the cost barrier
0.05
is non-trivial, battery-assisted thermal attack might still be
inviting for potential attackers, such as the target colocation’s 0.00
−6 −4 −2 0 2 4 6
Load Estimation Err. (%)
ill-intentioned competitor or state-sponsored attackers.
Attacker’s malicious cooling load. In recent years, vendors (a) (b)
have integrated built-in batteries into servers’ power supply
units as an emerging backup power solution (e.g., Supermicro Fig. 5. (a) Server voltage carries the servers’ load information. (b) Load
estimation error of the voltage side channel.
BBP [17] shown in Fig. 4(b)). Thus, an attacker can discharge
built-in batteries to supply additional power to its servers,
generating malicious cooling loads without being monitored
and estimate the total load at runtime.
by the colocation operator’s power meters. Moreover, without
We run a 24-hour real-world workload trace in our prototype
air flow meters, temperature sensors that only monitor server
and collect the voltage signal using a NI digital data acquisi-
inlet/outlet temperature cannot reliably locate the malicious
tion (DAQ) as an ADC proxy to extract the servers’ total power
cooling load. Consequently, if left neglected, thermal attacks
load. We plot in Fig. 5(b) the probability distribution of load
can be launched behind the meter. This is also illustrated
estimation errors, confirming that the voltage side channel can
in Fig. 1: an attacker generates 300W cooling load, but the
be leveraged for precisely timing thermal attacks.
colocation operator only measures 200W from the power meter
Possibility of being detected. Detection of battery-assisted
and the additional 100W load is supported by the attacker’s
thermal attacks is not difficult, but contingent upon the edge
internal batteries.
colocation operator’s practice of environment monitoring.
Availability of off-the-shelf hardware. Servers with built-
Specifically, if the operator solely relies on power meters for
in batteries are commercially available (e.g., Supermicro [17]).
monitoring tenants’ loads and temperature sensors for condi-
The current battery energy density is enough to fit into servers
tioning the thermal environment, thermal attacks may possibly
and supply sufficient additional power to mask the attacker’s
remain undetected until they cause damages. A service outage
malicious cooling loads [47], even for an one-shot attack that
(due to one-shot attack) or more frequent thermal emergencies
requires more attack loads than repeated attacks. Moreover,
(due to repeated attacks) can trigger a thorough inspection,
servers with large peak-to-average ratios are also available
thus exposing the attacker. In order to proactively prevent
for generating a large amount of heat during an attack. For
such damages in advance, as discussed in Section VII, the
example, Dell manufactured PowerEdge R740/R740xd servers
operator can install additional monitoring apparatus such as
can be equipped with up to three Nvidia Tesla GPUs each with
server outlet air flow meters, which are not widely used in
225W peak and 20W idle power [48], [49].
many data centers. Thus, although thermal attacks do not have
Voltage side channel to time thermal attacks. Due to
a high degree of stealthiness, there is a need of attention to
time-varying loads, the attacker needs to find a good timing
potential thermal attacks.
for successful attacks (especially for repeated attacks) when
Relationship to power attacks. Power attacks exploit
benign tenants’ aggregate power load (or cooling load) is high.
oversubscribed power capacity and can be launched without
The attacker can utilize a side channel — voltage side channel
the need of battery [5]–[8], [51]. On the other hand, our
in our study — to estimate benign tenants’ power draw from
proposed thermal attacks are launched with the help of built-in
the shared PDU. The voltage side channel is robust against
battery for concealment of malicious cooling loads. Moreover,
changes in the environment and provides high accuracy due to
for repeated attacks, thermal attacks are stateful due to battery
its wired signal [5]. Utilizing the voltage side channel requires
charging/discharging that results in temporal correlation of
one analog-to-digital converter (ADC) that can fit on a server’s
battery states, whereas power attacks are stateless and can
power supply unit (as demonstrated in an orthogonal study for
be launched at any time without being constrained by the
USB-powered IoT devices [50]). As shown in Fig. 4(a), the
available battery energy. Thus, our thermal attacks are com-
ADC taps into the server’s input voltage to sample the PDU-
plementary to power attacks and present a potential threat by
level voltage.
leveraging servers’ built-in battery for a malicious purpose.
For the readers’ understanding, we show in Fig. 5(a) the fun-
damental principle behind the voltage side channel as recently
IV. L EARNING AN ATTACK P OLICY
proposed in [5]. The key idea is that because of the voltage
drop along the shared power cable, the total load information An one-shot attack is a special case of repeated attacks if the
(proportional to current) is contained in the voltage signal, e.g., attacker sets a sufficiently high threshold on benign tenant’s
V1 , entering any servers connected to the PDU. Meanwhile, all load (above which an attack is launched) and greedily use
today’s servers have power factor correction (PFC) circuits that up its large built-in battery energy. Thus, we now study a
generate high-frequency voltage ripples, whose amplitude is general repeated attack policy, Foresighted, by formulating
strongly correlated with the server load. Thus, the attacker can it as a discrete-time Markov decision process (MDP) and
sense the incoming voltage signal, extract the voltage ripples, using reinforcement learning. The repeated attack policy has
334
a structural property: attack when both the benign tenants’ system
∞ state) which maximizes the total discounted reward
k
server load and the battery energy level are sufficiently high. k=0 γ R(sk , ak , sk+1 ). The discount factor γ ∈ (0, 1) is
imposed to ensure the convergence of summation and implies
A. MDP formulation
in practice that future rewards are relatively less important than
We divide the entire time horizon into time slots (e.g., 1 immediate rewards [52]. Nonetheless, the resulting server inlet
minute each) indexed by k = 0, 1, 2, · · · , ∞, and present our temperature T (s, a) is an involved function that also depends
MDP formulation below. on external factors such as the edge colocation layout, and
• System state: s = (b, u) ∈ S the dynamics of benign tenants’ power usage is unknown to
• Action: a(s) ∈ A(s) the attacker. Thus, we need an online learning approach to
• State transition probabilities: P (s, a, s ) identify the optimal policy π ∗ on the fly.
• Reward function: R(s, a, s )
• Discount factor: γ ∈ (0, 1)
B. Batch Q-learning
The tuple (s, a, s ) means that, given an action a, the
system state evolves from s to s . In our problem, the system Reinforcement learning can effectively assist an agent with
state includes two sub-states: battery state (the amount of finding optimal actions in an unknown environment. The
remaining energy b in the batter units) and the attacker’s cooling load state is essentially uncontrollable and exogenous
estimated benign tenants’ load state u (using a voltage side to the attacker. On the other hand, the battery state is fully
channel in Section III-D [5]). Note that we consider the controllable and, with simplification, can be approximated as
estimated load as part of the system state, because the true bk+1 = min(bk + ek , B̄), where ek is the charged energy
value of servers’ total load is not available to the attacker. during one time slot (a negative value means battery discharg-
We consider three actions: (1) charging the battery units; ing for attacks) and B̄ is the total battery capacity. Thus, we
(2) launching a thermal attack by running the servers at peak adopt batch Q-learning [53], by extending the widely-used
power and discharging batteries; and (3) standby, i.e., running standard Q-learning [52], [53]. Concretely, by introducing an
dummy workloads without charging or discharging batteries. intermediate state (also called post state s̃k ), we have two state
The battery’s charging rate is fixed at the vendor recommended transition processes: from sk to s̃k , we only update the battery
value, while the effective discharging rate (i.e., power actually state whose transition, according to the attacker’s action, is
delivered to servers, excluding battery losses) is set to pb fully determined; then, from s̃k to sk+1 , we will update the
which, if combined with the attacker’s subscribed capacity cooling demand state based on observations. More specifically,
ca , can support the attacker’s total server power consumption for each time slot k, our proposed batch Q-learning works as
pa for thermal attacks. The state transitions are governed by follows:
benign tenants’ load that is exogenous to the attacker and the
battery energy evolution which is controlled by the attacker’s ak ← arg max [Q(sk , a) + θV (s̃k (sk , a))] (3)
charging/discharging decision. a∈A(sk )
We define the attacker’s reward function as follows: s̃k (sk , ak ) ← f (sk , ak ) (4)
+
R(s, a, s ) = w · [T (s, a) − T0 ] − β(a), (2) Q(sk , ak ) ← (1 − δ)Q(sk , ak ) + δR(sk , ak , sk+1 ) (5)
C(sk ) = max [Q(sk , a) + γV (s̃k )] (6)
where T (s, a) is the resulting server inlet temperature, T0 is a
the server inlet temperature conditioned by the operator with- V (s̃k ) = (1 − δ) V (s̃k ) + δC(sk+1 ) (7)
out attacks, β(a) is a cost term, and the operator [·]+ means
max(·, 0). Note that the attacker can easily sense the resulting where δ ∈ (0, 1) is the learning rate, and only the battery state
inlet temperature T (s, a), because today’s servers have built- is updated based on the attacker’s charging/discharging action
in temperature sensors to monitor the server inlet temperature when setting the post state s̃k (sk , a) in Eqn. 4.
for safety reasons (i.e., if the server inlet temperature is too Unlike standard Q-learning, three different value matrixes
high, the server may shut down by itself [34]). Clearly, after are used for batch learning: state-action value Q(sk , ak ),
discharging batteries, the attacker needs to recharge them, post-state value V (s̃k ), and normal state value C(Sk ). First,
which hence draws more energy from the operator’s PDU after observing the system state sk , the attacker makes an
than otherwise. To account for this, we add a normalized cost action a based on Q(sk , a) and post-state value V (s̃k (sk , a))
term: β(a) = 1 during an attack and β(a) = 0 otherwise. according to Eqn. 3. Then, post state sk can be obtained
The cost is normalized to 1, because the attacker discharges based on attacker’s action. Next, the reward Rk is obtained
a fixed amount of energy for each attack. The weight w ≥ 0 based on attacker’s observed server inlet temperature and its
governs the tradeoff between server inlet temperature increase reward function in Eqn. (2). Meanwhile, the next state sk+1 is
and total battery usage (or attack time): the larger w, the more obtained by estimating the cooling state through a voltage side
importance of server inlet temperature increase and hence channel as discussed in Section III-D. Thus, the three value
more attacks. matrixes can be updated recursively according to Eqns. (5),
In a standard MDP, the goal is to find an optimal policy (6) and (7), respectively, making the learning process converge
π ∗ : S → A (i.e., deciding an optimal action given each more quickly.
335
TABLE I
L IST OF PARAMETERS WITH THE DEFAULT VALUES . 3 12
Power (kW)
4 Aggregate Power
10 Cooling Capacity
Parameter Value 2 8
Data Center Capacity 8 kW 1
Number of Tenants 4 6
Number of Servers 40 4
Number of Server Racks 2 0 6 12 18 24
Attacker’s Capacity (ca ) 0.8 kW Time (h)
Attacker’s Total Battery Capacity (B̄) 0.2 kWh (a) (b)
Attack Thermal Load from Battery 1 kW
Charging Rate of the Battery 0.2 kW Fig. 6. (a) Data center layout.
1 Server racks.
2 Heat containment. 3 Air
Temperature Threshold for Emergency (Tth ) 32◦ C conditioner.
4 Supply air duct. (b) 24-hour snapshot of the power trace.
Q-learning Discount Factor (γ) 0.99
Q-learning Learning Rate (δ(t)) 1/t0.85
336
Temperature (∘C)
100 250 1.5kW thermal overload beyond the limit that can be handled
Overload (kW)
1.5 40
Power (W)
80 200
1.0
Experiment
Simulation 35 60 150
by the top air vents. We obtain the heat distribution model
0.5 30 40 Battery Level
Discharge 100 based on CFD analysis. In Fig. 7(a), we show the monitored
20 Re-Charge 50
0.0 25
server inlet temperature change along with our temperature
0 0
0 10 20 30 0 10 20 30 40 50 60 change simulated using our model. We see that both the heat
Time (min) Time (min)
distribution model and temperature sensor readings exhibit
(a) (b)
very similar dynamics. This is expected since we adopt well-
established CFD-based simulation [6], [31].
Fig. 7. Experimental validation of our simulation model.
Battery energy dynamics. In our Q-learning and the
simulation, we need to validate that the linear battery model
lasts for 5 minutes for each thermal emergency. If the inlet bk+1 = min(bk +ek , B̄), where bk is the battery level at time k,
temperature continues rising to reach 45◦ C, automatic shut- is accurate to model the battery energy changes with respect
down occurs (e.g., the shared PDU can power off), creating a to the charging/discharging decisions. For this, we connect
system outage and service interruptions [33]. two Dell desktops with a total load of ∼175W to our UPS
Power trace. For the three benign tenants, we use workload battery. We connect a power meter between the UPS and the
traces from Facebook and Baidu [13], [14], and generate a AC power outlet to measure the total power consumption of
year-long synthetic power trace from request-level log using the battery and the desktops. We connect another power meter
server power models validated in real systems [58]–[60]. The between the UPS and the desktops to record the total power
total power usage is scaled to have a 75% average utilization of the two desktops. Subtracting the later from the former
in our 8kW data center. We show a 24-hour snapshot of the gives the total power consumption of the UPS. To demonstrate
power trace in Fig 6(b). To demonstrate its robustness across the battery dynamics, we first run the UPS on the battery
different load patterns, we also run an alternate power trace discharging mode by unplugging it from the AC outlet. After
and show in Fig. 13 in Section VI-F. 10 minutes, we reconnect the UPS to the AC outlet, which puts
Application performance. For delay-sensitive workloads, it in the battery charging mode. We show the battery energy
high-percentile latency is the most critical metric [61]. Here, levels in Fig. 7(b). In our experiment, the charging rate is
we consider 95-percentile response time as the performance lower than the discharging rate, because of the additional UPS
metric and model the tenants’ performance based on experi- loss to power the running desktops. This experiment conforms
ments on our small cluster (Fig. 15 in Appendix A). to our choice of a linear battery energy model. While even
Q-learning parameters. Following the literature [62], we more complicated and detailed battery models (e.g., impact of
set the default discount factor γ = 0.99 and a dynamic learning ambient temperature) may be adopted [63], it does not offer
rate that is updated everyday using δ(t) = 1/t0.85 , where t much additional insight for our purpose and our observations
is the number of days elapsed. We use one minute as each still hold.
time slot, and show the other parameters when presenting the To sum up, our simulation methodology (i.e., using CFD-
results in Section VI. To initialize the table of Q values, we use based analysis for modeling temperature dynamics and using a
random power traces offline based on an initial attack policy. linear charging/discharing model for battery energy dynamics)
Our results show that during the online learning stage, the matches well with the real-world observations and hence can
action policy can converge quickly (often within 1-4 weeks). be used to evaluate thermal attacks with a good confidence.
Evaluation metrics. For the adverse thermal environment,
VI. E VALUATION R ESULTS
we consider the average server inlet temperature increase,
the probability distribution of the temperature, and the total We first show an example of one-shot attack. Then, for
emergency hours due to repeated thermal attacks. For benign repeated attacks, we compare Foresighted with another attack
tenants, we examine their performance degradation. We also policy, Myopic, that launches thermal attacks in a greedy
study the average response time during the emergency periods manner whenever there is enough energy in the battery and
normalized to that of without any emergencies. the benign tenants’ aggregate load is sufficiently high. Besides
Myopic and Foresighted, we also consider Random as a bench-
B. Experimental Validation of Our Simulation Model mark, where the attacker randomly launches thermal attacks
While simulation-based evaluation is widely used in data whenever it has enough battery energy without considering
center research [6], [9], [21], we validate our simulation model benign tenants’ power loads.
using real experiments on our prototype consisting of 14
servers and a 600VA CyberPower UPS battery. We look into A. Thermal Attack Demonstration
the two important aspects of our simulation model — thermal 1) One-shot attack: We consider a 30-minute snapshot and
dynamics and battery charging/discharging model. demonstrate an one-shot attack in Fig. 8 where the attacker
Temperature dynamics. We place our server rack in a injects 3kW of intense attack load at around the 18th minute,
sealed environment with a comparable dimension to an edge causing the server inlet temperature to rise quickly. At around
data center. The rack is cooled by the building’s central cooling the 21st minute, a thermal emergency is triggered and power
system and has air vents on the top. We create an additional capping is applied, limiting the total metered load below 5KW.
337
Temperature (∘C)
Thermal Emergency Metered Load Attack Load Temperature Metered Power Attack Load Battery Recharge
12 60 Cooling Capacity Thermal Emergency Battery Energy
Power (kW)
10 50 10 Random
System Outage
Power (kW)
Energy (%)
8 40 100
Battery
6 30 8 75
4 20 6 50
0 5 10 15 20 25 30 25
Time (min)
4 0
10 Myopic
Fig. 8. Demonstration of a one-shot attack.
Power (kW)
Energy (%)
100
Battery
8 75
6 50
25
Nonetheless, the attack load remains to keep the server inlet 4 0
temperature high enough beyond the safety threshold of 45◦ C 10 Foresighted
Power (kW)
Energy (%)
100
[33], successfully resulting in a system outage. This is also
Battery
8 75
consistent with other orthogonal studies that demonstrate a 50
6
very quick rise of inlet temperature in case of a cooling system 25
4 0
malfunction [41]. If the one-shot attack is coordinated across 0 1 2 3 4
Time (h)
multiple colocations, a service interruption may occur and
create significant damages. Fig. 9. 4-hour snapshot of thermal attacks.
2) Repeated attacks: We illustrate how repeated attacks
create emergencies under different attack policies in Fig. 9 by
considering a four-hour snapshot when the total power/cooling
load is relatively higher. In our illustration, Random launches
attacks for 8% of the times, Myopic sets the attack threshold
at 7.4kW, while Foresighted uses a weight w = 14. These
settings are chosen to yield similar attack times (i.e., 8%
of the time) across different attack policies. The total power
drawn from the operator’s PDU is shown as “Metered Power”, (a) Weight w = 9 (b) Weight w = 14
while the actual server power consumption also includes the
contribution from the attacker’s batteries (“Attack Load”) and Fig. 10. Attack policy learnt by Foresighted.
hence is larger than the metered power during the attacks.
On the other hand, the actual server power is smaller than
the metered power during battery charging. The discrepancy with fully charged batteries, unlike Foresighted, these attacks
between the metered power and actual server power highlights will more likely occur at the wrong times due to the lack of
the attacker’s “behind-the-meter” cooling loads that are not learning and accounting for battery level dynamics.
monitored by the operator.
We see in Fig. 9 that thermal attacks using Random, B. Attack Policy Learnt by Foresighted
which remains oblivious of the high cooling load, fail to We show in Fig. 10 the structural property of our repeated
create any thermal emergencies. Note that, Random’s attacks attack policy learnt by Foresighted: attack when both the
look sparser in Fig. 9 since they are more spread over time benign tenants’ server load and the battery energy level are
while Myopic and Foresighted’s attacks are concentrated in sufficiently high. For illustration, we consider two different
the high power/cooling load periods. Myopic exploits the values of w (the larger w, the more weight on creating
voltage side channel [5] to detect benign tenants’ high power temperature increases and hence more attacks). For w = 9 in
loads and launches thermal attacks between hours 0 and 1. Fig. 10(a), attacks are launched only when the estimated power
Since the power/cooling load remains at a high level, attacks load (including the attacker’s subscribed power capacity) is
continue until the operator announces a thermal emergency. above 7.5kW and more than 60% of battery energy is left.
At that point, attacks are stopped and the power consumption For w = 14, we see that attacks are launched even for 40%
is capped to oblige to the operator’s emergency handling remaining battery energy when the power is above 7.5kW.
protocol. The power returns to a normal level after being Meanwhile, Foresighted launches attacks at a lower power of
capped for 5 minutes to handle the thermal emergency. 7kW when it has more than 80% battery energy.
While it also launches thermal attacks between hours 0 and
1, Foresighted does not launch a series of unsuccessful short- C. Cost Estimate
duration attacks like Myopic. Instead, it waits to regain the bat- Benign tenants’ cost. With an one-shot attack, benign
tery energy and launches a sustained thermal attack to trigger a tenants can suffer from service outages, which may be costly
second thermal emergency near hour 2. This shows the benefits or even indirectly cause fatal damages (e.g., decreased safety
of reinforcement learning which considers the impact of its for assisted driving [46]); with repeated attacks, tenants can
actions on the future for maximizing the long-term benefits. potentially experience more frequent performance degradation.
Note that, even if Myopic only launches long-duration attacks The monetary impact of thermal attacks is generally difficult
338
16 3 3
12 0.2
ΔT (∘C)
Ts=27oC 2 2
8 Ts=29oC
0.1 1 1
4
0 0.0 0 0
0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Tenant Tenant Tenant
Overload Power (kW) Attack Time (h/day) Attack Time (h/day) #1 #2 #3
Fig. 11. (a) Overload time required to exceed the temperature limit of 32◦ C. (b) Average temperature increase vs. attack time. (c) Total attack-induced
emergency vs. attack time. (d) Tenants’ performance during emergencies.
to estimate. To offer an approximate point of reference, we However, as more attacks are launched, the temperature in-
provide a ballpark estimate for repeated attacks following crease for Myopic peaks at around attack time of 1.1 hours
prior studies [10], [12], [64] that calculate the cost impact per day and then starts to decrease. This is because Myopic
resulting from the increased 95-percentile latency. Under our launches premature attacks which deplete the battery energy
setting, Foresighted causes a total performance cost of roughly and hence miss future attack opportunities. We see a similar
$60+K/year to benign tenants in our 8kW edge colocation impact on the annual thermal emergency time in Fig. 11(c)
(roughly 80% of benign tenants’ total rental costs plus amor- where Myopic’s performance starts to deteriorate around attack
tized server costs), noting that the actual cost highly depends time of 1.5 hours per day.
on the affected tenants’ applications and can include additional Foresighted takes the future into account and hence retains
indirect cost such as business reputation. both the average temperature increase and annual emergency
Attacker’s cost. The attacker’s cost involves the power time increase with more thermal attacks. However, beyond
capacity subscription cost, electricity cost, and server purchase an attack time of 1.5 hours per day, Foresighted cannot
cost: 150$/kW/month power subscription cost, 0.1$/kWh en- create further higher temperature increases nor more thermal
ergy cost, and $4500 for each server [12]. It is on a par with emergencies. This is mainly because the total available attack
the cost for other related attacks [5]–[9], and can be affordable opportunities are limited (i.e., benign tenants do not always
for institutional or state-sponsored attackers. have high power loads) and recharging batteries takes time.
Nonetheless, given any amount of thermal attacks, Foresighted
D. Impact of Thermal Attacks can create higher server inlet temperature increases and more
thermal emergencies than Myopic.
For repeated attacks, we first show in Fig. 11(a) how long
Attack-induced thermal emergencies. In Fig. 11(c), we
it takes for the server inlet temperature to exceed the 32◦ C
see that the attack-induced thermal emergencies for both
threshold. Naturally, the temperature exceeds the threshold
Myopic and Foresighted are close to zero at low attack time.
sooner with increased cooling overload. Similarly, when the
This is because the operator declares a thermal emergency
data center is already running hotter (i.e., higher supply tem-
when the data center temperature exceeds 32◦ C and stays
perature Ts ), its temperature reaches the limit faster. We see
there for at least two minutes. Hence, at low attack time
that it takes less than four minutes to increase the data center
which also corresponds to low average temperature increases
temperature from 27◦ C to 32◦ C with one kW of additional
in Fig. 11(b), there are almost no thermal emergencies due to
cooling load, demonstrating the potential danger of thermal
attacks.
attacks.
Performance impacts. We normalize the tenants’ 95-
We then vary the total attack energy injected into the percentile response time to that of without any emergencies.
edge colocation (i.e., total attack time), while keeping the We take the average of the normalized response time during
attack load from the battery fixed at 1kW. We vary the the emergency periods and show the result in Fig. 11(d).
attack probability for Random from 0% to 15%, the load We see that Myopic has a slightly higher average perfor-
threshold (including the attacker’s own power subscription) mance impact than Foresighted. This is because Myopic
for launching an attack under Myopic from 6.5kW to 8.0kW, mainly captures the most prominent attack opportunities while
and the weight parameter for Foresighted from w = 0 to Foresighted intelligently picks up even the subtle opportunities
w = 30. Figs. 11(b) and 11(c) show the average server inlet with relatively lower impact, resulting in a lower average
temperature increase (ΔT ) beyond 27◦ C and the amount of performance impact. Nonetheless, since Foresighted seizes
attack-induced emergencies (measured in % of the total time) both the prominent and subtle attack opportunities, it results in
given different average daily attack times, respectively. In more frequent thermal emergencies, thus resulting in a greater
Fig. 11(c), we exclude Random because it fails to create any cost impact.
thermal emergency.
Temperature increase. We see in Fig. 11(b) that with more E. Sensitivity Study
attacks, the temperature increase caused by Random also rises. We now study how the battery capacity, side channel
For Myopic and Foresighted, the temperature increase rises accuracy, attack load, and data center average utilization
very fast initially, when attacks are conservatively launched. affect the resulting thermal attacks. We also study the impact
339
Myopic Foresighted Myopic Foresighted Myopic Foresighted Myopic Foresighted
Battery Capacity
4 3 4 3 0.6
Emergency (%)
Emergency (%)
Emergency (%)
Emergency (%)
3 3
(kWh)
2 2 0.4
2 2
1 1 1 1 0.2
0 0 0 0 0.0
0.1 0.2 0.3 0.4 0 1 2 5 0.5 1 1.5 2 55 65 75 85 0 2 4 6 8 10
Battery Capacity (kWh) Random Noise (%) Attack Load (kW) Capacity Utilization (%) $PPMJOH$apacity Increase (%)
Fig. 12. Sensitivity of Foresighted. (a) Battery capacity. (b) Load estimation due to random noise in side channel. (c) Attack load. (d) Average utilization
of data center capacity. (e) Required battery capacity for extra cooling capacity.
12 3
of additional cooling capacity on attacker’s battery capacity
Power (kW)
10 Cooling Capacity
requirement. We exclude Random from our study here since 8
2
it fails to create any thermal emergency. 6 1
Battery capacity. Considering repeated attacks, we vary 4
0
0 6 12 18 24 #1 #2 #3
the battery capacity from 0.1 kWh to 0.4 kWh, and show Time (h) Tenant
the annual duration of thermal emergencies due to the attacks (a) (b)
in Fig. 12(a). Naturally, a larger battery provides greater
flexibility in launching thermal attacks. Hence, we see the Fig. 13. Results with an alternate power trace. (a) A 24-hour snapshot of the
annual thermal emergency time increases with battery capacity. alternate power trace. (b) Tenants’ performance during emergencies.
We also see the difference between Myopic and Foresighted
decreases with a larger battery as the battery is more likely to
be available whenever Myopic needs it, like in Foresighted. Thus, as discussed in Section VII, other defenses are more
Load estimation accuracy. To test robustness against volt- effective and cost-efficient, especially for an existing data
age side channel errors, we add varying degrees of random center that has limited cooling capacity.
errors to the estimated loads of benign tenants and show our F. Results with an Alternate Power Trace
results in Fig. 12(b). As expected, the thermal emergency time
decreases for both Myopic and Foresighted when there is more We conduct our year long evaluation with an alternate power
noise in the side channel. Nonetheless, Foresighted can still trace to demonstrate that Foresighted is effective regardless of
create a significant amount of thermal emergency, even using the benign tenants’ load patterns. We use the Google cluster
a noisy voltage side channel. trace from [40] as the alternate total power trace. We show
a 24-hour snapshot of the alternate power trace in Fig. 13(a).
Attack load. The attack load determines how much addi-
Like in the default setting, we scale the power trace to have
tional cooling load is injected during each attack. We show the
a 75% average utilization in our 8kW edge colocation. We
results in Fig. 12(c) where we keep the attacker’s subscribed
keep the same default settings as in Section V for Myopic
capacity at 0.8kW and scale the thermal attack load from
and Foresighted. Fig. 13(b) shows that, with the alternate
0.5kW to 2kW. We see that the annual emergency time
power trace, benign tenants suffer from similar performance
greatly increase with a higher attack load and that Foresighted
degradation as in our earlier results. While we omit detailed
consistently outperforms Myopic by a great margin.
discussion for space limitation, these findings are consistent
Capacity utilization. We study the impact of average data with our earlier results.
center utilization on the thermal attack by scaling the power
trace of all the servers while maintaining the peak power at VII. D EFENSE M ECHANISM
8kW. Fig. 12(d) shows that the total thermal emergency time Tenants generally expect reliable power and cooling sup-
increases with increased capacity utilization. This is intuitive plies (subject to contractual terms) from the colocation oper-
since an increased utilization means the data center more ator which manages non-IT systems. Thus, we offer possible
frequently operates close to its capacity, thus leading to more defenses from the operator’s perspective. We first discuss
thermal attack opportunities. defenses that aim at preventing potential thermal attacks,
Extra cooling capacity. We study the impact of the oper- followed by defenses that detect thermal attacks.
ator’s extra cooling capacity on Foresighted’s battery require-
ment to maintain similar impact (i.e., 2.3% emergency). In A. Prevention
Fig. 12(e), we see that the extra cooling capacity mandates The following defense strategies are proactive measures to
higher battery capacity. Specifically, the increase in battery inhibit potential thermal attacks.
capacity for 10% extra cooling capacity is about ∼0.3kWh, Infrastructure resilience. A straightforward defense
which can still be feasible given today’s battery energy density. against thermal attacks is to reinforce an edge colocation’s
Note, however, that upgrading an existing data center cooling physical infrastructure for handling thermal overloads. For
system to add extra cooling capacity is non-trivial due to this, the operator can deploy a cooling system with additional
constraints such as space limitation, data center uptime, etc. redundancies. This approach, however, can increase the capital
340
cost [24], [65] and be particularly challenging for existing attacker’s servers — the source of the injected cooling load
systems. Alternatively, the operator can lower its server inlet — is still needed to hold the attacker accountable. Thus, to
temperature set point (to 20◦ C instead of the recommended monitor the servers’ actual cooling loads, the operator can
27◦ C) to have more margins for triggering thermal emergen- measure each server’s outlet temperature as well as the hot
cies. The drawback is the increased cooling energy cost [15], air flows. Alternatively, thermal cameras may be employed
[31]. Thus, while oversubscribing data center cooling capacity to identify the servers that are running extra hot. Likewise,
[15], [16], [24] and increasing temperature set point [31] have microphone arrays can be used along with the thermal camera
been suggested for cost efficiency, they should be carefully to pinpoint servers with fans spinning at a high speed (needed
exercised, balancing the benefit versus risk to potential thermal by servers that have higher cooling loads) [7]. While these
attacks. monitoring apparatuses are not used in all data centers, they
Rigorous move-in inspection. The colocation operator are readily available and can be easily installed by data centers
can employ a more rigorous background check and move-in to identify malicious cooling loads.
inspection process for all tenants’ servers to detect and remove To sum up, there exist readily-available defenses, such as
integrated batteries. Note that without built-in batteries, the move-in inspection to disallow built-in batteries, advanced
attacker cannot have additional power sources to support anomaly detection, and installation of monitoring apparatuses
thermal attacks behind the meter or overload the shared to locate the attacker. Given the potential threat of thermal
cooling capacity, unless the data center cooling capacity is attacks that are currently neglected, the edge colocation oper-
oversubscribed as suggested by recent studies [15], [16], [24]. ator can implement one or more of the suggested defenses to
Besides, the operator can also enforce on-site power load safeguard its thermal environment for tenants.
tests to ensure that the server power is consistent with the
tenant’s data center capacity subscription. The operator should VIII. R ELATED W ORKS
be particularly careful about the servers’ peak power. Power and thermal management. The common practice
Degrading physical side channels. The colocation operator of aggressive capacity oversubscription can create occasional
may increase the attacker’s uncertainties about timing attacks capacity overloads when the demand peaks [12]–[14], [18],
by degrading/eliminating the physical side channel. For exam- [19], [21]. To safely ride through power emergencies, numer-
ple, it can add jamming noise signals into the colocation power ous graceful power capping techniques have been proposed,
networks and/or use power line noise filters. Additionally, the such as throttling CPU frequencies [13], migrating/deferring
operator may also prohibit unusual sensors (e.g., microphones) workloads [14], [45], and discharging batteries to boost power
on tenants’ servers in order to prevent an attacker from supply [18], [19], [21]. Likewise, managing server loads to
exploiting other possible but unknown side channels. handle thermal emergencies are equally crucial [16], [20],
[43]. These studies, however, are not applicable for colocations
B. Detection whose operators have no control over tenants’ servers. More-
Detection strategies can be implemented to catch an attacker over, they do not consider an adversarial setting. More recent
that may circumvent prevention approaches. works [12], [68] propose market approaches to coordinate
Detecting behind-the-meter cooling loads. The same tenants’ power demand in colocations, but they assume that
power reading can result in different cooling loads and server tenants are all benign without any malicious intentions.
inlet/outlet temperature, depending on whether malicious ther- Data center security and thermal fault attacks. Securing
mal attacks are launched or not. Thus, by using anomaly detec- data centers against cyber attacks, such as network DDoS [69]
tion algorithms (e.g., cross-checking readings by temperature and data/privacy breach [70], has been extensively investi-
sensors and power meters), the operator can detect an irregular gated. Prior studies have also considered malicious thermal
thermal environment possibly due to thermal attacks. load attacks on a single device [71]. More recently, data center
Identifying attacks from impacts. One-shot attacks can power and cooling system security has been emerging as a
be easily identified through a thorough inspection if a system crucial concern [5]–[9], [51], [72]. However, these works focus
outage occurs. By contrast, repeated-attacks that inject milder on overloading the power infrastructure (i.e., power attack) of
loads to trigger more frequent thermal emergencies can re- large data centers with multi-level redundancy or creating hot-
quire more efforts. Since precise temperature management is spots (i.e., thermal attack) in Amazon-type cloud with frequent
difficult with open airflow cooling, there can be occasional VM shuffling. In contrast, we focus on novel battery-assisted
thermal emergencies in colocations even without thermal at- thermal attacks in a shared edge colocation. Moreover, our
tacks; colocation operators often offer a long-term temperature repeated battery-assisted thermal attacks are stateful whereas
SLA (e.g., the inlet temperature is conditioned below 27◦ C prior attacks are stateless as the current attack does not depend
for 99% or more of the time) [66], [67]. This may potentially on any past/future attacks.
allow an attacker to hide behind the statistics for a longer time. Battery management and others. The prior studies have
Thus, advanced algorithms can be implemented to monitor exploited batteries for various purposes, such as better energy
SLA metrics to early detect the presence of thermal attacks. capacity [63], concealing a household’s electricity usage infor-
Improved data center monitoring. While the aforemen- mation from the utility for better privacy [73], smoothing data
tioned approaches can detect thermal attacks, pin-pointing the center power demand [18], [19], [21], among many others.
341
To our knowledge, however, our study is the first to leverage [21] D. Wang, C. Ren, A. Sivasubramaniam, B. Urgaonkar, and H. Fathy,
batteries for a malicious purpose — one-shot or repeated “Energy storage in datacenters: what, where, and how much?,” in
SIGMETRICS, 2012.
thermal attacks in edge colocations — which highlights the [22] Y. Sverdlik, “Google to build and lease data centers in big cloud
need of attention to the potential threat. expansion,” in DataCenterKnowledge, April 2016.
[23] DatacenterKnowledge, “Vapor IO to sell data center colocation services
IX. C ONCLUSION at cell towers,” http://www.datacenterknowledge.com/archives/2017/06/
21/vapor-io-to-sell-data-center-colocation-services-at-cell-towers.
In this paper, we discovered that the sharing of cooling [24] M. Skach, M. Arora, D. Tullsen, L. Tang, and J. Mars, “Virtual melting
systems may expose edge colocations’ potential vulnerabil- temperature: Managing server load to minimize cooling overhead with
ities to both one-shot and repeated thermal attacks assisted phase change materials,” in ISCA, 2018.
[25] S. Malla, Q. Deng, Z. Ebrahimzadeh, J. Gasperetti, S. Jain, P. Kon-
with built-in batteries. For repeated attacks, we presented a dety, T. Ortiz, and D. Vieira, “Coordinated priority-aware charging of
foresighted attack policy which, using reinforcement learning, distributed batteries in oversubscribed data centers,”
learns on the fly a good timing for thermal attacks. We also [26] V. Sakalkar, V. Kontorinis, D. Landhuis, S. Li, D. De Ronde, T. Bloom-
ing, A. Ramesh, J. Kennedy, C. Malone, J. Clidaras, and P. Ranganathan,
ran simulations to validate our attacks and showed that, for “Data center power oversubscription with a medium voltage power plane
an 8kW edge colocation, an attacker can cause performance and priority-aware capping,” in ASPLOS, 2020.
degradation for affected tenants. Finally, we suggested effec- [27] A. Kumbhare, R. Azimi, I. Manousakis, A. Bonde, F. Frujeri, N. Ma-
halingam, P. Misra, S. A. Javadi, B. Schroeder, M. Fontoura, and
tive countermeasures against potential thermal attacks that are R. Bianchini, “Prediction-based power oversubscription in cloud plat-
currently neglected in many data centers. forms,” 2020.
[28] T. Evans, “The different technologies for cooling data
R EFERENCES centers,” http://www.apcmedia.com/salestools/VAVR-5UDTU5/
VAVR-5UDTU5 R2 EN.pdf.
[1] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision [29] Google, “Heat containment,” http://www.google.com/about/datacenters/
and challenges,” IEEE Internet of Things Journal, vol. 3, pp. 637–646, efficiency/external/.
Oct 2016. [30] D. L. Moss, “Dynamic control optimizes facility airflow delivery,” Dell
[2] DatacenterKnowledge, “NTT plans global data center network for con- White Paper, March 2012.
nected cars,” http://www.datacenterknowledge.com/archives/2017/03/27/ [31] Q. Tang, S. K. S. Gupta, and G. Varsamopoulos, “Thermal-aware task
ntt-plans-global-data-center-network-for-connected-cars. scheduling for data centers through minimizing heat recirculation,” in
[3] Vapor IO, “The edge data center,” https://www.vapor.io/. CLUSTER, 2007.
[4] Uptime Institute, “Data center industry survey,” 2018, https:// [32] S. V. Patankar, “Airflow and cooling in a data center,” Journal of Heat
uptimeinstitute.com/2018-data-center-industry-survey-results. Transfer, vol. 132, p. 073001, July 2010.
[5] M. A. Islam and S. Ren, “Ohm’s law in data centers: A voltage side [33] R. A. Steinbrecher and R. Schmidt, “Data center environments: Ashrae’s
channel for timing power attacks,” in CCS, 2018. evolving thermal guidelines,” ASHRAE Technical Feature, pp. 42–49,
[6] M. A. Islam, S. Ren, and A. Wierman, “Exploiting a thermal side December 2011.
channel for power attacks in multi-tenant data centers,” in CCS, 2017.
[34] Dell, “Integrated dell remote access controller 9 (iDRAC9) version
[7] M. A. Islam, L. Yang, K. Ranganath, and S. Ren, “Why some like it
3.00.00.00.”
loud: Timing power attacks in multi-tenant data centers using an acoustic
[35] 365DataCenters, “Master services agreement,” http://www.
side channel,” in SIGMETRICS, 2018.
365datacenters.com/master-services-agreement/.
[8] Z. Xu, H. Wang, Z. Xu, and X. Wang, “Power attack: An increasing
[36] R. A. Bridges, N. Imam, and T. M. Mintz, “Understanding gpu power: A
threat to data centers,” in NDSS, 2014.
survey of profiling, modeling, and simulation methods,” ACM Comput.
[9] X. Gao, Z. Xu, H. Wang, L. Li, and X. Wang, “Reduced cooling
Surv., vol. 49, pp. 41:1–41:27, Sept. 2016.
redundancy: A new security vulnerability in a hot data center,” in NDSS,
2018. [37] Keysight Technology, “Learn to connect power supplies in parallel for
[10] Ponemon Institute, “2016 cost of data center outages,” 2016, https:// higher current output,” https://www.keysight.com/main/editorial.jspx?
www.ponemon.org/blog/2016-cost-of-data-center-outages. cc=US&lc=eng&ckey=520808&nid=-11143.0.00&id=520808.
[11] P. Jones, “Overheating brings down microsoft data center,” Dat- [38] S. Govindan, D. Wang, A. Sivasubramaniam, and B. Urgaonkar, “Ag-
acenter Dynamics, 2013, https://www.datacenterdynamics.com/news/ gressive datacenter power provisioning with batteries,” ACM Trans.
overheating-brings-down-microsoft-data-center/. Comput. Syst., vol. 31, pp. 2:1–2:31, Feb. 2013.
[12] M. A. Islam, X. Ren, S. Ren, A. Wierman, and X. Wang, “A market [39] D. Wang, C. Ren, and A. Sivasubramaniam, “Virtualizing power distri-
approach for handling power emergencies in multi-tenant data center,” bution in datacenters,” in ISCA, 2013.
in HPCA, 2016. [40] L. Liu, C. Li, H. Sun, Y. Hu, J. Gu, T. Li, J. Xin, and N. Zheng, “Heb:
[13] Q. Wu, Q. Deng, L. Ganesh, C.-H. R. Hsu, Y. Jin, S. Kumar, B. Li, Deploying and managing hybrid energy buffers for improving datacenter
J. Meza, and Y. J. Song, “Dynamo: Facebook’s data center-wide power efficiency and economy,” in ISCA, 2015.
management system,” in ISCA, 2016. [41] P. Lin, S. Zhang, and J. VanGilder, “Data center temperature rise during
[14] G. Wang, S. Wang, B. Luo, W. Shi, Y. Zhu, W. Yang, D. Hu, L. Huang, a cooling system outage,” APC White Paper 179, 2014.
X. Jin, and W. Xu, “Increasing large-scale data center capacity by [42] Intel, “Intel cloud builders guide to power management in cloud design
statistical power control,” in EuroSys, 2016. and deployment using Supermicro platforms and NMView management
[15] M. Skach, M. Arora, C.-H. Hsu, Q. Li, D. Tullsen, L. Tang, and J. Mars, software,” 2013.
“Thermal time shifting: Leveraging phase change materials to reduce [43] L. Ramos and R. Bianchini, “C-Oracle: Predictive thermal management
cooling costs in warehouse-scale computers,” in ISCA, 2015. for data centers,” in HPCA, 2008.
[16] I. Manousakis, I. n. Goiri, S. Sankar, T. D. Nguyen, and R. Bianchini, [44] L. Zhang, S. Ren, C. Wu, and Z. Li, “A truthful incentive mechanism for
“Coolprovision: Underprovisioning datacenter cooling,” in SoCC, 2015. emergency demand response in colocation data centers,” in INFOCOM,
[17] Supermicro, “Battery backup power - evolutionary design to replace 2015.
UPS,” http://www.supermicro.com/products/nfo/files/bbp/f bbp.pdf. [45] D. Wang, S. Govindan, A. Sivasubramaniam, A. Kansal, J. Liu, and
[18] B. Aksanli, T. Rosing, and E. Pettis, “Distributed battery control for B. Khessib, “Underprovisioning backup power infrastructure for data-
peak power shaving in datacenters,” in IGCC, 2013. centers,” in ASPLOS, 2014.
[19] V. Kontorinis, L. E. Zhang, B. Aksanli, J. Sampson, H. Homayoun, [46] S. Baidya, Y.-J. Ku, H. Zhao, J. Zhao, and S. Dey, “Vehicular and
E. Pettis, D. M. Tullsen, and T. S. Rosing, “Managing distributed UPS edge computing for emerging connected and autonomous vehicle ap-
energy for effective power capping in data centers,” in ISCA, 2012. plications,” in DAC, 2020.
[20] Y. Kim, J. Choi, S. Gurumurthi, and A. Sivasubramaniam, “Managing [47] “Calb 100 ah se series lithium iron phosphate battery,” https://www.
thermal emergencies in disk-based storage systems,” Dec 2008. evwest.com/catalog/product info.php?products id=51.
342
[48] “Poweredge r740xd rack server,” https://www.dell.com/en-us/work/ an additional 1.5kW load to overload the cooling system and
shop/povw/poweredge-r740xd. measure the server inlet temperature. As shown in Fig. 14(a),
[49] L. Brochard, V. Kamath, J. Corbalán, S. Holland, W. Mittelbach, and
M. Ott, Energy-Efficient Computing and Data Centers. John Wiley & the inlet temperature rises to nearly 40◦ C within minutes.
Sons, 2019. Our experiment, albeit on a small scale, demonstrates the
[50] K. Lee, N. Klingensmith, S. Banerjee, and Y. Kim, “Voltkey: Continuous rapid increase of server inlet temperature due to a overloaded
secret key generation based on power line noise for zero-involvement
pairing and authentication,” Proc. ACM Interact. Mob. Wearable Ubiq- cooling system. This is also corroborated by other studies
uitous Technol., vol. 3, Sept. 2019. that demonstrate rapid temperature rises in data centers due
[51] X. Gao, Z. Gu, M. Kayaalp, D. Pendarakis, and H. Wang, “Container- to cooling malfunction [41]. We follow the ASHRAE safety
Leaks: Emerging security threats of information leakages in container
clouds,” in DSN, 2017. limit and do not further overload our system [33].
[52] J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-
learning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994.
Temperature (∘C)
Overload (kW)
400
[53] J. Xu, L. Chen, and S. Ren, “Online learning for offloading and autoscal- Cooling Load
Tinlet
Without Attack
With Attack
ing in energy harvesting mobile edge computing,” IEEE Transactions on 1.0 35 300 Emergency
Cognitive Communications and Networking, vol. 3, no. 3, pp. 361–373, 200
0.5 30
2017. 100
0.0 25 0
[54] Google, “Google’s Data Center Efficiency,” http://www.google.com/ 0 10 20 30 0 10 20 30
about/datacenters/. Time (min) Time (min)
[55] Vertiv, “Smartmod modula data center infrastructure.” (a) Temperature (b) Performance
[56] X. Wang, X. Wang, G. Xing, and C. xian Lin, “Leveraging thermal dy-
namics in sensor placement for overheating server component detection,” Fig. 14. Experiment in our server rack. (a) Server inlet temperature increases
in IGCC, 2012. due to a cooling capacity overload by 1.5kW. (b) Latency performance is
[57] NVidia, “https://www.nvidia.com/en-us/geforce/graphics-cards/30- compromised due to server power capping for handling an emergency.
series/rtx-3080/.”
[58] D. G. Feitelson, D. Tsafrir, and D. Krakov, “Experience with using
the parallel workloads archive,” Journal of Parallel and Distributed
Computing, vol. 74, no. 10, pp. 2967–2982, 2014. 2.5 2.5
Response Time
Response Time
800 users 300 req/s
Normalized
Normalized
[59] Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallel/ 2.0 2.0
1000 users 400 req/s
workload/. 1.5
1.5
1.0
[60] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a 1.0 0.5
warehouse-sized computer,” in ISCA, 2007. 0.5 0.0
[61] M. E. Haque, Y. h. Eom, Y. He, S. Elnikety, R. Bianchini, and 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
Normalized Power Normalized Power
K. S. McKinley, “Few-to-many: Incremental parallelism for reducing
tail latency in interactive services,” in ASPLOS, 2015. (a) Web Service (b) Web Search
[62] E. Even-Dar and Y. Mansour, “Learning rates for q-learning,” Journal
of Machine Learning Research, vol. 5, no. Dec, pp. 1–25, 2003. Fig. 15. Performance degradation due to power capping.
[63] L. He, E. Kim, and K. G. Shin, “*aware charging of lithium-ion battery
cells,” in ICCPS, 2016. We implement the ClouSuite Web Service benchmark [74]
[64] P. X. Gao, A. R. Curtis, B. Wong, and S. Keshav, “It’s not easy being
green,” SIGCOMM Comput. Commun. Rev., 2012. in a set of 4 servers with a workload of 600 requests/s
[65] I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, “Approx- and show the impact of power capping on the 95-percentile
hadoop: Bringing approximations to mapreduce frameworks,” in ASP- response time, which is the key performance metric [61]. An
LOS, 2015.
[66] Internap, “Colocation services and SLA,” http:// x-percentile response time means that x% of the requests
www.internap.com/internap/wp-content/uploads/2014/06/ have a latency less than this response time. For illustration,
Attachment-3-Colocation-Services-SLA.pdf. we throttle the CPU speed to cap the total server power to
[67] Equinix, “Colocation services and SLA,” https://enterprise.verizon.com/
service guide/reg/cp colocation equinix data centers sla.pdf. 60% of the peak power. We see from Fig. 14(b) that during
[68] M. A. Islam, H. Mahmud, S. Ren, and X. Wang, “Paying to save: the emergency, the response time jumps nearly four times to
Reducing cost of colocation data center via rewards,” in HPCA, 2015. 400ms.
[69] S. Yu, Y. Tian, S. Guo, and D. O. Wu, “Can we beat ddos attacks
in clouds?,” IEEE Transactions on Parallel and Distributed Systems,
We also extend our experiments to Web Search implemen-
vol. 25, pp. 2245–2254, September 2014. tation from CloudSuite [74]. We show the 95-th percentile
[70] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-vm side response time normalized to the service level agreement
channels and their use to extract private keys,” in CCS, 2012.
[71] S. Skorobogatov, “Local heating attacks on flash memory devices,” in
(100ms) for two different numbers of users for Web Service
Workshop on Hardware-Oriented Security and Trust, 2009. in Fig. 15(a) and two different request rates for Web Search
[72] C. Li, Z. Wang, X. Hou, H. Chen, X. Liang, and M. Guo, “Power attack in Fig. 15(b), respectively. The server power consumption is
defense: Securing battery-backed data centers,” in ISCA, 2016.
[73] L. Yang, X. Chen, J. Zhang, and H. V. Poor, “Optimal privacy-preserving
normalized to the peak. We see that when the server power
energy management for smart meters,” in INFOCOM, 2014. consumption decreases, the response times for both applica-
[74] “CloudSuite - The Search Benchmark,” http://cloudsuite.ch/. tions increase for any given workload level. This reveals the
degree of performance degradation faced by tenants when they
A PPENDIX A reduce their power consumption while the workload remains
P ROTOTYPE D EMONSTRATION OF T HERMAL ATTACKS unchanged.
To see the impact of thermal attacks, we run experiments on
a rack of 14 Dell PowerEdge servers in a scaled environment
with hot-cold aisles to mimic an edge colocation. The cooling
system can support up to a cooling load of 3kW. We inject
343