EEDCS Tutorial IISWC2011 PDF
EEDCS Tutorial IISWC2011 PDF
Austin, Texas
6 November 2011
Sources: 1. Koomey, “Worldwide Electricity Used in Data Centers”, Environmental Research Letters, 2008
2. Report to Congress on Server and Data Center Energy Efficiency, U.S. Environmental Protection Agency, 2007
© 2011 IBM Corporation
SWTSE 2010
Schedule
10:00 AM BREAK
12:00 PM LUNCH
3:15 PM BREAK
4:30 PM END
Performance,
Capacity/ Up to 95% Resources
HVAC, IT Watt idle
35 resource used at
Units UPS Power Other Processor 5-20%
45% 55% 70% 30% average for
33 Units (Power supply, appl. load
fans, memory,
Delivered*
Rec.
humidity
27ºC 35ºC 45ºC
40ºC
temperature
Source: 2011 Thermal Guidelines for Data Processing Environments – Expanded Data Center Classes and Usage
8 Guidance, American Society of Heating, Refrigerating and Air-Conditioning Engineers © 2011 IBM Corporation
Wasting provisioned resources
-- Stranded power
Using nameplate power (worst-case) to allocate power on circuit breakers
– However, real workloads do not use that much power
– Result: available power is stranded and cannot be used
Stranded power is a problem at all levels of the data center
Example: IBM HS20 blade server – nameplate power is 56 W above real workloads.
350
Nameplate power: 308 W
300 Stranded power: 56 W
Server Power (W)
250
200
150 Real workload maximum
100
50
Workloads: SPEC CPU, SPEC JBB, LINPACK
0
SPECJ BB
LINPACK
perlbmk-1
ammp-1
perlbmk-2
ammp-2
mgrid-1
mes a-1
equak e-1
sixtrac k-1
mgrid-2
mes a-2
equak e-2
sixtrac k-2
nameplate
idle
gzip-1
vpr-1
gcc -1
mcf-1
parser-1
eon-1
gap-1
v ortex -1
bzip2-1
applu-1
galgel-1
art-1
lucas -1
fma3d-1
apsi-1
gzip-2
vpr-2
gcc -2
mcf-2
parser-2
eon-2
gap-2
v ortex -2
bzip2-2
applu-2
galgel-2
art-2
lucas -2
fma3d-2
apsi-2
crafty -1
swim-1
facerec -1
crafty -2
swim-2
facerec -2
wup
wup
twolf-1
twolf-2
Power supply still a vexation for the NSA Snafus forced Twitter datacenter move
“The spy agency has delayed the deployment of “A new, custom-built facility in Utah meant to
some new data-processing equipment because it is house computers that power the popular
short on power and space. Outages have shut down messaging service by the end of 2010 has been
some offices in NSA headquarters for up to half a plagued with everything from leaky roofs to
day…Some of the rooms that house the NSA's insufficient power capacity, people familiar with
enormous computer systems were not designed to the plans told Reuters.”
handle newer computers that generate considerably
more heat and draw far more electricity than their -- Reuters, April 1, 2011
predecessors.”
-- Baltimore Sun, June 2007
11 © 2011 IBM Corporation
4. Power is a first-class design constraint for servers
Power and performance are top design parameters for microprocessors and servers
Power constraints exist across all classes of computing equipment
– Laptop (30 – 90 W)
– Desktop (100s W)
– Server (200W - 5kW)
– Data center (1-20 MW)
Server components
– CPU socket power Air cooling limits and fan noise consideration
– Memory power
– Electrical cord limits
– Power supply limits (physical size, reuse of standard designs)
¾ capital costs
¼ operating costs
2.8
Power Usage Effectiveness
2.6
2.4
2.2
(PUE)
2.0
Typical existing DC (1.7)
1.8
1.6
1.4
Best new DCs in 2010 (<1.2)
1.2
Best possible PUE (= 1.0)
1.0
1 2 3 4 5 6 7 8 9 10 11 12 14 16 17 18 19 20 21 22
Data Center number
Source: Tschudi et al., "Measuring and Managing Energy Use in Data Centers." HPAC Engineering, LBNL/PUB-945, 2005.
Rewards inefficiency in the server (e.g. poor AC/DC conversion, fan power is included)
PUE is insufficient for “proving” and managing energy efficiency
Efficiency (MFLOPS/Watt)
2500
Energy-efficiency for High Performance Computing
– Large clusters are costly to operate 2000
(ASC Purple @ 4.5 MW, $0.12/kWh
$4.7M/year) 1500
Use accelerators or GPUs
– Site must be designed to supply power 1000
Green500 list reorders the Supercomputing TOP500 500
list for energy-efficiency
– Metric: LINPACK performance / Power for 0
computer 0 100 200 300 400 500
– Does not include computer room cooling
Green500 Rank
1 2097 IBM – Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 41.0 109
2 1684 IBM – Watson Research Center NNSA/SC Blue Gene/Q Prototype 1 38.8 165
3 1375 Nagasaki U. – self-made Intel i5, ATI Radeon GPU, Infinibad 34.24 430
QDR
4 958 GSIC Center, Tokyo Institute of HP ProLiant SL390s 1243.8 5
Technology G7 Xeon 6C X5670 Nvidia GPU
5 891 CINECA/SCS – SuperComputing IBM iDataPlex DX360ME, Xeon 2.4, 160 54
Solution nVidia GPU, Infiniband
CUE (Carbon Usage Effectiveness) Total CO2 emissions caused by the Total Data Center Energy
CUE =
– Measure sustainability IT Equipment Energy
– In addition to PUE
Lack of realism
– Do not include network and remote storage loads
• SAP System Power benchmark will include network and storage
– No task switching
– Very strong affinity
Secured
Vault
Network
Operating Fiber Connectivity Terminating
Center on Frame Relay Switch
(1) http://hightech.lbl.gov/DCTraining/graphics/ups-efficiency.html
(2) N. Rasmussen. “Electrical Efficiency Modeling for Data Centers”, APC White Paper, 2007
(3) http://hightech.lbl.gov/documents/PS/Sample_Server_PSTest.pdf
(4) “ENERGY STAR® Server Specification Discussion Document”, October 31, 2007.
(5) IBM internal sources
26 © 2011 IBM Corporation
Cooling infrastructure for a typical large data center
Condensation
water pump
Condensation water loop
– Usually ends in a cooling
tower
– Needed to remove heat Cooling tower
out of the facilities
Racks
– Arranged in a hot-aisle cold-aisle configuration
Computer room air conditioning (CRAC) units
– Located in raised-floor room or right outside of raised-floor room
– Blower moves air across the raised floor and across cooling element
– Most common type in large data centers uses chilled water (CW) from facilities plant
– Adjusts water flow to maintain a constant return temperature
– Often raised floors have a subset of CRACs that also control humidity in floor
Hot-aisle containment
Purpose
– Reduce localized hotspots
– Allow higher power density in
older facilities
– Optimize cooling by rack
– No raised floor required
Implementation
– Self-contained air cooling solution
(water or glycol for taking heat
from the air)
– Air movement
Types
– Enclosures – create cool Liebert XDF™ APC InfraStruXure
microclimate for selected Enclosure (1) InRow RP (2)
‘problem’ equipment
– Sidecar heat exchanger – to “Liebert XDF™ High Heat-Density Enclosure with Integrated Cooling”,
address rack-level hotspots http://www.liebert.com/product_pages/ProductDocumentation.aspx?id=40
without increasing HVAC load “APC InfraStruXure InRow RP Chilled Water”,
http://www.apcc.com/resource/include/techspec_index.cfm?base_sku=ACRP501
Source: Chris Page, “Air & Water Economization & Alternative Cooling Solutions
–Customer Presented Case StudiesData Center Efficiency Summit, 2010
32 © 2011 IBM Corporation
Open Compute Project
– Data center
• Electrical
• Mechanical
• Racks
• Battery cabinet
– Server
• Chassis
• Motherboard
• Power supply
http://opencompute.org
Servers
Storage
Network 33% CPUs
22% other
5% Networking
10% Disks
30% DRAM
Server peak power by hardware component from a Google data center (2007)
Source: Luiz André Barroso and Urs Hölzle, The Datacenter as a Computer: An Introduction to the Design of
Warehouse-Scale Machines, Morgan & Claypool, 2009.
o w 8 removable disks
rf l
ai
POWER7 processor
DDR3
DIMM buffers DIMM
Processor card slots
Connector
36 © 2011 IBM Corporation
Address variability in hardware and operating environment
Complex environment
– Installed component count, ambient temperature, component variability, etc.
– How to guarantee power management constraints across all possibilities?
Feedback-driven control
– Capability to adapt to environment, workload, varying user requirements
– Regulate to desired constraints even with imperfect information
Models
Estimate unmeasured quantities
Predict impact of actuators
Actuate
Set performance state (e.g. frequency)
Set low-power modes (e.g. DRAM power-down)
Set fan speeds
37 © 2011 IBM Corporation
Sensors: temperature
On-chip/-component sensors
– Measure temperatures at specific locations on the processor or in specific units
– Need more rapid response time, feeding faster actuations e.g. clock throttling.
– Proprietary interfaces with on-chip control and standard interfaces for off-chip control.
– Example: POWER7 processor has 44 digital thermal sensors per chip
– Example: Nehalem EX has 9 digital thermal sensors per chip
– Example: DDR3 specification has thermal sensor on each DIMM
AC power
– External components – Intelligent PDU, SmartWatt
– Intelligent power supplies – PMBus standard
– Instrumented power supplies Example: IBM DPI PDU+
DC power
– Most laptops – battery discharge rate
– IBM Active Energy Manager – system power
– Measure at VRM
Within a chip (core-level)
– Power proxy (model using performance counters)
Sensor must suit the application:
– Access rate (second, ms, us)
– Accuracy
– Precision
– Accessibility (I2C, Ethernet)
‘Performance’ Counters
– Traditionally part of processor performance monitoring unit
– Can track microarchitecture and system activity of all kinds
– A fast feedback for activity, have also been shown to serve as potential proxies for
power and even thermals
– Example: Instructions fetched per cycle
– Example: Non-halted cycles
100%
Normalized socket power
80%
60%
40%
20%
0%
0% 20% 40% 60% 80% 100%
Normalized socket frequency
Race-to-idle Just-in-time
power task
A
idle Task A idle Task B idle
time deadline deadline
45 © 2011 IBM Corporation
Actuators: DRAM
Source: Kenneth Wright (IBM) et al., “Emerging Challenges in Memory System Design”, tutorial, 17th IEEE
46
International Symposium on High Performance Computer Architecture, 2011. © 2011 IBM Corporation
Thermal constraints
Source: Kenneth Wright (IBM) et al., “Emerging Challenges in Memory System Design”, tutorial, 17th IEEE
47
International Symposium on High Performance Computer Architecture, 2011. © 2011 IBM Corporation
Power variability
Source: Karthick Rajamani, IBM
Power can vary due to manufacturing of same part (processor leakage power)
Power consumption is different across vendors
Memory power specifications for DDR2 parts
with identical performance specifications
2.5 2.07
Normalized Current/Power
2.0
1.55
1.52
1.52
1.51
1.51
1.28
1.5
1.16
2X difference in
1.00
1.00
1.0 active power
and 1.5X
difference in
0.5 idle power!
0.0
Max Active (Idd7) Max Idle (Idd3N)
Today, it is still not uncommon for idle servers to consume 50% of their peak power.
– Many components (memory, disks) do not have a wide range for active states.
– Power supplies are not highly efficient at every utilization level.
Energy Efficiency =
Utilization/Power
Figure 2. Server power usage and energy efficiency at varying utilization levels, from
idle to peak performance. Even an energy-efficient server still consumes about half its
full power when doing virtually no work.
Averages
found
outside
the cloud
Figure 1. Average CPU utilization of more than 5,000 servers during a six-month period. Servers are rarely completely idle
and seldom operate near their maximum utilization, instead operating most of the time at between 10 and 50 percent of their
maximum
© 2011 IBM Corporation
Virtualization – opportunities for power reduction
Throttle level
Desired power consumption PI controller Component
4000
3800
Frequency 3600
(GHz) 3400
3200
3000
1000 1100 1200 1300 1400
Small undershoot Lefurgy etTime
al., ICAC07
(ms)
Maximum Expected
60 Memory Power Example: power shifting across
Maximum Expected CPU and memory for PPC 970 computer
50 CPU Power
Points show execution intervals of many
40
workloads with no limit on power budget
Conventional
CPU Budget “Static” encloses unthrottled intervals for
Power 30 (78 Watts) 40 W budget: 27 W CPU, 13 W Memory
(W)
20 “Dynamic” encloses unthrottled intervals
for 40 W budget and power shifting
10 – Better performance than static design
Static Dynamic for 40 W
0 – Lower cost than conventional 78 W
0 20 40 60 power supply design
Memory Power (W )
Opportunities
– Intra-chip: shift between function units, cores, cores caches
– Intra-node: shift between processors, processors DRAM, leakage/fans
– Intra-rack: shift between nodes, storage compute, disaggregated DRAM
– Intra-data center: cross-node optimization (placement, migration,
consolidation)
– Across data centers: time shifting, power arbitrage, enhanced reliability
Problem: Many enterprise data centers spend upwards of 40% of their IT power on storage
– SAS (15K, FC, …) drives are fastest,
but highest power and cost
– Optimizing performance can drive low
resource utilization (spread data across many spindles)
– Other parts of the data center are becoming
more energy proportional but storage is not
– Optimizing for better energy efficiency can potentially
reduce performance
Standards bodies are including power and energy metrics along with performance
– Storage Performance Council (SPC)
– SNIA
– EPA
Opportunities:
– Move away from high-cost, high-power enterprise SAS/FC drives
– Consolidation (fewer spinning disks = less energy, but less throughput)
– Hybrid Configurations (Tiering/Caching): Replace power-hungry SAS with SATA (for capacity) and
flash/PCM (for IOPS)
•SATA consumes ~60% lower energy per byte
•Flash can deliver over 10X SAS performance for random accesses
•Flash has lower active energy and enables replacing SAS with SATA and spindown
(by absorbing I/O activity)
•Issue: But what data should be placed in what storage technology and when?
– Opportunistic spindown
– Write offloading
– Deduplication/Compression
Availability
Low-Latency
Reliability
Throughput
Redundancy
Capacity
Accessibility
Feature-rich
(extensible management
Fault-tolerance options via simple software)
Yesterday Today
Excessive Power: Reduced Power:
- No Spin-down - Aggressive Spindown
- Fans and controllers - System power mgm’t
- Flash to absorb I/O
Costly:
-15K RPM SAS Reduced Cost:
- Wasted capacity - SATA + SSD
- Storage virtualization
Wasted Capacity:
- RAID Configuration Increased Capacity:
- Short-stroking - Dense SATA drives
- No short-stroking
- Deduplication
IT Equipment
42% 34% 23%
1U, 2U+ Compute Servers
Storage Comms
Components
31% 8% 6% 13% 20% 22%
CPUs CPU VR Memory HDD/RMSD PSU loss Misc
http://www.emc.com/collateral/analyst-reports/diverse-
exploding-digital-universe.pdf
67 © 2011 IBM Corporation
Storage Power Capping
Empirical experiments 8-disk RAID-6 SAS array vs. 8-disk RAID-6 SATA array + SSD
cache
Some environments/workloads are already tuned with the right spindown approach
– e.g., backup/archive (MAID)
Some environments/workloads cannot tolerate any multi-second latency from a spinup delay
– e.g., OLTP
Spindown is appropriate for medium-duty workload environments
– e.g., email, virtualization, filers
Caution:
not to exceed rated
spindown cycles
Thin-provisioning
– Software tools to report and advise on data management/usage for energy- and capital-
conserving provisioning
Deduplication
– Either at the file or block level
– Ensures only one copy of data is stored on disk (e.g., replicate copies are turned to
pointers to the original)
Storage virtualization
– Allows for more storage systems to be hidden behind and controlled by a central
controller that can more efficiently manage the different storage systems
– Abstract physical devices to allow for more functionality
– Enables powerful volume management
Data is growing and the storage systems to satisfy the capacity demand are not energy-
proportional
More of the data center is becoming more energy-efficient, and storage energy consumption
is becoming dominant
Storage Power Capping and Energy Saving Techniques
– Hybrid storage architectures
Metrics and benchmarks adopted by the industry are beginning to drive this issue and the
importance of focus in this area
References:
– Dennis Colarelli and Dirk Grunwald. Massive arrays of idle disks for storage archives. pages 1–11. In
Proceedings of the 2002 ACM/IEEE International Conference on Supercomputing, 2002.
– Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang, Peter Reiher, and Geoff Kuenning.
PARAID: a gear-shifting power-aware raid. ACM Transactions on Storage, 3(3):33, October 2007.
– D. Chen, G. Goldberg, R. Kahn, R. I. Kat, K. Meth, and D. Sotnikov, Leveraging disk drive acoustic
modes for power management. In Proceedings of the 26th IEEE Conference on Mass Storage
Systems and Technologies (MSST), 2010.
– Wes Felter and Anthony Hylick and John Carter. Reliability-aware energy management for hybrid
storage systems. To Appear in the Proceedings of the 27th IEEE Symposium on Massive Storage
Systems and Technologies (MSST), 2011.
Source: HP
Source: Priya Mahadevan, Sujata Banerjee, Puneet Sharma: Energy Proportionality of an Enterprise Network. Green Networking
Workshop 2010
81 © 2011 IBM Corporation
Switch Power Example
Low-power idle (LPI) allows PHY Tx side to shut off when no packet is being transmitted
– Saves ~400mW per 1 Gbps port
Rx side remains on continuously
Makes Ethernet more energy-proportional
Cloud:
Computing
infrastructure
designed for
dynamic
provisioning
of resources
for computing
tasks.
"Cloud will grow from a $3.8 billion opportunity in 2010, representing over 600,000 units, to a $6.4 billion
market in 2014, with over 1.3 million units.”. In Worldwide Enterprise Server Cloud Computing 2010–
2014 Forecast Abstract (IDC Market Analysis Doc # 223118), Apr 2010.
“cloud computing to reduce data centers energy consumption from 201.8TWh of electricity in 2010 to
139.8 TWh in 2020, a reduction of 31%”, in Pike Research Report on “Cloud Computing Energy
Efficiency”, as reported in Clean Technology Business Review, December, 2010.
92 © 2011 IBM Corporation
Efficiency from Cloud Model for Computing
Better utilization of systems drives increased efficiency
– Increased sharing of resources – lower instance of unused resources.
– Less variability in aggregate load for larger population of workloads – better sizing of
infrastructure to total load.
Computing on a large-scale saves materials, energy
– Study shows savings through less materials for larger cooling and UPS units.
– Similar savings also possible in IT equipment
Economies of scale fund newer technologies
– Favor exploitation of newer (riskier), cheaper cooling technologies because of scaled up
benefits.
– Favor re-design of IT equipment with greater modularity, homogeneity with efficiency as
a driving concern.
W1 W2 W3 W4
Network
W2 W4
Virtual W1 W3
Machines
Server 1 - On Server 2 - On
(VM)
94 © 2011 IBM Corporation
Dynamic Consolidation with Live Migration
Network
Virtual
Machines
(VM)
App 0
App 1
App 2
App 3
App 4
App 5
Servers1-5 Off *Study conducted by Wael El-essawy, Karthick Rajamani, Juan Rubio, Tom Keller
1
Assume linear power model for each server between:
– Pmax (Power at 100% utilization)
– Pidle ( Idle Power) 0.9
Pmax:
Relative Power
– Determined by the server derated power 0.8
Old-large
– Represents how much power is allocated Old-medium
– Usually less than nameplate power 0.7 Old-small
– Maximum configuration New-large
New-medium
Pidle 0.6 New-small
– Determined by the server age and model New-blade
– Assumes no DVFS
0.5
0 20 40 60 80 100
Utilization
98 © 2011 IBM Corporation
Case Study: Consolidation Results
Input Utilization Histogram
off
Percentage of Servers
50
Average = 8.24%
– Per cluster, on average: 40
0
Data Center Servers are mostly underutilized 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
– 76% of the servers are less than 10% utilized Consolidated Utilization Histogram
Average = 34.6%
Percentage of Servers
Cluster-level consolidation significantly raised 15
server utilization
10
– 35% Average consolidated Utilization
5
90
80
70
50
40
30
51
101
152
202
252
302
352
402
452
502
40
Cluster ID
Group Power (KVA)
30
0
0 50 100 150 200 250 300 350 400 450 500 550
Cluster ID
Thermal Management
– Consolidation can increase the diversity in data center thermal distribution
– Thermal-aware consolidation/task placement strategies to mitigate thermal impact of
consolidation.
– Modular cooling infrastructure controls are an important complement to consolidation
solutions to reduce overall datacenter energy consumption.
– Integration of energy-aware task placement/consolidation with thermal management
solutions can be a successful approach to full Data center energy optimization.
Ideal energy proportionality is still far from reality, so continue with server
consolidation.
Clusters of servers heterogeneous in their efficiencies would continue to benefit from
energy-aware task placement/consolidation.
Cooling solutions without good tuning options can interact sub-optimally with energy
proportional hardware requiring intelligent task consolidation/placement to improve overall
datacenter efficiency.
Basic Motivation
– Charging, incentivizing customers to allow better infrastructure utilization.
– Identify in-efficiencies and unanticipated consumption.
– Adapt resource provisioning and allocation with energy-usage information for more
efficient operation.
– Energy profiling of software to guide more efficient execution.
2LiteGreen:
Saving Energy in Networked Desktops Using Virtualization,
Tathagata Das, Pradeep Padala, Venkata N. Padmanabhan,
Ramachandran Ramjee, Kang G. Shin, USENIX 2010.
1SleepServer:A Software-Only Approach for Reducing the Energy
Consumption of PCs within Enterprise Environments, Yuvraj Agarwal
Stefan Savage Rajesh Gupta, USENIX 2010.
Cloud is an attractive computing infrastructure model with rapid growth because of its on-
demand resource provisioning feature.
– Growth of cloud computing would lead to growth of large data centers. Large-scale
computing in turn enables increased energy efficiency and overall cost efficiency.
The business models (cloud provider) around clouds incentivize energy efficiency
optimizations creating a big consumer for energy-efficiency research.
– The cloud’s transparent physical resource usage model facilitates sharing and
efficiency improvements through virtualization and consolidation.
– Energy-proportionality and consolidation need to co-exist to drive Cloud energy-
efficiency
– End-to-end (total DC optimization) design and operations’ optimization for efficiency
will also find a ready customer in Cloud Computing.
Efficiency optimization while guaranteeing SLAs will continue to drive research directions
in the Cloud.
Applications
Libraries
Hypervisor /
Virtual Machine Monitor
Manage Provision,
Schedule, Consume/utilize
resource resources
states Manage
111 states © 2011 IBM Corporation
The Many Roles of Software in Energy-efficient Computing
Exploiting lower energy states and lower power operating modes
– Support all hardware modes e.g. S3/S4, P-states in virtualized environments
– Detect and/or create idleness to exploit modes.
– Software stack optimizations to reduce mode entry/exit/transition overheads.
Energy-aware resource management
– Understand and exploit energy vs performance trade-offs e.g. Just-in-time vs Race-to-idle
– Avoid resource waste (bloat) that leads to wasted energy
– Adopt energy-conscious resource management methods e.g. polling vs interrupt, synchronizations.
Energy-aware data management
– Understand and exploit energy vs performance trade-offs e.g. usage of compression
– Energy-aware optimizations for data layout and access methods e.g. spread data vs consolidate
disks, inner tracks vs outer tracks
– Energy-aware processing methods e.g. database query plan optimization
Energy-aware software productivity
– Understand and limit energy costs of modularity and flexibility
– Target/eliminate resource bloat in all forms
– Develop resource-conscious modular software architectures
Enabling hardware with lower energy consumption
– Parallelization to support lower power multi-core designs
– Compiler and Runtime system enhancements to help accelerator-based designs
Source:
Mondira (Mandy)
Pant, Intel
Presentation at
GLSVLSI, May
2010
1.4
Normalized Power
1
0.8
0.4
0.2
0
100 90 80 70 60 50 40 30 20 10 0
SPECpower_ssj2008 Load Levels
1.0 3
1
0 .8
0 .6
0 .4
0 .2
Minimum Utilization Threshold
0
smt4 sm t2 smt1
*Power-performance Management on an IBM POWER7 Server, Core-po ol app roach Ave rag e-utilization ap proach
Rank
idle
2GB 2GB 16GB
Large regions of memory need to be idle before lower power mode can be used.
Higher savings/latency mode (O(µs)) needs even larger regions to be idle, infeasible.
Granularity needs worsen with larger capacity devices/DIMMs, i.e., can be worse than shown..
Hypervisor and system software involvement to consolidate data in fewest power domains can
maximize idle opportunities
Source:
The interplay of Software Bloat, Hardware
Energy Proportionality and System Bottlenecks
– HotPower 2011.
Source for figures: Energy Efficiency: The New Holy Grail of Data
Management Systems Research, S Harizoupoulos, M A Shah, J
Meza, P Ranganathan, CIDR Perspectives 2009.
JouleMeter
– http://research.microsoft.com/en-us/projects/joulemeter/default.aspx
Energy Modeling
Section Outline
Modeling Workflow
Initial Applications
Pass
Electric
Feedback Thermal
Data:
– Measurement-based: use real workloads or systems, and sensors
– Analytical: use models of system to estimate state variables
Execution:
– Real-time
– Off-line
Electrical/Power Modeling
Thermal Modeling
System characterization:
– Perform experiments on system
– Build polynomial models for components
– Obtain “steady-state” by solving system of
equations
Source:
– University of Michigan
– “Understanding and
Abstracting Total Data Center Power”,
Workshop on Energy-Efficient Design
(WEED), held in conjunction with ISCA 2009.
Focus:
– Electric power of data centers
Approach:
– Power is a function of equipment utilization
and ambient outside temperature
Source:
– University of Michigan
– “Stochastic Queuing Simulation for Data Center Workloads”, Workshop on Exascale Evaluation
and Research Techniques (EXERT), 2010.
Focus:
– Integrate workload characteristics in a data center power model
Approach:
– Characterize equipment
– Characterize workloads build distributions
Suited for data center design, and “what-if” modeling, not for runtime management
Source:
– Frank Bodi, “Super Models in Mission Critical Facilities”, INTELEC 2010
Focus:
– Electrical and mechanical modeling of power distribution and its use to detect failures
Approach:
– Develop an electrical model for each component
– Models are connected according to topology of data center
Virtual stress test
– Monte-Carlo simulation of failures: loss of main power, loss of redundant power,
switching sequences
– Determine mean-time-between-failure (MTBF) useful to determine equipment
deficiencies
Components modeled:
– Incoming AC grid, standby power generator, transfer switches, USP, transformers, power panels,
breakers, computer and air-conditioning loads
Data center is represented as a “net-list” of components
Source:
– Univ. Virginia
– “HotSpot: A Compact Thermal Modeling
Methodology for Early-Stage VLSI
Design”, Transactions of VLSI, 2006
– http://lava.cs.virginia.edu/HotSpot/
Focus:
– Thermal modeling of microprocessors
– Integration with power simulation
Approach:
– All thermal interfaces (heat sink, heat
spreader, silicon) are represented as
resistors or capacitors
– Values are obtain out of “basic principles”
– RC-network for the microprocessor and
package is iteratively solved
Source:
– Univ. of Pittsburgh
– “Thermal Faults Modeling Using a RC
Model with an Application to Web
Farms”, ECRTS, 2007
Focus:
– Thermal modeling of servers
Approach:
– Abstracts properties of the system and
develops network inspired by electrical
components
Source:
– Rutgers University
– “Mercury and Freon: Temperature Emulation and Management for Server Systems”, ASPLOS,
2006.
Focus:
– Thermal modeling of servers
Approach:
– Build graphs to represent heat transfer paths and air flow paths
– Based on conservation of energy laws
Provides a good link between data center level and chip level thermal models
Basic principles:
– Conservation of energy:
Popular tools:
– ANSYS: Fluent
– Mentor Graphics: FloTherm
– SolidWorks: FloWorks
Challenges:
– Not particularly addressing data centers
– Need accurate input data (dimensions of
equipment, thermal properties, etc.)
– Steep learning curves
– Can be expensive
Source:
– http://dcsg.bcs.org/welcome-dcsg-simulator
– http://www.romonet.com/content/prognose
Focus:
– High-level data center cost and energy
simulator
Approach:
– Use component efficiency curves from
manufacturers
– Build electrical and thermal “topology” of data
center
– Energy balance of components
Reports:
– Detailed information and classification of
energy consumption
– Simulation across seasons
– Cost
Reports:
– Detailed information and
classification of energy consumption
– Simulation across seasons
– Cost
Source:
– The Green Grid consortium
– http://thegreengrid.org/library-and-tools.aspx?category=All&type=Tool
Focus:
– Address multiple aspects of data center energy efficiency
Approach:
– High level tools, useful for planning or rough estimation
Source:
– IBM Research
– “Measurement-based modeling for data
centers”, ITHERM, 2010
Focus:
– Thermal modeling of data centers for
improving energy efficiency
Approach:
– Thermal scanning of data center build
thermal model
– Thermal sensors are strategically placed
allows using simpler version of
thermodynamic equations
Research Opportunities
Hybrid models:
– Off-line models are for planning and design
– Real-time models requires sensor data
optimal optimal
setpoint setpoint
more IT fan power, more IT fan power,
less chiller power less chiller power
Source: Uptime Institute white paper: “Tier Classifications Define Site Infrastructure Performance”
159 © 2011 IBM Corporation
Data Center Up-Time Institute Tier 4 Reliability Support
and more energy efficient operation for chips with high levels of
circuit variability
165 © 2011 IBM Corporation
Aging-Induced Degradation
141
121
# of processors at FOM value
101
81 Worry
Point
61 ~ 4000 F
41
21
1
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Projected FOM Growth worst case FOM = 1058
170 © 2011 IBM Corporation
Improvements to Reduce Thermal Cycling
…
Vi-1 lower
voltage
Calculate average
temperature T_current > T_prev?
= Vi
same
voltage
<
Vi+1 higher
voltage
…
controlled
uncontrolled
Power
overhead
On-demand On-demand
Compute cooling Compute cooling
power power power power
System System
Fan 1
Fan 2
Fan 3
Fan 4
Components Components
1 2
• Redundant series fan pairs, for normal mode, only one fan in a set is on (Fan 1
and Fan 3)
• Assign additional cooling (Fan2 or Fan4) on demand
• When one fan fails, the other fan is switched on just-in-time before thermal
emergency (a few seconds observed in real system). From then on, use normal
mode.
• When a failed fan is replaced, higher performance can be resumed when the
utilization requires it.
177 © 2011 IBM Corporation
Research in Emerging Technologies and
Solutions
*Ramcloud:
Ramcloud: http://fiz.stanford.edu:8081/display/ramcloud/Home
Control gate
Floating gate
source drain
Source: Scalable High Performance Main Memory System Using Phase-Change Memory Technology,
Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, Jude A. Rivers, ISCA 2009.
PCM + DRAM
– PCM as main memory, DRAM as large cache
– Virtual memory managed – IBM Watson
– Hardware managed – University of Pittsburgh
3D architectures
– 3D chip containing processors and DRAM/PCM
• IBM, University of Pittsburgh
– Used to reduce power consumption
– Diskless (HP Nanostore)
Benefits
– Lower distribution losses from higher voltage on-board power distribution.
– Lower energy spent for droop control, as regulation is closer to load.
– Enables fine-grained voltage control (spatial and temporal) leading to better
load-matching and improved energy-efficiency
– Reduction in board/system costs, reducing voltage regulation needs on board.
Challenges
– Space overheads on processor chip
– Difficulty realizing good discrete components in same technology as digital
circuits.
Opportunities
– 3D packaging can help address both challenges above.
System on Chip
3D Integration
3D Benefits:
High core-cache bandwidth
Integration of disparate technologies
Reduction in wire length
Reduced interconnect, I/O cost – eliminates off-
chip drivers, lower power overheads & faster,
higher energy-efficiency
Challenges
New technology: initial development costs, tool costs.
Cooling
Microchannel
Pin fin
Micro-channel
liquid coolers
Heat exchanger
CMOS 80ºC
Problem
– Overprovisioning of power distribution components in data centers for availability and to
handle workload spikes
Solutions
– Provision for average load => reducing stranded power, use power capping.
– Oversubscribe with redundancy and power cap upon failure of one of the supplies/PDUs.
– Employ power distribution topologies with overhead power busses to spread secondary
power feeds over larger number of PDUs, reducing the reserve PDU capacity at each
PDU.
– Use power-distribution-aware workload scheduling strategies to match load more evenly
with power availability.
Challenges
– Separated IT and facilities operations, not enough instrumentation – no integrated,
complete view of power consumption versus availability for optimizations.
– Existing methods for increased availability of the power delivery infrastructure have high
energy/power costs.
*Power Routing: Dynamic Power Provisioning in the Data Center, Steven Pelley, David Meisner, Pooya Zandevakili,
Thomas F Wenisch, Jack Underwood, ASPLOS 2010
Power Sensors
Power Distribution
Unit showing 2
panels of circuit
Power Distribution Units in a Data Center breakers
eth pwr
5a. Midplane:
16 Node Cards
6. Rack: 2 Midplanes
211 © 2011 IBM Corporation
SeaMicro SM10000-64
Goal is improve compute/power and
compute/space metrics
10U server
– 512 ATOM 64 bit cores (256 sockets)
– 4GB per socket
– 2.5KW
– Operates as 256 node cluster
– High-speed internal network – 1.2Tbit/s
– External 64 x 1Gb ethernet ports
– Virtualized I/O
• All I/O is shared between sockets
• Improves efficiency – very low
overhead per socket.
Designed for high volume of modest
computational workloads
– Web servers
– Hadoop
ZT systems R1801e
213 © 2011 IBM Corporation
Maybe the future would look like this……..