0% found this document useful (0 votes)
144 views17 pages

HPC Overview

Overview of High Performance Computing (HPC) systems

Uploaded by

mallhead
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views17 pages

HPC Overview

Overview of High Performance Computing (HPC) systems

Uploaded by

mallhead
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Top High Performance Computing Systems

A Short Overview

CRAY - JAGUAR
Jaguar was a pentascale supercomputer built by Cray at Oak Ridge National Laboratory that had a peak
performance of 1.75 PFlops. In 2012 XT5 Jaguar was replaced by Titan which is currently ranked as the
worlds fastest computer and is a Cray XK7 system. Key features of Jaguar follow next and then the
differences of its upgrade, Titan.

System Specifications (most important):


Processor
Memory
Cabinets
Compute Nodes
Performance
Size
Power

224256 AMD Opteron Cores @ 2.6 GHz


300 TB
200 (8 rows of 25 cabinets)
18.688
Peak performance 1.75 PFlops
4.600 feet2
480 volt power/cabinet

Compute Node:

2 AMD Opteron 2435 Hex-Cores


Cray specific SeaStar network
Improves acalability
Low latency, high bandwidth
Interconnect
Upgradable processor, memory and
interconnect

Topology:

3D torus network
192 IO nodes distributed across the torus to prevent hot-spots, connected to the storage via
Infiniband
Fabric connections provide redundant paths

Figure 1: XT5 Topology

Packaging Hierarchy:

Blade: 4 network connections


Cabinet(Rack): 192 Opteron Processors 776 Opteron Cores, 96 nodes
System: 200 cabinets

Figure 2: Building the Cray XT5 System

Titan - Main Differences:

Jaguar has been loaded up with 18.688 of the first shipping NVIDIA Telsa K20 GPUs and renamed
Titan
Compute node: Consists of a 16-core AMD Opteron 6274 processor and a K20 GPU accelerator
Memory: 710TB
Peak Performance: more than 20 PFlops

Titan Interconnect:

1 Gemini routing and communications ASIC per two nodes


48 switch ports per Gemini chip (160 GB/s internal switching capacity per chip)
3D torus interconnect

Gemini Interconnect:

Enables tens of millions of MPI messages/second


Each hybrid node is interfaced to the Gemini interconnect through HyperTransportTM 3.0
technlogy
This architecture bypasses the PCI bottlenecks and provides a peak of 20 GB/s of injection
bandwidth/node
Connectionless protocol Highly scalable architecture
Powerful bisection and global bandwidth
Dynamic routing of messages

Figure 3: Gemini Interconnect

IBM - Blue Gene/Q


Blue Gene/Q is the third generation of IBMs series of HPC systems, Blue Gene, following the Blue
Gene/L and /P architectures. It is a massively parallel computer and its purpose is to solve large scale
scientific problems, such as protein folding.

Key Features:

Small footprint
High power efficiency (worlds most efficient supercomputer - Green500/June 2011)
Low latency, high bandwidth inter-processor communication system
1.024 nodes/rack
Scales to 512 racks (currently implemented with 256) Peak Performance: 100 PF
Integrated 5D torus with optical cables for inter-processor communication Tremendous
bisection bandwidth

System Specifications (most important):

Processor
Memory
Networks

I/O Nodes (10 GbE or InfiniBand)


Performance
Power

Dimensions

IBM PowerPC A2 1.6 GHz, 16 cores per node


16 GB SDRAM-DDR3 per node (1333 MTps)
5D Torus 40 GBps; 2.5 sec latency
Collective network part of the 5D Torus;
collective logic operations supported
Global Barrier/Interrupt part of 5D Torus
PCIe x8 Gen2 based I/O
1 GB Control Network System Boot, Debug,
Monitoring
16-way SMP processor; configurable in 8,16 or 32
I/O nodes per rack
Peak performance per rack - 209.7 TFlops
Typical 80 kW per rack (estimated) 380-415, 480
VAC 3-phase; maximum 100 kW per rack; 4x60
amp service per rack
Height: 2095 mm
Width: 1219 mm
Depth: 1321 mm
Weight: 4500 lbs with coolant (LLNL 1 IO drawer
configuration)
Service clearances: 914 mm on all sides

Compute Chip:
System-on-a-Chip Design: integrates processors, memory and networking logic into a single chip
- 360 mm2 Cu-45 technology (SOI)
- 16 user + 1 service processors
- plus 1 redundant processor
- all processors are symmetric
- 1.6 GHz
- L1 I/D cache = 16kB/16kB
- L1 prefetch engines
- peak performance 204.8 GFlops @ 55W
- Central shared L2 cache: 32 MB
- Dual memory controller
- Chip-to-chip networking
- Router logic integrated into BQC chip
- External IO

Crossbar Switch:

Central connection structure between processors, L2 cache, networking logic and various lowbandwidth units
Frequency: 800MHZ (half frequency clock grid)
3 separate switches:
Request traffic -- write bandwidth 12B/PUnit @ 800 MHz
Response traffic -- write bandwidth 12B/PUnit @ 800 MHz
Invalidate traffic
22 master ports
18 slave ports
Peak on-chip bisection bandwidth 563 GB/s

Networking Logic:

Communication ports:
11 bidirectional chip-to-chip links @ 2GB/s
2 links can be used for PCIe Gen2 x8
On-chip networking logic
Implements 14-port router
Designed to support point-to-point, collective and barrier messages

Packaging Hierarchy:

Basic Topology - Interconnects:


Interconnect

Node oard
Connects
onnects 32 compute cards into a 5-d torus of length 2.. Each card consists of one
compute integrated circuit and 72 memory chips
Signaling rate @ 4Gbps
Boards on the same midplane are connected via the midplane connector
Boards on different midplanes are connected via 8 link chips on each midplane
These link chips drive optical transceivers (@10Gbps) through optical cables
Midplane
Midplane interconnectio
interconnection is a 512 node, 5-d
d torus of size 4 x 4 x 4 x 4 x 2
Rack
Contains one or two fully populated midplanes

Figure 4: Compute torus dimensionality of Blue Gene/Q building blocks

CRAY - XC30 Series


The Cray XC30 supercomputer is Cray's next generation supercomputing series, upgradable to 100
PFlops per system. Known as the "Cascade" program, it was made possible in part by Cray's participation
in the Defence Advanced Research Projects Agency's(DARPA) High Productivity Computing Systems
program. One of the key features of this system is the introduction of AriesTM interconnect, an
interconnect chipset follow-on to Gemini with a new system interconnect topology called "Dragonfly".
This topology leads to scalable system size and global network bandwidth.

System Specifications(most important):

Processor
Memory
Compute Cabinet

64-bit Intel Xeon E5-2600 Series


processors; up to 384 per cabinet
32-128GB per node
Memory bandwidth: Up to 117GB/s per node
Initially up to 3,072 processor cores per
system cabinet, upgradeable

Interconnect

1 Aries routing and communications ASIC per four


compute nodes
48 switch ports per Aries chip (500GB/s switching
capacity per chip)
Dragonfly interconnect: Low latency, high
bandwidth topology

Performance

Peak performance: Initially up to 66 Tflops per


system cabinet

Power

88 kW per compute cabinet, maximum


configuration
Circuit requirements (2 per compute cabinet): 100
AMP at 480/277 VAC or 125 AMP at 400/230 VAC
(three-phase, neutral and ground)
6 kW per blower cabinet, 20 AMP at 480 VAC, 16

Dimensions

AMP at 400 VAC (three-phase, ground)


configuration
Circuit requirements (2 per compute cabinet): 100
AMP at 480/277 VAC or 125 AMP at 400/230 VAC
(three-phase, neutral and ground)
6 kW per blower cabinet, 20 AMP at 480 VAC, 16
AMP at 400 VAC (three-phase, ground)
H 80.25 in. x W 35.56 in. x D 62.00 in. (compute
cabinet)
H 80.25 in. x W 18.00 in. x D 42.00 in. (blower
cabinet)
3450 lbs. per compute cabinet - liquid cooled, 243
lbs./square foot floor loading
750 lbs. per blower cabinet

Compute Node - Components:

A pair of Intel Xeon E5 processors (16 or more cores)


8 DDR-3 memory channels
Memory capacity: 64 or 128 GB

Aries Device - Interconnects:

System-on-chip device comprising four Network Interface Controllers, a 48-port tiled router and
a multiplexer
It provides connectivity for four nodes on a blade
Aries devices are connected to each other via cables and backplanes

Figure 5: Aries Device Connecting 4 Nodes

Dragonfly Topology - Features and Benefits:

Each chassis on the system consists of 16 64-node blades


Each cabinet has three chassis (192 nodes)
Dragonfly network is constructed from two-cabinet electrical groups with 384 nodes/group
All connections within a group are electrical @14Gbps per lane
This topology provides good global bandwidth and low latency at lower cost compared to a fat
tree

Figure 6: Chassis backplane provides all-to-all connectivity between blades. Chassis within the same group connect to each
other bia electrical connectors. Groups connect together via optical cables.

Figure 7: 2-D all-to-all structure is used within a group

FUJITSU - K COMPUTER
K-Computer is a massively parallel computer system developed by Fujitsu and RIKEN as a part of the
High Performance Computing Infrastructure(HPCI) initiative led by Japan's Ministry of Education,
Culture, Sports, Science and Technology. It is designed to perform extremely complex scientific
calculations and it was ranked as the top performing supercomputer in the world in 2011.

System Specifications (most important):


CPU
(SPARC64
VIIIfx)

Node
System Board(SB)
Rack
System
Interconnect

Cores/Node
Performance
Architecture
Cache
Power
Mem. Bandwidth
Peak Performance
Configuration
Mem. Capacity
No. of Nodes
No. of SB
Nodes/System
Topology
Performance
No. of Links
Architecture

8 cores @2GHz
128GFlops
SPARC V9 + HPC extension
L1(I/D) Cache: 32KB/32KB
L2 Cache: 6MB
58W(typ. 30C)
64GB/s
>10PFlops
1 CPU/Node
16GB (2GB/core)
4 nodes/SB
24 SBs/rack
>80.000
6D Mesh/Torus
5GB/s for each link
10 links/node
Routing chip structure
(no outside switch box)

Compute Chip(CPU):

Extended SPARC64TM VII architecture for HPC


High performance and low power consumption
High reliable design
1271 signal pins

Compute Node:

Single CPU and interconnect


controller
10 links for inter-node connection
@10 GB/s per link

Topology - Tofu Interconnect:

Logical 3-D torus network


Physical topology: 6-D Mesh/Torus
10 links/node: 6 for 3D torus (coordinates xyz) and 4 redundant links for 3D mesh/torus (coord:
abc)
Bi-section bandwidth: >30TB/s

Figure 8: Topology

6-D Benefits:

High Scalability
Twelve times higher than the 3D torus
High Performance and Operability
Low hop-count (average hop count about 1/2 of 3D torus)
High Fault Tolerance
12 possible alternate paths to bypass faulty nodes
Redundant node can be assigned preserving the torus topology

Packaging Hierarchy:

Figure 9: K-Computer Packaging Hierarchy

DELL - STAMPEDE
Stampede was developed by Dell at Texas Advanced Computing Center(TACC) as an NSF-funded project.
It is a new HPC system and one of the largest computing systems in the world for open science research.
It was ranked 7th in the Top500 List in 2012.

System Specifications(most important):

Processor

Memory
Packaging

Intel 8-core Xeon E5 processor 2 per node /


core freq @2.7GHz,
Intel 61-core Xeon Phi Coprocessor 1 per
node / core freq @1.1GHz
32GB per node (host memory),
additional 8GB on the coprocessor card
6.400 nodes in 160 racks (40 nodes/rack)
along with 2 38-port Mellanox leaf switches

Interconnect

Interconnected nodes with Mellanox FDR


InfiniBand technology in a 2- level(cores and leafs)
fat tree topology @56GB/s

Performance
Power

Peak performance: 10PFlops


>6MW total power

Compute Node:

Figure 10: Stampede Zeus Compute Node

Interconnect:

It consists of Mellanox switches, fiber cables and Host Channel Adapters approx. 75 miles of
cables
Eight core 648-port SX6536 switches and > 320 36-port SX6025 endpoint switches form a 2-level
Clos fat tree topology
They have 4.0 and 73 TB/s capabilities respectively
5/4 oversubscription at the endpoint switches(leafs)
Key feature Any MPI message is at most 5 hops from source to destination

Figure 11: Stampede Interconnect

NUDT - TIANHE-I
Developed by the Chinese National University of Defense Technology, it was the fastest computer in the
world from Oct. 2010 to June 2011 an now it is one of the few Petascale supercomputers in the world.

System Specifications(most important):

Processor

Two 6-core Intel Xeon X5670 CPUs @ 2.93GHz


and one NVIDIA M2050 GPU @1.15GHz per
compute node
Two 8-core FT-1000 CPUs @1.0GHz per service
node

Memory
Nodes-Packaging

Total: 262 TB
7.168 compute nodes and 1.024 service nodes
installed in 120 racks

Interconnect

Fat-tree structure with bi-directional bandwidth


@ 160Gbps

Performance
Power

Peak performance: 4700TFlops


4.04MW

Interconnect:

Use of high-radix Network Routing Chips (NRC) and high-speed Network Interface Chips(NIC)
NRC and NIC chips are designed with self-intellectual property
Topology is an optic-electronic hybrid hierarchical fat-tree structure
Bi-directional bandwidth @ 160Gbps, latency: 1.57s, switch density: 61.44Tbps in a single
backboard

Figure 12: Architecture of the Interconnect Network

The first layer consists of 480 switching boards. In each rack, every 16 nodes connect with each
other through the switch board in the rack. The main boards are connected with the switch
board via a back board. Communication on switching boards uses electrical transmission.
The second layer contains 11 384-port switches connected with QSFP optical fibers. There are
12 leaf switch boards and 12 root switch boards in each 384-port switch. A high density back
board connects them in an orthotropic way. The switch boards in the rack are connected to 384port switches through optical fibers.

References:
[1] Top 500 supercomputer sites: http://www.top500.org/
[2] A. Bland, R. Kendall, D. Kothe, J. Rogers, G. Shipman. Jaguar: The Worlds Most Powerful
Computer
[3] B. Bland. The worlds most powerful computer, presentation at Cray Users Group 2009
Meeting
[4] Jaguar (supercomputer): http://en.wikipedia.org/wiki/Jaguar_(supercomputer)
[5] Titan (supercomputer): http://en.wikipedia.org/wiki/Titan_(supercomputer)
[6] Cray XK7: http://www.cray.com/Products/Computing/XK7.aspx
[7] Blue Gene/Q: http://en.wikipedia.org/wiki/Blue_Gene
[8] IBM. IBM System Blue Gene/Q, IBM Systems and Technology Data Sheet
[9] J. Milano, P. Lembke. IBM System Blue Gene Solution, Blue Gene/Q Hardware Overview and
Installation Planning, Redbooks
[10] M. Ohmacht/ IBM BlueGene Team. Memory Speculation of the Blue Gene/Q Compute Chip,
presentation, 2011
[11] R. Wisniewski/ IBM BlueGene Team. BlueGene/Q: Architecture CoDesign; Path to Exascale,
presentation, 2012
[12] Cray XC30: http://www.cray.com/Products/Computing/XC.aspx
[13] B. Alverson, E. Froese, L. Kaplan, D. Roweth / Cray Inc. Cray XC Series Network,
http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf
[14] K-computer: http://en.wikipedia.org/wiki/K_computer
[15] T. Inoue. The 6D Mesh/Torus Interconnect of K Computer, presentation
[16] T. Maruyama. SPARC64TM VIIIfx: Fujitsus New Generation Octo Core Processor for PETA Scale
Computing, presentation, 2009
[17] M. Sato. The K computer and Xcalable MP parallel language project, presentation
[18] Y. Ajima, T. Inoue, S. Hiramoto, T. Shimizu. Tofu: Interconnect for the K computer, Fujitsu Sci.
Tech. J., Vol 48, No. 3, pp. 280-285 (July 2012)
[19] H. Maeda, H. Kubo, H. Shimamori, A. Tamura, J. Wei. System Packaging Technologies for the K
Computer, Fujitsu Sci. Tech. J., Vol 48, No. 3, pp. 286-294 (July 2012)
[20] Stampede: http://www.tacc.utexas.edu/resources/hpc/#stampede
[21] Stampede: https://www.xsede.org/web/guest/tacc-stampede#overview

[22] D. Stanzione. The Stampede is Coming: A New Petascale Resource for the Open Science
Community, presentation
[23] S. Lantz. Parallel Programming on Ranger and Stampede, presentation, 2012
[24] Tianhe 1A: http://www.nscc-tj.gov.cn/en/resources/resources_1.asp#TH-1A
[25] X. Yang, X. Liao, K. Lu et al. The TianHe-1A supercomputer: Its hardware and software.,
Journal of Computer Science and Technology, 26(3): 344-351 May 2011

You might also like