HPC Overview
HPC Overview
A Short Overview
CRAY - JAGUAR
Jaguar was a pentascale supercomputer built by Cray at Oak Ridge National Laboratory that had a peak
performance of 1.75 PFlops. In 2012 XT5 Jaguar was replaced by Titan which is currently ranked as the
worlds fastest computer and is a Cray XK7 system. Key features of Jaguar follow next and then the
differences of its upgrade, Titan.
Compute Node:
Topology:
3D torus network
192 IO nodes distributed across the torus to prevent hot-spots, connected to the storage via
Infiniband
Fabric connections provide redundant paths
Packaging Hierarchy:
Jaguar has been loaded up with 18.688 of the first shipping NVIDIA Telsa K20 GPUs and renamed
Titan
Compute node: Consists of a 16-core AMD Opteron 6274 processor and a K20 GPU accelerator
Memory: 710TB
Peak Performance: more than 20 PFlops
Titan Interconnect:
Gemini Interconnect:
Key Features:
Small footprint
High power efficiency (worlds most efficient supercomputer - Green500/June 2011)
Low latency, high bandwidth inter-processor communication system
1.024 nodes/rack
Scales to 512 racks (currently implemented with 256) Peak Performance: 100 PF
Integrated 5D torus with optical cables for inter-processor communication Tremendous
bisection bandwidth
Processor
Memory
Networks
Dimensions
Compute Chip:
System-on-a-Chip Design: integrates processors, memory and networking logic into a single chip
- 360 mm2 Cu-45 technology (SOI)
- 16 user + 1 service processors
- plus 1 redundant processor
- all processors are symmetric
- 1.6 GHz
- L1 I/D cache = 16kB/16kB
- L1 prefetch engines
- peak performance 204.8 GFlops @ 55W
- Central shared L2 cache: 32 MB
- Dual memory controller
- Chip-to-chip networking
- Router logic integrated into BQC chip
- External IO
Crossbar Switch:
Central connection structure between processors, L2 cache, networking logic and various lowbandwidth units
Frequency: 800MHZ (half frequency clock grid)
3 separate switches:
Request traffic -- write bandwidth 12B/PUnit @ 800 MHz
Response traffic -- write bandwidth 12B/PUnit @ 800 MHz
Invalidate traffic
22 master ports
18 slave ports
Peak on-chip bisection bandwidth 563 GB/s
Networking Logic:
Communication ports:
11 bidirectional chip-to-chip links @ 2GB/s
2 links can be used for PCIe Gen2 x8
On-chip networking logic
Implements 14-port router
Designed to support point-to-point, collective and barrier messages
Packaging Hierarchy:
Node oard
Connects
onnects 32 compute cards into a 5-d torus of length 2.. Each card consists of one
compute integrated circuit and 72 memory chips
Signaling rate @ 4Gbps
Boards on the same midplane are connected via the midplane connector
Boards on different midplanes are connected via 8 link chips on each midplane
These link chips drive optical transceivers (@10Gbps) through optical cables
Midplane
Midplane interconnectio
interconnection is a 512 node, 5-d
d torus of size 4 x 4 x 4 x 4 x 2
Rack
Contains one or two fully populated midplanes
Processor
Memory
Compute Cabinet
Interconnect
Performance
Power
Dimensions
System-on-chip device comprising four Network Interface Controllers, a 48-port tiled router and
a multiplexer
It provides connectivity for four nodes on a blade
Aries devices are connected to each other via cables and backplanes
Figure 6: Chassis backplane provides all-to-all connectivity between blades. Chassis within the same group connect to each
other bia electrical connectors. Groups connect together via optical cables.
FUJITSU - K COMPUTER
K-Computer is a massively parallel computer system developed by Fujitsu and RIKEN as a part of the
High Performance Computing Infrastructure(HPCI) initiative led by Japan's Ministry of Education,
Culture, Sports, Science and Technology. It is designed to perform extremely complex scientific
calculations and it was ranked as the top performing supercomputer in the world in 2011.
Node
System Board(SB)
Rack
System
Interconnect
Cores/Node
Performance
Architecture
Cache
Power
Mem. Bandwidth
Peak Performance
Configuration
Mem. Capacity
No. of Nodes
No. of SB
Nodes/System
Topology
Performance
No. of Links
Architecture
8 cores @2GHz
128GFlops
SPARC V9 + HPC extension
L1(I/D) Cache: 32KB/32KB
L2 Cache: 6MB
58W(typ. 30C)
64GB/s
>10PFlops
1 CPU/Node
16GB (2GB/core)
4 nodes/SB
24 SBs/rack
>80.000
6D Mesh/Torus
5GB/s for each link
10 links/node
Routing chip structure
(no outside switch box)
Compute Chip(CPU):
Compute Node:
Figure 8: Topology
6-D Benefits:
High Scalability
Twelve times higher than the 3D torus
High Performance and Operability
Low hop-count (average hop count about 1/2 of 3D torus)
High Fault Tolerance
12 possible alternate paths to bypass faulty nodes
Redundant node can be assigned preserving the torus topology
Packaging Hierarchy:
DELL - STAMPEDE
Stampede was developed by Dell at Texas Advanced Computing Center(TACC) as an NSF-funded project.
It is a new HPC system and one of the largest computing systems in the world for open science research.
It was ranked 7th in the Top500 List in 2012.
Processor
Memory
Packaging
Interconnect
Performance
Power
Compute Node:
Interconnect:
It consists of Mellanox switches, fiber cables and Host Channel Adapters approx. 75 miles of
cables
Eight core 648-port SX6536 switches and > 320 36-port SX6025 endpoint switches form a 2-level
Clos fat tree topology
They have 4.0 and 73 TB/s capabilities respectively
5/4 oversubscription at the endpoint switches(leafs)
Key feature Any MPI message is at most 5 hops from source to destination
NUDT - TIANHE-I
Developed by the Chinese National University of Defense Technology, it was the fastest computer in the
world from Oct. 2010 to June 2011 an now it is one of the few Petascale supercomputers in the world.
Processor
Memory
Nodes-Packaging
Total: 262 TB
7.168 compute nodes and 1.024 service nodes
installed in 120 racks
Interconnect
Performance
Power
Interconnect:
Use of high-radix Network Routing Chips (NRC) and high-speed Network Interface Chips(NIC)
NRC and NIC chips are designed with self-intellectual property
Topology is an optic-electronic hybrid hierarchical fat-tree structure
Bi-directional bandwidth @ 160Gbps, latency: 1.57s, switch density: 61.44Tbps in a single
backboard
The first layer consists of 480 switching boards. In each rack, every 16 nodes connect with each
other through the switch board in the rack. The main boards are connected with the switch
board via a back board. Communication on switching boards uses electrical transmission.
The second layer contains 11 384-port switches connected with QSFP optical fibers. There are
12 leaf switch boards and 12 root switch boards in each 384-port switch. A high density back
board connects them in an orthotropic way. The switch boards in the rack are connected to 384port switches through optical fibers.
References:
[1] Top 500 supercomputer sites: http://www.top500.org/
[2] A. Bland, R. Kendall, D. Kothe, J. Rogers, G. Shipman. Jaguar: The Worlds Most Powerful
Computer
[3] B. Bland. The worlds most powerful computer, presentation at Cray Users Group 2009
Meeting
[4] Jaguar (supercomputer): http://en.wikipedia.org/wiki/Jaguar_(supercomputer)
[5] Titan (supercomputer): http://en.wikipedia.org/wiki/Titan_(supercomputer)
[6] Cray XK7: http://www.cray.com/Products/Computing/XK7.aspx
[7] Blue Gene/Q: http://en.wikipedia.org/wiki/Blue_Gene
[8] IBM. IBM System Blue Gene/Q, IBM Systems and Technology Data Sheet
[9] J. Milano, P. Lembke. IBM System Blue Gene Solution, Blue Gene/Q Hardware Overview and
Installation Planning, Redbooks
[10] M. Ohmacht/ IBM BlueGene Team. Memory Speculation of the Blue Gene/Q Compute Chip,
presentation, 2011
[11] R. Wisniewski/ IBM BlueGene Team. BlueGene/Q: Architecture CoDesign; Path to Exascale,
presentation, 2012
[12] Cray XC30: http://www.cray.com/Products/Computing/XC.aspx
[13] B. Alverson, E. Froese, L. Kaplan, D. Roweth / Cray Inc. Cray XC Series Network,
http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf
[14] K-computer: http://en.wikipedia.org/wiki/K_computer
[15] T. Inoue. The 6D Mesh/Torus Interconnect of K Computer, presentation
[16] T. Maruyama. SPARC64TM VIIIfx: Fujitsus New Generation Octo Core Processor for PETA Scale
Computing, presentation, 2009
[17] M. Sato. The K computer and Xcalable MP parallel language project, presentation
[18] Y. Ajima, T. Inoue, S. Hiramoto, T. Shimizu. Tofu: Interconnect for the K computer, Fujitsu Sci.
Tech. J., Vol 48, No. 3, pp. 280-285 (July 2012)
[19] H. Maeda, H. Kubo, H. Shimamori, A. Tamura, J. Wei. System Packaging Technologies for the K
Computer, Fujitsu Sci. Tech. J., Vol 48, No. 3, pp. 286-294 (July 2012)
[20] Stampede: http://www.tacc.utexas.edu/resources/hpc/#stampede
[21] Stampede: https://www.xsede.org/web/guest/tacc-stampede#overview
[22] D. Stanzione. The Stampede is Coming: A New Petascale Resource for the Open Science
Community, presentation
[23] S. Lantz. Parallel Programming on Ranger and Stampede, presentation, 2012
[24] Tianhe 1A: http://www.nscc-tj.gov.cn/en/resources/resources_1.asp#TH-1A
[25] X. Yang, X. Liao, K. Lu et al. The TianHe-1A supercomputer: Its hardware and software.,
Journal of Computer Science and Technology, 26(3): 344-351 May 2011