Aman Seminar Report
Aman Seminar Report
Accelerators
Seminar Report Submitted in Partial Fulfilment of the Requirements
for the Degree of
Bachelor of Engineering
In
Information Technology
Submitted by
Aman Goyal: (Roll No. 19UITE9001)
Bachelor of Engineering
In
Information Technology
Submitted by
Aman Goyal: (Roll No. 19UITE9001)
CERTIFICATE
This is to certify that the work contained in this report entitled “High Performance
Computing With Accelerators” is submitted by Mr Aman Goyal (Roll. No:
19UITE9001) to the Department of Computer Science & Engineering, M.B.M.
University, Jodhpur, for the partial fulfilment of the requirements for the degree of
Bachelor of Engineering in Information Technology.
He has carried out his seminar work under my supervision. This work has not been
submitted else-where for the award of any other degree or diploma.
The seminar work in our opinion, has reached the standard fulfilling of the requirements
for the degree of Bachelor of Engineering in Information Technology in accordance
with the regulations of the Institute.
Dr Anil Gupta
Professor
(Supervisor)
Dept. of Computer Science & Engg.
M.B.M. University, Jodhpur
Dr NC Barwar
(Head)
Dept. of Computer Science & Engg.
M.B.M. University, Jodhpur
iii
iv
DECLARATION
I, Aman Goyal, hereby declare that this seminar titled “High Performance Computing
With Accelerators” is a record of original work done by me under the supervision and
guidance of Dr Anil Gupta sir.
I, further certify that this work has not formed the basis for the award of the
Degree/Diploma/Associateship/Fellowship or similar recognition to any candidate of
any university and no part of this report is reproduced as it is from any other source
without appropriate reference and permission.
SIGNATURE OF STUDENT
Aman Goyal
7th Semester, IT
Enroll. No. – 18R/04502
Roll No. – 19UITE9001
v
vi
ACKNOWLEDGEMENT
I would like to thank my esteemed supervisor Dr. Anil Gupta for his valuable
suggestion, keen interest, constant encouragement, incessant inspiration and continuous
help throughout this work. His excellent guidance has been instrumental in making this
work a success.
I would also like to express my thanks to all Staff member of the Department of
Computer Science for direct or indirect support. A special thank goes to my parents and
all my friends for their support in completion of this work.
Aman Goyal
7th Semester, IT
vii
viii
ABSTRACT
In the past few years, a new class of HPC systems has emerged. These systems employ
unconventional processor architectures—such as IBM's Cell processor and graphics
processing units (GPUs)—for heavy computations and use conventional central
processing units (CPUs) mostly for non-compute-intensive tasks, such as I/O and
communication. Prominent examples of such systems include the Los Alamos National
Laboratory's Cell-based Roadrunner) and the Chinese National University of Defense
Technology's ATI GPU-based Tianhe-1 cluster.
The main reason computational scientists consider using accelerators is because of the
need to increase application performance to either decrease the compute time, increase
the size of the science problem that they can compute, or both. The HPC space is
challenging since it’s dominated by applications that use 64 bit floating point
calculations and have frequent data reuse. As the size of conventional HPC systems
increase, their space and power requirements and operational cost quickly outgrow the
available
Resources and budgets. Thus, metrics such as flops per machine footprint, flops per
watt of power, or flops per dollar spent on the hardware and its operation are becoming
increasingly important. Accelerator-based HPC systems look particularly attractive
considering these metrics.
ix
x
Contents
1. Introduction 1
1. 2 Introduction to Accelerator 2
2.2 Accelerators 7
3. Types Of Accelerators 9
3.1 GPU 9
3.2 FPGAs 16
4. Case Study 25
4.1 Introduction 25
5. Summary 31
References…………………………………………………………………….. 33
xi
xii
xiii
List of Figures
xiv
xv
Chapter 1
INTRODUCTION
For many years microprocessor single thread performance has increased at rates
Consistent with Moore’s Law for transistors. In the 1970s-1990s the improvement was
Mostly obtained by increasing clock frequencies. Clock speeds are now improving
slowly and microprocessor vendors are increasing the number of cores per chip to
obtain improved performance. This approach is not allowing microprocessors to
increase single thread performance at the rates customers have come to expect.
Alternative technologies include:
· General Purpose Graphical Processing Units (GPGPUs)
· Field Programmable Gate Arrays (FPGAs) boards
· Clear Speed’s floating-point boards
· IBM’s Cell processors
These have the potential to provide single thread performance orders of magnitude
faster than current “industry standard” microprocessors from Intel and AMD.
Unfortunately performance expectations cited by vendors and in the press are frequently
unrealistic due to very high theoretical peak rates, but very low sustainable ones.
Many customers are also constrained by the energy required to power and cool today’s
Computers. Some accelerator technologies require little power per Gflop/s of
performance and are attractive from this reason alone. Others accelerators require much
The HPC space is challenging since it is dominated by applications that use 64-bit
Floating-point calculations and these frequently have little data reuse. HPCD personnel
Are also doing joint work with software tool vendors to help ensure their products work
Well for the HPC environment. This report gives an overview of accelerator
technologies , The HPC applications space, hardware accelerators. Recommendations
on which Technologies hold the most promise, and speculations on the future of these
technologies.
Although the potential of accelerators in HPC is evident for many (but not all)
applications, it might remain unrealized unless the scientific computing community,
computer scientists, technology vendors, and funding agencies do their part to advance
the technology. Several existing challenges can point the way forward.
First, many current efforts to move scientific codes to accelerators are undertaken by the
domain application developers themselves. Although domain experts’ application
insights are hard to replicate, they often end up spending too much time on the
idiosyncrasies of the new systems and programming tools and don’t always obtain the
full benefits of the accelerator capabilities. Teaming up with computer scientists or
application specialists intimately familiar with these systems can help produce better
code and achieve better performance.
Third, many application developers are reluctant to start porting code to application
accelerators because they can’t predict performance benefits without an actual prototype
implementation, which by itself is a substantial effort. We therefore need to develop
performance models against which developers can evaluate the suitability of candidate
applications for application accelerators. We also need to create a set of guides on how
to port applications to different accelerator platforms.
In the past few years, a new class of HPC systems has emerged. These systems employ
Unconventional processor architectures—such as IBM's Cell processor and graphics
Processing units (GPUs)—for heavy computations and use conventional central
Processing units (CPUs) mostly for non-compute-intensive tasks, such as I/O and
Communication. Prominent examples of such systems include the Los Alamos National
Laboratory's Cell-based RoadRunner (ranked second on the December 2009 TOP500
list) and the Chinese National University of Defense Technology's ATI GPU-based
Tianhe-1 cluster (ranked fifth on the same TOP500 list).
Currently, there's only one large GPU-based cluster serving the US computational
Science community—namely, Lincoln, a TeraGrid resource available at NCSA. This
will be augmented in the near future by Keeneland, a Georgia Institute of Technology
system funded by NSF Track 2D HPC acquisition program. On the more exotic front,
Novo-G cluster, which is based on Altera field-programmable gate array (FPGA), is
deployed at the University of Florida's NSF Center for High-Performance
Reconfigurable Computing (CHREC). By all indications, this trend toward the use of
unconventional processor architectures will continue, especially as new GPUs, such as
Nvidia's Fermi, are introduced. The top eight systems on the November 2009Green500
list of the world’s most energy efficient supercomputers are accelerator-based.
Early adopters aren't overly concerned about code portability, because in their view,
efforts such as OpenCL and the development of standard libraries (such as Magma, a
matrix algebra library for GPU and multicore architectures) will eventually deliver on
cross-platform portability. Many early adopters are still porting code kernels to a single
accelerator, but a growing number of teams are starting to look beyond simple kernels
and single accelerator chips.
Metrics
There are many metrics that can be used to measure the benefit of accelerators. Some
important ones to consider are:
· Price/performance (want to increase Gflop/s / $) – the more costly the
accelerator, the faster it must be to succeed
· Computational density for system (want to increase Gflop/s / cubic meter) –
accelerators can improve this significantly
· Power considerations (want to increase Gflop/s / watt) – some technologies
require very little power while other require so much they can’t be used in low
power systems
3.1 GPU
3.1.1 Introduction
A graphics processing unit or GPU (also occasionally called visual processing unit
or VPU) is a specialized microprocessor that offloads and accelerates 3D or 2D graphics
rendering from the microprocessor. It is used in embedded systems, mobile phones,
personal computers, workstations, and game consoles. Modern GPUs are very efficient
at manipulating computer graphics, and their highly parallel structure makes them more
effective than general-purpose CPUs for a range of complex algorithms. In a personal
computer, a GPU can be present on a video card, or it can be on the motherboard. More
than 90% of new desktop and notebook computers have integrated GPUs, which are
usually far less powerful than those on a dedicated video card.
The most common operations for early 2D computer graphics include the BitBLT
operation, combining several bitmap patterns using a RasterOp, usually in special
hardware called a "blitter", and operations for drawing rectangles, triangles, circles, and
The model for GPU computing is to use a CPU and GPU together in a heterogeneous
coprocessing computing model. The sequential part of the application runs on the CPU
and the computationally-intensive part accelerated by the GPU. From the user’s
perspective, the application just runs faster because it is using the high-performance of
the GPU to boost performance.
The IBM Professional Graphics Controller was one of the very first 2D/3D graphics
accelerators available for the IBM PC.
As the processing power of GPUs has increased, so has their demand for electrical
power. High performance GPUs often consumes more energy than current CPUs.
Another characteristic of high performance GPUs is that they require a lot of power
(and hence a lot of cooling). So they’re fine for a workstation, but not for systems such
as blades that are heavily constrained by cooling. However, floating-point calculations
require much less power than graphics calculations. So a GPU performing floating-
point code might use only half the power of one doing pure graphics code. Most GPUs
achieve their best performance by operating on four-tuples each of which is a 32-bit
floating-point number. These four components are packed together into a 128-bit word
which is operated on as a group. So it’s like a vector of length four and similar to the
Today, parallel GPUs have begun making computational inroads against the CPU, and a
subfield of research, dubbed GPU Computing or GPGPU for General Purpose
Computing on GPU, has found its way into fields as diverse as oil exploration, scientific
image processing, linear algebra[4], 3D reconstruction and even stock options pricing
determination. Nvidia's CUDA platform is the most widely adopted programming
model for GPU computing, with OpenCL also being offered as an open standard.
The GPUs of the most powerful class typically interface with the motherboard by
means of an expansion slot such as PCI Express (PCIe) or Accelerated Graphics Port
(AGP) and can usually be replaced or upgraded with relative ease, assuming the
motherboard is capable of supporting the upgrade. A few graphics cards still use
Peripheral Component Interconnect (PCI) slots, but their bandwidth is so limited that
they are generally used only when a PCIe or AGP slot is not available.
A dedicated GPU is not necessarily removable, nor does it necessarily interface with the
motherboard in a standard fashion. The term "dedicated" refers to the fact that dedicated
Hybrid solutions:
This newer class of GPUs competes with integrated graphics in the low-end desktop and
notebook markets. The most common implementations of this are ATI's HyperMemory
and NVIDIA's TurboCache. Hybrid graphics cards are somewhat more expensive than
integrated graphics, but much less expensive than dedicated graphics cards. These share
memory with the system and have a small dedicated memory cache, to make up for the
high latency of the system RAM. Technologies within PCI Express can make this
possible. While these solutions are sometimes advertised as having as much as 768MB
of RAM, this refers to how much can be shared with the system memory.
A new concept is to use a general purpose graphics processing unit as a modified form
of stream processor. This concept turns the massive floating-point computational power
of a modern graphics accelerator's shader pipeline into general-purpose computing
power, as opposed to being hard wired solely to do graphical operations. In certain
applications requiring massive vector operations, this can yield several orders of
magnitude higher performance than a conventional CPU. The two largest discrete (GPU
designers, ATI and NVIDIA, are beginning to pursue this new approach with an array
of applications. Both nVidia and ATI have teamed with Stanford University to create a
GPU-based client for the Folding@Home distributed computing project, for protein
folding calculations. In certain circumstances the GPU calculates forty times faster than
the conventional CPUs traditionally used by such applications.
Since 2005 there has been interest in using the performance offered by GPUs for
evolutionary computation in general, and for accelerating the fitness evaluation in
genetic programming in particular. Most approaches compile linear or tree programs on
the host PC and transfer the executable to the GPU to be run. Typically the performance
advantage is only obtained by running the single active program simultaneously on
many example problems in parallel, using the GPU's SIMD architecture. However,
substantial acceleration can also be obtained by not compiling the programs, and instead
transferring them to the GPU, to be interpreted there. Acceleration can then be obtained
by either interpreting multiple programs simultaneously, simultaneously running
multiple example problems, or combinations of both. A modern GPU (e.g. 8800 GTX
or later) can readily simultaneously interpret hundreds of thousands of very small
programs.
3.1.3 Hardware
There are two dominant producers of high performance GPU chips: NVIDIA and ATI.
ATI was purchased by AMD in November 2006. Until recently both GPU companies
were very secretive about the internals of their processors. However, now both are
opening up their architecture to encourage third party vendors to produce better
performing product. ATI’s has their Close To Metal (CTM) API. This is claimed to be
an Instruction Set Architecture (ISA) for ATI GPUs so that software vendors can
develop code using the CTM instructions instead of writing everything in graphics
AMD has also announced the Fusion program which will place CPU and GPU cores on
a single chip by 2009. An open question is whether the GPU component on the Fusion
chips will be performance competitive with ATI’s high power GPUs.
3.1.4 Software
Most GPU programs are written in a shader language such as OpenGL (Linux,
Windows) or HLSL (Windows). These languages are very different from C or Fortran
or other common high level languages usually used by HPC scientists. Hence arose the
need to explore other languages that would be more acceptable to HPC users.
The most popular alternative to shader languages are streams languages – so named
Because they operate on streams (vectors of arbitrary length) of data. These are well
suited for parallelism and hence GPUs since element in a stream can be operated on by
a different functional unit. The first two streams languages for GPUs were BrookGPU
and Sh (now named RapidMind). BrookGPU is a language that originated in the
Stanford University Graphics Lab to provide a general purpose language for GPUs. This
language contains extensions to C that can be used to operate on the four-tuples with
single instructions. This effort is currently in maintenance mode because its creator has
left Stanford so our team is not pursuing it. However in October 2006, PeakStream
announced their successor to BrookGPU. Although they claim their language is really
C++ with new classes it looks like a new language. They have created some 32-bit
The other language we investigated for programming GPUs is RapidMind. This is effort
started at the University of Waterloo and led to founding the company RapidMind to
productize the language and compilers. This is a language that is embedded in C++
programs and allows GPUs to be abstracted without directly programming in a shader
language. While this language is based in graphics programming it is also a general
purpose language that can be used for other technical applications. Also, the user does
not have to directly define the data passing between the CPU and GPU as the
RapidMind compiler takes care of setting up and handling this communication. Since
this language was the only viable GPU language suitable for our market, the authors
began a series of technical exchanges with RapidMind personnel. RapidMind has also
simplified the syntax of their language to make it easier to use.
3.2 FPGAs
3.2.1 Introduction
Field Programmable Gate Arrays ( FPGPAs) have a long history in embedded
processing and specialized computing. These areas include DSP, ASIC prototyping,
medial imaging, and other specialized compute intensive areas.
An important differentiator between FPGAs and other accelerators is that they are
programmable. You can program them for one algorithm and then reprogram them to
do a different one. This reprogramming step may take several milliseconds, so it needs
to be done in anticipation of the next algorithm needed to be most effective. FPGA
chips seem primitive compared to standard CPUs since some of the things that are basic
on standard processors require a lot of effort on FPGAs. For example, CPUs have
functional units that perform 64-bit floating-point multiplication as opposed to FPGAs
Running code on FPGAs is cumbersome as it involves some steps that are not necessary
for CPUs. Assume an application is written in C/C++. Steps include:
· Profile to identify code to run on FPGA
· Modify code to use FPGA C language (such as Handel-C, Mitrionics, etc.)
· Compile this into a hardware description language (VHLD or Verilog)
· Perform FPGA place-and-route and product FPGA “bitfile”
· Download bitfile to FPGA
· Compile complete application and run on host processor and FPGA
For example, the latest generation and largest Xilinx Virtex-5 chip has 192 25x18 bit
primitive multipliers. It takes 5 of these to perform a 64-bit floating-point multiply and
these can run at speeds up to 500 MHz. So an upper limit on double precision
multiplication is [192/5] * 0.5 = 19 Gflop/s. A matrix-matrix multiplication includes
multiplications and additions and the highest claim seen for a complete DGEMM is
about 4 Gflop/s, although numbers as high as 8 Gflop/s have been reported for data
local to the FPGA. Cray XD-1 results using an FPGA that is about half the size of
current FPGAs show DGEMM and double precision 1-d FFTs performing at less than 2
Gflop/s. Single precision routines should run several times faster. FPGAs are very good
at small integer and floating-point calculations with a small number of bits. The
manager of one university reconfiguration computing site noted: "If FPGAs represent
Superman, then double precision calculations are kryptonite."
3.2.2 Hardware
The dominant FPGA chip vendors are Xilinx and Altera. Both companies produce many
different types of FPGAs. Some FPGAs are designed to perform integer calculations
while others are designed for floating-point calculations. Each type comes in many
different sizes, so most HPC users would be in interested in the largest (but most
expensive) FPGA that is optimized for floating-point calculations. Other chip
companies are the startup Velogix (FPGAs) and MathStar (FPOAs).
3.2.3 Software
Once again the software environment is not what the HPC community is used to using.
There’s a spectrum of FPGA software development tools. At one end is the popular
hardware design language Verilog used by hardware designers. This has very good
performance, but the language is very different from what HPC researches expect. Some
vendors have solutions that are much closer to conventional C++. The conventional
wisdom is that the closer to standard C that the solution is, the worse the resulting
Clearspeed has a beta release of a software development kit that includes a compiler.
There are two ways to use the ClearSpeed boards. One is to make a call to a routine
from their math library. This library contains an optimized version of the matrix-matrix
multiply subprogram DGEMM. The other way to access the boards is to write routines
using the ClearSpeed accelerator language Cn. See the “Investigations Finding” for
more performance information. The first accelerator enhanced system to make the
Top500 list is the TSUBAME grid cluster in Tokyo. It is entry 9 on the November 2006
list and derives about ¼ of its performance from ClearSpeed boards and the rest from
Opteron processors.
3.4.1 Introduction
The Cell Broadband Engine—or Cell (Refer Fig 3.4.1.1) as it is more commonly
known—is a microprocessor designed to bridge the gap between conventional desktop
processors (such as the Athlon 64, and Core 2 families) and more specialized high-
performance processors, such as the NVIDIA and ATI graphics-processors (GPUs). The
longer name indicates its intended use, namely as a component in current and future
digital distribution systems; as such it may be utilized in high-definition displays
and recording equipment, as well as computer entertainment systems for the HDTV era.
Additionally the processor may be suited to digital imaging systems (medical, scientific,
etc.) as well as physical simulation (e.g., scientific and structural engineering
modeling).
To achieve the high performance needed for mathematically intensive tasks, such as
decoding/encoding MPEG streams, generating or transforming three-dimensional data,
or undertaking Fourier analysis of data, the Cell processor marries the SPEs and the
PPE via EIB to give access, via fully cache coherent DMA (direct memory access), to
both main memory and to other external data storage. To make the best of EIB, and to
overlap computation and data transfer, each of the nine processing elements (PPE and
SPEs) is equipped with a DMA engine. Since the SPE's load/store instructions can only
access its own local memory, each SPE entirely depends on DMAs to transfer data to
and from the main memory and other SPEs' local memories. A DMA operation can
transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks.
One of the major design decisions in the architecture of Cell is the use of DMAs as a
central means of intra-chip data transfer, with a view to enabling maximal asynchrony
and concurrency in data processing inside a chip.
3.4.2 Architecture
Cell has a total of nine cores and is a heterogeneous multiprocessor with aunique
design, boasting an impressive theoretical peak performance of over 200 Gflops.
Heterogeneous refers to the nine cores, which are of two different types, each
specializing in different tasks. This is a completely different approach than other multi-
core processors from e.g. Intel and AMD, where all cores are of the same type and
therefore have the same strength and weaknesses. Initial research by IBM and others
have shown that Cell outperforms these commodity processors, by several factors, for
certain types of scientific kernels and can achieve near peak performance. Faster clock
The first major commercial application of Cell was in Sony's PlayStation 3 game
console. Mercury Computer Systems has a dual Cell server, a dual Cell blade
configuration, a rugged computer, and a PCI Express accelerator board available in
different stages of production. Toshiba has announced plans to incorporate Cell in high
definition television sets. Exotic features such as the XDR memory subsystem and
coherent Element Interconnect Bus (EIB) interconnect appear to position Cell for future
applications in the supercomputing space to exploit the Cell processor's prowess in
floating point kernels. IBM has announced plans to incorporate Cell processors as add-
on cards into IBM System z9mainframes, to enable them to be used as servers for
MMORPGs.
Several typical HPC kernels have been ported to the Cell processors and their
performance compared to other leading commodity processors. Initial result are very
impressive, showing that Cell can be up to 25x faster then other leading commodity
processors from Intel and AMD. Results also show that near peak performance is
achievable on the Cell processor. Besides the HPC kernels, a small subset of a well
know real-world image library has also been ported to the Cell processor. This has
resulted in a Cell extension for the library, with optimized library functions, which take
advantage of Cell’s unique architecture. The extended library encapsulates the
complexity of the Cell processor and is very easy to use and requires almost no
knowledge about the Cell processor. Comparing the performance of the extended library
functions on the Cell processor, with the performance of the original library functions
on the commodity processors, shows that the extended Cell library can perform up to
30x times better. Thus initial results are very encouraging and show that Cell has real
potential for image processing workloads.
Since the emergence of the first computers, the appetite for more computing
performance as proved to be insatiable. For every year that goes by, companies and
research institutes around the world crave for more and more performance. Traditionally
this performance has been delivered by monolith supercomputers, however, in recent
times monolith supercomputers have given way to High Per-formance Computing
(HPC) systems, which are now the most prevalent way of achieving large amounts of
computing performance. According to International Data Corporation (IDC) the HPC
market grew more than 102% from 2002 to 2006 and IDC expects that it will grow by
an additional 9.1% annually[66, 67]. This turn toward HPC is due to the relatively low
cost and high performance of commercial off-the-shelf hardware, which means that
supercomputer performance can now be achieved at a fraction of the cost it used to.
A hybrid Cell cluster also has the advantage that cluster applications based on GP
processors can run directly on the hybrid Cell HPC cluster, without porting and/or
recompilation. One could then port key parts of the applications to use the Cell
processors as accelerators. Therefore due to the limitations of the PPE and the above
benefits of hybrid clusters, it will be a requirement that the Cell HPC cluster designed in
this chapter must, in addition to Cell processors, also consist of a number of GP
processors.
At the center of this is commodity processors, where prices have plummeted over the
years, while at the same time performance has increased. Commodity processors are
now standard in many HPC systems and developers are on constant lookout for new
affordable processors which will increase the performance of their next generation HPC
systems. Consequently the Cell processor is very interesting. Cell outperforms
commodity processors from Intel and AMD by several magnitudes, both in terms of
theoretical peak performance and actual performance. Cell is therefore an ideal
candidate for HPC.
PlayStation 3 cluster
The considerable computing capability of the PlayStation 3's Cell microprocessors has
raised interest in using multiple, networked PS3s for various tasks that require
affordable high-performance computing.
PS3 Clusters
The NCSA has already built a cluster based on the PlayStation 3. Terra Soft Solutions
has a version of Yellow Dog Linux for the PlayStation 3,and sells PS3s with Linux pre
installed, in single units, and 8 and 32 node clusters. In addition, RapidMind is pushing
their stream programming package for the PS3.
Single PS3
Even a single PS3 can be used to significantly accelerate some computations. Marc
Stevens, Arjen K. Lenstra, and Benne de Weger have demonstrated using a single PS3
Although game developers are only now beginning to take advantage of the PS3’s
processing Ability, the United States Air Force has taken the claim literally. Stars and
Stripes newspaper announced a $2 million government project to create a research
supercomputer using 2,000 PS3s. The project will be headed by the Air Force Research
Laboratory in Rome, New York.
Case Study
4.1 Overview
The NVIDIA® Tesla™ 20-series is designed from the ground up for high performance
computing. Based on the next generation CUDA GPU architecture codenamed “Fermi”,
which is the third generation CUDA architecture. It supports many “must have” features
for technical and enterprise computing. These include ECC memory for uncompromised
Accuracy and scalability support for C++ and 8x the double precision performance
compared Tesla 10-series GPU computing products. When compared to the latest quad
core CPU, Tesla 20-series GPU computing processors deliver equivalent performance at
1/20th the power consumption and 1/10th the cost.
An indicator of CUDA adoption is the ramp of the Tesla GPU for GPU computing.
There are now more than 700 GPU clusters installed around the world at Fortune 500
companies ranging from Schlumberger and Chevron in the energy sector to BNP
Paribas in banking.
The GPU Computing System seamlessly fits into enterprise server clusters and scales to
solve the most complex computing problems.
Using SGI ALTIX Cluster + Compute Nodes with SGI TP Storage – 64 cores
for CAD / CAM Applications: MS- Nastran / Star CD / UG etc.
Using SGI Cluster and HPC of 4700 series 64 cores & Storage (SMP) For
Weather modeling and High Performance Computing
Using SGI Altix 4700 with 32 cores + SGI TP Storage (SMP) For Geological
applications and GIS applications
5. INCOIS – Hyderabad :
7. TIFR – Mumbai:
Using HP Intel Itanium & XEON based Cluster & storage For Home grown
8. DELHI University:
Using HP C 8000 Chassis Based HP Cluster with IB and Gigabit for Home
Grown
9. TATA STEEL:
Using SGI Altix 1300 system 80 core Cluster with IB and Gigabit for Analysis
and Fluent
There are multiple families of accelerators suitable for executing applications from
portions of HPC space. These include GPGPUs, FPGAs, Clear Speed and the Cell
processor. Each type is good for specific types of applications, but they all need
applications with a high calculation to memory reference ratio. They are best at the
following:
· GPGPUs: graphics, 32-bit floating-point
· FPGAs: embedded applications, applications that require a small number of bits
· Clear speed: matrix-matrix multiplication, 64-bit floating-point
· Cell: graphics, 32-bit floating-point
Common traits for accelerators today include slow clock frequencies, performance is
through parallelism, low bandwidth connections to CPU, and the lack of standard
software tools.
[5] FPGAs:http://www.xilinx.com/support/documentation/white_papers/
wp375_HPC_Using_FPGAs.pdf
References 35