Beyond Graphics - The Present and Future of GP-GPU: Cuda Nvidia
Beyond Graphics - The Present and Future of GP-GPU: Cuda Nvidia
Beyond Graphics - The Present and Future of GP-GPU: Cuda Nvidia
It wasn't so long ago that 3D graphics cards were only expected to deliver higher frames-per-
second in your favorite 3D games. Sure, the graphics companies fought over some image quality
issues like the internal color processing precision and the quality of anti-aliasing or anisotropic
filtering, but even that was targeted at game performance and quality. Of course, there have been
graphics cards for years now designed for the professional 3D market—CAD/CAM, industrial
design, folks like that. Still, it's all 3D rendering of some form or another.
The first hint of graphics cards doing something "more than 3D" was with the introduction of video
acceleration. It started out simple, with partial decoding of MPEG video, moving gradually into full
acceleration of the MPEG2 used in DVDs, and today is quite robust. Modern graphics cards
accelerate much of the decoding steps required for sophisticated codecs like VC-1 and H.264 (both
used in Blu-ray movies), along with de-interlacing, noise reduction, dynamic contrast control, and
more. Much of this work is done in dedicated video hardware on the GPU.
The release of DirectX 10, with unified vertex/pixel/geometry shaders and stream-out functions
brought with it a class of hardware that is more flexible, and more easily able to handle other
computing tasks. Research into using the powerful parallel processing of GPUs has been going on
for years. They call it "GP-GPU," for general-purpose computing on a GPU, and it's about ready for
the mainstream. Here's some of what you can look forward to.
CUDA is relatively simple, as stream processing languages go. It's based on C, with some
extensions, so it's pretty familiar to developers. Writing code that is highly parallel and manages
data to work optimally in a GPU's memory systems is tricky, but the payoffs are great. In the high
performance computing (HPC) environment, where large clusters or supercomputers are purpose-
built to perform specific tasks with custom software, CUDA has gained a lot of traction. Financial
analysis models, oil and gas exploration software, medical imaging, fluid dynamics, and other
tough "big iron" tasks are already using CUDA together with Nvidia GPUs to realize speed
improvements an order of magnitude or two greater than when running on CPUs.
ATI's approach is similar in some ways, different in others. The company's first attempt at
exposing the GPU to general purpose computing tasks was through a "Close to the Metal" or CTM
driver. Nobody really wants to work at the very low level required for CTM, so ATI has evolved
their stream computing efforts into theStream SDK, which includes a number of features. Low-
level access is provided through a more accessible CAL (Compute Abstraction Layer), and code
libraries include the AMD Core Math Library (ACML), AMD Performance Library (APL), and a video
transcode library called COBRA. Unfortunately, not all of these parts are publicly available as part
of the SDK just yet. For example, the ACML is an existing product to optimize math functions on
AMD CPUs, and the alpha version with GPU optimized math libraries is only available to limited
parties by request.
Though AMD's stream computing efforts haven't seen adoption in the HPC market quite as broad
as Nvidia's, they have still gained traction with a number of vendors making very similar
applications. Our focus today isn't on these apps, though. We're concerned with the consumer
space. What's it in for us?
Folding@Home
Stanford University runs one of the most popular distributed computing applications
around, Folding@Home. It calculates protein folding on a massive scale, using thousands of
computers and PlayStation 3s around the world. The idea is to better understand how proteins
"fold" or assemble themselves, and how the process goes wrong sometimes—which is thought to
be at the heart of diseases like Alzheimer's and Parkinson's.
For awhile now the labs at Stanford have been working with ATI to produce a GPU-accelerated
version of their folding software. Now, the second generation of this GPU folding app is freely
available in beta, and it uses the more reliable and better-performing CUDA for Nvidia GPUs, or
CAL for ATI cards.
A quick look at the Client Stats page shows that there are, at the time of this writing, about 7600
active GPUs running the FAH app, generating around 840 teraflops of computational power (it's
not a theoretical peak number, it's real work running the FAH computations). That's somewhere
around 110 gigaflops per GPU, on average. To put that in perspective, the regular windows CPU
client is about one gigaflop per client (it's a mix of the single-threaded client and the multi-core
SMP version). The PS3 looks like it leads the pack with a total of 1,358 teraflops, but that's from
over 48,000 active PS3s. Each PS3 is actually delivering about 28 gigaflops apiece.
In other words, the average GPU client is four times faster than a PS3. That includes relatively
older and low-end graphics cards, too. Newer cards are likely six to eight times faster.
Right now, Nvidia's cards are better folders, due primarily to better optimized code. With the latest
drivers, most GeForce cards are getting pretty close to peak utilization. ATI's cards, which rely on
their CAL driver, still seem to have a lot of headroom. In fact, the new Radeon HD 4800 have 800
stream processors, but the current client runs on them as if they were older cards with only 320.
The GPU2 Folding@Home client itself, and the CAL/CUDA drivers from the graphics manufacturers,
still need some optimization. The performance, display features, and reliability are all going to
improve over the coming weeks. It's simply too early to compare one architecture or graphics card
against another with regards to FAH. Still, the current client is stable and fast enough to be usable,
and you should use it if you can.
Over here, we all band together on the truly impressive DL.TVfolding team, currently #10 in the
worldwide rankings and climbing fast. We'd love it if you devoted some of your computer power to
helping discover cures for some truly nasty diseases with us. Just join team 57391 when you
install Folding@Home and you're all set. You can see some of our team stats here. Continued...
Video Transcoding
Perhaps one of the worst experiences in modern consumer PC use is video transcoding. You have a
video downloaded from the 'net, or taken from your camcorder, and you want to put it on your
iPhone or Zune or burn it to a DVD. So you fire up your video conversion tool or iTunes or video
editing software and you wait…and wait…and wait. Your whole computer becomes unusable as you
wait for 15 minutes, watching a progress bar slowly crawl across the window. Or worse, you
transcode hi-def video or advanced formats like H.264 and have to walk away from your computer
for hours.
While graphics cards have performed high quality video decodingfor awhile now, we're just now on
the cusp of good GPU-accelerated transcoding. A good $200 graphics card can make your video
conversions go faster than real-time, sometimes much faster, turning those 15 minute waits into a
minute or less. The overnight conversions can be done during a coffee break.
Nvidia is working together with Elemental Technologies on a couple of products to encode video
with GeForce cards. RapiHD is a plug-in for Adobe Premiere Pro CS3 that performs a variety of
video scaling and conversion features, audio and video encoding and decoding, and video capture
features all on the GPU. For general users that aren't video professionals with a budget for
professional editing software and plugins, the same company will soon release
the BadaBOOM media converter. There's no set price yet, but expect it to be in the range of
general consumer software (maybe $20–50).
In the beta we used, the options are limited to transcoding to a number of popular portable
formats (PSP, iPhone, iPod, Zune), and you won't find a lot of options. Part of the bargain price
appears to be its limited utility—you can't convert a hi-def AVCHD clip to a hi-def H.264 clip to
burn onto Blu-ray, for instance. But there's no denying the performance. We transcoded a 1
minute, 50 second MPEG clip at 720p resolution to an iPhone MP4 format in about 30 seconds with
excellent quality. Even the fastest CPU-based encoders, optimized for quad-core CPUs, would take
four times as long.
click on image for full view
Of course, Nvidia isn't the only one with accelerated video conversion on the
way. CyberLink supports GPU-acceleration on ATI graphics cards in PowerDirector 7, which is a
pretty big win and is already shipping. You can even transcode several files at once with pretty
incredible speed. We're told plug-ins for professional products like Adobe Premiere have been in
the works for some time already, though brand names and release dates are still under wraps.
Flash Player 10 adds a host of effects and capabilities, many of which leverage graphics cards.
Pixel Bender lets Flash authors use filters to warp or change the visual display of their Flash
application or video. A new 3D API built into Flash gives devs the ability to more easily draw and
manipulate 3D objects. All modern Nvidia and ATI GPUs are used to accelerate this stuff, as well as
to accelerate video rendering for smoother performance and better video quality.
Of course, one can't mention Adobe without bringing to mind that eponymous image editing
package, Photoshop. It's so popular, it has become a verb. The good news here is that the next
version of Photoshop is going to accelerate several functions on both ATI and Nvidia's GPUs. In
early demos we have seen at both ATI and Nvidia press events, we saw zooming, panning, and
rotating enormous many-megapixel images go as fast as you can move the mouse. Manipulating
and painting on 3D models is fast and smooth, and panoramas can be stitched together into a 3D
sphere and directly manipulated. There's more to come, but the next Photoshop is still deep in
development. There's no set release date yet, but it's estimated to hit the market before the end
of the year.
Game Physics Battle
You may have heard by now that Nvidia purchased AGEIA awhile back, and is now in control of the
PhysX middleware. Of course, the first thing they did is take the acceleration layer that used to let
the PhysX stuff run on an AGEIA PPU card and port it to run on GeForce cards. A current beta
driver from Nvidia enables this support on GeForce 9800 GTX and GeForce GTX 260/280 cards,
along with a new version of the PhysX driver.
Right now there aren't many apps that support hardware-accelerated physics with the PhysX
middleware. Unreal Tournament 3 is perhaps the best-known, but all the maps and core gameplay
is designed with the limitations of CPU-based physics in mind. So if you want to get anything out
of physics inUT3, you really need to run special "PhysX enabled maps" that are loaded with
breakable stuff and so on.
The other prominent app to use PhysX is 3DMark Vantage. The PhysX middleware is used for one
particular CPU test, while some other individual feature tests and the game tests do some physics
stuff (particles, cloth) through DirectX with Futuremark's own code. With the aforementioned GPU
and PhysX driver, that one CPU test goes much faster. On a Core 2 Extreme QX9650, we get about
18 operations a second with the CPU. With a GeForce GTX 260, that goes up to around 130
operations per second.
Some would argue that this is cheating. And in fact, Futuremark's own rules and guidelines
prohibit graphics drivers from substantially affecting the CPU tests. Further, some will claim that a
single GPU doing the physics in that test is not being stressed for graphics, so it's not a real
representation of how much physics acceleration you'd get with one graphics card in a real game
running both PhysX and graphics.
On the other hand, this is not a driver that is specific to 3DMark Vantage—it does accelerate any
game or application that supports the PhysX PPU hardware acceleration, and thus is a "valid"
measure of game physics acceleration. Either way, the drivers to enable PhysX acceleration have
not been submitted to Futuremark as of this writing, so they won't comment on whether or not
they're "legal" and appropriate for comparisons. It can make a substantial difference in the overall
3DMark score on the Entry or Performance settings, but much less so on High or Extreme, where
the CPU tests are weighted less.
Will ATI owners be left in the dust as physics acceleration comes online on Nvidia cards? Not
necessarily. Equally big in the physics middleware space, if not bigger, is Havok. Recently
purchased by Intel, one would think that Havok wouldn't be interested in working with AMD, but
the opposite is true. Havok just announceda close working partnership with AMD to optimize their
physics middleware for AMD's processors, and that includes ATI graphics cards. They're much
further from shipping a real working solution than Nvidia, however.
Then there's this juicy rumor: An enthusiast hacker fromNGOHQ.com supposedly toyed around
with ATI's drivers and Nvidia's PhysX driver and got it running on a Radeon HD 3800 series card.
It's one of those "interesting if true" moments that makes you think it shouldn't be very hard to
get PhysX working on ATI's hardware as well.
Certainly there's a lot of politics at play here. Nvidia has a financial stake in making sure PhysX is
widely adopted, and adds values to their graphics cards by enabling support only on them.
AMD/ATI has no reason to increase developer support for PhysX—middleware owned by a
competitor—by enabling hardware support for it. Or perhaps there are licensing fees involved that
make it financially prohibitive. We may never know.
What we do know is this: The current situation is bad for consumers. For the foreseeable future,
there will be plenty of games coming to market using the PhysX middleware, and just as many
using Havok middleware. The idea that you would only get hardware acceleration of physics on
your graphics card for one or the other depending on whether you opted for ATI or Nvidia is bad
news. It's like buying a CPU, and having some future game run 10 times better on your system if
you chose an AMD CPU, but another game would run 10 times better if you bought an Intel CPU,
all because of licensing and software support issues.
Fortunately, there are some standards in the works. Apple recently began work with the Khronos
Group, which manages the OpenGL, OpenGL ES, and OpenSL ES audio standards, to develop a
heterogeneous computing API called OpenCL (for "Open Compute Language"). The goal is to build
programming standards for data and task parallel computing for a variety of devices, including
GPUs and CPUs. It seems like everyone is on board, with the initial participants in the working
group including Nvidia, AMD, Apple, IBM, Intel, ARM, and a host of others.
Not much is known about OpenCL just yet because it's in the early stages. The idea is to have a C-
based language for which GPU and CPU developers would then write a driver. An application
developer would write an OpenCL based stream computing app, and it would then run on any
hardware with an OpenCL driver.
The downside is that, if history is any indication, the movement of the Khronos Group can be a
little slow. The rush to tap into the massive compute power of GPUs is on, and OpenCL may not be
of much help if it takes another 18 months to get the standard ratified and drivers on the market.
On the other hand, Apple is hot to include it as a core feature of OS X v10.6 ("Snow Leopard"), so
that might push things along quickly.
The other standard worth watching is, once again, DirectX. DirectX 11 is set to include a few new
major features, one of which is what they're calling a "Compute Shader." Again, DX11 is still a
ways out and we don't know much yet about this new Compute Shader or how it will be accessed
by programmers. It will certainly be Windows-only as current DirectX technologies are, but that
hardly stopped Direct3D from dominating the consumer graphics landscape. The question is will
the compute shader in DX11 be general enough to work well in non-graphics applications? Will the
programming model be something that is relatively easy to work with, and makes sense for a wide
variety of applications? And when will it hit the market? Will it be part of Windows 7, or will
Windows Vista users be able to download and use DX11?
Right now there are more questions than answers about both OpenCL and DirectX 11, but that
should change over the course of the year as these standards take shape. The good news is that
both Nvidia and AMD/ATI are perfectly eager to support these standards
Over the next couple of years, we should see an increased emphasis on how well GPUs run not
only graphics applications, but general purpose applications, with hardware features devoted to
speeding up GP-GPU tasks. This has already begun, of course, and will accelerate over time.
Ideally, your graphics card should be able to smoothly balance the demands of a graphics
application with general purpose computing tasks, whether it is video encoding, protein folding,
game physics, or AI computation.
No discussion of GP-GPU's future would be complete without mentioning Intel's future GPU, code-
named Larrabee. We still don't know a lot about the product yet, but we do know that Intel will
position it to compete with high-end graphics products from Nvidia and ATI. It will feature some
silicon devoted specifically to graphics functions, but the main unified shader block will be
comprised of a whole mess of X86-compatible cores. Of course, they'll have newfangled SIMD
extensions on there, too. The goal is to make a graphics product that excels at rasterization, but
most importantly, takes general-purpose computing forward by a big step by making the GPU
easier to program for non-graphics tasks.
Intel certainly has a lot to prove when it comes to graphics, and we've heard promises of future
greatness many times before. On the other hand, we know it's a major focus of the company and
Intel doesn't often stumble on those. Larrabee isn't due out until late 2009 or 2010, and nobody
really knows what the market will look like by then. Will standards like OpenCL and DirectX 11
make an "easier to program GPU" a non-issue by that point? Will the efforts of AMD and Nvidia,
together with their own new hardware, make for the kind of still competition that Intel simply can't
overcome in its first real attempt?
It wasn't that long ago that Intel was going to take over the peripherals market with webcams and
game controllers, after all, and we know how well that went. Advanced silicon is in the lifeblood of
Intel, and owning their own fabs that seem to be one step ahead of the rest of the world can give
them a real competitive advantage. The GP-GPU market right now is like peering into a pretty
hazy crystal ball. There's so much movement along several fronts that it's hard to know how it will
shake out. For consumers, there's still not a lot to do with your GPU besides graphics, but that will
change by the end of the year as more video encoding applications come online. For really broad,
industry-wide usage of the GPU as a processing resource in your computer, we need standards,
and those are coming around as well. It will still take a year or more, but the day is coming when
a graphics card review will have as many charts devoted to GP-GPU performance as game
benchmarks. And when that day comes, all our computers will suddenly seem a lot more powerful.