0% found this document useful (0 votes)
14 views122 pages

Graphics 5

The document discusses visible surface detection methods, including object-space and image-space algorithms, to determine which surfaces of 3D objects are visible to an observer. It covers various techniques such as back-face detection, depth-buffer, A-buffer, scan-line, and depth-sorting methods, as well as shading techniques like constant intensity, Gouraud, and Phong shading. Additionally, it introduces GPU architecture and CUDA programming for parallel computing, emphasizing the differences between host and device threading resources.

Uploaded by

Virat D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views122 pages

Graphics 5

The document discusses visible surface detection methods, including object-space and image-space algorithms, to determine which surfaces of 3D objects are visible to an observer. It covers various techniques such as back-face detection, depth-buffer, A-buffer, scan-line, and depth-sorting methods, as well as shading techniques like constant intensity, Gouraud, and Phong shading. Additionally, it introduces GPU architecture and CUDA programming for parallel computing, emphasizing the differences between host and device threading resources.

Uploaded by

Virat D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

UNIT IV: Visible Surface

Detection & GPU


Architecture
Visible-surface detection
 It is also referred to as hidden-surface elimination methods.
 In a given set of 3D objects and viewing specification, we wish to determine which lines or surfaces
of the objects are visible, so that we can display only the visible lines or surfaces. This process is
known as hidden surfaces or hidden line elimination, or visible surface determination.
 The hidden line or hidden surface algorithm determines the lines, edges, surfaces or volumes that
are visible or invisible to an observer located at a specific point in space.
 We can broadly classify visible-surface detection algorithms according to whether they deal with the
object definitions or with their projected images.

 These two approaches are called object-space methods and image-space methods, respectively:
 An object-space method compares objects and parts of objects to each other within the scene definition to
determine which surfaces, as a whole, we should
label as visible.
 In an image-space algorithm, visibility is decided point by point at each pixel position on the projection
plane.
Back-face detection

 A fast and simple object-space method for locating the back faces of a
polyhedron is based on front-back tests. A point (x, y, z) is behind a polygon
surface if
Ax + By + Cz + D < 0 (1)
where A, B, C, and D are the plane parameters for the polygon.
 When this position is along the line of sight to the surface, we must be looking
at the back of the polygon. Therefore, we could use the viewing position to
test for back faces.
Depth-buffer Method
 A commonly used image-space approach for detecting visible surfaces is the depth-buffer method,
which compares surface depth values throughout a scene for each pixel position on the projection
plane.
 Each surface of a scene is processed separately, one pixel position at a time, across the surface.
 The algorithm is usually applied to scenes containing only polygon surfaces, because depth values
can be computed very quickly and the method is easy to implement.
 This visibility-detection approach is also frequently alluded to as the z-buffer method, because
object depth is usually measured along the z axis of a viewing system.
 Drawback: A drawback of the depth-buffer method is that it identifies only
one visible surface at each pixel position. In other words, it deals only with
opaque surfaces and cannot accumulate color values for more than one
surface, as is necessary if transparent surfaces are to be displayed.
A-Buffer Method

 An extension of the depth-buffer ideas is the A-buffer procedure (at the


other end of the alphabet from “z-buffer,” where z represents depth).
 This depth-buffer extension is an antialiasing, area-averaging, visibility-
detection method developed at Lucasfilm Studios for inclusion in the surface-
rendering system called REYES (an acronym for “Renders Everything You Ever
Saw”).
 The buffer region for this procedure is referred to as the accumulation buffer,
because it is used to store a variety of surface data, in addition to depth
values.
Surface information in the A-buffer
includes:
Scan-line Method

 This image-space method for identifying visible surfaces computes and


compares depth values along the various scan lines for a scene.
 As each scan line is processed, all polygon surface projections intersecting
that line are examined to determine which are visible.
 Across each scan line, depth calculations are performed to determine which
surface is nearest to the view plane at each pixel position.
 When the visible surface has been determined for a pixel, the surface color
for that position is entered into the frame buffer.
 Surfaces are processed using the information stored in the polygon tables.
 The edge table contains coordinate endpoints for each line in the scene, the inverse slope of
each line, and pointers into the surface-facet table to identify the surfaces bounded by each
line.
 The surface-facet table contains the plane coefficients, surface material properties, other
surface data, and possibly pointers into
the edge table.
 To facilitate the search for surfaces crossing a given scan line, an active list of edges is formed
for each scan line as it is processed.
 The active edge list contains only those edges that cross the current scan line, sorted in order of
increasing x.
 In addition, we define a flag for each surface that is set to “on” or “off” to indicate whether a
position along a scan line is inside or outside the surface.
Depth-sorting Method

 Using both image-space and object-space operations, the depth-sorting


method performs the following basic functions:
1. Surfaces are sorted in order of decreasing depth.
2. Surfaces are scan-converted in order, starting with the surface of greatest
depth.

 Sorting operations are carried out in both image and object space, and the
scan conversion of the polygon surfaces is performed in image space.
 This visibility-detection method is often referred to as the painter’s
algorithm.
Painter’s Algorithm
Constructive Solid-Geometry Methods

 This method creates a new volume using Boolean set operations (e.g.- union, intersection or
difference operation) on two specified volumes.
 Initially a set of primitive 3d volumes are available e.g.- cubes, spheres ,pyramids, cylinders,
cones and closed spline surfaces.
Contd…
Tree representation for a CGS object
Binary Space Partitioning (BSP) Tree

•One of class of “list-priority” algorithms – returns ordered list of


polygon fragments for specified view point (static pre-processing
stage). 5

•Choose polygon arbitrarily


2
•Divide scene into front (relative to normal) and back half-spaces.
3
•Split any polygon lying on both sides. 1 4
•Choose a polygon from each side – split scene again.

•Recursively divide each side until each node contains only 1 polygon.

View of scene from above


BSP Tree

5
•Choose polygon arbitrarily 5a
2 5b
•Divide scene into front (relative to normal) and back
half-spaces.
3
•Split any polygon lying on both sides. 1 4

•Choose a polygon from each side – split scene again.

•Recursively divide each side until each node contains only


1 polygon.
front 3 back

1 4
2 5b
5a
BSP Tree

5
•Choose polygon arbitrarily 5a
2 5b
•Divide scene into front (relative to normal) and back half-
spaces.
3
•Split any polygon lying on both sides. 1 4

•Choose a polygon from each side – split scene again.

•Recursively divide each side until each node contains only


1 polygon.
front 3 back
2 4
front
5b
5a 1
BSP Tree

5
•Choose polygon arbitrarily 5a
2 5b
•Divide scene into front (relative to normal) and back half-
spaces.
3
•Split any polygon lying on both sides. 1 4

•Choose a polygon from each side – split scene again.

•Recursively divide each side until each node contains


only 1 polygon.
front 3 back
2 4
front
5a 1 5b
BSP Tree

5
•Choose polygon arbitrarily
2
•Divide scene into front (relative to normal) and back half-
spaces.
3
•Split any polygon lying on both sides. 1 4

•Choose a polygon from each side – split scene again.

•Recursively divide each side until each node contains only


1 polygon. 5 back
4 back
Alternate formulation
3 starting at 5
front
1 back
2
Area-subdivision Method

 This technique for hidden-surface removal is essentially an image-


space method, but object-space operations can be used to
accomplish depth ordering of surfaces.
 The area-subdivision method takes advantage of area coherence
in a scene by locating those projection areas that represent part of
a single surface.
Octrees
 Hierarchical tree structure to represent solid objects
 Each node corresponds to a region of three-dimensional space.
 Octree representation is commonly used in Medical Imaging & other
applications that require display of object cross-sections.
 Extension of Quadtree encoding in 2D space.
Quadtree representation
Octree representation

 Various manipulation
routines can be
applied to the solid
(e.g. union,
intersection or
difference)
Shading Methods

 Shading is referred to as the implementation of the illumination model at the


pixel points or polygon surfaces of the graphics objects.
 Shading model is used to compute the intensities and colors to display the
surface. The shading model has two primary ingredients: properties of the
surface and properties of the illumination falling on it.
 The principal surface property is its reflectance, which determines how much
of the incident light is reflected. If a surface has different reflectance for the
light of different wavelengths, it will appear to be colored.
 An object illumination is also significant in computing intensity. The scene
may have to save illumination that is uniform from all direction, called
diffuse illumination.
 The simplest form of shading considers only diffuse illumination:
Epd=Rp Id
 where Epd is the energy coming from point P due to diffuse illumination. Id is the diffuse illumination falling on the
entire scene, and Rp is the reflectance coefficient at P which ranges from shading contribution from specific light
sources will cause the shade of a surface to vary as to its orientation concerning the light sources changes and will
also include specular reflection effects. In the above figure, a point P on a surface, with light arriving at an angle of
incidence i, the angle between the surface normal Np and a ray to the light source. If the energy Ips arriving from
the light source is reflected uniformly in all directions, called diffuse reflection, we have
Eps=(Rp cos i)Ips
 This equation shows the reduction in the intensity of a surface as it's tipped obliquely to the light source.
Constant Intensity Shading

 A fast and straightforward method for rendering an object


with polygon surfaces is constant intensity shading, also
called Flat Shading.
 In this method, a single intensity is calculated for each
polygon. All points over the surface of the polygon are then
displayed with the same intensity value.
 Constant Shading can be useful for quickly displaying the
general appearances of the curved surface.
Gouraud Shading

 This Intensity-Interpolation scheme, developed by Gouraud and usually


referred to as Gouraud Shading, renders a polygon surface by linear
interpolating intensity value across the surface.
 Intensity values for each polygon are coordinate with the value of adjacent
polygons along the common edges, thus eliminating the intensity
discontinuities that can occur in flat shading.
 Each polygon surface is rendered with Gouraud Shading by performing the
following calculations:
1. Determining the average unit normal vector at each polygon vertex.
2. Apply an illumination model to each vertex to determine the vertex
intensity.
3. Linear interpolate the vertex intensities over the surface of the polygon.
Phong Shading

 A more accurate method for rendering a polygon surface is to interpolate the


normal vector and then apply the illumination model to each surface point.
 This method developed by Phong Bui Tuong is called Phong Shading or normal
vector Interpolation Shading. It displays more realistic highlights on a surface
and greatly reduces the Match-band effect.
 A polygon surface is rendered using Phong shading by carrying out the
following steps:
1. Determine the average unit normal vector at each polygon vertex.
2. Linearly & interpolate the vertex normals over the surface of the polygon.
3. Apply an illumination model along each scan line to calculate projected
pixel intensities for the surface points.
Global Illumination

 Illumination models in computer graphics are often approximations of the physical


laws that describe surface-lighting effects.
 To reduce computations, most packages use empirical models based on simplified
photometric calculations.
 Surface rendering is performed through calculating the interaction of an object's
surface with the light striking it.
 This type of illumination model is known as local illumination, and considers only
the properties of that object and the light that directly strikes it.
 To produce more realistic lighting effects, we must also consider the contribution
of light that is reflected from other objects onto the surface of the object being
shaded.
 This type of illumination, called global illumination, can be more accurate, but
that accuracy comes at the expense of additional computation.
Ray Tracing Method

 Some global illumination methods, such as ray-tracing, attempt to determine


surface shading by following light rays from the eyepoint back into the scene
through the pixels of the image plane.

 Ray casting is used in constructive solid geometry for locating surface


intersections along a ray from a pixel position.

 Ray casting is also a means for identifying visible surfaces in a scene.

 Ray tracing is the generalization of the basic ray-casting procedure.


Radiosity Lighting Model

 We can model lighting effects more precisely by considering the physical laws
governing the radiant-energy transfers within an illuminated scene. This
method for computing pixel color values is
generally referred to as the radiosity model.
GPU Architecture
Contents:

 Introduction of CUDA.

 Programmable Model / Parallel Computing with CUDA.

 CUDA Application Programmable Interface.

 Hardware Implementation.

 Performance Guide / Memory Optimizations.

 Introduction of Multi-GPU Programming.


Introduction to CUDA:
 The Graphics Processor Unit (GPU) as a Data-Parallel Computing Device:
GPU: It is a specialized circuit designed to rapidly manipulate and alter memory in such a way, to accelerate the
building of images in a frame buffer intended for output to a display.

Fig.1. Floating-point operations per second


i. GPU is specialized for compute-intensive, highly parallel computation.

ii. It is designed more transistors are devote to data processing.

Fig.2. The GPU Devotes More Transistors to Data Processing


Many Core GPU-Block Diagram

 G80 (launched Nov 2006 – GeForce 8800 GTX)


 128 Thread Processors execute kernel threads.
 Up to 12,288 parallel threads active.
 Per-block shared memory (PBSM) accelerate processing.

Fig.3. Block Diagram of many core GPU


Fig.4. GPU architecture
CUDA GPU Roadmap
Heterogeneous Computing with CUDA:

 CUDA C programming involves running code on two different platforms concurrently: a host system with one or
more CPUs and one or more devices with CUDA enabled NVIDIA GPUs.

 NVIDIA devices are associated with rendering graphics.

 Powerful arithmetic engines capable of running thousands of lightweight threads in parallel.

 Difference between Host and Device:


i. Threading resources
ii. Threads
iii. RAM
Difference between Host and Device:

 Threading resources: Execution pipelines on host systems can support a limited number of concurrent
threads.
➢ Four quad-core processors can run only 16 threads concurrently.
➢ All NVIDIA GPUs can support at least 768 concurrently active threads per multiprocessor.
➢ NVIDIA GeForce GTX 280 can support more than 30,000 active threads.

 Threads: Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and
off of CPU execution channels to provide multithreading capability. Threads on GPU are extremely lightweight.

 RAM: Both the host system and the device have RAM.
➢ On the host system, RAM is generally equally accessible to all code.
➢ On the device, RAM is divided virtually and physically into different types.
Maximum Performance Benefit:

 High Priority: To get the maximum benefit from CUDA, focus first on finding ways to parallelize sequential code.

 Amdahl’s law specifies the maximum speed-up.

 Amdahl’s Law: also known as Amdahl’s argument.


➢ This law is used to find the maximum expected improvement to an overall system when only part of the system is
improved.
➢ It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
➢ The maximum speed-up (S) of a program is

where, P is the fraction of the total serial execution time taken by the portion of code. N is the number of processors.
Understanding the Programming Environment:

 CUDA Compute Capability: It describes:


1. Features of hardware.
2. Set of instructions supported by the device.
3. Maximum number of threads per block.
4. Number of register per multiprocessor.

The compute capability of the GPU can be queried programmatically in the CUDA SDK in the deviceQuery
sample.

The information is obtained by calling cudaGetDeviceProperties().


Fig.7. Sample CUDA configuration data reported by deviceQuery
 C Runtime for CUDA and Driver API Version:
The CUDA driver API and the C runtime for CUDA are two of the programming interface to CUDA.

Version number is important because the CUDA driver API is backward compatible but not forward compatible.

➢ Applications, plug-ins, and libraries (including the C runtime for CUDA) compiled against a particular version of
the driver API.

➢ It will continue to work on subsequent driver releases.

➢ Applications, plug-ins, and libraries compiled against a particular version of the driver API may not work on
earlier versions of the driver.
CUDA: A new architecture for computing on the GPU
 CUDA stands for Compute Unified Device Architecture .

 It is a new hardware and software architecture for issuing and managing computations on the GPU as a data-
parallel computing device.

 There is no need of mapping them to a graphics API.

 It is available for the GeForce 9 Series, Quadro FX 5600/4600, and Tesla solutions.

 A hardware driver, an application programming interface (API) and its runtime and two higher-level
mathematical libraries of common usage, CUFFT and CUBLAS.

 The hardware has been designed to support lightweight driver and runtime layers, resulting in high performance.
 Developed by NVIDIA in late 2006.

 CUDA is a compiler and toolkit for programming NVIDIA GPUs.

 CUDA API extends the C programming language.

 Runs on thousands of threads.

 It is an scalable model.

 Objectives:
 Express Parallelism
 Give high level abstraction from hardware
Fig.9. CUDA Software Stack
Systematically Processing Flow on CUDA

 Copy data from main memory to GPU memory.

 CPU instructs the process to GPU.

 GPU execute parallel in each core.

 Copy the result from GPU memory to main memory.


Processing Flow on CUDA

Fig.10. Processing Flow on CUDA


Programming Model: A Highly Multithreaded Coprocessor
GPU is viewed as a compute device operating as a coprocessor to the main CPU (host).

 Data-parallel, compute intensive functions should be off-loaded to the device.

 Functions that are executed many times, but independently on different data, are prime candidates.

▪ i.e. body of for-loops

 A function compiled for the device is called a Kernel.

 The kernel is executed on the device as many different threads

 Both host (CPU) and device (GPU) manage their own memory, host memory and device memory.

▪ Data can be copied between them


Thread Batching: Grid of thread blocks

 The computational grid consist of a grid of thread


blocks

 Each thread executes the kernel

 The application species the grid and block


dimensions

 The grid layouts can be 1, 2, or 3-dimensional

 The maximal sizes are determined by GPU memory


and kernel complexity

 Each block has an unique block ID

 Each thread has an unique thread ID (within the


block)
Fig.3. Grid of Thread Blocks
Grid of Thread Blocks

Fig.4. Computational Grid


Some Design Goals

 Enable heterogeneous systems (i.e., CPU+GPU)


 CPU & GPU are separate devices with separate DRAMs

 Scale to 100’s of cores, 1000’s of parallel threads.

 Programmers focus on parallel algorithms.


Heterogeneous Programming

 CUDA = serial program with parallel kernels, all in C


 Serial C code executes in a host thread (i.e. CPU thread)
 Parallel kernel C code executes in many device threads across multiple processing elements (i.e.
GPU threads).

Fig.5. Programming Between Host and Device


Kernel = Many Concurrent Threads

 One kernel is executed at a time on the device


 Many threads execute each kernel
 Each thread executes the same code…
 … on different data based on its thread ID

 CUDA threads might be


 Physical threads
 As on NVIDIA GPUs
 GPU thread creation and
context switching are
essentially free
 Or Virtual threads
 E.g. 1 CPU core might execute
Multiple CUDA threads
Fig.6. Thread Execution
Hierarchy of Concurrent Threads

 Threads are grouped into thread blocks


 Kernel = grid of thread blocks

Fig.7. Hierarchy of parallel threads

Threads wait at the barrier until all


 By definition, threads in the same block may synchronize with barriers threads at the same block reach the
barrier.
Transparent Scalability

 Threads blocks cannot synchronize


 So they can run in any order, concurrently or sequentially
 This independence gives scalability:
 A Kernel scales across any number of parallel cores

• Implicit barrier between different kernels

Fig.8. Kernel Grid with multiple core device


Programming Model: Memory (has Thread Hierarchy) Model

Fig.9. Thread processing model


Heterogeneous Memory Model

Fig.10. Memory model between host and devices


Memory Model : Computational Grid

Fig.11. Basic need for Memory model


Memory Model

Fig.12. Interfacing of memories with per block


Hardware Implementation

 A set of SIMD Multiprocessors with On-Chip shared memory.

 Execution Model.
A set of SIMD Multiprocessors with On-Chip shared memory

• The device is implemented as a set of multiprocessors.

• Each multiprocessor has a Single Instruction, Multiple Data (SIMD) architecture.

• Each multiprocessor has on-chip memory of the four following types:

▪ One set of local 32-bit registers per processor,

▪ A parallel data cache or shared memory,

▪ A read-only constant cache,

▪ A read-only texture cache.


SIMD Multiprocessor

Fig.13. Single Instruction, Multiple Data processing


A set of SIMD multiprocessors with on-chip shared memory

Fig.14. Set of SIMD


multiprocessors
Execution Model

 Executed on the device by executing one or more blocks on each multiprocessor using time slicing.

 Each block is split into SIMD groups of threads called warps.

 Each of these warps contains the same number of threads, called the warp size.

 It is executed by the multiprocessor in a SIMD (Single Instruction Multiple Data) fashion.

 A thread scheduler periodically switches from one warp to another to maximize the use of the
multiprocessor’s computational resources.
CUDA Application Programming Interface.

 C Runtime for CUDA.

 Language Extension.

 Common Runtime Component.

 Device Runtime Component.

 Host Runtime Component.


Extension to the C programming language:

 GOAL: To provide a relatively simple path for users familiar with the C programming language.

 To easily write programs for execution by the device.

It consists of:

 A minimal set of extensions to the C language: allow the programmer to target portions of the source
code for execution on the device.

 A runtime library split into:

➢ A host component

➢ A device component

➢ A common component
C Runtime for CUDA

 Handles kernel loading.

 Setting up kernel parameters.

 Launch configuration before the kernel launched.

C runtime for CUDA can perform:

 Implicit code initialization

 CUDA context management

 CUDA module management

 Kernel configuration

 Parameter passing
Language Extension

The extensions to the C programming language are four-fold:

 Function type qualifiers

 Variable type qualifiers

 A new directive

 Four built-in variables

Each source file containing these extensions must be compiled with the CUDA

compiler nvcc
Function type qualifiers

 _device_
The _device_ qualifier declares a function that is:
1. Executed on the device
2. Callable from the device only
 _global_
The _global_ qualifiers declares a function as being a kernel. Such function is:
1. Executed on the device
2. Callable from the host only
 _host_
The _host- qualifier declares a function that is:
1. Executed on the host
2. Callable from the host only
Variable Type Qualifiers

 _device_
The _device_ qualifiers declares a variable that resides on the device.
1. Resides in global memory space.
2. Has the lifetime of an application.
3. Is accessible from the threads within the grid and from the host through the runtime library.
 _constant_
The _constant_ qualifiers, optionally used together with _device_, declares a variable that:
1. Resides in constant memory space
2. Has the lifetime of an application
3. Is accessible from all the threads within the grid and from the host through the runtime library.
 _shared_
The _shared_ qualifiers, optionally used together with _device_
Common Runtime Component

The common runtime component can be used by both host and device functions:

Built-in Vector Types: These are vector types derived from the basic integer and floating-point types.

 The 1st, 2nd,3rd, and 4th components are accessible through the fields X, Y, Z, and W respectively.

 All come with a constructor function of the form make_<type name>.

int2 make_int2(int x, int y);

 Which creates a vector of type int2 with value (x, y).


Device Runtime Component

 Mathematical Functions

 Synchronization Function: _syncthreads()

 Type Conversion Functions: The suffixed in the function below indicate IEEE-754 rounding modes:
➢ Rn is round-to-nearest-even
➢ Rz is round-towards-zero
➢ Ru is round-up
➢ Rd is round-down

 Texture Functions: tex1Dfetch()


Host Runtime Component

 Device management

 Context management

 Memory management

 Code module management

 Execution control

 Texture reference management

 Interoperability with OpenGL and Direct3D


Performance Guidelines

 Instruction Performance.

 Number of Threads per Block.

 Memory Optimization.

 Data Transfer between Host and Device.

 Benefits of Device, Shared, Local, Texture, and Constant Memory.


Programming Model: Managing Memory

 Host code manages device memory:

➢ Allocate / free

➢ Copy data

➢ Applies to global and constant device memory (DRAM)

 Shared memory is statically allocated

 Host manages texture data:

➢ Stored on GPU

➢ Takes advantage of texture caching / filtering / clamping

 Host manages non-pageable CPU memory:

➢ Allocate / free
Memory Model

Fig.12. Interfacing of memories with per block


CUDA Device Memory Space Overview

 Each thread can:


(Device) Grid
 R/W per-thread registers
Block (0, 0) Block (1, 0)
 R/W per-thread local memory
 R/W per-block shared memory Shared Memory Shared Memory

 R/W per-grid global memory Registers Registers Registers Registers

 Read only per-grid constant memory


Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
 Read only per-grid texture memory

Local Local Local Local


Memory Memory Memory Memory

• The host can R/W global, constant, and


texture memories Host Global
Memory

Constant
Memory

Texture
Memory
Global, Constant, and Texture Memories
(Long Latency Accesses)
 Global memory (Device) Grid

 Main means of communicating R/W Data between host


Block (0, 0) Block (1, 0)
and device
 Contents visible to all threads Shared Memory Shared Memory

 Texture and Constant Memories


Registers Registers Registers Registers
 Constants initialized by host
 Contents visible to all threads
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Local Local Local Local


Memory Memory Memory Memory

Host Global
Memory

Constant
Memory

Texture
Memory

Courtesy: NDVIA
GPU Memory Allocation / Release

 cudaMalloc(void **pointer, size_t nbytes)

 cudaMemset(void *pointer, int value, size_t count)

 cudaFree(void* pointer)

int n =1024;
int nbytes = 1024*sizeof (int);
int *d_a = 0;
cudaMalloc((void**)&d_a, nbytes);
cudaMemset(d_a, 0, nbytes);
cudaFree(d_a);
Compliable Example
Element wise Matrix Addition
Introduction of Multi-GPU

 NVIDIA® Maximus® technology transformed the design process by


combining the industry-leading graphics capability of NVIDIA Quadro®
graphics processing units (GPUs) and the high performance computing
power of NVIDIA Tesla® GPUs.

 Substantial time savings—A multi-GPU system helps address the pressures of


delivering a high-quality product to market more quickly by providing ultra-fast
processing of your computations, renderings, and other computational and visually
intensive projects.

 Multiple iterations—The ability to revise your product multiple times in a resource


and time-constrained environment will lead to a better end result. Completing
each iteration of your automobile, animated movie, or seismic data processing
faster, leads to additional refinements.

You might also like