Graphics 5
Graphics 5
These two approaches are called object-space methods and image-space methods, respectively:
An object-space method compares objects and parts of objects to each other within the scene definition to
determine which surfaces, as a whole, we should
label as visible.
In an image-space algorithm, visibility is decided point by point at each pixel position on the projection
plane.
Back-face detection
A fast and simple object-space method for locating the back faces of a
polyhedron is based on front-back tests. A point (x, y, z) is behind a polygon
surface if
Ax + By + Cz + D < 0 (1)
where A, B, C, and D are the plane parameters for the polygon.
When this position is along the line of sight to the surface, we must be looking
at the back of the polygon. Therefore, we could use the viewing position to
test for back faces.
Depth-buffer Method
A commonly used image-space approach for detecting visible surfaces is the depth-buffer method,
which compares surface depth values throughout a scene for each pixel position on the projection
plane.
Each surface of a scene is processed separately, one pixel position at a time, across the surface.
The algorithm is usually applied to scenes containing only polygon surfaces, because depth values
can be computed very quickly and the method is easy to implement.
This visibility-detection approach is also frequently alluded to as the z-buffer method, because
object depth is usually measured along the z axis of a viewing system.
Drawback: A drawback of the depth-buffer method is that it identifies only
one visible surface at each pixel position. In other words, it deals only with
opaque surfaces and cannot accumulate color values for more than one
surface, as is necessary if transparent surfaces are to be displayed.
A-Buffer Method
Sorting operations are carried out in both image and object space, and the
scan conversion of the polygon surfaces is performed in image space.
This visibility-detection method is often referred to as the painter’s
algorithm.
Painter’s Algorithm
Constructive Solid-Geometry Methods
This method creates a new volume using Boolean set operations (e.g.- union, intersection or
difference operation) on two specified volumes.
Initially a set of primitive 3d volumes are available e.g.- cubes, spheres ,pyramids, cylinders,
cones and closed spline surfaces.
Contd…
Tree representation for a CGS object
Binary Space Partitioning (BSP) Tree
•Recursively divide each side until each node contains only 1 polygon.
5
•Choose polygon arbitrarily 5a
2 5b
•Divide scene into front (relative to normal) and back
half-spaces.
3
•Split any polygon lying on both sides. 1 4
1 4
2 5b
5a
BSP Tree
5
•Choose polygon arbitrarily 5a
2 5b
•Divide scene into front (relative to normal) and back half-
spaces.
3
•Split any polygon lying on both sides. 1 4
5
•Choose polygon arbitrarily 5a
2 5b
•Divide scene into front (relative to normal) and back half-
spaces.
3
•Split any polygon lying on both sides. 1 4
5
•Choose polygon arbitrarily
2
•Divide scene into front (relative to normal) and back half-
spaces.
3
•Split any polygon lying on both sides. 1 4
Various manipulation
routines can be
applied to the solid
(e.g. union,
intersection or
difference)
Shading Methods
We can model lighting effects more precisely by considering the physical laws
governing the radiant-energy transfers within an illuminated scene. This
method for computing pixel color values is
generally referred to as the radiosity model.
GPU Architecture
Contents:
Introduction of CUDA.
Hardware Implementation.
CUDA C programming involves running code on two different platforms concurrently: a host system with one or
more CPUs and one or more devices with CUDA enabled NVIDIA GPUs.
Threading resources: Execution pipelines on host systems can support a limited number of concurrent
threads.
➢ Four quad-core processors can run only 16 threads concurrently.
➢ All NVIDIA GPUs can support at least 768 concurrently active threads per multiprocessor.
➢ NVIDIA GeForce GTX 280 can support more than 30,000 active threads.
Threads: Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and
off of CPU execution channels to provide multithreading capability. Threads on GPU are extremely lightweight.
RAM: Both the host system and the device have RAM.
➢ On the host system, RAM is generally equally accessible to all code.
➢ On the device, RAM is divided virtually and physically into different types.
Maximum Performance Benefit:
High Priority: To get the maximum benefit from CUDA, focus first on finding ways to parallelize sequential code.
where, P is the fraction of the total serial execution time taken by the portion of code. N is the number of processors.
Understanding the Programming Environment:
The compute capability of the GPU can be queried programmatically in the CUDA SDK in the deviceQuery
sample.
Version number is important because the CUDA driver API is backward compatible but not forward compatible.
➢ Applications, plug-ins, and libraries (including the C runtime for CUDA) compiled against a particular version of
the driver API.
➢ Applications, plug-ins, and libraries compiled against a particular version of the driver API may not work on
earlier versions of the driver.
CUDA: A new architecture for computing on the GPU
CUDA stands for Compute Unified Device Architecture .
It is a new hardware and software architecture for issuing and managing computations on the GPU as a data-
parallel computing device.
It is available for the GeForce 9 Series, Quadro FX 5600/4600, and Tesla solutions.
A hardware driver, an application programming interface (API) and its runtime and two higher-level
mathematical libraries of common usage, CUFFT and CUBLAS.
The hardware has been designed to support lightweight driver and runtime layers, resulting in high performance.
Developed by NVIDIA in late 2006.
It is an scalable model.
Objectives:
Express Parallelism
Give high level abstraction from hardware
Fig.9. CUDA Software Stack
Systematically Processing Flow on CUDA
Functions that are executed many times, but independently on different data, are prime candidates.
Both host (CPU) and device (GPU) manage their own memory, host memory and device memory.
Execution Model.
A set of SIMD Multiprocessors with On-Chip shared memory
Executed on the device by executing one or more blocks on each multiprocessor using time slicing.
Each of these warps contains the same number of threads, called the warp size.
A thread scheduler periodically switches from one warp to another to maximize the use of the
multiprocessor’s computational resources.
CUDA Application Programming Interface.
Language Extension.
GOAL: To provide a relatively simple path for users familiar with the C programming language.
It consists of:
A minimal set of extensions to the C language: allow the programmer to target portions of the source
code for execution on the device.
➢ A host component
➢ A device component
➢ A common component
C Runtime for CUDA
Kernel configuration
Parameter passing
Language Extension
A new directive
Each source file containing these extensions must be compiled with the CUDA
compiler nvcc
Function type qualifiers
_device_
The _device_ qualifier declares a function that is:
1. Executed on the device
2. Callable from the device only
_global_
The _global_ qualifiers declares a function as being a kernel. Such function is:
1. Executed on the device
2. Callable from the host only
_host_
The _host- qualifier declares a function that is:
1. Executed on the host
2. Callable from the host only
Variable Type Qualifiers
_device_
The _device_ qualifiers declares a variable that resides on the device.
1. Resides in global memory space.
2. Has the lifetime of an application.
3. Is accessible from the threads within the grid and from the host through the runtime library.
_constant_
The _constant_ qualifiers, optionally used together with _device_, declares a variable that:
1. Resides in constant memory space
2. Has the lifetime of an application
3. Is accessible from all the threads within the grid and from the host through the runtime library.
_shared_
The _shared_ qualifiers, optionally used together with _device_
Common Runtime Component
The common runtime component can be used by both host and device functions:
Built-in Vector Types: These are vector types derived from the basic integer and floating-point types.
The 1st, 2nd,3rd, and 4th components are accessible through the fields X, Y, Z, and W respectively.
Mathematical Functions
Type Conversion Functions: The suffixed in the function below indicate IEEE-754 rounding modes:
➢ Rn is round-to-nearest-even
➢ Rz is round-towards-zero
➢ Ru is round-up
➢ Rd is round-down
Device management
Context management
Memory management
Execution control
Instruction Performance.
Memory Optimization.
➢ Allocate / free
➢ Copy data
➢ Stored on GPU
➢ Allocate / free
Memory Model
Constant
Memory
Texture
Memory
Global, Constant, and Texture Memories
(Long Latency Accesses)
Global memory (Device) Grid
Host Global
Memory
Constant
Memory
Texture
Memory
Courtesy: NDVIA
GPU Memory Allocation / Release
cudaFree(void* pointer)
int n =1024;
int nbytes = 1024*sizeof (int);
int *d_a = 0;
cudaMalloc((void**)&d_a, nbytes);
cudaMemset(d_a, 0, nbytes);
cudaFree(d_a);
Compliable Example
Element wise Matrix Addition
Introduction of Multi-GPU