0% found this document useful (0 votes)

53 views9 pages

An INTRODUCTION TO CUDA Programming

This document provides an introduction to CUDA programming. It describes the CUDA architecture including the hardware model of GPUs with multiprocessors and the software stack. It discusses the CUDA programming interface including the runtime library and language extensions. It also provides examples of thread batching in grids and blocks, and the different memory spaces accessible to threads. An example CUDA program for matrix addition is referenced.

Uploaded by

TJK001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views9 pages

An INTRODUCTION TO CUDA Programming

Uploaded by

TJK001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/227487008

An INTRODUCTION TO CUDA Programming

Article · January 2008

Source: RePEc

CITATION READS
1 3,244

1 author:

Irina Mocanu
University Polytechnica of Bucharest
90 PUBLICATIONS 359 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

AmiHomeCare View project

Mr@old View project

All content following this page was uploaded by Irina Mocanu on 13 August 2014.

The user has requested enhancement of the downloaded file.

An INTRODUCTION TO CUDA Programming
Irina Mocanu
University POLITEHNICA Bucharest
[email protected]
Abstract. The graphics boards have become so powerful that they are usded for mathematical
computations, such as matrix multiplication and transposition, which are required for complex visual and
physics simulations in computer games. NVIDIA has supported this trend by releasing the CUDA
(Compute Unified Device Architecture) interface library to allow applications developers to write code
that can be uploaded into an NVIDIA-based card for execution by NVIDIA's massively parallel GPUs.
This paper is an introduction to the CUDA programming based on the documentation from [2] and [4].

Introduction

The programmable graphics processor unit (GPU) has evolved into an absolute computing
workhorse. Today's GPUs offer a lot of resources for both graphics and non-graphics processing. Data-
parallel processing maps data elements to parallel processing threads. Many applications that process
large data sets such as arrays can use a data-parallel programming model to speed up the computations. In
3D rendering large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media
processing applications can map image blocks and pixels to parallel processing threads. A lot of any other
algorithms except the image rendering and processing algorithms are accelerated by data-parallel
processing.
In this scope, Nvidia developed CUDA (Compute Unified Device Architecture) [1], a new
hardware and software architecture for issuing and managing computations on the GPU as a data-parallel
computing device without the need of mapping them to a graphics API. It is available for the GeForce 8
Series, Quadro FX 5600/4600, and Tesla solutions.
The paper presents the principal features of CUDA programming. It is presented the CUDA
architecture and the application programming interface in CUDA, based on the documentation from [2]
and [4]. Also there is an simple CUDA program for adding two matrix which uses the parallel capabilities
of CUDA, which was presented in [3].

Figure 1. The CUDA software stack

The CUDA Architecture

CUDA is a framework which works in a modern massively-parallel environment. CUDA-enabled
graphics processors operate as co-processors within the host computer. This means that each GPU is
considered to have its own memory and processing elements that are separate from the host computer. To
perform useful work, data must be transferred between the memory space of the host computer and
CUDA device(s).
As in [2], the CUDA software stack is composed of several layers: (i) a hardware driver, (ii) an
application programming interface (API) and its runtime, (iii) two higher-level mathematical libraries of
common usage, as in Figure 1.
CUDA provides general DRAM memory addressing [2]: the ability to read and write data at any
location in DRAM, just like on a CPU. CUDA has a parallel data cache with very fast general read and
write access, that threads use to share data with each other.

When programmed with CUDA, the GPU is viewed as a compute device capable of executing a
very high number of threads in parallel. It operates as a coprocessor to the main CPU (host). The host
and the device maintain their own DRAM, the host memory and device memory, respectively, as in [2].
The batch of threads that executes a kernel is organized as a grid of thread blocks like in Figure 2,
as in [2].

Figure 2. Thread batching

A thread block is a batch of threads that can cooperate together by efficiently sharing data
through some fast shared memory and synchronizing their execution to coordinate memory accesses.
Each thread is identified by its thread ID, which is the thread number within the block. An application
can also specify a block as a two- or three-dimensional array of arbitrary size and identify each thread
using a 2- or 3-component index instead. There is a limited maximum number of threads that a block can
contain. The blocks of same dimensionality and size that execute the same kernel can be batched together
into a grid of blocks. Threads in different thread blocks from the same grid cannot communicate and
synchronize with each other. A device may run all the blocks of a grid sequentially if it has very few
parallel capabilities, or in parallel if it has a lot of parallel capabilities, or usually a combination of both.
Each block is identified by its block ID, which is the block number within the grid. An application can
also specify a grid as a two-dimensional array of arbitrary size and identify each block using a 2-
component index instead.A thread that executes on the device has only access to the device’s DRAM and
on-chip memory through the following memory spaces, as illustrated in Figure 3 (as described in [2]): (i)
read-write per-thread registers, (ii) read-write per-thread local memory, (iii) read-write per-block shared
memory, (iv) read-write per-grid global memory, (v) read-only per-grid constant memory, (vi) read-only
per-grid texture memory.

Figure 3. The Memory model

The device is implemented as a set of multiprocessors as illustrated in Figure 4, as described in [2].

Each multiprocessor has a Single Instruction, Multiple Data architecture (SIMD). At any given clock
cycle, each processor of the multiprocessor executes the same instruction, but operates on different data.
Each multiprocessor has on-chip memory of the four following types: (i) one set of local 32-bit registers
per processor, (ii) a parallel data cache (shared memory) that is shared by all the processors and
implements the shared memory space, (iii) a read-only constant cache that is shared by all the processors
and speeds up reads from the constant memory space, which is implemented as a read-only region of
device memory, (iv) a read-only texture cache that is shared by all the processors and speeds up reads
from the texture memory space, which is implemented as a read-only region of device memory.
The local and global memory spaces are implemented as read-write regions of device memory and
are not cached. Each multiprocessor accesses the texture cache via a texture unit that implements the
various addressing modes and data filtering.
A grid of thread blocks is executed on the device by executing one or more blocks on each
multiprocessor using time slicing: Each block is split into SIMD groups of threads called warps; each of
these warps contains the same number of threads, called the warp size, and is executed by the
multiprocessor in a SIMD fashion; a thread scheduler periodically switches from one warp to another to
maximize the use of the multiprocessor’s computational resources. A block is processed by only one
multiprocessor, so that the shared memory space resides in the on-chip shared memory leading to very
fast memory accesses. The multiprocessor’s registers are allocated among the threads of the block. If the
number of registers used per thread multiplied by the number of threads in the block is greater than the
total number of registers per multiprocessor, the block cannot be executed and the corresponding kernel
will fail to launch. Several blocks can be processed by the same multiprocessor concurrently by allocating
the multiprocessor’s registers and shared memory among the blocks. The issue order of the warps within
a block is undefined, but their execution can be synchronized.
Figure 4. The hardware model

The CUDA application programming interface

The goal of the CUDA programming is to provide a relatively simple path for users familiar with
the C. Based on [2], it consists of:
• A runtime library (presented in Table 1) split into:
• A host component, that runs on the host and provides functions to control and access one or
more compute devices from the host;
• A device component, that runs on the device and provides device-specific functions;
• A common component, that provides built-in vector types and a subset of the C standard
library that are supported in both host and device code.
• A minimal set of extensions to the C language, that allow the programmer to target portions of the
source code for execution on the device (composed of four parts presented in Table 2).

The host Multiple devices supported

component
Memory: linear or CUDA arrays

OpenGL and DirectX interoperability

Asynchronicity: global functions and most runtime functions return to the

application before the device has completed the requested task
The device Synchronization function
component
Type conversion
Type casting
Atomic functions (performs a read-modify-write operation on one 32 bit word residing
in global memory)
A common Built-in Vector types ( float1, float2, int3, ushort4 etc)
component Constructor type creation: int2 i = make int2( i, j)
Mathematical functions (standard math.h on CPU)
Time function for benchmarking
Texture references
Table 1 The runtime library

Function type qualifiers to device declares a function that is:

specify whether a function (i) executed on the device, (ii) callable from the device only.
executes on the host or on the __global__ declares a function as being a kernel: (i) executed on the
device and whether it is device, (ii) callable from the host only.
callable from the host or from
__host__ declares a function that is: (i) executed on the host, (ii)
the device
callable from the host only.
__device__ declares a variable that resides on the device: (i) resides in
global memory space, (ii) has the lifetime of an application, (iii) it is
accessible from all the threads within the the grid and from the host
through the runtime library.
__constant__ (optionally used with __device__), declares a variable
Variable type qualifiers to that: (i) Resides in constant memory space, (ii) has the lifetime of an
specify memory location on the application, (iii) it is accessible from all the threads within the grid and
device of a variable from the host through the runtime library.

shared , (optionally used with device), declares a variable

that: (i) resides in the shared memory space of a thread block, (ii) has the
lifetime of the block, (iii) is only accessible from all the threads within
the block.

A new directive to specify how  Any call to a __global__ function must specify the execution
a kernel is executed on the configuration for that call.
device from the host  The execution configuration defines the dimension of the grid
and blocks that will be used to execute the function on the
device.
 It is specified by the expression <<<Dg,Db,Ns>>> inserted
between the function name and the parenthesized argument list,
where: (i) Dg is of type dim3 and specifies the dimension and
size of the grid (Dg.x*Dg.y*Dg.y is the number of blocks being
launched), (ii) Db is of type dim3 and specifies the dimension
and size of each block, (Db.x*Db.y*Db.z is the number of
threads per block), (iii) Ns is of type size_t and specifies the
number of bytes in shared memory that is dynamically allocated
per block for this call in addition to the statically allocated
memory.
gridDim is of type dim3 and contains the dimensions of the grid.
Four built-in variables that blockIdx is of type uint3 and contains the block index within the grid.
specify the grid and block
dimensions and the block and blockDim is of type dim3 and contains the dimensions of the block.
thread indices threadIdx is of type uint3 and contains the thread index within the
block.
Table 2. The set of extensions to the C language

A CUDA example

In the following it is presented a CUDA program for adding two matrix in parallel comparing with the
same program write in C, which is described in [3]:

CPU Program Observations CUDA Program

void add_matrix __global__ add_matrix
(float* a, float* b, ( float* a, float* b, float* c, int N )
float* c, int N) { {
int index; int i = blockIdx.x * blockDim.x +
threadIdx.x;
for ( int i = 0; i < int j = blockIdx.y * blockDim.y +
N; i++ ) threadIdx.y;
for ( int j = 0; j < int index = i + j*N;
N; j++ )
{ The nested for-loops if ( i < N && j < N )
index = i + j*N; are replaced with an c[index] = a[index] + b[index];
c[index] = implicit grid }
a[index] +
b[index];
} int main() {
} dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid(N/dimBlock.x,N/dimBlock.y);
add_matrix<<<dimGrid,dimBlock>>>
int main() { (a,b,c,N);
add_matrix(a, b, }
c, N );
}

Bellow it is presented the whole source version of the above program, as pesented in [3]:

Program Observations
const int N = 1024;
Set grid size
const int blocksize = 16;
__global__
void add_matrix(float* a, float *b, float *c, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y; Compute kernel
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}
int main()
{
float *a = new float[N*N];
float *b = new float[N*N];
CPU memory allocation
float *c = new float[N*N];
for (int i = 0; i < N*N; ++i) a[i] = 1.0f; b[i] = 3.5f;
float *ad, *bd, *cd;
const int size = N*N*sizeof(float);
cudaMalloc((void**)&ad, size); GPU memory allocation
cudaMalloc((void**)&bd, size);
cudaMalloc((void**)&cd, size);
cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
Copy data to GPU
cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
dim3 dimBlock(blocksize, blocksize);
dim3 dimGrid(N/dimBlock.x, N/dimBlock.y); Execute kernel
add_matrix<<<dimGrid, dimBlock>>>( d, bd, cd, N);
cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost); Copy result back to CPU
cudaFree(ad); cudaFree(bd); cudaFree(cd);
delete[] a; delete[] b; delete[] c; Clean up and return
return EXIT_SUCCESS;
}

Conclusions
CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU)
using graphics APIs:
 It uses the standard C language, with some simple extensions; it is no need to learn graphics API;
 Scattered writes – code can write to arbitrary addresses in memory;
 Shared memory – CUDA exposes a fast shared memory region that can be shared amongst
threads;
But CUDA has some limitations:
 Recursive functions are not supported and must be converted to loops.
 Threads should be run in groups of at least 32 for best performance.
 CUDA-enabled GPUs are only available from Nvidia (GeForce 8 series and above, Quadro
and Tesla)

Bibliography

1. http://www.nvidia.com/object/cuda_home.html
2. http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
3. Johan Seland - Cuda Programming, Winter School in Parallel Computing,Geilo, January 20-25, 2008,
(http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf)
4. http://www.gpgpu.org/sc2007/

View publication stats

CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Endsem Imp Hpc Unit 5
No ratings yet
Endsem Imp Hpc Unit 5
24 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
course-7
No ratings yet
course-7
21 pages
chapter-8
No ratings yet
chapter-8
58 pages
Parallel Programming
No ratings yet
Parallel Programming
454 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
OpenCL Programming
100% (1)
OpenCL Programming
246 pages
Cuda
No ratings yet
Cuda
69 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
No ratings yet
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
18 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Chartis - RiskTech100 2024 - Publication Dec04
No ratings yet
Chartis - RiskTech100 2024 - Publication Dec04
34 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Module 08
No ratings yet
Module 08
89 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Cuda Chapter
No ratings yet
Cuda Chapter
18 pages
cuda
No ratings yet
cuda
25 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
CS8083 UNIT III Notes
No ratings yet
CS8083 UNIT III Notes
26 pages
(Signal Processing and Communications 13) Hu, Yu Hen - Programmable Digital Signal Processors - Architecture, Programming, and App PDF
No ratings yet
(Signal Processing and Communications 13) Hu, Yu Hen - Programmable Digital Signal Processors - Architecture, Programming, and App PDF
386 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
NVIDIA CUDA Programming Guide 2.0
100% (3)
NVIDIA CUDA Programming Guide 2.0
107 pages
PDC LAB Experiment 2
No ratings yet
PDC LAB Experiment 2
12 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Parallel and Distributed Algorithms: Johnnie W. Baker
No ratings yet
Parallel and Distributed Algorithms: Johnnie W. Baker
67 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Parallel Algorithem
No ratings yet
Parallel Algorithem
15 pages
Designing and Building Parallel Programs
No ratings yet
Designing and Building Parallel Programs
371 pages
Pig: Web-Scale Processing, Yahoo Research
100% (1)
Pig: Web-Scale Processing, Yahoo Research
33 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Module-5 ACA PDF
100% (1)
Module-5 ACA PDF
30 pages
Introduction To Parallel Processing (TempusCh1)
No ratings yet
Introduction To Parallel Processing (TempusCh1)
42 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Magro - Hyper-Threading Technology Impact
No ratings yet
Magro - Hyper-Threading Technology Impact
9 pages
Parallel Programming Using MPI
No ratings yet
Parallel Programming Using MPI
69 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Introduction To PCA
No ratings yet
Introduction To PCA
7 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Introduction to SIMD Array Processors
No ratings yet
Introduction to SIMD Array Processors
4 pages
ACA T1 Solutions
No ratings yet
ACA T1 Solutions
17 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
The Rise and Fall of High Performance Fortran
No ratings yet
The Rise and Fall of High Performance Fortran
9 pages
Distributed System 2022
No ratings yet
Distributed System 2022
7 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Parallel and Distributed Computing Lecture 02
No ratings yet
Parallel and Distributed Computing Lecture 02
17 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Comparison of Multimedia SIMD, GPUs and Vector
No ratings yet
Comparison of Multimedia SIMD, GPUs and Vector
13 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Assignment - 10 Parallel Sorting Techniques: Range-Partitioning Sort
No ratings yet
Assignment - 10 Parallel Sorting Techniques: Range-Partitioning Sort
6 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CUDA
No ratings yet
CUDA
33 pages
Blink: A Fast Nvlink-Based Collective Communication Library
No ratings yet
Blink: A Fast Nvlink-Based Collective Communication Library
3 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Sphinx Search Beginner's Guide
From Everand
Sphinx Search Beginner's Guide
Abbas Ali
4/5 (2)
Scientific Computing with Scala
From Everand
Scientific Computing with Scala
Vytautas Jančauskas
No ratings yet
A Beginner’s Guide to Solar Panels
From Everand
A Beginner’s Guide to Solar Panels
RAMSESVII
No ratings yet
Mastering Android NDK
From Everand
Mastering Android NDK
Kosarevsky Sergey
No ratings yet
OpenFlow Cookbook
From Everand
OpenFlow Cookbook
Kingston Smiler. S
5/5 (1)
Learning iBeacon
From Everand
Learning iBeacon
Craig Gilchrist
No ratings yet
Unleashing the Power of Astro
From Everand
Unleashing the Power of Astro
Tamas Piros
No ratings yet
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
Spring Boot 3.0 Crash Course
From Everand
Spring Boot 3.0 Crash Course
Kit Harrington
No ratings yet
The JavaScript Journey: From Basics to Full-Stack Mastery
From Everand
The JavaScript Journey: From Basics to Full-Stack Mastery
Priya Singh
No ratings yet
Building Modern GUIs with tkinter and Python: Building user-friendly GUI applications with ease (English Edition)
From Everand
Building Modern GUIs with tkinter and Python: Building user-friendly GUI applications with ease (English Edition)
Saurabh Chandrakar
No ratings yet
Cacti 0.8 Network Monitoring
From Everand
Cacti 0.8 Network Monitoring
Dinangkur Kundu
No ratings yet
Racket Unleashed: Building Powerful Programs with Functional and Language-Oriented Programming
From Everand
Racket Unleashed: Building Powerful Programs with Functional and Language-Oriented Programming
Robert Johnson
No ratings yet
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
Heroku Cloud Application Development
From Everand
Heroku Cloud Application Development
Anubhav Hanjura
No ratings yet
Bluetooth Low Energy LE Complete Self-Assessment Guide
From Everand
Bluetooth Low Energy LE Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet

An INTRODUCTION TO CUDA Programming

Uploaded by

An INTRODUCTION TO CUDA Programming

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

An INTRODUCTION TO CUDA Programming

Article · January 2008

AmiHomeCare View project

Mr@old View project

The user has requested enhancement of the downloaded file.

Figure 1. The CUDA software stack

The CUDA Architecture

Figure 2. Thread batching

Figure 3. The Memory model

The device is implemented as a set of multiprocessors as illustrated in Figure 4, as described in [2].

The CUDA application programming interface

The host Multiple devices supported

OpenGL and DirectX interoperability

Asynchronicity: __global__ functions and most runtime functions return to the

Function type qualifiers to __device__ declares a function that is:

__shared__ , (optionally used with __device__), declares a variable

CPU Program Observations CUDA Program

View publication stats

You might also like

Asynchronicity: global functions and most runtime functions return to the

Function type qualifiers to device declares a function that is:

shared , (optionally used with device), declares a variable