0% found this document useful (0 votes)

159 views33 pages

Gpu Computing With Cuda Lecture 8 - Cuda Libraries - Cufft, Pycuda

The document provides an overview of GPU computing with CUDA libraries CUFFT and PyCUDA. It discusses the discrete Fourier transform and fast Fourier transform, including algorithms like Cooley-Tukey. It then summarizes CUFFT, a CUDA library for performing FFTs on GPUs, including supported transform types, plans, functions, and performance considerations. PyCUDA allows GPU computing using scripting languages.

Uploaded by

narasimhan.rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views33 pages

Gpu Computing With Cuda Lecture 8 - Cuda Libraries - Cufft, Pycuda

Uploaded by

narasimhan.rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

GPU Computing with CUDA

Lecture 8 - CUDA Libraries - CUFFT, PyCUDA

Christopher Cooper
Boston University
August, 2011
UTFSM, Valparaíso, Chile

1
Outline of lecture

‣ Overview:

- Discrete Fourier Transform (DFT)

- Fast Fourier Transform (FFT)

‣ Algorithm

‣ Motivation, examples

‣ CUFFT: A CUDA based FFT library

‣ PyCUDA: GPU computing using scripting languages

2
CUDA Libraries

Bell, Dalton, Olson. Towards AMG on GPU

3
Fourier Transform

‣ Fourier Transform
! ∞
û(k) = e−ikx u(x)dx Real Wave
−∞

‣ Inverse Fourier Transform

! ∞
u(x) = ikx
e û(k)dk Wave Real
−∞

4
Discrete Fourier Transform (DFT)

‣ The case of discrete u(x) we have aliasing

Real Discrete, bounded

Wave Bounded, discrete

sin(x) sin(3x)
sin(5x) sin(7x)

Values at sample points repeat at k2 = k1+N 5

Discrete Fourier Transform (DFT)

‣ DFT
N
! −1
− 2πi
ûk = uj e N kj k = 0, 1, ..., N − 1
j=0

‣ Inverse DFT
N
! −1
2πi
uj = ûk e N kj j = 0, 1, ..., N − 1
k=0

6
Fast Fourier Transform (FFT)

‣ Fast method to calculate the DFT

‣ Computations drop from O(N 2 ) to O(N log(N ))

- N = 104:

‣ Naive: 108 computations

Huge reduction!
‣ FFT: 4*104 computations

‣ Many algorithms, let’s look at Cooley-Tukey radix-2

7
Fast Fourier Transform (FFT)

‣ Cooley-Tukey radix 2

- Assume N being a power of 2

N
! −1
− 2πi
ûk = uj e N kj

j=0

Divide sum into even and odd parts

N/2−1 N/2−1
! !
− 2πi − 2πi
ûk = u2j e N k(2j) + u2j+1 e N k(2j+1)

j=0 j=0

8
Fast Fourier Transform (FFT)

‣ Cooley-Tukey radix 2

- Assume N being a power of 2

N
! −1
− 2πi
ûk = uj e N kj

j=0

Divide sum into even and odd parts

N/2−1 N/2−1
! !
− 2πi − 2πi
ûk = u2j e N k(2j) + u2j+1 e N k(2j+1)

j=0 j=0

Even Odd
8
Fast Fourier Transform (FFT)
N/2−1 N/2−1
! 2πi ! 2πi
− N/2 kj − 2πi − N/2 kj
ûk = u2j e +e N k u2j+1 e
j=0 j=0

‣ By doing this recursively until there is no sum, you get log(N) levels

‣ Sum is decomposed and redundant operations appear

‣ 4 point transform
− 2π − 2π − 2π
ûk = u0 + u1 e 4 ik + u2 e 4 i2k + u3 e 4 i3k

− 2π − 2π − 2π
ûk = u0 + u2 e 4 i2k +e 4 ik (u1 + u3 e 4 i2k )
−π
ûk = u0 + u2 e −πik
+e 2 ik (u1 + u3 e−πik )
k = 0, 1, 2, 3
9
Fast Fourier Transform (FFT)

û0 = u0 + u2 e0 + e0 (u1 + u3 e0 )
−π
û1 = u0 + u2 e−πi
+ e 2 i (u1 + u3 e−πi )
û2 = u0 + u2 e−2πi
+ e (u1 + u3 e
−πi
)
−2πi
−3 π
û3 = u0 + u2 e −3πi
+ e 2 i (u1 + u3 e−3πi )
periodicity ⇒ e = e
0 −2πi
= 1, e−πi
=e
−3πi
= −1
u0 û0

u2 û1

u1 û2

u3 û3
10
FFT - Motivation

‣ Signal processing

- Signal comes in time domain, but want the frequency spectrum

‣ Convolution, filters

- Signals can be filtered with convolutions

! t
f (s)g(t − s) ds, 0≤t<∞
0

F{f ∗ g} = F{f } · F{g}

11
FFT - Motivation

‣ Partial Differential Equations (PDEs) - Spectral methods

- Use DFTs to calculate derivatives

N
! −1
2πi 2π
uj = ûk e N kj j = xj For evenly spaced grid
N
k=0

N
! −1
uj = ûk eikxj
k=0
N
! −1
∂uj !
∂u
= ikûk eikxj
= ikû
∂x ∂x
k=0

2 N
! −1
∂ uj
2
= −k 2 ûk eikxj
∂x
k=0
12
FFT - Motivation

‣ Advantages

- Spectral accuracy

O(cN ) 0<c<1

‣ Limitations

- Grid constraints

- Boundary condition constraint O(N −m )

13
CUFFT

‣ CUFFT: CUDA library for FFTs on the GPU

‣ Supported by NVIDIA

‣ Features:

- 1D, 2D, 3D transforms for complex and real data

- Batch execution for multiple transforms

- Up to 128 million elements (limited by memory)

- In-place or out-of-place transforms

- Double precision on GT200 or later

- Allows streamed execution: simultaneous computation and data

movement 14
CUFFT - Types

‣ cufftHandle

- Handle type to store CUFFT plans

‣ cufftResult

- Return values, like CUFFT_SUCCESS, CUFFT_INVALID_PLAN,

CUFFT_ALLOC_FAILED, CUFFT_INVALID_TYPE, etc.

‣ cufftReal

‣ cufftDoubleReal

‣ cufftComplex

‣ cufftDoubleComplex
15
CUFFT - Transform types

‣ R2C: real to complex

‣ C2R: Complex to real

‣ C2C: complex to complex

‣ D2Z: double to double complex

‣ Z2D: double complex to double

‣ Z2Z: double complex to double complex

16
CUFFT - Plans

‣ cufftPlan1d()

‣ cufftPlan2d()

‣ cufftPlan3d()

‣ cufftPlanMany()

17
CUFFT - Functions

‣ cufftDestroy

- Free GPU resources

‣ cufftExecC2C, R2C, C2R, Z2Z, D2Z, Z2D

- Performs the specified FFT

‣ For more details see the CUFFT Library documentation available in

the NVIDIA website!

18
CUFFT - Performance considerations

‣ Several algorithms for different sizes

‣ Performance recommendations

- Restrict size to be a multiple of 2, 3, 5 or 7

- Restrict the power-of-two factorization term of the X-dimension to

be at least a multiple of 16 for single and 8 for double

- Restrict the X-dimension of single precision transforms to be strictly

a power of two between 2(2) and 2048(8192) for Tesla (Fermi)
GPUs

19
CUFFT - Performance considerations

‣ CUFFT vs FFTW

- CUFFT is good for larger, power of two sized FFTs

- CUFFT is not good for small sized FFTs

‣ CPU can store all data in cache

‣ GPU data transfers take too long

http://www.sharcnet.ca/~merz/CUDA_benchFFT/ 20
CUFFT - Example

#include <cufft.h> 
#define NX 256 
#define BATCH 10 

cufftHandle plan; 
cufftComplex *data; 
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH);
 
/* Create a 1D FFT plan. */ 
cufftPlan1d(&plan, NX, CUFFT C2C, BATCH); 

/* Use the CUFFT plan to transform the signal in place. */ 
cufftExecC2C(plan, data, data, CUFFT FORWARD); 

/* Destroy the CUFFT plan. */ 
cufftDestroy(plan); 
cudaFree(data); 

21
CUFFT - Example

‣ Solve Poisson equation using FFT

r 2
− 2σ 2 r2
− 2σ2
∇ u=
2
e
σ4
r2
!
uan =e− 2σ2 r = (x − 0.5) 2 + (y − 0.5)2

Fourier spectral method for 2D Poisson Eqn

‣ Consider periodic boundary conditions 0.8

0.6

u
0.4

0.2

0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
y 0 0
x

22
CUFFT - Example

‣ Steps
∇2 u = f

FFT system −k û = fˆ
2

fˆ
Find derivative û = − 2
k
! "
fˆ
Transform back u = ifft − 2
k

k 2 = kx2 + ky2
23
CUFFT - Example
int main()
{
        int N = 64; 
        float   xmax  = 1.0f, xmin  = 0.0f, ymin  = 0.0f,
                h     = (xmax‐xmin)/((float)N), s     = 0.1, s2    = s*s;
    
        float   *x = new float[N*N], *y = new float[N*N], *u = new float[N*N],
                *f = new float[N*N], *u_a = new float[N*N], *err = new float[N*N];

        float r2;
        for (int j=0; j<N; j++)
                for (int i=0; i<N; i++)
                {       x[N*j+i] = xmin + i*h;
                        y[N*j+i] = ymin + j*h;

                        r2 = (x[N*j+i]‐0.5)*(x[N*j+i]‐0.5) + (y[N*j+i]‐0.5)*(y[N*j+i]‐0.5);
                        f[N*j+i] = (r2‐2*s2)/(s2*s2)*exp(‐r2/(2*s2));
                        u_a[N*j+i] = exp(‐r2/(2*s2)); // analytical solution
                }

        float   *k = new float[N];
        for (int i=0; i<=N/2; i++)
        {       k[i] = i * 2*M_PI;
        }
        for (int i=N/2+1; i<N; i++)
        {       k[i] = (i ‐ N) * 2*M_PI; 24
        }
CUFFT - Example

      // Allocate arrays on the device
      float *k_d, *f_d, *u_d;
      cudaMalloc ((void**)&k_d, sizeof(float)*N);
      cudaMalloc ((void**)&f_d, sizeof(float)*N*N);
      cudaMalloc ((void**)&u_d, sizeof(float)*N*N);

      cudaMemcpy(k_d, k, sizeof(float)*N, cudaMemcpyHostToDevice);
      cudaMemcpy(f_d, f, sizeof(float)*N*N, cudaMemcpyHostToDevice);

      cufftComplex *ft_d, *f_dc, *ft_d_k, *u_dc;
      cudaMalloc ((void**)&ft_d, sizeof(cufftComplex)*N*N);
      cudaMalloc ((void**)&ft_d_k, sizeof(cufftComplex)*N*N);
      cudaMalloc ((void**)&f_dc, sizeof(cufftComplex)*N*N);
      cudaMalloc ((void**)&u_dc, sizeof(cufftComplex)*N*N);

      dim3 dimGrid  (int((N‐0.5)/BSZ) + 1, int((N‐0.5)/BSZ) + 1);
      dim3 dimBlock (BSZ, BSZ);
      real2complex<<<dimGrid, dimBlock>>>(f_d, f_dc, N);

      cufftHandle plan;
      cufftPlan2d(&plan, N, N, CUFFT_C2C);

25
CUFFT - Example

 cufftExecC2C(plan, f_dc, ft_d, CUFFT_FORWARD);

       solve_poisson<<<dimGrid, dimBlock>>>(ft_d, ft_d_k, k_d, N);

       cufftExecC2C(plan, ft_d_k, u_dc, CUFFT_INVERSE);

       complex2real<<<dimGrid, dimBlock>>>(u_dc, u_d, N);

       cudaMemcpy(u, u_d, sizeof(float)*N*N, cudaMemcpyDeviceToHost);

       float constant = u[0];
       for (int i=0; i<N*N; i++)
       {       u[i] ‐= constant; //substract u[0] to force the arbitrary constant to be 0
       }

26
CUFFT - Example

__global__ void solve_poisson(cufftComplex *ft, cufftComplex *ft_k, float *k, int N)
{
        int i = threadIdx.x + blockIdx.x*BSZ;
        int j = threadIdx.y + blockIdx.y*BSZ;
        int index = j*N+i;

        if (i<N && j<N)
        {
                float k2 = k[i]*k[i]+k[j]*k[j];
                if (i==0 && j==0) {k2 = 1.0f;}
                ft_k[index].x = ‐ft[index].x/k2;
                ft_k[index].y = ‐ft[index].y/k2;
        }
}

27
CUFFT - Example

__global__ void real2complex(float *f, cufftComplex *fc, int N)
{
        int i = threadIdx.x + blockIdx.x*blockDim.x;
        int j = threadIdx.y + blockIdx.y*blockDim.y;
        int index = j*N+i;

        if (i<N && j<N)
        {       fc[index].x = f[index];
                fc[index].y = 0.0f;
        }
}

__global__ void complex2real(cufftComplex *fc, float *f, int N)
{
        int i = threadIdx.x + blockIdx.x*BSZ;
        int j = threadIdx.y + blockIdx.y*BSZ;
        int index = j*N+i;

        if (i<N && j<N)
        {
                f[index] = fc[index].x/((float)N*(float)N);
                //divide by number of elements to recover value
        }
}
28
PyCUDA

‣ Python + CUDA = PyCUDA

‣ Python: scripting language easy to code, but slow

‣ CUDA difficult to code, but fast!

‣ PyCUDA wants to take the best of both worlds

‣ http://mathema.tician.de/software/pycuda

29
PyCUDA

‣ Scripting language

- High level programming language that is interpret by another

program at runtime rather than compiled

- Advantages: ease on programmer

- Disadvantages: slow (specially inner loops)

‣ PyCUDA

- CUDA codes does not need to be a constant at compile time

- Machine generated code: automatic manage of resources

30
PyCUDA

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print dest‐a*b
31
PyCUDA

‣ Transferring data
import numpy
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)

cuda.memcpy_htod(a_gpu, a)

‣ Executing a kernel
from pycuda.compiler import SourceModule
mod = SourceModule("""
  __global__ void doublify(float *a)
  {
    int idx = threadIdx.x + threadIdx.y*4;
    a[idx] *= 2;
  }
  """)
... # Allocate, generate and transfer
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))

a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled
print a 32

FFT128 Project
No ratings yet
FFT128 Project
70 pages
FFT
No ratings yet
FFT
10 pages
Workshop 18 - Mixing Analysis (LMI) Part A: Project Setup and Processing
No ratings yet
Workshop 18 - Mixing Analysis (LMI) Part A: Project Setup and Processing
30 pages
DFT 2
No ratings yet
DFT 2
19 pages
Assignment 03 DSP
No ratings yet
Assignment 03 DSP
9 pages
FFT A Brief Overview
No ratings yet
FFT A Brief Overview
30 pages
VkFFT-A Performant Cross-Platform and Open-Source GPU FFT Library
No ratings yet
VkFFT-A Performant Cross-Platform and Open-Source GPU FFT Library
20 pages
CUFFT Library
No ratings yet
CUFFT Library
29 pages
Fast Fourier Transform (FFT) : ) ( Elsewhere, 0 1 0), ( ) (E N X K X N N N X N X
No ratings yet
Fast Fourier Transform (FFT) : ) ( Elsewhere, 0 1 0), ( ) (E N X K X N N N X N X
27 pages
LAB4
No ratings yet
LAB4
18 pages
Interim Report On Benchmarking
No ratings yet
Interim Report On Benchmarking
37 pages
Lab4 DSP
No ratings yet
Lab4 DSP
18 pages
Fast Fourier Transform (FFT)
No ratings yet
Fast Fourier Transform (FFT)
1 page
Lab4 Report
No ratings yet
Lab4 Report
19 pages
DSP Matlab Lab Session-2
No ratings yet
DSP Matlab Lab Session-2
17 pages
FFTReal Version 2.11
No ratings yet
FFTReal Version 2.11
5 pages
ASPA Module 1
No ratings yet
ASPA Module 1
60 pages
DSP Exp4 2001ee24
No ratings yet
DSP Exp4 2001ee24
8 pages
CS 179: GPU Programming
No ratings yet
CS 179: GPU Programming
40 pages
FFT Processor
No ratings yet
FFT Processor
29 pages
FFT Tutorial 121102
No ratings yet
FFT Tutorial 121102
28 pages
Design of Radix-2 Butterfly Processor
100% (1)
Design of Radix-2 Butterfly Processor
39 pages
The Quantum Fourier Transform and Jordan's Algorithm
No ratings yet
The Quantum Fourier Transform and Jordan's Algorithm
6 pages
IP FFT Processors For OFDM in FPGA (Http://bbwizard - Com)
No ratings yet
IP FFT Processors For OFDM in FPGA (Http://bbwizard - Com)
9 pages
The DFT and The Fast Fourier Transform (FFT) : A. Introduction
No ratings yet
The DFT and The Fast Fourier Transform (FFT) : A. Introduction
2 pages
155.FFT Ropec
No ratings yet
155.FFT Ropec
7 pages
Lec08computationofdft 230606124625 691907bf
No ratings yet
Lec08computationofdft 230606124625 691907bf
48 pages
Title Computational Complexity of FFT Algorithm: Objective
No ratings yet
Title Computational Complexity of FFT Algorithm: Objective
4 pages
Mohammad Nazmul Haque, Mohammad Nazmul Haque, Mohammad Shorif Uddin
No ratings yet
Mohammad Nazmul Haque, Mohammad Nazmul Haque, Mohammad Shorif Uddin
9 pages
Implementation of Fast Fourier Transform (FFT) On Graphics Processing Unit (GPU)
No ratings yet
Implementation of Fast Fourier Transform (FFT) On Graphics Processing Unit (GPU)
61 pages
Fast Fourier Transform Real
No ratings yet
Fast Fourier Transform Real
3 pages
DSP Module 3 Notes
No ratings yet
DSP Module 3 Notes
14 pages
Large-Scale Discrete Fourier Transform On TPUs
No ratings yet
Large-Scale Discrete Fourier Transform On TPUs
11 pages
Algorithm For Scalable Fourier Transforms
No ratings yet
Algorithm For Scalable Fourier Transforms
5 pages
Fast Fourier Transform (FFT) - Digital Signal Processing Lecture 6
No ratings yet
Fast Fourier Transform (FFT) - Digital Signal Processing Lecture 6
34 pages
Fast Fourier Transform
No ratings yet
Fast Fourier Transform
13 pages
FFT Implementation in FPGA
No ratings yet
FFT Implementation in FPGA
52 pages
Fast Fourier Transform (FFT)
No ratings yet
Fast Fourier Transform (FFT)
25 pages
18-791 Lecture #17 Introduction To The Fast Fourier Transform Algorithm Richard M. Stern
No ratings yet
18-791 Lecture #17 Introduction To The Fast Fourier Transform Algorithm Richard M. Stern
22 pages
6-fftj DK JNFD LGM Ogdpor KDML KMGGLKM LG
No ratings yet
6-fftj DK JNFD LGM Ogdpor KDML KMGGLKM LG
11 pages
ECE 410 Digital Signal Processing D. Munson University of Illinois
No ratings yet
ECE 410 Digital Signal Processing D. Munson University of Illinois
20 pages
Chapter2 Lect8
No ratings yet
Chapter2 Lect8
12 pages
Benchmarking: of FFT Algorithms
No ratings yet
Benchmarking: of FFT Algorithms
3 pages
Computation of DFT
No ratings yet
Computation of DFT
13 pages
Chuong - 6 - Efficient Computation of The DFT Fast Fourier Transform Algorithms
No ratings yet
Chuong - 6 - Efficient Computation of The DFT Fast Fourier Transform Algorithms
67 pages
Unit 6 Implementation of FFT Algorithms
No ratings yet
Unit 6 Implementation of FFT Algorithms
22 pages
Dept of Electrical, Electronics & Instrumentation Engineering KK Birla Goa Campus
No ratings yet
Dept of Electrical, Electronics & Instrumentation Engineering KK Birla Goa Campus
36 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Lab Notes 6
No ratings yet
Lab Notes 6
21 pages
UNIT - 3: Fast-Fourier-Transform (FFT) Algorithms: Dr. Manjunatha. P
No ratings yet
UNIT - 3: Fast-Fourier-Transform (FFT) Algorithms: Dr. Manjunatha. P
100 pages
50 Years of FFT Algorithms and Applications.
No ratings yet
50 Years of FFT Algorithms and Applications.
34 pages
Implementation of DFT FFT
No ratings yet
Implementation of DFT FFT
14 pages
ELEC692 VLSI Signal Processing Architecture: Architecture For Fourier Transform
No ratings yet
ELEC692 VLSI Signal Processing Architecture: Architecture For Fourier Transform
40 pages
DFT/ FFT Using TMS320C5515 TM eZDSP USB Stick: Via CCS and Matlab
No ratings yet
DFT/ FFT Using TMS320C5515 TM eZDSP USB Stick: Via CCS and Matlab
12 pages
Impact of DPU 2017
No ratings yet
Impact of DPU 2017
6 pages
FFT VHDL
No ratings yet
FFT VHDL
28 pages
On FRFT
No ratings yet
On FRFT
11 pages
Comparative Study of Various FFT Algorithm Implementation On FPGA
No ratings yet
Comparative Study of Various FFT Algorithm Implementation On FPGA
4 pages
Olc5 Dit DFT
No ratings yet
Olc5 Dit DFT
37 pages
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Simulation and Debugging Techniques in Vivado IP Integrator
100% (1)
Simulation and Debugging Techniques in Vivado IP Integrator
22 pages
Team-BHP PDI Checklist
No ratings yet
Team-BHP PDI Checklist
3 pages
CSD Price List W.E.F. 01.02.2021
No ratings yet
CSD Price List W.E.F. 01.02.2021
1 page
Hyundai-Shield of Trust Program: Any Time Before 1st Free Service
50% (2)
Hyundai-Shield of Trust Program: Any Time Before 1st Free Service
4 pages
PFA Vs PTFE in Instrumentation
No ratings yet
PFA Vs PTFE in Instrumentation
5 pages
Alternators: LSA 42.2 - 2 Pole
No ratings yet
Alternators: LSA 42.2 - 2 Pole
7 pages
RS3 Loudspeaker: Assembly
No ratings yet
RS3 Loudspeaker: Assembly
1 page
Frequency Response Techniques
No ratings yet
Frequency Response Techniques
85 pages
Bubble Sort
No ratings yet
Bubble Sort
108 pages
Education and Development in India
No ratings yet
Education and Development in India
672 pages
Galvocoat 16380
No ratings yet
Galvocoat 16380
2 pages
Design of Electrical Apparatus
No ratings yet
Design of Electrical Apparatus
15 pages
PC - 07 Cabin Units
No ratings yet
PC - 07 Cabin Units
17 pages
Viga Acero Compuesta - 21x50#
No ratings yet
Viga Acero Compuesta - 21x50#
10 pages
Universe Galaxy
No ratings yet
Universe Galaxy
11 pages
Reference No. Self-Assessment Guide Agricultural Crop Production NC Ii Perform Nursery Operations
No ratings yet
Reference No. Self-Assessment Guide Agricultural Crop Production NC Ii Perform Nursery Operations
4 pages
Design of Steel and Composite Beams With Web Openings
No ratings yet
Design of Steel and Composite Beams With Web Openings
19 pages
H3 World School - One of The Top English Medium Schools in Ahmedabad
No ratings yet
H3 World School - One of The Top English Medium Schools in Ahmedabad
3 pages
Biological Basis of Behavior - Lect 1
No ratings yet
Biological Basis of Behavior - Lect 1
19 pages
Unit III Referance Book
No ratings yet
Unit III Referance Book
65 pages
CT1900 Student Guide Femap101 PDF
No ratings yet
CT1900 Student Guide Femap101 PDF
395 pages
N Gene Manual 120
No ratings yet
N Gene Manual 120
241 pages
TENTEC V-Series Data Sheet R8 A4
No ratings yet
TENTEC V-Series Data Sheet R8 A4
4 pages
1 GOCE Introduction
No ratings yet
1 GOCE Introduction
62 pages
Goldin Et Al 2006 The Homecoming of American College Women The Reversal of The College Gender Gap
No ratings yet
Goldin Et Al 2006 The Homecoming of American College Women The Reversal of The College Gender Gap
55 pages
Zou 等 - 2023 - Physics-Informed Neural Network-based Serial Hybri
No ratings yet
Zou 等 - 2023 - Physics-Informed Neural Network-based Serial Hybri
13 pages
B1392 Transistor
No ratings yet
B1392 Transistor
4 pages
Quiz 001
No ratings yet
Quiz 001
3 pages
7.3 Java Applet
No ratings yet
7.3 Java Applet
6 pages
Motor Development: A New Synthesis
No ratings yet
Motor Development: A New Synthesis
17 pages
Methods of Research
100% (1)
Methods of Research
6 pages
Bosch Rexroth Gearbox Product Line
100% (1)
Bosch Rexroth Gearbox Product Line
10 pages

Gpu Computing With Cuda Lecture 8 - Cuda Libraries - Cufft, Pycuda

Uploaded by

Gpu Computing With Cuda Lecture 8 - Cuda Libraries - Cufft, Pycuda

Uploaded by

GPU Computing with CUDA

Lecture 8 - CUDA Libraries - CUFFT, PyCUDA

- Discrete Fourier Transform (DFT)

- Fast Fourier Transform (FFT)

‣ CUFFT: A CUDA based FFT library

‣ PyCUDA: GPU computing using scripting languages

Bell, Dalton, Olson. Towards AMG on GPU

‣ Inverse Fourier Transform

‣ The case of discrete u(x) we have aliasing

Real Discrete, bounded

Wave Bounded, discrete

Values at sample points repeat at k2 = k1+N 5

‣ Fast method to calculate the DFT

‣ Computations drop from O(N 2 ) to O(N log(N ))

‣ Naive: 108 computations

‣ Many algorithms, let’s look at Cooley-Tukey radix-2

- Assume N being a power of 2

Divide sum into even and odd parts

- Assume N being a power of 2

Divide sum into even and odd parts

‣ Sum is decomposed and redundant operations appear

- Signal comes in time domain, but want the frequency spectrum

- Signals can be filtered with convolutions

F{f ∗ g} = F{f } · F{g}

‣ Partial Differential Equations (PDEs) - Spectral methods

- Use DFTs to calculate derivatives

- Boundary condition constraint O(N −m )

‣ CUFFT: CUDA library for FFTs on the GPU

- 1D, 2D, 3D transforms for complex and real data

- Batch execution for multiple transforms

- Up to 128 million elements (limited by memory)

- In-place or out-of-place transforms

- Double precision on GT200 or later

- Allows streamed execution: simultaneous computation and data

- Handle type to store CUFFT plans

- Return values, like CUFFT_SUCCESS, CUFFT_INVALID_PLAN,

‣ R2C: real to complex

‣ C2R: Complex to real

‣ C2C: complex to complex

‣ D2Z: double to double complex

‣ Z2D: double complex to double

‣ Z2Z: double complex to double complex

- Free GPU resources

‣ cufftExecC2C, R2C, C2R, Z2Z, D2Z, Z2D

- Performs the specified FFT

‣ For more details see the CUFFT Library documentation available in

‣ Several algorithms for different sizes

- Restrict size to be a multiple of 2, 3, 5 or 7

- Restrict the power-of-two factorization term of the X-dimension to

- Restrict the X-dimension of single precision transforms to be strictly

- CUFFT is good for larger, power of two sized FFTs

- CUFFT is not good for small sized FFTs

‣ CPU can store all data in cache

‣ GPU data transfers take too long

‣ Solve Poisson equation using FFT

Fourier spectral method for 2D Poisson Eqn

‣ Consider periodic boundary conditions 0.8

‣ Python + CUDA = PyCUDA

‣ Python: scripting language easy to code, but slow

‣ CUDA difficult to code, but fast!

‣ PyCUDA wants to take the best of both worlds

- High level programming language that is interpret by another

- Advantages: ease on programmer

- Disadvantages: slow (specially inner loops)

- CUDA codes does not need to be a constant at compile time

- Machine generated code: automatic manage of resources

You might also like