0% found this document useful (0 votes)
48 views7 pages

016 JCIT Vol6 No12

This document discusses parallel implementation of compressive sensing based synthetic aperture radar (SAR) imaging using graphics processing units (GPUs). The authors propose modifying the iterative shrinkage/thresholding algorithm to better utilize parallel computing on the GPU. Experimental results showed GPU implementation provided significant speedup compared to CPU implementation, enabling real-time SAR image reconstruction. Key aspects of compressive sensing theory, iterative shrinkage/thresholding algorithm, and GPU architecture are also summarized.

Uploaded by

ladooroy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views7 pages

016 JCIT Vol6 No12

This document discusses parallel implementation of compressive sensing based synthetic aperture radar (SAR) imaging using graphics processing units (GPUs). The authors propose modifying the iterative shrinkage/thresholding algorithm to better utilize parallel computing on the GPU. Experimental results showed GPU implementation provided significant speedup compared to CPU implementation, enabling real-time SAR image reconstruction. Key aspects of compressive sensing theory, iterative shrinkage/thresholding algorithm, and GPU architecture are also summarized.

Uploaded by

ladooroy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi,

Najeeb Ahmad, Zhang Bingchen

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU


1

Tian Jihua, 2Sun Jinping, 3Zhang Yuxi, 4Najeeb Ahmad, 5Zhang Bingchen 1 School of Electronic and Information Engineering, Beihang University [email protected] 2,3,4 School of Electronic and Information Engineering, Beihang University 5 Nat. Key Lab of MW Imaging Tech. Institute of Electronics, CAS

Abstract
The paper proposed a new scheme for parallel implementation of compressive sensing based SAR imaging on GPU with Iterative Shrinkage/Thresholding algorithm. To get a faster recovery speed, we modified the existed IST algorithm structure, and realized the fast implementation on GPU. The experiment result shows that parallel computing capabilities of GPU have a significant speedup in comparison with computing capability of CPU.

Keywords: Compressive Sensing, Synthetic Aperture Radar, Graphics Processing Unit, CUDA 1. Introduction
As a major remote sensing sensor, Synthetic aperture radar (SAR) can produce high resolution images from a moving platform, such as an airplane or a satellite. A SAR system produces 2D (range and azimuth) terrain reflectivity images by emitting a sequence of closely spaced radio frequency pulses and by sampling the echoes scattered from the ground targets [1]. The main advantage of SAR is that images of the illuminated area can be obtained independent of time-of-day or weather conditions (e.g., fog, cloud level, rain, and snow). Modern airborne and spaceborne SAR systems can produce very high resolution images and are being widely used in many civilian and military applications [1, 2]. Compressive sensing (CS), proposed by Donoho [3], Emmanuel Cand`es [4] and Micheal Elad [5] et al. is a new developing novel theory that enables perfect recovery of signals and data from what appear to be highly sub-Nyquist-rate samples. CS proclaims that an unknown sparse (or sparse under certain basis) signal can be exactly recovered with high probability from very limited number of measurements by solving a convex l1 optimization problem. Based on rigid mathematics, CS has attracted many attentions in image processing, data fusion of multiple sensors, radar applications and so on. Up to now, a few literatures have addressed adopting CS in some radar applications including SAR and Inverse Synthetic Aperture Radar (ISAR) [6-9]. However, the reconstruction of sparse signal requires numerous matrix-vector multiplications, which imposes a heavy burden on the numerical computation, especially when the sensing matrix is a large dense one. Meanwhile, the computation of compressive sensing based SAR imaging technique becomes larger and larger along with the increasing demand on high resolution SAR images. As a result, it takes quite a long time to reconstruct a SAR image on CPU, which can not be implemented in real-time. Recently, the fast processing performance of graphics processing unit (GPU) offers an alternative for fast reconstruction of sparse signal. This paper realized the fast reconstruction of compressive sensing based SAR images, taking advantage of the efficient parallel computing capabilities of GPU.

2. GPU Architecture and Software Framework


Recently, driven by the insatiable market demand for real-time, high-definition 3D graphics, the programmable GPU has evolved into a highly parallel, multithreaded many core processor with tremendous computational horsepower and very high memory bandwidth [10]. NVIDIA released computed unified device architecture (CUDA) in November 2006, the first developed environment and software framework for GPU, which is an extension to the standard C

Journal of Convergence Information Technology(JCIT) Volume6, Number12, December 2011 doi:10.4156/jcit.vol6.issue12.16

122

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

language that allows users to manage the GPU as computational device without the help of graphic API. In CUDA architecture, tasks are split into a grid of multiple thread blocks each of which consists of series of threads. The thread blocks are arranged into different stream multiprocessors of GPU. CUDA adopts the single instruction multiple thread (SIMT) model, which means all the threads in one block share the same instruction code with different data, and possibly run at different states. In CUDA, every thread has its own dedicated registers, and communication between blocks is realized through shared memory and synchronous mechanism. This design helps minimize the costs of context switching on GPU. The design aim of GPU is to realize the parallel computation through numerous threads, fixing it suitable for large scale parallel computing tasks that are intensive in computation and simple in logic.

3. Theory of Compressive Sensing


The theory of CS reveals that exact recovery of an unknown sparse signal is possible from very limited samples by solving an inverse problem through either a linear program or a greedy pursuit. Suppose that signal s R N is K-sparse on an overcomplete dictionary , i.e.

s = x

(1)

where = {y 1 ,y 2 ,,y N } is an N N matrix constructed by a sparse basis {y i } , and x is a vector with all except K of its entries are zeros. Various expansions, including wavelets, the DCT, and Gabor frames, are widely used for the representation and compression of natural signals, images, and other data. The matrix is constructed according to the selected expansion. In order to reconstruct signal s , a set of M measurements is acquired ( M < N ), which are linear combinations of the points within s . More precisely
y = s = x

(2)

where is a M N matrix, hereinafter called measurement matrix. Since M < N , recovery of the signal s from the measurements y is ill-posed in general. The CS theory reveals that when the matrix A = has the Restricted Isometry Property (RIP) [3-5], the x or the signal s can be recovered from a similarly sized set of M = O( K log( N / K )) measurements y with high probability. The RIP is closely related to an incoherency property between and , which means the rows of can not provide a sparse representation of the columns of and vice versa. The RIP and incoherency holds for many pairs of basis, such as delta spikes and Fourier sinusoids, or sinusoids and wavelets. It can be proved that (pseudo) random noise-like matrix performs excellently as , such as the matrix constructed by Bernoulli or Gaussian random variables. Another choice for the measurement matrix that offers good performance in many cases is a causal, quasi-Toeplitz matrix [3-6]. When the RIP holds, the x or the signal s can be recovered from the solution of a convex optimization problem. Formally, with high probability, x is the unique solution to

min x

s.t.

y = x

(3)

which can be solved efficiently with linear programming techniques. Current reconstruction methods include iterative greedy algorithms such as Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP) and convex relaxation algorithms such as Basis Pursuit (BP) and Iterative Shrinkage/Threshold (IST) and so on.

4. Iterative Shrinkage/Thresholding Algorithm and GPU Realization


4.1 Iterative Shrinkage/Thresholding Algorithm

123

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

In practice, the measurements are always disturbed by noise or other interference. So it is no more suitable to enforce y = x according to the constraint in Equation 3. Generally we solve the problem by transforming the constraint convex optimization problem into the following unconstraint convex optimization problem

min
x

1 2 y - Ax 2 + t x 2

(4)

where 2 denotes the Euclidean norm and t is the regularization parameter which provides a tradeoff between fidelity to the measurements and the noise sensitivity. Iterative Shrinkage/Thresholding (IST) [11, 12] is a state-of-the-art algorithm in solving the unconstraint convex optimization problem, with the following iterative scheme

xk +1 = soft ( xk +A H ( y - Axk ),t )

(5)

where soft ( x ,t ) = sgn( x ) max(| x | -t , 0) is the shrinkage operator. IST algorithm has already been applied extensively to handle the unconstraint convex optimization problem arising in recovery of sparse signal, image restoration and other linear inverse problems. Each iteration step of IST algorithm only requires matrix-vector multiplications and addition computation, which is suited to utilize the efficient parallel computing capabilities of GPU to realize fast recovery. So this paper realized the fast reconstruction of compressive sensing based SAR images using IST algorithm on GPU. The detailed procedure of IST algorithm is described as follows 1. Initialize x0 = 0 , residual vector r0 = y , and set iteration step k = 1 ; 2. Compute the correlation of A with the current residual, and the next estimate according to the current one (6) xk +1 = xk + A H rk 3. Process the new estimate with shrinkage operator (7) xk +1 = soft ( xk +1 ,t ) 4. Update the residual vector (8) rk +1 = y - Axk +1 5. Compute the objective function value f k +1 = 0.5 rk +1 + t xk +1 change Df = f k +1 - f k
2 1

, and get the relative

f k . If Df is smaller than the stopping threshold then terminate the

. Otherwise, go to step 2 for the next iteration. iteration and output the estimate x In IST algorithm, the most computation prohibitive portion is the matrix-vector multiplication involving A and A H , with computation complexity of O( MN ) . Besides, each iteration step requires two such multiplications, so the computation of the whole recovery process is very large, especially when the matrix is a large dense one. However, if we convert Equation 5 to Equation 9, we can find that, A H y participates in the computation as a constant vector, and the two multiplications reduce to
one multiplication only involving A H A in each iteration. Although the computation complexity of A H Axk is O( N 2 ) , larger than the matrix-vector multiplication involving A and A H on CPU, it really can reduce the whole computation when implemented in parallel.
xk +1 = soft ( xk +A H y - A H Axk ,t )

(9)

So we proposed a new scheme based on the above analysis, precompute the A H A and A H y before the iteration, then only one matrix-vector multiplication is required in each iteration step. In this way, it not only reduces the cost of matrix-vector multiplication, but also reduces the time cost by data transmission between two multiplications. In addition, we noticed that the residual vector should be available when calculate the objective function value. However, the calculation of residual vector involves matrix A , which is against the proposed method requiring only A H A and A H y . So we need

124

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

some changes to the calculation of residual vector to meet with the proposal. Fortunately, we find that if we replace the computation formula of objective function value with Equation 10, the two different methods show the same effect when judging whether the termination criterion is satisfied based on the % is different from f .In addition, relative change of two contiguous objective function values, although f as we know, the SAR images are not sparse over all range gate, so we need add some constraint to escape from recovering the unsparse ones. If the objective function value in one step is no less than the one got in the former step, then we can say the scene is not sparse and exit the recovery.

% = 0.5 A H y - A H Ax 2 + t x f

(10)

The basic procedure of the modified version of IST algorithm is 1. Initialize x0 = 0 , compute A H y and A H A , and set iteration step k = 1 ; 2. Compute the correlation of A with the current residual xtemp = A H y - A H Axk 3. Compute the new estimate xk +1 = soft ( xk + xtemp ,t )

(11) (12)

4. Compute the objective function value f%k +1 = 0.5 xtemp

% > f % , k > 2 , then + t xk +1 1 , if f k +1 k

%= f % -f % terminate the iteration. Otherwise, get the relative change Df k +1 k

% is smaller % . If Df f k

. Otherwise, than the stopping threshold then terminate the iteration and output the estimate x go to step 2 for the next iteration.

4.2 GPU implementation of IST


As we know, in CUDA framework, communication between host CPU and GPU device often costs lots of time, so we should use such communication as few as possible[13,14]. In this paper, data communication between host CPU and GPU device only occurs at the start and end of implementation of the algorithm. At the start phase, the precomputed A H y and A H A , regularization parameter t and other necessary parameters are transmitted to GPU, while the reconstructed results are transmitted back to CPU at the end of the recovery. Where A H A is stored in the global memory, and A H y is stored in the constant memory as it will not be changed during the recovery. In order to save the communication time further, we transmit back the nonzero elements and the corresponding indices instead of the whole elements of the reconstructed signal [15]. Besides, when numerous data need to be reconstructed, a series of streams can be built, each of which is responsible for transmission and execution of different data. Processing with two streams allows for the memory copies of one stream to overlap with the kernel execution of the other stream. Then the time cost by memory copies between CPU and GPU can be efficiently hidden, and get the performance improved. GPU device begins to execute the iterative recovery once it receives the data from CPU. As mentioned above, the dominant computation during the recovery is the matrix-vector multiplication. Note that in the matrix-vector multiplication, column vectors of the matrix are mutually independent, which is fit to be implemented in parallel. The matrix-vector multiplication is realized with coarse-grained parallelism blocks that can not communication with each other, together with the fine-grained parallelism threads. In detail, multiplications between column vectors of A H A with xk are realized in coarse-grained parallelism, while elements and elements products inside the vector multiplication are realized in fine-grained parallelism. We stored the matrix A H A in global memory, so we have to access the global memory to fetch it when it is needed. To limit the memory latency, we utilized the shared memory that is as fast as register. For instance, column vectors of A H A and xk are all stored in the shared memory that lies in each thread block. During the IST recovery, we have to transform some multidimensional data to one dimension, such as the calculation of Euclidean norm and l1 norm. Take the calculation of l1 norm for instance, normally

125

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

we add all the elements step by step. But on GPU, there are only 512 threads in each block which is smaller than dimension of the vector, so we split such task into parallel accumulation involving multiple thread blocks, where each block is responsible for addition of part data and the partial sum got by each thread are summed up at last. To make the most efficient use of the compute power of GPU, we should utilize enough thread blocks, keeping the maximum active thread blocks per multiprocessor. However, communication between thread blocks works only through global memory that is limited in GPU and has long access latency, which means it will cost lots of time in memory access if too many blocks were used. So we should make a tradeoff to select the suitable thread blocks. During the vector multiplication realization in fine-grained parallelism and computation of multidimensional data to one dimension, each thread block will complete summation of many data. This paper adopts the parallel summation reduction method to make most efficient of the parallel performance of GPU, Figure 1 shows the procedure of parallel summation reduction with 8 elements. The traditional serial summation method requires n steps to sum up n elements, while the parallel summation reduction method only requires log 2 n steps. Meanwhile, the parallel summation reduction works with sequential addressing which is bank conflict free, avoiding the reduction in efficient access bandwidth. In addition, the threads in each warp will either execute the summation or not, which will avoid the performance degradation caused by divergence.
0 1 2 3 4 5 6 7

Figure1. Parallel summation reduction with 8 elements

5. Experiment
To validate the speeding up of parallel realization of compressive sensing based SAR imaging on GPU, we separately reconstructed the same SAR image using IST algorithm on CPU and GPU. The configuration of the CPU used in this paper is Intel Core2 Quad 8400, 2.66GHz, and the GPU is Tesla C1060. And the data used in the experiment are real airborne SAR data which have been collected by an X-band SAR with the resolution of 2m. For the detailed compressive sensing based SAR imaging technique, please refer to literature [16]. We implemented the CPU and GPU code in single precision float and computed the average processing time over 100 repeated executions on CPU and GPU separately. Figure 2.a shows the conventional SAR imaging result with full samples, while Figure 2.b shows the compressive sensing base SAR image reconstructed from 50% samples implemented on GPU. The time cost by CPU and GPU are shown in Table 1. From Table 1, we can see that GPU speeds up 35 times than CPU.

126

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

(a)

(b)

Figure2 (a). Conventional SAR imaging result with full samples. (b). Compressive sensing based SAR imaging result with 50% samples implemented on GPU Table1. The average execution times on CPU and GPU CPU
Time/s 8.995

GPU
0.258

Speedup
35

6. Conclusion
The paper realized the parallel implementation of compressive sensing based SAR imaging on GPU, and Iterative Shrinkage/Thresholding algorithm is adopted to reconstruct the SAR image. To make the most efficient use of parallelism characteristic of GPU, we modified the existed IST algorithm structure, and realized the fast implementation on CUDA platform. The experiment result shows that parallel computing capabilities of GPU have a significant speedup in comparison with computing capability of CPU.

7. Acknowledgement
This work was supported by the 973 Program of China under Grant 2010CB731903, the National Natural Science Foundation of China (Grant No. 60901056).

8. References
[1] W. G. Carrara, R. S. Goodman, R. M. Majewaki, Spotlight Synthetic Aperture Radar: Signal Processing Algorithms, Norwood, MA: Artech House, 1995. [2] I. G. Cumming and F. Wong, Digital Processing of Synthetic Aperture Radar, Norwood, MA: Artech House, 2005. [3] D. L. Donoho, Compressed Sensing, IEEE Trans. on Info. Theory, vol.52, no.4, pp.12891306, 2006. [4] E. Cand`es, J. Romberg and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Trans. on Info.Theory, vol.52, no.2, 2006, pp.489509. [5] M. Elad, Optimized Projections for Compressed Sensing, IEEE Trans. on Signal Process., vol.55, no.12, pp.56955702, 2007. [6] R. Baraniuk, P. Steeghs, Compressive radar imaging, IEEE Radar Conference, pp.128-133, 2007.

127

Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen

[7] M. Herman, T. Strohmer, Compressed sensing radar, IEEE Radar Conference, pp.1-6, 2008. [8] V. M. Patel, G. R. Easley, D. M. Healy and R. Chellappa, Compressed Synthetic Aperture Radar, IEEE Journal of Selected Topics in Signal Processing, vol.4, no.2, pp.244254, 2010. [9] J. H. G. Ender, On compressive sensing applied to radar, Signal Processing, vol.90, no.5, pp.1402-1414, 2010. [10] NVIDIA, CUDA Programming Guide, Version 2.3.1, Auguest 2009. [11] M. A. T. Figueiredo and R. D. Nowak, An EM algorithm for wavelet-based image restoration, IEEE Transactions on Image Processing, vol.12, no.8, pp.906-916, 2003. [12] Ingrid Daubechies, Michel Defrise, Christine De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Communications in Pure and Applied Mathematics, vol.57, pp.14131457, 2004. [13] Zhiyong Yuan, Yuanyuan Zhang, Jianhui Zhao, Yihua Ding, Chengjiang Long, Lu Xiong, Dengyi Zhang, Guozhong Liang, Real-time Simulation for 3D Tissue Deformation with CUDA Based GPU Computing, JCIT: Journal of Convergence Information Technology, vol.5, no.4, pp.109-119, 2010. [14] Xiangyun Liao, Zhiyong Yuan, Weixin Si, Zhaoliang Duan, Ruixue Mao, Jianhui Zhao, Research and Application of Parallel Computing Technologies based on CUDA and OpenCL, Journal of Covergence Information Technology, vol.6, no.6, 2011. [15] Sangkyun Lee, Stephen J. Wright, Implementing algorithms for signal and image reconstruction on graphical processing units, Computer Sciences Department, University of Wisconsin-Madison, Tech. Rep., November, 2008. [16] Jihua Tian, Jinping Sun, Xiao Han, Bingchen Zhang, Motion Compensation for Compressive Sensing SAR Imaging with Autofocus, The 6th IEEE Conference on Industrial Electronics & Applications, pp.1564-1567, 2011.

128

You might also like