016 JCIT Vol6 No12
016 JCIT Vol6 No12
Tian Jihua, 2Sun Jinping, 3Zhang Yuxi, 4Najeeb Ahmad, 5Zhang Bingchen 1 School of Electronic and Information Engineering, Beihang University [email protected] 2,3,4 School of Electronic and Information Engineering, Beihang University 5 Nat. Key Lab of MW Imaging Tech. Institute of Electronics, CAS
Abstract
The paper proposed a new scheme for parallel implementation of compressive sensing based SAR imaging on GPU with Iterative Shrinkage/Thresholding algorithm. To get a faster recovery speed, we modified the existed IST algorithm structure, and realized the fast implementation on GPU. The experiment result shows that parallel computing capabilities of GPU have a significant speedup in comparison with computing capability of CPU.
Keywords: Compressive Sensing, Synthetic Aperture Radar, Graphics Processing Unit, CUDA 1. Introduction
As a major remote sensing sensor, Synthetic aperture radar (SAR) can produce high resolution images from a moving platform, such as an airplane or a satellite. A SAR system produces 2D (range and azimuth) terrain reflectivity images by emitting a sequence of closely spaced radio frequency pulses and by sampling the echoes scattered from the ground targets [1]. The main advantage of SAR is that images of the illuminated area can be obtained independent of time-of-day or weather conditions (e.g., fog, cloud level, rain, and snow). Modern airborne and spaceborne SAR systems can produce very high resolution images and are being widely used in many civilian and military applications [1, 2]. Compressive sensing (CS), proposed by Donoho [3], Emmanuel Cand`es [4] and Micheal Elad [5] et al. is a new developing novel theory that enables perfect recovery of signals and data from what appear to be highly sub-Nyquist-rate samples. CS proclaims that an unknown sparse (or sparse under certain basis) signal can be exactly recovered with high probability from very limited number of measurements by solving a convex l1 optimization problem. Based on rigid mathematics, CS has attracted many attentions in image processing, data fusion of multiple sensors, radar applications and so on. Up to now, a few literatures have addressed adopting CS in some radar applications including SAR and Inverse Synthetic Aperture Radar (ISAR) [6-9]. However, the reconstruction of sparse signal requires numerous matrix-vector multiplications, which imposes a heavy burden on the numerical computation, especially when the sensing matrix is a large dense one. Meanwhile, the computation of compressive sensing based SAR imaging technique becomes larger and larger along with the increasing demand on high resolution SAR images. As a result, it takes quite a long time to reconstruct a SAR image on CPU, which can not be implemented in real-time. Recently, the fast processing performance of graphics processing unit (GPU) offers an alternative for fast reconstruction of sparse signal. This paper realized the fast reconstruction of compressive sensing based SAR images, taking advantage of the efficient parallel computing capabilities of GPU.
122
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
language that allows users to manage the GPU as computational device without the help of graphic API. In CUDA architecture, tasks are split into a grid of multiple thread blocks each of which consists of series of threads. The thread blocks are arranged into different stream multiprocessors of GPU. CUDA adopts the single instruction multiple thread (SIMT) model, which means all the threads in one block share the same instruction code with different data, and possibly run at different states. In CUDA, every thread has its own dedicated registers, and communication between blocks is realized through shared memory and synchronous mechanism. This design helps minimize the costs of context switching on GPU. The design aim of GPU is to realize the parallel computation through numerous threads, fixing it suitable for large scale parallel computing tasks that are intensive in computation and simple in logic.
s = x
(1)
where = {y 1 ,y 2 ,,y N } is an N N matrix constructed by a sparse basis {y i } , and x is a vector with all except K of its entries are zeros. Various expansions, including wavelets, the DCT, and Gabor frames, are widely used for the representation and compression of natural signals, images, and other data. The matrix is constructed according to the selected expansion. In order to reconstruct signal s , a set of M measurements is acquired ( M < N ), which are linear combinations of the points within s . More precisely
y = s = x
(2)
where is a M N matrix, hereinafter called measurement matrix. Since M < N , recovery of the signal s from the measurements y is ill-posed in general. The CS theory reveals that when the matrix A = has the Restricted Isometry Property (RIP) [3-5], the x or the signal s can be recovered from a similarly sized set of M = O( K log( N / K )) measurements y with high probability. The RIP is closely related to an incoherency property between and , which means the rows of can not provide a sparse representation of the columns of and vice versa. The RIP and incoherency holds for many pairs of basis, such as delta spikes and Fourier sinusoids, or sinusoids and wavelets. It can be proved that (pseudo) random noise-like matrix performs excellently as , such as the matrix constructed by Bernoulli or Gaussian random variables. Another choice for the measurement matrix that offers good performance in many cases is a causal, quasi-Toeplitz matrix [3-6]. When the RIP holds, the x or the signal s can be recovered from the solution of a convex optimization problem. Formally, with high probability, x is the unique solution to
min x
s.t.
y = x
(3)
which can be solved efficiently with linear programming techniques. Current reconstruction methods include iterative greedy algorithms such as Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP) and convex relaxation algorithms such as Basis Pursuit (BP) and Iterative Shrinkage/Threshold (IST) and so on.
123
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
In practice, the measurements are always disturbed by noise or other interference. So it is no more suitable to enforce y = x according to the constraint in Equation 3. Generally we solve the problem by transforming the constraint convex optimization problem into the following unconstraint convex optimization problem
min
x
1 2 y - Ax 2 + t x 2
(4)
where 2 denotes the Euclidean norm and t is the regularization parameter which provides a tradeoff between fidelity to the measurements and the noise sensitivity. Iterative Shrinkage/Thresholding (IST) [11, 12] is a state-of-the-art algorithm in solving the unconstraint convex optimization problem, with the following iterative scheme
(5)
where soft ( x ,t ) = sgn( x ) max(| x | -t , 0) is the shrinkage operator. IST algorithm has already been applied extensively to handle the unconstraint convex optimization problem arising in recovery of sparse signal, image restoration and other linear inverse problems. Each iteration step of IST algorithm only requires matrix-vector multiplications and addition computation, which is suited to utilize the efficient parallel computing capabilities of GPU to realize fast recovery. So this paper realized the fast reconstruction of compressive sensing based SAR images using IST algorithm on GPU. The detailed procedure of IST algorithm is described as follows 1. Initialize x0 = 0 , residual vector r0 = y , and set iteration step k = 1 ; 2. Compute the correlation of A with the current residual, and the next estimate according to the current one (6) xk +1 = xk + A H rk 3. Process the new estimate with shrinkage operator (7) xk +1 = soft ( xk +1 ,t ) 4. Update the residual vector (8) rk +1 = y - Axk +1 5. Compute the objective function value f k +1 = 0.5 rk +1 + t xk +1 change Df = f k +1 - f k
2 1
. Otherwise, go to step 2 for the next iteration. iteration and output the estimate x In IST algorithm, the most computation prohibitive portion is the matrix-vector multiplication involving A and A H , with computation complexity of O( MN ) . Besides, each iteration step requires two such multiplications, so the computation of the whole recovery process is very large, especially when the matrix is a large dense one. However, if we convert Equation 5 to Equation 9, we can find that, A H y participates in the computation as a constant vector, and the two multiplications reduce to
one multiplication only involving A H A in each iteration. Although the computation complexity of A H Axk is O( N 2 ) , larger than the matrix-vector multiplication involving A and A H on CPU, it really can reduce the whole computation when implemented in parallel.
xk +1 = soft ( xk +A H y - A H Axk ,t )
(9)
So we proposed a new scheme based on the above analysis, precompute the A H A and A H y before the iteration, then only one matrix-vector multiplication is required in each iteration step. In this way, it not only reduces the cost of matrix-vector multiplication, but also reduces the time cost by data transmission between two multiplications. In addition, we noticed that the residual vector should be available when calculate the objective function value. However, the calculation of residual vector involves matrix A , which is against the proposed method requiring only A H A and A H y . So we need
124
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
some changes to the calculation of residual vector to meet with the proposal. Fortunately, we find that if we replace the computation formula of objective function value with Equation 10, the two different methods show the same effect when judging whether the termination criterion is satisfied based on the % is different from f .In addition, relative change of two contiguous objective function values, although f as we know, the SAR images are not sparse over all range gate, so we need add some constraint to escape from recovering the unsparse ones. If the objective function value in one step is no less than the one got in the former step, then we can say the scene is not sparse and exit the recovery.
% = 0.5 A H y - A H Ax 2 + t x f
(10)
The basic procedure of the modified version of IST algorithm is 1. Initialize x0 = 0 , compute A H y and A H A , and set iteration step k = 1 ; 2. Compute the correlation of A with the current residual xtemp = A H y - A H Axk 3. Compute the new estimate xk +1 = soft ( xk + xtemp ,t )
(11) (12)
% is smaller % . If Df f k
. Otherwise, than the stopping threshold then terminate the iteration and output the estimate x go to step 2 for the next iteration.
125
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
we add all the elements step by step. But on GPU, there are only 512 threads in each block which is smaller than dimension of the vector, so we split such task into parallel accumulation involving multiple thread blocks, where each block is responsible for addition of part data and the partial sum got by each thread are summed up at last. To make the most efficient use of the compute power of GPU, we should utilize enough thread blocks, keeping the maximum active thread blocks per multiprocessor. However, communication between thread blocks works only through global memory that is limited in GPU and has long access latency, which means it will cost lots of time in memory access if too many blocks were used. So we should make a tradeoff to select the suitable thread blocks. During the vector multiplication realization in fine-grained parallelism and computation of multidimensional data to one dimension, each thread block will complete summation of many data. This paper adopts the parallel summation reduction method to make most efficient of the parallel performance of GPU, Figure 1 shows the procedure of parallel summation reduction with 8 elements. The traditional serial summation method requires n steps to sum up n elements, while the parallel summation reduction method only requires log 2 n steps. Meanwhile, the parallel summation reduction works with sequential addressing which is bank conflict free, avoiding the reduction in efficient access bandwidth. In addition, the threads in each warp will either execute the summation or not, which will avoid the performance degradation caused by divergence.
0 1 2 3 4 5 6 7
5. Experiment
To validate the speeding up of parallel realization of compressive sensing based SAR imaging on GPU, we separately reconstructed the same SAR image using IST algorithm on CPU and GPU. The configuration of the CPU used in this paper is Intel Core2 Quad 8400, 2.66GHz, and the GPU is Tesla C1060. And the data used in the experiment are real airborne SAR data which have been collected by an X-band SAR with the resolution of 2m. For the detailed compressive sensing based SAR imaging technique, please refer to literature [16]. We implemented the CPU and GPU code in single precision float and computed the average processing time over 100 repeated executions on CPU and GPU separately. Figure 2.a shows the conventional SAR imaging result with full samples, while Figure 2.b shows the compressive sensing base SAR image reconstructed from 50% samples implemented on GPU. The time cost by CPU and GPU are shown in Table 1. From Table 1, we can see that GPU speeds up 35 times than CPU.
126
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
(a)
(b)
Figure2 (a). Conventional SAR imaging result with full samples. (b). Compressive sensing based SAR imaging result with 50% samples implemented on GPU Table1. The average execution times on CPU and GPU CPU
Time/s 8.995
GPU
0.258
Speedup
35
6. Conclusion
The paper realized the parallel implementation of compressive sensing based SAR imaging on GPU, and Iterative Shrinkage/Thresholding algorithm is adopted to reconstruct the SAR image. To make the most efficient use of parallelism characteristic of GPU, we modified the existed IST algorithm structure, and realized the fast implementation on CUDA platform. The experiment result shows that parallel computing capabilities of GPU have a significant speedup in comparison with computing capability of CPU.
7. Acknowledgement
This work was supported by the 973 Program of China under Grant 2010CB731903, the National Natural Science Foundation of China (Grant No. 60901056).
8. References
[1] W. G. Carrara, R. S. Goodman, R. M. Majewaki, Spotlight Synthetic Aperture Radar: Signal Processing Algorithms, Norwood, MA: Artech House, 1995. [2] I. G. Cumming and F. Wong, Digital Processing of Synthetic Aperture Radar, Norwood, MA: Artech House, 2005. [3] D. L. Donoho, Compressed Sensing, IEEE Trans. on Info. Theory, vol.52, no.4, pp.12891306, 2006. [4] E. Cand`es, J. Romberg and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Trans. on Info.Theory, vol.52, no.2, 2006, pp.489509. [5] M. Elad, Optimized Projections for Compressed Sensing, IEEE Trans. on Signal Process., vol.55, no.12, pp.56955702, 2007. [6] R. Baraniuk, P. Steeghs, Compressive radar imaging, IEEE Radar Conference, pp.128-133, 2007.
127
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU Tian Jihua, Sun Jinping, Zhang Yuxi, Najeeb Ahmad, Zhang Bingchen
[7] M. Herman, T. Strohmer, Compressed sensing radar, IEEE Radar Conference, pp.1-6, 2008. [8] V. M. Patel, G. R. Easley, D. M. Healy and R. Chellappa, Compressed Synthetic Aperture Radar, IEEE Journal of Selected Topics in Signal Processing, vol.4, no.2, pp.244254, 2010. [9] J. H. G. Ender, On compressive sensing applied to radar, Signal Processing, vol.90, no.5, pp.1402-1414, 2010. [10] NVIDIA, CUDA Programming Guide, Version 2.3.1, Auguest 2009. [11] M. A. T. Figueiredo and R. D. Nowak, An EM algorithm for wavelet-based image restoration, IEEE Transactions on Image Processing, vol.12, no.8, pp.906-916, 2003. [12] Ingrid Daubechies, Michel Defrise, Christine De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Communications in Pure and Applied Mathematics, vol.57, pp.14131457, 2004. [13] Zhiyong Yuan, Yuanyuan Zhang, Jianhui Zhao, Yihua Ding, Chengjiang Long, Lu Xiong, Dengyi Zhang, Guozhong Liang, Real-time Simulation for 3D Tissue Deformation with CUDA Based GPU Computing, JCIT: Journal of Convergence Information Technology, vol.5, no.4, pp.109-119, 2010. [14] Xiangyun Liao, Zhiyong Yuan, Weixin Si, Zhaoliang Duan, Ruixue Mao, Jianhui Zhao, Research and Application of Parallel Computing Technologies based on CUDA and OpenCL, Journal of Covergence Information Technology, vol.6, no.6, 2011. [15] Sangkyun Lee, Stephen J. Wright, Implementing algorithms for signal and image reconstruction on graphical processing units, Computer Sciences Department, University of Wisconsin-Madison, Tech. Rep., November, 2008. [16] Jihua Tian, Jinping Sun, Xiao Han, Bingchen Zhang, Motion Compensation for Compressive Sensing SAR Imaging with Autofocus, The 6th IEEE Conference on Industrial Electronics & Applications, pp.1564-1567, 2011.
128