A Demonstration of Exact String Matching Algorithms With CUDA
A Demonstration of Exact String Matching Algorithms With CUDA
Author List
Raymond Tay (Autodesk, formerly Linden Lab)
Summation
In this chapter, the author presents a demonstration application of three commonly used exact
string matching algorithms using NVIDIA CUDA Technology. The algorithms are namely the
Brute-force, QuickSearch and Horspool. The author attempts to apply known CUDA techniques
to implement, test and optimize where applicable;challenges the author faced was mapping
CUDA's threading and memory model to what is normally an algorithm designed to execute on
the single core CPU. The author hopes that through this effort, to demonstrate the power of
CUDA to the budding GPU developer.
The work here for all algorithms revolves around getting a CUDA thread to execute the scanning
and locating a match; if it does find a match the CUDA thread will update a data structure
revealing the position where the pattern was found. The data structures needed by the CUDA
threads will be provided by the CUDA kernel.
In the CUDA version, N threads could be conducting the same search. Each of the N threads
attempts to scan for a match of the text, in parallel, and when it discovers a match a data
structure for storing the found indices will be updated.
The source codes for the sequential and parallelized(CUDA) code is shown below for illustration
purposes.
Each CUDA thread can potentially and possibly read each character and obtain a match, in the
event that the pattern follows one another in the string; hence this translates to (N*m) bytes of
data being read. Each CUDA thread potentially writes at most n/m times (assuming the pattern
follows one after another other) but in general, the text and pattern could be absolutely random.
Quicksearch
The sequential QuickSearch is a variant of the popular Boyer-Moore Algorithm where it does not
suffer from the problem of sub-optimal performance when it comes to matching patterns that
inherit from small alphabets like DNA.
In the classic QuickSearch, the inventor of the algorithm dropped the “good suffix shift” aka
“matching shift” computation in favour of the “bad-character shift” aka “occurrence shift”
computation. This algorithms precomputes the “bad-character shift” for the pattern before using
the results of the previous computation to aid in its search for pattern in the text body.
In the CUDA version, the classic QuickSearch has been reorganized so that the “bad-character
shift” is parallelized; and in the scanning code the “skipping distance” data structure (which is a
1D array containing the skipping distances regardless of a match or mismatch and each valid
element is a CUDA thread's id) is pre-computed which will be used by the CUDA kernel. In the
CUDA kernel, the thread will only execute the scanning code if it can locate its id in the “skipping
distance” data structure mentioned earlier.
The source codes for the sequential and CUDA version of QuickSearch is presented below:
The “bad-character shift” computation is the same as the one shown in the sequential
QuickSearch.
In the CUDA version, the approach the author's taken is very similar to the implementation of
the CUDA version of QuickSearch i.e. In the CUDA version, the classic QuickSearch has been
reorganized so that the “bad-character shift” is parallelized; and in the scanning code the
“skipping distance” data structure (which is a 1D array containing the skipping distances
regardless of a match or mismatch and each valid element is a CUDA thread's id) is pre-
computed which will be used by the CUDA kernel. In the CUDA kernel, the thread will only
execute the scanning code if it can locate its id in the “skipping distance” data structure
mentioned earlier.
The source codes for the sequential and CUDA Horspool is shown below:
Two sorts of tests were conducted: (1) pattern was shorter than the alphabet size (2) pattern
was longer than the alphabet size.
One observation from the tests is that the speedup factor of the CUDA to the sequential code
ranges from 31 to 106. Another observation is that the CUDA versions of the code do exhibit
branch divergence and bank conflicts and this behavior is highly dependent on the pattern and
the text involved.
Table 2: Test results when pattern is longer than the size of the alphabet
Final Evaluation
The author believes that performance gains would be better if the implementation was in (a)
Asynchronous concurrent execution since multiple kernels execution concurrently would
possibly improve the run times. The author investigated that optimizations beyond -O2 for the
sequential algorithms did not seem to affect the overall run times.
The author hoped to implement a multi-GPU solution but due to lack of resources, it cannot be
pursued in the near future though the author would get a big kick out of it!
References
• David Kirk and Wen-mei Hwu of Programming Massively Parallel Processors 2010 first
edition.
• AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical
Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter
5, pp 255-300, Elsevier, Amsterdam.
• HORSPOOL R.N., 1980, Practical fast searching in strings, Software - Practice &
Experience, 10(6):501-506.
• SUNDAY D.M., 1990, A very fast substring search algorithm, Communications of the
ACM . 33(8):132-142.
• Quick Search Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• Horspool Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• Brute-force Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• NVIDIA CUDA Programming Guide 3.0
• NVIDIA CUDA Reference Manual 3.0
• NVIDIA CUDA Best Practices Guide