0% found this document useful (0 votes)

4K views10 pages

A Demonstration of Exact String Matching Algorithms With CUDA

The document summarizes the author's demonstration of implementing three exact string matching algorithms (Brute-force, QuickSearch, and Horspool) using CUDA. The author maps the typically sequential algorithms to CUDA's parallel programming model and graphics hardware. Benchmark results show the CUDA implementations achieve speedups from 31x to 106x compared to sequential versions. While optimization opportunities remain, such as asynchronous execution, the author believes GPU acceleration can benefit applications using string matching.

Uploaded by

Raymond Tay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4K views10 pages

A Demonstration of Exact String Matching Algorithms With CUDA

Uploaded by

Raymond Tay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Demonstration of Exact String Matching Algorithms using CUDA

Author List
Raymond Tay (Autodesk, formerly Linden Lab)

Summation
In this chapter, the author presents a demonstration application of three commonly used exact
string matching algorithms using NVIDIA CUDA Technology. The algorithms are namely the
Brute-force, QuickSearch and Horspool. The author attempts to apply known CUDA techniques
to implement, test and optimize where applicable;challenges the author faced was mapping
CUDA's threading and memory model to what is normally an algorithm designed to execute on
the single core CPU. The author hopes that through this effort, to demonstrate the power of
CUDA to the budding GPU developer.

Introduction, Problem Statement, and Context

String-matching is a very important subject in the wider domain of text processing. String-
matching algorithms are basic components used in implementations of practical softwares
existing under most operating systems. String-matching consists of finding one or more
occurrences of a pattern in a body of text. All the algorithms in this work locates all occurrences
of the pattern in the text body aided by GPU acceleration. The algorithms developed were
tested for patterns whose length are shorter and greater than the alphabet. The pattern is
denoted by x=[0..m-1] and m denotes its length, the text is denoted by y=[0..n-1] where n
denotes its length; the alphabet of the text and pattern refers to all symbols used to represent
strings (e.g. the alphabet of a binary string is ∑={0,1}) and is denoted by ∑ with the size equal to
∂ (e.g. the size of the alphabet for binary strings is ∂=2).
The author is aware the wide applicability of string matching algorithms ranging from text
editors, the popular Unix tool grep, virus scanning technology, locating DNA sequences. The
author believes that the techniques devised here can be leveraged by current mid-range
workstations as they normally come equipped with CUDA/OpenCL enabled graphics cards.
Core Method
The methods applied to the development includes the following

1) Find ways to parallelize the sequential code

2) Minimize data transfer between the host and device

3) Global memory should be coalesced as much as possible

4) Avoid branch divergence within a CUDA warp

The work here for all algorithms revolves around getting a CUDA thread to execute the scanning
and locating a match; if it does find a match the CUDA thread will update a data structure
revealing the position where the pattern was found. The data structures needed by the CUDA
threads will be provided by the CUDA kernel.

Algorithms, Implementations, and Evaluations

Brute-force
The sequential form consists of a function, BF (acronym for BruteForce) where it attempts to
match the pattern to the text by scanning the text from left to right. In the sequential code, a
single thread is conducting the search and when it finds a match the algorithm will output to
console the position it was found.

In the CUDA version, N threads could be conducting the same search. Each of the N threads
attempts to scan for a match of the text, in parallel, and when it discovers a match a data
structure for storing the found indices will be updated.

The source codes for the sequential and parallelized(CUDA) code is shown below for illustration
purposes.

Illustration 1: Sequential Brute Force

Each CUDA thread can potentially and possibly read each character and obtain a match, in the
event that the pattern follows one another in the string; hence this translates to (N*m) bytes of
data being read. Each CUDA thread potentially writes at most n/m times (assuming the pattern
follows one after another other) but in general, the text and pattern could be absolutely random.

Illustration 2: CUDA Brute Force

Quicksearch
The sequential QuickSearch is a variant of the popular Boyer-Moore Algorithm where it does not
suffer from the problem of sub-optimal performance when it comes to matching patterns that
inherit from small alphabets like DNA.

In the classic QuickSearch, the inventor of the algorithm dropped the “good suffix shift” aka
“matching shift” computation in favour of the “bad-character shift” aka “occurrence shift”
computation. This algorithms precomputes the “bad-character shift” for the pattern before using
the results of the previous computation to aid in its search for pattern in the text body.

In the CUDA version, the classic QuickSearch has been reorganized so that the “bad-character
shift” is parallelized; and in the scanning code the “skipping distance” data structure (which is a
1D array containing the skipping distances regardless of a match or mismatch and each valid
element is a CUDA thread's id) is pre-computed which will be used by the CUDA kernel. In the
CUDA kernel, the thread will only execute the scanning code if it can locate its id in the “skipping
distance” data structure mentioned earlier.
The source codes for the sequential and CUDA version of QuickSearch is presented below:

Illustration 3: Sequential QuickSearch

Illustration 4: CUDA QuickSearch

Horspool
In the classic Horspool algorithm, the implementation favours the use of the bad-character shift
computation alone and it's not very efficient when the pattern is shorter than the alphabet i.e. m
< ∂.

The “bad-character shift” computation is the same as the one shown in the sequential
QuickSearch.

In the CUDA version, the approach the author's taken is very similar to the implementation of
the CUDA version of QuickSearch i.e. In the CUDA version, the classic QuickSearch has been
reorganized so that the “bad-character shift” is parallelized; and in the scanning code the
“skipping distance” data structure (which is a 1D array containing the skipping distances
regardless of a match or mismatch and each valid element is a CUDA thread's id) is pre-
computed which will be used by the CUDA kernel. In the CUDA kernel, the thread will only
execute the scanning code if it can locate its id in the “skipping distance” data structure
mentioned earlier.

The source codes for the sequential and CUDA Horspool is shown below:

Illustration 5: Sequential Horspool

Illustration 6: CUDA Horspool
Evaluation
The author subjected the three sequential and their CUDA equivalent algorithms to
benchmarking and applied some, but not all, CUDA techniques and technology. Each test was
ran with 100 iterations and taking the average. The tests were ran on a 32-bit Ubuntu OS,
GTX480 Nvidia Card, 8-core Intel i7 CPU, 6GB of System RAM.

Two sorts of tests were conducted: (1) pattern was shorter than the alphabet size (2) pattern
was longer than the alphabet size.

One observation from the tests is that the speedup factor of the CUDA to the sequential code
ranges from 31 to 106. Another observation is that the CUDA versions of the code do exhibit
branch divergence and bank conflicts and this behavior is highly dependent on the pattern and
the text involved.

Here is the summary:

Algorithm Type Optimiz Search runtime GPU Effective Speedu
ation (milliseconds) bandwidth (GBps) p
factor

brute-force SEQ -O2 24 N/A -

brute-force CUDA None 0.24 11.9 100

Shared 0.24 11.9

memory

Page- 0.41 7.1 59

locked
memory

QuickSearc SEQ -O2 16 N/A -

QuickSearc CUDA None 0.18 15.87 88

h
Shared 0.15 19.77 106
memory

Horspool SEQ -O2 16 N/A -

CUDA None 0.19 15.62 84

Shared 0.16 18.55 100

memory

Table 1: Test results for pattern shorter than alphabet size

Algorithm Type Optimiz Search runtime GPU Effective Speedu
ation (milliseconds) bandwidth (GBps) p
factor

brute-force SEQ -O2 21.2 N/A -

brute-force CUDA Shared 0.55 5.35 38

memory

QuickSearc SEQ -O2 17.2 N/A -

QuickSearc CUDA Shared 0.47 6.29 36

h memory

Horspool SEQ -O2 16.8 N/A -

CUDA Shared 0.53 5.56 31

memory

Table 2: Test results when pattern is longer than the size of the alphabet

Final Evaluation
The author believes that performance gains would be better if the implementation was in (a)
Asynchronous concurrent execution since multiple kernels execution concurrently would
possibly improve the run times. The author investigated that optimizations beyond -O2 for the
sequential algorithms did not seem to affect the overall run times.

The author's initial experimentation with page-locked/zero-copy in was not encouraging as

effective bandwidth lagged significantly on the linux operation system; the author cannot offer
an explanation at this point in time, why this is the case.

The author hoped to implement a multi-GPU solution but due to lack of resources, it cannot be
pursued in the near future though the author would get a big kick out of it!

References
• David Kirk and Wen-mei Hwu of Programming Massively Parallel Processors 2010 first
edition.
• AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical
Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter
5, pp 255-300, Elsevier, Amsterdam.
• HORSPOOL R.N., 1980, Practical fast searching in strings, Software - Practice &
Experience, 10(6):501-506.
• SUNDAY D.M., 1990, A very fast substring search algorithm, Communications of the
ACM . 33(8):132-142.
• Quick Search Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• Horspool Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• Brute-force Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• NVIDIA CUDA Programming Guide 3.0
• NVIDIA CUDA Reference Manual 3.0
• NVIDIA CUDA Best Practices Guide

Solution Manual of Cmputer Organization and Architectur
44% (27)
Solution Manual of Cmputer Organization and Architectur
29 pages
Computer Science - Introduction To Problem Solving
No ratings yet
Computer Science - Introduction To Problem Solving
38 pages
NID in Software Engineering
No ratings yet
NID in Software Engineering
192 pages
Brute Force Algorithm PDF
No ratings yet
Brute Force Algorithm PDF
4 pages
First Periodical Test Programming
No ratings yet
First Periodical Test Programming
4 pages
Cuda
No ratings yet
Cuda
93 pages
DAA - (MINI - PROJECT) Aniket, Vedant
No ratings yet
DAA - (MINI - PROJECT) Aniket, Vedant
19 pages
C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written
No ratings yet
C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written
1 page
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
Brute Force Algorithm
No ratings yet
Brute Force Algorithm
4 pages
Paper 20
No ratings yet
Paper 20
4 pages
Sequence Alignment Report
No ratings yet
Sequence Alignment Report
9 pages
DS V Unit Notes
No ratings yet
DS V Unit Notes
33 pages
Lec 3
No ratings yet
Lec 3
37 pages
Survey Paper On String Matching
No ratings yet
Survey Paper On String Matching
4 pages
Chapter 3 - String Processing
0% (1)
Chapter 3 - String Processing
28 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
IRS Unit-5
No ratings yet
IRS Unit-5
62 pages
ALo 2
No ratings yet
ALo 2
23 pages
Entropy-Based Approach in Selection Exact String-Matching Algorithms
No ratings yet
Entropy-Based Approach in Selection Exact String-Matching Algorithms
19 pages
Unit-V DS Pattern Matching and Tries
No ratings yet
Unit-V DS Pattern Matching and Tries
26 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
GPU Quicksort
No ratings yet
GPU Quicksort
22 pages
Fla 03
No ratings yet
Fla 03
27 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
10 pages
PPL Gpu Sorting Pre Print
No ratings yet
PPL Gpu Sorting Pre Print
28 pages
String Processing
No ratings yet
String Processing
4 pages
Serial and Parallel Implementation of Needleman-Wunsch Algorithm
No ratings yet
Serial and Parallel Implementation of Needleman-Wunsch Algorithm
12 pages
Tania Islam
No ratings yet
Tania Islam
13 pages
A Practical Quicksort Algorithm for Graphics Processors 3hac3qeos3
No ratings yet
A Practical Quicksort Algorithm for Graphics Processors 3hac3qeos3
21 pages
Question Bank 2024
No ratings yet
Question Bank 2024
8 pages
Computing Patterns in Strings 1st Edition Bill Smythinstant Download
100% (2)
Computing Patterns in Strings 1st Edition Bill Smythinstant Download
44 pages
Wavelet Tree
No ratings yet
Wavelet Tree
29 pages
Week 11
No ratings yet
Week 11
21 pages
Exact String Matchin
No ratings yet
Exact String Matchin
7 pages
HTTP WWW - Nvidia.com Content Cudazone CUDABrowser Downloads Papers Acceleration of The SmithWaterman Algorithm Using Single
No ratings yet
HTTP WWW - Nvidia.com Content Cudazone CUDABrowser Downloads Papers Acceleration of The SmithWaterman Algorithm Using Single
12 pages
UNIT 5.3 (String Mactching)
No ratings yet
UNIT 5.3 (String Mactching)
23 pages
Data Structures and Algorithms: Practical Workbook
100% (1)
Data Structures and Algorithms: Practical Workbook
76 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Week Week 22 22: Radix Search Trees
No ratings yet
Week Week 22 22: Radix Search Trees
64 pages
6.189 Info Session
No ratings yet
6.189 Info Session
20 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
Lab 5 MP
No ratings yet
Lab 5 MP
1 page
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Parallel BFS On Graphs Using GPGPU
No ratings yet
Parallel BFS On Graphs Using GPGPU
10 pages
Parallel Hashing: John Erol Evangelista
No ratings yet
Parallel Hashing: John Erol Evangelista
42 pages
Data Structures Unit 5
No ratings yet
Data Structures Unit 5
20 pages
DS1822 ParallelComputing Unit4
No ratings yet
DS1822 ParallelComputing Unit4
16 pages
Chapter 2 - String Processing
No ratings yet
Chapter 2 - String Processing
26 pages
M0302 Computer Science - E
No ratings yet
M0302 Computer Science - E
14 pages
MC0072 New
No ratings yet
MC0072 New
7 pages
Lec 6-String Processing
100% (1)
Lec 6-String Processing
25 pages
Neural Network Implementation Using CUDA and OpenMP
No ratings yet
Neural Network Implementation Using CUDA and OpenMP
7 pages
CUDA Compression Final Report
No ratings yet
CUDA Compression Final Report
11 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Evolution and Optimum Seeking Schwefel PDF
No ratings yet
Evolution and Optimum Seeking Schwefel PDF
2 pages
Solutions Manual: From: Solutions Books Date: Fri, 10 Feb 2012 11:53:24 0800 (PST)
8% (13)
Solutions Manual: From: Solutions Books Date: Fri, 10 Feb 2012 11:53:24 0800 (PST)
12 pages
Unit 1 - Science and Computers
No ratings yet
Unit 1 - Science and Computers
18 pages
Space Complexity
No ratings yet
Space Complexity
8 pages
Unit 4 PDF
No ratings yet
Unit 4 PDF
59 pages
The Kuz Ram Fragmentation Model 20 Years On - C. Cunningham - EFEE-2005 PDF
100% (1)
The Kuz Ram Fragmentation Model 20 Years On - C. Cunningham - EFEE-2005 PDF
10 pages
Automatic Frequency Planning (AFP) : Technical Note
No ratings yet
Automatic Frequency Planning (AFP) : Technical Note
55 pages
B.tech Syllabi - Branch-ECE-VLSI, 2022 Regulations
No ratings yet
B.tech Syllabi - Branch-ECE-VLSI, 2022 Regulations
65 pages
List of Algorithms Interview Questions
No ratings yet
List of Algorithms Interview Questions
9 pages
DynamoPrimer Print
No ratings yet
DynamoPrimer Print
569 pages
Master Thesis Uni Bremen
100% (3)
Master Thesis Uni Bremen
8 pages
Flowchart and Algo PDF
No ratings yet
Flowchart and Algo PDF
8 pages
Unit 2 (Computational Thinking and Programming)
No ratings yet
Unit 2 (Computational Thinking and Programming)
42 pages
PPS R22 Syllabus
No ratings yet
PPS R22 Syllabus
2 pages
L15 Euclid's Algorithm
No ratings yet
L15 Euclid's Algorithm
14 pages
Paper 2
No ratings yet
Paper 2
7 pages
AP CSP Create Task Scoring 2
No ratings yet
AP CSP Create Task Scoring 2
3 pages
R2 - On Tolerating Faults in Naturally Redundant Algorithms
No ratings yet
R2 - On Tolerating Faults in Naturally Redundant Algorithms
10 pages
ACFrOgBGMm6dIQvznqlXUaQPx 8AslRnBd3XNiDASaIi68d 5dZ0x6wAD1vkxKFzudtxREOyHX2rJdou4 v4L9-zH-BN7wsldZrDEHgztXA0jHZMuGfs6xcfIOfew74
No ratings yet
ACFrOgBGMm6dIQvznqlXUaQPx 8AslRnBd3XNiDASaIi68d 5dZ0x6wAD1vkxKFzudtxREOyHX2rJdou4 v4L9-zH-BN7wsldZrDEHgztXA0jHZMuGfs6xcfIOfew74
4 pages
Practile
No ratings yet
Practile
15 pages
M Phil Computer Science Thesis Download
100% (2)
M Phil Computer Science Thesis Download
4 pages
Conceptual and Procedural Knowledge in Mathematics: An Introductory Analysis.
No ratings yet
Conceptual and Procedural Knowledge in Mathematics: An Introductory Analysis.
92 pages
L-032 L2 Samyuktha Mandampully PPS Experiment 1
No ratings yet
L-032 L2 Samyuktha Mandampully PPS Experiment 1
14 pages
Flowcharts
No ratings yet
Flowcharts
25 pages
500 Important Spoken Tamil Situations Into Spoken English Sentences Sample
No ratings yet
500 Important Spoken Tamil Situations Into Spoken English Sentences Sample
38 pages
Class VI ICT
No ratings yet
Class VI ICT
2 pages
Week 8 Algorithm Analysis
No ratings yet
Week 8 Algorithm Analysis
39 pages

A Demonstration of Exact String Matching Algorithms With CUDA

Uploaded by

A Demonstration of Exact String Matching Algorithms With CUDA

Uploaded by

Demonstration of Exact String Matching Algorithms using CUDA

Introduction, Problem Statement, and Context

1) Find ways to parallelize the sequential code

2) Minimize data transfer between the host and device

3) Global memory should be coalesced as much as possible

4) Avoid branch divergence within a CUDA warp

Algorithms, Implementations, and Evaluations

Illustration 1: Sequential Brute Force

Illustration 2: CUDA Brute Force

Illustration 3: Sequential QuickSearch

Illustration 4: CUDA QuickSearch

Illustration 5: Sequential Horspool

Here is the summary:

brute-force SEQ -O2 24 N/A -

brute-force CUDA None 0.24 11.9 100

Shared 0.24 11.9

Page- 0.41 7.1 59

QuickSearc SEQ -O2 16 N/A -

QuickSearc CUDA None 0.18 15.87 88

Horspool SEQ -O2 16 N/A -

CUDA None 0.19 15.62 84

Shared 0.16 18.55 100

Table 1: Test results for pattern shorter than alphabet size

brute-force SEQ -O2 21.2 N/A -

brute-force CUDA Shared 0.55 5.35 38

QuickSearc SEQ -O2 17.2 N/A -

QuickSearc CUDA Shared 0.47 6.29 36

Horspool SEQ -O2 16.8 N/A -

CUDA Shared 0.53 5.56 31

The author's initial experimentation with page-locked/zero-copy in was not encouraging as

You might also like