Parallelization of Sat Algorithms in Gpus: Carlos Filipe Costa
Parallelization of Sat Algorithms in Gpus: Carlos Filipe Costa
Parallelization of Sat Algorithms in Gpus: Carlos Filipe Costa
To demonstrate our results we will run standard benchmarks This memory has the duration of the application and can
against our project and compare results with some state of be used by different kernels. However, its access speed is
the art solvers and with some solvers that leverage on the slow, usually taking between 400 to 800 cycles per access.
GPU as well. The preferred way of accessing global memory is by doing
coalesced accesses. A coalesced access is made when all the
2. CUDA threads in a half warp2 access contiguous memory positions.
CUDA is a programming framework for machines with nVidia If this happens, these accesses are condensed in a transac-
GPUs. It is composed by a programming language, a com- tion. This reduces the access time and is the only way to
piler and a runtime environment. achieve peak memory throughput. If the memory positions
are not sequential, the read instruction is repeated until all
The CUDA programming model is called Single Instruction accesses are performed. However, the GPU only stops in
Multiple Threads (SIMT) and it has some advantages as case of data dependency making the accesses asynchronous.
well as some drawbacks when compared to Simultaneous
Multi threading (SMT), the programming model of multi- 2.1.2 Pinned Memory
core CPUs. An advantage is that it has that a much higher Pinned memory, is an extension to Global Memory, which
number of threads that can be ran at the same time, en- allows the GPU to access memory allocated on the GPU.
abling a cheap high-throughput, high-latency design, more- It is called pinned memory, because it cannot be swapped.
over, each thread may have up to 63 registers. A disadvan- This memory is good for data that is only read once, but is
tage is that SMT provides mechanisms of concurrency that used several times, during the execution of the kernel.
are not present in GPUs such as locks and semaphores.
2.1.3 Shared memory
CUDA programs are called kernels and are organized into Shared memory, uses the same memory bank as L1 cache. It
blocks of threads. Each block may have up to 216 threads is the fastest memory directly available for the programmer
organized in up to three dimensions. to use. It takes two cycles per access, but it is very small
much like L1 cache, ranging from 16KB to 48KB. Shared
The threads in a block can be synchronized within the block memory is shared between all threads in a block and can
with barriers and can communicate using the per-block shared be used for communication. The memory is divided in 32
memory. Concurrency treatment is limited and solely han- banks, which can not be accessed at the same time. If a con-
dled by atomic functions that operate at memory position current bank access exists, they are serialized resulting in a
level. This is the only concurrency permitted, accesses to repetition of the instruction, causing performance degrada-
one memory position. This limits the ways in which con- tion. This memory should be used when the access pattern
currency can be applied, as there is no way to allow several to the data is random resulting in performance degradation
operations to be carried out in succession. if global memory is used. This memory has the lifetime of a
block.
Each group of 32 consecutive threads is called a warp1 . A
warp is the fundamental unit of execution in the program.
The same instruction is issued to all threads in the warp. 2.2 Kernel performance limitations
This means that if one thread in a warp executes condi- When profiling, there are three types of kernels. This classi-
tional code, the others will have to wait until that branch is fication show us where to perform more effective optimiza-
done. Therefore, if the code has an high degree of branch tions. The limiters are often the number of instructions one
divergence, this results in performance loss, as its execution can execute, and the speed at which one can read from mem-
is serialized. Other cause for loss of performance is data ory. To better explain these cases, we are going to present
dependency, if a thread in a warp is waiting for data, the extreme cases of each type of kernel.
others have to wait as well.
A kernel is considered instruction bound when the execution
These blocks, are organized into Grids. Grids can also be time of a kernel is spent doing calculations, and not memory
organized in one or two dimensions. The maximum number accesses. The way to optimize this kind of kernels is to
of blocks that can run at the same time is hardware specific. choose a more efficient way to compute the result, like using
better instructions or a completely different algorithm.
2.1 CUDA memory model A kernel is considered Memory bound when memory ac-
In CUDA there are several memory levels with different cesses dominate the kernel’s execution time, the kernels are
speeds and characteristics. The efficient usage of all these called Memory bound. This means that even with the best
levels is key in achieving the maximum possible performance. access patterns possible, the kernel will still be limited by the
However, this is not always easy to achieve mostly due to speed at which it can obtain new information. When neither
space constraints. instruction, neither memory dominates the execution time,
there is another limiter to the performance, Latency. La-
2.1.1 Global memory tency happens when instructions are repeated, mostly due
1 2
this is the current warp size. However it is architecture the first half of the threads the last half of threads of a
dependent, so it may change in the future warp
to bad memory accesses patterns or due to serialization as a CPU only approaches, to having the CPU and the GPU
result of branch divergence, atomic operations or bank con- solve the same problem cooperatively. There have several
flicts in shared memory. This happens mostly when the data methods to achieve this cooperation:
model of the problem, does not fit the GPU memory model
well, and that limits the performance of the kernel.
• they use the same approach as Fujii et al. [10], doing
only clause analysis in the GPU, but they introduce
3. RELATED WORK conflict analysis, as opposed to Fujii’s [10] method;
With the advent of multi-core CPUs, and with the per-core
speed coming to a stall, there was a need to develop SAT • they run a solver in the CPU and when the search
Solvers that could take advantage of these systems. More- reaches an advanced state, they pass the search to the
over, GPUs, that were once only used to perform graphic GPU.
processing, are now able to perform more general comput-
ing. Therefore, several attempts to take advantage of the • they do everything in the GPU, alternating between a
massive parallelism of GPU architectures to speed up the search kernel, and a conflict analysis kernel, with the
solving process have recently been proposed. CPU serving as synchronisation.
3.1 Portfolio based Solvers Noting that, all these methods can be ran with, or without,
The Portfolio approach to SAT solving tries to leverage on a watched literal scheme. In all it’s flavours the project
the fact that heuristics play an important role in the ef- remains constant in the fact that the only processing being
fectiveness of the solving process, and that, with the right parallelized is the clause analysis. However, the project does
heuristic a problem that is otherwise hard to solve can be not propose a default configuration.
solved in an fraction of the time it once took. The main chal-
lenge is finding the right heuristic for each problem. Portfo- 3.2.1 Scalable GPU Framework
lio solvers address this by having different solvers employing Meyer [18], proposed a parallel 3-SAT solver on the GPU
different heuristics and competing to solve the same prob- that featured a pipeline of kernels that processed the prob-
lem. Thus increasing the chance of having a solver with a lem, the author discarded most common techniques and re-
heuristic suitable for the problem at hand. This approach lied only on the massive thread parallelism and the scala-
has yet another advantage over its predecessor. By hav- bility of the GPU to solve problems. The focus of the work
ing different solvers working on the same search tree, when was to determine the scalability and applicability of GPUs
one of the solvers reaches a solution, this result is in fact to SAT solving rather than trying to devise the next high
the solution for the problem, hence, there is no need for end SAT solver [18].
load balancing. These ideas, coupled with the clause shar-
ing techniques mentioned before, makes the state of the art
in parallel SAT solving. 4. MINISATGPU
This section presents our concept solution for solving sat
which relies on the cooperation between the CPU and the
3.2 GPU enhanced Solvers GPU. First, in Section4.1, we will present the idea and a top-
In MESP (Minisat Enhanced with Survey Propagation) [12] level view of the system, its components and data structures.
the authors proposed using a GPU implementation of Survey Afterwards, in Section4.2 we will present relevant implemen-
SAT to enhance Minisat’s variable picking solution, VSIDS. tation details, of both the CPU and the GPU components
The authors chose Survey SAT because it was easily paral- of the solver, as well as some decisions that lead us to the
lelizable as the key parts of the algorithm did not have data final solver.
dependencies, aspect that also makes it very well suited for
a GPU.
4.1 System Overview
To take full advantage of the GPU capabilities we opted
Fujii et al [10], took another approach and proposed to use
to use an approach similar to CUD@SAT where they use
the GPU as an accelerator for the unit propagation pro-
the CPU and GPU in cooperation. Houwver, unlike with
cedure of 3-SAT problems, mimicking a previous work by
CUD@SAT We decided to execute the BCP and variable
Davis et al. where Davis et al. [7] used a FPGA to accel-
picking in the GPU, while the conflict analysis was to be
erate the analysis procedure, instead of using GPUs. This
executed on the CPU, as it is a single threaded procedure.
Solver uses a basic DPLL approach and only parallelizes the
This decision allows us to send a new problem every time
clause analysis procedure. Every time a variable is picked
we returned to the GPU, thus, we were able to add clauses
and set, the GPU is called to analyse all clauses to search for
to the problem, enabling us to learn from conflicts.
implications. If an implication is found, the GPU is called
again with this new information, if no implication is found, a
variable is picked instead. In this implementation, the CPU 4.1.1 A Novel Approach
holds the state of the problem, and as the objective of the Like us, most GPU solvers, use both the CPU and the GPU
work was only to speedup analysis, the backtracks are done together. However, they use the GPU in a different way.
on the CPU and are chronological. They use the entire GPU to run a single search thread3 .
3
we call search thread to the path a solver takes during
The CUD@SAT [1] project was developed by the Constraint search. For instance, minisat is single threaded, meaning it
and Logic Programming Lab of the university of Udine. This only has one search path, and consequently a search thread.
solver has several possible configurations that range from Contrastingly, a portfolio solver, such as Plingeling, searches
need to do inter-block communication, we can do the entire
CPU GPU
search in the GPU. Second, as we can do the entire search
Problem Parser
in a single block, we can use the rest of the GPU to run
additional search threads.
clause
Problem Generation Modify and copy
DB*
Module
This being said, our approach at using GPUs to solve SAT,
as depicted in figure1 consists on three key phases: 1) The
Kernel Search Thread State
Setup
search phase, that comprises both the decision phase and the
Launch
unit propagation phase; 2) the problem analysis phase, that,
reads Variable Selection after checking if no solution was found, either SAT or UN-
Variable and Clause
Sorting
State
State
State SAT, analyzes the conflicts that were returned by the GPU;
writes
Unit Propagation 3) The problem regeneration phase, in which the problem
definition is converted to a more GPU friendly notation and
Synch
Point
Update State sent to the GPU.
Clause DB manages
Management Module • Search Engine
Workflow Cross Device Memory Access • Conflict Analyzer
Dataflow
• Clause Database
Figure 1: Diagram of the sustem architecture • Search Thread Handler
Each block is given a different set of clauses and they will The Search Engine is the only component that is executed
analyze that set of clauses and propose propagations. They in the GPU, while the others run exclusively in the CPU.
only propose because GPU blocks can not communicate with This partitioning was adopted because, as the execution pat-
each other, which produces a very practical effect. To make tern of the search process has little divergence, it suits the
propagations and decisions, they must return to the CPU as GPU extremely well. Furthermore, searching is the bulk
it serves as the synchronization point. This results in fre- of the work in CPU solvers consuming around 80% of the
quent focus switches between the CPU and the GPU, which search time[19], meaning that in this case the majority of
means that an extra overhead is added to the computation. the workload will be done on the GPU.
We decided against this for two reasons. First, focus switches The other three components are executed on the CPU. The
are expensive and it is good practice to maximize the work first component is the Conflict Analyzer. It takes the
done in each kernel call. Second, analyzing all clauses in states returned by the GPU and analyzes the conflicts, com-
every kernel call, although it results in an extremely regu- puting a minimized conflict clause as a result.
lar kernel, is very inefficient and will fail to scale to bigger
problems, as we will show in Section 5. The problem with The second is the Clause Database. This component holds
this approach is that the number of clauses is the limiting all the clauses, both the initial set and the ones generated
factor, as there is a limit to the number of blocks that can during conflict analysis. This component was modified to
be ran in parallel and thus, to the number of clauses that be able to handle the database even though that unlike in
can be analyzed at a time. minisat, the CPU side of the solver is stateless.
We propose a different way to use the GPU. Instead of ana- The last component is the Search Thread Handler. Which
lyzing all clauses, we choose to analyze only the clauses we enforces the search heuristics picked for each thread, allow-
needed to, by analyzing the effects caused by one assign- ing us to have a portfolio of search threads. These handlers
ment at a time. By doing this careful clause selection, we are read and updated both in the GPU and in the CPU.
can do what takes the others the entire GPU, in a single
block. This is possible because with this approach, the lim- 4.1.3 Data structures
iting factor changes from being the total number of clauses, These structures will be used both in the GPU and in the
to be the ratio of clauses per variable. By fitting the prop- CPU.
agation step in one block only, enables us to do more in
the GPU in two different ways. First, as we eliminated the A crucial structure is the Clause Database. This Database is
through a different path with each sub-solver, meaning it has moved to the GPU with a modification when compared with
several search threads. When we refer to search threads, we the CPU version. As shown in figure 2, we need some infor-
will always write search thread, as opposed to CPU or GPU mation when in the GPU that is not necessary in the CPU,
threads which we will refer to as threads, only. and vice versa. So when we create the Database that we are
typedef struct {
M L -2 4 3 -5 M L -2 -3 -6 1 // c u r r e n t l e v e l o f a s s i g n m e n t
i n t curLev ;
M: Metadata L: LBD Weight
// c l a u s e t h a t i m p l i e d t h e v a r i a b l e
i n t ∗ varClause ;
S W -2 4 3 -5 S W -2 -3 -6 1 // a s s i g n m e n t l e v e l o f each v a r i a b l e
S: Size W: Watched Literal Index int ∗ varLevel ;
// p o l a r i t y o f each v a r i a b l e
Figure 2: The clause information in the CPU and in the i n t ∗ var ;
GPU // a s s i g n m e n t t r a i l
int ∗ t r a i l ;
// t r a i l L i m i t s
going to copy to the GPU, we replace this information, with int ∗ trailLim ;
the information that better suits our needs. // number o f v a r i a b l e s on t h e t r a i l
int trailSize ;
The literal vector itself is of little use, as there is no way // number new v a r i a b l e s on t r a i l ( s e t by CPU)
to know where a clause starts, and another ends. We make i n t newTrail ;
the search engine aware of this by sending an additional // number o f v a r i a b l e s i n t h e problem
structure that, for each variable, has the list of clauses it int varSize ;
belongs to. This structure is depicted in Figure 3. This // c o n f l i c t c l a u s e ( s e t by GPU)
structure has pointers to the start of the list, and in the first int confl ;
position of the list, there is a value representing the number } state ;
of elements. This way, we can loop the list without crossing
bounds to clauses of other variable. Figure 4: structure that represents the state of each search
thread.