Parallelization of Sat Algorithms in Gpus: Carlos Filipe Costa

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Parallelization of SAT algorithms in GPUs

Carlos Filipe Costa


Instituto Superior Técnico
[email protected]
May 2014

ABSTRACT has several different single-threaded solvers, running in each


The boolean satisfiability problem is a very important NP- core/thread racing against each other while also sharing
complete problem. However, during recent years, the per- meaningful clauses[13].
formance of SAT Solvers has came to an halt, which lead
to work being done to use GPUs to aid in the solving pro- Meanwhile, in recent years, GPUs, with the introduction
cess. However, the techniques employed misuse the GPU of General Propose GPUs, have ceased to be a graphics
and limit its applicability. In this thesis we present a novel only hardware to become able to do everyday computing.
approach to use the GPU that removes these limitations and GPUs have hundreds of cores and a computing power that,
allows the GPU to solve problems of any size, which was not if used properly[15], outperforms that of the Central Pro-
possible with former approaches. cessing Unit (CPU). However, even tough SAT Solvers made
the jump to multi-cores, they are not taking advantage of
Keywords this piece of hardware that, like the processor, exists in every
computer.
GPU, Boolean Satisfiability
To address this gap, in this thesis we propose a way of using
1. INTRODUCTION the GPU to aid the process of SAT Solving, and even though
The Boolean Satisfiability Problem was the first NP-Complete we believe GPUs can be used to obtain substantial improve-
problem to be identified[6]. Some factors still make it one ments in the performance of state-of-the-art SAT solver al-
of the most important problems in computer science. Those gorithms, their applications pose several challenges. First,
factors are its simplicity, its numerous practical application GPUs make use of a different parallel paradigm than multi-
in real life problems such as software testing with model core systems. Whereas in multi-core systems each CPU can
checkers, theorem proving and Electronic Design Automa- execute an arbitrary instruction at any point in time, in
tion. But, perhaps the most important factor is the ability GPUs all cores have to be synchronized executing the same
to reduce any NP-Complete problem to SAT[14] making it a instruction (even if using different data). Second, the mem-
cornerstone in what comes to solving this class of problems. ory model of the GPU is also different. While the CPU is
optimized for random accesses to the memory, the GPU, in
The Davis-Putnam-Logemann-Loveland (DPLL) algorithm[8], order to achieve top performance, should have all threads
introduced in 1962, solves the problem by recursively travers- accessing the memory in sequential positions. So to achieve
ing the entire search space with several improvements over optimal performance improvements, we believe that the cre-
a more basic brute-force algorithm finishing only when a ation of a novel mechanism that reconciliates both GPUs
solution is found or when the algorithm has exhaustively and SAT solvers is mandatory.
checked every possibility and found no solution. This al-
gorithm serves, even today, as the basis for Modern SAT With this in mind, we are going to propose a way of using
Solvers[11] as they build upon it. However, in recent years the GPU as a way of aiding the CPU achieve better results
the global performance of state of the art solvers has been at solving SAT instances. To do so, we are going to present
stalling and only minor optimizations have been introduced a method to split the computation in order to regularize it,
to improve them despite important efforts of the whole com- and we will use the CPU to tackle the parts of the problem
munity[16][2]. that cause the irregularity instead of having them made on
the GPU. This separation comes from the fact that some
Moreover, the CPUs are hitting their physical limit in terms parts in the DPLL algorithm add irregularity to the algo-
of single core speed and to continue to improve in speed rithm if they are executed in the GPU, which due to the
SAT Solvers had to keep up with the hardware and go par- SIMT paradigm, becomes very expensive. More specifically,
allel[16]. The initial method to parallelize SAT Solving was conflict analysis and clause learning as the first adds irreg-
to split the search space between the cores/threads, and, ularity to the algorithm, and the second is difficult as well
with load balancing techniques re-split the space every time as on the GPU no proper synchronization methods exist.
a core/thread finished its search[20][4]. These Solvers could However, without conflict analysis and clause learning, we
also share important clauses they would find during their fall back to a GPU adaptation of the simplistic DPLL. So
search[5]. The most successful method so far is to use port- to tackle this challenge we have the, highly regular, search
folio solvers[17]. A portfolio Solver consists on a solver that
done in the GPU, while having the, complicated and single This memory is generally in the order of the Gigabytes and
threaded, analysis and clause management done in the CPU. is accessible by all threads.

To demonstrate our results we will run standard benchmarks This memory has the duration of the application and can
against our project and compare results with some state of be used by different kernels. However, its access speed is
the art solvers and with some solvers that leverage on the slow, usually taking between 400 to 800 cycles per access.
GPU as well. The preferred way of accessing global memory is by doing
coalesced accesses. A coalesced access is made when all the
2. CUDA threads in a half warp2 access contiguous memory positions.
CUDA is a programming framework for machines with nVidia If this happens, these accesses are condensed in a transac-
GPUs. It is composed by a programming language, a com- tion. This reduces the access time and is the only way to
piler and a runtime environment. achieve peak memory throughput. If the memory positions
are not sequential, the read instruction is repeated until all
The CUDA programming model is called Single Instruction accesses are performed. However, the GPU only stops in
Multiple Threads (SIMT) and it has some advantages as case of data dependency making the accesses asynchronous.
well as some drawbacks when compared to Simultaneous
Multi threading (SMT), the programming model of multi- 2.1.2 Pinned Memory
core CPUs. An advantage is that it has that a much higher Pinned memory, is an extension to Global Memory, which
number of threads that can be ran at the same time, en- allows the GPU to access memory allocated on the GPU.
abling a cheap high-throughput, high-latency design, more- It is called pinned memory, because it cannot be swapped.
over, each thread may have up to 63 registers. A disadvan- This memory is good for data that is only read once, but is
tage is that SMT provides mechanisms of concurrency that used several times, during the execution of the kernel.
are not present in GPUs such as locks and semaphores.
2.1.3 Shared memory
CUDA programs are called kernels and are organized into Shared memory, uses the same memory bank as L1 cache. It
blocks of threads. Each block may have up to 216 threads is the fastest memory directly available for the programmer
organized in up to three dimensions. to use. It takes two cycles per access, but it is very small
much like L1 cache, ranging from 16KB to 48KB. Shared
The threads in a block can be synchronized within the block memory is shared between all threads in a block and can
with barriers and can communicate using the per-block shared be used for communication. The memory is divided in 32
memory. Concurrency treatment is limited and solely han- banks, which can not be accessed at the same time. If a con-
dled by atomic functions that operate at memory position current bank access exists, they are serialized resulting in a
level. This is the only concurrency permitted, accesses to repetition of the instruction, causing performance degrada-
one memory position. This limits the ways in which con- tion. This memory should be used when the access pattern
currency can be applied, as there is no way to allow several to the data is random resulting in performance degradation
operations to be carried out in succession. if global memory is used. This memory has the lifetime of a
block.
Each group of 32 consecutive threads is called a warp1 . A
warp is the fundamental unit of execution in the program.
The same instruction is issued to all threads in the warp. 2.2 Kernel performance limitations
This means that if one thread in a warp executes condi- When profiling, there are three types of kernels. This classi-
tional code, the others will have to wait until that branch is fication show us where to perform more effective optimiza-
done. Therefore, if the code has an high degree of branch tions. The limiters are often the number of instructions one
divergence, this results in performance loss, as its execution can execute, and the speed at which one can read from mem-
is serialized. Other cause for loss of performance is data ory. To better explain these cases, we are going to present
dependency, if a thread in a warp is waiting for data, the extreme cases of each type of kernel.
others have to wait as well.
A kernel is considered instruction bound when the execution
These blocks, are organized into Grids. Grids can also be time of a kernel is spent doing calculations, and not memory
organized in one or two dimensions. The maximum number accesses. The way to optimize this kind of kernels is to
of blocks that can run at the same time is hardware specific. choose a more efficient way to compute the result, like using
better instructions or a completely different algorithm.
2.1 CUDA memory model A kernel is considered Memory bound when memory ac-
In CUDA there are several memory levels with different cesses dominate the kernel’s execution time, the kernels are
speeds and characteristics. The efficient usage of all these called Memory bound. This means that even with the best
levels is key in achieving the maximum possible performance. access patterns possible, the kernel will still be limited by the
However, this is not always easy to achieve mostly due to speed at which it can obtain new information. When neither
space constraints. instruction, neither memory dominates the execution time,
there is another limiter to the performance, Latency. La-
2.1.1 Global memory tency happens when instructions are repeated, mostly due
1 2
this is the current warp size. However it is architecture the first half of the threads the last half of threads of a
dependent, so it may change in the future warp
to bad memory accesses patterns or due to serialization as a CPU only approaches, to having the CPU and the GPU
result of branch divergence, atomic operations or bank con- solve the same problem cooperatively. There have several
flicts in shared memory. This happens mostly when the data methods to achieve this cooperation:
model of the problem, does not fit the GPU memory model
well, and that limits the performance of the kernel.
• they use the same approach as Fujii et al. [10], doing
only clause analysis in the GPU, but they introduce
3. RELATED WORK conflict analysis, as opposed to Fujii’s [10] method;
With the advent of multi-core CPUs, and with the per-core
speed coming to a stall, there was a need to develop SAT • they run a solver in the CPU and when the search
Solvers that could take advantage of these systems. More- reaches an advanced state, they pass the search to the
over, GPUs, that were once only used to perform graphic GPU.
processing, are now able to perform more general comput-
ing. Therefore, several attempts to take advantage of the • they do everything in the GPU, alternating between a
massive parallelism of GPU architectures to speed up the search kernel, and a conflict analysis kernel, with the
solving process have recently been proposed. CPU serving as synchronisation.

3.1 Portfolio based Solvers Noting that, all these methods can be ran with, or without,
The Portfolio approach to SAT solving tries to leverage on a watched literal scheme. In all it’s flavours the project
the fact that heuristics play an important role in the ef- remains constant in the fact that the only processing being
fectiveness of the solving process, and that, with the right parallelized is the clause analysis. However, the project does
heuristic a problem that is otherwise hard to solve can be not propose a default configuration.
solved in an fraction of the time it once took. The main chal-
lenge is finding the right heuristic for each problem. Portfo- 3.2.1 Scalable GPU Framework
lio solvers address this by having different solvers employing Meyer [18], proposed a parallel 3-SAT solver on the GPU
different heuristics and competing to solve the same prob- that featured a pipeline of kernels that processed the prob-
lem. Thus increasing the chance of having a solver with a lem, the author discarded most common techniques and re-
heuristic suitable for the problem at hand. This approach lied only on the massive thread parallelism and the scala-
has yet another advantage over its predecessor. By hav- bility of the GPU to solve problems. The focus of the work
ing different solvers working on the same search tree, when was to determine the scalability and applicability of GPUs
one of the solvers reaches a solution, this result is in fact to SAT solving rather than trying to devise the next high
the solution for the problem, hence, there is no need for end SAT solver [18].
load balancing. These ideas, coupled with the clause shar-
ing techniques mentioned before, makes the state of the art
in parallel SAT solving. 4. MINISATGPU
This section presents our concept solution for solving sat
which relies on the cooperation between the CPU and the
3.2 GPU enhanced Solvers GPU. First, in Section4.1, we will present the idea and a top-
In MESP (Minisat Enhanced with Survey Propagation) [12] level view of the system, its components and data structures.
the authors proposed using a GPU implementation of Survey Afterwards, in Section4.2 we will present relevant implemen-
SAT to enhance Minisat’s variable picking solution, VSIDS. tation details, of both the CPU and the GPU components
The authors chose Survey SAT because it was easily paral- of the solver, as well as some decisions that lead us to the
lelizable as the key parts of the algorithm did not have data final solver.
dependencies, aspect that also makes it very well suited for
a GPU.
4.1 System Overview
To take full advantage of the GPU capabilities we opted
Fujii et al [10], took another approach and proposed to use
to use an approach similar to CUD@SAT where they use
the GPU as an accelerator for the unit propagation pro-
the CPU and GPU in cooperation. Houwver, unlike with
cedure of 3-SAT problems, mimicking a previous work by
CUD@SAT We decided to execute the BCP and variable
Davis et al. where Davis et al. [7] used a FPGA to accel-
picking in the GPU, while the conflict analysis was to be
erate the analysis procedure, instead of using GPUs. This
executed on the CPU, as it is a single threaded procedure.
Solver uses a basic DPLL approach and only parallelizes the
This decision allows us to send a new problem every time
clause analysis procedure. Every time a variable is picked
we returned to the GPU, thus, we were able to add clauses
and set, the GPU is called to analyse all clauses to search for
to the problem, enabling us to learn from conflicts.
implications. If an implication is found, the GPU is called
again with this new information, if no implication is found, a
variable is picked instead. In this implementation, the CPU 4.1.1 A Novel Approach
holds the state of the problem, and as the objective of the Like us, most GPU solvers, use both the CPU and the GPU
work was only to speedup analysis, the backtracks are done together. However, they use the GPU in a different way.
on the CPU and are chronological. They use the entire GPU to run a single search thread3 .
3
we call search thread to the path a solver takes during
The CUD@SAT [1] project was developed by the Constraint search. For instance, minisat is single threaded, meaning it
and Logic Programming Lab of the university of Udine. This only has one search path, and consequently a search thread.
solver has several possible configurations that range from Contrastingly, a portfolio solver, such as Plingeling, searches
need to do inter-block communication, we can do the entire
CPU GPU
search in the GPU. Second, as we can do the entire search
Problem Parser
in a single block, we can use the rest of the GPU to run
additional search threads.
clause
Problem Generation Modify and copy
DB*
Module
This being said, our approach at using GPUs to solve SAT,
as depicted in figure1 consists on three key phases: 1) The
Kernel Search Thread State
Setup
search phase, that comprises both the decision phase and the
Launch
unit propagation phase; 2) the problem analysis phase, that,
reads Variable Selection after checking if no solution was found, either SAT or UN-
Variable and Clause
Sorting
State
State
State SAT, analyzes the conflicts that were returned by the GPU;
writes
Unit Propagation 3) The problem regeneration phase, in which the problem
definition is converted to a more GPU friendly notation and
Synch
Point
Update State sent to the GPU.

Conflict Analysis updates


Decision 4.1.2 Architecture
Module Our system is partitioned into four modules. One of these
clause
DB components is executed in the GPU, while the others are
Search Thread
Heuristic Enforcement
executed in the CPU. The four components are:

Clause DB manages
Management Module • Search Engine
Workflow Cross Device Memory Access • Conflict Analyzer
Dataflow

• Clause Database
Figure 1: Diagram of the sustem architecture • Search Thread Handler

Each block is given a different set of clauses and they will The Search Engine is the only component that is executed
analyze that set of clauses and propose propagations. They in the GPU, while the others run exclusively in the CPU.
only propose because GPU blocks can not communicate with This partitioning was adopted because, as the execution pat-
each other, which produces a very practical effect. To make tern of the search process has little divergence, it suits the
propagations and decisions, they must return to the CPU as GPU extremely well. Furthermore, searching is the bulk
it serves as the synchronization point. This results in fre- of the work in CPU solvers consuming around 80% of the
quent focus switches between the CPU and the GPU, which search time[19], meaning that in this case the majority of
means that an extra overhead is added to the computation. the workload will be done on the GPU.

We decided against this for two reasons. First, focus switches The other three components are executed on the CPU. The
are expensive and it is good practice to maximize the work first component is the Conflict Analyzer. It takes the
done in each kernel call. Second, analyzing all clauses in states returned by the GPU and analyzes the conflicts, com-
every kernel call, although it results in an extremely regu- puting a minimized conflict clause as a result.
lar kernel, is very inefficient and will fail to scale to bigger
problems, as we will show in Section 5. The problem with The second is the Clause Database. This component holds
this approach is that the number of clauses is the limiting all the clauses, both the initial set and the ones generated
factor, as there is a limit to the number of blocks that can during conflict analysis. This component was modified to
be ran in parallel and thus, to the number of clauses that be able to handle the database even though that unlike in
can be analyzed at a time. minisat, the CPU side of the solver is stateless.

We propose a different way to use the GPU. Instead of ana- The last component is the Search Thread Handler. Which
lyzing all clauses, we choose to analyze only the clauses we enforces the search heuristics picked for each thread, allow-
needed to, by analyzing the effects caused by one assign- ing us to have a portfolio of search threads. These handlers
ment at a time. By doing this careful clause selection, we are read and updated both in the GPU and in the CPU.
can do what takes the others the entire GPU, in a single
block. This is possible because with this approach, the lim- 4.1.3 Data structures
iting factor changes from being the total number of clauses, These structures will be used both in the GPU and in the
to be the ratio of clauses per variable. By fitting the prop- CPU.
agation step in one block only, enables us to do more in
the GPU in two different ways. First, as we eliminated the A crucial structure is the Clause Database. This Database is
through a different path with each sub-solver, meaning it has moved to the GPU with a modification when compared with
several search threads. When we refer to search threads, we the CPU version. As shown in figure 2, we need some infor-
will always write search thread, as opposed to CPU or GPU mation when in the GPU that is not necessary in the CPU,
threads which we will refer to as threads, only. and vice versa. So when we create the Database that we are
typedef struct {
M L -2 4 3 -5 M L -2 -3 -6 1 // c u r r e n t l e v e l o f a s s i g n m e n t
i n t curLev ;
M: Metadata L: LBD Weight
// c l a u s e t h a t i m p l i e d t h e v a r i a b l e
i n t ∗ varClause ;
S W -2 4 3 -5 S W -2 -3 -6 1 // a s s i g n m e n t l e v e l o f each v a r i a b l e
S: Size W: Watched Literal Index int ∗ varLevel ;
// p o l a r i t y o f each v a r i a b l e
Figure 2: The clause information in the CPU and in the i n t ∗ var ;
GPU // a s s i g n m e n t t r a i l
int ∗ t r a i l ;
// t r a i l L i m i t s
going to copy to the GPU, we replace this information, with int ∗ trailLim ;
the information that better suits our needs. // number o f v a r i a b l e s on t h e t r a i l
int trailSize ;
The literal vector itself is of little use, as there is no way // number new v a r i a b l e s on t r a i l ( s e t by CPU)
to know where a clause starts, and another ends. We make i n t newTrail ;
the search engine aware of this by sending an additional // number o f v a r i a b l e s i n t h e problem
structure that, for each variable, has the list of clauses it int varSize ;
belongs to. This structure is depicted in Figure 3. This // c o n f l i c t c l a u s e ( s e t by GPU)
structure has pointers to the start of the list, and in the first int confl ;
position of the list, there is a value representing the number } state ;
of elements. This way, we can loop the list without crossing
bounds to clauses of other variable. Figure 4: structure that represents the state of each search
thread.

Starting point of clauses


In a more detailed view, as is depicted in Figure1, the whole
0 3 6 process starts when a problem is given to the system. Min-
isat’s built in parsers will read the problem and store it in
minisat’s own data structures. After this, the solver will read
the parameters given to it, some are native to minisat, some
are new parameters, and it will configure itself accordingly.
2 x -y 2 y -z 2 -x -y Parameter Description Default Value
–blocks Number of solvers to be 64
Clauses with size in the first position launched concurrently
(one per GPU block)
–pfreq Frequency at which the 1
Figure 3: Compressed Matrix representation
clause database is pushed
to the GPU
Lastly, we need the information about each search thread –gpu-threads Number of threads to be 192
state. This information, depicted in Figure4, is stored in used during unit propaga-
a vector of structs, the vectors that represent all the ad- tion
ditional information, such as variable assignments and the –l-lin Increase the clause true
clause that caused such assignment, are stored in one vector database capacity lin-
only allocated after the problem is processed. We do this early, glucose style
because this information is used both in the GPU and the
CPU and to copy data from the GPU that is sequential is Table 1: MinisatGPU’s configuration parameters
faster than doing the same with data that is scattered.
The next step is to set up the search threads. Each search
4.1.4 Workflow thread will have its own context and heuristics, for restarts
Our system’s components cooperate, and work, with each and polarity mode. This constitutes the setup phase. After,
others information, so that together they can solve prob- comes the solving phase, which will only end when a solution
lems. The general workflow of our system is the following, is found. The solving phase, uses the GPU, so the first step is
we process the clause database in the CPU and we send it to read the clause database and send the problem definition
to the GPU. In the GPU we assign values to variables and to the GPU in a GPU friendly way. In this step, as said
propagate those decisions until either a conflict, or a solu- before, additional information is stored in each clause, before
tion, is found. After that, we return to the CPU where we it is sent to the GPU to help during search.
will do conflict analysis and, if necessary, a clause database
cleanup. At this time, we are back in the beginning, where At this point, the problem definition is on the GPU, so we
we process the clause database before sending it to the GPU. can launch the GPU to start working on this. When we
launch the GPU, we work in parallel in the CPU, to do to make the GPU side of the solver, minisat[9] compatible.
some activities that can be done in advance. That is, we had to store and return all the information that
minisat[9] needed to do its job. Also, as we use minisat’s[9]
In the GPU we will do a one time copy of each search thread own conflict analysis and clause strengthning routines, we
state to a local container. This states, as said before, are had to make them GPU-friendly as in our solver, unlike in
stored in pinned memory, which can be accessed from the minisat[9], the cpu-side of the solver does not have an im-
GPU. The next phase is to pre-load, to the appropriate con- plicit state. This section is organized as follows, first we will
tainer, the implication vector, the variables that are going introduce the changes we made to Minisat[9] for it to suit our
to be propagated, either as a result from conflict analysis, or needs. After that, we are going to present the adaptations
as a result of other search thread work. If these don’t exit, we had to make on the GPU code, being that some of it was
we select one variable, instead. After this, each variable is reused from the previous DPLL approach. Finally we are
picked in succession, one at a time, and the clauses where going to present the mechanisms used for the cooperation
they show up, are processed in parallel, one per thread. If between the CPU and the GPU.
during this stage, an implication is found, it is added to the
end of the implication vector. We leave this stage with one 4.2.1 Minisat adaptation
out of two conditions: 1) we have no more variables to prop-
In adapting minisat[9] to our needs, we had to make several
agate, in which we proceed to select another one; 2) we have
changes to the way minisat[9] operated. This happens be-
found a conflict, in which the search thread ends its search;
cause minisat[9] is highly optimized to have only one search
We have a solution when no more variables are left unas-
thread, during the solving process. As we wanted to make
signed, and there is no more work to do. We leave the GPU
minisat[9] able to handle several search threads, we had to
when all threads have finished their work and have either a
prepare it so it was able to do so. The first thing we had
conflict, or a solution.
to change was to make minisat[9] stateless. Besides the as-
signment stack, minisat[9] stores state in the problem itself,
When we return to the CPU, we will either have a solution,
as it changes the order of the literals in a clause so that the
and the solver returns it and ends, or as many conflicts as
two watched literals are always the first two literals in the
search threads. If it is the latter, we will have to analyze the
clause. As we have multiple search threads running in the
returned conflicts. We use a modified version of minisat’s
GPU, they cannot change this order as the order for one
conflict analysis procedure. These modifications have to do
search thread, would not suit another. However, minisat[9]
with the fact that minisat stores it’s search thread state in
makes use of this order when it comes to conflict analysis
the problem definition, and makes use of it in several steps
and conflict clause strengthning.
of the solving process. However, as we have several search
threads and, consequently, states, we cannot have state built
Conflict analysis makes use of this ordering because the first
in. After we are done with conflict analysis, we will add the
literal in a clause is always the last one to be assigned due to
resulting clauses to the database. However, clauses that
that same clause. More specifically, due to unit propagation.
have only one literal, are not added to the database, they
are added directly to each search thread as an assumption
This means that, because during analysis we follow these
instead. If two assumptions collide, the solving process ends
propagations backwards, when we leave a clause, to analyze
with UNSAT. When adding these new clauses, if we exceed
another, the variable we followed to reach the new clause,
the limit of clauses that the database can have, we clean
points to the first one. If we were to use this literal in our
the database using with a modified procedure, we increase
analysis again, we would start an endless analysis loop. To
the limit, either exponentially or linearly, and we send the
prevent that, we had to change the analysis procedure to
problem to the GPU again. At this point we are at the
have in mind that clauses were stateless. We did this by
beginning of the solving phase loop, again.
introducing a marker that skips the analysis of variables
that already were, or are, in the queue to be analyzed.
As said previously, when we launch the GPU kernel we do
some work in parallel, in the CPU. This work consists on
Another problem was that this queue should have the assert-
sorting the variables with respect with their VSIDS weights,
ing literal added in last place. However, due to concurrency
also, when a database cleanup is eminent, we also pre-sort
issues in the GPU, the assignment trail may not be in the as-
the clauses so that we do not need to do that before removing
signing order that minisat[9], which runs only in one thread,
the clauses. This way, we use the CPU to advance work,
understands. So we had to use a special queue that realized
when it would, otherwise, be idle.
which literal was the asserting literal and returned it in last
place so the analysis could proceed normally.
4.2 Implementation
To implement our system, we decided against starting a Another place where minisat[9] needs to be altered is in the
solver from scratch and to integrate our changes into min- database management procedure, specially during cleanups.
isat[9], a lightweight and extensible solver, instead. This Minisat[9] works with a single thread and it is fine tuned
decision came from the fact that by having facilities such to excel at doing so. However, when we have no state and
as problem parsing and clause management routines, our instead of a single search thread, we have many, minisat’s[9]
job would be less prone to errors. Moreover, Minisat[9] has clause locking mechanisms cease to function properly, as
been extensively tested. That being so, when integrating our they once again rely on the ordering of the literals. This re-
changes, we could be fairly certain that wrong answers to sults in the elimination of clauses that are part some search
problems were the result of our modifications. However, in- thread reasoning. This adaptation is a two step process.
tegrating with minisat[9] also had disadvantages as we have The first step is to prevent clauses that belong to reasonings
from being removed. This is achieved by storing the clause make the benefits of clause sharing irrelevant[5]. We do not
pointers they are using, and checking if those clause pointers face this problem. Because all search threads are halted
are stored when trying to remove clauses. The second part when conflict analysis and clause database management is
is when we actually clean the database. Minisat[9] cleans being done, we are able to have only one clause database
the database by creating another, empty, database where it that receives all clauses, from all search threads. Another
will copy the clauses to. This means that we will have all reason for this decision is that there is not enough space in
clauses with new clause pointers. the GPU to have a clause database for each search thread,
so frequent database cleanups would be required, generating
To be able to update our stored pointers accordingly, we had additional overhead. Moreover, if we had a clause database
to intercept and modify minisat’s[9] clause allocation routine per thread, the overhead of sending the databases to the
so it would return us the new position. As we stored the two GPU would increase greatly, as we would have to send hun-
pointers, the new and the old, in a hashtable, we just need dreds of databases instead of one single database. There is
to replace the pointers stored in each search thread’s state one last benefit to having one clause database for all threads.
with the value returned by the hash table for the pointer As we let the limits that would trigger a cleanup unchanged
they currently hold. from the single threaded minisat[9], in the beginning, fre-
quent cleanups are made leaving us with a very optimized
These are the modifications that enable minisat[9] to hold set of clauses in the database.
several search states instead of just one, like it usually does.
To actually transform minisat[9] in a beacon of search threads, 4.2.2 GPU Side
some additional steps are required. The first, as we do the
Even with all the changes made to minisat[9], further adap-
search in the GPU, is to modify the main loop to skip the
tations were necessary in the GPU side to make them work
procedures that make up search, the literal selection and
together. We started with the solver described in the pre-
unit propagation routines. After that, we replace these two
vious section and we stripped it down so only the search
routines with a kernel call, that will launch the GPU which
loop was left. At this point we only had the literal selec-
will in turn, run the search kernel. After the launch, when
tion and the clause analysis procedure. To make the two
the focus returns to the CPU, we need to verify if the search
sides work together, we had to store information we did not
is over before doing conflict analysis, so we look for solutions,
need before: trail, where the assignment order is stored;
SAT, or UNSAT in all the returned states. If a solution is
trail limit, which stores the index where each level ends in
found, we return it, if not, we proceed to conflict analysis.
the trail; the level itself; variable information, their as-
After conflict analysis we backtrack each state and we add
signment, the clause that implied them, and the implication
the conflict clause to the problem. If that is the case, a clause
level they belong to.
database cleanup is performed. After this, we return to the
beginning, we regenerate the problem and send it back to
That being said, we have to update this information dur-
the GPU for further analysis.
ing the search routine. Much of the information regarding
variables was already stored as it was needed to do con-
The conflict analysis and backtrack steps can be performed
flict analysis, the adaptation was in the location where we
regardless of the number of search threads that are running
stored it. However, there was no trail, no trail limits, and
in the GPU, and so can the solution checking routine.
no information about implication levels, before. Our search
procedure uses an atomic operation to let us do several op-
We decided to handle our search threads with a portfolio ap-
erations atomically, as can be seen in Figure5. To achieve
proach. This happened for two reasons: 1) There is no need
this we leverage on the fact that only one thread will have
to do load balance 2) with hundreds of search threads, hav-
the result required to enter the ’if’ clause in line 165. This
ing different approaches is crucial to succeed. However, it is
change, although it increases code divergence is an optimiza-
hard to come up with hundreds of different combinations of
tion as the third step, the confirmation step, would have to
heuristics, so we followed the approach used in Plingeling[3]
check all variables for new information. This is an issue for
were they do not have a lot of different heuristics, just one,
two reasons, the first is that we had to do additional mem-
they add diversity by randonly attributing different initial
ory accesses. The second reason is that, when the problem
variable selection orderings to each search thread[3]. So to
at hand had thousands of variables, this step was extremely
accommodate hundreds of different solvers, we chose differ-
costly.
ent restart heuristics and polarity modes, and we gave each
solver a different initial variable ordering.
The first optimization we made was to the clause analisys
procedure, for it to take advantage of one fact regarding
We choose, for restart heuristics, the two provided by min-
memory accesses in the GPU, as said before, reads are asyn-
isat[9], exponential and luby, and the LBD restart strategy
chronous. With this in mind, there are two pieces of infor-
introduced in glucose[2]. Additionally, we have three polar-
mation that we need to get in order to analyze a clause. The
ity modes: 1) all variables are assigned true, 2) all variables
first is the clause itself, that will be in contiguous positions
are assigned false, 3) variables are assigned their last implied
in memory. The second is the value that each variable, as-
value, also known as phase saving.
sociated with the literals that compose the clause, hold. To
read these values in the most optimized way, we fetch four
In running a portfolio solver, we differ in the way we handle
literals at a time. As all literals are stored in contiguous po-
learnt clauses. Normally, solvers would share only clauses
sitions, when we fetch the first, as the GPU will get a chunk
that were considered important. However, they do this be-
of memory with each fetch, we will most likely get the others
cause sharing all clauses would cause an overhead that would
as well. Meaning that we fetch all four literals, for the price
// we do an atomic compare and s e t with send the clause database to the GPU. This procedure was
// −1 ( u n a s s i g n e d ) was dominated, executionwise, by the successive calls to the
// i f t h e comparison i s s u c c e s s f u l we memory copy API, so instead of copying each clause to the
// r e p l a c e t h e v a l u e i n memory with t h e GPU, we modified the procedure to copy the clauses to an
// new p o l a r i t y auxiliary buffer. With the aid of this buffer we could copy
// t h e o p e r a t i o n r e t u r n s t h e p r e v i o u s the entire database in one operation, making this operation
// v a l u e s t o r e d i n memory as efficient as possible. Furthermore, we noticed that these
i n t o l d v a l=atomicCAS ( v a r s [ U n a s s i g n e d L i t ] , − 1 , copies, however small, were still taking some time to com-
ourPolarity ) ; plete, and there was work to be done after them. To address
// i f t h i s v a l u e i s e q u a l t o our p o l a r i t y this, we started using Asynchronous memory copies, so we
// o r was u n a s s i g n e d could continue our work while the data was being copied.
// t h e r e i s no c o n f l i c t . I f t h a t i s not These asynchronous operations, are only asynchronous in
// t h e c a s e , we r e p o r t a c o n f l i c t respect to the CPU and will maintain order with respect to
i n t c o n f =( o l d v a l != o u r P o l a r i t y ) && o l d v a l != −1; other GPU operations, so synchronization errors can occur.
i f ( conf ){
return conf ; Another way we used asynchronous operations, between the
} GPU and the CPU, was to reduce the effects of ordering all
i f ( o l d v a l !=−1){ the variables with respect to the VSIDS weights. As dis-
r e t u r n −1; cussed before, with hundreds of search threads, come hun-
} dreds of sort operations. By delaying, by one search, the
// h e r e we s e t t h e r e s t o f t h e i n f o r m a t i o n ordering and making them after the kernel launch, we can
search in the GPU, while sorting these variables in the CPU.
Another thing we can also sort, in parallel with the GPU, is
Figure 5: Using an atomic compare and set operations to the clause database. If we do that, we can skip the sorting
provide concurrency for more than one operation process when the time comes to reduce the database size.
This optimization works because clauses added in the last
round of searches, will be locked, as they will belong to a rea-
of the first. After that, when we fetch the value associated soning, and will never be removed regardless of their place
with each variable, as we fetch the four in succession, we can in the clause list.
start processing the first values while waiting for the others
to complete. With this, we end up with a system that tries to use the
GPU and the CPU cooperatively in the best possible way.
This optimization attacks one of the most crucial perfor- We will leave to the next section an evaluation of our solver
mance issues of SAT Solvers in the GPU. The fact that this against solvers that use the CPU exclusively and solvers that
problem has low memory locality. When we read the clause, use both cooperatively.
as the positions are sequential, we can take advantage of
caches to speed up the process. However, when we need 5. EVALUATION
to read the values assigned to the variables, the reads will To evaluate our solver we will compare it against the state-
be completely random, and these accesses will be issued re- of-the-art in SAT solvers that use the GPU, as a metric, we
peatedly until all threads have fetched the value they need. will also test our solver against minisat, the original solver,
In average, these instructions are issued eight to nine times, and glucose, from which we took the clause rating mecha-
instead of the optimal value of one. nism and the restart heuristic. The test suit is a subset of
well known problems taken from SATLib, a popular repos-
In order to further optimize this procedure, we implemented itory of SAT problems. We will first test our solver against
watched literals in the GPU. However, the procedure where the best solver we have found that uses the GPU during the
we verified the literals was so unoptimized in relation to solving process, CUD@SAT.
memory localness that the benefits of using watched literals,
were hidden by the memory access overhead. Furthermore, CUD@SAT has several configurations, presented in its web-
the fact that all threads must wait for the others, in the site. Therefore, we will present the minimum time it took to
same warp, to finish the work in a clause, means that if one solve each problem by the best combination of options, in-
thread fails to fall into the watched literal rule, there will not stead of having several columns, one for each configuration.
be a noticeable difference in performance for that particular
round of analysis. The tests used are described in Table 2 and we chose these
problems because these sets reflect well what happens across
4.2.3 GPU/CPU cooperation other problem sets, in general, when we compare the two
The last part of the implementation was to actually connect solvers.
the GPU with minisat[9]. The cooperation is mainly made
through information exchange, this being said, optimizing 5.1 Evaluation against GPU Solvers
memory transfers and postponing some work in the CPU to As can be seen in Table3, our solver is very competitive with
be done while the GPU is busy, is crucial to achieve optimal CUD@SAT in small problems, and both solvers stuggle with
performance. hard problems like hole10. However, when it comes to big
problems, CUD@SAT can not compete with us as they do
To achieve this, we aimed to reduce the time needed to not scale well.
Type Name Description Problem MinisatGPU minisat glucose (2.3)
Industrial bmc-ibm series Module checking prob- bmc-gallileo-8 13.96 0.25 0.05
lems from IBM bmc-galilleo-9 17.28 0.27 0.12
Industrial bmc-gallileo series Mmodule checking prob- bmc-ibm-10 19.47 0.28 0.12
lems from Gallileo bmc-ibm-11 16.90 0.32 0.09
Encodings holeN series Encodings of pigeon hole bmc-ibm-12 40.65 0.98 1.04
problems bmc-ibm-13 3.99 0.48 0.45
Encodings parity series Encodings of parity func- bmc-ibm-1 4.17 0.09 0.03
tions problems. bmc-ibm-2 0.39 0.01 0.01
Encodings gus-md5 search for md5 hash coli- bmc-ibm-3 5.89 0.09 0.04
sions bmc-ibm-4 3.77 0.09 0.03
bmc-ibm-5 0.72 0.02 0.01
Table 2: Problem sets bmc-ibm-6 8.66 0.33 0.05
bmc-ibm-7 0.23 0.01 0.01
Problem MinisatGPU CUD@SAT (best) hole8 9.27 0.43 4.90
bmc-gallileo-8 13.96 >600secs hole10 672.71 211.46 71.66
bmc-galilleo-9 17.28 >600secs
gus-md5-04 128 1.41 0.98
bmc-ibm-10 19.47 >600secs
gus-md5-07 160.09 24.41 54.50
bmc-ibm-11 16.90 >600secs
bmc-ibm-12 40.65 >600secs par16-1-c 0.63 0.06 0.06
bmc-ibm-13 3.99 >600secs par16-2-c 1.18 0.12 0.08
bmc-ibm-1 4.17 >600secs par16-3-c 1.66 0.07 0.06
bmc-ibm-2 0.39 0.07005 par16-4-c 0.88 0.02 0.01
bmc-ibm-3 5.89 >600secs par16-5-c 2.37 0.09 0.03
bmc-ibm-4 3.77 >600secs
Table 4: Results agains minisat and glucose
bmc-ibm-5 0.72 >600secs
bmc-ibm-6 8.66 >600secs
bmc-ibm-7 0.23 >600secs
this synchronization step, there is still overhead in moving
hole8 9.27 4.90 clauses and other important data from CPU’s memory to
hole10 672.71 >1000secs the GPU. Which limits the applicability of GPUs to larger
gus-md5-04 128 22.29 problems.
gus-md5-07 160.09 >600secs
par16-1-c 0.63 >600secs However, the challenges do not end with these CPU-GPU
interoperability issues. While in the GPU, the access pat-
par16-2-c 1.18 32.50
tern to memory is suboptimal. Even if we can optimize
par16-3-c 1.66 >600secs
clause fetching, which we do, we still cannot optimize the
par16-4-c 0.88 360.91
verification of the variable polarity. This second access will
par16-5-c 2.37 35.00 most likely not be coalesced or cached and it will, therefore,
result in the repetition of these read instructions, which are
Table 3: Results agains the best option of CUD@SAT issued on average 8 to 9 times, which is close to limit of 16
repetitions4 . This is the main issue in trying to solve SAT
with the GPU. SAT solving is memory intensive and this
5.2 Evaluation against CPU Solvers poor access pattern renders our kernel latency bound, as we
With our new way of using the GPU, we can scale to prob- spend 80% of the execution time of each kernel waiting for
lems of any size, which was not possible with previous ap- these read operations to terminate.
proaches, while remaining competitive in problems of smaller
dimensions. This being said, we are now going to compare
our solver with minisat, the starting point, and glucose, from
6. CONCLUSION
In this thesis we have presented a new method to use GPUs
which we took the clause rating and restart heuristics. We
to solve SAT problems. Our method uses three key insights:
are going to use the same problem sets, and both minisat
1) Instead of using the whole GPU to run a single search
and glucose are tested using their default settings. Even
thread, pick clauses carefully and run a search thread in a
though we have a highly competitive solver against other
single block; 2) run the entire search in the GPU to reduce
GPU implementations, there is still work to be done when
focus changes 3) scale by adding more search threads to
it comes to competing with CPU only implementations, see
the GPU. Using these ideas, we built a solver, minisatGPU
Table4. This difference in performance can be attributed
that implement a portfolio of search threads that are run in
to two reasons, first, minisat and glucose are highly opti-
parallel in the GPU, while conflict analysis, is done in the
mized to run a single thread and do not have to deal with
CPU to avoid code divergence in the GPU.
all the overhead that the GPU introduces. Tests we made
show that 100.000 empty kernel launches if synchronized,
The evaluation of minisatGPU shows that it scales well to
take 1 second. This is relevant to the applicability of GPUs
to SAT, as both minisat and glucose exceed this number 4
memory accesses are handled with the granularity of a half
in conflicts per second in several problems. In addition to warp, or 16 threads
big problems, when compared to other GPU solvers, and it and Applications of Satisfiability Testing, number 2919
can solve problems that were once unsolvable by GPU based in Lecture Notes in Computer Science, pages 502–518.
solvers. However, when compared with CPU based solvers, Springer Berlin Heidelberg, Jan. 2004.
it is still behind in terms of performance. This is due to poor [10] H. Fujii and N. Fujimoto. GPU acceleration of BCP
memory access patterns during the solving process that be- procedure for SAT algorithms. 2012.
comes dominated by the latency of instruction repetition. [11] E. Goldberg and Y. Novikov. BerkMin: a fast and
And, while we believe that with new algorithms, GPUs can robust sat-solver. Discrete Applied Mathematics,
be an alternative to CPUs, with current SAT and GPU tech- 155(12):1549–1561, June 2007.
nology, CPUs only solvers will most likely, have the upper [12] K. Gulati and S. P. Khatri. Boolean satisfiability on a
hand for the foreseeable future. graphics processor. In Proceedings of the 20th
symposium on Great lakes symposium on VLSI,
7. FUTURE WORK GLSVLSI ’10, page 123–126, New York, NY, USA,
We believe our solver can be improvised in two ways. When 2010. ACM.
we return from kernel execution, the GPU is left waiting [13] Y. Hamadi and L. Sais. ManySAT: a parallel SAT
until the next round of search arrives. The first improvisa- solver. JOURNAL ON SATISFIABILITY,
tion is try to capitalize on this and use the GPU while the BOOLEAN MODELING AND COMPUTATION
CPU is busy doing analysis, by launching a second search (JSAT), 6, 2009.
kernel when the first returns to the CPU. However we did [14] R. M. Karp. Reducibility among combinatorial
not invest in this because we do work in the CPU in parallel problems. In M. Jünger, T. M. Liebling, D. Naddef,
with the GPU meaning the CPU is busy most of the time. G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt,
The other improvisation is made upon the first, by using G. Rinaldi, and L. A. Wolsey, editors, 50 Years of
different streams, which are basically a way of telling the Integer Programming 1958-2008, pages 219–241.
GPU which operations can be done in parallel and which Springer Berlin Heidelberg, Jan. 2010.
cannot, in relation to the GPU, we can hide the time we [15] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim,
spend transferring the clause database by doing work in the A. D. Nguyen, N. Satish, M. Smelyanskiy,
GPU at the same time. This can only be achieved with two, S. Chennupaty, P. Hammarlund, R. Singhal, and
or more kernels, which mean that in our implementation, we P. Dubey. Debunking the 100X GPU vs. CPU myth:
are subjected to these transfer times. an evaluation of throughput computing on CPU and
GPU. SIGARCH Comput. Archit. News,
8. REFERENCES 38(3):451–460, June 2010.
[1] Exploiting unexploited computing resources for [16] M. Lewis, T. Schubert, and B. Becker. Multithreaded
computational logics, June 2012. SAT solving. In Proceedings of the 2007 Asia and
[2] G. Audemard and L. Simon. Predicting learnt clauses South Pacific Design Automation Conference,
quality in modern SAT solvers. In Proceedings of the ASP-DAC ’07, page 926–931, Washington, DC, USA,
21st international jont conference on Artifical 2007. IEEE Computer Society.
intelligence, IJCAI’09, page 399–404, San Francisco, [17] R. Martins, V. Manquinho, and I. Lynce. Improving
CA, USA, 2009. Morgan Kaufmann Publishers Inc. search space splitting for parallel SAT solving. In 2010
[3] A. Biere. Lingeling, plingeling and treengeling entering 22nd IEEE International Conference on Tools with
the SAT competition 2013. In Proceedings of SAT Artificial Intelligence (ICTAI), volume 1, pages
Competition 2013, A. Balint, A. Belov, M. Heule, M. 336–343, 2010.
Järvisalo (editors), 2013. [18] Q. Meyer, F. Schönfeld, M. Stamminger, and
[4] M. Böhm and E. Speckenmeyer. A fast parallel R. Wanka. 3-SAT on CUDA: towards a massively
SAT-Solver - efficient workload balancing. In Annals parallel SAT solver. In 2010 International Conference
of Mathematics and Artificial Intelligence, page 40–0, on High Performance Computing and Simulation
1996. (HPCS), pages 306–313, 2010.
[5] W. Chrabakh and R. Wolski. GrADSAT: A Parallel [19] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang,
SAT Solver for the Grid. 2003. and S. Malik. Chaff: engineering an efficient SAT
[6] S. A. Cook. The complexity of theorem-proving solver. In Proceedings of the 38th annual Design
procedures. In Proceedings of the third annual ACM Automation Conference, DAC ’01, page 530–535, New
symposium on Theory of computing, STOC ’71, page York, NY, USA, 2001. ACM.
151–158, New York, NY, USA, 1971. ACM. [20] H. Zhang, M. P. Bonacina, M. Paola, Bonacina, and
[7] J. D. Davis, Z. Tan, F. Yu, and L. Zhang. Designing J. Hsiang. PSATO: a distributed propositional prover
an efficient hardware implication accelerator for SAT and its application to quasigroup problems. Journal of
solving. In H. K. Büning and X. Zhao, editors, Theory Symbolic Computation, 21:543–560, 1996.
and Applications of Satisfiability Testing – SAT 2008,
number 4996 in Lecture Notes in Computer Science,
pages 48–62. Springer Berlin Heidelberg, Jan. 2008.
[8] M. Davis, G. Logemann, and D. Loveland. A machine
program for theorem-proving. Commun. ACM,
5(7):394–397, July 1962.
[9] N. Eén and N. Sörensson. An extensible SAT-solver.
In E. Giunchiglia and A. Tacchella, editors, Theory

You might also like