AI Research Paper (4x4 Sudoku Solver)
AI Research Paper (4x4 Sudoku Solver)
Abstract: This paper presents a novel deep learning-based approach to solving 4x4 Sudoku puzzles, by viewing Su-
doku as a complex multi-level sequence completion problem. It introduces a neural network model, termed as
”Multiverse”, which comprises multiple parallel computational units, or ”verses”. Each unit is designed for
sequence completion based on Long Short-Term Memory (LSTM) modules. The paper’s novel perspective
views Sudoku as a sequence completion task rather than a pure constraint satisfaction problem. The study
generated its own dataset for 4x4 Sudoku puzzles and proposed variants of the Multiverse model for compar-
ison and validation purposes. Comparative analysis shows that the proposed model is competitive with, and
potentially superior to, state-of-the-art models. Notably, the proposed model was able to solve the puzzles
in a single prediction, which offers promising avenues for further research on larger, more complex Sudoku
puzzles.
1 INTRODUCTION
In this paper we propose a deep learning model for
solution of 4X4 Sudoku puzzles and show that the
model can solve over 99 percent of the puzzles pro-
vided to it in just a single prediction while state-of-
(a) Boxes. (b) Row Groups. (c) Column Groups.
the-art systems needed more prediction iterations to Figure 1: Depiction of the Sudoku variables.
attain similar results.
The Sudoku puzzle was first introduced in the values so that every row, column, and sub-square will
1970s in a Dell magazine. It became fairly known have all the values from 1 · · · o2 . A block is either a
throughout Japan. In the early 2000s, the puzzle row, a column or a sub-square interchangeably.
started becoming extremely popular in Europe and Sudoku is considered a logic based puzzle. In fact,
then in the United States, igniting an interest not only there are many types of logic that must be combined
in the form of competitions but also in the form of to solve the hardest of puzzles, ranging from trial and
scientific research (see (Hayes, 2006) for a detailed error to deduction and from inference to elimination.
background). The popularity of the puzzle caused a A Sudoku can, technically, have more than one
lot of research to be done regarding its logic-based correct solution. A Well Posed Sudoku is a puzzle that
properties and the diverse methods for solving it. has only one correct solution. A puzzle can be easier
The Sudoku puzzle is composed of a square of to solve if there are redundant hints, namely givens
cells with o2 rows and o2 columns for some natural that can be deduced from one or more other givens. A
constant o called the order of the puzzle. The square puzzle with no redundant hints is called Locally Min-
is subdivided into its o2 primary o × o squares called imal (Simonis, 2005). Figures 2 and 3 demonstrate
boxes or sub-squares. The boxes divide the rows and examples of the types of puzzles and their solutions.
columns into o sets of rows and o sets of columns The puzzle in Fig. 2a is locally minimal because re-
called row groups and column groups respectively. moving any of the numbers will cause it to have more
The puzzle begins with some placement of values than one solution thus no longer being well posed. For
1 · · · o2 in some of the cells called givens or hints. The example, removing the 2 will cause Fig. 3b to be a
object of the puzzle is to fill the rest of the cells with valid solution as well; removing the 1 will make Fig.
15
Schendowich, C., Ben Isaac, E. and Azoulay, R.
Multiverse: A Deep Learning 4X4 Sudoku Solver.
DOI: 10.5220/0012232500003636
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024) - Volume 3, pages 15-22
ISBN: 978-989-758-680-4; ISSN: 2184-433X
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
(a) Well posed, locally minimal. (b) Only well posed. (c) Not well posed (4 solutions).
Figure 2: Examples for well posed and locally minimal puzzles.
3d be a valid solution, and so on. bespoke deep learning techniques to tackle its solu-
Sudoku can further be classified as a discrete con- tion. Recognizing that the puzzle’s completion pro-
straint satisfaction problem (CSP). In particular, Su- cess entails a multi-dimensional sequence, our pro-
doku is a special case of CSPs called an exact cover posed model incorporates multiple parallel computa-
problem, where each constraint must be satisfied by tional units. For the sake of simplicity, our investiga-
only one variable in the solution assignment. Any al- tion primarily focuses on training the model on 4x4
gorithm that can solve discrete CSPs or exact cover puzzles that exhibit local minimality. Through one-
problems will be able to provide a solution for a Su- shot prediction, we achieved an impressive comple-
doku, albeit usually in exponential run-time. tion rate exceeding 99 percent, effectively showcas-
Sudoku can also be approached as a Machine ing the neural network’s capacity to grasp the abstract
Learning problem, thus described as a multi-class relationships among the puzzle’s cells.
multi-label classification problem, meaning that a To the best of our knowledge, even though Large
given puzzle has multiple values that need to be deter- Language Models (LLMs) have been used for Sudoku
mined (namely the values of the unassigned cells) and solving, there has not been previous research based on
each of those is one discrete value of a shared domain the premise that Sudoku is a sequential completion
of integers. Therefore, there are various machine problem. We show that this approach is effective and
learning and deep learning methods that might be ap- justifiable.
plicable to it. Unfortunately, a very high level of gen-
eralization is required for learning the implicit con-
nections between the cells so simple methods are not 2 RELATED WORKS
good enough here (Palm et al., 2018). Some progress
has been made in this avenue of research, particularly Since solution by logic based algorithms has proven
combining deep learning with other methods (Wang intractable for difficult puzzles of high order, auto-
et al., 2019) or using networks with recurrent predic- mated Sudoku solving has become the object of a
tion steps (Palm et al., 2018; Yang et al., 2023). large amount of highly varied research.
In this study, we introduce a novel approach to Simonis (Simonis, 2005) did a thorough job defin-
tackle Sudoku solving by leveraging a neural network ing the basic constraints in Sudoku puzzles. He also
architecture. Our methodology is rooted in the notion showed that these constraints can be described in var-
that Sudoku puzzles share similarities with intricate ious ways, creating additional redundant constraints
sequential data completion problems. To address this, with which the puzzles can be solved more efficiently.
we devised a specialized sequential completion unit In his paper, he compares 15 different strategies based
based on Long Short-Term Memory (LSTM) and in- on constraint propagation and showed that the well
terconnected multiple such units in a parallel fashion. posed puzzles in his dataset can all be solved by an
Remarkably, our resultant model exhibits comparable all-different hyperarc solver, if the execution starts
competence to state-of-the-art models, while obviat- with one shaving move.
ing the need for multiple prediction iterations. No- Hunt et al. (Hunt et al., 2007) took this one step
tably, we demonstrate a substantial disparity between further by showing that the constraints can be real-
our model and existing approaches in terms of perfor- ized in a binary matrix solvable by the DLX algorithm
mance when such iterations are excluded.
provided by Knuth (Knuth, 2000) and that many of
Drawing upon the notion that Sudoku represents a
the common logical solving techniques can be sum-
complex sequential data completion task, we adopted
16
Multiverse: A Deep Learning 4X4 Sudoku Solver
(a) The solution for Fig. 2. (b) 2nd solution for Fig. 2c. (c) 3rd solution for Fig. 2c. (d) 4th solution for Fig. 2c.
Figure 3: Solutions for puzzles in Fig. 2.
marized in a singe theorem based on that matrix. The the problem structure coded manually and a series of
advantages of using DLX is that the algorithm is faster predictive steps towards a solution. In contrast, our
than other backtracking algorithms and also provides study eliminates the need for prior knowledge of the
the number of possible solutions thus giving indica- problem’s architecture and successfully resolves the
tion if a puzzle is well posed or not. The disadvantage puzzles in a single prediction step.
in using DLX is that since it is a backtracking algo- Wang et al. (Wang et al., 2019) combined a SAT
rithm its run-time is exponential by nature. solver with neural networks to add logical learning
Another related approach can be found in the methods that can overcome the difficulty traditional
works of Weber (Weber, 2005), Ist et al. (Ist neural networks have with global constraints in dis-
et al., 2006) and Posthoff and Steinbach (Posthoff crete logical relationships. Using that combination
and Steinbach, 2010) who all model Sudoku as a they succeeded in attaining a 98.3 percent comple-
SAT problem and use various SAT solvers to solve tion rate without any hand coded knowledge of the
it. This method utilizes powerful existing systems problem structure. This approach differs from our ap-
but requires explicit definition of the rules and lay- proach in that it requires the use of a SAT solver to
out of each puzzle, a number of clauses which could complement the prediction provided by the network.
expand exponentially with more complex puzzles. In Moreover, Chang et al. (Chang et al., 2020) showed
our study we favoured machine learning because it that the good results presented by Wang et al. (Wang
obviates the need of explicit description of the logic et al., 2019) are limited to easy puzzles and the results
problem. are significantly worse than those of Palm et al. (Palm
As mentioned in the introduction, deep learning is et al., 2018) when trying to solve hard puzzles.
also a good candidate for Sudoku solving. The advan- Mehta (Mehta, 2021) created a reward system for
tage of machine learning systems in general and deep a Q-agent and achieved 7 percent full puzzle com-
learning systems in particular is that they can gener- pletion rate in easy Sudokus and 2.1 and 1.2 per-
alize a direct solution for a problem without having cent win rates in medium and hard Sudokus respec-
to use algorithms of exponential or worse runtimes tively all with no rules provided. She did this un-
to solve particular instances of the problem. Once aware of Poloziuk and Smrkovska (Poloziuk and Sm-
trained, such a model could be significantly better rkovska, 2020) who tested more complex Q-based
even than the DLX algorithm. agents and Monte-Carlo Tree Search (MCTS) algo-
Park (Park, 2018) provides a model based on a rithms and came to the conclusion that they require
standard convolutional neural network that can solve too much computation power to be used reasonably
a Sudoku in a sequence of interdependent steps. It to solve Sudoku. With MCTS they performed a small
solved 70 percent of the puzzles it was tested on, us- number of experiments achieving 35-46 percent ac-
ing a loop that predicted the value of the one highest curacy and called it a success, even though the results
probability cell in each iteration. are fairly low, considering that in each experiment the
Palm et al. (Palm et al., 2018) created a graph training took them days to perform.
neural network that solves problems that require mul- Du et al. (Du et al., 2021) used a Multi Layered
tiple iterations of interdependent steps and showed Perceptron (MLP) to solve order 2 puzzles. They cre-
that it solved more than 96 percent of the Sudokus ated a small 4 layer dense neural network and per-
presented to it, including very hard ones as well, al- formed their prediction stage by stage each time fill-
though the solution in this method had to be done in ing in only the one highest probability cell. Their
a sequence of interdependent steps. While the results dataset included puzzles none of which were locally
achieved are noteworthy, it’s important to acknowl- minimal - all missing between 4 and 10 values. Their
edge that they necessitated both an understanding of model solved more than 99 percent of the puzzles
17
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
tested. However, the vast majority of the puzzles Table 1: Well Posed Order 2 Puzzle Count.
tested were not locally minimal and the number of
H WP LM H WP LM
prediction iterations they required were equal to the
4 25728 25728 10 2204928 0
number of missing values. In our study, we not only
5 284160 58368 11 1239552 0
performed experiments on locally minimal puzzles
6 1041408 1536 12 522624 0
with 10, 11, or 12 missing values - the hardest well
7 2141184 0 13 161280 0
posed order 2 puzzles - and attained a completion rate
8 2961024 0 14 34560 0
greater than 99 percent, but did so in a single predic-
9 2958336 0
tion step, albeit with a more intricate model.
Yang et al. (Yang et al., 2023) trained a generative H - The number of hints in the puzzle.
pre-trained transformer (GPT) based model with the WP - Well Posed, LM - Locally Minimal.
dataset used by Palm et al. (Palm et al., 2018) and
ing 4x4 puzzles lies in the opportunity they provide to
tested it with iterative predictions. They had supe-
build, test and refine models that could be efficiently
rior results, solving more than 99 percent of the puz-
scaled to more intricate Sudoku variants. This makes
zles, although when restricted to the hardest puzzles
them an essential stepping stone in advancing deep
achieved a 96.7 percent completion rate. However,
learning methodologies for solving larger and more
Yang et al. (Yang et al., 2023) required a sequence
complex problems.
of prediction steps to reach the solution and did not
In this section we bring the technical information
provide a solution in one end-to-end prediction. They
of our methods, in particular the data composition and
needed 32 prediction iterations to achieve their results
the structure of our models.
in order 3 puzzles. The baseline for our study is the
model used by Yang et al. (Yang et al., 2023) modi-
fied for order 2 puzzles. We show that our model has 3.1 Datasets
competitively good results in less prediction iterations
than required by their model. Since most of the research into Sudoku was on order
As can be seen above, the significant results so 3 puzzles, we did not find an existing dataset of order
far have been achieved only in systems that integrate 2 puzzles so we created our own.
sequences of interdependent prediction stages. In this There exist exactly 288 unique order 2 solved Su-
paper we propose a model that achieves competent doku boards. Those boards represent 85632 puzzles
results in a single prediction stage. which are both well posed and locally minimal, each
containing only 4, 5, or 6 hints. It is possible to create
a larger number of well posed puzzles by adding more
hints, although those puzzles are not locally minimal.
3 DEEP LEARNING METHODS Figures 2 and 3 demonstrate examples for the various
possible types of puzzles and their solutions. Table 1
Our paper introduces a deep learning approach specif- shows the full number of well posed puzzles with 4 to
ically targeted at tackling order 2 Sudoku puzzles. 14 hints and how many of them are locally minimal.
These are 4x4 puzzles that, while smaller in scale Our primary dataset consists of all 85632 well
compared to the standard order 3 Sudokus, present a posed and locally minimal order 2 puzzles. Details
unique appeal for scientific investigation. Given their on the generation process of the puzzles is provided
relative simplicity, both in terms of representation and in the appendix. The training of our models was per-
analysis, focusing our research on order 2 puzzles en- formed on a subset of 77069 puzzles using 9-fold
ables more rapid training and facilitates quicker at- cross validation (We divided the puzzles into 10 sub-
tainment of results. sets and left one out of the process).
We consider Sudoku to be akin to a sophisticated, The puzzles and their solutions are composed of
multi-layered sequence completion problem. With strings of digits, where missing values are denoted as
this perspective, we developed a deep learning neu- zeros. Since the values are discrete and categorical,
ral network that leverages LSTM modules designed we one-hot encoded them into five digit binary vec-
for sequence completion. This approach has yielded tors in order to make processing easier.
results that are on par with current leading models.
While our demonstrated results are limited to or- 3.2 Machine Learning Models
der 2 puzzles, we maintain the belief that these puz-
zles are sufficiently complex to serve as a sound foun- Below we describe the models used in this study. The
dation for creating a successful model for higher- first section describes our main model, a neural net-
order Sudoku problems. The importance of study- work architecture we call the Multiverse, which is
18
Multiverse: A Deep Learning 4X4 Sudoku Solver
Multiverse Model
Figure 4: An order 2 Multiverse model with 6 parallel verse
In the model description below, we use the following modules.
variables. o is the order of the puzzle (o = 2 for all
the tests in this study). r is the size of the range of the direction is unimportant. Unlike NLP problems,
values possible in a given puzzle (r = 5 for all the Sudoku data is discrete and all-different removing the
tests in this study, to include the 4 possible values and semantic related aspects from the problem, so other
zero). s is the length of a side of a given puzzle (s = o2 techniques that have greater effect when used on text
always). a is the number of cells in a given puzzle related problems are not required here.
(a = s2 always).
Sudoku in many respects is a sequence completion Convergence Study Models
problem. Each row must be completed with a permu-
tation of the values in the range. Each column and box Some of the configurations of the baseline model sur-
must be completed likewise. Each value must have a passed a 99 percent completion rate, albeit with more
location in each row forming a sequence of value-in- than one prediction, so we performed a study testing
row location indices that requires completion. Sim- the number of verses required to attain such presti-
ilarly, such sequences are also formed by values in gious results in one-shot predictions. This study was
columns and boxes. There may exist more aspects of performed on Multiverse models with 6 (See Fig. 4),
Sudoku puzzles that can also be viewed as completion 10 and 12 parallel verses (called M6, M10, and M12
problems but that is subject for a separate study. respectively). M6 was our first robust model with sig-
Our initial sequence completion unit (below re- nificantly good results. We based most of our study
ferred to as a “verse”) is the following sequence of around that model, but also tested the more powerful
deep learning layers, with reshape layers between M10 and M12 models to show that results achieved
them where necessary: by the baseline are attainable by our model as well.
1. Conv1D, where: input size = (a, r), output size = The results themselves will be detailed in Section 4.
(a, r), filters = r, kernel size = 1, strides = 1
2. Dense, where: input size = (a × r), output size = Ablation Study Models
(a × r)
We performed our ablation study on the following in-
3. Bidirectional LSTM, where: input size = (a, r), complete Multiverse models with 6 verses:
output size = (a, 2 × r), return sequences is set to
true. • M6 - 6 complete verses.
4. Bidirectional LSTM, where: input size = (a, 2 × • No Conv - 6 verses that have no Conv1D layers.
r), output size = (a, 2 × r), return sequences is set • No Dense - 6 verses that have no Dense layers.
to true. • No LSTM - 6 verses that have no LSTM layers.
Our complete model is composed of a number • One LSTM - 6 verses that have only the first
of parallel verses with their output concatenated into LSTM layer.
a dense softmax termination layer (hence the name
• M5 - Only 5 verses, all complete.
“Multiverse”). A Multiverse model for order 2 puz-
zles with 6 parallel verses (M6) is depicted in Fig- Results will be detailed in Section 4.
ure 4. The motivation for this architecture is that the
combination of the convolutional and the dense layers Baseline Model
provide a basic embedding feature that allows for dif-
ferent interpretations for the parallel verses. The Bidi- As a baseline we modified the transformer based
rectional LSTM is a good sequencing modeler when model provided by Yang et al. (Yang et al., 2023) to
19
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
fit our order 2 data. The model is based on a MinGPT Table 2: Convergence Study Results.
module (Karpathy, 2020) set to the input of a Su-
Model No. Epochs Completion Rate
doku puzzle. MinGPT performs the computations on
M6 100 92 percent
sparse categorical data. Therefore, the input is not
one-hot encoded, rather left in numerical format with M10 28 97 percent
one change: all the values in the solution that corre- M12 22 >99 percent
spond to hints were changed to -100. We maintained
this data format in our modified model.
For a convincing comparison with our model we
ran tests on the baseline model with a variety of set-
tings for the following parameters:
• Recurrences - The number of prediction itera-
tions. We refer to this parameter by the name It-
erations so as not to confuse with RNN.
• Heads - The number of transformer heads used.
• Embedding - The size of the puzzle embedding.
The outcomes will be elaborated upon in Section 4. Figure 5: Completion percentages of M6, M10 and M12
models by number of training epochs.
20
Multiverse: A Deep Learning 4X4 Sudoku Solver
21
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
neering and Computer Science (EIECS), pages 306– the 65534 possible 16 bit binary strings except for an
310. IEEE. all-0 string which is an empty puzzle and obviously
Hayes, B. (2006). Unwed numbers. American scientist, not well posed and an all-1 string which is a com-
94(1):12. pleted solution (bitmasks). The bitmasks file was cre-
Hunt, M., Pong, C., and Tucker, G. (2007). Difficulty- ated in descending order of the amount of zeroes in
driven sudoku puzzle generation. The UMAP Journal, the strings. In the final stage, for each solution in so-
29(3):343–362.
lutions we iterated over bitmasks and filtered the well
Ist, I. L., Lynce, I., and Ouaknine, J. (2006). Sudoku as posed and locally minimal puzzles resulting with a
a sat problem. In Proceedings of the International
Symposium on Artificial Intelligence and Mathemat- file with all the 85632 well posed and locally minimal
ics (AIMATH), pages 1–9. puzzles and their respective solutions (puzzles).
Karpathy, A. (2020). mingpt.
Knuth, D. E. (2000). Dancing links. arXiv preprint
Stage 1: Order 2 Solutions
cs/0011047.
Since there are only 288 unique solutions for the or-
Mehta, A. (2021). Reinforcement learning for constraint
satisfaction game agents (15-puzzle, minesweeper, der 2 Sudoku problem, this stage is very straightfor-
2048, and sudoku). arXiv preprint arXiv:2102.06019. ward. We simply took an empty puzzle, applied to it
Palm, R., Paquet, U., and Winther, O. (2018). Recurrent the DLX algorithm, and saved to a file all the resulting
relational networks. Advances in neural information solutions.
processing systems, 31.
Park, K. (2018). Can convolutional neural networks crack Stage 2: Binary Strings
sudoku puzzles.
Poloziuk, K. and Smrkovska, V. (2020). neural networks Creation of the bitmasks file was done with array ma-
and monte-carlo method usage in multi-agent systems nipulation. We created empty arrays and added to
for sudoku problem solving. Technology audit and them indices of locations where a bit in a mask should
production reserves, 6(2):56. be ‘1’, by gradually appending to the arrays more ar-
Posthoff, C. and Steinbach, B. (2010). The solution of dis- rays that are based on them and have more indices
crete constraint problems using boolean models-the added to them. After that we scanned the arrays and
use of ternary vectors for parallel sat-solving. In In- transformed them into bit strings.
ternational Conference on Agents and Artificial Intel-
ligence, volume 2, pages 487–493. SCITEPRESS. The algorithm runs in Θ(216 ) for order 2 puz-
Simonis, H. (2005). Sudoku as a constraint problem. In CP
zles, and if modified for order o puzzles its runtime
Workshop on modeling and reformulating Constraint is Θ(2area ), where area = side2 and side = o2 . It uses
Satisfaction Problems, volume 12, pages 13–27. Cite- a similar amount of memory.
seer.
Wang, P.-W., Donti, P., Wilder, B., and Kolter, Z. (2019). Stage 3: Order 2 Puzzles
Satnet: Bridging deep learning and logical reasoning
using a differentiable satisfiability solver. In Interna- This stage approaches the binary strings in bitmasks
tional Conference on Machine Learning, pages 6545– as binary masks. We consider a mask a to cover
6554. PMLR. another mask b if for every i where b[i] =‘1’, also
Weber, T. (2005). A sat-based sudoku solver. In LPAR, a[i] =‘1’.
pages 11–15. For each solution in solutions we iterate over the
Yang, Z., Ishay, A., and Lee, J. (2023). Learning to solve binary masks in bitmasks and for each mask m we
constraint satisfaction problems with recurrent trans- check the following conditions:
former. In The Eleventh International Conference on
Learning Representations. • Has a puzzle already been added whose mask is
covered by m (and the refore the puzzle is not lo-
cally minimal)?
APPENDIX • Is the corresponding puzzle well posed?
If the first condition is false and the second is true
Order 2 Puzzle Dataset Generation we add the corresponding puzzle and its solution to
puzzles.
The generation process of locally minimal well posed In order to check if a puzzle is well posed we ap-
order 2 puzzles and their solutions was a 3 stage pro- ply to it the DLX algorithm and test if the number of
cess. The first stage entailed creating a file contain- solutions is equal to 1.
ing all the 288 unique solutions for order 2 puzzles
(solutions). We also needed to create a file with all
22