Report
Report
Hangman Challenge
Success rate in recorded session = 0.675
Algorithm Part 1
Base Algorithm (CatBoost)
1. Ideation
1.1 The default algorithm for solving word puzzles or games begins by making
initial guesses. These guesses are based on two factors: the length of the
unknown word and the frequency of letters in commonly used words.
Instead of considering where letters appear in the word, the algorithm focuses
on the nature of the dictionary you're using and how often individual letters tend
to appear in words. In other words, it prioritizes which letters are more likely to
be in the word based on their overall usage rather than their speci c position.
1.2 It can be observed that long words often share common pre xes and
su xes, such as 'tion', 'ous', 'ing', 'dis', 'pre', and so on. These common
pre xes and su xes should be used strategically in the early stages of guessing
words.
2. Dataset Creation
2.1 Simulated Hangman Instances
A unique set of letters is formed for each word in the dataset. For example, for
the word "tree," the set would be {‘t’, ’r’, ’e’}.
Then, for each combination of letters in this set, a hangman problem instance is
generated.
ffi
fi
ffi
fi
fi
Raghav Aggarwal Page 2
_ree { ’t’ }
t_ee { ‘r’ }
tr__ { ‘e’ }
We need to create a dataset that utilizes the observations for initial guesses and
provides an advantage to the nal model for correctly identifying su xes and
pre xes.
Each hangman word is represented with an encoding, using '0' for positions to
be guessed and '-1' for other positions.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
_ p p _ e _ p p _ e
0 16 16 0 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 16 16 0 5
t r _ _ q u _ n t t r _ _ q u _ n t
fi
fi
ffi
Raghav Aggarwal Page 3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 18 0 0 17 21 0 14 20 -1 -1 20 18 0 0 17 21 0 14 20
A row of size 80 is used (e.g., 20 in the example provided). The letter values are
inserted at the front and back, considering the word's size so that it ts.
Underscore positions are lled with '0', and the rest are lled with ‘-1'. This
encoding provides information about the positions of letters in pre xes and
su xes relative to each other.
The size 80 was chosen because the maximum size of the word in the given
dictionary was 26, which was assumed to be max 40 in the testing dictionary.
Note that this encoding gives the model the positions of letters in su x and
pre x relative positions.
This dataset had been created from the given dictionary using the le attached
“dataset_creation.ipynb”, and the subsequent dataset produced was saved in
parquet format named “fd.parquet”
3.1 For each letter, a classi cation model is trained that identi es if the letter
should be guessed (1) or not guessed (0) for a given instance of the hangman
game.
Since the dataset only contains categorical features (‘0’ for ‘_’, ‘3’ for ‘c’, etc ),
the classi cation model chosen was “CatBoost Classi er” due to its
pro ciency on categorical datasets.
A model class is created that includes 26 models, one for each letter of the
alphabet. This class has a 'predict' function that provides a guessing probability
for each alphabet based on the models' prediction probability for '1' (i.e.,
guessing).
The model achieves an average balanced accuracy of 0.89 for all letters on the
test split. This accuracy measures how accurately the model determines
whether a letter should be guessed in a given instance.
( Balanced accuracy calculates the average of the sensitivity (true positive rate)
and speci city (true negative rate) of a binary classi er.
See : sklearn.metrics.balanced_accuracy_score )
3.2 Some imbalance is observed in the prediction probabilities for letters with
lower occurrence frequencies, such as 'z,' due to its infrequent appearance in
the dictionary.
However, the model performs well in instances where words contain su xes like
"..._ized" because of how the dataset was created to account for letter positions
in su xes.
While the model excels at providing accurate initial guesses and identifying
pre xes and su xes, it encounters challenges when the word contains less
common letters. This is because the relative positioning of letters makes sense
for the beginning and end of the word but not for the middle.
(The model trained from this approach was stored as a pickle le named
“model_trained/multilabel_catboost_model.pkl” )
Algorithm Part 2
Fine-tuning algorithm
dictionary. These unguessed letters often have their neighboring letters already
guessed.
6. Dictionary Creation
6.1 For length ve and, for instance, “une_ualizer”, the following substrings
were created containing only one missing letter position
Substrings for 5
une_u
ne_ua
e_ual
_ual
All substrings of length 5 for each word in the dictionary are created and stored
in a new_dictionary.
This is then used to create a frequency_array for each substring containing the
number of occurrences of each alphabet in the missing position when the rest of
the letters match the substrings in new_dictionary.
Substrings A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
une_u 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0
ne_ua 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0
e_ual 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 4 0 0 0 0 0 0
_uali 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 4 0 0 0 0
6.3 The frequencies are then divided by the total frequency of each substring.
These frequency_arrays are then added across all substrings of the instance.
Note that this ensures that if a substring’s frequency_array has a single letter
with non-zero frequency or a very high-frequency ratio letter, it results in a
fi
Raghav Aggarwal Page 6
con dence value > 1 for that letter, prioritizing a uniquely positioned substring
over the ones with more frequency.
This nal array represents the con dence level of each letter with which that
letter should be guessed. For each length value, a single letter with the highest
con dence level is returned.
If, at any comparison level, the con dence level of a letter is more than the
probability of the predicted letter, it is returned as the guessed letter, for that
instance, in the hangman game. Otherwise, the predicted letter is returned.
8. Modi cation
For substrings of length 10,9,8, the number of missing values allowed in the
selected substring was changed to 2 due to instances where two or more low-
frequency letters could appear in words of such considerable lengths.
9. Results
The initial algorithm showed promising performance in initial tests, but it
achieved signi cantly better results when integrated with a ne-tuning algorithm.
The second algorithm, speci cally designed to address the limitations of the
base algorithm, proved to be an ideal solution. After conducting 1000 recorded
sessions, the nal success_rate reached an impressive 0.625.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi