Linear Programming Word Problems Formulation Using
Linear Programming Word Problems Formulation Using
Infrrd AI Lab
Infrrd.ai
{jianglong,mamathan,shivvignesh,deepakumar,akshayuppal}@infrrd.ai
Abstract
1 Introduction
Linear programming problems are used in business applications to optimize the solution found for
the problem. We define the linear programming problem as a maximization or minimization problem.
We may have to obtain maximum profit or reduce operational costs based on the requirements. People
conceived the idea of formulating a mathematical problem to run a business and drive society to the
next level. The problem arises when there are multiple ingredients or resources which scale only in
a linear manner. We like to blend two or more materials to maximize profit or minimize cost, an
example of linearity in the materialistic view.
The formulation of linear programming problems is a powerful concept in solving a set of business
problems consisting of monetary benefits. The solution for the linear programming problems is still
a critical concept to understand by non-experts. Dantzig developed a method known as a simplex
method to solve linear programming problems Nash [2000]. Here a set of constraints defines the
feasible region to optimize the linear programming problem. We locate the corners of the feasible
region and evaluate them in this method to find the optimal solution. The solutions to the linear
programming problems got better with interior point methods. In an interior point-based approach,
the valuation begins from inside the feasible region, which reduces the computational time compared
to the simplex method when the number of variables is large Karmarkar [1984]. Now, we have readily
available software to solve the problem CPLEX [2022]. The software helps to solve mathematically
framed linear programming problems. It still requires an expert to formulate a linear programming
problem from the word problem. The NL4Opt competition NL4Opt [2022] creates an opportunity
to develop an automated solution to convert a word problem into a mathematically framed problem
which is then passed on to readily available software to find the optimal solution Ramamonjison
et al. [2022]. The competition conceptualizes the hurdle faced by the non-experts while formulating
a linear programming problem. There are two tasks in this competition, and they are named entity
recognition and generating the precise meaning representation. A method has to identify all the
entities present in the word problems in the entity recognition task. Another solution uses the located
entities to generate a mathematical representation from the previous method. However, the text
generation method depends on the ground truth of the first task and acts as a reference for evaluation.
Our contributions and observations are summarized as follows:
• The number of samples in the training set is less to train a large language model. So, we
introduce the data augmentation strategies: LwTR, SR, MR, and SiS Dai and Adel [2020]
as a preprocessing step. Thus, we increase the number of samples in the data to provide a
more plausible output.
• We introduce an ensemble of models to learn the weights for each label predicted from the
single models.
• We introduce multi-task learning and train a large language model to generate text through
different prompts. An entity wrapper is placed around an entity to enhance the data. We
combine the task-specific text from the same model to form a meaningful representation for
linear programming solvers.
Our approach follows the flair framework Akbik et al. [2019] and uses PyTorch and hugging face
Face [2022] transformers to build and experiment. Our base model is a ‘Text Embedding + BiLSTM
+ CRF’ transformer model as shown in Figure 1. We had used the base model to train a multilingual
transformer for MultiCONER competition Malmasi et al. [2022a,b]. The text embedding layer was
XLM-Roberta embeddings for MultiCONER competition He et al. [2022]. The F1 score of the
model is equivalent to the baseline results when the text embedding layer is Roberta-base Liu et al.
[2019]. Since the number of training samples available to train the model is less, we performed data
augmentation techniques on the training samples.
Figure 1: The main architecture of the model where the text embedding layer is replaced with different
types of embeddings available in hugging face.
2
2.1 Data augmentation
A lot of labeled data is required by transformer-based language models to produce a good performance.
Here, we are working on linear programming word problems, and the availability of the word problems
is scarce. Apart from this, we have a time-consuming annotation task that requires expert knowledge.
In this case, both of them are very critical. We need to overcome this data deficiency problem. So, we
made use of data augmentation techniques to replace token-level embeddings Dai and Adel [2020].
The method expands the number of training samples by adding additional transformed samples
without changing the labels. There are four techniques in this method: Label-wise token replacement
(LwTR), Synonym replacement (SR), Mention replacement (MR), and Shuffle within segments (SiS).
Here, we have used all the techniques to perform data augmentation in the training samples. However,
the development set was not utilized and kept as it was for evaluation purposes. Table 1 illustrates an
example of generation of augmented data.
Table 1: The different types of data augmentation techniques: Label-wise token replacement (LwTR),
Synonym replacement (SR), Mention replacement (MR), and Shuffle within segments (SiS).
Method Instance
Original A serving of chicken costs $ 10
O O O B-VAR B-OBJ_NAME O B-PARAM
LwTR A flour of cereal profit $ 25
O O O B-VAR B-OBJ_NAME O B-PARAM
SR A serving of volaille cost $ X
O O O B-VAR B-OBJ_NAME O B-PARAM
MR A serving of Durian TV savings $ 0.32
O O O B-VAR I-VAR B-OBJ_NAME O B-PARAM
SiS serving of a chicken costs $ 25
O O O B-VAR B-OBJ_NAME O B-PARAM
Table 2: The performance of a single model with different text embeddings on the development set.
Model Text embeddings Development
number F1 score
1 roberta-base + no augmentation 0.8117
2 roberta-base + LwTR & SR augmentation 0.8641
3 xlm-roberta-base + 4 augmentations 0.8898
4 roberta-base + 4 augmentations 0.9127
5 roberta-base + glove + 4 augmentations 0.9151
6 only glove + 4 augmentations 0.787
7 roberta-base + glove + nobilstm + 4 augmentations 0.9154
8 roberta-base + glove + nobilstmcrf + 4 augmentations 0.9034
3
2.2 Ensemble architecture
A simple ensemble strategy of Majority Voting is developed by He et al. [2022]. Given a set of
M sequence labeling models denoted as C = {c1 , c2 , ..., cM } and an input sentence denoted as
S = {w1 , w2 , ..., wn }, where each w is a word from S. Each model from C will output a sequence
of predictions for each word w in sentence S. Let Ocsi = {Owci
1
ci
, Ow 2
ci
, ..., Ow n
} denote the prediction
ci
output of model Ci on sentence S. Owj denotes the prediction of model Ci on word wj in IOB
format. The set of outputs for all models in C on sentence S will be denoted as
The Majority Voting strategy takes all model’s predictions of word wj and outputs the most frequent
prediction as to the final prediction for wj . An obvious issue with Majority Voting is the IOB scheme
constraint. The final ensemble result is not guaranteed to be passing all the constraints where (I) tag
must follow and (B) tag and the entity of neighbor (B) and (I) tag must be the same. We introduce an
ensemble learning approach via sequence labeling called ‘EnsembleCRF’ as shown in Figure 2.
Figure 2: The architecture of EnsembleCRF model. Given a sentence as input, each of the sequence
labeling models will output the name entity prediction in IOB format. One hot encoder combines
them to generate ensemble output from a Conditional Random Field model.
The model outputs are stacked together and passed through a one-hot encoder, three linear layers,
and CRF. The CRF layer is trained to optimally combine the model predictions to form a new set
of predictions. The addition of the three linear layers helped in the performance improvement. The
EnsembleCRF model is of the form
Den is the ensemble learning dataset composed of X and Y . X = {Os1 , Os2 , ..., Osk } is created by
using model set C and set of input sentences {S} = {S1 , S2 , ..., Sk } with size K. Each element of
X is defined as equation (1). Y is the ground truth entity label in IOB format. During the training
phase for some models in set C, we included only the training datasets. Thus, we choose to perform
4
the ensemble learning using an augmented train set created using the data augmentation strategies
explained in Section 2.1. Since the second layer classifier is a CRF layer, we solved the problem of
breaking the IOB constraints. By learning to optimally combine the model predictions, EnsembleCRF
also learns to avoid mistakes made by single sequence labeling models.
We experimented with creating Den with the augmented training dataset. However, we found that
there is no positive correlation between the number of models in C and the macro-averaged F1 score
on the development dataset. Treating every possible combination of set C as a hyperparameter to
optimize will yield the optimal result.
In the MultiCONER competition Malmasi et al. [2022a,b], an ensemble model provided a boost in
the F1 score. We followed a similar approach and used the same architecture. The single models,
which were evaluated in the earlier section, are combined to form an ensemble. The results are
tabulated in Table 3. We found all six models with 4 augmentations gave an F1 score of 0.9173 on
the development set, but the best model evaluated on the test set had a different combination, which
is represented, in bold format. The results varied at the third decimal digit position indicating that
the combination is providing improvement but is not sufficient to bring in the change in the f1 score.
However, the results are consistent with the ensemble model. One of the main concerns with the
ensemble model is the number of single models to use to gain the maximum F1 score. Since it has to
be evaluated in a trial-and-error approach.
Table 3: The performance of ensemble with different text embeddings on the development set.
Model Text embeddings Development
number F1 score
1 roberta-base + 4 augmentations 0.9154
(3 models) roberta-base + glove + 4 augmentations
xlm-roberta-base + 4 augmentations
2 roberta-base + 4 augmentations 0.9164
(3 models) roberta-base + glove + 4 augmentations
roberta-base + glove + nobilstm + 4 augmentations
3 roberta-base + 4 augmentations 0.9161
(4 models) roberta-base + glove + 4 augmentations
roberta-base + glove + nobilstm + 4 augmentations
roberta-base + glove + nobilstmcrf + 4 augmentations
4 roberta-base + 4 augmentations 0.9167
(5 models) roberta-base + glove + 4 augmentations
roberta-base + glove + nobilstm + 4 augmentations
roberta-base + glove + nobilstmcrf + 4 augmentations
xlm-roberta-base + 4 augmentations
5 roberta-base + 4 augmentations 0.9173
(6 models) roberta-base + glove + 4 augmentations
roberta-base + glove + nobilstm + 4 augmentations
roberta-base + glove + nobilstmcrf + 4 augmentations
xlm-roberta-base + 4 augmentations
glove + 4 augmentations
The test results are tabulated in Table 4. Our ensemble approach is placed at the top with an F1
score of 0.939. The results are tabulated after performing a reproducibility test by the competition
organizers which helps us confirm that our programs are reproducible and consistent with the F1
score.
5
Table 4: The performance of different submissions on the test set NL4Opt [2022].
Rank Team Name Affiliation(s) F1 score
1 Infrrd AI Lab Infrrd 0.939
2 mcmc OPD 0.933
3 PingAn-zhiniao PingAn Technology 0.932
4 Long BDAA-BASE 0.931
5 VTCC-NLP Viettel 0.929
6 Sjang POSTECH 0.927
7 DeepBlueAI DeepBlueAI 0.921
8 TeamFid Fidelity 0.920
9 KKKKKi Netease 0.917
10 holajoa Imperial College London 0.910
11 Dream 0.884
Baseline (xlm-roberta-base) Nl4Opt 0.906
The generation of declarations was conceptualized as entity relationship mapping. The relationship
between the ground truth entities was missing, and our refined objective was to map the relationship
to capture the declaration that exists in the natural language. To do so, we tried to implement the
entity relationship way of mapping the entities. We couldn’t progress further and moved to the next
step, the text generation approach. We observed that the text generation process was much easier
than entity relationship mapping in our experiments.
We use a text generator to generate the declarations. We have used text to text transfer transformer
(T5) in our experiments Raffel et al. [2019]. The architecture has visible tokens for the input text, and
the output text is observed based on the past predicted outputs. We fine-tuned the regularly available
T5 model from the hugging face library Face [2022]. In the first stage, we provided the raw input text
as input to the transformer, and the expected output was as defined in sub-task 2 NL4Opt [2022]. Our
output is a generation of declaration, and the results generated were on par with the baseline results
shared by the competition organizers.
Figure 3: The generation of text using T5 model for different types of prompts.
6
We observed that the score was par but didn’t have much-anticipated improvements. We performed
an initial analysis and found the two points identified during our discussion. They were the number
of samples available to train the model and the repeated occurrence of declarations in terms of
constraints. To address these points, we decompose each sample into pieces for multi-task learning
and, thus, implicitly increase the number of training samples in our dataset.
We believe the output was low due to the small number of samples. We split the single task into
multiple tasks as shown in Figure 3. In this way, we invariably introduced a lot of examples in the
training set. This approach helped the model to map the declaration with the probable entities in
the process. First, we trained the model to predict only the objective of the problem, and it was
then scaled to generate the constraint declarations. There were seven different types of constraints
in the competition. They are sum constraints, upper bound constraints, lower bound constraints,
linear constraints, ratio constraints, xby constraints, and xy constraints. Thus, we combine all the
declarations by repeating the training seven more times.
Every linear programming problem has a definite objective statement, but the constraints declarations
may be missing in a few. In our multi-task learning setting, the missing constraint statements in a
problem act as negative samples in the training process. It helps the model in predicting the correct
constraints without much effort. Tables 5 and 7 show the original mapping and the proposed mapping
for sub-task 2 to add multiple samples to the training dataset.
Table 5: The original mapping for text generation in sub-task 2 NL4Opt [2022].
Original mapping:
<s>
<DECLARATION>
<OBJ_DIR> maximize </OBJ_DIR>
<OBJ_NAME> number of coconuts </OBJ_NAME> [is]
<VAR> rickshaws </VAR> [TIMES] <PARAM> 50 </PARAM>
<VAR> ox carts </VAR> [TIMES] <PARAM> 30 </PARAM>
</DECLARATION>
<DECLARATION>
<CONST_DIR> at most </CONST_DIR>
<OPERATOR> LESS_OR_EQUAL </OPERATOR>
<LIMIT> 200 </LIMIT>
<CONST_TYPE> [LINEAR_CONSTRAINT] </CONST_TYPE> [is]
<VAR> rickshaws </VAR> [TIMES] <PARAM> 10 </PARAM>
<VAR> ox carts </VAR> [TIMES] <PARAM> 8 </PARAM>
</DECLARATION>
<DECLARATION>
<CONST_DIR> must not exceed </CONST_DIR>
<OPERATOR> LESS_OR_EQUAL </OPERATOR>
<CONST_TYPE> [XY_CONSTRAINT] </CONST_TYPE>
<VAR> ox carts </VAR> [is] <VAR> rickshaws </VAR>
</DECLARATION>
</s>
Table 6: The exploratory data analysis of number of tokens for different training strategies.
Type of training Prefix + input length (max) Output mapping length (max)
Original: prefix=generate Train: 203 Train: 716
linear program mapping Dev: 214 Dev: 601
augmented input Train: 545 Train: 716
Dev: 529 Dev: 601
multi-task Train: 248 Train: 564
Dev: 258 Dev: 438
augmented input + multi-task Train: 591 Train: 564
Dev: 570 Dev: 438
7
Table 7: The proposed multi-task mapping for text generation.
Multi-task mapping:
prompt <OBJ_DECLARATION> </OBJ_DECLARATION>:
<OBJ_DECLARATION>
<OBJ_DIR> maximize </OBJ_DIR>
<OBJ_NAME> number of coconuts </OBJ_NAME> [is]
<VAR> rickshaws </VAR> [TIMES] <PARAM> 50 </PARAM>
<VAR> ox carts </VAR> [TIMES] <PARAM> 30 </PARAM>
</OBJ_DECLARATION>
negative samples
prompt <CONST_DECLARATION> [SUM_CONSTRAINT] </CONST_DECLARATION>:
prompt <CONST_DECLARATION> [XY_CONSTRAINT] </CONST_DECLARATION>:
prompt <CONST_DECLARATION> [RATIO_CONSTRAINT] </CONST_DECLARATION>:
prompt <CONST_DECLARATION> [UPPER_BOUND] </CONST_DECLARATION>:
prompt <CONST_DECLARATION> [LOWER_BOUND] </CONST_DECLARATION>:
We fine-tune the model through prompting. We prefix the input text to generate the appropriate output
depending on the prefix task. We performed exploratory data analysis to compute the maximum
number of tokens after prefix at the input and the output. The numbers are tabulated in Table 6 and
exceeded the limit of 512 tokens in the original training set. After splitting the main task into multiple
tasks and increasing the data, we observed that the results started moving up and away from the
baseline results. The results were not as expected after this step, and another round of analysis was
carried out to identify the reasons for not picking up the correct declarations from the statements.
We thought of teaching the model by explicitly mentioning the actual locations of the entities in
the input text. Thus, we prepared a wrapper for all the entities with a notation shown in Figure 4.
The entities present in the input text get wrapped before training the model. The number of tokens
Figure 4: An example word problem with the wrapped entities used in the training.
8
Table 8: The declaration-level accuracy on the development set for different types of training with the
hyperparameters.
Type of training Hyperparameters Declaration-level accuracy
Original: prefix=generate max_seq_length: 600 62.8%
linear program mapping max_output_length: 750
training_epochs: 20
augmented input max_seq_length: 600 75.9%
max_output_length: 750
training_epochs: 24
multi-task max_seq_length: 600 65.6%
max_output_length: 750
training_epochs: 17
augmented input + multi-task max_seq_length: 600 83.5%
max_output_length: 750
training_epochs: 50
doubled with the addition of the wrapper tokens. Table 6 shows the increase in the number of tokens
processed by the model. We observed the maximum number of tokens in the training set before and
after the addition of the wrapper are 240 and 590, respectively. With the addition of new tokens, the
sequence length crossed the actual limit of 512, as suggested in the paper Raffel et al. [2019]. We
have to fix the size of the transformer to 600 and perform training on the model. Since the size is
not a power of 2, the performance deteriorates to some extent. Table 8 shows the ablation study for
different training strategies.
The results of sub-task 2 on the test set is tabulated in Table 9. We stand in the fifth position after
evaluating the program independently by the competition organizers and thus results are updated by
reproducing the model by the organizers. There is a slight drop in the accuracy due to the retrieval of
multiple instances for the same prompt question.
Table 9: The performance of different submissions on the test set NL4Opt [2022].
Rank Team Name Affiliation(s) F1 score
1 UIUC-NLP UIUC 0.899
2 Sjang POSTECH 0.878
3 Long BDAA-BASE 0.867
4 PingAn-zhiniao PingAn Technology 0.866
5 Infrrd AI Lab Infrrd 0.780
6 KKKKKi Netease 0.634
Baseline (BART) NL4Opt 0.608
3.2 Discussion
The additional wrapper tokens not only increase the length of the input sequence and redundant for
many repeated items. We have to replace each wrapper token with a unique token to limit the size of
tokens doesn’t cross the prescribed limit of 512. We have performed data augmentation in sub-task 1.
We may need to use this approach to increase the number of samples in the training set. The objective
and constraints declarations consist of a lot of numerical values. It is difficult to identify whether to
perform the replacement of a numerical value or word reflecting the entity in a problem. The retrieval
of the model through prompting has one serious concern to be addressed. When the entity appears at
multiple locations, the model may pick one of them and predict it as an output. If we have different
values at several locations for the same entity, the model picks one and doesn’t retrieve all of them.
The values present in the input text differ for the same entity. It has to be addressed in the future to
make the model adapt to picking up all of them belonging to the same entity in the training procedure.
9
4 Conclusion
We proposed an ensemble of models to predict the labels and generate the texts. The approaches are
consistent with better results compared to a single model. We would like to explore the decoders in
the ensemble since the text generation had the decoder part. We would like to use the large models
in place of the base models to know whether the models are better at predicting the labels. The
OBJ NAME entity is difficult to capture in the present set of models which has to be addressed in
a different way. We need to overcome the drawbacks of prompting like two instances in the same
question to improve the score.
References
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf.
Flair: An easy-to-use framework for state-of-the-art nlp. In NAACL 2019, 2019 Annual Conference
of the North American Chapter of the Association for Computational Linguistics (Demonstrations),
pages 54–59, 2019.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Fran-
cisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised
cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
CPLEX. IBM ILOG CPLEX Optimization Studio, 2022. URL www.ibm.com/products/
ilog-cplex-optimization-studio.
Xiang Dai and Heike Adel. An analysis of simple data augmentation for named entity recognition.
arXiv preprint arXiv:2010.11683, 2020.
Hugging Face. Hugging face. https://huggingface.co/docs/transformers/index, 2022.
Accessed: 2022-12-23.
Jianglong He, Akshay Uppal, Mamatha N, Shiv Vignesh, Deepak Kumar, and Aditya Kumar Sarda.
Infrrd.ai at semeval-2022 task 11: A system for named entity recognition using data augmentation,
transformer-based sequence labeling model, and ensemblecrf. In Proceedings of the 16th Interna-
tional Workshop on Semantic Evaluation (SemEval-2022), pages 1501–1510, Seattle, United States,
July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.semeval-1.206.
URL https://aclanthology.org/2022.semeval-1.206.
N Karmarkar. A new polynomial-time algorithm for linear programming. Combinatorica, 4:373–395,
1984. doi: https://doi.org/10.1007/BF02579150.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, and Oleg Rokhlenko. MultiCoNER: a
Large-scale Multilingual dataset for Complex Named Entity Recognition. 2022a.
Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, and Oleg Rokhlenko. SemEval-2022 Task
11: Multilingual Complex Named Entity Recognition (MultiCoNER). In Proceedings of the 16th
International Workshop on Semantic Evaluation (SemEval-2022). Association for Computational
Linguistics, 2022b.
John C Nash. The (dantzig) simplex method for linear programming. Computing in Science &
Engineering, 2(1):29–31, 2000.
NL4Opt, 2022. URL https://nl4opt.github.io/. Accessed: 2022-12-23.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer, 2019. URL https://arxiv.org/abs/1910.10683.
Rindranirina Ramamonjison, Haley Li, Timothy T. Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-
Dehkordi, Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation of
optimization models from problem descriptions, 2022. URL https://arxiv.org/abs/2209.
15565.
10