Structures For Math MathNLP Workshop-10
Structures For Math MathNLP Workshop-10
Structures For Math MathNLP Workshop-10
Table 1: Summary of the evaluated models, their base models and the data sets used for pre-training.
2021), a model that relied on Operator Trees during community Mathematics StackExchange4 . From
pre-training. However, neither the model nor the the remaining set we chose 10k for development,
code are publicly available. and an additional set of 10k as test set. The average
BERT, ALBERT and RoBERTa were trained on number of nodes in all three sets is 16.5, while the
a general natural language corpus and serve as base- average tree depth is 4.8. The most common node
lines. Six models were further pre-trained from a types are variables and numbers, followed by LATEX
base model like BERT or RoBERTa, while three braces and relation symbols. Among the relation
were developed from scratch. The data sources the symbols, the equal sign "=" occurs most often.
models were trained on are rather diverse: Five
4.2 Metrics
models use ARQMath, others use math text books,
school curricula, paper abstracts, or other discus- We follow Hewitt and Manning and evaluate the
sion posts apart from ARQMath. SciBERT is the performance using UUAS (Undirected Unlabeled
only model what was not specifically trained on Attachment Score), which denotes the percentage
mathematical content, but on scientific publications. of correctly identified edges in the predicted tree
All models can be found on Huggingface by using and, distance Spearman (DSpr.), which is deter-
their model identifier. mined by first calculating the Spearman correlation
between the predicted distances dB and the gold-
standard distances dT . These correlations are then
4.1 Data averaged among all formulas of a fixed length. Fi-
nally, the average across formulas of lengths 5–50
Our probe is trained on formulas, which were is reported as DSpr. We decided to include both
parsed to OPTs by a custom LATEX parser writ- metrics since it was shown that their scores can re-
ten in Python adapted from the parser rules of the sult in opposite trends (Hall Maudslay et al., 2020).
mathematical formula search engine Approach0
(Zhong and Zanibbi, 2019; Zhong et al., 2020). We 4.3 Setup
could not use existing parsers because it is neces- To train and evaluate the probing classifier, we used
sary to associate each LATEX token with its node in the original code provided by Hewitt and Manning5
the OPT and existing parsers only output the entire and adapted it to the transformers library6 . We used
parse tree without annotation of a node’s token in
4
the formula. We selected 50k training examples https://math.stackexchange.com
5
by chance from the corpus of all formulas from https://github.com/john-hewitt/
structural-probes
ARQMath 2020 (Mansouri et al., 2020), which 6
https://pypi.org/project/
contains question and answer posts from the Q&A transformers/
the L1 loss and a maximum rank of the probe of Model DSpr. UUAS
768, as reported by the authors. We trained the
albert-base-v2 0.631 (3) 0.477 (3)
probes using one A100 GPU with 40 GB GPU
AnReu/math_albert 0.680 (2) 0.513 (2)
memory. Depending on the base model, the train-
ing of a probe took between 15 min and 1.5h. Each bert-base-cased 0.713 (7) 0.532 (6)
model was trained on five different random seeds. allenai/
0.727 (7) 0.545 (7)
scibert-scivocab-cased
5 Results AnReu/
0.815 (7) 0.700 (6)
math_pretrained_bert
Tab. 2 summarizes the highest values from all lay- tbs17/MathBERT 0.718 (6) 0.550 (5)
ers. We report our results using UUAS and DSpr. tbs17/
where higher values indicate a larger percentage 0.686 (5) 0.530 (5)
MathBERT-custom
of correctly reconstructed edges and a higher cor-
relation between the predicted and gold-distances, roberta-base 0.703 (5) 0.526 (5)
respectively. Each value is the mean of the five roberta-large 0.706 (9) 0.536 (13)
runs. It is visible that almost all adapted models AnReu/
0.746 (5) 0.576 (5)
improve over their natural language baselines. The math_pretrained_roberta
highest performance overall is demonstrated by An- shauryr/
0.715 (5) 0.541 (4)
Reu/math_pretrained_bert. Only the performance arqmath-roberta-base
of MathBERT-custom drops in comparison to bert- uf-aice-lab/
0.711 (9) 0.547 (11)
base-cased. The DSpr. scores of the best models math-roberta
in comparison to their baselines are visualized in witiko/mathberta 0.752 (5) 0.574 (5)
Fig. 2.
Table 2: Results of reconstruction of OPTs using UUAS
In general, the models pre-trained on ARQMath and DSpr., displaying only the best results across all
demonstrated a better performance across both met- layers, best layer indicated by (layer number).
rics compared to models pre-trained on other data
sets. A possible reason could be that this data set
contains a large variety of formulas written in LATEX math_pretrained_roberta, which is the second best
while this is unclear for the other data sets since model for UUAS in this study. A similar mis-
they are not publicly available. We validated these match between the performance in downstream
results also using a second OPT data set based on natural language tasks and syntactic parsing was
the MATH data set (Hendrycks et al., 2021), which also found by (Glavaš and Vulić, 2021). There-
contains formulas written in LATEXextracted from fore, this finding casts a doubt on whether the
competition math problems. Since there was no models rely on their OPT knowledge when solv-
drop in performance among the models pre-trained ing the downstream task of Mathematical Answer
on ARQMath, we can conclude that models did not Retrieval. However, the limitations of probing clas-
benefit from the overlap between the pre-training sifiers as the one used in this work do not allow
data and the probing formulas. conclusions about the models usage of the knowl-
BERT and RoBERTa-based models show that edge. Hence, further research in this direction is
the best extractability for Operator Trees lies in required to investigate whether and how these mod-
the middle layers, between layer 4 and 7 for base els use structural knowledge during downstream
models and between layer 9 and 13 for large tasks. In addition, Appendix A shows examples of
models. This pattern is consistent with the re- reconstructed Operator Trees, while Appendix B
sults reported by Hewitt and Manning for depen- contains the mean scores and standard deviation
dency structures. Notably, the same pattern does for each model in each layer.
not emerge for ALBERT and AnReu/math_albert.
6 Conclusion
Here, the highest scores are in layers 2 and 3. Over-
all, the scores for both ALBERT-based models are This work aims to answer the question: Are Oper-
significantly lower, even after training on ARQ- ator Trees extractable from the models’ contextu-
Math. Interestingly, this model was among the alized embeddings? We trained a structural probe
best for the ARQMath Lab 3 on Mathematical that learns to approximate the distances between
Answer Retrieval and outperformed also AnReu/- nodes in the trees. The results show that models
AnReu/math_pretrained_bert roberta-base for Information Services and HPC (ZIH) at TU
AnReu/math_albert albert-base-v2 Dresden.
witiko/mathberta bert-base-cased
0.8 References
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert:
DSpr.
. . . . .
. . . . . . .
. . . . . .
U _ 1 ∪ U _ 2 = \mathbb R U _ 1 ∪ U _ 2 = \mathbb R
. . . . . . . . . . . . . . .
. . .
albert-base-v2 AnReu/math_albert
. . . . .
. . . . . . .
. . . . . .
U _ 1 ∪ U _ 2 = \mathbb R U _ 1 ∪ U _ 2 = \mathbb R
. . . . . . . . . . . . . .
. .
. .
AnReu/math_pretrained_bert AnReu/math_pretained_roberta
. . .
. . . . . . . . .
. . . . . .
U _ 1 ∪ U _ 2 = \mathbb R U _ 1 ∪ U _ 2 = \mathbb R
. . . . . . .
. . . . . . . . .
. .
allenai/scibert-scivocab-cased bert-base-cased
. . . . . .
. . . . . . . . . . . .
U _ 1 ∪ U _ 2 = \mathbb R U _ 1 ∪ U _ 2 = \mathbb R
. . . . . . . . . . . . . . . .
. .
roberta-base roberta-large
. . . . .
. . . . . . .
. . . . . .
U _ 1 ∪ U _ 2 = \mathbb R U _ 1 ∪ U _ 2 = \mathbb R
. . . . . . . . . . . . . .
. .
. .
shauryr/arqmath-roberta-base tbs17/MathBERT
. . .
. . . . . . . . .
. . . . . .
U _ 1 ∪ U _ 2 = \mathbb R U _ 1 ∪ U _ 2 = \mathbb R
. . . . . . .
. . . . . . . . .
. .
tbs17/MathBERT-custom uf-aice-lab/math-roberta
. . .
. . . . . .
U _ 1 ∪ U _ 2 = \mathbb R
. . . . . . .
.
.
witiko/mathberta
Figure 3: Operator Trees calculated from the predicted squared distances between the tokens. The black edges
above each formula are the gold edges from the OPT parser, while the red edges are the predicted ones by each
model, taken from one seed of the best layer by DSpr. In a large majority of cases the models correctly identified
the edges of the displayed formula. Most differences can be seen from the second part of the left hand side of the
equation, where the models mostly struggle with the parent-child relationships of the equal sign.
B Results for all layers
shauryr/ AnReu/
roberta-base witiko/mathberta
arqmath-roberta-base math_pretrained_roberta
mean stdev mean stdev mean stdev mean stdev
0 0.4456 0.00127 0.4619 0.00044 0.4268 0.00078 0.4543 0.00068
1 0.4825 0.00042 0.5010 0.00083 0.5086 0.00068 0.5296 0.00084
2 0.4988 0.00053 0.5184 0.00053 0.5388 0.00065 0.5477 0.00095
3 0.5187 0.00015 0.5352 0.00059 0.5618 0.00065 0.5690 0.00063
4 0.5224 0.00042 0.5414 0.00061 0.5732 0.00027 0.5731 0.00043
5 0.5255 0.00041 0.5378 0.00017 0.5742 0.00084 0.5759 0.00026
6 0.5172 0.00053 0.5358 0.00026 0.5687 0.00057 0.5672 0.00024
7 0.5088 0.00035 0.5247 0.00059 0.5692 0.00038 0.5591 0.00029
8 0.5106 0.00033 0.5311 0.00061 0.5704 0.00055 0.5599 0.00026
9 0.5091 0.00057 0.5324 0.00016 0.5659 0.00025 0.5590 0.00041
10 0.4899 0.00032 0.5167 0.00033 0.5471 0.00043 0.5411 0.00027
11 0.4713 0.00051 0.4902 0.00030 0.5294 0.00035 0.5238 0.00037