Implementation of the Seq2Seq models proposed in the paper Enhancing Sequence-to-Sequence Modelling for RDF triples to Natural Text using Fairseq a sequence modeling toolkit. Also, instructions to reproduce experiments are delivered.
The following repositories must be downloaded, please install them in main directory, .gitignore wil ignore them to be pushed.
git clone https://github.com/rsennrich/subword-nmt.git
git clone https://github.com/moses-smt/mosesdecoder.gitNext steps requires Python >= 3.6 and PyTorch >= 1.2.0. One can install all requiremets executing:
pip install -r requirements.txtOnce all requirements are met, install Fairseq software.
pip install fairseqThe ./data directory holds different type of data:
- Original data taken from WebNLG corpus:
data/datasets/originalin the paper is mentioned asrelease_v2.1version.data/benchmark/originalin the paper is mentioned aswebnlg_challenge_2017version.
- Preprocessed data:
data/datasets/preprocesseddata/benchmark/preprocessed - Fairseq data format:
data/datasets/formatdata/benchmark/format - Monolingual data and its predicted RDF triples:
data/monolingual/data
In this directory, we also included data related to train-valid loss , data/loss, and predictions, data/predictions, to allow analysis. The data/vocab is a folder for pretrained embeddings, evertyhing included here will be ignored.
Monolingual data can be obtained by means of WikiExtractor. Alternatively, the targeted approach mentioned in the work, which improves results in comparison with previous monolingual, can be generated from data/monolingual/:
pyhton3 scrapper.py [DATASET] > [OUTPUT_TEXT-1]If data/datasets/original is going to be used as real data in BT, then , [DATASET] argument must be release_v2.1, and if data/datasets/benchmark is going to be used, then, provide webnlg as argument. This script requires to place the Wikipedia2Vec embeddings, pickle format, in data/vocab.
In order to clean the Wikipedia text and fix instance lenght, two scripts must be executed.
python3 preprocessing_wiki.py [OUTPUT_TEXT-1] [OUTPUT_TEXT-2]
python3 filter.py [OUTPUT_TEXT-2] [OUTPUT_CLEAN_TEXT-3]Synthetic data can be generated with Transformer model or parsing techniques, the latter showed better results and will be detailed below. How to execute Transformer architecture with other data will be presented later on, only change data directory if synthetic data wants to be generated from the Transformer.
Parsing method requires the installation of Stanford CoreNLP and Stanford Parser. Both can be installed in main directory, where will be ignored. If so, no modification needs to be done in the code, otherwise, adapt global variables of data/monolingual/RDF_Triple.py with the corresponding path of the Stanford Parser.
The parsing algorithm is taken from the author: TPetrou, some updates and modifications have been introduced to improve it and make it compatible with our task.
In order to parse the monolingual text, we have to execute a java-process in background to initiate the parsing instance, then, we can start parsing, everything from data/monolingual/. Notice that the java process must be executed inside the Stanford CoreNLP folder.
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos,lemma,ner,parse,depparse -status_port 9000 -port 9000 -timeout 15000
python3 RDF_Triple.py [OUTPUT_CLEAN_TEXT-3] > [OUTPUT_RDF-4]Finally, we can clean this output removing empty RDF and aligning the remaining ones with the monolingual data.
python3 corpus_alignment.py [OUTPUT_RDF-4] [OUTPUT_CLEAN_TEXT-3] This will generate two files rdf_aligned.txt and text_aligned.txt corresponding to the output of the Back Translation model.
If Tagged Back Translation wants to be reproduced, follow the same steps, however, during preprocessing and before making compatible with Fairseq software, explained below, do the following from ./preprocessing/:
python3 tagged_bt -f | --file ) [INPUT_PATH]
-l | --line ) [LINE_TAGGING]
-o | --overwrite ) [OVERWRITE]The option -f [INPUT_PATH] is for the generated corpus path, and -l [LINE_TAGGING] allow user to specify from which line should taggs be added. Then, -o [OVERWRITE] is a boolean value whether overwrite the generated file or not.
We show how to preprocess from the original data in .xml format to fairseq format. Notice that some preprocessing steps can be skipped, as in some experiments, but we show how to do the entire preprocessing pipeline described in our work.
Turning the .xml files into source and target plain text, splitted acording to default train, dev, test separation. It also outputs a lexicalised and delexicalised version. Being in the ./preprocessing directory, follow these commands.
sh xml_to_text.shIn some experiments, where the entire pipeline is not followed, one needs to remove camelCase style and lowercase all words. This can be done as follows:
sh lower_and_camelCase.sh The lower_and_camelCase.sh script can be modified to read and write from-to any path.
Then, we apply Byte Pair Encoding and Moses tokenization.
export MOSESDECODER=../mosesdecoder/ #Provide the directory of the cloned repository
export BPE=../subword-nmt/ #Provide the directory of the cloned repository
sh token_and_bpe.shThe token_and_bpe.sh script can be modified to read and write from-to any path.
Lastly, we preprocess with fairseq to make data compatible with the software.
sh fairseq_format.shIt will dump data in data/datasets/format/ or in data/benchmark/format/. The faireq_format.sh script can be modified to read from any path.
In order to run the models, we provide a wrapping script ./models/run_model.sh that accepts several parameters to adjust the training procedure.
sh run_model.sh -a | --architecture) [ARCHITECTURE_NAME]
-c | --config-file) [CONFIGURATION_FILE]
-p | --data-path) [RELATIVE_DATA_PATH]
-s | --emb-source) [EMBEDDINGS_SOURCE]
-d | --emb-dimension) [EMBEDDINGS_DIMENSION]
-fp16 | --fp16) [MIXED PRECISION TRAINING]All of the provided options are keyword arguments, except for -fp16 which is a flag that it indicates wheter or not float16 mixed precision training should be used.
Bellow, we provide several examples to reproduce the best results obtained in the network, however, third parties can feel free to reproduce other experiments since experimental data is processed and available in this repository.
Vanilla Convolutional Model
sh run_model.sh -a fconv_self_att_wp -c 2 -p '../data/datasets/format/DELEX_BPE_5_000/'Byte Pair Encoding
sh run_model.sh -a transformer -c 1 -p '../data/datasets/format/DELEX_BPE_5_000/'Pretrained Embeddings
sh run_model.sh -a transformer -c 2 -s glove -d 300 -p '../data/datasets/format/LEX_LOW_CAMEL_BPE'Back Translation
sh run_model.sh -a transformer -c 3 -p '../data/datasets/format/LEX_LOW_CAMEL_SYNTHETIC_2_ENRICHED_BPE'Once the model is trained, we can predict using fairseq software. If needed, the output will be delexicalised, this is automatically inferred. The software randomly predicts the instances, hence, we have to process the output format before delexicalising predictions. Fairseq predictions directly remove the BPE and Moses tokenization. It can be done as follows from the ./postprocessing directory.
sh predict.sh [MODEL_CHECKPOINTS] [DATA] [OUTPUT_FILE]
sh relexicalise.sh [FILE_NAME] [FILE_PATH]This will create one folder in the ../data/predictions/[OUTPUT_FILE], which has to be provided in [FILE_PATH] and [OUTPUT FILE] in [FILE NAME], with the predicted output, the aligned w.r.t. source and postprocess.
To compute performance metrics: BLEU, TER, METEOR and chrF++, we have adopted the script provided by the WebNLG Challenge 2020 placed in ./metrics . This requires to download METEOR in metrics/metrics, it is ignored to be pushed.
wget https://www.cs.cmu.edu/~alavie/METEOR/download/meteor-1.5.tar.gz
tar -xvf meteor-1.5.tar.gz
mv meteor-1.5 metrics
rm meteor-1.5.tar.gzOne can run single evaluation or evaluate all predictions in the data/predictions/directory. The model's name and performance metrics are stored in models_metrics.json to history tracking, plotting, etc.
sh run_eval.sh [PREDICTIONS] [TARGET] # Single evaluation
sh run_full_evaluation.sh # Multiple evaluationIf you find our work or the code useful, please consider cite our paper using:
@inproceedings{domingo-etal-2020-rdf2text,
title = "Enhancing Sequence-to-Sequence Modelling for {RDF} triples to Natural Text",
author = "Oriol Domingo and David Bergés and Roser Cantenys and Roger Creus and José A.R. Fonollosa",
booktitle = {Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the Semantic Web (WebNLG+ 2020)},
year = "2020",
address = {Dublin, Ireland (Virtual)},
publisher = {"Association for Computational Linguistics"},
}