JGLUE Evaluation Scripts

Requirements

Python: 3.9+
Dependencies: See pyproject.toml.

Getting started

Create a virtual environment and install dependencies.
```
$ uv venv -p /path/to/python
$ uv sync
```
Log in to wandb.
```
$ wandb login
```

Training and evaluation

You can train and test a model with the following command:

# For training and evaluating MARC-ja
uv run python src/train.py -cn marc_ja devices=[0,1] max_batches_per_device=16

Here are commonly used options:

-cn: Task name. Choose from marc_ja, jcola, jsts, jnli, jsquad, and jcqa.
devices: GPUs to use.
max_batches_per_device: Maximum number of batches to process per device (default: 4).
compile: JIT-compile the model with torch.compile for faster training ( default: false).
model: Pre-trained model name. see YAML config files under configs/model.

To evaluate on the out-of-domain split of the JCoLA dataset, specify datamodule/valid=jcola_ood ( or datamodule/valid=jcola_ood_annotated). For more options, see YAML config files under configs.

Debugging

uv run python scripts/train.py -cn marc_ja.debug

You can specify trainer=cpu.debug to use CPU.

uv run python scripts/train.py -cn marc_ja.debug trainer=cpu.debug

If you are on a machine with GPUs, you can specify the GPUs to use with the devices option.

uv run python scripts/train.py -cn marc_ja.debug devices=[0]

Tuning hyper-parameters

$ wandb sweep <(sed 's/MODEL_NAME/deberta_base/' sweeps/jcola.yaml)
wandb: Creating sweep from: /dev/fd/xx
wandb: Created sweep with ID: xxxxxxxx
wandb: View sweep at: https://wandb.ai/<wandb-user>/JGLUE-evaluation-scripts/sweeps/xxxxxxxx
wandb: Run sweep agent with: wandb agent <wandb-user>/JGLUE-evaluation-scripts/xxxxxxxx
$ DEVICES=0,1 MAX_BATCHES_PER_DEVICE=16 COMPILE=true wandb agent <wandb-user>/JGLUE-evaluation-scripts/xxxxxxxx

Results

We fine-tuned the following models and evaluated them on the dev set of JGLUE. We tuned learning rate and training epochs for each model and task following the JGLUE paper.

Model	MARC-ja/acc	JCoLA/acc	JSTS/pearson	JSTS/spearman	JNLI/acc	JSQuAD/EM	JSQuAD/F1	JComQA/acc
Waseda RoBERTa base	0.965	0.867	0.913	0.876	0.905	0.853	0.916	0.853
Waseda RoBERTa large (seq512)	0.969	0.849	0.925	0.890	0.928	0.910	0.955	0.900
LUKE Japanese base*	0.965	-	0.916	0.877	0.912	-	-	0.842
LUKE Japanese large*	0.965	-	0.932	0.902	0.927	-	-	0.893
DeBERTaV2 base	0.970	0.879	0.922	0.886	0.922	0.899	0.951	0.873
DeBERTaV2 large	0.968	0.882	0.925	0.892	0.924	0.912	0.959	0.890
DeBERTaV3 base	0.960	0.878	0.927	0.891	0.927	0.896	0.947	0.875

*The scores of LUKE are from the official repository.

Tuned hyper-parameters

Learning rate: {2e-05, 3e-05, 5e-05}

Model	MARC-ja/acc	JCoLA/acc	JSTS/pearson	JSTS/spearman	JNLI/acc	JSQuAD/EM	JSQuAD/F1	JComQA/acc
Waseda RoBERTa base	3e-05	3e-05	2e-05	2e-05	3e-05	3e-05	3e-05	5e-05
Waseda RoBERTa large (seq512)	2e-05	2e-05	3e-05	3e-05	2e-05	2e-05	2e-05	3e-05
DeBERTaV2 base	2e-05	3e-05	5e-05	5e-05	3e-05	2e-05	2e-05	5e-05
DeBERTaV2 large	5e-05	2e-05	5e-05	5e-05	2e-05	2e-05	2e-05	3e-05
DeBERTaV3 base	5e-05	2e-05	3e-05	3e-05	2e-05	5e-05	5e-05	2e-05

Training epochs: {3, 4}

Model	MARC-ja/acc	JCoLA/acc	JSTS/pearson	JSTS/spearman	JNLI/acc	JSQuAD/EM	JSQuAD/F1	JComQA/acc
Waseda RoBERTa base	4	3	4	4	3	4	4	3
Waseda RoBERTa large (seq512)	4	4	4	4	3	3	3	3
DeBERTaV2 base	3	4	3	3	3	4	4	4
DeBERTaV2 large	3	3	4	4	3	4	4	3
DeBERTaV3 base	4	4	4	4	4	4	4	4

Huggingface hub links

Waseda RoBERTa base: nlp-waseda/roberta-base-japanese
Waseda RoBERTa large ( seq512): nlp-waseda/roberta-large-japanese-seq512
LUKE Japanese base: studio-ousia/luke-base-japanese
LUKE Japanese large: studio-ousia/luke-large-japanese
DeBERTaV2 base: ku-nlp/deberta-v2-base-japanese
DeBERTaV2 large: ku-nlp/deberta-v2-large-japanese
DeBERTaV3 base: ku-nlp/deberta-v3-base-japanese

Author

Nobuhiro Ueda (ueda at nlp.ist.i.kyoto-u.ac.jp)

Reference

yahoojapan/JGLUE: JGLUE: Japanese General Language Understanding Evaluation
JGLUE: Japanese General Language Understanding Evaluation (Kurihara et al., LREC 2022)
栗原健太郎, 河原大輔, 柴田知秀, JGLUE: 日本語言語理解ベンチマーク, 自然言語処理, 2023, 30 巻, 1 号, p. 63-87, 公開日 2023/03/15, Online ISSN 2185-8314, Print ISSN 1340-7619, https://doi.org/10.5715/jnlp.30.63, https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_article/-char/ja

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
.github		.github
configs		configs
scripts		scripts
src		src
sweeps		sweeps
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JGLUE Evaluation Scripts

Requirements

Getting started

Training and evaluation

Debugging

Tuning hyper-parameters

Results

Tuned hyper-parameters

Huggingface hub links

Author

Reference

About

Releases

Packages

Contributors 3

Languages

License

nobu-g/JGLUE-evaluation-scripts

Folders and files

Latest commit

History

Repository files navigation

JGLUE Evaluation Scripts

Requirements

Getting started

Training and evaluation

Debugging

Tuning hyper-parameters

Results

Tuned hyper-parameters

Huggingface hub links

Author

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages