Benchmarking of various tokenizers

This repository contains benchmarking tools of various tokenizers.

Overview

To perform benchmarking, you have to run the following two steps.

Preparation

The following commands prepare resources (e.g. model data) and compile source codes.

% git submodule update --init
% ./download_resources.sh
% ./compile_all.sh

Measurement

To measure the speed of each tokenizer, run the following commands. If you stop ./run_all.sh in the middle, ./stats.py will calculate statistics from avaiable results.

% ./run_all.sh | tee ./results
% ./stats.py < ./results

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

For softwares under thirdparty, follow the license terms of each software.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Benchmarking of various tokenizers

Overview

Preparation

Measurement

Disclaimer

License

Contribution

About

Licenses found

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
bench		bench
resources		resources
thirdparty		thirdparty
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
compile_all.sh		compile_all.sh
download_resources.sh		download_resources.sh
run_all.sh		run_all.sh
stats.py		stats.py

License

Licenses found

legalforce-research/tokenizer-speed-bench

Folders and files

Latest commit

History

Repository files navigation

Benchmarking of various tokenizers

Overview

Preparation

Measurement

Disclaimer

License

Contribution

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages