Google Scholar

[PDF][PDF] Fast, consistent tokenization of natural language text

L A. Mullen, K Benoit, O Keyes, D Selivanov… - Journal of Open Source …, 2018 - theoj.org

L A. Mullen, K Benoit, O Keyes, D Selivanov, J Arnold

Journal of Open Source Software, 2018•theoj.org

Computational text analysis usually proceeds according to a series of well-defined steps. After importing texts, the usual next step is to turn the human-readable text into machinereadable tokens. Tokens are defined as segments of a text identified as meaningful units for the purpose of analyzing the text. They may consist of individual words or of larger or smaller segments, such as word sequences, word subsequences, paragraphs, sentences, or lines (Manning, Raghavan, and Schütze 2008, 22). Tokenization is the process of splitting the text into these smaller pieces, and it often involves preprocessing the text to remove punctuation and transform all tokens into lowercase (Welbers, Van Atteveldt, and Benoit 2017, 250–51). Decisions made during tokenization have a significant effect on subsequent analysis (Denny and Spirling 2018; D. Guthrie et al. 2006). Especially for large corpora, tokenization can be computationally expensive, and tokenization is highly language dependent. Efficiency and correctness are therefore paramount concerns for tokenization.

The tokenizers package for R provides fast, consistent tokenization for natural language text (Mullen 2018; R Core Team 2017).(The package is available on GitHub and archived on Zenodo.) Each of the tokenizers expects a consistent input and returns a consistent output, so that the tokenizers can be used interchangeably with one another or relied on in other packages. To ensure the correctness of output, the package depends on the stringi package, which implements Unicode support for R (Gagolewski 2018). To ensure the speed of tokenization, key components such as the n-gram and skip n-gram tokenizers are written using the Rcpp package (Eddelbuettel 2013; Eddelbuettel and Balamuta 2017). The tokenizers package is part of the rOpenSci project.

theoj.org

Show moreShow less

Save Cite Cited by 124 Related articles All 6 versions View as HTML

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

[PDF][PDF] Fast, consistent tokenization of natural language text