[PDF][PDF] Fast, consistent tokenization of natural language text
Journal of Open Source Software, 2018•theoj.org
Computational text analysis usually proceeds according to a series of well-defined steps.
After importing texts, the usual next step is to turn the human-readable text into
machinereadable tokens. Tokens are defined as segments of a text identified as meaningful
units for the purpose of analyzing the text. They may consist of individual words or of larger
or smaller segments, such as word sequences, word subsequences, paragraphs,
sentences, or lines (Manning, Raghavan, and Schütze 2008, 22). Tokenization is the …
After importing texts, the usual next step is to turn the human-readable text into
machinereadable tokens. Tokens are defined as segments of a text identified as meaningful
units for the purpose of analyzing the text. They may consist of individual words or of larger
or smaller segments, such as word sequences, word subsequences, paragraphs,
sentences, or lines (Manning, Raghavan, and Schütze 2008, 22). Tokenization is the …
Computational text analysis usually proceeds according to a series of well-defined steps. After importing texts, the usual next step is to turn the human-readable text into machinereadable tokens. Tokens are defined as segments of a text identified as meaningful units for the purpose of analyzing the text. They may consist of individual words or of larger or smaller segments, such as word sequences, word subsequences, paragraphs, sentences, or lines (Manning, Raghavan, and Schütze 2008, 22). Tokenization is the process of splitting the text into these smaller pieces, and it often involves preprocessing the text to remove punctuation and transform all tokens into lowercase (Welbers, Van Atteveldt, and Benoit 2017, 250–51). Decisions made during tokenization have a significant effect on subsequent analysis (Denny and Spirling 2018; D. Guthrie et al. 2006). Especially for large corpora, tokenization can be computationally expensive, and tokenization is highly language dependent. Efficiency and correctness are therefore paramount concerns for tokenization.
The tokenizers package for R provides fast, consistent tokenization for natural language text (Mullen 2018; R Core Team 2017).(The package is available on GitHub and archived on Zenodo.) Each of the tokenizers expects a consistent input and returns a consistent output, so that the tokenizers can be used interchangeably with one another or relied on in other packages. To ensure the correctness of output, the package depends on the stringi package, which implements Unicode support for R (Gagolewski 2018). To ensure the speed of tokenization, key components such as the n-gram and skip n-gram tokenizers are written using the Rcpp package (Eddelbuettel 2013; Eddelbuettel and Balamuta 2017). The tokenizers package is part of the rOpenSci project.
theoj.org
Showing the best result for this search. See all results