All Projects → Prooffreader → word_list_tools

Prooffreader / word_list_tools

Licence: other
Python and pandas tools to perform various analyses on different types of word lists

Programming Languages

python
139335 projects - #7 most used programming language

word_list_tools

Python and pandas tools to perform various analyses on different types of word lists

Note: This repo was thoroughly resturctured on 2014/09/13, I tried to go through and make sure all the paths were valid, but it's possible some slipped by me.

Word corpora used:

  • COHA, the Corpus of Historical American English from Brigham Young University. 1-grams require a licence to use, so they are not included here; .gitignore has a rule to ignore coha_1*.*. Metadata/summary data is included here.
  • Brown corpus, part of python's NLTK
  • Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005 (http://statmt.org/europarl/) [files not included because they are ginormous]

Simple word lists used:

  • The Moby list of crossword puzzle words (113,809 words)

Run script first:

  • initial_data_munge.ipynb will make properly formatted python objects of the corpora/lists above

Tools:

  • letter_frequency: Simple single letter counts, and comparisons to a standard cryptography text
  • letter_proximity: conditional probabilities of each moiety of a bigram; or, the probability that one letter follows or precedes another.
  • letter_distributions: graphs of how often a letter is towards the beginning, middle or end of words
  • letter_distribution_europarl_comparison: letter_distribution or multiple languages in Europarl
  • google_ngrams: a work in progress
  • pattern_searches: various ways to search various word lists by various patterns
  • stringcmp_py: scripts to compare strings (e.g. Jaro-Winkler, Levenshtein, Metaphone). From the Australian National University, under a GNU open source license. (The web page I originally downloaded them from in 2013 is no longer online.)

See the readme inside the data_user_pickle_csv for their schemas.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].