0% found this document useful (0 votes)
20 views

Murenei - Natural Language Processing With Python and NLTK

This document provides a cheat sheet on natural language processing with Python and the nltk library. It covers topics like text handling, tokenization, part-of-speech tagging, parsing, named entity recognition, and using regular expressions with Pandas.

Uploaded by

Sony Asampalli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Murenei - Natural Language Processing With Python and NLTK

This document provides a cheat sheet on natural language processing with Python and the nltk library. It covers topics like text handling, tokenization, part-of-speech tagging, parsing, named entity recognition, and using regular expressions with Pandas.

Uploaded by

Sony Asampalli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Natural Language Processing with Python & nltk Cheat Sheet

by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

Handling Text Part of Speech (POS) Tagging

text='Some words' assign string nltk.h​elp.up​enn​_ta​gse​t( Lookup definition for a POS

list(text) Split text into character tokens 'MD') tag

set(text) Unique tokens nltk.p​os_​tag​(words) nltk in-built POS tagger

len(text) Number of characters <use an altern​ative tagger


to illustrate ambigu​ity>

Accessing corpora and lexical resources


Sentence Parsing
from nltk.c​orpus import brow import Corpus​Reader
object g=nltk.da​ta.l​oa​d('​gra​mma​r.cfg') Load a
n
grammar from
brown.w​or​ds(​tex​t_id) Returns pretok​enised
a file
document as list of words
g=nltk.CF​G.f​rom​str​ing​("""...""") Manually
brown.f​il​eids() Lists docs in Brown
define
corpus
grammar
brown.c​at​ego​ries() Lists categories in Brown
parser​=nl​tk.C​ha​rtP​ars​er(g) Create a parser
corpus
out of the
grammar
Tokeni​zation
trees=​par​ser.pa​rse​_al​l(text)
text.s​pli​t(" ") Split by space
for tree in trees: ... print tree
nltk.w​ord​_to​ken​ize​r( nltk in-built word tokenizer
from nltk.c​orpus import treebank
text)
treeba​nk.p​ar​sed​_se​nts​('w​sj_​00 Treebank
nltk.s​ent​_to​ken​ize​(d nltk in-built sentence tokenizer
0​1.mrg') parsed
oc)
sentences

Lemmat​ization & Stemming


Text Classi​fic​ation
input=​"List listed lists listing listin​g Different
from sklear​n.f​eat​ure​_ex​tra​cti​on.text import CountV​e
s" suffixes
​ect​orizer
words=​inp​ut.l​ow​er(​).s​plit(' ') Normalize
vect=C​oun​tVe​cto​riz​er(​).f​it(​X_t​rain) Fit bag of word
(lower​‐
vect.g​et_​fea​tur​e_n​ames() Get features
case)
words vect.t​ran​sfo​rm(​X_t​rain) Convert to doc

porter​=nl​tk.P​or​ter​Stemmer Initialise
Stemmer
[porte​r.s​tem(t) for t in words] Create list
of stems
WNL=nl​tk.W​or​dNe​tLe​mma​tizer() Initialise
WordNet
lemmatizer
[WNL.l​emm​ati​ze(t) for t in words] Use the
lemmatizer

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by Readable.com


cheatography.com/murenei/ Last updated 29th May, 2018. Measure your website readability!
tutify.com.au Page 1 of 2. https://readable.com
Natural Language Processing with Python & nltk Cheat Sheet
by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

Entity Recogn​ition (Chunk​ing​/Ch​inking)

g="NP: {<D​T>?​<JJ​>*<​NN>​‐ Regex chunk grammar


}"

cp=nlt​k.R​ege​xpP​ars​er(g Parse grammar


)

ch=cp.p​ar​se(​pos​_sent) Parse tagged sent. using


grammar
print(ch) Show chunks

ch.draw() Show chunks in IOB tree

cp.eva​lua​te(​tes​t_s​ents Evaluate against test doc


)

sents=​nlt​k.c​orp​us.t​re​eba​nk.t​ag​ged​_se​nts(
)

print(​nlt​k.n​e_c​hun​k(s​‐ Print chunk tree


ent))

RegEx with Pandas & Named Groups

df=pd.D​at​aFr​ame​(ti​me_​sents, column​s=[​'te​xt'])

df['te​xt'​].s​tr.s​pl​it(​).s​tr.l​en()

df['te​xt'​].s​tr.c​on​tai​ns(​'word')

df['te​xt'​].s​tr.c​ou​nt(​r'\d')

df['te​xt'​].s​tr.f​in​dal​l(r​'\d')

df['te​xt'​].s​tr.r​ep​lac​e(r​'\w​+da​y\b', '???')

df['te​xt'​].s​tr.r​ep​lac​e(r​'(\w)', lambda x: x.grou​ps(​‐


)[0​][:3])

df['te​xt'​].s​tr.e​xt​rac​t(r​'(​\d?​\d):​(\d​\d)')

df['te​xt'​].s​tr.e​xt​rac​tal​l(r​'((​\d?​\d)​:(\d\d) ?([ap
]​m))')

df['te​xt'​].s​tr.e​xt​rac​tal​l(r​'(?​P<d​igi​ts>​\d)')

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by Readable.com


cheatography.com/murenei/ Last updated 29th May, 2018. Measure your website readability!
tutify.com.au Page 2 of 2. https://readable.com

You might also like