0% found this document useful (0 votes)
58 views3 pages

Sree017 NLP

This document provides a summary of key natural language processing (NLP) techniques including tokenization, bag-of-words modeling, TF-IDF, word embeddings like Word2Vec, stop words removal, stemming, lemmatization, part-of-speech tagging, and named entity recognition. It defines each technique and provides examples of how they are implemented using popular NLP libraries like NLTK, SpaCy, Keras, TensorFlow, and Gensim. The document is intended as a cheat sheet for common NLP concepts and processing steps.

Uploaded by

Rahul Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views3 pages

Sree017 NLP

This document provides a summary of key natural language processing (NLP) techniques including tokenization, bag-of-words modeling, TF-IDF, word embeddings like Word2Vec, stop words removal, stemming, lemmatization, part-of-speech tagging, and named entity recognition. It defines each technique and provides examples of how they are implemented using popular NLP libraries like NLTK, SpaCy, Keras, TensorFlow, and Gensim. The document is intended as a cheat sheet for common NLP concepts and processing steps.

Uploaded by

Rahul Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

NLP Cheat Sheet

by sree017 via cheatography.com/126402/cs/24446/

Tokeni​zation Tokeni​zation (cont) Bag Of Words & TF-IDF (cont) Bag Of Words & TF-IDF (cont)

Tokenization breaks the [word for word in doc] X = cv.fit​_tr​ans​for​m(c​‐ A 2-gram (or bigram) is
raw text into words, # Keras oun​ter​s).t​oa​rray() a two-word sequence of
sentences called tokens. from keras.p​re​pro​ces​sin​‐ Term Freque​ncy​-In​verse words, like “I love”,
These tokens help in g.text import text_t​o_w​‐ Document Frequency (TF- “love reading”, or
understanding the ord​_se​quence IDF): “Analytics Vidhya”.
context or developing text_t​o_w​ord​_se​que​nce​‐ ​ ​ ​ ​ ​ Term freque​ncy​–in​‐ And a 3-gram (or
the model for the NLP. (pa​rag​raph) verse document trigram) is a three-word
... If the text is split # genis frequency, is a sequence of words like
into words using some from gensim.su​mma​riz​ati​‐ numerical statistic that “I love reading”, “about
separation technique it on.t​ex​tcl​eaner import is intended to reflect data science” or “on
is called word split_​sen​tences how important a word is Analytics Vidhya”.
tokenization and same split_​sen​ten​ces​(pa​rag​‐ to a document in a
separation done for raph) collection or corpus. Stemming & Lemmat​ization
sentences is called from gensim.utils import ​ ​ T.F = No of rep of From Stemming we will
sentence tokenization. tokenize words in setence/No of process of getting the
# NLTK list(t​oke​niz​e(p​ara​‐ words in sentence root form of a word. We
import nltk graph)) ​ ​ IDF = No of sentences would create the stem
nltk.d​own​loa​d('​punkt') / No of sentences words by removing the
paragraph = "​write Bag Of Words & TF-IDF containing words prefix of suffix of a
paragaraph here to Bag of Words model is from sklear​n.f​eat​ure​_ex​‐ word. So, stemming a
convert into tokens." used to preprocess the tra​cti​on.text import word may not result in
sentences = nltk.s​ent​‐ text by converting it TfidfV​ect​orizer actual words.
_to​ken​ize​(pa​rag​raph) into a bag of words, cv = TfidfV​ect​ori​zer() paragraph = "​"
words = nltk.w​ord​_to​ken​‐ which keeps a count of X = cv.fit​_tr​ans​for​m(c​‐ # NLTK
ize​(pa​rag​raph) the total occurrences of oun​ter​s).t​oa​rray() from nltk.stem import
# Spacy most frequently used N-gram Language Model: Porter​Stemmer
from spacy.l​ang.en words An N-gram is a sequence from nltk import sent_t​‐
import English # counters = List of of N tokens (or words). okenize
nlp = English() stences after pre A 1-gram (or unigram) is from nltk import word_t​‐
sbd = nlp.cr​eat​e_p​ipe​‐ processing like tokeni​‐ a one-word sequen​ce.the okenize
('s​ent​enc​izer') zation, stemmi​ng/​lem​mat​‐ unigrams would simply stem = Porter​Ste​mmer()
nlp.ad​d_p​ipe​(sbd) iza​tion, stopwords be: “I”, “love”, sentence = sent_t​oke​niz​‐
doc = nlp(pa​rag​raph) from sklear​n.f​eat​ure​_ex​‐ “reading”, “blogs”, e(p​ara​gra​ph)[1]
[sent for sent in tra​cti​on.text import “about”, “data”, words = word_t​oke​niz​e(s​‐
doc.sents] CountV​ect​orizer “science”, “on”, “Analy​‐ ent​ence)
nlp = English() cv = CountV​ect​ori​zer​(ma​‐ tics”, “Vidhya”. [stem.s​te​m(word) for
doc = nlp(pa​rag​raph) x_f​eatures = 1500) word in words]
# Spacy

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com


cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 1 of 3. http://crosswordcheats.com
NLP Cheat Sheet
by sree017 via cheatography.com/126402/cs/24446/

Stemming & Lemmat​ization Word2Vec Stop Words Stop Words (cont)


(cont)
In BOW and TF-IDF Stopwords are the most for word in token_​list:
No Stemming in spacy approach semantic common words in any ​ ​ ​ ​lexeme = nlp.vo​cab​‐
# Keras information not stored. natural language. For [word]
No Stemming in Keras TF-IDF gives the purpose of ​ ​ ​ if lexeme.is​_stop ==
Lemmat​iza​tion: importance to uncommon analyzing text data and False:
As stemming, lemmat​‐ words. There is building NLP models, ​ ​ ​ ​ ​ ​ ​ ​fil​ter​ed_​sen​ten​‐
ization do the same definitely chance of these stopwords might ce.a​pp​end​(word)
but the only overfitting. not add much value to # Gensim
difference is that In W2v each word is the meaning of the from gensim.pa​rsi​ng.p​re​pro​‐
lemmat​ization ensures basically repres​ented document. cessing import remove​_st​‐
that root word belongs as a vector of 32 or # NLTK opwords
to the language more dimension instead from nltk.c​orpus import remove​_st​opw​ord​s(p​ara​graph)
# NLTK of a single number. stopwords
from nltk.stem import Here the semantic from nltk.t​okenize Tokeni​zation
WordNe​tLe​mma​tizer inform​ation and import word_t​okenize NLTK Spacy Keras Tensorlfow
lemma = WordNe​tLe​mma​‐ relation between words stopwords = set(st​opw​‐
dfdfd
tizer() is also preserved. ord​s.w​ord​s('​eng​lish'))
sentence = sent_t​oke​‐ Steps: word_t​okens = word_t​‐ Parts of Speech (POS) Tagging,
niz​e(p​ara​gra​ph)[1] 1. Tokeni​zation of the oke​niz​e(p​ara​graph) Chunking & NER
words = word_t​oke​niz​‐ sentences [word for word in
The pos(parts of speech)
e(s​ent​ence) 2. Create Histograms word_t​okens if word not
explain you how a word is
[lemma.le​mma​tiz​e(word) 3. Take most frequent in stopwords]
used in a sentence. In the
for word in words] words # Spacy
sentence, a word have
# Spcay 4. Create a matrix with from spacy.l​ang.en
different contexts and
import spacy as spac all the unique words. import English
semantic meanings. The
sp = spac.l​oad​('e​n_c​‐ It also represents the from spacy.l​an​g.e​n.s​‐
basic natural language
ore​_we​b_sm') occurence relation top​_words import
processing(NLP) models like
ch = sp(u'w​arning between the words STOP_WORDS
bag-of-words(bow) fails to
warned') from gensim.models nlp = English()
identify these relation
for x in ch: import Word2Vec my_doc = nlp(pa​rag​raph)
between the words. For that
​ ​ ​ ​pri​nt(​ch.l​emma_) model = Word2V​ec(​sen​‐ # Create list of word
we use pos tagging to mark a
# Keras tences, min_co​unt=1) tokens
word to its pos tag based on
No lemmat​ization or words = model.w​v.v​ocab token_list =
its context in the data.
stemming vector = model.w​v[​'fr​‐ [token.text for token
Pos is also used to extract
eedom'] in my_doc]
rlationship between the
similar = model.w​v.m​os​‐ # Create list of word
words
t_s​imi​lar​['f​ree​dom'] tokens after removing
# NLTK
stopwords
filter​ed_​sen​tence =[]

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com


cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 2 of 3. http://crosswordcheats.com
NLP Cheat Sheet
by sree017 via cheatography.com/126402/cs/24446/

Parts of Speech (POS) Tagging, Parts of Speech (POS) Tagging,


Chunking & NER (cont) Chunking & NER (cont)

from nltk.t​okenize import word_pos = pos_ta​g(w​ord​‐


word_t​okenize _to​kens)
from nltk import pos_tag chunkP​arser = nltk.R​ege​‐
nltk.d​own​loa​d('​ave​rag​ed_​‐ xpP​ars​er(​gra​mmar)
per​cep​tro​n_t​agger') tree = chunkP​ars​er.p​ar​‐
word_t​okens = word_t​oke​‐ se(​wor​d_pos)
niz​e('Are you afraid of Named Entity Recogn​iza​‐
someth​ing?') tion:
pos_ta​g(w​ord​_to​kens) It is used to extract
# Spacy inform​ation from unstru​‐
nlp = spacy.l​oa​d("e​n_c​‐ ctured text. It is used
ore​_we​b_s​m") to classy the entities
doc = nlp("Co​ron​avirus: which is present in the
Delhi resident tests text into categories like
positive for corona​virus, a person, organi​zation,
total 31 people infected event, places, etc. This
in India") will give you a detail
[token.pos_ for token in knowledge about the text
doc] and the relati​onship
Chunking: between the different
Chunking is the process entities.
of extracting phrases # Spacy
from the Unstru​ctured import spacy
text and give them more nlp = spacy.l​oa​d("e​n_c​‐
structure to it. We also ore​_we​b_s​m")
called them shallow doc = nlp("Co​ron​avirus:
parsing.We can do it on Delhi resident tests
top of pos tagging. It positive for corona​virus,
groups words into chunks total 31 people infected
mainly for noun phrases. in India")
chunking we do by using for ent in doc.ents:
regular expres​sion. ​ ​ ​ ​pri​nt(​ent.text,
# NLTK ent.st​art​_char, ent.en​‐
word_t​okens = word_t​oke​‐ d_char, ent.la​bel_)
niz​e(text)

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com


cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 3 of 3. http://crosswordcheats.com

You might also like