Sree017 NLP

This document provides a summary of key natural language processing (NLP) techniques including tokenization, bag-of-words modeling, TF-IDF, word embeddings like Word2Vec, stop words removal, stemming, lemmatization, part-of-speech tagging, and named entity recognition. It defines each technique and provides examples of how they are implemented using popular NLP libraries like NLTK, SpaCy, Keras, TensorFlow, and Gensim. The document is intended as a cheat sheet for common NLP concepts and processing steps.

Uploaded by

Rahul Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views3 pages

Sree017 NLP

Uploaded by

Rahul Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

NLP Cheat Sheet

by sree017 via cheatography.com/126402/cs/24446/

Tokenization Tokenization (cont) Bag Of Words & TF-IDF (cont) Bag Of Words & TF-IDF (cont)

Tokenization breaks the [word for word in doc] X = cv.fit_transform(c‐ A 2-gram (or bigram) is
raw text into words, # Keras ounters).toarray() a two-word sequence of
sentences called tokens. from keras.preprocessin‐ Term Frequency-Inverse words, like “I love”,
These tokens help in g.text import text_to_w‐ Document Frequency (TF- “love reading”, or
understanding the ord_sequence IDF): “Analytics Vidhya”.
context or developing text_to_word_sequence‐ Term frequency–in‐ And a 3-gram (or
the model for the NLP. (paragraph) verse document trigram) is a three-word
... If the text is split # genis frequency, is a sequence of words like
into words using some from gensim.summarizati‐ numerical statistic that “I love reading”, “about
separation technique it on.textcleaner import is intended to reflect data science” or “on
is called word split_sentences how important a word is Analytics Vidhya”.
tokenization and same split_sentences(parag‐ to a document in a
separation done for raph) collection or corpus. Stemming & Lemmatization
sentences is called from gensim.utils import T.F = No of rep of From Stemming we will
sentence tokenization. tokenize words in setence/No of process of getting the
# NLTK list(tokenize(para‐ words in sentence root form of a word. We
import nltk graph)) IDF = No of sentences would create the stem
nltk.download('punkt') / No of sentences words by removing the
paragraph = "write Bag Of Words & TF-IDF containing words prefix of suffix of a
paragaraph here to Bag of Words model is from sklearn.feature_ex‐ word. So, stemming a
convert into tokens." used to preprocess the traction.text import word may not result in
sentences = nltk.sent‐ text by converting it TfidfVectorizer actual words.
_tokenize(paragraph) into a bag of words, cv = TfidfVectorizer() paragraph = ""
words = nltk.word_token‐ which keeps a count of X = cv.fit_transform(c‐ # NLTK
ize(paragraph) the total occurrences of ounters).toarray() from nltk.stem import
# Spacy most frequently used N-gram Language Model: PorterStemmer
from spacy.lang.en words An N-gram is a sequence from nltk import sent_t‐
import English # counters = List of of N tokens (or words). okenize
nlp = English() stences after pre A 1-gram (or unigram) is from nltk import word_t‐
sbd = nlp.create_pipe‐ processing like tokeni‐ a one-word sequence.the okenize
('sentencizer') zation, stemming/lemmat‐ unigrams would simply stem = PorterStemmer()
nlp.add_pipe(sbd) ization, stopwords be: “I”, “love”, sentence = sent_tokeniz‐
doc = nlp(paragraph) from sklearn.feature_ex‐ “reading”, “blogs”, e(paragraph)[1]
[sent for sent in traction.text import “about”, “data”, words = word_tokenize(s‐
doc.sents] CountVectorizer “science”, “on”, “Analy‐ entence)
nlp = English() cv = CountVectorizer(ma‐ tics”, “Vidhya”. [stem.stem(word) for
doc = nlp(paragraph) x_features = 1500) word in words]
# Spacy

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 1 of 3. http://crosswordcheats.com
NLP Cheat Sheet
by sree017 via cheatography.com/126402/cs/24446/

Stemming & Lemmatization Word2Vec Stop Words Stop Words (cont)

(cont)
In BOW and TF-IDF Stopwords are the most for word in token_list:
No Stemming in spacy approach semantic common words in any lexeme = nlp.vocab‐
# Keras information not stored. natural language. For [word]
No Stemming in Keras TF-IDF gives the purpose of if lexeme.is_stop ==
Lemmatization: importance to uncommon analyzing text data and False:
As stemming, lemmat‐ words. There is building NLP models, filtered_senten‐
ization do the same definitely chance of these stopwords might ce.append(word)
but the only overfitting. not add much value to # Gensim
difference is that In W2v each word is the meaning of the from gensim.parsing.prepro‐
lemmatization ensures basically represented document. cessing import remove_st‐
that root word belongs as a vector of 32 or # NLTK opwords
to the language more dimension instead from nltk.corpus import remove_stopwords(paragraph)
# NLTK of a single number. stopwords
from nltk.stem import Here the semantic from nltk.tokenize Tokenization
WordNetLemmatizer information and import word_tokenize NLTK Spacy Keras Tensorlfow
lemma = WordNetLemma‐ relation between words stopwords = set(stopw‐
dfdfd
tizer() is also preserved. ords.words('english'))
sentence = sent_toke‐ Steps: word_tokens = word_t‐ Parts of Speech (POS) Tagging,
nize(paragraph)[1] 1. Tokenization of the okenize(paragraph) Chunking & NER
words = word_tokeniz‐ sentences [word for word in
The pos(parts of speech)
e(sentence) 2. Create Histograms word_tokens if word not
explain you how a word is
[lemma.lemmatize(word) 3. Take most frequent in stopwords]
used in a sentence. In the
for word in words] words # Spacy
sentence, a word have
# Spcay 4. Create a matrix with from spacy.lang.en
different contexts and
import spacy as spac all the unique words. import English
semantic meanings. The
sp = spac.load('en_c‐ It also represents the from spacy.lang.en.s‐
basic natural language
ore_web_sm') occurence relation top_words import
processing(NLP) models like
ch = sp(u'warning between the words STOP_WORDS
bag-of-words(bow) fails to
warned') from gensim.models nlp = English()
identify these relation
for x in ch: import Word2Vec my_doc = nlp(paragraph)
between the words. For that
print(ch.lemma_) model = Word2Vec(sen‐ # Create list of word
we use pos tagging to mark a
# Keras tences, min_count=1) tokens
word to its pos tag based on
No lemmatization or words = model.wv.vocab token_list =
its context in the data.
stemming vector = model.wv['fr‐ [token.text for token
Pos is also used to extract
eedom'] in my_doc]
rlationship between the
similar = model.wv.mos‐ # Create list of word
words
t_similar['freedom'] tokens after removing
# NLTK
stopwords
filtered_sentence =[]

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 2 of 3. http://crosswordcheats.com
NLP Cheat Sheet
by sree017 via cheatography.com/126402/cs/24446/

Parts of Speech (POS) Tagging, Parts of Speech (POS) Tagging,

Chunking & NER (cont) Chunking & NER (cont)

from nltk.tokenize import word_pos = pos_tag(word‐

word_tokenize _tokens)
from nltk import pos_tag chunkParser = nltk.Rege‐
nltk.download('averaged_‐ xpParser(grammar)
perceptron_tagger') tree = chunkParser.par‐
word_tokens = word_toke‐ se(word_pos)
nize('Are you afraid of Named Entity Recogniza‐
something?') tion:
pos_tag(word_tokens) It is used to extract
# Spacy information from unstru‐
nlp = spacy.load("en_c‐ ctured text. It is used
ore_web_sm") to classy the entities
doc = nlp("Coronavirus: which is present in the
Delhi resident tests text into categories like
positive for coronavirus, a person, organization,
total 31 people infected event, places, etc. This
in India") will give you a detail
[token.pos_ for token in knowledge about the text
doc] and the relationship
Chunking: between the different
Chunking is the process entities.
of extracting phrases # Spacy
from the Unstructured import spacy
text and give them more nlp = spacy.load("en_c‐
structure to it. We also ore_web_sm")
called them shallow doc = nlp("Coronavirus:
parsing.We can do it on Delhi resident tests
top of pos tagging. It positive for coronavirus,
groups words into chunks total 31 people infected
mainly for noun phrases. in India")
chunking we do by using for ent in doc.ents:
regular expression. print(ent.text,
# NLTK ent.start_char, ent.en‐
word_tokens = word_toke‐ d_char, ent.label_)
nize(text)

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

cheatography.com/sree017/ Last updated 26th September, 2020. Learn to solve cryptic crosswords!
Page 3 of 3. http://crosswordcheats.com

NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Lab 2
No ratings yet
Lab 2
49 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
Book1 PDF
No ratings yet
Book1 PDF
846 pages
NLP
No ratings yet
NLP
29 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
Bayesian Statistics For Beginners: A Step-By-Step Approach Therese M Donovan Download
100% (4)
Bayesian Statistics For Beginners: A Step-By-Step Approach Therese M Donovan Download
59 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
7 Idf
No ratings yet
7 Idf
5 pages
NLP Notes-1
No ratings yet
NLP Notes-1
11 pages
NLTK
No ratings yet
NLTK
4 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
All Practicals
No ratings yet
All Practicals
33 pages
NLPNOTES
No ratings yet
NLPNOTES
26 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Afshin Molavi Persian Pilgrimages - Journeys Across Iran
100% (2)
Afshin Molavi Persian Pilgrimages - Journeys Across Iran
344 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
A7 NLP Exp2
No ratings yet
A7 NLP Exp2
11 pages
Pipeline
No ratings yet
Pipeline
9 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Module 5 - DIGITAL TECHNIQUES ELECTRONIC INSTRUMENT SYSTEMS - 1
100% (1)
Module 5 - DIGITAL TECHNIQUES ELECTRONIC INSTRUMENT SYSTEMS - 1
58 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
(En) M Sew - Soft Instruction
No ratings yet
(En) M Sew - Soft Instruction
58 pages
NLTK
No ratings yet
NLTK
3 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Bscs 25
No ratings yet
Bscs 25
15 pages
Nafs and Rizq PDF
No ratings yet
Nafs and Rizq PDF
4 pages
CG Mini Project Report Kyashawanth
100% (1)
CG Mini Project Report Kyashawanth
33 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
HL Calculus 1 Notes
No ratings yet
HL Calculus 1 Notes
12 pages
Ch1 Introduction To Os
No ratings yet
Ch1 Introduction To Os
16 pages
Ffiptqd6ffi: Ost Ffin Ffi (Wffi
No ratings yet
Ffiptqd6ffi: Ost Ffin Ffi (Wffi
2 pages
IPBDS-Circular
No ratings yet
IPBDS-Circular
2 pages
Intel's Haswell CPU Microarchitecture
No ratings yet
Intel's Haswell CPU Microarchitecture
17 pages
PP Lavanya
No ratings yet
PP Lavanya
4 pages
Comparative and Superlative Adjective
No ratings yet
Comparative and Superlative Adjective
5 pages
CLASS PROGRAM HUMSSgas11
No ratings yet
CLASS PROGRAM HUMSSgas11
3 pages
Computer Application Lab File - 231126 - 220302
No ratings yet
Computer Application Lab File - 231126 - 220302
5 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
n4-fr I: (!"JR Ffi
No ratings yet
n4-fr I: (!"JR Ffi
7 pages
AMAPOLA (LAbM)
No ratings yet
AMAPOLA (LAbM)
3 pages
Candidate Instructions
No ratings yet
Candidate Instructions
8 pages
Maharashtra State Board of Technical Education Analysis of Term End Examination Result
No ratings yet
Maharashtra State Board of Technical Education Analysis of Term End Examination Result
1 page
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
DM Unit I
No ratings yet
DM Unit I
6 pages
B PSU 210 - 42sh
No ratings yet
B PSU 210 - 42sh
4 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Quadratics Expression
No ratings yet
Quadratics Expression
6 pages
Review of Related Literature
No ratings yet
Review of Related Literature
7 pages
Smile 2 Track List
No ratings yet
Smile 2 Track List
2 pages
Polynomial Sample Problems
No ratings yet
Polynomial Sample Problems
3 pages
Swin Transformer
No ratings yet
Swin Transformer
1 page
Five Design Principles
No ratings yet
Five Design Principles
6 pages
BDHI4103
No ratings yet
BDHI4103
8 pages
ELDON LESSON PLAN COT 2 Concrete Abstract Lesson Plan COT 3RD GRADING
100% (1)
ELDON LESSON PLAN COT 2 Concrete Abstract Lesson Plan COT 3RD GRADING
2 pages
Dr. Nawal Al Hulwa - CV - English
No ratings yet
Dr. Nawal Al Hulwa - CV - English
8 pages
16 Passive PDF
No ratings yet
16 Passive PDF
2 pages
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
No ratings yet
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
2 pages
Img 20250209 0001
No ratings yet
Img 20250209 0001
1 page
8051 Programme
No ratings yet
8051 Programme
7 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

Sree017 NLP

Uploaded by

Sree017 NLP

Uploaded by

NLP Cheat Sheet

by sree017 via cheatography.com/126402/cs/24446/

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

Stemming & Lemmat​ization Word2Vec Stop Words Stop Words (cont)

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

Parts of Speech (POS) Tagging, Parts of Speech (POS) Tagging,

from nltk.t​okenize import word_pos = pos_ta​g(w​ord​‐

By sree017 Published 26th September, 2020. Sponsored by CrosswordCheats.com

You might also like

Stemming & Lemmatization Word2Vec Stop Words Stop Words (cont)

from nltk.tokenize import word_pos = pos_tag(word‐