Natural Language Processing
Natural Language Processing
UNIT – 1 :
1. Introduction
1.1. Understanding NLP
1.2. Understanding Basic Applications
1.3. Advantages of togetherness of NLP and python
1.4. Environment setup for nltk
2. Practical understanding of corpus and database.
1. Introduction :
1. Understanding NLP :
Natural Language Processing (NLP) is a part of Artificial Intelligence (AI) and it works like as
human behavior and to teach the human language with all its complexity to computers.
2. Understanding Basic Applications :
Applications of NLP :
i. Chatbots
ii. Language Translators
iii. Sentiment Analysis
iv. Auto Correct / Auto Complete
v. Social media marketing
vi. Voice Assistance
vii. Grammar Checkers
viii. E-mail classification & Filtering
ix. Machine Translation
x. Speech Recognition
xi. Text Extraction
xii. Predictive Text
xiii. Targeted Advertisement
3. Advantages of togetherness of NLP and python :
a. Developing prototypes for the NLP based expert System using python is very easy &
efficient.
b. A large variety of opensource NLP libraries are available for python programmers.
c. Community support is very strong.
d. Easy to use & less complex for beginners
e. Rapid development : Testing & Development are easy and less complex.
f. Optimization of NLP bases system is less complex compared to other programming
languages.
g. Many of the new frame works such as Apache spark , Apache flink , TensorFlow and so
on provide API for python.
4. Environment setup for nltk(Natural Language Tool Kit) :
1. In python shell to know the version of the python we can use -V (or) –version
>python -V
Python 3.11.2
>python --version
Python 3.11.2
2. In Jupyter Notebook / Spyder
from platform import python_version
print(python_version())
Output: 3.9.7
3. In python shell
import nltk
(or) download nltk (if the package is not available)
In other platforms
pip install -U nltk
from sklearn import nltk //it will import the package
for Windows, install python free version
https://docs.python-giude.org/starting/install3/win
2. Binary Attributes :
Binary Data has only 2 values(or)states. e.g., yes/no, effected/unaffected, true/false
Symmetric : Both values are equally important (e.g., Gender)
Asymmetric : Both values are not equally important (e.g., Result)
Attributes values
e.g.,
Gender Male, Female
Result Pass, Fail
3. Ordinal Attributes :
The Ordinal attributes contains the values that have a meaningful sequence (or) ranking
(order) between them, but the magnitude between values is not actually known, the order
of values shows what is important but don’t indicate how important it is.
Attribute Value
e.g.,
Grade A, B, C, D, E, F
Basic Pay Scale 16, 17, 18
2. Quantitative Data Attributes (or) Numeric Data Attributes :
These are of 3 types
1. Numeric Attributes
2. Discrete Attributes
3. Continuous Attributes
1. Numeric Attributes :
A Numeric Attribute is a quantitative because, it is a measurable quantity, represented
in integer (or) real values. Numerical Attributes are of 2 types.
i. Interval Scaled
ii. Ratio-Scaled
i. Interval Scaled :
Attributes has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point, (or) zero points.
Data can be added and subtracted at an interval scale but cannot be multiplied (or)
divided.
e.g., Temperature in degrees (or) Centigrade.
If a day’s temperature of one day is twice of the other day, we cannot
say that one day is twice as hot as another day.
ii. Ratio-Scaled :
It is numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we can
say of a value as a being a multiple (or) ratio of another value.
The values are ordered, and we can also compute the difference between values,
and the mean, median, mode, Quantile range…
2. Discrete Attributes :
Discrete data have finite values it can be numerical and can also be in categorical form.
These attributes have finite (or) countably infinite set of values.
e.g., Attributes values
Profession Teacher, Business man, Peon
Zip Code 501701, 120042
3. Continuous Attributes :
Continuous data have an infinite number of states. Continuous data is of float type.
There can be many values between 2 integers. (Say 1 & 2)
e.g., Attribute values
Height 6, 5.9, 5.6…
Weight 40, 60, 45…
Data Attributes
Qualitative Quantitative
(Categorical) (Numeric)
NP VP
a cake
1. Suffix removal :
This step removes the pre-defined ending from words
2. Recording :
Has pre-defined endings to the output of the first stem
e.g., ational → ate changes rotational → rotate
It is difficult to use stemming with morphologically rich languages.
Even in English stemmers are not perfect.
Another problem with porters is it reduces only the suffixes.
A more efficient 2-level morphological model first proposed by koskenniemi(1983) can be
used for high level languages.
2-step morphological level consists
1. Lexical level
2. Surface level
Step 1: Step 2:
surface form Split the word into Intermediate form Maps morphenes to stem lexical form
possible morphenes and morphological features
VP
Verb NP
the Noun
door
2. Bottom – Up parsing :
A Bottom-up parser stars with the words in the input sentence & attempts to
construct a parse tree in an upward direction towards the root.
At each step the parser looks for rules in the grammar where the right-hand side
matches some of the portions in the parse tree constructed so far, and reduces it
using the left-hand side of the production.
e.g., Input : Paint the door
The given string has 2 possibilities for parse trees of which we have to consider
the correct parse tree among them based on the given Grammar/ production rules.
Level 1 : paint the door
Level 4 : NP NP
VP
Verb NP
the Noun
door
Example: Grammar:
Input: The girl wrote an essay S → NP VP
VP → Verb NP
S
NP → Det Noun
Ø Ø
Det → an|the
Ø Ø Ø Verb → wrote
NP Ø Ø VP Noun → girl
Noun → essay
det N V det N
1 2 3 4 5
The girl wrote an essay
Rules :
A CFG is in CNF, if all the rules are of only 2 forms
A → BC (or) A → Wi
Each entry in the table is based on previous entry. The Basic CYK algorithm is
also a chart-based algorithm.
,
A non-terminal is stored in the [i,j] entry after that iff A ⇒Wi Wi+1 , ... Wi+k-1
Discourse Integration :
Discourse Integration is closely related to pragmatics. It is considered as the larger context
for every smaller part of Natural Language structure.
Discourse Analysis delas with how the immediately preceding sentence can affect the
meaning & interpretation of the next sentence.
Hence context can be analyzed in a bigger context such as paragraph level, document
level.
(or)
The meaning of any sentence depends on the meaning of the sentence just before it. It also
brings about the meaning of immediately succeeding sentence.
e.g., Jai had an NLP text book. I want it.
Key aspects of concepts :
Situational concepts context :
What people know about what they can see around them.
Background Knowledge context :
What people know about each other & the world.
Core textual context :
What people know about what they have been saying.
Pragmatic Analysis :
It deals with the overall communicative & social content & its effect on the interpretation. In
this analysis the main focus is always on what was said in interpreted on what is spent.
Pragmatic analysis helps users to discover this intended effect by applying a set of rules that
characterize cooperative dialogues.
e.g., Close the window (It should be interpreted as a request instead of an order)
I heart you
If you eat all of the food, it will make you bigger.
UNIT – 3 : PRE-PROCESSING
1. Handling corpus-raw
1.1. Handling raw-text
1.2. Sentence Tokenization
1.3. Lower-Case Conversion
1.4. Stemming
1.5. Lemmatization
1.6. Stop words removal
2. Handling corpus-raw sentences
2.1. Word Tokenizer
2.2. Word Lemmatization
3. Basic preprocessing
3.1. Basic level Regular Expression
3.2. Advanced Regular Expression
4. Practical and customized preprocessing
1) Handling the corpus-raw :
Getting the raw data which Load the data (.txt file) run
contains in a paragraph into the system sentence tokenizer
This is a sample text file which is using for reading the data
one paragraph
Lower-Case Conversion:
Converting all data to lower case help in the pre-processing and in the later stages in the
NLP Application. When we are doing parsing
Code :
def wordlowercase():
text ="I am student"
print(text.lower())
wordlowercase()
Output:
i am student
Sentence Tokenization:
Sentence Tokenization is the process of identifying the boundary of the sentences starting
& ending point.
The following open-source tools are available for performing the sentence tokenization
1. OpenNLP
2. StanfordcoreNLP
3. GATE
4. Nltk
Code:
from nltk import sent_tokenize as st
from nltk.corpus import gutenberg as cg
def fileread():
file_contents = open('data.txt','r').read()
return file_contents
def localtextvalue():
text="""one paragraph"""
return text
def readcorpus():
raw_content_cd = cg.raw('burgess-busterbrown.txt')
return raw_content_cd
if __name__ == "__main__":
print(" ")
print("----output from raw text file----")
print(" ")
filecontentdetails = fileread()
print(filecontentdetails)
print(" ")
print("----sentence tokenization of raw text----")
print(" ")
st_list_rawfile = st(filecontentdetails)
print(st_list_rawfile)
print(len(st_list_rawfile))
print(" ")
print("----output from assigned variable----")
print(" ")
localvariabledata = localtextvalue()
print(localvariabledata)
print(" ")
print("----sentence tokenization of assigned variable----")
print(" ")
st_list_local = st(localvariabledata)
print(st_list_local)
print(len(st_list_local))
print(" ")
print("----output corpus data----")
print(" ")
fromcorpusdata = readcorpus()
print(fromcorpusdata)
print(" ")
print("----sentence tokenization of corpus data----")
print(" ")
st_list_corpus = st(fromcorpusdata)
print(st_list_corpus)
print(len(st_list_corpus))
Output :
This is a sample text file which is using for reading the data
['This is a sample text file which is using for reading the data']
1
one paragraph
['one paragraph']
1
def stemmer_porter():
port = PorterStemmer()
print("Stemmer")
return " ".join([port.stem(i) for i in text.split()])
if __name__ == "__main__":
print (stemmer_porter())
Output:
Stemmer
natur languag process (nlp) is a branch of artifici intellig that help comp
ut understand, interpret and manipul human language. nlp draw from mani dis
ciplines, includ comput s
Lemmatization :
Extracting a meaningful word (base form) from the given word as per the context.
Challenges of Lemmatization:
It’s best applicable for English language. Other languages have very less accuracy
when compared with English language.
All the words of other languages are not recognized accurately.
Code:
from nltk.stem import WordNetLemmatizer
def lemmatizer():
word_lemma = WordNetLemmatizer()
print("raw-text")
print()
print(text)
print ("Verb lemma")
print (" ".join([word_lemma.lemmatize(i,pos="v") for i in text.split()]))
print ("Noun lemma")
print (" ".join([word_lemma.lemmatize(i,pos="n") for i in text.split()]))
print ("Adjective lemma")
print (" ".join([word_lemma.lemmatize(i, pos="a") for i in text.split()]))
print ("Satellite adjectives lemma")
print (" ".join([word_lemma.lemmatize(i, pos="s") for i in text.split()]))
print ("Adverb lemma")
print (" ".join([word_lemma.lemmatize(i, pos="r") for i in text.split()]))
if __name__ == "__main__":
lemmatizer()
Output:
raw-text
Adjective lemma
Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer
Adverb lemma
Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer
if __name__ == "__main__":
stopwordlist()
stopwordremove()
Output:
--------raw text---------
Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer science and computational linguistics, in its pursu
it to fill the gap between human communication and
b) Word Lemmatization :
It is a process of deleting/modifying affixes of word as per context and these words
came as per the context.
e.g., They are going to picnic.
She really loves to buy cars.
We are running for taking a meal.
Challenges of Word Lemmatization:
It’s best applicable for English language. Other languages (Telugu, Hindi, Urdu,
Arabic, Hebrew ) have very less accuracy when compared with English language.
Code:
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
def wordtokenization():
content = """She really wants to buy a car. She told me angrily. It is better for
you. Man is walking. We are meeting tomorrow"""
print(word_tokenize(content))
def wordlemmatization():
wordlemma = WordNetLemmatizer()
print(wordlemma.lemmatize('cars'))
print(wordlemma.lemmatize('walking',pos='v'))
print(wordlemma.lemmatize('meeting',pos='v'))
print(wordlemma.lemmatize('meeting',pos='n'))
print(wordlemma.lemmatize('better',pos='a'))
if __name__=="__main__":
print("---Word Tokenization---")
wordtokenization()
print("---Word Lemmatization---")
wordlemmatization()
Output:
---Word Tokenization---
['She', 'really', 'wants', 'to', 'buy', 'cars', '.', 'She', 'told', 'me', '
angrily', '.', 'It', 'is', 'better', 'for', 'you', '.', 'Man', 'is', 'walki
ng', '.', 'We', 'are', 'meeting', 'tomorrow']
---Word Lemmatization---
car
walk
meet
meeting
good
3. Pre-processing :
Basic Level Regular Expression :
Regular Expression is a powerful tool when we want to do customized pre-processing
(or) when we have noisy data.
Basic flags : The Basic Flags are I, L, M, S, U and X
re.I : This flag is used for ignoring case.
Output: re.IGNORECASE
re.L : This flag is used to find a local dependent.
Output: re.LOCALE
re.M : This flag is useful if we want to find patterns throughout multiple lines.
Output: re.MULTILINE
re.S : This flag is used to find dot(.) matches.
Output: re.DOTALL
re.U : This flag is used to work for unicode characters.
Output: re.UNICODE
re.X : This flag is used for writing regex in a more readable format.
Output: re.VERBOSE
re.match() : This checks for a match of the string only at the beginning of the
string. So if it finds the pattern at the beginning of the input string then it returns
the matched pattern. Otherwise, it returns a noun.
re.search() : This checks for a match of the string anywhere in the string. It finds all
the occurrences of the pattern in the given input string (or) data.
Code:
import re
def searchmatch():
line = "I love animals"
matchobj = re.match(r'animals',line,re.M|re.I)
if matchobj:
print("match:",matchobj.group())
else:
print("No Match!")
searchobj = re.search(r'animals',line,re.M|re.I)
if searchobj:
print("search:",searchobj.group())
else:
print("Nothing found")
if __name__ == "__main__":
searchmatch()
Output:
No Match!
search: animals
poslookaheadobjpattern = re.findall(r'play(?=ground)',text,re.M|re.I)
print("Positive lookahead : "+str(poslookaheadobjpattern))
poslookaheadobj = re.search(r'play(?=ground)',text,re.M|re.I)
print("positive lookahead character Index : "+str(poslookaheadobj.span()))
poslookbehindobjpattern = re.findall(r'(?<=play)ground',text,re.M|re.I)
print("Positive LookBehind : "+str(poslookbehindobjpattern))
poslookbehindobj = re.search(r'(?<=play)ground',text,re.M|re.I)
print("Positive LookBehind Character Index : "+str(poslookbehindobj.span()))
neglookaheadobjpattern = re.findall(r'play(?!ground)',text,re.M|re.I)
print("Negative lookahead : "+str(neglookaheadobjpattern))
neglookaheadobj = re.search(r'play(?!ground)',text,re.M|re.I)
print("Negative lookahead character Index : "+str(neglookaheadobj.span()))
neglookbehindobjpattern = re.findall(r'(?<!play)ground',text,re.M|re.I)
print("Negative LookBehind : "+str(neglookbehindobjpattern))
neglookbehindobj = re.search(r'(?<!play)ground',text,re.M|re.I)
print("Negative Lookbehind Index : "+str(neglookbehindobj.span()))
if __name__ == "__main__":
print("----Advanced Regular expression----")
advRegEx()
Output:
----Advanced Regular expression----
Positive lookahead : ['play']
positive lookahead character Index : (11, 15)
Positive LookBehind : ['ground']
Positive LookBehind Character Index : (15, 21)
Negative lookahead : ['play']
Negative lookahead character Index : (2, 6)
Negative LookBehind : ['ground']
Negative Lookbehind Index : (38, 44)
Yes No
Pre-processing required Pre-processing is not required
↓
Remove html tags & repeated text
↓
Sentence Tokenizer
↓
Word Tokenizer
↓
Word Lemmatization
↓
Sentence Lemmatization
↓
Lower case Conversion
↓
Stop words removal
↓
Stemming
.
.
Understanding Case Studies of pre-processing :
1. Grammarly Correction System (e.g., Customer Reviews)
2. Sentiment Analysis ( positive, negative, neutral )
3. Machine Translation (speech based, text based)
4. Spelling Correction
Operations used in Pre-processing :
1. Insertion
2. Deletion
3. Substitution
1. Insertion :
If we have any incorrect string, after inserting 1 (or) more characters we will get the
correct string (or) expected string.
e.g., ‘aple’ on insertion of ‘p’ becomes ‘apple’
‘puzle’ on insertion of ‘z’ becomes ‘puzzle’
2. Deletion :
If we have an incorrect string, which can be converted into a correct string after
deleting 1 (or) more characters of the string.
e.g., ‘carroot’ after deleting ‘o’ becomes ‘carrot’
‘bannana’ after deleting ‘n’ becomes ‘banana’
3. Substitution :
If we get correct string by substitutions/substituting 1 (or) more characters then it is
called a substitution.
e.g., ‘implemantation’ on substituting ‘a’ with ‘e’ it becomes ‘implementation’
‘corroption’ on substituting ‘o’ with ‘u’ it becomes ‘corruption’
Minimum Edit Distance Algorithm :
This algorithm works in converting one string ‘x’ to another string ‘y’ and we need to find is
what the minimum edit cost is to convert string ‘x’ to string ‘y’.
Algorithm :
Input : Two strings (‘x’ & ‘y’)
Output : Cheapest possible sequences of the characters for converting the string from ‘x’ to
‘y’ that is equals to the minimum edit distance cost for converting string ‘x’ to ‘y’.
i.e., sequence of characters = minimum edit distance
Steps:
1. Set n to a length of P.
Set m to a length of Q.
2. If n=0, return m and exit.
If m=0, return n and exit.
3. Create a matrix of containing 0…m rows & 0…n columns.
4. Initialize the first row to 0…n.
Initialize the first column to 0...m.
5. Iterate each character of P ( i from 1 to n ).
Iterate each character of Q ( j from 1 to m ).
6. If P[i] equals Q[j] then cost is 0.
If P[i] doesn’t equals Q[j] then cost is 1.
Set the value at cell v[i,j] of the matrix equals to the minimum of all three values of the
following points:
7. The cell immediately previous plus 1 : v[i-1,j]+1.
8. The cell immediately to the left plus 1 : v[i,j-1]+1.
9. The cell diagonally previous & to the left plus the cost : v[i-1,j-1]+1 for v[i-1,j-1]+cost
should be considered.
10. After the iteration in step 7 to step 9 has been completed, the distance is found in cell
v[n,m].
Code :
import re
from collections import Counter
def words(text):
return re.findall(r'\w+',text.lower())
WORDS = Counter(words(open('data.txt').read()))
def P(word, N = sum(WORDS.values())): #probability of word
return WORDS[word]/N
def correction(word): #most probable spelling correction of the word
return max(candidates(word),key=P)
def candidates(word): #generate possible spelling corrections for word
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words): #the subset of words that appear in the dictionary of WORDS
return set(w for w in words if w in WORDS)
def edits1(word): #all edits that are one edit away from word
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
deletes = [L+R[1:] for L,R in splits if R]
transposes = [L+R[1]+R[0]+R[2:] for L,R in splits if len(R)>1]
replaces = [L+c+R[1:] for L,R in splits if R for c in letters]
inserts = [L+c+R for L,R in splits for c in letters]
return set(deletes + transposes +replaces + inserts)
def edits2(word): #all edits that are two edits away from word
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
if __name__=="__main__":
print(correction('aple'))
print(correction('correcton'))
print(correction('statament'))
print(correction('tutpore'))
Output :
apple
correction
statement
tutor
UNIT – 4 : FEATURE ENGINEERING & NLP ALGORITHMS
1. Understanding of feature Engineering
2.1.1. Introduction
2.1.2. Purpose
2.1.3. Challenges
2. Basic features of NLP
i. Understanding the basics of parsing
ii. Understanding the concepts of parsing
iii. Developing a parser from the scratch
iv. Types of Grammars
3. Basic statistical feature of NLP, Advantages of feature Engineering
4. Challenges of feature Engineering
1. Understanding of feature Engineering :
Introduction :
Feature Engineering :
It is the process of decorating (or) deriving the features (or) attributes for given data.
These features are useful to develop NLP applications.
Features are required for applying machine learning algorithms.
The main purpose of feature engineering is to develop machine learning algorithms (or)
techniques for the NLP applications, we require it to decorate/provide features of the data.
Also, It is difficult to measure/calculate the performance/accuracy of the machine learning
algorithm.
Challenges of feature Engineering :
After generating features we need to decide which feature should be selected, selection of
features & perform Machine learning techniques.
To select the good feature is difficult & sometimes it’s complex to select.
During the feature selection we need to eliminate some of the less important features and
this elimination features is also a critical part of the feature engineering.
Manual feature engineering is time consuming.
Feature engineering requires domain expertise (or) at least basic knowledge about
domains.
2. Basic features of NLP :
Understanding the basics of parsing :
Parsing :
The task that uses the rewrite the rules of a grammar to either generate a particular
sequence of words/Reconstruct its derivation is known as ‘parsing’. A phrase structured
constructor from a sentence is called parsing.
A syntactic parser is thus responsible for recognizing a sentence and assigning a sentence’s
syntactic structure to it.
There exist 2 types of parsing techniques:
3. Top – Down parsing
4. Bottom – Up parsing
3. Top – down parsing :
Top-down parsing starts its search from the root node ( S ) and works
downwards towards the leaves.
A successful parse corresponds to a tree which matches exactly with the words in
the input sentence.
4. Bottom – Up parsing :
A Bottom-up parser stars with the words in the input sentence & attempts to
construct a parse tree in an upward direction towards the root.
At each step the parser looks for rules in the grammar where the right-hand side
matches some of the portions in the parse tree constructed so far, and reduces it
using the left-hand side of the production.
We use 2 dynamic algorithms to implement parsing.
3. Earley Parsing
4. CYK
Types of Grammars :
There exist 2 types of grammars.
1. CFG (Context Free Grammar)
2. PCFG (Probability Context Free Grammar)
1. CFG (Context Free Grammar) :
Context Free Grammar G = ( T, C, N, S, L, R )
where T → Terminals/Lexical symbols
C → Pre-Terminals
N → non-terminals
S → Start Symbol belongs to non-terminals
L → Lexical Terminals
Grammar : R → grammar
S → NP VP NP → N N → rods
VP → V NP PP → P NP V → people
VP → V NP PP N → people V → fish
NP → NP NP N → fish V → tank
NP → NP PP N → tank P → with
e.g.,
1. People fish tank
Parse tree :
S
NP VP
N V NP
People fish N
Tank
2. People fish tank with rods
Parse tree :
S
NP VP
N V NP PP
People fish N P NP
(or)
NP VP
N V NP
People fish NP PP
Tank P NP
With N
rods
Grammar :
S → NP VP (1.0) N → people (0.5) P → with (1.0)
VP → V NP (0.6) N → fish (0.2)
VP → V NP PP (0.4) N → tank (0.2)
NP → NP NP (0.1) N → rods (0.1)
NP → NP PP (0.2) V → people (0.1)
NP → N (0.7) V → fish (0.6)
PP → P NP (1.0) V → tank (0.3)
e.g., People fish tank with rods
Parse tree :
S (1.0)
NP (0.7) VP (0.4)
Rods
S (1.0)
NP (0.7) VP (0.6)
Rods
P(t2) = 1 * 0.7 * 0.6 * 0.5 * 0.6 * 0.2 * 0.7 * 1 * 0.2 * 1 * 0.7 * 0.1
= 0.00012348
∴ P = P(t1) + P(t2)
= 0.0008232 + 0.00012348
= 0.00094668
3. Basic Statistical feature of NLP, Advantages of feature Engineering :
Advantages of feature Engineering :
Better features give a lot of meaning even if we chose a less optimal Machine Learning
algorithms, we will get a good result. Good features provide the flexibility of choosing an
algorithm. Even if we chose a less complex model then we will get good accuracy.
If we choose good features then even simple Machine Learning algorithms do well.
Better Understanding will lead to better accuracy. we should spend more time on feature
engineering to generate the application features for our dataset.
4. Challenges of feature Engineering :
An effective way of converting text data into a numerical format is quite challenging. For
this challenge trial & error method may help to us.
In the NLP domain, we can easily derive the features that are categorical features (or)
basic NLP features.
We have to convert these features into a numerical format.
Although there are couple of techniques that we can use such as TF-IDF (Term Frequen
cy – Inverse Document Frequent items)Encode, Ranking, Co-occurrence matrix, Word
Embedding, Word2Vector, VEC….to convert our text data into numerical data.