Chapter-1 Introduction To NLP
Chapter-1 Introduction To NLP
Text pre-processing: -
Text data derived from natural language is unstructured and noisy.
Text pre-processing involves transforming text into a clean and consistent
format that can then be fed into a model for further analysis and learning.
Text pre-processing techniques may be general so that they are applicable to
many types of applications, or they can be specialized for a specific task.
For example, the methods for processing scientific documents with equations
and other mathematical symbols can be quite different from those for dealing
with user comments on social media.
However, some steps, such as sentence segmentation, tokenization, spelling
corrections, and stemming, are common to both.
Here's what you need to know about text preprocessing to improve your
natural language processing (NLP).
Technique and method used in NLP:-
Syntax and semantic analysis are two main techniques used with natural
language processing.
Syntax is the arrangement of words in a sentence to make grammatical
sense.
NLP uses syntax to assess meaning from a language based on grammatical
rules.
Syntax techniques include:
1. Parsing
2. Word segmentation.
This is the act of taking a string of text and deriving word forms from it.
Example: A person scans a handwritten document into a computer.
The algorithm would be able to analyze the page and recognize that the words are
divided by white spaces.
3. Sentence breaking.
This places sentence boundaries in large texts.
Example: A natural language processing algorithm is fed the text, "The
dog barked. I woke up."
The algorithm can recognize the period that splits up the sentences using
sentence breaking.
4. Morphological segmentation.
This divides words into smaller parts called morphemes.
Example: The word untestably would be broken into [[un[[test]able]]ly],
where the algorithm recognizes "un," "test," "able" and "ly" as
morphemes.
This is especially useful in machine translation and speech recognition.
5. Stemming.
This divides words with inflection in them to root forms.
Example: In the sentence, "The dog barked," the algorithm would be able
to recognize the root of the word "barked" is "bark."
This would be useful if a user was analyzing a text for all instances of the
word bark, as well as all of its conjugations.
The algorithm can see that they are essentially the same word even
though the letters are different.
Semantics involves the use of and meaning behind words.
Natural language processing applies algorithms to understand the
meaning and structure of sentences.
Semantics techniques include:
1. Word sense disambiguation.
This derives the meaning of a word based on context.
Example: Consider the sentence, "The pig is in the pen."
The word pen has different meanings.
An algorithm using this method can understand that the use of the
word pen here refers to a fenced-in area, not a writing implement.
2. Named entity recognition.
This determines words that can be categorized into groups.
Example: An algorithm using this method could analyze a news article and
identify all mentions of a certain company or product.
Using the semantics of the text, it would be able to differentiate between
entities that are visually the same.
For instance, in the sentence, "Daniel McDonald's son went to McDonald's
and ordered a Happy Meal," the algorithm could recognize the two instances
of "McDonald's" as two separate entities -- one a restaurant and one a
person.
3. Natural language generation.
This uses a database to determine semantics behind words and generate
new text.
Example: An algorithm could automatically write a summary of findings from a
business intelligence platform, mapping certain words and phrases to
features of the data in the BI platform.
Another example would be automatically generating news articles or tweets
based on a certain body of text used for training.
Text pre-processing pipeline (part of text pre-processing):-
1. Tokenization: -
OR
2. Lower casing: -
print(Sentence)
Stemming and lemmatization are used to reduce words to their base form,
which can help reduce the vocabulary size and simplify the text.
Stemming involves stripping the suffixes from words to get their stem,
whereas lemmatization involves reducing words to their base form based on
their part of speech.
This step is commonly used in various NLP tasks such as text
classification, information retrieval, and topic modelling.
5. Lemmatization:
6. Regular Expressions
A regular expression (RE) is a language for specifying text search strings.
RE helps us to match or find other strings or sets of strings, using a
specialized syntax held in a pattern.
Regular expressions are used to search texts in UNIX as well as in MS
WORD in identical way.
We have various search engines using a number of RE features.
Properties of Regular Expressions
Followings are some of the important properties of RE −
American Mathematician Stephen Cole Kleene formalized the Regular Expression
language.
RE is a formula in a special language, which can be used for specifying simple
classes of strings, a sequence of symbols.
In other words, we can say that RE is an algebraic notation for characterizing a set of
strings.
Regular expression requires two things, one is the pattern that we wish to search and
other is a corpus of text from which we need to search.
Mathematically, A Regular Expression can be defined as follows −
ε is a Regular Expression, which indicates that the language is having an empty
string.
φ is a Regular Expression which denotes that it is an empty language.
If X and Y are Regular Expressions, then
o X, Y
o X.Y(Concatenation of XY)
o X+Y (Union of X and Y)
o X*, Y* (Kleen Closure of X and Y) are also regular expressions.
If a string is derived from above rules then that would also be a regular expression.
(0 + 10*) {0, 1, 10, 100, 1000, 10000, … }
(a+b)* It would be set of strings of a’s and b’s of any length which also includes the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}
(a+b)*abb It would be set of strings of a’s and b’s ending with the string abb i.e. {abb, aabb, babb, aaabb, ababb, …………..}
(11)* It would be set consisting of even number of 1’s which also includes an empty string i.e. {ε, 11, 1111, 111111, ……….}
(aa)*(bb)*b It would be set of strings consisting of even number of a’s followed by odd number of b’s i.e. {b, aab, aabbb, aabbbbb, aaaab,
aaaabbb, …………..}
(aa + ab + ba + It would be string of a’s and b’s of even length that can be obtained by concatenating any combination of the strings aa, ab, ba
bb)* and bb including null i.e. {aa, ab, ba, bb, aaab, aaba, …………..}
https://www.tutorialspoint.com/natural_language_processing/
natural_language_processing_word_level_analysis.htm#:~:text=Regular%20Expressions,syntax%20held%20in%20a
%20pattern.
Minimum edit text.
Many NLP tasks are concerned with measuring how similar two strings are.
Spell correction: The user typed “graffe” , Which is closest? : graf grail giraffe
the word giraffe, which differs by only one letter from graffe, seems intuitively to be more similar than, say
grail or graf,
The minimum edit distance between two strings is defined as the minimum number of editing operations
(insertion, deletion,substitution) needed to transform one string into another.
The minimum edit distance between intention and execution can be visualized using their alignment.
Given two sequences, an alignment is a correspondence between substrings of the two sequences.
Given two strings str1 and str2 and below operations that can be performed on str1. Find minimum number
of edits (operations) required to convert ‘str1’ in to ‘str2’.
1. Insert
2. Remove
3. Replace
All of the above operations are of equal cost.
Input: str1 = “geek”, str2 = “gesek”
Output: 1
Explanation: We can convert str1 into str2 by inserting a ‘s’.
Input: str1 = “cat”, str2 = “cut”
Output: 1
Explanation: We can convert str1 into str2 by replacing ‘a’ with ‘u’.
Input: str1 = “sunday”, str2 = “saturday”
Output: 3
Explanation: Last three and first characters are same. We basically need to convert “un” to “atur”.
This can be done using below three operations.
Replace ‘n’ with ‘r’, insert t, insert a.
https://youtu.be/We3YDTzNXEk
In this method, we use bottom up approach to compute the edit distance between str1 and
str2.
We start by computing edit distance for smaller subproblems and use the results of these
smaller subproblems to compute results for subsequent larger problems.
The results are stored in a two dimensional array as shown below.
Each cell (m,n) of this array represents distance first 'm' characters of str1 and first 'n'
characters of str2.
For example, when 'm' is 0, distance between str1 which is of 0 length and str2 of 'n' length
is 'n'.
Please observe 0th row of above matrix.
Same is the case for values in 0th column where str2 is of 0 length.
Now in this matrix, for cell (m,n) which represents distance between str1 of length 'm'
characters and str2 of length 'n' characters, if 'm'th character of str1 and 'n'th character of
str2 are same, then we simply need to fill cell(m,n) using value of cell (m-1, n-1) which
represents edit distance between first 'm-1' characters if str1 and first 'n-1' characters of
str2.
Notice the red arrows in the above array.
If 'm'th character of str1 is not equal to 'n'th character of str2, then we choose minimum
value from following three cases-
1. Delete 'm'th character of str1 and compute edit distance between 'm-1'
characters of str1 and 'n' characters of str2.
For this computation, we simply have to do - (1 + array[m-1][n])
where 1 is the cost of delete operation and array[m-1][n] is edit
distance between 'm-1' characters of str1 and 'n' characters of str2.
2. Similarly, for the second case of inserting last character of str2 into str1, we
have to do - (1 + array[m][n-1]).
3. And for the third case of substituting last character of str1 by last character
of str2 we use - (1 + array[m-1] [n-1]).
Please checkout function 'find Distance (String str1, String str2)' in code snippet for
implementation details.
The time and space complexity of this method is O(mn) where 'm' is the length of str1 and
'n' is the length of str2.
Edit distance isn’t sufficient We often need to align each character of the two strings to each
other.
We do this by keeping a “backtrace”
Every time we enter a cell, remember where we came from
When we reach the end, – Trace back the path from the upper right corner to read off the
alignment
Time complexity: O(nm)
Space complexity: O(nm)
Back trace: O(n+m)
This can include nouns, verbs, adjectives, and other grammatical categories.
POS tagging is useful for a variety of NLP tasks, such as information extraction,
named entity recognition, and machine translation.
The algorithm learns to predict the correct POS tag for a given word based on
the context in which it appears.
There are various POS tagging schemes that have been developed, each with its
own set of tags and rules.
Some common POS tagging schemes include the Penn Treebank tag set and
the Universal Dependencies tag set .
POS tags:
The: determiner
cat: noun
sat: verb
on: preposition
the: determiner
mat: noun
In this example, each word in the sentence has been labeled with its
corresponding part of speech.
The determiner “the” is used to identify specific nouns, while the noun “cat”
refers to a specific animal.
The verb “sat” describes an action, and the preposition “on” describes the
relationship between the cat and the mat.
Identifying part of speech of word is not just mapping words to their respective
POS tags.
Same word might have different part of speech tag based on different context.
Thus it is not possible to have common mapping for parts of speech tags.
When you have a huge corpus manually finding different part-of-speech for
each word is a scalable solution.
But why are we tagging these words with their parts of speech?
There are several reasons why we might tag words with their parts of speech
(POS) in natural language processing (NLP):