0% found this document useful (0 votes)
86 views

Chapter-1 Introduction To NLP

This document provides an overview of natural language processing (NLP) and various techniques used in NLP including text preprocessing, syntax analysis, and semantics analysis. Some key techniques discussed are tokenization, lowercasing, stop word removal, stemming, lemmatization, parsing, word segmentation, sentence breaking, and morphological segmentation. The goal of text preprocessing techniques is to transform raw text into a clean, consistent format for further NLP analysis. Syntax analysis examines the structure and arrangement of words while semantics analysis focuses on word meaning and context.

Uploaded by

Sruja Koshti
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Chapter-1 Introduction To NLP

This document provides an overview of natural language processing (NLP) and various techniques used in NLP including text preprocessing, syntax analysis, and semantics analysis. Some key techniques discussed are tokenization, lowercasing, stop word removal, stemming, lemmatization, parsing, word segmentation, sentence breaking, and morphological segmentation. The goal of text preprocessing techniques is to transform raw text into a clean, consistent format for further NLP analysis. Syntax analysis examines the structure and arrangement of words while semantics analysis focuses on word meaning and context.

Uploaded by

Sruja Koshti
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter-1 Introduction to NLP

 Introduction to Natural Language Processing: -


 NLP stands for Natural Language Processing, which is a part of Computer Science,
Human language, and Artificial Intelligence.
 It is the technology that is used by machines to understand, analyse, manipulate, and
interpret human's languages.
 It helps developers to organize knowledge for performing tasks such as translation,
automatic summarization, Named Entity Recognition (NER), speech recognition,
relationship extraction, and topic segmentation.

 Text pre-processing: -
 Text data derived from natural language is unstructured and noisy.
 Text pre-processing involves transforming text into a clean and consistent
format that can then be fed into a model for further analysis and learning.
 Text pre-processing techniques may be general so that they are applicable to
many types of applications, or they can be specialized for a specific task.
 For example, the methods for processing scientific documents with equations
and other mathematical symbols can be quite different from those for dealing
with user comments on social media.
 However, some steps, such as sentence segmentation, tokenization, spelling
corrections, and stemming, are common to both.
 Here's what you need to know about text preprocessing to improve your
natural language processing (NLP).
 Technique and method used in NLP:-
 Syntax and semantic analysis are two main techniques used with natural
language processing.
 Syntax is the arrangement of words in a sentence to make grammatical
sense.
 NLP uses syntax to assess meaning from a language based on grammatical
rules.
 Syntax techniques include:

1. Parsing

 Parsing is the process of figuring out the grammatical structure of a sentence,


determining which words belong together as phrases and which are the subject or
object of a verb.
 This NLP technique offers additional context about a text in order to help with
processing and analyzing it accurately.

 This is how parsing might work on a short sentence.

2. Word segmentation.
 This is the act of taking a string of text and deriving word forms from it.
 Example: A person scans a handwritten document into a computer.
 The algorithm would be able to analyze the page and recognize that the words are
divided by white spaces.
3. Sentence breaking.
 This places sentence boundaries in large texts.
 Example: A natural language processing algorithm is fed the text, "The
dog barked. I woke up."
 The algorithm can recognize the period that splits up the sentences using
sentence breaking.
4. Morphological segmentation.
 This divides words into smaller parts called morphemes.
 Example: The word untestably would be broken into [[un[[test]able]]ly],
where the algorithm recognizes "un," "test," "able" and "ly" as
morphemes.
 This is especially useful in machine translation and speech recognition.
5. Stemming.
 This divides words with inflection in them to root forms.
 Example: In the sentence, "The dog barked," the algorithm would be able
to recognize the root of the word "barked" is "bark."
 This would be useful if a user was analyzing a text for all instances of the
word bark, as well as all of its conjugations.
 The algorithm can see that they are essentially the same word even
though the letters are different.
 Semantics involves the use of and meaning behind words.
 Natural language processing applies algorithms to understand the
meaning and structure of sentences.
 Semantics techniques include:
1. Word sense disambiguation.
 This derives the meaning of a word based on context.
 Example: Consider the sentence, "The pig is in the pen."
 The word pen has different meanings.
 An algorithm using this method can understand that the use of the
word pen here refers to a fenced-in area, not a writing implement.
2. Named entity recognition.
 This determines words that can be categorized into groups.
 Example: An algorithm using this method could analyze a news article and
identify all mentions of a certain company or product.
 Using the semantics of the text, it would be able to differentiate between
entities that are visually the same.
 For instance, in the sentence, "Daniel McDonald's son went to McDonald's
and ordered a Happy Meal," the algorithm could recognize the two instances
of "McDonald's" as two separate entities -- one a restaurant and one a
person.
3. Natural language generation.
 This uses a database to determine semantics behind words and generate
new text.
 Example: An algorithm could automatically write a summary of findings from a
business intelligence platform, mapping certain words and phrases to
features of the data in the BI platform.
 Another example would be automatically generating news articles or tweets
based on a certain body of text used for training.
 Text pre-processing pipeline (part of text pre-processing):-

1. Tokenization: -

 The tokenization stage involves converting a sentence into a stream


of words, also called “tokens.”
 Tokens are the basic building blocks upon which analysis and other
methods are built.
 NLP toolkits allow users to input multiple criteria based on which
word boundaries are determined.
 For example, you can use a whitespace or punctuation to determine
if one word has ended and the next one has started.
 Again, in some instances, these rules might fail.
 For example, don’t, it’s, etc. are words themselves that contain
punctuation marks and have to be dealt with separately.

OR

 Tokenization is the process of segmenting the text into a list


of tokens.
 In the case of sentence tokenization, the token will be
sentenced and in the case of word tokenization, it will be the
word.
 It is a good idea to first complete sentence tokenization and
then word tokenization, here output will be the list of lists.
 Tokenization is performed in each & every NLP pipeline.

2. Lower casing: -

 This step is used to convert all the text to lowercase letters.


 This is useful in various NLP tasks such as text classification,
information retrieval, and sentiment analysis.
 Code:-

Sentence= “Books are on the table”


Sentence = Sentence.lowercase(Sentence)

print(Sentence)

3. Stop word removal:


 What are Stop words?
 Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
that a search engine has been programmed to ignore, both when indexing entries
for searching and when retrieving them as the result of a search query.
 We would not want these words to take up space in our database, or taking up
valuable processing time.
 For this, we can remove them easily, by storing a list of words that you consider to
stop words.
 NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16
different languages.
 Stop words are commonly occurring words in a language such as “the”, “and”, “a”,
etc.
 They are usually removed from the text during preprocessing because they do not
carry much meaning and can cause noise in the data.
 This step is used in various NLP tasks such as text classification, information
retrieval, and topic modelling.
4. Stemming

 Stemming and lemmatization are used to reduce words to their base form,
which can help reduce the vocabulary size and simplify the text.
 Stemming involves stripping the suffixes from words to get their stem,
whereas lemmatization involves reducing words to their base form based on
their part of speech.
 This step is commonly used in various NLP tasks such as text
classification, information retrieval, and topic modelling.

5. Lemmatization:

 Unlike stemming, lemmatization reduces the words to a word


existing in the language.
 For lemmatization to resolve a word to its lemma, part of speech of
the word is required.
 This helps in transforming the word into a proper root form.
 Lemmatization is preferred over Stemming because lemmatization
does a morphological analysis of the words.

6. Regular Expressions
 A regular expression (RE) is a language for specifying text search strings.
 RE helps us to match or find other strings or sets of strings, using a
specialized syntax held in a pattern.
 Regular expressions are used to search texts in UNIX as well as in MS
WORD in identical way.
 We have various search engines using a number of RE features.
 Properties of Regular Expressions
 Followings are some of the important properties of RE −
 American Mathematician Stephen Cole Kleene formalized the Regular Expression
language.
 RE is a formula in a special language, which can be used for specifying simple
classes of strings, a sequence of symbols.
 In other words, we can say that RE is an algebraic notation for characterizing a set of
strings.
 Regular expression requires two things, one is the pattern that we wish to search and
other is a corpus of text from which we need to search.
Mathematically, A Regular Expression can be defined as follows −
 ε is a Regular Expression, which indicates that the language is having an empty
string.
 φ is a Regular Expression which denotes that it is an empty language.
 If X and Y are Regular Expressions, then
o X, Y
o X.Y(Concatenation of XY)
o X+Y (Union of X and Y)
o X*, Y* (Kleen Closure of X and Y) are also regular expressions.
 If a string is derived from above rules then that would also be a regular expression.
(0 + 10*) {0, 1, 10, 100, 1000, 10000, … }

(0*10*) {1, 01, 10, 010, 0010, …}

(0 + ε)(1 + ε) {ε, 0, 1, 01}

(a+b)* It would be set of strings of a’s and b’s of any length which also includes the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}

(a+b)*abb It would be set of strings of a’s and b’s ending with the string abb i.e. {abb, aabb, babb, aaabb, ababb, …………..}

(11)* It would be set consisting of even number of 1’s which also includes an empty string i.e. {ε, 11, 1111, 111111, ……….}

(aa)*(bb)*b It would be set of strings consisting of even number of a’s followed by odd number of b’s i.e. {b, aab, aabbb, aabbbbb, aaaab,
aaaabbb, …………..}

(aa + ab + ba + It would be string of a’s and b’s of even length that can be obtained by concatenating any combination of the strings aa, ab, ba
bb)* and bb including null i.e. {aa, ab, ba, bb, aaab, aaba, …………..}
https://www.tutorialspoint.com/natural_language_processing/
natural_language_processing_word_level_analysis.htm#:~:text=Regular%20Expressions,syntax%20held%20in%20a
%20pattern.
 Minimum edit text.
 Many NLP tasks are concerned with measuring how similar two strings are.
 Spell correction: The user typed “graffe” , Which is closest? : graf grail giraffe
 the word giraffe, which differs by only one letter from graffe, seems intuitively to be more similar than, say
grail or graf,
 The minimum edit distance between two strings is defined as the minimum number of editing operations
(insertion, deletion,substitution) needed to transform one string into another.
 The minimum edit distance between intention and execution can be visualized using their alignment.
 Given two sequences, an alignment is a correspondence between substrings of the two sequences.
 Given two strings str1 and str2 and below operations that can be performed on str1. Find minimum number
of edits (operations) required to convert ‘str1’ in to ‘str2’.

1. Insert
2. Remove
3. Replace
 All of the above operations are of equal cost.
 Input: str1 = “geek”, str2 = “gesek”
 Output: 1
 Explanation: We can convert str1 into str2 by inserting a ‘s’.
 Input: str1 = “cat”, str2 = “cut”
 Output: 1
 Explanation: We can convert str1 into str2 by replacing ‘a’ with ‘u’.
 Input: str1 = “sunday”, str2 = “saturday”
 Output: 3
 Explanation: Last three and first characters are same. We basically need to convert “un” to “atur”.
 This can be done using below three operations.
 Replace ‘n’ with ‘r’, insert t, insert a.
 https://youtu.be/We3YDTzNXEk

 In this method, we use bottom up approach to compute the edit distance between str1 and
str2.
 We start by computing edit distance for smaller subproblems and use the results of these
smaller subproblems to compute results for subsequent larger problems.
 The results are stored in a two dimensional array as shown below.
 Each cell (m,n) of this array represents distance first 'm' characters of str1 and first 'n'
characters of str2.
 For example, when 'm' is 0, distance between str1 which is of 0 length and str2 of 'n' length
is 'n'.
 Please observe 0th row of above matrix.
 Same is the case for values in 0th column where str2 is of 0 length.
 Now in this matrix, for cell (m,n) which represents distance between str1 of length 'm'
characters and str2 of length 'n' characters, if 'm'th character of str1 and 'n'th character of
str2 are same, then we simply need to fill cell(m,n) using value of cell (m-1, n-1) which
represents edit distance between first 'm-1' characters if str1 and first 'n-1' characters of
str2.
 Notice the red arrows in the above array.
 If 'm'th character of str1 is not equal to 'n'th character of str2, then we choose minimum
value from following three cases-
1. Delete 'm'th character of str1 and compute edit distance between 'm-1'
characters of str1 and 'n' characters of str2.
 For this computation, we simply have to do - (1 + array[m-1][n])
where 1 is the cost of delete operation and array[m-1][n] is edit
distance between 'm-1' characters of str1 and 'n' characters of str2.
2. Similarly, for the second case of inserting last character of str2 into str1, we
have to do - (1 + array[m][n-1]).
3. And for the third case of substituting last character of str1 by last character
of str2 we use - (1 + array[m-1] [n-1]).
 Please checkout function 'find Distance (String str1, String str2)' in code snippet for
implementation details.
 The time and space complexity of this method is O(mn) where 'm' is the length of str1 and
'n' is the length of str2.
 Edit distance isn’t sufficient We often need to align each character of the two strings to each
other.
 We do this by keeping a “backtrace”
 Every time we enter a cell, remember where we came from
 When we reach the end, – Trace back the path from the upper right corner to read off the
alignment
 Time complexity: O(nm)
 Space complexity: O(nm)
 Back trace: O(n+m)

 POS (Parts of speech) tagging:-

 Part-of-speech (POS) tagging is a process in natural language processing (NLP)


where each word in a text is labeled with its corresponding part of speech.

 This can include nouns, verbs, adjectives, and other grammatical categories.

 POS tagging is useful for a variety of NLP tasks, such as information extraction,
named entity recognition, and machine translation.

 It can also be used to identify the grammatical structure of a sentence and to


disambiguate words that have multiple meanings.

 POS tagging is typically performed using machine learning algorithms, which


are trained on a large annotated corpus of text.

 The algorithm learns to predict the correct POS tag for a given word based on
the context in which it appears.

 There are various POS tagging schemes that have been developed, each with its
own set of tags and rules.

 Some common POS tagging schemes include the Penn Treebank tag set and
the Universal Dependencies tag set .

 Let’s take an example,

 Text: “The cat sat on the mat.”

 POS tags:
 The: determiner

 cat: noun

 sat: verb

 on: preposition

 the: determiner

 mat: noun

 In this example, each word in the sentence has been labeled with its
corresponding part of speech.

 The determiner “the” is used to identify specific nouns, while the noun “cat”
refers to a specific animal.

 The verb “sat” describes an action, and the preposition “on” describes the
relationship between the cat and the mat.

 POS tagging is a useful tool in natural language processing (NLP) as it allows


algorithms to understand the grammatical structure of a sentence and to
disambiguate words that have multiple meanings.

 It is typically performed using machine learning algorithms that are trained on


a large annotated corpus of text.

 Identifying part of speech of word is not just mapping words to their respective
POS tags.

 Same word might have different part of speech tag based on different context.

 Thus it is not possible to have common mapping for parts of speech tags.

 When you have a huge corpus manually finding different part-of-speech for
each word is a scalable solution.

 As tagging itself might take days.

 This is why we rely on tool-based POS tagging.

 But why are we tagging these words with their parts of speech?

 Use of parts of Speech Tagging in NLP

 There are several reasons why we might tag words with their parts of speech
(POS) in natural language processing (NLP):

1. To understand the grammatical structure of a sentence:

 By labeling each word with its POS, we can better


understand the syntax and structure of a sentence.

 This is useful for tasks such as machine translation and


information extraction, where it is important to know how
words relate to each other in the sentence.
2. To disambiguate words with multiple meanings:
 Some words, such as “bank,” can have multiple meanings depending
on the context in which they are used.
 By labeling each word with its POS, we can disambiguate these
words and better understand their intended meaning.
3. To improve the accuracy of NLP tasks:
 POS tagging can help improve the performance of various NLP tasks,
such as named entity recognition and text classification.
 By providing additional context and information about the words in
a text, we can build more accurate and sophisticated algorithms.
4. To facilitate research in linguistics:
 POS tagging can also be used to study the patterns and
characteristics of language use and to gain insights into the structure
and function of different parts of speech.

 Steps Involved in the POS tagging


 Here are the steps involved in a typical example of part-of-speech (POS) tagging in natural
language processing (NLP):
1. Collect a dataset of annotated text:
 This dataset will be used to train and test the POS tagger.
 The text should be annotated with the correct POS tags for each
word.
2. Preprocess the text:
 This may include tasks such as tokenization (splitting the text into
individual words), lowercasing, and removing punctuation.
3. Divide the dataset into training and testing sets:
 The training set will be used to train the POS tagger, and the testing
set will be used to evaluate its performance.
4. Train the POS tagger:
 This may involve building a statistical model, such as a hidden
Markov model (HMM), or defining a set of rules for a rule-based or
transformation-based tagger.
 The model or rules will be trained on the annotated text in the
training set.
5. Test the POS tagger:
 Use the trained model or rules to predict the POS tags of the words
in the testing set.
 Compare the predicted tags to the true tags and calculate metrics
such as precision and recall to evaluate the performance of the
tagger.
6. Fine-tune the POS tagger:
 If the performance of the tagger is not satisfactory, adjust the model
or rules and repeat the training and testing process until the desired
level of accuracy is achieved.
7. Use the POS tagger:
 Once the tagger is trained and tested, it can be used to perform POS
tagging on new, unseen text.
 This may involve preprocessing the text and inputting it into the
trained model or applying the rules to the text.
 The output will be the predicted POS tags for each word in the text.
 Application of POS Tagging
 There are several real-life applications of part-of-speech (POS) tagging in natural language
processing (NLP):
1. Information extraction:
 POS tagging can be used to identify specific types of information in a
text, such as names, locations, and organizations.
 This is useful for tasks such as extracting data from news articles or
building knowledge bases for artificial intelligence systems.
2. Named entity recognition:
 POS tagging can be used to identify and classify named entities in a
text, such as people, places, and organizations.
 This is useful for tasks such as building customer profiles or
identifying key figures in a news story.
3. Text classification:
 POS tagging can be used to help classify texts into different
categories, such as spam emails or sentiment analysis.
 By analysing the POS tags of the words in a text, algorithms can
better understand the content and tone of the text.
4. Machine translation:
 POS tagging can be used to help translate texts from one language
to another by identifying the grammatical structure and
relationships between words in the source language and mapping
them to the target language.
5. Natural language generation:
 POS tagging can be used to generate natural-sounding text by
selecting appropriate words and constructing grammatically correct
sentences.
 This is useful for tasks such as chatbots and virtual assistants.

You might also like