0% found this document useful (0 votes)

52 views59 pages

Unit 1 Introduction To NLP

Uploaded by

HARSHINI RAVVA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views59 pages

Unit 1 Introduction To NLP

Uploaded by

HARSHINI RAVVA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

lOMoARcPSD|34540975

Unit 1 introduction to nlp

Natural Language Processing (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Rudrarapu Sudeepthi ([email protected])
lOMoARcPSD|34540975

Unit-1
UNIT - I Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models Finding the Structure ofDocuments: Introduction,
Methods, Complexity of the Approaches, Performances of the Approaches

NLP INTRODUCTION:

Natural Language Processing (NLP) refers to AI method of communicating with an

intelligent systems using a natural language such as English.
Processing of Natural Language is required when you want an intelligent system like robot
to perform as per your instructions, when you want to hear decision from a dialogue based
etc.
The field of NLP involves making computers to perform useful tasks with the natural
languages humans use. The input and output of an NLP system can be −
 Speech
 Written Text
Components of NLP
There are two components of NLP as given −
Natural Language Understanding (NLU)
Understanding involves the following tasks −
 Mapping the given input in natural language into useful representations.
 Analyzing different aspects of the language.

Natural Language Generation (NLG)

It is the process of producing meaningful phrases and sentences in the form of natural
language from some internal representation.
It involves −
 Text planning − It includes retrieving the relevant content from knowledge base.
 Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
 Text Realization − It is mapping sentence plan into sentence structure.

Words and Their components

The general objective of an Information Retrieval Systemis to minimize the overhead of a

user locating needed information.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Overhead can be expressed as the time a user spends in all of the steps leading to reading an
item containing the needed information (e.g., query generation,
queryexecution,scanningresultsofquerytoselectitemsto read,readingnon-relevantitems).

The two major measures commonly associated with

Precision and recall.

When a user decides to issuea search looking for information on a topic,the total

database is logically

Divided into four segments

Relevant items are those documents that contain information that helps the searcher in
answering his question.

Non-relevant items are those items that do not provide any directly useful information.

There are two possibilities with respect to each item: it can be retrieved or
notretrieved by the user'squery.

Where:

Number_Possible_Relevantarethe

Number of relevant items in the database.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Number Total Relevantis the total number of items retrieved from the query.

Number_Retrieved_Relevantis the number of items retrieved hat are relevant to the

user's search need.

Precision measures one aspect of information retrieval overhead for a user associated with a
particular search.

If a search has a 85 per cent precision,then 15 per cent of the user effort is overhead
reviewing non-relevant items.

Recall gauges how well a system processing a particular query is able to retrieve the
relevant items Functional Overview
A total Information Storage and Retrieval System is composed of four major functional
processes:

1) Item Normalization
2) Selective Dissemination of Information (i.e., “Mail”)
3) Archival Document Database Search, and an Index
4) Database Search along with the Automatic File Build process that
supportsIndex Files.
1) Item Normalization:

The first step in any integrated system is to normalize the incoming items to a standard
format. Item normalization provides logical restructuring of the item. Additional
operations during item normalization are needed to create a searchable data structure:
identification of processing tokens (e.g., words), characterization of the tokens, and
stemming (e.g., removing word endings) of the tokens.

The processing tokens and their characterization are used to define the searchable
text from the total received text. Figure 1.5 shows the normalization process.
Standardizing the input takes the different external formats of input data and performs
the translation to the formats acceptable to the system. A system may have a single
format for all items or allow multiple formats. One example of standardization could be
translation of foreign languages into Unicode. Every language has a different

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

internal binary encoding for the characters in the language. One standard encoding
that covers English, French, Spanish, etc. is ISO-Latin.

To assist the users in generating indexes, especially the professional indexers, the
system provides a process called Automatic File Build(AFB).

Multi-media adds an extra dimension to the normalization process. In addition to

normalizing the textual input, the multi-media input also needs to be standardized.
There are a lot of options to the standards being applied to the normalization. If the
input is video the likely digital standards will be either MPEG-2, MPEG-1, AVI or Real
Media. MPEG (Motion Picture Expert Group) standards are the most universal standards
for higher quality video where Real Media is the most common standard for lower quality
video being used on the Internet. Audio standards are typically WAV or Real Media (Real
Audio). Images vary from JPEG to BMP.

The next process is to parse the item into logical sub-divisions that have meaning
to the user. This process, called “Zoning,” is visible to the user and used to increase the
precision of a search and optimize the display. A typical item is sub- divided into zones,
which may overlap and can be hierarchical, such as Title, Author, Abstract, Main Text,
Conclusion, and References. The zoning information is passed to the processing token
identification operation to store the information, allowing searches to be restricted to a

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

specific zone. For example, if the user is interested in articles discussing “Einstein” then
the search should not include the Bibliography, which could include references to articles
written by “Einstein.”

Systems determine words by dividing input symbols into 3 classes:

1) Valid word symbols

2) Inter-word symbols
3) Special processing symbols.

A word is defined as a contiguous set of word symbols bounded by inter-word

symbols. In many systems inter-word symbols are non-searchable and should be carefully
selected. Examples of word symbols are alphabetic characters and numbers. Examples of
possible inter-word symbols are blanks, periods and semicolons. The exact definition of
an inter-word symbol is dependent upon the aspects of the language domain of the items
to be processed by the system. For example, an apostrophe may be of little importance if
only used for the possessive case in English,

but might be critical to represent foreign names in the database.

Next, a Stop List/Algorithm is applied to the list of potential processing tokens.

The objective of the Stop function is to save system resources by eliminating from the set
of searchable processing tokens those that have little value to the system. Given the
significant increase in available cheap memory, storage and processing power, the need
to apply the Stop function to processing tokens is decreasing.

Examples of Stop algorithms are: Stop all numbers greater than “999999” (this
was selected to allow dates to be searchable) Stop any processing token that has
numbers and characters intermixed

2) Selective Dissemination (Distribution, Spreading) of Information

The Selective Dissemination of Information (Mail) Process provides the capability to

dynamically compare newly received items in the information system against standing
statements of interest of users and deliver the item to those users whose statement of
interest matches the contents of the item. The Mail process is composed of the search

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

process, user statements of interest (Profiles) and user mail files. As each item is received,
it is processed against every user’s profile. A profile contains a typically broad search
statement along with a list of user mail files that will receive the document if the search
statement in the profile is satisfied. Selective Dissemination of Information has not yet
been applied to multimedia sources.

3) Document Database Search

The Document Database Search Process provides the capability for a query to search
against all items received by the system. The Document Database Search process is
composed of the search process, user entered queries (typically ad hoc queries) and the
document database which contains all items that have been received, processed and
stored by the system. Typically items in the Document Database do not change (i.e., are
not edited) once received.

Index Database Search

When an item is determined to be of interest, a user may want to save it for future
reference. This is in effect filing it. In an information system this is accomplished via

the index process. In this process the user can logically store an item in a file along with
additional index terms and descriptive text the user wants to associate with the item. The
Index Database Search Process (see Figure 1.4) provides the capability to create indexes
and search them.

There are 2 classes of index files:

1) Public Index files

2) Private Index files
Every user can have one or more Private Index files leading to a very large number of
files. Each Private Index file references only a small subset of the total number of items in
the Document Database. Public Index files are maintained by professional library services
personnel and typically index every item in the Document Database. There is a small
number of Public Index files. These files have access lists (i.e., lists of users and their
privileges) that allow anyone to search or retrieve data. Private Index files typically have
very limited access lists. To assist the users in generating indexes, especially the

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

professional indexers, the system provides a process called Automatic File Build shown in
Figure 1.4 (also called Information Extraction).

Multimedia Database Search

From a system perspective, the multi-media data is not logically its own data structure,
but an augmentation to the existing structures in the Information Retrieval System.

Relationship to Database Management Systems

From a practical standpoint, the integration of DBMS’s and Information Retrieval
Systems is very important. Commercial database companies have already integrated the
two types of systems. One of the first commercial databases to integrate the two systems
into a single view is the INQUIRE DBMS. This has been available for over fifteen years. A
more current example is the ORACLE DBMS that now offers an imbedded capability
called CONVECTIS, which is an informational retrieval system that uses a comprehensive
thesaurus which provides the basis to generate “themes” for a particular item. The
INFORMIX DBMS has the ability to link to RetrievalWare to provide integration of
structured data and information along with functions associated with Information
Retrieval Systems.

that the user is interested in seeing.

Digital Libraries and Data Warehouses (DataMarts)
As the Internet continued its exponential growth and project funding became available,
the topic of Digital Libraries has grown. By 1995 enough research and pilot efforts had
started to support the 1ST ACM International Conference on Digital Libraries (Fox-96).
Indexing is one of the critical disciplines in library science and significant effort has gone
into the establishment of indexing and cataloging standards. Migration of many of the
library products to a digital format introduces both opportunities and challenges.
Information Storage and Retrieval technology has addressed a small subset of the issues
associated with Digital Libraries.

Data warehouses are similar to information storage and retrieval systems in that
they both have a need for search and retrieval of information. But a data warehouse is
more focused on structured data and decision support technologies. In addition to the

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

normal search process, a complete system provides a flexible set of analytical tools to
“mine” the data. Data mining (originally called Knowledge Discovery in Databases -
KDD) is a search process that automatically analyzes data and extract relationships and
dependencies that were not part of the database design.

Information Retrieval System Capabilities

Search Capabilities
Browse Capabilities
Miscellaneous Capabilities
Standards

The search capabilities address both Boolean and Natural Language queries. The
algorithms used for searching are called Boolean, natural language processing and
probabilistic. Probabilistic algorithms use frequency of occurrence of processing tokens
(words) in determining similarities between queries and items and also in predictors on
the potential relevance of the found item to the searcher.

The newer systems such as TOPIC, RetrievalWare, and INQUERY all allow for natural
language queries.

Browse functions to assist the user in filtering the search results to find relevant
information are very important.

2.1 Search Capabilities

The objective of the search capability is to allow for a mapping between a user’s specified
need and the items in the information database that will answer that need. It can consist
of natural language text in composition style and/or query terms (referred to as terms in
this book) with Boolean logic indicators between them. One concept that has occasionally
been implemented in commercial systems (e.g., RetrievalWare), and holds significant
potential for assisting in the location and ranking of relevant items, is the “weighting” of
search terms. This would allow a user to indicate the importance of

search terms in either a Boolean or natural language interface. Given the following
natural language query statement where the importance of a particular search term is

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

indicated by a value in parenthesis between 0.0 and 1.0 with 1.0 being the most
important.

The search statement may apply to the complete item or contain additional
parameters limiting it to a logical division of the item (i.e., to a zone). Based upon the
algorithms used in a system many different functions are associated with the system’s
understanding the search statement. The functions define the relationships between the
terms in the search statement (e.g., Boolean, Natural Language, Proximity, Contiguous
Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g.,
Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).

Boolean Logic
Boolean logic allows a user to logically relate multiple concepts together to define what
information is needed. Typically the Boolean functions apply to processing tokens
identified anywhere within an item. The typical Boolean operators are AND, OR, and
NOT. These operations are implemented using set intersection, set union and set
difference procedures. Asearch terms in either a Boolean or natural language interface.
Given the following natural language query statement where the importance of a
particular search term is indicated by a value in parenthesis between 0.0 and 1.0 with 1.0
being the most important.

the search statement may apply to the complete item or contain additional
paramesearch terms in either a Boolean or natural language interface. Given the
following natural language query statement where the importance of a particular

search term is indicated by a value in parenthesis between 0.0 and 1.0 with 1.0 being the
most important.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g.,
Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).

limiting it to a logical division of the item (i.e., to a zone). Based upon the
algorithms used in a system many different functions are associated with the system’s
understanding the search statement. The functions define the relationships between the
terms in the search statement (e.g., Boolean, Natural Language, Proximity, Contiguous
Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g.,
Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).

few systems introduced the concept of “exclusive or” but it is equivalent to a slightly
more complex query using the other operators and is not generally useful to users since
most users do not understand it.

A special type of Boolean search is called “M of N” logic. The user lists a set of
possible search terms and identifies, as acceptable, any item that contains a subset of the
terms. For example, “Find any item containing any two of the following terms: “AA,” “BB,”
“CC.” This can be expanded into a Boolean search that performs an AND between all
combinations of two terms and “OR”s the results together ((AA AND BB) or (AA AND
CC) or (BB AND CC)).

Proximity

Proximity is used to restrict the distance allowed within an item between two search
terms. The semantic concept is that the clossearch terms in either a Boolean or natural
language interface. Given the following natural language query statement where the
importance of a particular search term is indicated by a value in parenthesis between

0.0 and 1.0 with 1.0 being the most important.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

terms in the search statement (e.g., Boolean, Natural Language, Proximity, Contiguous
Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g.,
Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).

two terms are found in a text the more likely they are related in the description of a
particular concept. Proximity is used to increase the precision of a search. If the terms
COMPUTER and DESIGN are found within a few words of each other then the item is
more likely to be discussing the design of computers than if the words are paragraphs
apart. The typical format for proximity is:

TERM1 within “m” “units” of TERM2

The distance operator “m” is an integer number and units are in Characters, Words,
Sentences, or Paragraphs.

A special case of the Proximity operator is the Adjacent (ADJ) operator that normally has
a distance operator of one and a forward only direction (i.e., in WAIS). Another special
case is where the distance is set to zero meaning within the same semantic unit.

contiguous Word Phrases

A Contiguous Word Phrase (CWP) is both a way of specifying a query term and a special
search operator. A Contiguous Word Phrase is two or more words that are treated as a
single semantic unit. An example of a CWP is “United States of America.” It is four words
that specify a search term representing a single specific semantic concept (a country)
that can be used with any of the operators discussed above. Thus a query could specify

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

“manufacturing” AND “United States of America” which returns any item that contains
the word “manufacturing” and the contiguous words “United States of America.”

A contiguous word phrase also acts like a special search operator that is similar
to the proximity (Adjacency) operator but allows for additional specificity. If two

terms are specified, the contiguous word phrase and the proximity operator using
directional one word parameters or the Adjacent operator are identical. For contiguous
word phrases of more than two terms the only way of creating an equivalent search
statement using proximity and Boolean operators is via nested Adjacencies which are not
found in most commercial systems. This is because Proximity and Boolean operators are
binary operators but contiguous word phrases are an “N”ary operator where “N” is the
number of words in the CWP.

Contiguous Word Phrases are called Literal Strings in WAIS and Exact Phrases in
RetrievalWare. In WAIS multiple Adjacency (ADJ) operators are used to define a Literal
String (e.g., “United” ADJ “States” ADJ “of” ADJ “America”).

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Fuzzy Searches
Fuzzy Searches provide the capability to locate spellings of words that are similar to the
entered search term. This function is primarily used to compensate for errors in spelling
of words. Fuzzy searching increases recall at the expense of decreasing precision (i.e., it
can erroneously identify terms as the search term). In the process of expanding a query
term fuzzy searching includes other terms that have similar spellings, giving more weight
(in systems that rank output) to words in the database that have similar word lengths
and position of the characters as the entered term. A Fuzzy Search on the term
“computer” would automatically include the following

words from the information database: “computer,” “compiter,” “conputer,” “computter,”

“compute.”

Term Masking
Term masking is the ability to expand a query term by masking a portion of the term and
accepting as valid any processing token that maps to the unmasked portion of the term.
The value of term masking is much higher in systems that do not perform stemming or
only provide a very simple stemming algorithm. There are two types of search term
masking: fixed length and variable length. Sometimes they are called fixed and variable
length “don’t care” functions.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Fixed length masking is a single position mask. It masks out any symbol in a particular position or the lack of that position in a word. Variable length “don’t cares”
allows masking of any number of characters within a processing token. The masking may be in the front, at the end, at both front and end, or imbedded. The first three of
these cases are called suffix search, prefix search and imbedded character string search, respectively. The use of an imbedded variable length don’t care is seldom used.
Figure 2.3 provides examples of the use of variable length term masking. If “*” represents a variable length don’t care then the following are examples of its use:
“*COMPUTER” Suffix Search

“COMPUTER*” Prefix Search

“COMPUTER” Imbedded String Search

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Numeric and Date Ranges

Term masking is useful when applied to words, but does not work for finding ranges of numbers or numeric dates. To find numbers larger than “125,” using a term “125*”
will not find any number except those that begin with the digits “125.”

Concept/Thesaurus Expansion
Associated with both Boolean and Natural Language Queries is the ability to expand the search terms via Thesaurus or Concept Class database reference tool. A Thesaurus
is typically a one-level or two-level expansion of a term to other terms that are similar in meaning. A Concept Class is a tree structure that expands each meaning of a word
into potential concepts that are related to the initial term (e.g., in the TOPIC system). Concept classes are sometimes implemented as a network structure that links word
stems (e.g., in the RetrievalWare system). An example of Thesaurus and Concept Class structures are shown in Figure 2.4 (Thesaurus-93) and Figure 2.5.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Thesauri are either semantic or based upon statistics. A semantic thesaurus is a listing of words and then other words that are semantically similar.

The problem with thesauri is that they are generic to a language and can introduce many search terms that are not found in the document database. An alternative uses
the database or a representative sample of it to create statistically related terms. It is conceptually a thesaurus in that words that are statistically related to other words by

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

their frequently occurring together in the same items. This type of thesaurus is very dependent upon the database being searched and may not be portable to other
databases.

Natural Language Queries

Natural language interfaces improve the recall of systems with a decrease in precision when negation is required.

Browse Capabilities
Once the search is complete, Browse capabilities provide the user with the capability to determine which items are of interest and select those to be displayed. There are
two ways of displaying a summary of the items that are associated with a query: line item status and data visualization. From these summary displays, the user can select
the specific items and zones within the items for display.

Ranking
Typically relevance scores are normalized to a value between 0.0 and 1.0. The highest value of 1.0 is interpreted that the system is sure that the item is relevant to the
search statement. In addition to ranking based upon the characteristics of the item and the database, in many circumstances collaborative filtering is providing an option
for selecting and ordering output.

Collaborative filtering has been very successful in sites such as AMAZON.COM MovieFinder.com, and CDNow.com in deciding what products to display to users based
upon their queries.

Rather than limiting the number of items that can be assessed by the number of lines on a screen, other graphical visualization techniques showing the relevance

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

relationships of the hit items can be used. For example, a two or three dimensional graph can be displayed where points on the graph represent items and the location of
the points represent their relative relationship between each other and the user’s query. In some cases color is also used in this representation. This technique allows a user
to see the clustering of items by topics and browse through a cluster or move to another topical cluster.

Zoning
Related to zoning for use in minimizing what an end user needs to review from a hit item is the idea of locality and passage based search and retrieval.

Highlighting
Most systems allow the display of an item to begin with the first highlight within the item and allow subsequent jumping to the next highlight. The DCARS system that acts
as a user frontend to the Retrieval Ware search system allows the user to browse an item in the order of the paragraphs or individual words that contributed most to the
rank value associated with the item. The highlighting may vary by introducing colors and intensities to indicate the relative importance of a particular word in the item in
the decision to retrieve the item.

Miscellaneous Capabilities 2.3.1Vocabulary Browse

Vocabulary Browse provides the capability to display in alphabetical sorted order words from the document database. Logically, all unique words (processing tokens) in
the database are kept in sorted order along with a count of the number of unique items in which the word is found. The user can enter a word or word fragment and the
system will begin to display the dictionary around the entered text.

It helps the user determine the impact of using a fixed or variable length mask on a search term and potential mis-spellings. The user can determine that entering
the search term “compul*” in effect is searching for “compulsion” or “compulsive” or “compulsory.” It also shows that someone probably entered the word “computen” when
they really meant “computer.”

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Iterative Search and Search History Log

Frequently a search returns a Hit file containing many more items than the user wants to review. Rather than typing in a complete new query, the results of the previous
search can be used as a constraining list to create a new query that is applied against it. This has the same effect as taking the original query and adding additional search
statement against it in an AND condition. This process of refining the results of a previous search to focus on relevant items is called iterative search. This also applies when
a user uses relevance feedback to enhance a previous search. The search history log is the capability to display all the previous searches that were executed during the
current session.

Canned Query
The capability to name a query and store it to be retrieved and executed during a later user session is called canned or stored queries. A canned query allows a user to
create and refine a search that focuses on the user’s general area of interest one time and then retrieve it to add additional search criteria to retrieve data that is currently
needed. Canned query features also allow for variables to be inserted into the query and bound to specific values at execution time.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Difficulties in NLP:

Issues and Challenges

NL has an extremely rich form and structure.

It is very ambiguous. There can be di昀昀erent levels of ambiguity −

Lexical ambiguity − It is at very primi琀椀ve level such as word-level.

For example, trea琀椀ng the word “board” as noun or verb?

Syntax Level ambiguity − A sentence can be parsed in di昀昀erent ways.

For example, “He li昀琀ed the beetle with red cap.” − Did he use cap to li昀琀 the beetle or he li昀琀ed a beetle that had red cap?

Referen琀椀al ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said, “I am 琀椀red.” − Exactly who is 琀椀red?

One input can mean di昀昀erent meanings.

Many inputs can mean the same thing.

NLP Terminology

Phonology − It is study of organizing sound systema琀椀cally.

Morphology − It is a study of construc琀椀on of words from primi琀椀ve meaningful units.

Morpheme − It is primi琀椀ve unit of meaning in a language.

Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in the sentence and in phrases.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Seman琀椀cs − It is concerned with the meaning of words and how to combine words into meaningful phrases and sentences.

Pragma琀椀cs − It deals with using and understanding sentences in di昀昀erent situa琀椀ons and how the interpreta琀椀on of the sentence is a昀昀ected.

Discourse − It deals with how the immediately preceding sentence can a昀昀ect the interpreta琀椀on of the next sentence.

World Knowledge − It includes the general knowledge about the world.

Natural Language Processing

l Humans communicate through some form of language either by text or speech.

l To make interactions between computers and humans, computers need to understand natural languages used by
humans.

l Natural language processing is all about making computers learn, understand, analyse, manipulate and interpret
natural(human) languages.

l NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and Artificial
Intelligence.
l Processing of Natural Language is required when you want an intelligent system like robot to perform as per your instructions,
when you want to hear decision from a dialogue based clinical expert system, etc.
l The ability of machines to interpret human language is now at the core of many applications that we use every day
- chatbots, Email classification and spam filters, search engines, grammar checkers, voice assistants, and social
language translators.
l The input and output of an NLP system can be Speech or Written Text

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Components of NLP
l There are two components of NLP, Natural Language Understanding (NLU)
and Natural Language Generation (NLG).
l Natural Language Understanding (NLU) which involves transforming human
language into a machine-readable format.
l It helps the machine to understand and analyse human language by extracting the text from large data such as keywords,
emotions, relations, and semantics.

l Natural Language Generation (NLG) acts as a translator that converts the computerized data into
natural language representation.

l It mainly involves Text planning, Sentence planning, and Text realization.

NLP Terminology
l Phonology − It is study of organizing sound systematically.
l Morphology: The study of the formation and internal structure of words.
l Morpheme − It is primitive unit of meaning in a language.
l Syntax: The study of the formation and internal structure of sentences.
l Semantics: The study of the meaning of sentences.
l Pragmatics − It deals with using and understanding sentences in different situations
and how the interpretation of the sentence is affected.
l Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next sentence.
l World Knowledge − It includes the general knowledge about the world.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Steps in NLP

l There are general five steps :

1. Lexical Analysis
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Lexical Analysis –

l The first phase of NLP is the Lexical Analysis.

l This phase scans the source code as a stream of characters and converts it into meaningful
lexemes.

l It divides the whole text into paragraphs, sentences, and words.

Syntactic Analysis (Parsing) –

l Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
among the words.

l The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
Semantic Analysis –

l Semantic analysis is concerned with the meaning representation.

l It mainly focuses on the literal meaning of words, phrases, and sentences.
l The semantic analyzer disregards sentence such as “hot ice-cream”.
Discourse Integration –

l Discourse Integration depends upon the sentences that proceeds it and also invokes the
meaning of the sentences that follow it.

Pragmatic Analysis –

l During this, what was said is re-interpreted on what it actually meant.

l It involves deriving those aspects of language which require real world knowledge.
l Example: "Open the door" is interpreted as a request instead of an order.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Finding the Structure of Words

l Human language is a complicated thing.
l We use it to express our thoughts, and through language, we receive information and infer its
meaning.

l Trying to understand language all together is not a viable approach.

l The point of morphology, for instance, is to study the variable forms and functions of words,
l The syntax is concerned with the arrangement of words into phrases, clauses, and sentences.
l Word structure constraints due to pronunciation are described by phonology,
l The conventions for writing constitute the orthography of a language.
l The meaning of a linguistic expression is its semantics, and etymology and lexicology cover especially the evolution of words and
explain the semantic, morphological, and other links among them.
l Words are perhaps the most intuitive units of language, yet they are in general tricky to define.
l Knowing how to work with them allows, in particular, the development of syntactic and
semantic abstractions and simplifies other advanced views on language.

l Here, first we explore how to identify words of distinct types in human languages, and how the internal structure of words can be
modelled in connection with the grammatical properties and lexical concepts the words should represent.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l The discovery of word structure is morphological parsing.

l In many languages, words are delimited in the orthography by whitespace and
punctuation.
l But in many other languages, the writing system leaves it up to the reader to tell words
apart or determine their exact phonological forms.
Words and Their Components
l Words are defined in most languages as the smallest linguistic units that can form a
complete utterance by themselves.
l The minimal parts of words that deliver aspects of meaning to them are called
morphemes.
Tokens

l Suppose, for a moment, that words in English are delimited only by whitespace and punctuation (the marks, such as full
stop, comma, and brackets)

l Example: Will you read the newspaper? Will you read it? I won’t read it.
l If we confront our assumption with insights from syntax, we notice two here: words
newspaper and won’t.

l Being a compound word, newspaper has an interesting derivational structure.

l In writing, newspaper and the associated concept is distinguished from the
isolated news and paper.

l For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or tokens,
each of which has its independent role and can be reverted to its normalized form.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l The structure of won’t could be parsed as will followed by not.

l In English, this kind of tokenization and normalization may apply to just a limited set of
cases, but in other languages, these phenomena have to be treated in a less trivial manner.

l Tokens behaving in this way can be found in various languages and are often called clitics.
Lexemes
l By the term word, we often denote not just the one linguistic form in the given context but also the concept behind the form and
the set of alternative forms that can express it.
l Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
l Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,
adjectives, conjunctions, particles, or other parts of speech.
l The citation form of a lexeme, by which it is commonly identified, is also called its
lemma.
l When we convert a word into its other forms, such as turning the singular mouse into the plural mice or mouses, we say we inflect
the lexeme.
l When we transform a lexeme into another one that is morphologically related, regardless of its lexical category, we say we
derive the lexeme: for instance, the nouns receiver and reception are derived from the verb to receive.
l Example: Did you see him? I didn’t see him. I didn’t see anyone.

 Example presents the problem of tokenization of didn’t and the investigation of the internal structure of anyone.

Morphemes
l Morphological theories differ on whether and how to associate the properties of word forms with their structural components.
l These components are usually called segments or morphs.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l The morphs that by themselves represent some aspect of the meaning of a word are called morphemes of some function.

 Human languages employ a variety of devices by which morphs and morphemes are
combined into word forms.
Morphology
l Morphology is the domain of linguistics that analyses the internal structure of

words.

l Morphological analysis – exploring the structure of words

l Words are built up of minimal meaningful elements called morphemes: played = play-ed

cats = cat-s

unfriendly = un-friend-ly
l Two types of morphemes: i Stems: play, cat,
friend ii Affixes: -ed, -s, un-, -ly

l Two main types of affixes:

i Prefixes precede the stem: un-

ii Suffixes follow the stem: -ed, -s, un-, -ly

l Stemming = find the stem by stripping off affixes

l play = play

replayed = re-play-ed

computerized = comput-er-ize-d

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Problems in morphological processing

l Inflectional morphology: inflected forms are constructed from base forms and inflectional

affixes.

l Inflection relates different forms of the same word

Lemma Singular Plural

cat cat cats

dog dog dogs

knife knife knives

sheep sheep sheep

mouse mouse mice

l Derivational morphology: words are constructed from roots (or stems) and derivational

affixes:

inter+national = international international+ize = internationalize

internationalize+ation = internationalization

Downloaded by Rudrarapu Sudeepthi ([email protected])

l The simplest morphological process concatenates morphs one by one, as in dis- agree-ment-s, where agree is a free lexical
lOMoARcPSD|34540975

morpheme and the other elements are bound grammatical morphemes contributing some partial meaning to the whole word.
l in a more complex scheme, morphs can interact with each other, and their forms may become subject to additional phonological
and orthographic changes denoted as morphophonemic.
l The alternative forms of a morpheme are termed allomorphs.

Typology
l Morphological typology divides languages into groups by characterizing the prevalent
morphological phenomena in those languages.
l It can consider various criteria, and during the history of linguistics, different classifications
have been proposed.
l Let us outline the typology that is based on quantitative relations between words, their morphemes, and their features:
l Isolating, or analytic, languages include no or relatively few words that would comprise more
than one morpheme
l Synthetic languages can combine more morphemes in one word and are further divided into agglutinative and fusional
languages.
l Agglutinative languages have morphemes associated with only a single function at a
time (as in Korean, Japanese, Finnish, and Tamil, etc.)
l Fusional languages are defined by their feature-per-morpheme ratio higher than one
(as in Arabic, Czech, Latin, Sanskrit, German, etc.).
l In accordance with the notions about word formation processes mentioned earlier, we
can also find out using concatenative and nonlinear:
l Concatenative languages linking morphs and morphemes one after another.
l Nonlinear languages allowing structural components to merge nonsequentially to
apply tonal morphemes or change the consonantal or vocalic templates of words.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Issues and Challenges

l Irregularity: word forms are not described by a prototypical linguistic model.
l Ambiguity: word forms be understood in multiple ways out of the context
l Productivity: is the inventory of words in a language finite, or is it unlimited?

l Morphological parsing tries to eliminate the variability of word forms to provide higher- level linguistic units whose lexical and
morphological properties are explicit and well defined.
l It attempts to remove unnecessary irregularity and give limits to ambiguity, both of which are present inherently in human
language.
l By irregularity, we mean existence of such forms and structures that are not described
appropriately by a prototypical linguistic model.
l Some irregularities can be understood by redesigning the model and improving its rules, but other lexically dependent
irregularities often cannot be generalized
l Ambiguity is indeterminacy (not being interpreted)in interpretation of expressions of
language.
l Morphological modelling also faces the problem of productivity and creativity in language, by which unconventional but perfectly meaningful new
words or new senses are coined.
Irregularity
l Morphological parsing is motivated by the quest for generalization and abstraction in the world of words.
l Immediate descriptions of given linguistic data may not be the ultimate ones, due to either
their inadequate accuracy or inappropriate complexity, and better formulations may be needed.
l The design principles of the morphological model are therefore very important.
l With the proper abstractions made, irregular morphology can be seen as merely enforcing

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

some extended rules, the nature of which is phonological, over the underlying or prototypical
regular word forms.
l Morphophonemic templates capture morphological processes by just organizing stem patterns and generic affixes without any context-dependent
variation of the affixes or ad hoc modification of the stems.
l The merge rules, indeed very neatly or effectively concise, then ensure that such structured
representations can be converted into exactly the surface forms, both orthographic and
phonological, used in the natural language.
l Applying the merge rules is independent of and irrespective of any grammatical parameters or information other than that contained in a
template.
l Most morphological irregularities are thus successfully removed.

Ambiguity
l Morphological ambiguity is the possibility that word forms be understood in multiple
ways out of the context of their discourse (communication in speech or writing).
l Words forms that look the same but have distinct functions or meaning are called
homonyms.
l Ambiguity is present in all aspects of morphological processing and language
processing at large.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Productivity
l Is the inventory of words in a language finite, or is it unlimited?
l This question leads directly to discerning two fundamental approaches to language, summarized in the distinction between langue and parole, or in

the competence versus performance

l In one view, language can be seen as simply a collection of utterances (parole) actually
pronounced or written (performance).

l This ideal data set can in practice be approximated by linguistic corpora, which are finite collections of linguistic data that are studied with
empirical(based on) methods and can be used for comparison when linguistic models are developed.
l Yet, if we consider language as a system (langue), we discover in it structural devices like recursion, iteration, or
compounding(make up; constitute)that allow to produce (competence) an infinite set of concrete linguistic utterances.

l This general potential holds for morphological processes as well and is called morphological productivity.

l We denote the set of word forms found in a corpus of a language as its vocabulary.
l The members of this set are word types, whereas every original instance of a word form is a word token.
l The distribution of words or other elements of language follows the “80/20 rule frame which is a protocol,” also known as the law of the vital few.
l It says that most of the word tokens in a given corpus(a collection of written texts)can be identified with just a couple of word types in its vocabulary,
and words from the
rest of the vocabulary occur much less commonly if not rarely in the corpus.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l Furthermore, new, unexpected words will always appear as the collection of linguistic data is
enlarged.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Morphological Models
l There are many possible approaches to designing and implementing morphological models.
l Over time, computational linguistics has witnessed the development of a number of formalisms and frameworks, in particular grammars of different
kinds and expressive power, with which to address whole classes of problems in processing natural as well as formal languages.
l Let us now look at the most prominent types of computational approaches to morphology.

Dictionary Lookup
l Morphological parsing is a process by which word forms of a language are associated with

corresponding linguistic descriptions.

l Morphological systems that specify these associations by merely enumerating(is the act or process of making or stating a list of things one

after another) them case by case do not offer any generalization means.
Most common data structure

Inverted file structures are composed of three files The document file

1. The inversion list (Posting List)

2. Dictionary
3. The inverted file : based on the methodology of storing an inversion of documents.

4. For each word a listof documents in which the word is found is stored(inversion of document
5. Each document is given a unique the numerical identifier that is stored in inversion list . Dictionary is used to located the inversion list for a particular
word.
Which is a sorted list( processing tokens) in the system and a pointer to the location of its inversion list.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Dictionary can also store other information used in query optimization such as length of inversion lists to increase the precision.

 Use zoning to improve

 precision and Restrict entries.
 Inversion list consists of document identifier for each document in which the word is found.
Ex: bit 1(10),1(12) 1(18) is in 10,12, 18 position of the word bit in the document #1.
 When a search is performed, the inversion lists for the terms in the query are locate and appropriate logic is applied between inversion lists.
 Weights can also be stored in the inversion list.
 Inversion list are used to store concept and their relationship.
 Words with special characteristics can be stored in their own dictionary. Ex: Date… which require date ranging and numbers.
 Systems that support ranking are re-organized in ranked order.
 B trees can also be used for inversion instead of dictionary.
 The inversion lists may be at the leaf level or referenced in higher level pointers.
 A B-tree of order m is defined as:
 A root node with between 2 and 2m keys
 All other internal nodes have between m and 2m keys

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

 All keys are kept in order from smaller to larger.

 All leaves are at the same level or differ by at most one level.

Finite-State Morphology
l By finite-state morphological models, we mean those in which the specifications written by human programmers are directly compiled into
finite-states

l The two most popular tools supporting this approach, XFST (Xerox Finite-State Tool),

LexTools.
l They consist of a finite set of nodes connected by directed edges labeled with pairs of input
and output symbols.
l In such a network or graph, nodes are also called states, while edges are called arcs.
l Traversing the network from the set of initial states to the set of final states along the arcs is equivalent to reading the sequences of encountered
input symbols and writing the sequences of corresponding output symbols.
l The set of possible sequences accepted by the defines the input language; the set of possible sequences emitted by the
defines the output language.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Input Input Morphological parsed output

Cats cat +N +PL

Cat cat +N +SG
Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose +V)
Gooses goose +V +3SG

mergin merge +V +PRES-PART

g
Caught (caught +V +PAST-PART) or (catch +V +PAST)

l matching words in the infinite regular language defined by grandson, great-grandson, great-great-grandson.
l In finite-state computational morphology, it is common to refer to the input word forms as surface strings and to
the output descriptions as lexical strings, if the transducer is used for morphological analysis, or vice versa, if it is

used for morphological generation.

• In English, a finite-state transducer could analyze the surface string children into the lexical
string child [+plural], for instance, or generate women from woman [+plural].

l Relations on languages can also be viewed as functions. Let us have a relation R, and let us denote by [Σ] the set
of all sequences over some set of symbols Σ, so that the domain and the range of R are subsets of [Σ].

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l We can then consider R as a function mapping an input string into a set of output strings, formally denoted by this

type signature, where [Σ] equals String:

l A theoretical limitation of finite-state models of morphology is the problem of capturing reduplication of words or
their elements (e.g., to express plurality) found in several human languages.

Unification-Based Morphology

l The concepts and methods of these formalisms are often closely connected to those
of logic programming.
l In finite-state morphological models, both surface and lexical forms are by themselves unstructured strings of atomic symbols.
l In higher-level approaches, linguistic information is expressed by more appropriate
data structures that can include complex values or can be recursively nested if
needed.
l Morphological parsing P thus associates linear forms φ with alternatives of structured
content ψ, cf.

l morphological modelling, word forms are best captured by regular expressions, while the linguistic content is best described
through typed feature structures.
l Feature structures can be viewed as directed acyclic graphs.

l A node in a feature structure comprises a set of attributes whose values can be

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l Nodes are associated with types, and atomic values are attribute less nodes
distinguished by their type.
l Unification is the key operation by which feature structures can be merged into a more
informative feature structure.
l Unification of feature structures can also fail, which means that the information in them is mutually incompatible.
l Morphological models of this kind are typically formulated as logic programs, and
unification is used to solve the system of constraints imposed by the model.
l Advantages of this approach include better abstraction possibilities for developing a
morphological grammar as well as elimination of redundant information from it.
l Unification-based models have been implemented for Russian, Czech, Slovene,
Persian, Hebrew, Arabic, and other languages.
Functional Morphology
 Functional morphology defines its models using principles of functional programming
and type theory.
 It treats morphological operations and processes as pure mathematical functions and organizes the linguistic as well as abstract
elements of a model into distinct types of values and type classes.
 Though functional morphology is not limited to modelling particular types of
morphologies in human languages, it is especially useful for fusional morphologies.
 Functional morphology implementations are intended to be reused as programming
libraries capable of handling the complete morphology of a language and to be
incorporated into various kinds of applications.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

 Morphological parsing is just one usage of the system, the others being
morphological generation, lexicon browsing, and so on.
we can describe inflection I, derivation D, and lookup L as functions of these generic
type

 Many functional morphology implementations are embedded in a general-purpose programming language, which gives
programmers more freedom with advanced programming techniques and allows them to develop full-featured, real-world
applications for their models.
 It influenced the functional morphology framework in with which morphologies of Latin, Swedish, Spanish, Urdu, and other
languages have been implemented.
l The notation then constitutes a so-called domain-specific embedded language, which makes programming even
more fun.

l Even without the options provided by general-purpose programming languages, functional morphology models
achieve high levels of abstraction.

l Morphological grammars in Grammatical Framework can be extended with descriptions of the syntax and semantics of a
language.
l Grammatical Framework itself supports multilinguality, and models of more than a dozen languages are available in
it as open-source software.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

2.Finding structure of Documents

Introduction

l In human language, words and sentences do not appear randomly but have structure.

l For example, combinations of words from sentences- meaningful grammatical units, such as statements, requests, and commands.
l Automatic extraction of structure of documents helps subsequent NLP tasks: for example, parsing, machine translation, and semantic role labelling
use sentences as the basic processing unit.
l Sentence boundary annotation(labelling) is also important for aiding human readability of
automatic speech recognition (ASR) systems.

l Task of deciding where sentences start and end given a sequence of characters(made of words and typographical cues) sentences boundary
detection.
l Topic segmentation as the task of determining when a topic starts and ends in a sequence of
sentences.
 The statistical classification approaches that try to find the presence of sentence and topic boundaries given human-annotated training data,
for segmentation.
 These methods base their predictions on features of the input: local characteristics that give

evidence toward the presence or absence of a sentence, such as aperiod(.), a question

mark(?), an exclamation mark(!), or another type of punctuation.

 Features are the core of classification approaches and require careful design and selection in

order to be successful and prevent overfitting and noise problem.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

 Most statistical approaches described here are language independent, every language is a challenging in itself.

 For example, for processing of Chinese documents, the processor may need to first segment the character sequences into words, as the words

usually are not separated by a space.

 Similarly, for morphological rich languages, the word structure may need to be analyzed to extract additional features.

 Such processing is usually done in a pre-processing step, where a sequence of tokens is determined.

 Tokens can be word or sub-word units, depending on the task and language.

 These algorithms are then applied on tokens.

Sentence Boundary Detection

l Sentence boundary detection (Sentence segmentation) deals with automatically segmenting

a sequence of word tokens into sentence units.

l In written text in English and some other languages, the beginning of a sentence is usually marked with an uppercase letter, and the

end of a sentence is explicitly marked with a

period(.), a question mark(?), an exclamation mark(!), or another type of punctuation.

l In addition to their role as sentence boundary markers, capitalized initial letters are used
distinguish proper nouns, periods are used in abbreviations, and numbers and punctuation marks are used inside proper names.
l The period at the end of an abbreviation can mark a sentence boundary at the same time.

l Example: I spoke with Dr. Smith. and My house is on Mountain Dr.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l In the first sentence, the abbreviation Dr. does not end a sentence, and in the second it does.

l Especially quoted sentences are always problematic, as the speakers may have uttered multiple sentences, and

sentence boundaries inside the quotes are also marked with

punctuation marks.

l An automatic method that outputs word boundaries as ending sentences according to the

presence of such punctuation marks would result in cutting some sentences incorrectly.
l Ambiguous abbreviations and capitalizations are not only problem of sentence segmentation in written text.

l Spontaneously written texts, such as short message service (SMS) texts or instant
messaging(IM) texts, tend to be nongrammatical and have poorly used or missing punctuation, which makes sentence segmentation even more
challenging.

l Similarly, if the text input to be segmented into sentences comes from an automatic system,
such as optical character recognition (OCR) or ASR, that aims to translate images of handwritten, type written, or printed text or spoken utterances into
machine editable text, the finding of sentences boundaries must deal with the errors of those systems as well.

l On the other hand, for conversational speech or text or multiparty meetings with
ungrammatical sentences and disfluencies, in most cases it is not clear where the boundaries
are.
l Code switching -that is, the use of words, phrases, or sentences from multiple languages by multilingual speakers- is another problem that can
affect the characteristics of sentences.

l For example, when switching to a different language, the writer can either keep the
punctuation rules from the first language or resort to the code of the second language.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l Conventional rule-based sentence segmentation systems in well-formed texts rely on patterns

to identify potential ends of sentences and lists of abbreviations for disambiguating them.

l For example, if the word before the boundary is a known abbreviation, such as “Mr.” or “Gov.,” the text is not segmented at that position even though
some periods are exceptions.
l To improve on such a rule-based approach, sentence segmentation is stated as a classification problem.
l Given the training data where all sentence boundaries are marked, we can train a classifier to recognize them.

Topic Boundary Detection

l Segmentation(Discourse or text segmentation) is the task of automatically dividing a stream of text or speech into topically homogenous blocks.
l This is, given a sequence of(written or spoken) words, the aim of topic segmentation is to
find the boundaries where topics change.

l Topic segmentation is an important task for various language understanding applications, such as information extraction and retrieval and text
summarization.
l For example, in information retrieval, if a long documents can be segmented into shorter, topically coherent segments, then only the segment that is
about the user’s query could be retrieved.
l During the late1990s, the U.S defence advanced research project agency(DARPA) initiated the topic detection and tracking program to further
the state of the art in finding and following new topic in a stream of broadcast news stories.
l One of the tasks in the TDT effort was segmenting a news stream into individual stories.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Methods
l Sentence segmentation and topic segmentation have been considered as a boundary
classification problem.

l Given a boundary candidate( between two word tokens for sentence segmentation and between two sentences for topic segmentation), the goal is

to predict whether or not the candidate is an actual boundary (sentence or topic boundary).
l Formally, let xƐX be the vector of features (the observation) associated with a candidate and y
ƐY be the label predicted for that candidate.

. label y can be b for boundary and �ഥ for nonboundary.

l The

l Classification problem: given a set of training examples(x,y)train, find a function that will assign the most accurate possible label y of unseen
examples xunseen.

l Alternatively to the binary classification problem, it is possible to model boundary types using
finer-grained categories.

l For segmentation in text be framed as a three-class problem: sentence boundary ba, without an abbreviation and abbreviation not as a boundary
l Similarly spoken language, a three way classification can be made between non-boundaries

statements bs, and question boundaries bq .

 For sentence or topic segmentation, the problem is defined as finding the most probable sentence or topic boundaries.
 The natural unit of sentence segmentation is words and of topic segmentation is sentence, as

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

we can assume that topics typically do not change in the middle of a sentences.

l The words or sentences are then grouped into categories stretches belonging to one sentences or topic- that is word or sentence boundaries are
classified into sentences or topic boundaries and -non-boundaries.
l The classification can be done at each potential boundary i (local modelling); then, the aim is
to estimate the most probable boundary type yഥi for each candidate xi

ഥഥ = ഥഥഥഥ ഥഥ

ഥഥ ഥഥ ഥ
ഥഥ
ഥ

Here, the ^ is used to denote estimated categories, and a variable without a ^ is used to show possible categories.
l In this formulation, a category is assigned to each example in isolation; hence, decision is made locally.
l However, the consecutive types can be related to each other. For example, in broadcast news speech, two consecutive sentences boundaries that
form a single word sentence are very infrequent.
l In local modelling, features can be extracted from surrounding example context of the

candidate boundary to model such dependencies.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

 It is also possible to see the candidate boundaries as a sequence and search for the sequence of boundary types
that have the maximum probability given the candidate examples,

ഥഥ ഥഥഥഥഥഥ
= ഥ ��
�

�
l We categorize the methods into local and sequence classification.
�
l Another categorization of methods is done according to the type of the machine learning algorithm: generative versus
discrimina琀椀ve.

l Generative sequence models estimate the joint distribution of the observations P(X,Y) (words, punctuation) and the labels(sentence boundary,
topic boundary).
l Discriminative sequence models, however, focus on features that categorize the differences between the labelling of that
examples.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

enerative Sequence Classification Methods

l Most commonly used generative sequence classification method for topic and

sentence is the hidden Markov model (HMM) function is being used in which the model is proposed according to bayers rule

Hmm Means: A hidden Markov model (HMM) is a statistical model that can be used to describe the evolution of observable events that depend on
internal factors, which are not directly observable

l generative models ca n b e h and led b y HELMs(hidden event language model) which can handle multiple orders of magnitude larger

Training data sets

l The probability in equation 2.2 is rewritten as the following, using the Bayes rule:

ഥഥഥഥ ഥഥ
= (| ) ( 2.1)
ഥഥ
ഥഥ ഥഥഥഥ ഥഥ ഥഥഥഥഥഥ
= � = �
ഥ ഥ ��)Τ�
(� )
(� �
ഥഥഥ
� ( )( 2.2)
ഥഥഥ (
ഥ = �
|�

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Here ഥഥ = Predicted class(boundary) label

Y = (y1,y2,….yk)= Set of class(boundary) labels

X = (x1,x2,….xn)= set of feature vectors

P(Y|X) = the probability of given the X (feature vectors),
what is the probability of X
belongs to the class(boundary) label.

P(x) = Probability of word sequence

P(Y) = Probability of the class(boundary)

l P(X) in the denominator is dropped because it is fixed for different Y and hence does not change the argument of max.
l P(X|Y) and P(Y) can be estimated as

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

2.Discriminative Local Classification Methods

l Discriminative classifiers aim to model P(yi |xi) equation 2.1 directly.

l The most important distinction is that class densities P(x|y) are model assumptions
in generative approaches
l A number of discriminative classification approaches are used, such as support vector machines,

boosting, maximum entropy, and regression. Are based on different machine learning

algorithms which are used in discrimination process in classifying the sentence boundary.

l While discriminative approaches have been shown to outperform generative methods in

many speech and language processing tasks.

l For sentence segmentation, supervised learning methods – have primarily been applied to

newspaper ar琀椀cles.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Supervised learning methods are used where a machine is being trained by giving instruc琀椀on and accordingly it will perform .there are many supervised

learning algorithms for di昀昀erent purpose.

l Stamatatos, Fakotakis and Kokkinakis are authors who used transformation based learning (TBL) to infer rules for finding sentence

boundaries.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l Many supervised learning method classifiers have been tried for the sentence boundary task such as regression trees, neural networks,
classification , maximum entropy classifiers, support vector machines, and naïve Bayes classifiers.
l The most Text tiling method is used for topic segmentation which uses a lexical cohesion (binding of word to another) metric in a

word vector space as an indicator for topic similarity.

l Figure depicts a typical graph of similarity with respect to consecutive segmentation units.

l The document is chopped when the similarity is below some threshold.

l Originally, two methods for computing the similarity scores were proposed: block

comparison and vocabulary introduction.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l The first, block comparison, compares adjacent blocks of text to see how similar they are according to how many words the adjacent
blocks have in common.
l Given two blocks, b1 and b2, each having k tokens (sentences or paragraphs),
l the similarity (or topical cohesion) between two blocks score is computed by the formula:

l Where w is the weight assigned to term t in block b.

t,b
l The weights can be binary or may be computed using other information retrieval- metrics such as term frequency(calculation of weight ).
l The second method is, the vocabulary introduction method, assigns a score to a token-sequence gap
on the basis of how many new words are seen in the interval in which it is the midpoint.

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

l Similar to the block comparison formulation, given two consecutive blocks b1 and b2, of equal number of words w, the
l
l topical cohesion score is computed with the following formula:

l Where NumNewTerms(b) returns the number of terms in block b seen the first time in text.

3.Discrimina琀椀ve Sequence Classi昀椀ca琀椀on Methods

l In segmentation tasks, the sentence or topic decision for a given example(word, sentence, paragraph) highly depends on the decision for the
examples in its vicinity(the area near to the topic or surrounding a particular database ).
l Discriminative sequence classification methods are in general extensions of local discriminative models with additional decoding stages that find the
best assignment of labels by looking at neighbouring decisions to label.
l Machine learning algorithms are used to discriminative sequence example(word, sentence, paragraph) commonly used are
Conditional random fields(CRFs), SVM-support vector machine which are extenstion of HMM
.
l Contrary to local classifiers that predict sentences or topic boundaries independently, CRFs can oversee the whole sequence of boundary
hypotheses to make their decisions.

4.Hybrid Approaches

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

In this approaches we use segamwnt classification method s by applying Viteribe Algorithm which is implemented byHmm

The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate
of the most likely sequence of hidden states —called the Viterbi path—that results in a sequence of observed events,
especially in the context of hidden Markov models (HMM).

Complexity of the Approaches

l The above approaches described here have advantages and disadvantages.

l In a given context and under a set of observation features, one approach may be better than

other.

l These approaches can be rated in terms of complexity (time and memory) of their training

and predic琀椀on algorithms and in terms of their performance on real-world datasets.

l In terms of complexity, training of discriminative approaches is more complex than training

of generative ones because they require multiple passes over the training data to adjust for feature weights.
l However, generative models c an be han dl ed by HELMs(hidden event language model) which can handle multiple orders of magnitude larger

Training data sets

On the other hand the disadvantage is , they work with only a few features .

Downloaded by Rudrarapu Sudeepthi ([email protected])

lOMoARcPSD|34540975

Downloaded by Rudrarapu Sudeepthi ([email protected])

Most Common English Tenses by JForrest English
No ratings yet
Most Common English Tenses by JForrest English
10 pages
Advanced Grammar in Use Martin Hewings 3rd Edition 2013 296p
100% (12)
Advanced Grammar in Use Martin Hewings 3rd Edition 2013 296p
306 pages
NLP 1
No ratings yet
NLP 1
59 pages
NLP All Units
No ratings yet
NLP All Units
81 pages
Faculty Name: Dr. Humera Khanam Subject Name:NLP
No ratings yet
Faculty Name: Dr. Humera Khanam Subject Name:NLP
206 pages
NLP 1 - 5 Modules
No ratings yet
NLP 1 - 5 Modules
210 pages
NLP Notes
No ratings yet
NLP Notes
203 pages
Unit 1 Irs
No ratings yet
Unit 1 Irs
26 pages
IRS Unit-1
No ratings yet
IRS Unit-1
27 pages
IRS Unit-1
No ratings yet
IRS Unit-1
61 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
UNIT 1 IRS WWWWW
No ratings yet
UNIT 1 IRS WWWWW
26 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
Irs PDF
No ratings yet
Irs PDF
68 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
IRS Unit-1
50% (2)
IRS Unit-1
14 pages
Irs Unit-1 Modified
No ratings yet
Irs Unit-1 Modified
12 pages
Cmrit Isr Notes - Docx New
No ratings yet
Cmrit Isr Notes - Docx New
54 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
44 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
IRS Unit-1
100% (5)
IRS Unit-1
14 pages
IRSUnit 1
No ratings yet
IRSUnit 1
26 pages
IRS Study Material
100% (1)
IRS Study Material
87 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Module 1 - Introduction
No ratings yet
Module 1 - Introduction
61 pages
ISR Unit 1
No ratings yet
ISR Unit 1
23 pages
Unit 1introduction
No ratings yet
Unit 1introduction
44 pages
Unit-1 Chapter 1
No ratings yet
Unit-1 Chapter 1
44 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
44 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Unit I
No ratings yet
Unit I
23 pages
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
No ratings yet
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
25 pages
Unit 1 Irs Information Retrieval Systems Unit 1
No ratings yet
Unit 1 Irs Information Retrieval Systems Unit 1
27 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
No ratings yet
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
14 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Irs Unit-Ii-Notes
No ratings yet
Irs Unit-Ii-Notes
18 pages
Sem Ser Ori
No ratings yet
Sem Ser Ori
69 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Chapter One IR
No ratings yet
Chapter One IR
18 pages
Unit V Notes Adbt Adbt
No ratings yet
Unit V Notes Adbt Adbt
7 pages
Unit 1 Irs Information Retrieval Systems Unit 1
No ratings yet
Unit 1 Irs Information Retrieval Systems Unit 1
27 pages
Unit I
No ratings yet
Unit I
65 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
57 pages
Irs Unit-2 Notes - 241015 - 102936
No ratings yet
Irs Unit-2 Notes - 241015 - 102936
27 pages
Irs Unit-1-1
No ratings yet
Irs Unit-1-1
113 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Mod 4
No ratings yet
Mod 4
35 pages
Resume R.nagaraju Me21b159
No ratings yet
Resume R.nagaraju Me21b159
2 pages
Organizational Behaviour Units III IV V
No ratings yet
Organizational Behaviour Units III IV V
4 pages
Pe Unit1 Notes
No ratings yet
Pe Unit1 Notes
41 pages
AI Spectrum U5
No ratings yet
AI Spectrum U5
30 pages
Fiot Unit 1
No ratings yet
Fiot Unit 1
94 pages
Q4M3 Expanded Definition
No ratings yet
Q4M3 Expanded Definition
6 pages
Ghalib
No ratings yet
Ghalib
16 pages
TQ in English 7-First Quarter 2022-2023
100% (1)
TQ in English 7-First Quarter 2022-2023
3 pages
Kowser CV
No ratings yet
Kowser CV
2 pages
Time Table (04 Nov) : UP SI (You Tube Ij)
No ratings yet
Time Table (04 Nov) : UP SI (You Tube Ij)
10 pages
My Mother at 66
No ratings yet
My Mother at 66
15 pages
ADP Urdu
No ratings yet
ADP Urdu
4 pages
Present Forms 1
No ratings yet
Present Forms 1
16 pages
English Test (4) : 1/. I Bought This Dress at The................... On Hang Bai Street
No ratings yet
English Test (4) : 1/. I Bought This Dress at The................... On Hang Bai Street
3 pages
Test de Evaluare Initială Anul Scolar 2012-2013 Limba Engleză Clasa A IX-a L2
No ratings yet
Test de Evaluare Initială Anul Scolar 2012-2013 Limba Engleză Clasa A IX-a L2
2 pages
CELTA - Pre-Course Quiz
100% (1)
CELTA - Pre-Course Quiz
3 pages
MP20XV MP20XV (A979) Parts Manual: Yale Europe Materials Handling Limited
No ratings yet
MP20XV MP20XV (A979) Parts Manual: Yale Europe Materials Handling Limited
112 pages
Automata - Unit 3-1
No ratings yet
Automata - Unit 3-1
26 pages
2021 Grammar Revision
No ratings yet
2021 Grammar Revision
3 pages
Describing Books With Examples
No ratings yet
Describing Books With Examples
4 pages
Structural-Analysis-And-Interpretation The Fox and The Crow
No ratings yet
Structural-Analysis-And-Interpretation The Fox and The Crow
27 pages
Affixes Exercise
No ratings yet
Affixes Exercise
3 pages
0547 Int Otg Sow
No ratings yet
0547 Int Otg Sow
40 pages
GA Level Diploma in Teaching English As A Foreign Language Hours Certificate
No ratings yet
GA Level Diploma in Teaching English As A Foreign Language Hours Certificate
1 page
Past Perfect
No ratings yet
Past Perfect
14 pages
7cs PPT With Excercises 1220938502799896 8
No ratings yet
7cs PPT With Excercises 1220938502799896 8
22 pages
Punctuation Marks
No ratings yet
Punctuation Marks
13 pages
Compose Clear: Sentences Using Appropriate Grammatical Structures
100% (1)
Compose Clear: Sentences Using Appropriate Grammatical Structures
16 pages
Full Download Persuasive Games in Political and Professional Dialogue 1st Edition Răzvan Săftoiu PDF
No ratings yet
Full Download Persuasive Games in Political and Professional Dialogue 1st Edition Răzvan Săftoiu PDF
41 pages
Irregular Verbs
No ratings yet
Irregular Verbs
4 pages
Future Tenses Power Point
100% (1)
Future Tenses Power Point
9 pages
Şcoala Gimnazială Porumbeşti Anul Şcolar 2020-2021
No ratings yet
Şcoala Gimnazială Porumbeşti Anul Şcolar 2020-2021
5 pages
Major 8 Macro Skills
No ratings yet
Major 8 Macro Skills
41 pages

Unit 1 Introduction To NLP

Uploaded by

Unit 1 Introduction To NLP

Uploaded by

lOMoARcPSD|34540975

Unit 1 introduction to nlp

Natural Language Processing (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Natural Language Processing (NLP) refers to AI method of communicating with an

Natural Language Generation (NLG)

Words and Their components

The general objective of an Information Retrieval Systemis to minimize the overhead of a

Downloaded by Rudrarapu Sudeepthi ([email protected])

The two major measures commonly associated with

Precision and recall.

Divided into four segments

Number of relevant items in the database.

Downloaded by Rudrarapu Sudeepthi ([email protected])

Number_Retrieved_Relevantis the number of items retrieved hat are relevant to the

Downloaded by Rudrarapu Sudeepthi ([email protected])

Multi-media adds an extra dimension to the normalization process. In addition to

Downloaded by Rudrarapu Sudeepthi ([email protected])

Systems determine words by dividing input symbols into 3 classes:

1) Valid word symbols

A word is defined as a contiguous set of word symbols bounded by inter-word

but might be critical to represent foreign names in the database.

Next, a Stop List/Algorithm is applied to the list of potential processing tokens.

2) Selective Dissemination (Distribution, Spreading) of Information

The Selective Dissemination of Information (Mail) Process provides the capability to

Downloaded by Rudrarapu Sudeepthi ([email protected])

3) Document Database Search

Index Database Search

There are 2 classes of index files:

1) Public Index files

Downloaded by Rudrarapu Sudeepthi ([email protected])

Multimedia Database Search

Relationship to Database Management Systems

that the user is interested in seeing.

Downloaded by Rudrarapu Sudeepthi ([email protected])

Information Retrieval System Capabilities

2.1 Search Capabilities

Downloaded by Rudrarapu Sudeepthi ([email protected])

Downloaded by Rudrarapu Sudeepthi ([email protected])

0.0 and 1.0 with 1.0 being the most important.

Downloaded by Rudrarapu Sudeepthi ([email protected])

TERM1 within “m” “units” of TERM2

contiguous Word Phrases

Downloaded by Rudrarapu Sudeepthi ([email protected])

Downloaded by Rudrarapu Sudeepthi ([email protected])

words from the information database: “computer,” “compiter,” “conputer,” “computter,”

Downloaded by Rudrarapu Sudeepthi ([email protected])

“COMPUTER*” Prefix Search

“*COMPUTER*” Imbedded String Search

Downloaded by Rudrarapu Sudeepthi ([email protected])

Numeric and Date Ranges

Downloaded by Rudrarapu Sudeepthi ([email protected])

Downloaded by Rudrarapu Sudeepthi ([email protected])

Natural Language Queries

Downloaded by Rudrarapu Sudeepthi ([email protected])

Miscellaneous Capabilities 2.3.1Vocabulary Browse

Downloaded by Rudrarapu Sudeepthi ([email protected])

Iterative Search and Search History Log

Downloaded by Rudrarapu Sudeepthi ([email protected])

Issues and Challenges

NL has an extremely rich form and structure.

It is very ambiguous. There can be di昀昀erent levels of ambiguity −

Lexical ambiguity − It is at very primi琀椀ve level such as word-level.

For example, trea琀椀ng the word “board” as noun or verb?

Syntax Level ambiguity − A sentence can be parsed in di昀昀erent ways.

One input can mean di昀昀erent meanings.

Many inputs can mean the same thing.

Phonology − It is study of organizing sound systema琀椀cally.

Morphology − It is a study of construc琀椀on of words from primi琀椀ve meaningful units.

Morpheme − It is primi琀椀ve unit of meaning in a language.

Downloaded by Rudrarapu Sudeepthi ([email protected])

World Knowledge − It includes the general knowledge about the world.

Natural Language Processing

l Humans communicate through some form of language either by text or speech.

Downloaded by Rudrarapu Sudeepthi ([email protected])

“COMPUTER” Imbedded String Search