0% found this document useful (0 votes)
35 views81 pages

NLP All Units

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views81 pages

NLP All Units

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

lOMoARcPSD|38245794

NLP All units

Cse (ai & ml) (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Shanthi Reddy Matam ([email protected])
lOMoARcPSD|38245794

Unit-1

UNIT - I Finding the Structure of Words: Words and Their Components, Issues and Challenges,
Morphological Models Finding the Structure ofDocuments : Introduction, Methods, Complexity of the
Approaches, Performances of the Approaches

NLP INTRODUCTION:

Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems using
a natural language such as English.

Processing of Natural Language is required when you want an intelligent system like robot to perform as per
your instructions, when you want to hear decision from a dialogue based etc.

The field of NLP involves making computers to perform useful tasks with the natural languages humans use.
The input and output of an NLP system can be â

ï · Speech

ï · Written Text

Components of NLP
There are two components of NLP as given â

Natural Language Understanding (NLU)


Understanding involves the following tasks â

ï · Mapping the given input in natural language into useful representations.

ï · Analyzing different aspects of the language.

Natural Language Generation (NLG)


It is the process of producing meaningful phrases and sentences in the form of natural language from some
internal representation.

It involves â

ï · Text planning  â It includes retrieving the relevant content from knowledge base.

ï · Sentence planning  â It includes choosing required words, forming meaningful phrases, setting tone of
the sentence.

ï · Text Realization  â It is mapping sentence plan into sentence structure.

Components of NLP 1

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Words and Their components

The general objective of an Information Retrieval Systemis to minimize the overhead of a user locating
needed information.

Overhead can be expressed as the time a user spends in all of the steps leading to reading an item containing
the needed information (e.g., query generation, queryexecution,scanningresultsofquerytoselectitemsto
read,readingnon-relevantitems).

The two major measures commonly associated with

Precision and recall .

When a user decides to issuea search looking for information on a topic,the total database is logically

Divided into four segments

Relevant items are those documents that contain information that helps the searcher in answering his question.

Non-relevant items are those items that do not provide any directly useful information.

There are two possibilities with respect to each item: it can be retrieved or notretrieved by the user'squery.

Where:

Number_Possible_Relevantarethe

Number of relevant items in the database.

Number Total Relevant is the total number of items retrieved from the query.

Number_Retrieved_Relevant is the number of items retrieved hat are relevant to the user's search need.

Precision measures one aspect of information retrieval overhead for a user associated with a particular search.

If a search has a 85 per cent precision,then 15 per cent of the user effort is overhead reviewing non-relevant
items.

Recall gauges how well a system processing a particular


query is able to retrieve the relevant items Functional
Overview
A total Information Storage and Retrieval System is composed of four major functional processes:

1) Item Normalization

2) Selective Dissemination of Information (i.e., â‎ Mailâ‎ )

3) Archival Document Database Search, and an Index

Natural Language Generation (NLG) 2

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

4) Database Search along with the Automatic File Build process that supportsIndex Files.

1) Item Normalization:
The first step in any integrated system is to normalize the incoming items to a standard format. Item
normalization provides logical restructuring of the item. Additional operations during item normalization
are needed to create a searchable data structure: identification of processing tokens (e.g., words),
characterization of the tokens, and stemming (e.g., removing word endings) of the tokens.

The processing tokens and their characterization are used to define the searchable text from the total
received text. Figure 1.5 shows the normalization process. Standardizing the input takes the different
external formats of input data and performs the translation to the formats acceptable to the system. A system
may have a single format for all items or allow multiple formats. One example of standardization could be
translation of foreign languages into Unicode. Every language has a different

internal binary encoding for the characters in the language. One standard encoding that covers
English, French, Spanish, etc. is ISO-Latin.

To assist the users in generating indexes, especially the professional indexers, the system provides a
process called Automatic File Build(AFB) .

Multi-media adds an extra dimension to the normalization process. In addition to normalizing the textual
input, the multi-media input also needs to be standardized. There are a lot of options to the standards being
applied to the normalization. If the input is video the likely digital standards will be either MPEG-2,
MPEG-1, AVI or Real Media. MPEG (Motion Picture Expert Group) standards are the most universal
standards for higher quality video where Real Media is the most common standard for lower quality video
being used on the Internet. Audio standards are typically WAV or Real Media (Real Audio). Images vary
from JPEG to BMP.

The next process is to parse the item into logical sub-divisions that have meaning to the user. This
process, called â‎ Zoning,â‎ is visible to the user and used to increase the precision of a search and
optimize the display. A typical item is sub- divided into zones, which may overlap and can be hierarchical,
such as Title, Author, Abstract, Main Text, Conclusion, and References. The zoning information is passed to
the processing token identification operation to store the information, allowing searches to be restricted
to a specific zone. For example, if the user is interested in articles discussing â‎ Einsteinâ‎ then the
search should not include the Bibliography, which could include references to articles written by
â‎ Einstein.â‎

Systems determine words by dividing input symbols into 3 classes:

1) Valid word symbols

2) Inter-word symbols

3) Special processing symbols.

A word is defined as a contiguous set of word symbols bounded by inter-word symbols. In many systems
inter-word symbols are non-searchable and should be carefully selected. Examples of word symbols are
alphabetic characters and numbers. Examples of possible inter-word symbols are blanks, periods and
semicolons. The exact definition of an inter-word symbol is dependent upon the aspects of the

Recall gauges how well a system processing a particularquery is able to retrieve the relevant items 3Function

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

language domain of the items to be processed by the system. For example, an apostrophe may be of
little importance if only used for the possessive case in English,

but might be critical to represent foreign names in the database.

Next, a Stop List/Algorithm is applied to the list of potential processing tokens. The objective of the Stop
function is to save system resources by eliminating from the set of searchable processing tokens those that
have little value to the system. Given the significant increase in available cheap memory, storage and
processing power, the need to apply the Stop function to processing tokens is decreasing.

Examples of Stop algorithms are: Stop all numbers greater than â‎ 999999â‎ (this was selected to allow
dates to be searchable) Stop any processing token that has numbers and characters intermixed

2) Selective Dissemination (Distribution, Spreading) of


Information
The Selective Dissemination of Information (Mail) Process provides the capability to dynamically compare
newly received items in the information system against standing statements of interest of users and deliver
the item to those users whose statement of interest matches the contents of the item. The Mail process is
composed of the search process, user statements of interest (Profiles) and user mail files. As each item is
received, it is processed against every userâ‎ s profile. A profile contains a typically broad search
statement along with a list of user mail files that will receive the document if the search statement in
the profile is satisfied. Selective Dissemination of Information has not yet been applied to multimedia
sources.

3) Document Database Search


The Document Database Search Process provides the capability for a query to search against all items
received by the system. The Document Database Search process is composed of the search process, user
entered queries (typically ad hoc queries) and the document database which contains all items that have been
received, processed and stored by the system. Typically items in the Document Database do not change (i.e.,
are not edited) once received.

Index Database Search


When an item is determined to be of interest, a user may want to save it for future reference. This is in
effect filing it. In an information system this is accomplished via

the index process. In this process the user can logically store an item in a file along with additional index
terms and descriptive text the user wants to associate with the item. The Index Database Search Process (see
Figure 1.4) provides the capability to create indexes and search them.

There are 2 classes of index files:

1) Public Index files

2) Private Index files

1) Item Normalization: 4

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Every user can have one or more Private Index files leading to a very large number of files. Each Private
Index file references only a small subset of the total number of items in the Document Database. Public Index
files are maintained by professional library services personnel and typically index every item in the
Document Database. There is a small number of Public Index files. These files have access lists (i.e., lists of
users and their privileges) that allow anyone to search or retrieve data. Private Index files typically have very
limited access lists. To assist the users in generating indexes, especially the professional indexers, the system
provides a process called Automatic File Build shown in Figure 1.4 (also called Information Extraction).

Multimedia Database Search


From a system perspective, the multi-media data is not logically its own data structure, but an
augmentation to the existing structures in the Information Retrieval System.

Relationship to Database Management Systems


From a practical standpoint, the integration of DBMSâ‎ s and Information Retrieval Systems is very
important. Commercial database companies have already integrated the two types of systems. One of the first
commercial databases to integrate the two systems into a single view is the INQUIRE DBMS. This has been
available for over fifteen years. A more current example is the ORACLE DBMS that now offers an
imbedded capability called CONVECTIS, which is an informational retrieval system that uses a
comprehensive thesaurus which provides the basis to generate â‎ themesâ‎ for a particular item. The
INFORMIX DBMS has the ability to link to RetrievalWare to provide integration of structured data and
information along with functions associated with Information Retrieval Systems.

that the user is interested in seeing.

Digital Libraries and Data Warehouses (DataMarts)


As the Internet continued its exponential growth and project funding became available, the topic of
Digital Libraries has grown. By 1995 enough research and pilot efforts had started to support the 1ST
ACM International Conference on Digital Libraries (Fox-96). Indexing is one of the critical disciplines in
library science and significant effort has gone into the establishment of indexing and cataloging standards.
Migration of many of the library products to a digital format introduces both opportunities and
challenges. Information Storage and Retrieval technology has addressed a small subset of the issues
associated with Digital Libraries.

Data warehouses are similar to information storage and retrieval systems in that they both have a need
for search and retrieval of information. But a data warehouse is more focused on structured data and
decision support technologies. In addition to the normal search process, a complete system provides a flexible
set of analytical tools to â‎ mineâ‎ the data. Data mining (originally called Knowledge Discovery in
Databases - KDD) is a search process that automatically analyzes data and extract relationships and
dependencies that were not part of the database design.

Information Retrieval System Capabilities


Search Capabilities

Browse Capabilities

Index Database Search 5

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Miscellaneous Capabilities

Standards

The search capabilities address both Boolean and Natural Language queries. The algorithms used for
searching are called Boolean, natural language processing and probabilistic. Probabilistic algorithms use
frequency of occurrence of processing tokens (words) in determining similarities between queries and
items and also in predictors on the potential relevance of the found item to the searcher.

The newer systems such as TOPIC, RetrievalWare, and INQUERY all allow for natural language queries.

Browse functions to assist the user in filtering the search results to find relevant information are very
important.

2.1 Search Capabilities


The objective of the search capability is to allow for a mapping between a userâ‎ s specified need and the
items in the information database that will answer that need. It can consist of natural language text in
composition style and/or query terms (referred to as terms in this book) with Boolean logic indicators
between them. One concept that has occasionally been implemented in commercial systems (e.g.,
RetrievalWare), and holds significant potential for assisting in the location and ranking of relevant items, is
the â‎ weightingâ‎ of search terms. This would allow a user to indicate the importance of

search terms in either a Boolean or natural language interface. Given the following natural language query
statement where the importance of a particular search term is indicated by a value in parenthesis between
0.0 and 1.0 with 1.0 being the most important.

The search statement may apply to the complete item or contain additional parameters limiting it to a logical
division of the item (i.e., to a zone). Based upon the algorithms used in a system many different functions are
associated with the systemâ‎ s understanding the search statement. The functions define the relationships
between the terms in the search statement (e.g., Boolean, Natural Language, Proximity, Contiguous
Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g., Term Masking, Numeric
and Date Range, Contiguous Word Phrases, and Concept/Thesaurus expansion).

Boolean Logic
Boolean logic allows a user to logically relate multiple concepts together to define what information is
needed. Typically the Boolean functions apply to processing tokens identified anywhere within an item.
The typical Boolean operators are AND, OR, and NOT . These operations are implemented using set
intersection, set union and set difference procedures. Asearch terms in either a Boolean or natural
language interface. Given the following natural language query statement where the importance of a
particular search term is indicated by a value in parenthesis between 0.0 and 1.0 with 1.0 being the most
important.

the search statement may apply to the complete item or contain additional paramesearch terms in
either a Boolean or natural language interface. Given the following natural language query statement
where the importance of a particular

search term is indicated by a value in parenthesis between 0.0 and 1.0 with 1.0 being the most important.

Information Retrieval System Capabilities 6

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

The search statement may apply to the complete item or contain additional parameters limiting it to a logical
division of the item (i.e., to a zone). Based upon the algorithms used in a system many different functions are
associated with the systemâ‎ s understanding the search statement. The functions define the relationships
between the terms in the search statement (e.g., Boolean, Natural Language, Proximity, Contiguous
Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g., Term Masking, Numeric
and Date Range, Contiguous Word Phrases, and Concept/Thesaurus expansion).

limiting it to a logical division of the item (i.e., to a zone). Based upon the algorithms used in a system many
different functions are associated with the systemâ‎ s understanding the search statement. The functions
define the relationships between the terms in the search statement (e.g., Boolean, Natural Language,
Proximity, Contiguous Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g.,
Term Masking, Numeric and Date Range, Contiguous Word Phrases, and Concept/Thesaurus expansion).

few systems introduced the concept of â‎ exclusive orâ‎ but it is equivalent to a slightly more complex
query using the other operators and is not generally useful to users since most users do not understand it.

A special type of Boolean search is called â‎ M of Nâ‎ logic. The user lists a set of possible search terms
and identifies, as acceptable, any item that contains a subset of the terms. For example, â‎ Find any item
containing any two of the following terms: â‎ AA,â‎ â‎ BB,â‎ â‎ CC.â‎ This can be expanded
into a Boolean search that performs an AND between all combinations of two terms and â‎ ORâ‎ s
the results together ((AA AND BB) or (AA AND CC) or (BB AND CC)).

Proximity
Proximity is used to restrict the distance allowed within an item between two search terms. The semantic
concept is that the clossearch terms in either a Boolean or natural language interface. Given the following
natural language query statement where the importance of a particular search term is indicated by a
value in parenthesis between

0.0 and 1.0 with 1.0 being the most important.

The search statement may apply to the complete item or contain additional parameters limiting it to a logical
division of the item (i.e., to a zone). Based upon the algorithms used in a system many different functions are
associated with the systemâ‎ s understanding the search statement. The functions define the relationships
between the terms in the search statement (e.g., Boolean, Natural Language, Proximity, Contiguous
Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (e.g., Term Masking, Numeric
and Date Range, Contiguous Word Phrases, and Concept/Thesaurus expansion).

two terms are found in a text the more likely they are related in the description of a particular concept.
Proximity is used to increase the precision of a search. If the terms COMPUTER and DESIGN are found
within a few words of each other then the item is more likely to be discussing the design of computers than if
the words are paragraphs apart. The typical format for proximity is:

TERM1 within â‎ mâ‎ â‎ unitsâ‎ of TERM2

The distance operator â‎ mâ‎ is an integer number and units are in Characters, Words, Sentences, or
Paragraphs.

A special case of the Proximity operator is the Adjacent (ADJ) operator that normally has a distance operator
of one and a forward only direction (i.e., in WAIS). Another special case is where the distance is set to zero

Boolean Logic 7

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

meaning within the same semantic unit.

contiguous Word Phrases


A Contiguous Word Phrase (CWP) is both a way of specifying a query term and a special search operator. A
Contiguous Word Phrase is two or more words that are treated as a single semantic unit. An example of a
CWP is â‎ United States of America.â‎ It is four words that specify a search term representing a
single specific semantic concept (a country) that can be used with any of the operators discussed above.
Thus a query could specify â‎ manufacturingâ‎ AND â‎ United States of Americaâ‎ which returns any
item that contains the word â‎ manufacturingâ‎ and the contiguous words â‎ United States of
America.â‎

A contiguous word phrase also acts like a special search operator that is similar to the proximity
(Adjacency) operator but allows for additional specificity. If two

terms are specified, the contiguous word phrase and the proximity operator using directional one word
parameters or the Adjacent operator are identical. For contiguous word phrases of more than two
terms the only way of creating an equivalent search statement using proximity and Boolean operators
is via nested Adjacencies which are not found in most commercial systems. This is because
Proximity and Boolean operators are binary operators but contiguous word phrases are an â‎ Nâ‎ ary
operator where â‎ Nâ‎ is the number of words in the CWP.

Contiguous Word Phrases are called Literal Strings in WAIS and Exact Phrases in RetrievalWare. In
WAIS multiple Adjacency (ADJ) operators are used to define a Literal String (e.g., â‎ Unitedâ‎
ADJ â‎ Statesâ‎ ADJ â‎ ofâ‎ ADJ â‎ Americaâ‎ ).

Fuzzy Searches
Fuzzy Searches provide the capability to locate spellings of words that are similar to the entered search term.
This function is primarily used to compensate for errors in spelling of words. Fuzzy searching increases
recall at the expense of decreasing precision (i.e., it can erroneously identify terms as the search term). In
the process of expanding a query term fuzzy searching includes other terms that have similar
spellings, giving more weight (in systems that rank output) to words in the database that have similar word
lengths and position of the characters as the entered term. A Fuzzy Search on the term â‎ computerâ‎
would automatically include the following

words from the information database: â‎ computer,â‎ â‎ compiter,â‎ â‎ conputer,â‎


â‎ computter,â‎ â‎ compute.â‎

Term Masking
Term masking is the ability to expand a query term by masking a portion of the term and accepting as valid
any processing token that maps to the unmasked portion of the term. The value of term masking is much
higher in systems that do not perform stemming or only provide a very simple stemming algorithm. There
are two types of search term masking: fixed length and variable length. Sometimes they are called fixed and
variable length â‎ donâ‎ t careâ‎ functions.

Fixed length masking is a single position mask. It masks out any symbol in a particular position or the lack of
that position in a word. Variable length â‎ donâ‎ t caresâ‎ allows masking of any number of characters

Proximity 8

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

within a processing token. The masking may be in the front, at the end, at both front and end, or imbedded.
The first three of these cases are called suffix search, prefix search and imbedded character string search,
respectively. The use of an imbedded variable length donâ‎ t care is seldom used. Figure 2.3
provides examples of the use of variable length term masking. If â‎ *â‎ represents a variable length
donâ‎ t care then the following are examples of its use: â‎ *COMPUTERâ‎ Suffix Search

â‎ COMPUTER*â‎ Prefix Search

â‎ *COMPUTER*â‎ Imbedded String Search

Numeric and Date Ranges


Term masking is useful when applied to words, but does not work for finding ranges of numbers or numeric
dates. To find numbers larger than â‎ 125,â‎ using a term â‎ 125*â‎ will not find any number except
those that begin with the digits â‎ 125.â‎

Concept/Thesaurus Expansion
Associated with both Boolean and Natural Language Queries is the ability to expand the search terms via
Thesaurus or Concept Class database reference tool. A Thesaurus is typically a one-level or two-level
expansion of a term to other terms that are similar in meaning. A Concept Class is a tree structure that
expands each meaning of a word into potential concepts that are related to the initial term (e.g., in the TOPIC
system). Concept classes are sometimes implemented as a network structure that links word stems (e.g., in
the RetrievalWare system). An example of Thesaurus and Concept Class structures are shown in Figure
2.4 (Thesaurus-93) and Figure 2.5.

Thesauri are either semantic or based upon statistics. A semantic thesaurus is a listing of words and
then other words that are semantically similar.

The problem with thesauri is that they are generic to a language and can introduce many search terms that are
not found in the document database. An alternative uses the database or a representative sample of it to create
statistically related terms. It is conceptually a thesaurus in that words that are statistically related to other
words by their frequently occurring together in the same items. This type of thesaurus is very dependent
upon the database being searched and may not be portable to other databases.

Natural Language Queries


Natural language interfaces improve the recall of systems with a decrease in precision when negation is
required.

Browse Capabilities
Once the search is complete, Browse capabilities provide the user with the capability to determine which
items are of interest and select those to be displayed. There are two ways of displaying a summary of the
items that are associated with a query: line item status and data visualization. From these summary displays,
the user can select the specific items and zones within the items for display.

Term Masking 9

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Ranking
Typically relevance scores are normalized to a value between 0.0 and 1.0. The highest value of 1.0 is
interpreted that the system is sure that the item is relevant to the search statement. In addition to ranking
based upon the characteristics of the item and the database, in many circumstances collaborative filtering
is providing an option for selecting and ordering output.

Collaborative filtering has been very successful in sites such as AMAZON.COM MovieFinder.com, and
CDNow.com in deciding what products to display to users based upon their queries.

Rather than limiting the number of items that can be assessed by the number of lines on a screen, other
graphical visualization techniques showing the relevance

relationships of the hit items can be used. For example, a two or three dimensional graph can be displayed
where points on the graph represent items and the location of the points represent their relative relationship
between each other and the userâ‎ s query. In some cases color is also used in this representation. This
technique allows a user to see the clustering of items by topics and browse through a cluster or move to
another topical cluster.

Zoning
Related to zoning for use in minimizing what an end user needs to review from a hit item is the idea of
locality and passage based search and retrieval.

Highlighting
Most systems allow the display of an item to begin with the first highlight within the item and allow
subsequent jumping to the next highlight. The DCARS system that acts as a user frontend to the Retrieval
Ware search system allows the user to browse an item in the order of the paragraphs or individual words that
contributed most to the rank value associated with the item. The highlighting may vary by introducing colors
and intensities to indicate the relative importance of a particular word in the item in the decision to retrieve
the item.

Miscellaneous Capabilities 2.3.1Vocabulary Browse


Vocabulary Browse provides the capability to display in alphabetical sorted order words from the
document database. Logically, all unique words (processing tokens) in the database are kept in sorted order
along with a count of the number of unique items in which the word is found. The user can enter a word or
word fragment and the system will begin to display the dictionary around the entered text.

It helps the user determine the impact of using a fixed or variable length mask on a search term and potential
mis-spellings. The user can determine that entering the search term â‎ compul*â‎ in effect is searching
for â‎ compulsionâ‎ or â‎ compulsiveâ‎ or â‎ compulsory.â‎ It also shows that someone
probably entered the word â‎ computenâ‎ when they really meant â‎ computer.â‎

Iterative Search and Search History Log

Ranking 10

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Frequently a search returns a Hit file containing many more items than the user wants to review. Rather than
typing in a complete new query, the results of the previous search can be used as a constraining list to create
a new query that is applied against it. This has the same effect as taking the original query and adding
additional search statement against it in an AND condition. This process of refining the results of a previous
search to focus on relevant items is called iterative search. This also applies when a user uses relevance
feedback to enhance a previous search. The search history log is the capability to display all the previous
searches that were executed during the current session.

Canned Query
The capability to name a query and store it to be retrieved and executed during a later user session is called
canned or stored queries. A canned query allows a user to create and refine a search that focuses on the
userâ‎ s general area of interest one time and then retrieve it to add additional search criteria to retrieve
data that is currently needed. Canned query features also allow for variables to be inserted into the query and
bound to specific values at execution time.

Difficulties in NLP:

Issues and Challenges

NL has an extremely rich form and structure.

It is very ambiguous. There can be different levels of ambiguity â

Lexical ambiguity  â It is at very primitive level such as word-level.

For example, treating the word â‎ boardâ‎ as noun or verb?

Syntax Level ambiguity  â A sentence can be parsed in different ways.

For example, â‎ He lifted the beetle with red cap.â‎ â Did he use cap to lift the beetle or he lifted a
beetle that had red cap?

Referential ambiguity  â Referring to something using pronouns. For example, Rima went to Gauri. She
said, â‎ I am tired.â‎ â Exactly who is tired?

One input can mean different meanings.

Many inputs can mean the same thing.

NLP Terminology

Phonology  â It is study of organizing sound systematically.

Morphology  â It is a study of construction of words from primitive meaningful units.

Morpheme  â It is primitive unit of meaning in a language.

Syntax  â It refers to arranging words to make a sentence. It also involves determining the structural role
of words in the sentence and in phrases.

Iterative Search and Search History Log 11

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Semantics  â It is concerned with the meaning of words and how to combine words into meaningful
phrases and sentences.

Pragmatics  â It deals with using and understanding sentences in different situations and how the
interpretation of the sentence is affected.

Discourse  â It deals with how the immediately preceding sentence can affect the interpretation of the
next sentence.

World Knowledge  â It includes the general knowledge about the world.

Natural Language Processing

ï · Humans communicate through some form of language either by text or speech.

ï · To make interactions between computers and humans, computers need to understand natural
languages used by

humans.

ï · Natural language processing is all about making computers learn, understand, analyse, manipulate
and interpret

natural(human) languages.

ï · NLP stands for Natural Language Processing , which is a part of Computer Science, Human
language, and Artificial Intelligence .

ï · Processing of Natural Language is required when you want an intelligent system like robot to
perform as per your instructions, when you want to hear decision from a dialogue based clinical
expert system, etc.

ï · The ability of machines to interpret human language is now at the core of many applications that
we use every day

- chatbots, Email classification and spam filters, search engines, grammar checkers, voice assistants,
and social language translators.

ï · The input and output of an NLP system can be Speech or Written Text

Components of NLP
ï · There are two components of NLP, Natural Language Understanding (NLU)

and Natural Language Generation (NLG).


ï · Natural Language Understanding (NLU) which involves transforming human

language into a machine-readable format.

and Natural Language Generation (NLG). 12

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · It helps the machine to understand and analyse human language by extracting the text from large
data such as keywords, emotions, relations, and semantics.

ï · Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation.

ï · It mainly involves Text planning, Sentence planning, and Text realization.

NLP Terminology
ï · Phonology â It is study of organizing sound systematically.

ï · Morphology : The study of the formation and internal structure of words.

ï · Morpheme â It is primitive unit of meaning in a language.

ï · Syntax : The study of the formation and internal structure of sentences.

ï · Semantics : The study of the meaning of sentences.

ï · Pragmatics â It deals with using and understanding sentences in different situations

and how the interpretation of the sentence is affected.

ï · Discourse â It deals with how the immediately preceding sentence can affect the interpretation
of the next sentence.

ï · World Knowledge â It includes the general knowledge about the world.

Steps in NLP
ï · There are general five steps :

1. Lexical Analysis

2. Syntactic Analysis (Parsing)

3. Semantic Analysis

4. Discourse Integration

5. Pragmatic Analysis

Steps in NLP 13

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Lexical Analysis â‎

ï · The first phase of NLP is the Lexical Analysis.

ï · This phase scans the source code as a stream of characters and converts it into meaningful

lexemes.

ï · It divides the whole text into paragraphs, sentences, and words.

Syntactic Analysis (Parsing) â‎

ï · Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship

among the words.

ï · The sentence such as â‎ The school goes to boyâ‎ is rejected by English syntactic analyzer.

Semantic Analysis â‎

ï · Semantic analysis is concerned with the meaning representation.

ï · It mainly focuses on the literal meaning of words, phrases, and sentences.

ï · The semantic analyzer disregards sentence such as â‎ hot ice-creamâ‎ .

Discourse Integration â‎

ï · Discourse Integration depends upon the sentences that proceeds it and also invokes the

meaning of the sentences that follow it.

Pragmatic Analysis â‎

ï · During this, what was said is re-interpreted on what it actually meant.

ï · It involves deriving those aspects of language which require real world knowledge.

ï · Example: "Open the door" is interpreted as a request instead of an order.

Finding the Structure of Words

ï · Human language is a complicated thing.

ï · We use it to express our thoughts, and through language, we receive information and infer its

meaning.

ï · Trying to understand language all together is not a viable approach.

ï · The point of morphology , for instance, is to study the variable forms and functions of words,

Steps in NLP 14

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · The syntax is concerned with the arrangement of words into phrases, clauses, and sentences.

ï · Word structure constraints due to pronunciation are described by phonology ,

ï · The conventions for writing constitute the orthography of a language.

ï · The meaning of a linguistic expression is its semantics, and etymology and lexicology cover especially
the evolution of words and explain the semantic, morphological, and other links among them.

ï · Words are perhaps the most intuitive units of language, yet they are in general tricky to define.

ï · Knowing how to work with them allows, in particular, the development of syntactic and

semantic abstractions and simplifies other advanced views on language.

ï · Here, first we explore how to identify words of distinct types in human languages, and how the internal
structure of words can be modelled in connection with the grammatical properties and lexical concepts the
words should represent.

ï · The discovery of word structure is morphological parsing .

ï · In many languages, words are delimited in the orthography by whitespace and

punctuation.

ï · But in many other languages, the writing system leaves it up to the reader to tell words

apart or determine their exact phonological forms.

Words and Their Components


ï · Words are defined in most languages as the smallest linguistic units that can form a

complete utterance by themselves.

ï · The minimal parts of words that deliver aspects of meaning to them are called

morphemes .
Tokens

ï · Suppose, for a moment, that words in English are delimited only by whitespace and punctuation (
the marks, such as full stop, comma, and brackets)

ï · Example: Will you read the newspaper? Will you read it? I wonâ‎ t read it.

ï · If we confront our assumption with insights from syntax, we notice two here:
words

morphemes . 15

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

newspaper and wonâ‎ t .

ï · Being a compound word, newspaper has an interesting derivational structure .

ï · In writing, newspaper and the associated concept is distinguished from the

isolated news and paper .

ï · For reasons of generality, linguists prefer to analyze wonâ‎ t as two syntactic words, or tokens,

each of which has its independent role and can be reverted to its normalized form.

ï · The structure of wonâ‎ t could be parsed as will followed by not .

ï · In English, this kind of tokenization and normalization may apply to just a limited set of

cases, but in other languages, these phenomena have to be treated in a less trivial manner.

ï · Tokens behaving in this way can be found in various languages and are often called clitics .

Lexemes
ï · By the term word, we often denote not just the one linguistic form in the given context but also the
concept behind the form and the set of alternative forms that can express it.

ï · Such sets are called lexemes or lexical items , and they constitute the lexicon of a

language.

ï · Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,

adjectives, conjunctions, particles, or other parts of speech.

ï · The citation form of a lexeme , by which it is commonly identified, is also called its

lemma .
ï · When we convert a word into its other forms, such as turning the singular mouse into the plural mice or
mouses , we say we inflect the lexeme.

ï · When we transform a lexeme into another one that is morphologically related, regardless of its
lexical category, we say we derive the lexeme: for instance, the nouns receiver and reception are
derived from the verb to receive .

ï · Example: Did you see him? I didnâ‎ t see him. I didnâ‎ t see anyone .

â‎ ¢ Example presents the problem of tokenization of didnâ‎ t and the investigation of the internal structure
of anyone .

lemma . 16

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Morphemes
ï · Morphological theories differ on whether and how to associate the properties of word forms with
their structural components.

ï · These components are usually called segments or morphs .

ï · The morphs that by themselves represent some aspect of the meaning of a word are called
morphemes of some function.

â‎ ¢ Human languages employ a variety of devices by which morphs and morphemes are

combined into word forms.

Morphology
ï · Morphology is the domain of linguistics that analyses the internal structure of words.

ï · Morphological analysis â‎ exploring the structure of words

ï · Words are built up of minimal meaningful elements called morphemes : played = play - ed

cats = cat - s

unfriendly = un - friend - ly

ï · Two types of morphemes: i Stems: play , cat , friend ii Affixes: - ed , - s , un -, - ly

ï · Two main types of affixes:

i Prefixes precede the stem: un -

ii Suffixes follow the stem: - ed , - s , un -, - ly

ï · Stemming = find the stem by stripping off affixes

ï · play = play

replayed = re- play -ed

computerized = comput -er-ize-d

Problems in morphological processing


ï · Inflectional morphology: inflected forms are constructed from base forms and inflectional

affixes.

Problems in morphological processing 17

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · Inflection relates different forms of the same word

Lemma Singular Plural


cat cat cats
dog dog dogs
knife knife knives
sheep sheep sheep
mouse mouse mice
ï · Derivational morphology: words are constructed from roots (or stems) and derivational

affixes:

inter+national = international international+ize = internationalize internationalize+ation =


internationalization

ï · The simplest morphological process concatenates morphs one by one, as in dis- agree-ment-s , where
agree is a free lexical morpheme and the other elements are bound grammatical morphemes contributing
some partial meaning to the whole word.

ï · in a more complex scheme, morphs can interact with each other, and their forms may become subject to
additional phonological and orthographic changes denoted as morphophonemic.

ï · The alternative forms of a morpheme are termed allomorphs .

Typology
ï · Morphological typology divides languages into groups by characterizing the prevalent

morphological phenomena in those languages.

ï · It can consider various criteria, and during the history of linguistics, different classifications

have been proposed.

ï · Let us outline the typology that is based on quantitative relations between words, their
morphemes, and their features:

ï · Isolating , or analytic , languages include no or relatively few words that would comprise more

than one morpheme

ï · Synthetic languages can combine more morphemes in one word and are further divided into
agglutinative and fusional languages.

ï · Agglutinative languages have morphemes associated with only a single function at a

time (as in Korean, Japanese, Finnish, and Tamil, etc.)

ï · Fusional languages are defined by their feature-per-morpheme ratio higher than one

Typology 18

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

(as in Arabic, Czech, Latin, Sanskrit, German, etc.).

ï · In accordance with the notions about word formation processes mentioned earlier, we

can also find out using concatenative and nonlinear:

ï · Concatenative languages linking morphs and morphemes one after another.

ï · Nonlinear languages allowing structural components to merge nonsequentially to

apply tonal morphemes or change the consonantal or vocalic templates of words.

Issues and Challenges


ï · Irregularity : word forms are not described by a prototypical linguistic model.

ï · Ambiguity : word forms be understood in multiple ways out of the context

ï · Productivity : is the inventory of words in a language finite, or is it unlimited?

ï · Morphological parsing tries to eliminate the variability of word forms to provide higher- level linguistic
units whose lexical and morphological properties are explicit and well defined.

ï · It attempts to remove unnecessary irregularity and give limits to ambiguity, both of which are present
inherently in human language.

ï · By irregularity, we mean existence of such forms and structures that are not described

appropriately by a prototypical linguistic model.

ï · Some irregularities can be understood by redesigning the model and improving its rules, but other
lexically dependent irregularities often cannot be generalized

ï · Ambiguity is indeterminacy (not being interpreted)in interpretation of expressions of

language.

ï · Morphological modelling also faces the problem of productivity and creativity in language, by
which unconventional but perfectly meaningful new words or new senses are coined.

Irregularity
ï · Morphological parsing is motivated by the quest for generalization and abstraction in the world of
words.

ï · Immediate descriptions of given linguistic data may not be the ultimate ones, due to either

their inadequate accuracy or inappropriate complexity, and better formulations may be needed.

Irregularity 19

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · The design principles of the morphological model are therefore very important.

ï · With the proper abstractions made, irregular morphology can be seen as merely enforcing

some extended rules, the nature of which is phonological, over the underlying or prototypical

regular word forms.

ï · Morphophonemic templates capture morphological processes by just organizing stem patterns and
generic affixes without any context-dependent variation of the affixes or ad hoc modification of the stems.

ï · The merge rules, indeed very neatly or effectively concise , then ensure that such structured

representations can be converted into exactly the surface forms, both orthographic and

phonological, used in the natural language.

ï · Applying the merge rules is independent of and irrespective of any grammatical parameters or
information other than that contained in a template.

ï · Most morphological irregularities are thus successfully removed.

Ambiguity

ï · Morphological ambiguity is the possibility that word forms be understood in multiple

ways out of the context of their discourse ( communication in speech or writing) .

ï · Words forms that look the same but have distinct functions or meaning are called

homonyms.

ï · Ambiguity is present in all aspects of morphological processing and language

processing at large.

Productivity
ï · Is the inventory of words in a language finite, or is it unlimited?

ï · This question leads directly to discerning two fundamental approaches to language, summarized in
the distinction between langue and parole , or in the competence versus performance

ï · In one view, language can be seen as simply a collection of utterances (parole) actually

pronounced or written (performance).

ï · This ideal data set can in practice be approximated by linguistic corpora, which are finite collections of
linguistic data that are studied with empirical(based on) methods and can be used for comparison when
linguistic models are developed.

Productivity 20

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · Yet, if we consider language as a system (langue), we discover in it structural devices like recursion,
iteration, or compounding( make up; constitute) that allow to produce (competence) an infinite set of
concrete linguistic utterances.

ï · This general potential holds for morphological processes as well and is called morphological
productivity.

ï · We denote the set of word forms found in a corpus of a language as its vocabulary.

ï · The members of this set are word types, whereas every original instance of a word form is a word token.

ï · The distribution of words or other elements of language follows the â‎ 80/20 rule frame which is a
protocol,â‎ also known as the law of the vital few.

ï · It says that most of the word tokens in a given corpus( a collection of written texts) ca n be identified with
just a couple of word types in its vocabulary, and words from the

rest of the vocabulary occur much less commonly if not rarely in the corpus.

ï · Furthermore, new, unexpected words will always appear as the collection of linguistic data is

enlarged.

Morphological Models
ï · There are many possible approaches to designing and implementing morphological models.

ï · Over time, computational linguistics has witnessed the development of a number of formalisms and
frameworks, in particular grammars of different kinds and expressive power, with which to address whole
classes of problems in processing natural as well as formal languages.

ï · Let us now look at the most prominent types of computational approaches to morphology.

Dictionary Lookup
ï · Morphological parsing is a process by which word forms of a language are associated with

corresponding linguistic descriptions.

ï · Morphological systems that specify these associations by merely enumerating( is the act or process of
making or stating a list of things one after another) them case by case do not offer any generalization
means.

Most common data structure

Inverted file structures are composed of three files The document file

1. The inversion list (Posting List)

Dictionary Lookup 21

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

2. Dictionary

3. The inverted file : based on the methodology of storing an inversion of documents.

4. For each word a listof documents in which the word is found is stored(inversion of document

5. Each document is given a unique the numerical identifier that is stored in inversion list . Dictionary is
used to located the inversion list for a particular word.

Which is a sorted list( processing tokens) in the system and a pointer to the location of its inversion
list.

Dictionary can also store other information used in query optimization such as length of inversion
lists to increase the precision.

ï Use zoning to improve

ï precision and Restrict entries.

ï Inversion list consists of document identifier for each document in which the word is found.

Ex: bit 1(10),1(12) 1(18) is in 10,12, 18 position of the word bit in the document #1.

ï When a search is performed, the inversion lists for the terms in the query are locate and appropriate
logic is applied between inversion lists.

ï Weights can also be stored in the inversion list.

ï Inversion list are used to store concept and their relationship.

ï Words with special characteristics can be stored in their own dictionary. Ex: Date â‎ ¦ which
require date ranging and numbers.

ï Systems that support ranking are re-organized in ranked order.

ï B trees can also be used for inversion instead of dictionary.

ï The inversion lists may be at the leaf level or referenced in higher level pointers.

ï A B-tree of order m is defined as:

ï A root node with between 2 and 2m keys

ï All other internal nodes have between m and 2m keys

ï All keys are kept in order from smaller to larger.

ï All leaves are at the same level or differ by at most one level.

Finite-State Morphology

Dictionary Lookup 22

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · By finite-state morphological models, we mean those in which the specifications written by human
programmers are directly compiled into finite-state s

ï · The two most popular tools supporting this approach, XFST (Xerox Finite-State Tool), LexTools.

ï · They consist of a finite set of nodes connected by directed edges labeled with pairs of input

and output symbols.

ï · In such a network or graph, nodes are also called states, while edges are called arcs.

ï · Traversing the network from the set of initial states to the set of final states along the arcs is equivalent to
reading the sequences of encountered input symbols and writing the sequences of corresponding output
symbols.

ï · The set of possible sequences accepted by the defines the input language; the set of possible
sequences emitted by the defines the output language.

Input Input Morphological parsed output


Cats cat +N +PL
Cat cat +N +SG
Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose +V)
Gooses goose +V +3SG
mergin
merge +V +PRES-PART
g
Caught (caught +V +PAST-PART) or (catch +V +PAST)

ï · matching words in the infinite regular language defined by grandson , great-grandson ,


great-great-grandson .

ï · In finite-state computational morphology, it is common to refer to the input word forms as


surface strings and to

the output descriptions as lexical strings , if the transducer is used for morphological analysis, or vice
versa, if it is

used for morphological generation.

â‎ ¢ In English, a finite-state transducer could analyze the surface string children into the lexical

string child [+plural], for instance, or generate women from woman [+plural].

ï · Relations on languages can also be viewed as functions. Let us have a relation R , and let us
denote by [Σ] the set

Dictionary Lookup 23

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

of all sequences over some set of symbols Σ, so that the domain and the range of R are subsets of
[Σ].

ï · We can then consider R as a function mapping an input string into a set of output strings,
formally denoted by this

type signature, where [Σ] equals String :

ï · A theoretical limitation of finite-state models of morphology is the problem of capturing


reduplication of words or

their elements (e.g., to express plurality) found in several human languages.

Unification-Based Morphology
ï · The concepts and methods of these formalisms are often closely connected to those

of logic programming.

ï · In finite-state morphological models, both surface and lexical forms are by themselves unstructured
strings of atomic symbols.

ï · In higher-level approaches, linguistic information is expressed by more appropriate

data structures that can include complex values or can be recursively nested if

needed.

ï · Morphological parsing P thus associates linear forms Ï with alternatives of structured

content Ï , cf.

ï · morphological modelling, word forms are best captured by regular expressions, while the linguistic
content is best described through typed feature structures .

ï · Feature structures can be viewed as directed acyclic graphs.

ï · A node in a feature structure comprises a set of attributes whose values can be

ï · Nodes are associated with types, and atomic values are attribute less nodes

distinguished by their type.

ï · Unification is the key operation by which feature structures can be merged into a more

informative feature structure.

ï · Unification of feature structures can also fail, which means that the information in them is
mutually incompatible.

Unification-Based Morphology 24

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · Morphological models of this kind are typically formulated as logic programs, and

unification is used to solve the system of constraints imposed by the model.

ï · Advantages of this approach include better abstraction possibilities for developing a

morphological grammar as well as elimination of redundant information from it.

ï · Unification-based models have been implemented for Russian, Czech, Slovene,

Persian, Hebrew, Arabic, and other languages.

Functional Morphology

ï · Functional morphology defines its models using principles of functional programming

and type theory.

ï · It treats morphological operations and processes as pure mathematical functions and organizes the
linguistic as well as abstract elements of a model into distinct types of values and type classes.

ï · Though functional morphology is not limited to modelling particular types of

morphologies in human languages, it is especially useful for fusional morphologies.

ï · Functiona l morpholog y implementation s ar e intende d t o b e reuse d a s programming

libraries capable of handling the complete morphology of a language and to be

incorporated into various kinds of applications.

ï · Morphological parsing is just one usage of the system, the others being
morphological generation, lexicon browsing, and so on.

we can describe inflection I , derivation D , and lookup L as functions of these generic

type

ï · Many functional morphology implementations are embedded in a general-purpose programming


language, which gives programmers more freedom with advanced programming techniques and allows
them to develop full-featured, real-world applications for their models.

ï · It influenced the functional morphology framework in with which morphologies of Latin, Swedish,
Spanish, Urdu, and other languages have been implemented.

ï · The notation then constitutes a so-called domain-specific embedded language, which makes
programming even

more fun.

Unification-Based Morphology 25

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · Even without the options provided by general-purpose programming languages, functional


morphology models

achieve high levels of abstraction.

ï · Morphological grammars in Grammatical Framework can be extended with descriptions of the


syntax and semantics of a language.

ï · Grammatical Framework itself supports multilinguality, and models of more than a dozen languages
are available in

it as open-source software.

2.Finding structure of Documents


Introduction

ï · In human language, words and sentences do not appear randomly but have structure.

ï · For example, combinations of words from sentences- meaningful grammatical units, such as statements,
requests, and commands.

ï · Automatic extraction of structure of documents helps subsequent NLP tasks: for example, parsing,
machine translation, and semantic role labelling use sentences as the basic processing unit.

ï · Sentence boundary annotation(labelling) is also important for aiding human readability of

automatic speech recognition (ASR) systems.

ï · Task of deciding where sentences start and end given a sequence of characters(made of words
and typographical cues) sentences boundary detection.

ï · Topic segmentation as the task of determining when a topic starts and ends in a sequence of

sentences.

ï · The statistical classification approaches that try to find the presence of sentence and topic
boundaries given human-annotated training data, for segmentation.

ï · These methods base their predictions on features of the input: local characteristics that give

evidence toward the presence or absence of a sentence, such as aperiod(.), a question

mark(?), an exclamation mark(!), or another type of punctuation.

ï · Features are the core of classification approaches and require careful design and selection in

order to be successful and prevent overfitting and noise problem.

2.Finding structure of Documents 26

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · Most statistical approaches described here are language independent, every language is a
challenging in itself.

ï · For example, for processing of Chinese documents, the processor may need to first segment the
character sequences into words, as the words usually are not separated by a space.

ï · Similarly, for morphological rich languages, the word structure may need to be analyzed to
extract additional features.

ï · Such processing is usually done in a pre-processing step, where a sequence of tokens is


determined.

ï · Tokens can be word or sub-word units, depending on the task and language.

ï · These algorithms are then applied on tokens.

Sentence Boundary Detection

ï · Sentence boundary detection (Sentence segmentation) deals with automatically segmenting

a sequence of word tokens into sentence units.

ï · In written text in English and some other languages, the beginning of a sentence is usually
marked with an uppercase letter, and the end of a sentence is explicitly marked with a

period(.), a question mark(?), an exclamation mark(!), or another type of punctuation.

ï · In addition to their role as sentence boundary markers, capitalized initial letters are used

distinguish proper nouns, periods are used in abbreviations, and numbers and punctuation marks are
used inside proper names.

ï · The period at the end of an abbreviation can mark a sentence boundary at the same time.

ï · Example: I spoke with Dr. Smith. and My house is on Mountain Dr.

ï · In the first sentence, the abbreviation Dr. does not end a sentence, and in the second it does.

ï · Especially quoted sentences are always problematic, as the speakers may have uttered multiple
sentences, and sentence boundaries inside the quotes are also marked with

punctuation marks.

ï · An automatic method that outputs word boundaries as ending sentences according to the

presence of such punctuation marks would result in cutting some sentences incorrectly.

ï · Ambiguous abbreviations and capitalizations are not only problem of sentence segmentation in
written text.

ï · Spontaneously written texts, such as short message service (SMS) texts or instant

2.Finding structure of Documents 27

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

messaging(IM) texts, tend to be nongrammatical and have poorly used or missing punctuation, which
makes sentence segmentation even more challenging.

ï · Similarly, if the text input to be segmented into sentences comes from an automatic system ,

such as optical character recognition (OCR) or ASR, that aims to translate images of handwritten,
type written, or printed text or spoken utterances into machine editable text, the finding of sentences
boundaries must deal with the errors of those systems as well.

ï · On the other hand, for conversational speech or text or multiparty meetings with

ungrammatical sentences and disfluencies, in most cases it is not clear where the boundaries

are.

ï · Code switching -that is, the use of words, phrases, or sentences from multiple languages by
multilingual speakers- is another problem that can affect the characteristics of sentences.

ï · For example, when switching to a different language, the writer can either keep the

punctuation rules from the first language or resort to the code of the second language.

ï · Conventional rule-based sentence segmentation systems in well-formed texts rely on patterns

to identify potential ends of sentences and lists of abbreviations for disambiguating them.

ï · For example, if the word before the boundary is a known abbreviation, such as â‎ Mr.â‎ or
â‎ Gov.,â‎ the text is not segmented at that position even though some periods are exceptions.

ï · To improve on such a rule-based approach, sentence segmentation is stated as a classification


problem.

ï · Given the training data where all sentence boundaries are marked, we can train a classifier to
recognize them.

Topic Boundary Detection


ï · Segmentation (Discourse or text segmentation) is the task of automatically dividing a stream of
text or speech into topically homogenous blocks.

ï · This is, given a sequence of(written or spoken) words, the aim of topic segmentation is to

find the boundaries where topics change.

ï · Topic segmentation is an important task for various language understanding applications, such as
information extraction and retrieval and text summarization.

ï · For example, in information retrieval, if a long documents can be segmented into shorter, topically
coherent segments, then only the segment that is about the userâ‎ s query could be retrieved.

Topic Boundary Detection 28

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · During the late1990s, the U.S defence advanced research project agency(DARPA) initiated the topic
detection and tracking program to further the state of the art in finding and following new topic in a stream
of broadcast news stories.

ï · One of the tasks in the TDT effort was segmenting a news stream into individual stories .

Methods
ï · Sentence segmentation and topic segmentation have been considered as a boundary

classification problem .
ï · Given a boundary candidate( between two word tokens for sentence segmentation and between
two sentences for topic segmentation), the goal is to predict whether or not the candidate is an actual
boundary (sentence or topic boundary).

ï · Formally, let xÆ X be the vector of features (the observation) associated with a candidate and y

Æ Y be the label predicted for that candidate.

ï · T h e la b el y c an b e b f o r b o u nd a r y a n d ð à´¥ f o r n o n b o u nd a r y .

ï · Classification problem : given a set of training examples(x,y) train , find a function that will assign
the most accurate possible label y of unseen examples x unseen.

ï · Alternatively to the binary classification problem, it is possible to model boundary types using

finer-grained categories.

ï · For segmentation in text be framed as a three-class problem: sentence boundary b a , without an


abbreviation and abbreviation not as a boundary

ï · Similarly spoken language, a three way classification can be made between non-boundaries

statements b s , and question boundaries b q .

â‎ ¢ For sentence or topic segmentation, the problem is defined as finding the most probable
sentence or topic boundaries.

â‎ ¢ The natural unit of sentence segmentation is words and of topic segmentation is sentence, as

we can assume that topics typically do not change in the middle of a sentences.

ï · The words or sentences are then grouped into categories stretches belonging to one sentences or
topic- that is word or sentence boundaries are classified into sentences or topic boundaries and
-non-boundaries.

classification problem . 29

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · The classification can be done at each potential boundary i (local modelling); then, the aim is

to estimate the most probable boundary type

‎ y à· i f o r e a c h c a n d i d a t e x i

ð ¦ à· =

‎ ð ð ð ð ð ð ¥

ð ¦ð ð ð ð

Here, the ^ is used to denote estimated categories, and a variable without a ^ is used to show
possible categories.

ï · In this formulation, a category is assigned to each example in isolation; hence, decision is made locally.

ï · However, the consecutive types can be related to each other. For example, in broadcast news speech,
two consecutive sentences boundaries that form a single word sentence are very infrequent.

ï · In local modelling, features can be extracted from surrounding example context of the

candidate boundary to model such dependencies.

â‎ ¢ It is also possible to see the candidate boundaries as a sequence and search for the sequence of
boundary types

that have the maximum probability given the candidate examples,

ð à·

‎ ð ð ð ð ð ð ¥

ð ¦

ï · We categorize the methods into local and sequence classification.

ï · Another categorization of methods is done according to the type of the machine learning
algorithm: generative versus

discriminative .

ï · Generative sequence models estimate the joint distribution of the observations P(X,Y) (words,
punctuation) and the labels(sentence boundary, topic boundary).

ï · Discriminative sequence models, however, focus on features that categorize the differences between
the labelling of that

classification problem . 30

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

examples.

1.Generative Sequence Classification Methods


ï · Most commonly used generative sequence classification method for topic and

sentence is the hidden Markov model (HMM) function is being used in which the model is proposed
according to bayers rule

Hmm Means: A hidden Markov model (HMM) is a statistical model that can be used to describe the
evolution of observable events that depend on internal factors, which are not directly observable

ï · generative models can be handled by HELMs(hidden event language model) which can handle multiple
orders of magnitude larger

Training data sets

ï · The probability in equation 2.2 is rewritten as the following, using the Bayes rule:

ð à·

‎ =ð ð ð ð ð ð ¥ ð (ð |ð ) ( 2.1)

ð à·

‎ ð ð ð ð ð ð ¥

ð ¦

‎ ð ð ð ð ð ð ¥

ð ¦

‎ ð ð ð ð ð ð ¥

ð ¦

‎ ð (ð )( 2.2)

H e re ð à· = P re d ict e d class(b ounda r y ) l abe l

Y = (y 1 ,y 2 ,â‎ ¦.y k )= Set of class(boundary) labels

X = (x 1 ,x 2 ,â‎ ¦.x n )= set of feature vectors

1.Generative Sequence Classification Methods 31

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

P(Y|X) = the probability of given the X (feature vectors),

what is the probability of X

belongs to the class(boundary) label.

P(x) = Probability of word sequence

P(Y) = Probability of the class(boundary)

ï · P(X) in the denominator is dropped because it is fixed for different Y and hence does not
change the argument of max.

ï · P(X|Y) and P(Y) can be estimated as

2.Discriminative Local Classification Methods


ï · Discriminative classifiers aim to model P(y i |x i ) equation 2.1 directly .

ï · The most important distinction is that c lass densities P(x|y) are model assumptions

in generative approaches

ï · A number of discriminative classification approaches are used, such as support vector machines,

boosting, maximum entropy, and regression. Are based on different machine learning

algorithms which are used in discrimination process in classifying the sentence boundary.

ï · While discriminative approaches have been shown to outperform generative methods in

many speech and language processing tasks.

ï · For sentence segmentation, supervised


learning methods â‎ have primarily been
applied to
newspaper articles .

Supervised learning methods are used where a machine is being trained by giving instruction and accordingly
it will perform .there are many supervised learning algorithms for different purpose.

ï · Stamatatos, Fakotakis and Kokkinakis are authors who used transformation based learning (TBL) to
infer rules for finding sentence boundaries .

ï · Many supervised learning method classifiers have been tried for the sentence boundary task such as
regression trees, neural networks, classification , maximum entropy classifiers, support vector

ï · For sentence segmentation, supervised learning methods â‎ have primarily been applied
32 to

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

machines, and naïve Bayes classifiers.

ï · The most Text tiling method is used for topic segmentation which uses a l exical cohesion (binding
of word to another) metric in a

word vector space as an indicator for topic similarity.

ï · Figure depicts a typical graph of similarity with respect to consecutive segmentation units .

ï · The document is chopped when the similarity is below some threshold.

ï · Originally, two methods for computing the similarity scores were proposed: block

comparison and vocabulary introduction .


ï · The first, block comparison, compares adjacent blocks of text to see how similar they are
according to how many words the adjacent blocks have in common.

ï · Given two blocks, b 1 and b 2 , each having k tokens (sentences or paragraphs),

ï · the similarity (or topical cohesion) between two blocks score is computed by the formula :

ï · Where w t,b is the weight assigned to term t


in block b .
ï · The weights can be binary or may be computed using other information retrieval- metrics such as
term frequency(calculation of weight ).

ï · The second method is, the vocabulary introduction method , assigns a score to a token-sequence
gap

on the basis of how many new words are seen


in the interval in which it is the midpoint .
ï · Similar to the block comparison formulation, given two consecutive blocks b 1 and b 2 , of equal
number of words w , the

ï ·

ï · topical cohesion score is computed with the following formula:

33

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · Where NumNewTerms(b) returns the number


of terms in block b seen the first time in text .

3.Discriminative Sequence Classification Methods

ï · In segmentation tasks, the sentence or topic decision for a given example(word, sentence, paragraph)
highly depends on the decision for the examples in its vicinity( the area near to the topic or surrounding
a particular database ).

ï · Discriminative sequence classification methods are in general extensions of local discriminative


models with additional decoding stages that find the best assignment of labels by looking at neighbouring
decisions to label.

ï · Machine learning algorithms are used to discriminative sequence example(word, sentence, paragraph)
commonly used are

Conditional random fields(CRFs), SVM-support vector machine which are extenstion of HMM

ï · Contrary to local classifiers that predict sentences or topic boundaries independently, CRFs can
oversee the whole sequence of boundary hypotheses to make their decisions.

4.Hybrid Approaches
In this approaches we use segamwnt classification
method s by applying Viteribe Algorithm which is
implemented byHmm

In this approaches we use segamwnt classification method s by applying Viteribe Algorithm which34
is implem

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

The Viterbi algorithm is a dynamic programming


algorithm for obtaining the maximum a posteriori
probability estimate of the most likely sequence of
hidden states â‎ called the Viterbi pathâ‎ that
results in a sequence of observed events,
especially in the context of hidden Markov models
 (HMM).

Complexity of the Approaches


ï · The above approaches described here have advantages and disadvantages .

ï · In a given context and under a set of observation features, one approach may be better than

other .
ï · These approaches can be rated in terms of complexity (time and memory) of their training

and prediction algorithms and in terms of their performance on real-world datasets .

ï · In terms of complexity, training of discriminative approaches is more complex than training

of generative ones because they require multiple passes over the training data to adjust for feature
weights.

ï · However, generative models can be handled by HELMs(hidden event language model) which can
handle multiple orders of magnitude larger

Training data sets

On the other hand the disadvantage is , they work with only a few features .

other . 35

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Natural Language Processing

R18 B.Tech. CSE (AIML) III & IV Year JNTU Hyderabad

Prepared by

K SWAYAMPRABHA

Assistance Professor

UNIT - II

Syntax Analysis :

Parsing Natural Language

Treebanks : A Data-Driven Approach to Syntax

Representation of Syntactic Structure

1 Syntax Analysis Using Dependency Graphs

2 Syntax Analysis Using Phrase Structure Trees

Parsing Algorithms

1 Shift-Reduce Parsing

2 Hypergraphs and Chart Parsing

3 Minimum Spanning Trees and Dependency Parsing

Models for Ambiguity Resolution in Parsing

1 Probabilistic Context-Free Grammars

2 Generative Models for Parsing

3 Discriminative Models for Parsing

Multilingual Issues: What Is a Token?

1 Tokenization, Case, and Encoding

2 Word Segmentation

3 Morphology

Parsing natural language refers to the process of analyzing the structure of a sentence in order to determine its
meaning. This is typically done by breaking down the sentence into its constituent parts, such as nouns, verbs,
adjectives, and adverbs, and analyzing how these parts are related to each other.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

There are several different methods for parsing natural language,

1. Rule-based systems,

2. Statistical models, and

3. Machine learning algorithms.

Rule-based systems rely on sets of predefined rules to analyze sentence structure and identify the
relationships between different parts of speech.

Statistical models use algorithms to analyze large datasets of annotated sentences in order to identify patterns
and relationships between different parts of speech.

Machine learning algorithms , such as neural networks, are becoming increasingly popular for parsing natural
language. These algorithms use large datasets of annotated sentences to train models that can accurately
predict the structure and meaning of new sentences.

Parsing natural language is an important task in many areas, including natural language processing, machine
translation, and speech recognition. It can also be used in applications such as chatbots, virtual assistants, and
search engines, where understanding the meaning of user queries is essential for providing accurate and
relevant responses.

TreeBank

‎ It may be defined as linguistically parsed text corpus that annotates syntactic or semantic sentence structure
. Geoffrey Leech coined the term 'treebank', which represents that the most common way of representing the
grammatical analysis is by means of a tree structure.

Treebanks are a data-driven approach to syntax analysis in natural language processing. A treebank is a
collection of sentences that have been parsed and annotated with their syntactic structures, typically
represented as syntax trees.

In a treebank, each word in a sentence is labeled with its part of speech, such as noun, verb, or adjective. The
words are then connected together in a tree structure that represents the relationships between them.

For example , a simple sentence like "The cat chased the mouse" might be represented as a tree with "cat"
and "mouse" as noun phrases, "chased" as a verb, and "the" and "the" as determiners.

Treebanks are created by linguists and other experts who manually annotate the sentences with their syntactic
structures. The process of creating a treebank is time-consuming and requires a lot of expertise , but once a
treebank has been created, it can be used to train machine learning algorithms to automatically parse new
sentences.

Treebanks are an important resource in natural language processing because they provide a large corpus of
annotated sentences that can be used to train and evaluate syntactic parsers. They can also be used to study
patterns in syntax across different languages and to develop new theories of syntax.

Some examples of well-known treebanks include the Penn Treebank, which is a collection of over 4 million
words of English text, and Prague Dependency Treebank, which is a collection of annotated sentences in
Czech.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Representation of Syntactic Structure

The representation of syntactic structure in natural language processing refers to how the grammatical
structure of a sentence is captured and represented. There are several different ways to represent syntactic
structure

1. Syntax Analysis Using Dependency Graphs

Syntax analysis using dependency graphs is a common approach in natural language processing that
represents the grammatical structure of a sentence as a directed graph . In this approach, each word in the
sentence is represented as a node in the graph, and the relationships between the words are represented as
directed edges between the nodes.

The edges in a dependency graph represent the syntactic relationships between words in the sentence.

For example , a subject-verb relationship is represented by an edge from the subject word to the verb word,
while an object-verb relationship is represented by an edge from the object word to the verb word. Other
relationships that can be represented in dependency graphs include adverbial modifiers, conjunctions, and
prepositions.

Dependency graphs can be used for a variety of syntax analysis tasks, including dependency parsing, named
entity recognition, and sentiment analysis. In dependency parsing, the goal is to automatically generate a
dependency graph for a given sentence. This can be done using a variety of algorithms, including
transition-based and graph-based parsers. Once a dependency graph has been generated, it can be used to
identify the grammatical structure of the sentence, extract information about the relationships between words,
and perform other syntactic analysis tasks.

Named entity recognition is another task that can be performed using dependency graphs . In this task, the
goal is to identify and classify named entities in a sentence, such as people, places, and organizations. This
can be done by analyzing the dependency relationships between words in the sentence and looking for
patterns that indicate the presence of named entities.

Top of Form

Here is an example of a dependency graph for the sentence

"The cat chased the mouse

": chased (V)

/\

cat (N) mouse (N)

/\

The (DET) the (DET)

In this example, the words "cat" and "mouse" are connected to the verb "chased" by directed edges that
indicate their syntactic relationships. Specifically, "cat" and "mouse" are both direct objects of the verb, while
"chased" is the head or root of the sentence. The determiners "the" preceding both "cat" and "mouse" are also

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

included in the dependency graph as dependents of their respective nouns. This dependency graph captures
the syntactic structure of the sentence and can be used to perform a variety of syntactic analysis tasks.

2. Syntax Analysis Using Phrase Structure Trees with example

Syntax analysis using phrase structure trees is another approach in natural language processing that represents
the grammatical structure of a sentence as a tree structure. In this approach, a sentence is broken down into its
constituent phrases, and each phrase is represented as a node in the tree. The words in the sentence are then
assigned to the appropriate nodes based on their syntactic roles.

Here is an example of a phrase structure tree for the sentence "The cat chased the mouse":

S S

+---------+--------+

||

NP VP

||

+---+---+ +----+-----+

||||

DET N V NP

||||

the cat chased the mouse

In this example , the sentence is represented as a tree structure with three levels: the root node S (representing
the sentence), the second-level nodes NP and VP (representing noun phrases and verb phrases), and the
third-level nodes DET, N, and V (representing determiners, nouns, and verbs). The words in the sentence are
assigned to the appropriate nodes based on their syntactic roles. For example, "cat" and "mouse" are both
assigned to the NP node, while "chased" is assigned to the V node. The determiners "the" preceding both "cat"
and "mouse" are also included in the tree structure as children of their respective noun nodes.

This phrase structure tree captures the syntactic structure of the sentence and can be used to perform a variety
of syntactic analysis tasks, including parsing, translation, and text-to-speech synthesis.

Parsing Algorithms

Parsing algorithms are algorithms used in computer science to analyze the structure of a string of symbols in a
particular formal language, typically represented as a context-free grammar. The process of analyzing the
structure of a string is called parsing

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

1.Shift-reduce parsing is a type of bottom-up parsing technique used in computer science to analyze and
understand the structure of a string or a sequence of tokens based on a given grammar. The technique starts
with an empty stack and an input string, and repeatedly applies two operations: shift and reduce.

Shift moves the next input token onto the stack, while reduce applies a grammar rule to reduce the top of the
stack to a non-terminal symbol. The parser continues to apply these operations until it reaches the end of the
input string and the stack contains only the start symbol of the grammar.

The shift-reduce parsing algorithm can be implemented using

a finite state machine or a push-down automaton, which enables it to handle a wide range of context-free
grammars, including ambiguous grammars. However, it may fail to recognize some valid input strings, which
can be resolved by using look ahead techniques or by modifying the grammar to eliminate ambiguities.

Shift-reduce parsing is widely used in compiler design and natural language processing, as it provides an
efficient and effective method for parsing and analyzing large volumes of structured data. Some common
algorithms for shift-reduce parsing include LR parsers, SLR parsers, and LALR parsers, each of which has its
own strengths and weaknesses depending on the complexity and structure of the grammar.

here's an example of shift-reduce parsing:

Suppose we have the following grammar:

S -> E

E -> E + T | T

T -> T * F | F

F -> ( E ) | id

And we want to parse the input string "id * ( id + id )". We can use a shift-reduce parser to build a parse tree
for this string as follows:

1. Start with an empty stack and the input string "id * ( id + id )".

2. Shift the first token "id" onto the stack.

3. Reduce the top of the stack to "F" using the rule "F -> id".

4. Shift the next token "*" onto the stack.

5. Shift the next token "(" onto the stack.

6. Shift the next token "id" onto the stack.

7. Reduce the top of the stack to "F" using the rule "F -> id".

8. Reduce the top of the stack to "T" using the rule "T -> F".

9. Reduce the top of the stack to "E" using the rule "E -> T".

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

10. Shift the next token "+" onto the stack.

11. Shift the next token "id" onto the stack.

12. Reduce the top of the stack to "F" using the rule "F -> id".

13. Reduce the top of the stack to "T" using the rule "T -> F".

14. Reduce the top of the stack to "E" using the rule "E -> E + T".

15. Shift the next token ")" onto the stack.

16. Reduce the top of the stack to "F" using the rule "F -> ( E )".

17. Reduce the top of the stack to "T" using the rule "T -> F".

18. Reduce the top of the stack to "E" using the rule "E -> T".

19. The stack now contains only the start symbol "S", indicating that the input string has been successfully
parsed.

The resulting parse tree for this input string is:

/\

T+

/\|

F id E

/\

TF

/\|

F id id

2. Hypergraphs and Chart Parsing

Hypergraphs and chart parsing are both techniques used in natural language processing and computational
linguistics to analyze the structure and meaning of natural language sentences .

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

A hypergraph is a graph-like data structure in which hyper edges connect multiple nodes instead of just two
nodes as in a conventional graph. In the context of natural language processing, hypergraphs can be used to
represent the syntactic and semantic structures of sentences. Each node in the hypergraph represents a word or
a phrase in the sentence, and each hyper edge represents a grammatical relationship between those words or
phrases.

Chart parsing is a type of parsing algorithm that uses a dynamic programming approach to build a chart or
table that represents the different possible syntactic and semantic structures of a sentence. The chart contains
cells that represent the different combinations of words and phrases in the sentence, and the algorithm uses a
set of grammar rules to fill in the cells with possible syntactic and semantic structures.

Chart parsing can be used with hypergraphs to build a more complex representation of sentence structure and
meaning. The hypergraph can be used to represent the full range of possible syntactic and semantic structures
for a sentence, while the chart can be used to efficiently explore and evaluate those structures.

One common type of chart parsing algorithm is the Earley parser, which uses a bottom-up approach to
construct the chart. Another common algorithm is the CYK parser, which uses a top-down approach and is
based on context-free grammars.

Chart parsing with hypergraphs is widely used in natural language processing applications such as machine
translation, text-to-speech synthesis, and information extraction. By representing the structure and meaning of
sentences in a formal and precise way, these techniques can help computers to better understand and generate
natural language text.

Suppose we have the following sentence:

"The cat sat on the mat."

We can use a hypergraph to represent the different possible syntactic and semantic structures for this sentence.
Each node in the hypergraph represents a word or a phrase, and each hyperedge represents a grammatical
relationship between those words or phrases. Here's an example hypergraph for this sentence:

The cat sat on the mat

||||||

+---------+---------+ +---------+---------+

||

+---------+ +---------+

||

+---------+

In this hypergraph, the words "the", "cat", "sat", "on", and "mat" are represented as nodes, and the hyperedges
represent the grammatical relationships between those words. For example, the hyperedge connecting "cat"
and "sat" represents the fact that "cat" is the subject of the verb "sat".

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

We can then use chart parsing to build a chart that represents the different possible syntactic and semantic
structures for this sentence. Each cell in the chart represents a combination of words or phrases, and the chart
is filled in with possible structures based on a set of grammar rules. Here's an example chart for this sentence:

12345

+---------+---------+---------+---------+---------+

1|D|||||

+---------+---------+---------+---------+---------+

2||N||||

+---------+---------+---------+---------+---------+

3|||V|||

+---------+---------+---------+---------+---------+

4|||P|||

+---------+---------+---------+---------+---------+

5||||D||

+---------+---------+---------+---------+---------+

6|||||N|

+---------+---------+---------+---------+---------+

7 | S1 | | | | |

+---------+---------+---------+---------+---------+

8 | | S2 | | | |

+---------+---------+---------+---------+---------+

9 | | | S3 | | |

+---------+---------+---------+---------+---------+

10 | | | | S4 | |

+---------+---------+---------+---------+---------+

11 | | | | | S5 |

+---------+---------+---------+---------+---------+

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

12 | | | | | S6 |

+---------+---------+---------+---------+---------+

In this chart, the rows represent the start and end positions of phrases, and the columns represent the different
phrase types (D for determiner, N for noun, V for verb, and P for preposition). The cells are filled in with the
possible structures based on the grammar rules. For example, cell (2,2) represents the phrase "cat", cell (3,3)
represents the verb phrase "sat", and cell (4,4) represents the prepositional phrase "on the mat".

By combining hypergraphs and chart parsing, we can build a more complete

Minimum spanning trees and dependency parsing are techniques used in natural language processing to
analyze the grammatical structure of sentences.

3. A minimum spanning tree (MST) is a tree-like structure that connects all the nodes in a weighted graph
with the minimum possible total edge weight. In the context of natural language processing, an MST can be
used to represent the most likely grammatical structure for a sentence, with the nodes representing the words
in the sentence and the edges representing the grammatical relationships between those words.

Dependency parsing is a type of parsing algorithm that uses syntactic dependency relationships to analyze the
structure of a sentence. In a dependency tree, the nodes represent the words in the sentence and the edges
represent the grammatical relationships between those words, such as subject-verb or object-preposition.

Here's an example of how minimum spanning trees and dependency parsing can be used to analyze a
sentence:

Consider the sentence "John gave Mary a book". We can use dependency parsing to identify the syntactic
dependencies between the words in the sentence:

+-----+ nsubj +------+

| John|-------->| gave |

+-----+ +------+

/\

det / \ dobj

/\

+------+ +------+

| Mary | | book |

+------+ +------+

In this dependency tree, the nodes represent the words in the sentence, and the edges represent the syntactic
dependencies between those words. For example, the "nsubj" edge connects "John" to "gave" and represents
the fact that "John" is the subject of the verb "gave". The "dobj" edge connects "book" to "gave" and
represents the fact that "book" is the direct object of the verb "gave".

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

We can then use an MST algorithm to find the minimum spanning tree that connects all the nodes in the
dependency tree with the minimum possible total edge weight. The resulting MST represents the most likely
grammatical structure for the sentence.

+-----+ nsubj +------+

| John|-------->| gave |

+-----+ +------+

dobj

||

+------+ |

| Mary | |

+------+ |

||

det

||

+------+ pobj +------+

| book |-------->| a |

+------+ +------+

In this MST, the nodes still represent the words in the sentence, but the edges represent the most likely
grammatical relationships between those words. For example, the "nsubj" and "dobj" edges are the same as in
the original dependency tree, but the "det" edge connecting "a" to "book" represents the fact that "a" is a
determiner for "book".

By analyzing the grammatical structure of sentences with minimum spanning trees and dependency parsing,
we can gain insights into the meaning and structure of natural language text. These techniques are widely used
in applications such as machine translation, sentiment analysis, and text classification.

Models for Ambiguity Resolution in Parsing

Parsing is the process of analyzing a sentence and determining its syntactic structure. However, natural
language sentences can often be ambiguous, and different parsing models may assign different syntactic
structures to the same sentence. In order to resolve ambiguity in parsing, various models have been proposed.
Here are some of the most common models for ambiguity resolution in parsing:

10

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

1. Probabilistic context-free grammars (PCFGs) are a type of context-free grammar where each production
rule is assigned a probability . These probabilities represent the likelihood of generating a particular string of
symbols using the given rule. PCFGs are often used in natural language processing tasks such as parsing,
where they can be used to assign probabilities to different parse trees for a given sentence.

Here is an example of a PCFG for generating simple arithmetic expressions:

S -> E

E -> E + E [0.4]

E -> E - E [0.3]

E -> E * E [0.2]

E -> E / E [0.1]

E -> ( E ) [0.0]

E -> num [0.0]

In this grammar, S is the start symbol and E represents an arithmetic expression. The production rules for E
indicate that an arithmetic expression can be generated by adding two expressions with probability 0.4,
subtracting two expressions with probability 0.3, multiplying two expressions with probability 0.2, dividing
two expressions with probability 0.1, or enclosing an expression in parentheses with probability 0.0. Finally,
an arithmetic expression can also be a number (num) with probability 0.0.

2 Generative Models for Parsing

Generative models for parsing are statistical models that aim to generate or simulate sentences that follow the
same distribution as a given training corpus. These models use probabilistic context-free grammars (PCFGs)
or other similar techniques to generate sentences based on their grammar and probability distributions.

Here is an example of a generative model for parsing that uses a PCFG:

Suppose we have a small training corpus of three sentences:

1. The cat sat on the mat.

2. The dog chased the cat.

3. The bird flew away.

We can use this corpus to learn a PCFG that represents the grammar of these sentences. Here is a sample
PCFG:

S -> NP VP [1.0]

NP -> Det N [0.67] | N [0.33]

VP -> V NP [0.67] | V [0.33]

11

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Det -> the [1.0]

N -> cat [0.33] | dog [0.33] | bird [0.33]

V -> sat [0.33] | chased [0.33] | flew [0.33]

This grammar allows us to generate new sentences that are similar to the sentences in the training corpus. For
example, we can use the following parse tree to generate the sentence "The cat chased the bird":

NP VP

|||

Det N V

|||

the cat chased

|||

bird

To generate this sentence, we start with the S symbol and apply the production rule S -> NP VP . We then
randomly choose between the two possible expansions of NP and VP based on their probabilities. In this case,
we choose NP -> Det N and VP -> V NP . We then randomly choose the expansions of Det, N, and V based
on their probabilities. Finally, we combine the resulting strings to get the sentence "The cat chased the bird".

We can generate other sentences in the same way, by randomly choosing expansion rules based on their
probabilities. Note that this approach allows us to generate sentences that may not have appeared in the
training corpus, but are still grammatically correct according to the PCFG.

3. Discriminative Models for Parsing with example

Discriminative models for parsing are statistical models that aim to predict the correct parse tree for a given
input sentence. These models use features of the input sentence and their context to make this prediction.

Here is an example of a discriminative model for parsing using a linear support vector machine (SVM):

Suppose we have the following input sentence:

"The cat sat on the mat."

We can use a set of hand-crafted features to represent the input sentence and its context, such as:

12

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · The current word "cat"

ï · The previous word "The"

ï · The next word "sat"

ï · The part of speech (POS) tag of "cat"

ï · The POS tag of "The"

ï · The POS tag of "sat"

ï · The dependency relation between "cat" and "sat"

ï · The head word of "cat"

ï · The head word of "sat"

ï ·

ï · We can then use these features as input to a linear SVM, which learns to predict the correct parse tree
based on these features. The SVM is trained on a set of annotated sentences, where each sentence is
represented by its features and the correct parse tree.

ï · During testing, the SVM predicts the correct parse tree for a given input sentence by computing a weighted
sum of the features, and then applying a threshold to this sum to make a binary classification decision. The
predicted parse tree can then be converted into a more readable format, such as a bracketed string.

ï · Here is an example parse tree that could be predicted by the SVM for the input sentence "The cat sat on
the mat":

(S

(NP (DT The) (NN cat))

(VP (VBD sat) (PP (IN on) (NP (DT the) (NN mat))))

(. .))

Note that discriminative models can be trained on a variety of feature sets, including hand-crafted features as
shown in this example, or features learned automatically from the input data using techniques such as neural
networks. Discriminative models can also incorporate additional information, such as lexical semantic
knowledge or discourse context, to improve their accuracy.

Multilingual Issues: What Is a Token?

In natural language processing, a token refers to a sequence of characters that represent a single unit of
meaning. Typically, tokens correspond to words or punctuation marks in a sentence.

However, in multilingual settings, the definition of a token can become more complex. This is because
different languages may use different writing systems, character encodings, or word segmentation

13

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

conventions, which can affect how tokens are defined and processed.

For example, consider the following sentence in Chinese:

æ ç ±å äº¬å¤©å® é ¨ ã‎

This sentence consists of six characters, which could be considered tokens in a Chinese language processing
pipeline. However, the sentence could also be segmented into four words, corresponding to the following
tokens:

æ (I) ç ± (love) å 京 (Beijing) å¤©å® é ¨ (Tiananmen)

Similarly, in languages that use non-Latin scripts, such as Arabic or Hebrew, the definition of a token can be
more complex due to the presence of diacritics or ligatures, which may affect how words are represented and
processed.

In multilingual natural language processing, it is important to carefully define and standardize the tokenization
process in order to ensure that input text is processed consistently and accurately across different languages
and scripts. This may involve developing language-specific tokenization rules or using machine learning
techniques to automatically segment text into tokens.

Tokenization, Case, and Encoding

Tokenization, case, and encoding are important concepts in natural language processing that are often applied
to text data prior to modeling or analysis.

1. Tokenization refers to the process of breaking down a piece of text into individual units, called tokens. In
English, tokens typically correspond to words or punctuation marks, and can be extracted using simple rules
based on whitespace and punctuation. For example, the sentence "The quick brown fox jumps over the lazy
dog" can be tokenized into the following tokens:

["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Case refers to the capitalization of words in a piece of text. In many cases, it is useful to convert all words to
lowercase in order to reduce the number of distinct tokens and simplify subsequent analysis. However, there
may be cases where preserving the original case of words is important for downstream tasks, such as named
entity recognition. For example, the sentence "New York is a city in the United States" can be converted to
lowercase as follows:

Encoding refers to the process of converting text into a numerical representation that can be processed by
machine learning algorithms. One common encoding scheme is one-hot encoding, where each token is
represented as a binary vector with a 1 in the position corresponding to the token's index in a fixed
vocabulary, and 0s elsewhere. For example, the sentence "The quick brown fox jumps over the lazy dog" can
be encoded as a matrix of shape (9, 9), where each row corresponds to a token and each column corresponds
to a position in the vocabulary:

[[1, 0, 0, 0, 0, 0, 0, 0, 0],

[0, 1, 0, 0, 0, 0, 0, 0, 0],

[0, 0, 1, 0, 0, 0, 0, 0, 0],

14

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

[0, 0, 0, 1, 0, 0, 0, 0, 0],

[0, 0, 0, 0, 1, 0, 0, 0, 0],

[0, 0, 0, 0, 0, 1, 0, 0, 0],

[0, 0, 0, 0, 0, 0, 1, 0, 0],

[0, 0, 0, 0, 0, 0, 0, 1, 0],

[0, 0, 0, 0, 0, 0, 0, 0, 1]]

Other encoding schemes include word embeddings, which map each token to a low-dimensional vector that
captures its semantic and syntactic properties, and character-level encodings, which represent each character
in a token as a separate feature.

Overall, tokenization, case, and encoding are critical preprocessing steps that help transform raw text data into
a format that can be effectively analyzed and modeled.

2. Word Segmentation

Word segmentation refers to the process of identifying individual words in a piece of text, especially in
languages where words are not explicitly separated by spaces or punctuation marks . Word segmentation is an
important task in natural language processing and can be challenging in languages such as Chinese, Japanese,
and Thai.

For example, consider the following sentence in Chinese:

æ ç ±å äº¬å¤©å® é ¨ ã‎

This sentence consists of six characters, which could be considered tokens in a Chinese language processing
pipeline. However, the sentence could also be segmented into four words, corresponding to the following
tokens:

æ (I) ç ± (love) å 京 (Beijing) å¤©å® é ¨ (Tiananmen)

The segmentation of Chinese text into words is typically performed using a combination of statistical and
rule-based methods. For example, one common approach is to use a dictionary or corpus of Chinese words as
a reference, and then apply statistical models or rule-based heuristics to identify likely word boundaries. Other
approaches use machine learning algorithms, such as conditional random fields or neural networks, to learn
the boundaries between words from annotated training data.

In languages such as Japanese and Thai, where words may be written without spaces, word segmentation can
be even more challenging. In these cases, additional linguistic and contextual information may be required to
disambiguate word boundaries. For example, in Japanese, the use of different writing systems (kanji,
hiragana, katakana) can provide cues for word segmentation, while in Thai, the tone and pronunciation of
individual characters can help identify word boundaries.

Overall, word segmentation is an important task in natural language processing that is essential for accurate
analysis and modeling of text data, particularly in languages where words are not explicitly separated by
spaces or punctuation marks.

15

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

3. Morphology

Morphology refers to the study of the structure of words and the rules that govern the formation of words
from smaller units known as morphemes. Morphemes are the smallest units of meaning in a language and can
be either free (can stand alone as words) or bound (must be attached to other morphemes to form words).

For example, consider the word "unhappily." This word consists of three morphemes:

1. "un-" is a prefix that means "not"

2. "happy" is the root or base word

3. "-ly" is a suffix that means "in a particular way"

Each of these morphemes has a specific meaning and function, and their combination in the word "unhappily"
changes the meaning and grammatical function of the root word "happy."

Morphology is important in natural language processing because it can help identify the meaning and
grammatical function of individual words, and can also provide insights into the structure and patterns of a
language. Some common applications of morphology in NLP include:

1. Stemming : the process of reducing a word to its root or stem form, which can help reduce the number of
unique words in a text corpus and improve efficiency in language modeling and information retrieval systems.

2. Morphological analysis : the process of breaking down words into their constituent morphemes, which can
help identify word meanings and relationships, as well as identify errors or inconsistencies in text data.

3. Morphological generation : the process of creating new words from existing morphemes, which can be
useful in natural language generation tasks such as machine translation or text summarization.

Top of Form

Bottom of Form

16

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Natural Language Processing

R18 B.Tech. CSE (AIML) III & IV Year JNTU Hyderabad

Prepared by

K SWAYAMPRABHA

Assistance Professor

UNIT - III

Semantic Parsing

1. Introduction

2. Semantic Interpretation

1 Structural Ambiguity

2 Word Sense

3 Entity and Event Resolution

4 Predicate-Argument Structure

5 Meaning Representation

3 System Paradigms

4 Word Sense

1 Resource

2.Systems

3 Software

Semantic parsing is a technique in natural language processing (NLP) that involves mapping natural language
sentences into structured representations, such as logical forms or executable code. The goal of semantic
parsing is to enable computers to understand the meaning of natural language sentences and to perform tasks
based on that understanding.

Here is an example of how semantic parsing works:

Input sentence: "What is the capital of France?"

Semantic representation: {"type": "query",

"target": "capital",

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

"entity": {"type": "location", "name": "France"}}

In this example , the input sentence is a question that asks for the capital of France. The semantic
representation captures the meaning of the sentence by identifying the type of the sentence (a query) and the
target of the query (the capital) and the entity to which the query applies (France). The semantic
representation can be used by a computer program to generate an answer to the question or to perform other
tasks based on the query.

Another example:

Input sentence: "Find me the cheapest flight from New York to San Francisco next Monday."

Semantic representation:

{"type": "query",

"target": "flight",

"filters": [{"type": "location", "name": "New York"},

{"type": "location", "name": "San Francisco"},

{"type": "date", "value": "next Monday"},

{"type": "sort", "key": "price", "order": "asc"}]}

In this example, the input sentence is a request to find the cheapest flight from New York to San Francisco on
the next Monday. The semantic representation captures the meaning of the sentence by identifying the type of
the sentence (a query), the target of the query (a flight), and a set of filters that narrow down the search to
flights that depart from New York, arrive at San Francisco, depart on the next Monday, and are sorted by price
in ascending order. The semantic representation can be used by a computer program to search for flights that
match the criteria and to return the cheapest option.

Semantic Interpretation

Semantic interpretation is the process of assigning meaning to a piece of language, such as a word, phrase,
sentence, or text. It is a fundamental task in natural language processing (NLP) and involves analyzing
language in its context to infer its intended meaning. The goal of semantic interpretation is to enable
computers to understand the meaning of natural language and to perform tasks based on that understanding.

Here are some examples of semantic interpretation:

1. Word Sense Disambiguation : In natural language, many words have multiple meanings depending on their
context. For example, the word "bank" can refer to a financial institution or the edge of a river. Semantic
interpretation involves determining the correct meaning of a word based on its context. This task is known as
word sense disambiguation.

2. Named Entity Recognition : Another task of semantic interpretation is named entity recognition, which
involves identifying and classifying named entities such as people, organizations, and locations in a piece of
text. For example, in the sentence "Bill Gates is the founder of Microsoft," semantic interpretation would
recognize "Bill Gates" as a person and "Microsoft" as an organization.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

3. Sentiment Analysis : Semantic interpretation can also be used to perform sentiment analysis, which
involves identifying the sentiment or opinion expressed in a piece of text. For example, in the sentence "I love
this product," semantic interpretation would recognize a positive sentiment.

4. Question Answering : Semantic interpretation is also used in question answering, which involves answering
a question based on a given piece of text. Semantic interpretation helps to identify the relevant information in
the text that answers the question.

1. Structural ambiguity can have a significant impact on semantic interpretation because it can lead to
multiple possible interpretations of a sentence, each with a different meaning. Resolving structural ambiguity
is therefore an important step in semantic interpretation, as it helps to ensure that the correct meaning of a
sentence is understood.

example of how structural ambiguity can impact semantic interpretation:

Sentence: "The old man the boat."

This sentence is structurally ambiguous because it is unclear whether the man is old or the boat is old.
Depending on how the sentence is parsed, it can be interpreted in two different ways:

1. The man is old: In this interpretation, "the old man" is a noun phrase that refers to an elderly man who is
performing the action of "the boat."

2. The boat is old: In this interpretation, "the old" is an adjective modifying "boat," and the sentence means
that the boat being referred to is old.

In this example, resolving the structural ambiguity is crucial for semantic interpretation because it determines
the identity of the subject of the sentence and therefore the meaning of the sentence as a whole.

2. Word sense is a crucial aspect of semantic interpretation as it determines the meaning of a word in a given
context. Words often have multiple meanings or senses, and understanding which sense of a word is being
used in a particular context is essential for accurate semantic interpretation.

Here is an example of word sense in semantic interpretation:

Sentence: "I saw a bat in the park."

The word "bat" has multiple senses, including a flying mammal or a piece of sports equipment. In this
context, the word "bat" likely refers to the animal sense, but without further context, it is not entirely clear. If
the sentence were "I hit the bat with a stick," the sense of "bat" would be more apparent.

Another example is the word "bank," which can have multiple meanings depending on the context. It can refer
to a financial institution or the edge of a river. Consider the following sentences:

ï · "I need to go to the bank to withdraw some money."

ï · "The children played by the bank of the river."

In the first sentence, "bank" refers to a financial institution, while in the second sentence, it refers to the edge
of a river. Understanding the correct sense of "bank" in each context is essential for accurate semantic
interpretation.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Entity and event resolution are important aspects of semantic interpretation that involve identifying and
extracting information about entities (such as people, places, and organizations) and events (such as actions,
states, and processes) from text.

Here is an example of entity resolution in semantic interpretation:

Sentence: "John Smith works at Google in New York City."

In this sentence, "John Smith" is an entity (a person), "Google" is an entity (an organization), and "New York
City" is an entity (a location). Entity resolution involves identifying and extracting these entities from the text
and linking them to their corresponding types (e.g., person, organization, or location).

Event resolution involves identifying and extracting information about events from text. For example:

Sentence: "The dog chased the cat up the tree."

In this sentence, the event is the action of the dog chasing the cat. Event resolution involves identifying and
extracting information about the action, including the actors (the dog and the cat) and the location (up the
tree).

Entity and event resolution can be challenging in natural language processing because entities and events can
be expressed in many different ways and can be ambiguous. For example, in the sentence "John Smith is the
CEO of XYZ," it may be unclear whether "XYZ" refers to an organization or a person. Resolving these
ambiguities requires a deep understanding of the context and the syntax of the sentence, as well as knowledge
about the world and common sense reasoning.

3. Predicate-argument structure is an important aspect of semantic interpretation that involves identifying the
relationships between the main verb (predicate) and its arguments (subjects, objects, and other complements).
Understanding the predicate-argument structure of a sentence is critical for accurate semantic interpretation
because it allows us to identify the roles and relationships of the various elements in the sentence.

Here is an example of predicate-argument structure in semantic interpretation:

Sentence: "The cat chased the mouse."

In this sentence, the predicate is "chased," and its arguments are "cat" (the subject) and "mouse" (the object).
The predicate-argument structure can be represented as follows:

Predicate: chased Subject: cat Object: mouse

Another example is:

Sentence: "John gave Mary the book."

In this sentence, the predicate is "gave," and its arguments are "John" (the subject), "Mary" (the indirect
object), and "the book" (the direct object). The predicate-argument structure can be represented as follows:

Predicate: gave Subject: John Indirect object: Mary Direct object: the book

Identifying the predicate-argument structure of a sentence can be challenging because it requires an


understanding of the syntax and semantics of the sentence. In some cases, the arguments may be omitted or

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

expressed implicitly, making it difficult to identify the relationships between the predicate and its arguments.

Meaning representation is the process of representing the meaning of a sentence or a text in a structured and
formal way. The goal of meaning representation is to capture the underlying meaning of the text, independent
of the surface form of the language used to express it.

Here is an example of meaning representation in semantic interpretation:

Sentence: "John gave Mary the book."

One common way to represent the meaning of this sentence is using a logical form, which captures the
predicate-argument structure of the sentence and the relationships between the various elements. The logical
form for this sentence could be:

give(john, book, mary)

This logical form represents the meaning of the sentence by capturing the predicate "give" and its arguments,
"john," "book," and "mary," in a structured and formal way. This logical form can be used to perform various
natural language processing tasks, such as question answering and text generation.

Another way to represent the meaning of a sentence is using a semantic graph, which captures the
relationships between the various concepts in the sentence. The semantic graph for the sentence "John gave
Mary the book" could look like th

+-------+

| John |

+-------+

give

+------+------+

|||

book Mary agent

This semantic graph represents the meaning of the sentence by capturing the relationships between the
concepts "John," "Mary," "book," and "give" in a structured and visual way. This graph can be used to
perform various natural language processing tasks, such as information extraction and semantic search.

System Paradigms

In computer science, a system paradigm is a fundamental approach or model for designing and implementing
computer systems. Here are some of the commonly recognized system paradigms:

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

1. Imperative paradigm : The imperative paradigm is based on the notion of imperative programming, where a
program is a sequence of statements that change the program state .

The key constructs in imperative programming include variables, assignments, loops, and conditionals. Most
programming languages, such as C, C++, Java, and Python, are based on the imperative paradigm.

2. Functional paradigm : The functional paradigm is based on the notion of functional programming, where a
program is a set of functions that take inputs and produce outputs. In functional programming, functions are
treated as first-class citizens, which means they can be passed as arguments to other functions, returned as
results, or stored in data structures. Examples of functional programming languages include Haskell, Lisp, and
ML.

3. Object-oriented paradigm : The object-oriented paradigm is based on the notion of objects, which
encapsulate data and behavior. Object-oriented programming is characterized by the concepts of inheritance,
polymorphism, and encapsulation.

Examples of object-oriented programming languages include Java, C++, and Python.

4. Event-driven paradigm : The event-driven paradigm is based on the notion of events, which are generated
by the system or the user and are handled by event handlers. Event-driven programming is commonly used in
graphical user interfaces, where user interactions generate events that are handled by event handlers.

Examples of event-driven programming languages include JavaScript and Python.

5. Concurrent paradigm : The concurrent paradigm is based on the notion of concurrency, where multiple
tasks or processes are executed simultaneously. Concurrent programming is commonly used in systems that
require high performance and scalability, such as web servers and databases.

Examples of concurrent programming languages include Go and Erlang.

These system paradigms are not mutually exclusive and can be combined in various ways to design and
implement computer systems. For example, many modern programming languages, such as Java and Python,
support multiple paradigms, including imperative, object-oriented, and functional programming.

In semantic parsing,

word sense disambiguation is a critical task, as it involves identifying the correct meaning of a word based on
the context in which it is used.

examples of how word sense disambiguation might be performed for the words "resource," "systems," and
"software" in the context of semantic parsing:

1. Resource:

ï · In the context of natural resources , the word "resource" might be associated with concepts such as "
sustainability," "conservation," and "environmental impact."

ï · In the context of computing , the word "resource" might be associated with concepts such as "memory,"
"processing power," and "network bandwidth ."

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · In the context of human resources , the word "resource" might be associated with concepts such as " talent
acquisition ," " employee retention ," and " skills development."

2. Systems:

ï · In the context of computing , the word "systems" might be associated with concepts such as "operating
systems," "database management systems," and "distributed systems."

ï · In the context of ecology , the word "systems" might be associated with concepts such as "ecosystems,"
"food webs," and "biogeochemical cycles."

ï · In the context of business , the word "systems" might be associated with concepts such as "supply chain
systems," "customer relationship management systems," and "quality management systems."

3. Software :

ï · In the context of computing , the word "software" might be associated with concepts such as "operating
systems," "application software," and "system software."

ï · In the context of music , the word "software" might be associated with concepts such as "digital audio
workstations," "synthesizers," and "sequencers."

ï · In the context of law , the word "software" might be associated with concepts such as "patentable
computer programs," "algorithmic processes," and "intellectual property laws."

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Natural Language Processing

R18 B.Tech. CSE (AIML) III & IV Year JNTU Hyderabad

Prepared by

K SWAYAMPRABHA

Assistance Professor

UNIT IV

Predicate-Argument Structure, Meaning Representation Systems, Software.

Predicate-Argument Structure

Verbal predicates that demand just a subject argument (e.g. sleep, work, relax ) are intransitive, verbal
predicates that demand an object argument as well (e.g. like, fry, help) are transitive, and verbal predicates
that demand two object arguments are ditransitive (e.g. give, lend).

A thing that refers to the type of event or state we are dealing with is termed a predicate , while the things
that refer to the participants in the event/state  are called the arguments of the predicate.

Expressiveness -- Ability to express wide range of subject matter. The ideal situation: a single meaning
representation language that could adequately represent the meaning of any sensible natural language
utterance. Although this ideal situation may not be possible, but the first order predicate calculus (FOPC) is
expressive enough to handle a lot of things. In fact, it is claimed that anything can be representable with other
three representation language, it can be also representable with FOPC. We will concentrate on FOPC, but
other representation languages are also used. For example, Text Meaning Representation (TMR) used in the
machine translation system of NMSU is a frame based representation

Predicate-Argument Structure

All natural languages have a form of predicate-argument arrangement at the core of their semantic structure.
Specific relations hold among the constituent words and phrases of the sentence. (predicate and its
arguments)Our meaning representation should support the predicate-argument structure induced by the
language. In fact, there is a relation between syntactic frames and semantic frames. We will try to find these
relations between syntactic frames and semantic frames

.Example:Want(somebody,something) -- Want is predicate with two arguments

Syntactic Structures :I want Turkish food. NP want NPI want to spend less than five dollars. NP want InfVPI
want it to be close by here. NP want NP Inf VP Verb sub-categorization rules allow the linking of the
arguments of syntactic structures with the semantic roles of these arguments in the semantic representation of
that sentence. The study of semantic roles associated with verbs is known as thematic role.In syntactic
structures, there are restrictions on the categories of their arguments. Similarly, there are also semantic
restrictions on the arguments of the predicates. The selectional restrictions specify semantic restrictions on the
arguments of verbs.

Other objects (other than verbs) in natural languages may have predicate-argument structure. A Turkish
restaurant under fifteen dollars. Under(TurkishRestaurant,$15)meaning representation is associated with the

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

preposition under. The preposition under can be characterized by a two-argument predicate. Make a
reservation for this evening for a table for two persons at 8.Reservation(Hearer,Today,8PM,2)meaning
representation is associated with the noun reservation (not with make).Our meaning representation should
support :variable arity predicate-argument structuresthe semantic labeling of arguments to predicatessemantic
constraints on the fillers of argument roles.

First Order Predicate Calculus (FOPC)

First Order Predicate Calculus (FOPC) is a flexible, well-understood, and computationally tractable approach.
So, FOPC satisfies the most of the things that we expect from a meaning representation language. FOPC
provides a sound computational basis for verifiability, inference, and expressiveness requirements. The most
attractive feature of FOPC is that it makes very few specific commitments for how things should be
represented.

FOPC ExampleI only have five dollars and I donâ ™ t have a lot of time.

Have(Speaker,LotOfTime)   Have(Speaker,FiveDollars)

A restaurant that serves Turkish food near Bilkent.

 Serves(x,TurkishFood)  x Restaurant(x)  Near(LocationOf(x),LocationOf(Bilkent))

All vegetarian restaurants serve vegetarian food.

Serves(x,VegetarianFood)  x VegetarianRestuarant(x) 

Semantics of FOPC The truth value of each FOPC formula can be computed using meanings of the elements
of FOPC.

    Truth tables for   Meanings of Assigned meanings to Predicates, Constant, Functions in an


interpretation. The truth values of our examples:

Have(Speaker,LotOfTime)   Have(Speaker,FiveDollars)  Serves(x,TurkishFood)  x Restaurant(x) 


Near(LocationOf(x),LocationOf(Bilkent)) Serves(x,VegetarianFood)  x VegetarianRestuarant(x) 

An argument expressed with sentences in predicate logic is valid if and only if the conclusion is true in
every interpretation in which all the premises are true .

Meaning Representation Systems

In NLP training, the senses through which we experience the world, are referred to as representational systems
.

The words used to describe these systems are called predicates .

Hence, you have several different systems working for you:

Visual predicates (for things you see),

auditory predicates (for things you hear),

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

kinaesthetic predicates (things you feel or tactile sensations), AD or self talk predicates (also called labelling
system),

olfactory predicates (things you smell), and gustatory predicates (things you taste).

The Neuro Linguistic Programming community says that we all have a system that we prefer to another and
hence, process most information through it. This is, of course, a generalisation, yet it will give you a clue as to
what might be going on right inside someone elseâ ™ sâ ™ internal map of the world. This preferred system
can be determined through physiology and predicates used in a conversation. Letâ ™ s look at an example:

-The way I look at this, is that it is still rather unclear. (Almost totally visual predicates).

-I really donâ ™ t feel comfortable with this. (Tactile/Kinaesthetic predicates).

-This doesnâ ™ t sound right to me. (Auditory predicates).

Knowing the preferred representational system (and therefore the predicates they prefer) of someone is useful
in virtually any context. Imagine you go to Spain and communicate in English. Chances are, the general gist
of what you want to say will get across (hopefully!), yet the finer distinctions of what you are saying are lost.
It is the same with representational systems. If you present information in somebody elseâ ™ s preferred system,
this information will be virtually irresistible to them!

Below is a chart which outlines the types of words/predicates people will use more often than not in a
preferred system. Remember the preferred system can change depending on the context they are in.

Predicates Chart

Visua l

Memorize by seeing pictures and are less distracted by noise. Often have trouble remembering and are
bored by long verbal â ˜ instructions because their mind may wander. They are interested by how the
program looks.

Auditory

Typically are easily distracted by noise. They can repeat things back to you easily & learn by listening.Â
They like music and like to talk on the phone. Tone of voice and the words used can be important.

Kinaesthetic

Often they talk slowly and breathy. They respond to physical rewards & touching. They memorize by doing
or walking through something. They will be interested in a program that feels right or gives them a gut
feeling.

AD/Labelling

They spend a fair amount of time talking to themselves. They memorize by steps, procedures, sequences.
They will want to know the program makes sense. They can also sometimes exhibit characteristics of other
rep systems.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · See ï · Hear
ï · Sense
ï · Look ï · Listen
ï · Feel ï · Experience
ï · View ï · Sound(s)
ï · Touch ï · Understand
ï · Appear ï · Make music
ï · Grasp ï · Think
ï · Show ï · Harmonize
ï · Get hold of ï · Learn
ï · Dawn ï · Tune in/out
ï · Slip through ï · Process
ï · Reveal ï · Be all ears
ï · Catch on ï · Decide
ï · Envision ï · Rings a bell
ï · Tap into ï · Motivate
ï · Illuminate ï · Silence
ï · Make contact ï · Consider
ï · Imagine ï · Be heard
ï · Throw out ï · Change
ï · Clear ï · Resonate
ï · Turn around ï · Perceive
ï · Foggy ï · Deaf
ï · Hard ï · Insensitive
ï · Focused ï · Mellifluous
ï · Unfeeling ï · Distinct
ï · Hazy ï · Dissonance
ï · Concrete ï · Conceive
ï · Crystal ï · Question
ï · Know
ï · picture ï · Unhearing

VÂ Â Â Visual

People who are visual often stand or sit with their heads and/or bodies erect, with their eyes up. They will
be breathing from the top of their lungs. They often sit forward in their chair and tend to be organized, neat,
well-groomed and orderly. They memorize by seeing pictures, and are less distracted by noise. They
often have trouble remembering verbal instructions because their minds tend to wander. A visual person
will be interested in how your program LOOKS. Appearances are important to them. They are often thin
and wiry.

AÂ Â Â Auditory

People who are auditory will move their eyes sideways (remember Richard Nixon?). They breathe from the
middle of their chest. They typically talk to themselves, and are easily distracted by noise. (some even
move their lips when they talk to themselves.)Â They can repeat things back to you easily, they learn by
listening, and usually like music and talking on the phone. They memorize by steps, procedures, and
sequences (sequentially). The auditory person likes to be TOLD how theyâ ™ re doing, and responds to a
certain tone of voice or set of words. They will be interested in what you have to say about your program.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

KÂ Â Â Kinaesthetic

People who are kinaesthetic will typically be breathing from the bottom of their lungs, so youâ ™ ll see their
stomach go in and out when they breathe. They often move and talk verrry slooowly. They respond to
physical rewards, and touching. They also stand closer to people than a visual person. They memorize by
doing or walking through something. They will be interested in your program if it â œ feels rightâ  .

Ad   Auditory Digital (self talk, or labelling system)

This person will spend a fair amount of time talking to themselves. They will want to know if your program
â œ makes senseâ  . The auditory digital person can exhibit characteristics of the other major representational
systems.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Natural Language Processing

R18 B.Tech. CSE (AIML) III & IV Year JNTU Hyderabad

Prepared by

K SWAYAMPRABHA

Assistance Professor

UNIT - V

Discourse Processing: Cohension, Reference Resolution, Discourse Cohension and Structure

Discourse processing in Natural Language Processing (NLP) refers to the study of how meaning is conveyed
across larger units of text, such as sentences, paragraphs, and entire documents. It involves analyzing how
sentences are related to each other in a text and how the overall structure of the text contributes to its meaning.

Discourse processing includes a wide range of tasks, such as coreference resolution, discourse segmentation,
text coherence, and text classification. These tasks are essential for various applications in NLP, including
machine translation, sentiment analysis, and text summarization.

Coreference resolution involves identifying all the expressions in a text that refer to the same entity. For
example, in the sentence "John went to the store. He bought some bread," the word "he" refers to John.
Discourse segmentation involves identifying the boundaries between different discourse units, such as
sentences or paragraphs.

Text coherence is the degree to which a text is logically organized and easy to understand. It is often evaluated
based on how well the text maintains a coherent topic, how well its parts relate to each other, and how well it
uses discourse markers to signal shifts in topic or perspective.

Text classification involves categorizing texts based on their content. For example, a news article may be
classified as sports, politics, or entertainment. Text classification is often used in applications such as
sentiment analysis, spam filtering, and topic modeling.

Cohension

Coherence and cohesion are two important concepts in discourse processing that are essential for
understanding the overall meaning of a text. While coherence refers to the overall clarity and logical
organization of a text, cohesion refers to the specific linguistic devices that writers use to connect the different
parts of a text.

Cohesion is the use of linguistic devices, such as conjunctions, reference words, and lexical repetition, to link
different parts of a text together. Cohesion creates a sense of unity in a text and helps the reader to follow the
writer's intended meaning. Examples of cohesive devices include pronouns (e.g., he, she, it), conjunctions
(e.g., and, but, or), adverbs (e.g., however, therefore), and lexical repetition (e.g., repeating the same word or
phrase multiple times).

There are several types of cohesive devices, including reference, substitution, ellipsis, conjunction, and lexical
cohesion.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

ï · Reference: Referring back to something previously mentioned in the text, such as "John saw a dog. It was
brown."

ï · Substitution: Replacing a word or phrase with a pronoun or other substitute, such as "John saw a dog. The
animal was brown."

ï · Ellipsis: Leaving out words that are not needed because they can be inferred from the context, such as
"John ate pizza for dinner and Mary spaghetti."

ï · Conjunction: Using words such as "and," "but," or "or" to connect phrases or sentences, such as "John
went to the store, and he bought some bread."

ï · Lexical cohesion: Using repeated words or related words to link sentences together, such as "John drove
his car. The vehicle was new."

Reference Resolution

Reference resolution is the process of identifying the objects or entities referred to by pronouns, nouns, or
other words in a text. It is a crucial task in natural language processing, as it helps to identify the relationships
between entities in a text and to understand the meaning of a sentence or a paragraph.

Reference resolution involves identifying the antecedent of a pronoun or a noun phrase in a text. An
antecedent is the word or phrase that the pronoun or noun phrase refers to. For example, in the sentence "John
saw a dog. It was brown," the pronoun "it" refers to the noun "dog."

There are several types of reference resolution, including anaphora resolution and cataphora resolution.

ï · Anaphora resolution: This type of reference resolution involves identifying the antecedent of a pronoun or
noun phrase that comes after the referring expression. For example, in the sentence "John saw a dog. It was
brown," "it" refers back to "dog," which comes earlier in the sentence.

ï · Cataphora resolution: This type of reference resolution involves identifying the antecedent of a pronoun or
noun phrase that comes before the referring expression. For example, in the sentence "When he saw the dog,
John ran away," "he" refers forward to "John," which comes later in the sentence.

Reference resolution can be a challenging task for computers, as it requires understanding the context and the
relationships between words and entities in a text. However, it is essential for many natural language
processing applications, such as machine translation, text summarization, and question answering.

Discourse Cohension and Structure

Discourse cohesion and structure are two related concepts that play important roles in creating effective
communication and understanding in natural language.

Discourse cohesion refers to how different parts of a text are connected through the use of linguistic devices,
such as pronouns, conjunctions, lexical repetition, and other cohesive markers. Cohesion creates a sense of
unity and coherence in a text, helping readers to follow the writer's intended meaning and to understand the
relationships between different ideas.

Discourse structure, on the other hand, refers to the larger organization and arrangement of ideas within a text.
It involves how ideas are presented and how they relate to each other, including the use of headings,

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

subheadings, paragraphs, and other structural devices. Discourse structure helps readers to navigate a text and
to understand its overall organization, which can also contribute to its coherence and clarity.

Effective discourse cohesion and structure are important for creating clear and coherent communication in
both written and spoken language. When a text is well-structured and cohesive, readers or listeners are more
likely to understand and remember the content. Discourse cohesion and structure are also important in many
natural language processing tasks, such as summarization, question-answering, and text classification, where
understanding the relationships between ideas and the overall organization of a text is essential.

n -Gram Models

n-gram models are statistical language models used in natural language processing and computational
linguistics. They are based on the idea of predicting the probability of a word given the preceding n-1 words
in a text.

An n-gram is a sequence of n words or characters that appear consecutively in a text. For example, a bigram
(2-gram) model would predict the probability of a word given the preceding word, while a trigram (3-gram)
model would predict the probability of a word given the two preceding words.

n-gram models are based on the assumption that the probability of a word depends only on the preceding n-1
words, which is known as the Markov assumption. They are trained on a large corpus of text data and estimate
the probability of a word given its context using maximum likelihood estimation or other statistical methods.

n-gram models are used in a wide range of natural language processing tasks, such as speech recognition,
machine translation, and text classification. They are often used as a baseline model for comparison with other
more complex language models.

One limitation of n-gram models is that they do not capture long-range dependencies between words in a text.
For example, a trigram model may not be able to accurately predict the next word in a sentence if the relevant
context extends beyond the previous two words. To address this limitation, more complex language models,
such as recurrent neural networks and transformer models, have been developed.

Language Model Evaluation

Language model evaluation is an important task in natural language processing (NLP) that involves measuring
the performance of a language model on a specific task or dataset. The goal of language model evaluation is to
determine how well the model can predict the next word in a sequence, generate coherent sentences, or
perform other language-related tasks.

There are several methods for evaluating language models, including:

1. Perplexity: Perplexity is a commonly used measure for evaluating language models. It measures how well a
model can predict the next word in a sequence of words. A lower perplexity score indicates a better language
model. Perplexity is often used to compare different language models on the same dataset.

2. Human evaluation: Human evaluation involves having humans assess the quality of generated text or the
performance of a language model on a specific task. This method is often used to evaluate the fluency,
coherence, and relevance of generated text.

3. Task-specific evaluation: Task-specific evaluation involves evaluating a language model on a specific task,
such as machine translation, text summarization, or sentiment analysis. The performance of the language

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

model is measured based on how well it performs on the task, using metrics such as accuracy, precision, and
recall.

4. Diversity and novelty evaluation: Diversity and novelty evaluation involves evaluating the diversity and
novelty of the text generated by a language model. This method is often used to evaluate the creativity and
originality of generated text.

Language model evaluation is an ongoing area of research in natural language processing, as new models and
methods are continually being developed to improve the performance of language models on various tasks. It
is important to choose appropriate evaluation methods that are suited to the specific task or application being
evaluated.

Parameter Estimation

1 Maximum-Likelihood Estimation and Smoothing

2 Bayesian Parameter Estimation

3 Large-Scale Language Models

Parameter estimation is the process of estimating the values of the parameters of a statistical model from the
data. In the context of natural language processing (NLP), parameter estimation is a crucial step in building
machine learning models such as language models, part-of-speech taggers, and named entity recognition
systems.

In NLP, models are typically trained on a large corpus of annotated data, and the objective is to estimate the
values of the model parameters that maximize the likelihood of the observed data. The most commonly used
method for parameter estimation is maximum likelihood estimation (MLE), which involves finding the set of
parameters that maximizes the probability of the observed data. Other methods for parameter estimation
include Bayesian estimation, which involves finding the posterior distribution of the parameters given the
data, and empirical Bayes, which involves using a hierarchical model to estimate the parameters.

Parameter estimation in NLP involves several steps, including preprocessing the data, selecting a model
architecture, defining the objective function, and selecting a suitable optimization algorithm. The objective
function typically involves a loss function that measures the discrepancy between the predicted output of the
model and the true output. The optimization algorithm is used to find the values of the parameters that
minimize the objective function.

The choice of optimization algorithm is important for efficient and effective parameter estimation.
Gradient-based optimization algorithms, such as stochastic gradient descent (SGD) and its variants, are
commonly used in NLP because they are computationally efficient and can handle large datasets. Other
optimization algorithms, such as quasi-Newton methods and conjugate gradient methods, may be more
effective for small datasets or for models with complex parameter spaces.

In summary, parameter estimation is a crucial step in building statistical models in NLP. The choice of model
architecture, objective function, and optimization algorithm can have a significant impact on the performance
of the model. Researchers and practitioners in NLP must carefully select appropriate methods for parameter
estimation based on the specific task and available data.

1 Maximum-Likelihood Estimation and Smoothing

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Maximum-likelihood estimation (MLE) is a commonly used method for estimating the parameters of a
statistical model based on observed data. In natural language processing (NLP), MLE is used for tasks such as
language modeling, where the goal is to estimate the probability of a sequence of words given a context.

MLE involves finding the values of the model parameters that maximize the likelihood of the observed data.
The likelihood function measures the probability of observing the data given the model parameters, and the
goal of MLE is to find the parameter values that make this probability as high as possible. The
maximum-likelihood estimate is the set of parameter values that maximizes the likelihood function.

In practice, MLE can be difficult to apply directly to NLP tasks, as the likelihood function may be complex
and high-dimensional. One common approach is to use smoothing techniques to estimate the probabilities of
unseen events, which can improve the accuracy of the model and reduce overfitting.

One popular smoothing method is Laplace smoothing, also known as add-one smoothing. This method
involves adding a small constant value (usually 1) to the count of each event, which ensures that the
probability estimate is never zero. Another smoothing method is Kneser-Ney smoothing, which estimates the
probability of a word based on its frequency in the training corpus and the number of unique contexts in
which it appears.

Smoothing techniques are important for handling the problem of data sparsity, which occurs when the training
data contains few or no examples of certain events or combinations of events. By smoothing the probability
estimates, the model can make reasonable predictions for unseen events and reduce the impact of noisy or
incomplete data.

2 Bayesian Parameter Estimation

Bayesian parameter estimation is an alternative approach to estimating the parameters of a statistical model in
natural language processing (NLP). Unlike maximum likelihood estimation (MLE), which seeks to find the
parameter values that maximize the likelihood of the observed data, Bayesian parameter estimation seeks to
find the posterior distribution of the parameters given the data.

Bayesian parameter estimation involves specifying a prior distribution over the parameters of the model and
using Bayes' rule to update this prior distribution based on the observed data. The resulting posterior
distribution represents the updated belief about the parameter values, taking into account both the prior
distribution and the observed data.

The choice of prior distribution can have a significant impact on the posterior distribution and the resulting
parameter estimates. A common approach is to use a conjugate prior, which has the same functional form as
the likelihood function and allows for convenient mathematical analysis. For example, if the likelihood
function is a Gaussian distribution, a conjugate prior would be another Gaussian distribution.

Bayesian parameter estimation offers several advantages over MLE. One advantage is that it allows for the
incorporation of prior knowledge or beliefs about the parameters, which can help reduce the impact of noisy
or incomplete data. Another advantage is that it provides a probabilistic framework for uncertainty
quantification, allowing for the calculation of confidence intervals and credible intervals for the parameter
estimates.

However, Bayesian parameter estimation can also be computationally expensive and require the specification
of a prior distribution, which may be subjective or difficult to choose. In addition, the resulting posterior
distribution may be complex and difficult to analyze, particularly in high-dimensional parameter spaces.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

3 Large-Scale Language Models

Large-scale language models are a recent development in natural language processing (NLP) that use deep
learning techniques to learn from massive amounts of text data and generate human-like language. These
models, such as GPT-3, have achieved state-of-the-art performance on a variety of NLP tasks, including
language modeling, question-answering, and text generation.

Large-scale language models are typically trained using unsupervised learning techniques such as
self-supervised learning or semi-supervised learning. Self-supervised learning involves training the model to
predict missing words in a sentence or reconstruct a corrupted sentence, while semi-supervised learning
involves leveraging a small amount of labeled data in addition to the massive amounts of unlabeled data.

One of the key challenges in training large-scale language models is handling the sheer amount of data and
computational resources required. Training these models can require weeks or months of computing time on
powerful hardware, and the resulting models can have billions of parameters. As a result, large-scale language
models are typically trained on specialized hardware such as graphics processing units (GPUs) or tensor
processing units (TPUs).

Another challenge with large-scale language models is managing the biases and ethical implications of the
generated language. These models learn from the patterns in the data they are trained on, which can include
biases and stereotypes present in the training data. Additionally, the ability of these models to generate
convincing language raises concerns about the potential misuse of the technology, such as the spread of
misinformation or the creation of fake news.

Despite these challenges, large-scale language models have the potential to revolutionize NLP and have
already demonstrated impressive performance on a wide range of tasks. Ongoing research is focused on
improving the efficiency and scalability of these models, as well as addressing the ethical and societal
implications of their use.

Language Model Adaptation

Language model adaptation is the process of fine-tuning a pre-trained language model to a specific domain or
task with a smaller amount of task-specific data. This approach can improve the performance of the language
model on the target domain or task by allowing it to better capture the specific linguistic patterns and
vocabulary of that domain.

The most common approach to language model adaptation is called transfer learning, which involves
initializing the language model with pre-trained weights and fine-tuning it on the target domain or task using a
smaller amount of task-specific data. This process typically involves updating the final layers of the language
model, which are responsible for predicting the target output, while keeping the lower-level layers, which
capture more general language patterns, fixed.

There are several advantages to using language model adaptation, including:

1. Improved performance on task-specific data: By fine-tuning a pre-trained language model on task-specific


data, the model can better capture the specific linguistic patterns and vocabulary of that domain, leading to
improved performance on task-specific data.

2. Reduced training time and computational resources: By starting with a pre-trained language model, the
amount of training data and computational resources required to achieve good performance on the target task
is reduced, making it a more efficient approach.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

3. Better handling of rare and out-of-vocabulary words: Pre-trained language models have learned to represent
a large vocabulary of words, which can be beneficial for handling rare and out-of-vocabulary words in the
target domain.

Language model adaptation has been applied successfully in a wide range of NLP tasks, including sentiment
analysis, text classification, named entity recognition, and machine translation. However, it does require a
small amount of task-specific data, which may not always be available or representative of the target domain.

Types of Language Models

1 Class-Based Language Models

2 Variable-Length Language Models

3 Discriminative Language Models

4 Syntax-Based Language Models

5 MaxEnt Language Models

6 Factored Language Models

7 Other Tree-Based Language Models

8 Bayesian Topic-Based Language Models

9 Neural Network Language Models

There are several types of language models in natural language processing (NLP), each with its own strengths
and weaknesses. Here are some of the most commonly used types of language models:

1. N-gram models: An n-gram model is a type of language model that predicts the next word in a sequence
based on the previous n-1 words. The most commonly used n-gram models are bigram and trigram models,
which use the previous word and the previous two words, respectively, to predict the next word.

2. Neural network models: Neural network models are a class of machine learning models that use deep
learning techniques to model the relationship between words in a sentence. These models can be trained on
large amounts of data to predict the likelihood of a sequence of words.

3. Transformer-based models: Transformer-based models, such as the GPT (Generative Pre-trained


Transformer) series, are a type of neural network model that use a self-attention mechanism to capture the
dependencies between words in a sentence. These models have achieved state-of-the-art performance on a
range of NLP tasks.

4. Probabilistic graphical models: Probabilistic graphical models are a type of statistical model that represent
the dependencies between words in a sentence as a graph. These models can be used to predict the likelihood
of a sequence of words based on their dependencies.

5. Rule-based models: Rule-based models use a set of pre-defined rules to predict the likelihood of a sequence
of words. These models can be useful for specific domains where the language is highly structured and
predictable, but they may not be as effective for more general NLP tasks.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Each type of language model has its own strengths and weaknesses, and the choice of model will depend on
the specific task and domain being considered. N-gram models and neural network models are the most
widely used types of language models due to their simplicity and effectiveness, while transformer-based
models are rapidly gaining popularity due to their ability to capture complex dependencies between words.

1 Class-Based Language Models

Class-based language models are a type of probabilistic language model that groups words into classes based
on their distributional similarity. The goal of class-based models is to reduce the sparsity problem in language
modeling by grouping similar words together and estimating the probability of a word given its class rather
than estimating the probability of each individual word.

The process of building a class-based language model typically involves the following steps:

1. Word clustering: The first step is to cluster words based on their distributional similarity. This can be done
using unsupervised clustering algorithms such as k-means clustering or hierarchical clustering.

2. Class construction: After clustering, each cluster is assigned a class label. The number of classes can be
predefined or determined automatically based on the size of the training corpus and the desired level of
granularity.

3. Probability estimation: Once the classes are constructed, the probability of a word given its class is
estimated using a variety of techniques, such as maximum likelihood estimation or Bayesian estimation.

4. Language modeling: The final step is to use the estimated probabilities to build a language model that can
predict the probability of a sequence of words.

Class-based language models have several advantages over traditional word-based models, including:

1. Reduced sparsity: By grouping similar words together, class-based models reduce the sparsity problem in
language modeling, which can improve the accuracy of the model.

2. Improved data efficiency: Since class-based models estimate the probability of a word given its class rather
than estimating the probability of each individual word, they require less training data and can be more
data-efficient.

3. Better handling of out-of-vocabulary words: Class-based models can handle out-of-vocabulary words better
than word-based models, since unseen words can often be assigned to an existing class based on their
distributional similarity.

However, class-based models also have some limitations, such as the need for a large training corpus to build
accurate word clusters and the potential loss of some information due to the grouping of words into classes.

Overall, class-based language models are a useful tool for reducing the sparsity problem in language modeling
and improving the accuracy of language models, particularly in cases where data is limited or
out-of-vocabulary words are common.

2 Variable-Length Language Models

Variable-length language models are a type of language model that can handle variable-length input
sequences, rather than fixed-length input sequences as used by n-gram models.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

The main advantage of variable-length language models is that they can handle input sequences of any length,
which is particularly useful for tasks such as machine translation or summarization, where the length of the
input or output can vary greatly.

One approach to building variable-length language models is to use recurrent neural networks (RNNs), which
can model sequences of variable length. RNNs use a hidden state that is updated at each time step based on
the input at that time step and the previous hidden state. This allows the network to capture the dependencies
between words in a sentence, regardless of the sentence length.

Another approach is to use transformer-based models, which can also handle variable-length input sequences.
Transformer-based models use a self-attention mechanism to capture the dependencies between words in a
sentence, allowing them to model long-range dependencies without the need for recurrent connections.

Variable-length language models can be evaluated using a variety of metrics, such as perplexity or BLEU
score. Perplexity measures how well the model can predict the next word in a sequence, while BLEU score
measures how well the model can generate translations that match a reference translation.

3 Discriminative Language Models

Discriminative language models are a type of language model that focuses on modeling the conditional
probability of the output given the input, rather than modeling the joint probability of the input and output as
in generative language models.

The goal of discriminative models is to learn a mapping from the input to the output, given a training dataset.
Discriminative models can be used for a variety of tasks, such as text classification, sequence labeling, and
machine translation.

One popular approach to building discriminative models is to use conditional random fields (CRFs). CRFs are
a type of probabilistic graphical model that can be used for sequence labeling tasks, such as named entity
recognition or part-of-speech tagging. CRFs model the conditional probability of the output sequence given
the input sequence, using features that capture the dependencies between neighboring labels in the output
sequence.

Another approach to building discriminative models is to use neural networks, such as feedforward neural
networks, convolutional neural networks (CNNs), or recurrent neural networks (RNNs). Neural networks can
be used for a wide range of tasks, including text classification, sequence labeling, and machine translation.

Discriminative models can be evaluated using a variety of metrics, such as accuracy, F1 score, or area under
the receiver operating characteristic curve (AUC-ROC). The choice of evaluation metric depends on the
specific task and the nature of the data.

4 Syntax-Based Language Models

Syntax-based language models are a type of language model that incorporates syntactic information in
addition to the usual word-based information.

Traditional language models, such as n-gram models or neural language models, focus on modeling the
probabilities of word sequences. In contrast, syntax-based language models consider the structure of sentences
and model the probabilities of syntactic structures, such as noun phrases or verb phrases.

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

There are several approaches to building syntax-based language models. One approach is to use context-free
grammars (CFGs) to represent the syntactic structure of sentences. A language model based on CFGs
generates sentences by recursively applying production rules, and assigns probabilities to each rule based on
the training data.

Another approach is to use dependency trees to represent the syntactic structure of sentences. Dependency
trees represent the relationships between words in a sentence, such as subject-verb or object-verb
relationships. A language model based on dependency trees assigns probabilities to each tree based on the
training data, and uses the tree to generate sentences.

Syntax-based language models can be used for a variety of tasks, such as text generation, machine translation,
and question answering. They can also be evaluated using standard metrics, such as perplexity or BLEU
score, although the evaluation is often more complex due to the additional syntactic information.

5 MaxEnt Language Models

MaxEnt (Maximum Entropy) language models are a type of probabilistic language model that use the
principle of maximum entropy to estimate the conditional probability of a word given its context.

In a MaxEnt language model, the probability distribution of the words in a given context is modeled as a set
of constraints on the expected values of a set of features. The goal is to find the probability distribution that
maximizes the entropy subject to the constraints.

MaxEnt models can be used to model both local and global context, and can incorporate various types of
features, such as word identity, part-of-speech, and syntactic information. The model is trained on a corpus of
text by estimating the parameters of the model using an optimization algorithm, such as gradient descent.

MaxEnt language models have been used for a variety of NLP tasks, including part-of-speech tagging, named
entity recognition, and sentiment analysis. They have been shown to perform well on tasks that require the
modeling of complex interactions between different types of linguistic features.

MaxEnt models have some advantages over other types of language models, such as the ability to incorporate
diverse feature sets and the ability to handle sparse data. However, they can be computationally expensive and
require careful selection of features and regularization parameters to prevent overfitting.

Overall, MaxEnt language models are a useful tool for NLP tasks

6 Factored Language Models

Factored language models are a type of language model that incorporates multiple sources of information, or
factors, to improve the modeling of language. The factors can be any type of linguistic information, such as
part-of-speech, word shape, syntactic information, or semantic information.

In a factored language model, each word in a sentence is represented as a vector of factors, and the probability
of a word sequence is modeled as a product of the probabilities of the individual factors. The model is trained
on a corpus of text by estimating the parameters of the model using an optimization algorithm, such as
maximum likelihood estimation.

Factored language models have several advantages over traditional language models. First, they can
incorporate a wide range of linguistic information, allowing them to better capture the complex nature of
language. Second, they can handle out-of-vocabulary words by using their factor information to estimate their

10

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

probability. Finally, they can be used to model a variety of linguistic phenomena, such as code-switching,
dialectal variation, and language contact.

Factored language models have been used for a variety of NLP tasks, including machine translation, speech
recognition, and information retrieval. They have been shown to outperform traditional language models in
many cases, especially when dealing with complex or noisy linguistic data.

7 Other Tree-Based Language Models

Tree-based language models are a type of language model that use tree structures to represent the syntactic
and/or semantic relationships between words in a sentence. In addition to syntax-based language models, there
are several other types of tree-based language models, including:

1. Semantic Role Labeling (SRL) Language Models: SRL models are used to identify the semantic roles
played by each word in a sentence, such as the subject, object, and verb. These models use syntactic and
semantic information to create a tree structure that represents the relationship between words and their roles.

2. Discourse Parsing Language Models: Discourse parsing models are used to analyze the structure and
organization of a discourse, such as the relationships between sentences and paragraphs. These models use
tree structures to represent the discourse structure, and can be used for tasks such as summarization and
information extraction.

3. Dependency Parsing Language Models: Dependency parsing models are used to identify the grammatical
relationships between words in a sentence, such as subject-verb and object-verb relationships. These models
use a tree structure to represent the dependencies between words, and can be used for tasks such as machine
translation and sentiment analysis.

4. Constituent Parsing Language Models: Constituent parsing models are used to identify the constituent
structures of a sentence, such as phrases and clauses. These models use tree structures to represent the
hierarchical structure of a sentence, and can be used for tasks such as text generation and summarization.

8 Bayesian Topic-Based Language Models

Bayesian topic-based language models, also known as topic models, are a type of language model that are
used to uncover latent topics in a corpus of text. These models use Bayesian inference to estimate the
probability distribution of words in each topic, and the probability distribution of topics in each document.

The basic idea behind topic models is that a document is a mixture of several latent topics, and each word in
the document is generated by one of these topics. The model tries to learn the distribution of these topics from
the corpus, and uses this information to predict the probability distribution of words in each document.

One of the most popular Bayesian topic-based language models is Latent Dirichlet Allocation (LDA). LDA
assumes that the corpus is generated by a mixture of latent topics, and each topic is a probability distribution
over the words in the corpus. The model uses a Dirichlet prior over the topic distributions, which encourages
sparsity and prevents overfitting.

LDA has been used for a variety of NLP tasks, including text classification, information retrieval, and topic
modeling. It has been shown to be effective in uncovering hidden themes and patterns in large corpora of text,
and can be used to identify key topics and concepts in a document.

9 Neural Network Language Models

11

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Neural network language models are a type of language model that use artificial neural networks to model the
probability distribution of words in a language. They are a type of machine learning model that can be trained
on large amounts of data, and have become increasingly popular in recent years due to their ability to achieve
state-of-the-art performance on a variety of NLP tasks.

The basic idea behind neural network language models is to learn a distributed representation of words, where
each word is represented as a vector in a high-dimensional space. These representations capture the semantic
and syntactic relationships between words, and can be used to predict the probability distribution of the next
word in a sequence.

One of the most popular types of neural network language models is the recurrent neural network (RNN)
language model, which uses a type of neural network that is designed to handle sequential data. RNNs have a
hidden state that captures the context of the previous words in the sequence, and this context is used to predict
the probability distribution of the next word.

Another popular type of neural network language model is the transformer model, which uses self-attention to
model the relationships between words in a sequence. Transformer models have become increasingly popular
in recent years, and have been used to achieve state-of-the-art performance on a variety of NLP tasks,
including language modeling, machine translation, and text classification.

Language-Specific Modeling Problems

1 Language Modeling for Morphologically Rich Languages

2 Selection of Subword Units

3 Modeling with Morphological Categories

4 Languages without Word Segmentation

5 Spoken versus Written Languages

Language-specific modeling problems refer to challenges that arise when building language models for
specific languages. These challenges can include issues related to data availability, morphology, syntax, and
semantics, among others.

One major challenge in building language models for specific languages is data availability. Many languages
do not have large corpora of text that are suitable for training language models, which can make it difficult to
build models that are accurate and robust. In addition, even when data is available, it may be difficult to obtain
high-quality annotations, such as part-of-speech tags or syntactic parses.

Another challenge is related to morphology, or the way words are formed in a language. Some languages have
complex morphological systems, which can make it difficult to model the relationships between words in a
sentence. For example, in languages like Arabic and Hebrew, words are typically formed from a root and a
series of affixes, which can result in a large number of word forms.

Syntax is another important factor to consider when building language models. Different languages have
different sentence structures and word orders, which can affect the way that language models are designed and
trained. For example, languages like Japanese and Korean have very different sentence structures from
English, which can require different modeling approaches.

12

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

Finally, semantics, or the meaning of words and sentences, can also pose challenges for language modeling.
Different languages may have different ways of expressing the same concept, or may have words that have
multiple meanings depending on context. This can make it difficult to build models that accurately capture the
meaning of sentences and phrases.

1 Language Modeling for Morphologically Rich Languages

Morphologically rich languages pose a challenge for language modeling due to the high degree of inflection
and derivation that words can undergo. Inflection refers to the modification of a word to indicate grammatical
features such as tense, aspect, number, gender, and case, while derivation refers to the formation of new words
from existing ones through the addition of prefixes and suffixes.

One common approach to language modeling for morphologically rich languages is to use sub-word units,
such as character n-grams or morphemes, rather than full words. This can help to capture the underlying
morphological structure of words, and can also improve the coverage of rare or unseen words.

Another approach is to use morphological analysis and generation tools to preprocess the text before training
the language model. These tools can be used to segment words into their constituent morphemes, and to label
each morpheme with its grammatical features. This can help to reduce the sparsity of the data, and can also
improve the accuracy of the language model.

Language-specific approaches may also be needed to deal with specific morphological phenomena that are
unique to certain languages. For example, languages with agglutinative morphology, such as Turkish and
Finnish, may require specialized methods for handling long sequences of morphemes that form a single word.

Finally, it may be beneficial to use transfer learning techniques to adapt language models trained on other
languages to the target language. This can help to overcome the data scarcity problem, and can also help to
leverage the linguistic knowledge that is shared across languages.

2 Selection of Subword Units

In language modeling for morphologically rich languages, the selection of subword units is an important
consideration. Subword units are smaller units of language that are used to represent words, such as character
n-grams or morphemes. Here are some common approaches to selecting subword units:

1. Character n-grams: One common approach is to use character n-grams, which are sequences of n characters
within a word. For example, the word "language" could be represented as a set of character 3-grams: {"lan",
"ang", "ngu", "gua", "uag", "age"}. This approach can be effective for capturing the morphology of words, as
well as for handling out-of-vocabulary (OOV) words.

2. Morphemes: Another approach is to use morphemes, which are the smallest units of meaning within a
word. For example, the word "languages" can be broken down into the morphemes "language" and "-s",
indicating plural. This approach can be effective for capturing the morphology and semantics of words, but
can require more computational resources for segmentation and analysis.

3. Hybrid approaches: Some approaches combine character n-grams and morphemes to create hybrid subword
units. For example, the word "languages" could be represented as a set of hybrid subword units: {"lan", "ang",
"ngu", "gua", "uag", "age", "es"}, where the "-s" morpheme is represented separately. This approach can be
effective for capturing both morphology and OOV words.

13

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

4. Word pieces: Another approach is to use a learned vocabulary of "word pieces", which are variable-length
subword units that are learned during training. This approach, used by models such as BERT and GPT, can be
effective for capturing complex morphology and semantics, and can also handle OOV words.

3 Modeling with Morphological Categories

In language modeling for morphologically rich languages, one approach is to model the language using
morphological categories. Morphological categories are linguistic features that are used to describe the
grammatical and semantic properties of words. For example, in English, nouns can be categorized as singular
or plural, and verbs can be categorized by tense, such as past or present.

Here are some common approaches to modeling with morphological categories:

1. Feature-based models: One approach is to use feature-based models, which represent words as a set of
binary or categorical features that describe their morphological properties. For example, a word might be
represented as a set of features indicating its tense, number, case, or gender. Feature-based models can be
effective for capturing the morphological properties of words, but may require a large number of features and
may not capture more complex relationships between words.

2. Conditional models: Another approach is to use conditional models, which predict the likelihood of a word
given its context and its morphological features. For example, a conditional model might predict the
likelihood of the word "running" in the context "I am __ to the store" based on its morphological features
indicating tense and aspect. Conditional models can be effective for capturing complex interactions between
words and their morphological properties, but may require large amounts of training data and computational
resources.

3. Hybrid approaches: Some approaches combine feature-based and conditional models to create hybrid
models that capture both the morphological properties of words and their context. For example, a hybrid
model might use a feature-based approach to represent the morphological properties of words and a
conditional approach to predict the likelihood of a word given its context and its morphological features.

4. Unsupervised models: Another approach is to use unsupervised models, which do not rely on explicit
morphological categories but instead learn to cluster words based on their shared morphological properties.
Unsupervised models can be effective for discovering new morphological categories and can handle unseen
words, but may not capture all the morphological properties of words and may require large amounts of
training data.

4 Languages without Word Segmentation

here are some languages that do not have a clear distinction between words, making word segmentation a
challenging problem in natural language processing. Here are a few examples of such languages:

1. Chinese: In Chinese, there are no spaces between words, and written text consists of a sequence of
characters. This makes it difficult to determine where one word ends and the next one begins, especially since
some characters can represent multiple words depending on the context.

2. Japanese: Japanese has a writing system consisting of three scripts: kanji (Chinese characters), hiragana,
and katakana. Kanji characters can represent multiple words, and hiragana and katakana are used for
grammatical particles and inflections. There are no spaces between words, and the use of kanji, hiragana, and
katakana can vary depending on the context.

14

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

3. Thai: Thai is a tonal language that does not use spaces between words. Instead, words are separated by a
space-like character called a "phayen." However, the placement of the phayen can vary depending on the
context, making it difficult to determine word boundaries.

4. Khmer: Khmer is the official language of Cambodia and does not use spaces between words. Instead, words
are separated by a symbol called a "khan," which is placed below the final consonant of the preceding
syllable. However, there are some cases where multiple words are written as a single word, and the use of
khan can vary depending on the context.

To process languages without clear word boundaries, natural language processing techniques such as
statistical models and machine learning algorithms can be used to identify possible word boundaries based on
context and statistical patterns. These techniques can help improve the accuracy of tasks such as text
segmentation, part-of-speech tagging, and machine translation for such languages.

5 Spoken versus Written Languages

Spoken and written languages have different characteristics and present different challenges for natural
language processing. Here are a few key differences between spoken and written languages:

1. Vocabulary: Spoken language tends to have a more limited vocabulary than written language. This is
because spoken language is often more informal and less precise, relying on context and gestures to convey
meaning. Written language, on the other hand, tends to be more formal and precise, with a wider range of
vocabulary.

2. Grammar: Spoken language is often less strict in terms of grammar and syntax, with more reliance on
intonation and gestures to convey meaning. Written language, on the other hand, tends to follow more rigid
grammatical rules and conventions.

3. Context: Spoken language is often dependent on context and situational cues, such as facial expressions and
body language, to convey meaning. Written language, on the other hand, is often self-contained and can be
read and understood without relying on external context.

4. Disfluencies: Spoken language often contains disfluencies, such as pauses, repetitions, and filler words like
"um" and "uh." These are less common in written language, which is typically more polished and edited.

5. Acoustic Characteristics: Spoken language has a unique set of acoustic characteristics, including pitch,
volume, and timing, that are not present in written language. These characteristics can be used to help identify
speakers and differentiate between different types of speech, such as questions, statements, and commands.

Multilingual and Crosslingual Language Modeling

1 Multilingual Language Modeling

2 Crosslingual Language Modeling

Multilingual and crosslingual language modeling are two related but distinct areas of natural language
processing that deal with modeling language data across multiple languages.

Multilingual language modeling refers to the task of training a language model on data from multiple
languages. The goal is to create a single model that can handle input in multiple languages. This can be useful
for applications such as machine translation, where the model needs to be able to process input in different

15

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

languages.

Crosslingual language modeling, on the other hand, refers to the task of training a language model on data
from one language and using it to process input in another language. The goal is to create a model that can
transfer knowledge from one language to another, even if the languages are unrelated. This can be useful for
tasks such as crosslingual document classification, where the model needs to be able to classify documents
written in different languages.

There are several challenges associated with multilingual and crosslingual language modeling, including:

1. Vocabulary size: Different languages have different vocabularies, which can make it challenging to train a
model that can handle input from multiple languages.

2. Grammatical structure: Different languages have different grammatical structures, which can make it
challenging to create a model that can handle input from multiple languages.

3. Data availability: It can be challenging to find enough training data for all the languages of interest.

To overcome these challenges, researchers have developed various approaches to multilingual and
crosslingual language modeling, including:

1. Shared embedding space: One approach is to train a model with a shared embedding space, where the
embeddings for words in different languages are learned jointly. This can help address the vocabulary size
challenge.

2. Language-specific layers: Another approach is to use language-specific layers in the model to handle the
differences in grammatical structure across languages.

3. Pretraining and transfer learning: Pretraining a model on large amounts of data in one language and then
fine-tuning it on smaller amounts of data in another language can help address the data availability challenge.

Multilingual and crosslingual language modeling are active areas of research, with many potential
applications in machine translation, crosslingual information retrieval, and other areas.

1 Multilingual Language Modeling

Multilingual language modeling is the task of training a single language model that can process input in
multiple languages. The goal is to create a model that can handle the vocabulary and grammatical structures of
multiple languages.

One approach to multilingual language modeling is to train the model on a mixture of data from multiple
languages. The model can then learn to share information across languages and generalize to new languages.
This approach can be challenging because of differences in vocabulary and grammar across languages.

Another approach is to use a shared embedding space for the different languages. In this approach, the
embeddings for words in different languages are learned jointly, allowing the model to transfer knowledge
across languages. This approach has been shown to be effective for low-resource languages.

Multilingual language models have many potential applications, including machine translation, language
identification, and cross-lingual information retrieval. They can also be used for tasks such as sentiment
analysis and named entity recognition across multiple languages. However, there are also challenges

16

Downloaded by Shanthi Reddy Matam ([email protected])


lOMoARcPSD|38245794

associated with multilingual language modeling, including the need for large amounts of multilingual data and
the difficulty of balancing the modeling of multiple languages.

2 Crosslingual Language Modeling

Crosslingual language modeling is a type of multilingual language modeling that focuses specifically on the
problem of transferring knowledge between languages that are not necessarily closely related. The goal is to
create a language model that can understand multiple languages and can be used to perform tasks across
languages, even when there is limited data available for some of the languages.

One approach to crosslingual language modeling is to use a shared encoder for multiple languages, which can
be used to map input text into a common embedding space. This approach allows the model to transfer
knowledge across languages and to leverage shared structures and features across languages.

Another approach is to use parallel corpora, which are pairs of texts in two different languages that have been
aligned sentence-by-sentence. These parallel corpora can be used to train models that can map sentences in
one language to sentences in another language, which can be used for tasks like machine translation.

Crosslingual language modeling has many potential applications, including cross-lingual information
retrieval, machine translation, and cross-lingual classification. It is particularly useful for low-resource
languages where there may be limited labeled data available, as it allows knowledge from other languages to
be transferred to the low-resource language.

However, crosslingual language modeling also presents several challenges, including the need for large
amounts of parallel data, the difficulty of aligning sentence pairs across languages, and the potential for errors
to propagate across languages.

17

Downloaded by Shanthi Reddy Matam ([email protected])

You might also like