0% found this document useful (0 votes)
88 views14 pages

Unit IV Notes

The document discusses various methods of representing text in natural language processing including one hot encoding, bag of words, count vectorizer, TF-IDF, and n-grams. It also discusses representing language using first-order logic including syntax like predicates, variables, and quantifiers. Finally, it discusses semantic analysis including syntax-driven semantic analysis and semantic attachments to words.

Uploaded by

Sunidhi Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views14 pages

Unit IV Notes

The document discusses various methods of representing text in natural language processing including one hot encoding, bag of words, count vectorizer, TF-IDF, and n-grams. It also discusses representing language using first-order logic including syntax like predicates, variables, and quantifiers. Finally, it discusses semantic analysis including syntax-driven semantic analysis and semantic attachments to words.

Uploaded by

Sunidhi Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

lOMoARcPSD|34168515

UNIT -IV - NOTES

Natural Language Processing (Jawaharlal Nehru Technological University, Hyderabad)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Riya Thakur ([email protected])
lOMoARcPSD|34168515

UNIT IV

SEMANTICS Requirements for representation, First-Order Logic, Description Logics –


Syntax-Driven Semantic analysis, Semantic attachments – Word Senses, Relations
between Senses, Thematic Roles.

The raw text corpus is preprocessed and transformed into a number of text representations
that are input to the machine learning model. The data is pushed through a series of
preprocessing tasks such as tokenization, stopword removal, punctuation removal, stemming,
and many more. We clean the data of any noise present. This cleaned data is represented in
various forms according to the purpose of the application and the input requirements of the
machine learning model.

Common Terms Used While Representing Text in NLP


 Corpus( C ): All the text data or records of the dataset together are known as a
corpus.
 Vocabulary(V): This consists of all the unique words present in the corpus.
 Document(D): One single text record of the dataset is a Document.
 Word(W): The words present in the vocabulary.

Types of Text Representation in NLP


Text representation can be classified into two types:-

 Discrete text representation


 Distributed text representation

In discrete text representation, the individual words in the corpus are mutually
exclusive and independent of one another. Eg: One Hot Encoding. Distributed text
representation thrives on the co-dependence of words in the corpus. The information
about a word is spread along the vector which represents it. Eg: Word Embeddings.

This article focuses on the discrete text representation in NLP. Some of them are:-

 One Hot Encoding


 Bag of words
 Count Vectorizer
 TF-IDF
 Ngrams

First Order logic

In the topic of Propositional logic, we have seen that how to represent


statements using propositional logic. But unfortunately, in propositional
logic, we can only represent the facts, which are either true or false. PL is

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

not sufficient to represent the complex sentences or natural language


statements. The propositional logic has very limited expressive power.
Consider the following sentence, which we cannot represent using PL
logic.

o "Some humans are intelligent", or


o "Sachin likes cricket."

To represent the above statements, PL logic is not sufficient, so we


required some more powerful logic, such as first-order logic.

First-Order logic:
o First-order logic is another way of knowledge representation in artificial
intelligence. It is an extension to propositional logic.
o FOL is sufficiently expressive to represent the natural language
statements in a concise way.
o First-order logic is also known as Predicate logic or First-order
predicate logic. First-order logic is a powerful language that develops
information about the objects in a more easy way and can also express the
relationship between those objects.
o First-order logic (like natural language) does not only assume that the
world contains facts like propositional logic but also assumes the following
things in the world:
o Objects: A, B, people, numbers, colors, wars, theories, squares,
pits, wumpus, ......
o Relations: It can be unary relation such as: red, round, is
adjacent, or n-any relation such as: the sister of, brother of, has
color, comes between
o Function: Father of, best friend, third inning of, end of, ......
o As a natural language, first-order logic also has two main parts:

o Syntax
o Semantics

Syntax of First-Order logic:

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

The syntax of FOL determines which collection of symbols is a logical


expression in first-order logic. The basic syntactic elements of first-order
logic are symbols. We write statements in short-hand notation in FOL.

Basic Elements of First-order logic:


Following are the basic elements of FOL syntax:

Constant 1, 2, A, John, Mumbai, cat,....

Variables x, y, z, a, b,....

Predicates Brother, Father, >,....

Function sqrt, LeftLegOf, ....

Connectives ∧, ∨, ¬, ⇒, ⇔

Equality ==

Quantifier ∀, ∃

Atomic sentences:
o Atomic sentences are the most basic sentences of first-order logic. These
sentences are formed from a predicate symbol followed by a parenthesis
with a sequence of terms.
o We can represent atomic sentences as Predicate (term1, term2, ......,
term n).

Example: Ravi and Ajay are brothers: => Brothers(Ravi, Ajay).


Chinky is a cat: => cat (Chinky).

Complex Sentences:
o Complex sentences are made by combining atomic sentences using
connectives.

First-order logic statements can be divided into two parts:

o Subject: Subject is the main part of the statement.


o Predicate: A predicate can be defined as a relation, which binds two
atoms together in a statement.

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

Consider the statement: "x is an integer.", it consists of two parts,


the first part x is the subject of the statement and second part "is an
integer," is known as a predicate.

Semantic Analysis

Syntax-Driven Semantic Analysis

Definition: Syntax-driven semantic analysis — assigning meaning


representations based soley on static knowledge from the lexicon and the
grammar. [JM]

This provides a representation that is "both context independent and


inference free." [JM], presumably referring to semantic context.

Definition: principle of compositionality — the meaning of a sentence is


the sum of the meanings of its constituent elements.
NB this includes the relationships among these elements (ordering,
grouping, etc.) I.e., the meaning of a sentence is "partially based on its
syntactic structure."

Pipeline Representation

Parser => Syntactic Representation => Semantic Analyzer =>


Meaning Representation

Notes:

 Syntactic representation could be parse tree (real or temporal);


or feature structures; or etc.
 Possible Meaning representations have been discussed —
Frames, FOPC, Semantic networks, etc.

Ambiguities arise from the syntax and lexicon and are thus woven into
the output of the semantic analyzer. A major assumption on the
handling of ambiguity in syntax-driven semantic analysis is
that subsequent interpretation processes will be required to resolve these
ambiguities. These processes will use domain
knowledge and knowledge of context to select among competing
representations [JM]

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

Parser => Syntactic Representation => Semantic Analyzer =>


Ambiguous Meaning Representations =>
Additional Disambiguating Stages [domain-specific knowledge;
contextual knowledge] =>
Selection process

Example: Rudimentary analysis of:

15.1 AyCaramba serves meat.

[JM Fig. 15.2] shows a simplified parse tree, annotated with meaning
representation (lacking feature attachments.)

Notes:

 "first" subtree retrieved by the semantic analyzer (is supposed to)


corresponds to the verb "to serve".
 the semantic analyzer discerns that VP requires
o a partcular NP (viz., an actor) to precede it
o a partcular NP (viz., a direct object) to complement the Verb

 this also demonstrates a method of encoding semantics using FOPC

In English, this says

 There exists an element e which satisfies the predicate of being an


event of Serving [a serving event]
∃ e such that e Is a Serving instance
 And e, together with AyCaramba also satisfies the predicate Server,
i.e., AyCaramba is the server in the event
∧ Server(e,AyCaramba)
 And e, together with Meat satisfies the predicate Served,
i.e., meat is served during the event.
∧ Served(e,Meat)

Note the similarities to Logic Form's representation of meaning.

Problems to overcome when using the above schema:

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

The interpretator of the tree needs to know that the Verb is to be used to
define the template for interpretation, where the Verb can be found in the
tree, and where the NP arguments to the template occur in the tree.

Since there are potentially infinitely many trees generated by any


reasonably sized grammar for NLP, this task needs some other processing
aids.

To facilitate the processing of the parse tree by the semantic analyzer,


grammar rules are augmented with semantic attachments."

We understand that words have different meanings based on the context of its usage
in the sentence. If we talk about human languages, then they are ambiguous too
because many words can be interpreted in multiple ways depending upon the
context of their occurrence.
Word sense disambiguation, in natural language processing (NLP), may be defined
as the ability to determine which meaning of word is activated by the use of word in a
particular context. Lexical ambiguity, syntactic or semantic, is one of the very first
problem that any NLP system faces. Part-of-speech (POS) taggers with high level of
accuracy can solve Word’s syntactic ambiguity. On the other hand, the problem of
resolving semantic ambiguity is called WSD (word sense disambiguation). Resolving
semantic ambiguity is harder than resolving syntactic ambiguity.
For example, consider the two examples of the distinct sense that exist for the
word “bass” −
 I can hear bass sound.
 He likes to eat grilled bass.
The occurrence of the word bass clearly denotes the distinct meaning. In first
sentence, it means frequency and in second, it means fish. Hence, if it would be
disambiguated by WSD then the correct meaning to the above sentences can be
assigned as follows −
 I can hear bass/frequency sound.
 He likes to eat grilled bass/fish.

Evaluation of WSD
The evaluation of WSD requires the following two inputs −
A Dictionary
The very first input for evaluation of WSD is dictionary, which is used to specify the
senses to be disambiguated.
Test Corpus
Another input required by WSD is the high-annotated test corpus that has the target
or correct-senses. The test corpora can be of two types &minsu;

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

 Lexical sample − This kind of corpora is used in the system, where it is


required to disambiguate a small sample of words.
 All-words − This kind of corpora is used in the system, where it is expected to
disambiguate all the words in a piece of running text.

Approaches and Methods to Word Sense Disambiguation


(WSD)
Approaches and methods to WSD are classified according to the source of
knowledge used in word disambiguation.
Let us now see the four conventional methods to WSD −
Dictionary-based or Knowledge-based Methods
As the name suggests, for disambiguation, these methods primarily rely on
dictionaries, treasures and lexical knowledge base. They do not use corpora
evidences for disambiguation. The Lesk method is the seminal dictionary-based
method introduced by Michael Lesk in 1986. The Lesk definition, on which the Lesk
algorithm is based is “measure overlap between sense definitions for all words
in context”. However, in 2000, Kilgarriff and Rosensweig gave the simplified Lesk
definition as “measure overlap between sense definitions of word and current
context”, which further means identify the correct sense for one word at a time.
Here the current context is the set of words in surrounding sentence or paragraph.
Supervised Methods
For disambiguation, machine learning methods make use of sense-annotated
corpora to train. These methods assume that the context can provide enough
evidence on its own to disambiguate the sense. In these methods, the words
knowledge and reasoning are deemed unnecessary. The context is represented as a
set of “features” of the words. It includes the information about the surrounding
words also. Support vector machine and memory-based learning are the most
successful supervised learning approaches to WSD. These methods rely on
substantial amount of manually sense-tagged corpora, which is very expensive to
create.
Semi-supervised Methods
Due to the lack of training corpus, most of the word sense disambiguation algorithms
use semi-supervised learning methods. It is because semi-supervised methods use
both labelled as well as unlabeled data. These methods require very small amount of
annotated text and large amount of plain unannotated text. The technique that is
used by semisupervised methods is bootstrapping from seed data.
Unsupervised Methods
These methods assume that similar senses occur in similar context. That is why the
senses can be induced from text by clustering word occurrences by using some
measure of similarity of the context. This task is called word sense induction or
discrimination. Unsupervised methods have great potential to overcome the
knowledge acquisition bottleneck due to non-dependency on manual efforts.

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

Applications of Word Sense Disambiguation (WSD)


Word sense disambiguation (WSD) is applied in almost every application of
language technology.
Let us now see the scope of WSD −
Machine Translation
Machine translation or MT is the most obvious application of WSD. In MT, Lexical
choice for the words that have distinct translations for different senses, is done by
WSD. The senses in MT are represented as words in the target language. Most of
the machine translation systems do not use explicit WSD module.
Information Retrieval (IR)
Information retrieval (IR) may be defined as a software program that deals with the
organization, storage, retrieval and evaluation of information from document
repositories particularly textual information. The system basically assists users in
finding the information they required but it does not explicitly return the answers of
the questions. WSD is used to resolve the ambiguities of the queries provided to IR
system. As like MT, current IR systems do not explicitly use WSD module and they
rely on the concept that user would type enough context in the query to only retrieve
relevant documents.
Text Mining and Information Extraction (IE)
In most of the applications, WSD is necessary to do accurate analysis of text. For
example, WSD helps intelligent gathering system to do flagging of the correct words.
For example, medical intelligent system might need flagging of “illegal drugs” rather
than “medical drugs”
Lexicography
WSD and lexicography can work together in loop because modern lexicography is
corpusbased. With lexicography, WSD provides rough empirical sense groupings as
well as statistically significant contextual indicators of sense.

Difficulties in Word Sense Disambiguation (WSD)


Followings are some difficulties faced by word sense disambiguation (WSD) −
Differences between dictionaries
The major problem of WSD is to decide the sense of the word because different
senses can be very closely related. Even different dictionaries and thesauruses can
provide different divisions of words into senses.
Different algorithms for different applications
Another problem of WSD is that completely different algorithm might be needed for
different applications. For example, in machine translation, it takes the form of target
word selection; and in information retrieval, a sense inventory is not required.

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

Inter-judge variance
Another problem of WSD is that WSD systems are generally tested by having their
results on a task compared against the task of human beings. This is called the
problem of interjudge variance.
Word-sense discreteness
Another difficulty in WSD is that words cannot be easily divided into discrete
submeanings.

Thematic analysis is a systematic method of analyzing qualitative


data. It enables you to find rich, useful insights quickly, and
organizes your data so that you can easily see context.

Qualitative data is inherently unstructured, with people


contributing ideas and feedback in the way that is natural to
them. It’s a conversation, rather than a list of points. To get
valuable insights, companies need to structure their feedback
data and filter out the noise and filler data.

In this overview, we’ll break down the jargon, and explain how
thematic analysis works. We'll briefly cover some manual
approaches, and take a deeper look at thematic analysis software.
We’ll also give some examples of how leading companies use
thematic analysis to better understand their feedback data.

Qualitative data and themes


As soon as you start looking at feedback or text analysis, you'll
see the term qualitative data. So what counts as qualitative data?
Simply put, it's information collected from text, audio and images
that can't be easily expressed using numbers.

If you have 100 one-star reviews of a product, you know


something's going wrong. But ratings in isolation can't provide a

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

fix. You need to look at the comments to understand context and


identify the problem. The one-star rating is quantitive data, and
the comments are qualitative data.

This is where thematic analysis comes in. Using this methodology,


you can go through the comments and pull out the
common themes to zero in on the answers you need.

What's a theme? In thematic analysis, a theme is assigned to a


piece of data to summarize its meaning. When it comes to
feedback, a theme is a collection of different ways people
talk about the same thing.

The process of labeling and organizing your qualitative data to


identify different themes and the relationships between them is
called coding. Themes can be combinations of words, phrases or
numbers.

What are the advantages of thematic analysis?

1. Simple to learn and apply


Thematic analysis is straightforward and easy to learn.
Everyone can use either manual or automated approaches to
generate useful themes from their datasets.
2. Goes beyond the surface
Thematic analysis allows you to dig deep, with themes referring
to the thoughts, motivations and ideas behind your data.
3. Flexible
Thematic analysis is a flexible approach which can be applied to
a wide variety of sample sizes, data types, and study questions.
4. Comprehensive bottom-up approach
With thematic analysis there is no need to set up categories in

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

advance. Allowing themes to emerge from your data means


you’ll capture unknown unknowns. This bottom-up approach
allows you to jump right into your data without doing a lot of
preparatory work.

What are the challenges of using thematic


analysis?

1. One-off cases
Topics that only occur once in your data might be overlooked, as
thematic analysis focuses on identifying patterns. This could be
significant if you have a smaller data set or an important issue
that has only been mentioned once.
2. Capturing the exact meaning
Since thematic analysis is phrase-based, sometimes the exact
meaning isn’t captured correctly. You can tackle this issue by
carefully evaluating and reviewing themes to make sure they
match up to the meaning behind your data. Combining AI and a
human eye is a powerful way to ensure your analysis results are
accurate.
3. Human error and bias
Manual approaches can introduce errors or bias into your results
and create problems with consistency. It’s natural to highlight
themes that support your existing beliefs. One fix is to choose
an automated solution. These use NLP (natural language
processing) to automatically identify themes. This can help
increase consistency across themes and reduce bias.

Tools for thematic analysis

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

Let’s walk through some of the tools you can use for thematic
analysis. These range from manual approaches to powerful AI
solutions which automate the entire analytical process.

Affinity mapping
If you’re planning to do your thematic analysis manually, affinity
mapping can help streamline the process. Affinity mapping groups
data into clusters based on their similarities.

One effective way to create a thematic map manually is by using


sticky memo notes. Once you’ve grouped together certain
phrases or sentences, it will be easy to see the overarching
themes in your data.

CAQDAS solutions for coding


CAQDAS stands for Computer Assisted Qualitative Data Analysis
Software. It’s commonly used by researchers to help them with
qualitative data analysis.

Instead of creating codes with pen and paper, you upload your
data and create codes directly within the CAQDAS software. You
can easily filter your data by code, which makes it easier to
review your codes and evaluate your findings.

Examples of CAQDAS solutions that can be used for thematic


analysis include ATLAS.ti, NVivo, MAXQDA, and QDA Miner. In the
example below you can see the codes that have been selected in
the right-hand panel.

Downloaded by Riya Thakur ([email protected])


lOMoARcPSD|34168515

n other words, Thematic Roles tell us what “role” the NP plays in the action
described by the verb in a sentence. The concept will, no doubt, become clearer as
we consider several examples of thematic roles.

Some of the different thematic roles that seem to hold in many, if not all, language
are shown in series 1) through 6). The NP that illustrates each thematic role is in
italics. (Note that NP as used in this entry refers to “noun phrase”. See Noun
PhraseOpens in new window)

 In the concept of Thematic Roles, an agent, or actor, is the thing that


causes, or instigates, the action in a sentence. The agent acts by volition, or
free will. In sentences that are in the passive voice, the agent is often
preceded by the preposition by, as is the case in 1b).

 1a) The goalie caught the ball.


 1b) The ball was caught by the goalie.

 The patient, or theme, is the thing affected by an action or the thing that
undergoes a change, as is the case in the series of example 2).

 2a) The basketball player bounced the ball.


 2b) The ball was bounced by the basketball player.
 2c) The ball bounced several times before rolling under the bench,
where it came to a stop.

 An instrument, or a means, is the thing, usually some tool, that is used to


carry out an action. The instrument is usually, but not always, inanimate and
does not act but is acted upon, as seen in series of 3).

 3a) He sliced the salami with a knife.


 3b) The scarf was knitted by hand.

 Source designate the place from which, or person from whom, an action
originated, as the series in 4) show.

 4a) I took the book from the library.


 4b) He got the book off the shelf.
 4c) Fire gives off heat.

111111

Downloaded by Riya Thakur ([email protected])

You might also like