Unit IV Notes
Unit IV Notes
UNIT IV
The raw text corpus is preprocessed and transformed into a number of text representations
that are input to the machine learning model. The data is pushed through a series of
preprocessing tasks such as tokenization, stopword removal, punctuation removal, stemming,
and many more. We clean the data of any noise present. This cleaned data is represented in
various forms according to the purpose of the application and the input requirements of the
machine learning model.
In discrete text representation, the individual words in the corpus are mutually
exclusive and independent of one another. Eg: One Hot Encoding. Distributed text
representation thrives on the co-dependence of words in the corpus. The information
about a word is spread along the vector which represents it. Eg: Word Embeddings.
This article focuses on the discrete text representation in NLP. Some of them are:-
First-Order logic:
o First-order logic is another way of knowledge representation in artificial
intelligence. It is an extension to propositional logic.
o FOL is sufficiently expressive to represent the natural language
statements in a concise way.
o First-order logic is also known as Predicate logic or First-order
predicate logic. First-order logic is a powerful language that develops
information about the objects in a more easy way and can also express the
relationship between those objects.
o First-order logic (like natural language) does not only assume that the
world contains facts like propositional logic but also assumes the following
things in the world:
o Objects: A, B, people, numbers, colors, wars, theories, squares,
pits, wumpus, ......
o Relations: It can be unary relation such as: red, round, is
adjacent, or n-any relation such as: the sister of, brother of, has
color, comes between
o Function: Father of, best friend, third inning of, end of, ......
o As a natural language, first-order logic also has two main parts:
o Syntax
o Semantics
Variables x, y, z, a, b,....
Connectives ∧, ∨, ¬, ⇒, ⇔
Equality ==
Quantifier ∀, ∃
Atomic sentences:
o Atomic sentences are the most basic sentences of first-order logic. These
sentences are formed from a predicate symbol followed by a parenthesis
with a sequence of terms.
o We can represent atomic sentences as Predicate (term1, term2, ......,
term n).
Complex Sentences:
o Complex sentences are made by combining atomic sentences using
connectives.
Semantic Analysis
Pipeline Representation
Notes:
Ambiguities arise from the syntax and lexicon and are thus woven into
the output of the semantic analyzer. A major assumption on the
handling of ambiguity in syntax-driven semantic analysis is
that subsequent interpretation processes will be required to resolve these
ambiguities. These processes will use domain
knowledge and knowledge of context to select among competing
representations [JM]
[JM Fig. 15.2] shows a simplified parse tree, annotated with meaning
representation (lacking feature attachments.)
Notes:
The interpretator of the tree needs to know that the Verb is to be used to
define the template for interpretation, where the Verb can be found in the
tree, and where the NP arguments to the template occur in the tree.
We understand that words have different meanings based on the context of its usage
in the sentence. If we talk about human languages, then they are ambiguous too
because many words can be interpreted in multiple ways depending upon the
context of their occurrence.
Word sense disambiguation, in natural language processing (NLP), may be defined
as the ability to determine which meaning of word is activated by the use of word in a
particular context. Lexical ambiguity, syntactic or semantic, is one of the very first
problem that any NLP system faces. Part-of-speech (POS) taggers with high level of
accuracy can solve Word’s syntactic ambiguity. On the other hand, the problem of
resolving semantic ambiguity is called WSD (word sense disambiguation). Resolving
semantic ambiguity is harder than resolving syntactic ambiguity.
For example, consider the two examples of the distinct sense that exist for the
word “bass” −
I can hear bass sound.
He likes to eat grilled bass.
The occurrence of the word bass clearly denotes the distinct meaning. In first
sentence, it means frequency and in second, it means fish. Hence, if it would be
disambiguated by WSD then the correct meaning to the above sentences can be
assigned as follows −
I can hear bass/frequency sound.
He likes to eat grilled bass/fish.
Evaluation of WSD
The evaluation of WSD requires the following two inputs −
A Dictionary
The very first input for evaluation of WSD is dictionary, which is used to specify the
senses to be disambiguated.
Test Corpus
Another input required by WSD is the high-annotated test corpus that has the target
or correct-senses. The test corpora can be of two types &minsu;
Inter-judge variance
Another problem of WSD is that WSD systems are generally tested by having their
results on a task compared against the task of human beings. This is called the
problem of interjudge variance.
Word-sense discreteness
Another difficulty in WSD is that words cannot be easily divided into discrete
submeanings.
In this overview, we’ll break down the jargon, and explain how
thematic analysis works. We'll briefly cover some manual
approaches, and take a deeper look at thematic analysis software.
We’ll also give some examples of how leading companies use
thematic analysis to better understand their feedback data.
1. One-off cases
Topics that only occur once in your data might be overlooked, as
thematic analysis focuses on identifying patterns. This could be
significant if you have a smaller data set or an important issue
that has only been mentioned once.
2. Capturing the exact meaning
Since thematic analysis is phrase-based, sometimes the exact
meaning isn’t captured correctly. You can tackle this issue by
carefully evaluating and reviewing themes to make sure they
match up to the meaning behind your data. Combining AI and a
human eye is a powerful way to ensure your analysis results are
accurate.
3. Human error and bias
Manual approaches can introduce errors or bias into your results
and create problems with consistency. It’s natural to highlight
themes that support your existing beliefs. One fix is to choose
an automated solution. These use NLP (natural language
processing) to automatically identify themes. This can help
increase consistency across themes and reduce bias.
Let’s walk through some of the tools you can use for thematic
analysis. These range from manual approaches to powerful AI
solutions which automate the entire analytical process.
Affinity mapping
If you’re planning to do your thematic analysis manually, affinity
mapping can help streamline the process. Affinity mapping groups
data into clusters based on their similarities.
Instead of creating codes with pen and paper, you upload your
data and create codes directly within the CAQDAS software. You
can easily filter your data by code, which makes it easier to
review your codes and evaluate your findings.
n other words, Thematic Roles tell us what “role” the NP plays in the action
described by the verb in a sentence. The concept will, no doubt, become clearer as
we consider several examples of thematic roles.
Some of the different thematic roles that seem to hold in many, if not all, language
are shown in series 1) through 6). The NP that illustrates each thematic role is in
italics. (Note that NP as used in this entry refers to “noun phrase”. See Noun
PhraseOpens in new window)
The patient, or theme, is the thing affected by an action or the thing that
undergoes a change, as is the case in the series of example 2).
Source designate the place from which, or person from whom, an action
originated, as the series in 4) show.
111111