NLP Unit I Notes-1
NLP Unit I Notes-1
NLP Unit I Notes-1
Prepared by
K SWAYAMPRABHA
Assistance Professor
• Course Objectives:
• Introduce to some of the problems and solutions of NLP and their
relation to linguistics and statistics.
• Course Outcomes:
• Show sensitivity to linguistic phenomena and an ability to model them
with formal grammars.
• Understand and carry out proper experimental methodology for
training and evaluating
• Empirical NLP systems
• Able to manipulate probabilities, construct statistical models over
strings and trees, and
• Estimate parameters using supervised and unsupervised training
methods.
• Able to design, implement, and analyze NLP algorithms
• Able to design different language modeling Techniques.
TEXT BOOKS:
• 1. Multilingual natural Language Processing Applications: From Theory
to Practice – Daniel M.Bikel and Imed Zitouni, Pearson Publication
• 2. Natural Language Processing and Information Retrieval: Tanvier
Siddiqui, U.S. Tiwary
• REFERENCE BOOK:
• 1. Speech and Natural Language Processing - Daniel Jurafsky & James
H Martin, Pearson Publications
3. Voice Assistants
These days voice assistants are all the rage! Whether its Siri, Alexa, or
Google Assistant, almost everyone uses one of these to make calls,
place reminders, schedule meetings, set alarms, surf the internet etc
4. Language Translator
Want to translate a text from English to Hindi but don’t know Hindi?
Well, Google Translate is the tool for you! While it’s not exactly 100%
accurate, it is still a great tool to convert text from one language to
another. Google Translate and other translation tools as well as use
Sequence to sequence modeling that is a technique in Natural
Language Processing.
5. Sentiment Analysis
Almost all the world is on social media these days! And companies can
use sentiment analysis to understand how a particular type of user
feels about a particular topic, product, etc. They can use natural
language processing, computational linguistics, text analysis, etc. to
understand the general sentiment of the users for their products and
services and find out if the sentiment is good, bad, or neutral.
6. Grammar Checkers
UNIT - I
• Finding the Structure of Words: Words and Their Components, Issues
and Challenges, Morphological Models
2. Lexemes: By the term word, we often denote not just the one
linguistic form in the given context but also the concept behind the form and
the set of alternative forms that can express it. Such sets are called lexemes
or lexical items, and they constitute the lexicon of a language.
• The lexeme “play,” for example, can take many forms, such
as playing, plays, played.
According to these criteria, the below are the important language family
groups:
Indo-European
Sino-Tibetan
Niger-Congo
Afroasiatic
Austronesian
Altaic
Japonic
Austroasiatic
Tai-Kadai
The most commonly spoken are languages in the Indo-European and Sino-
Tibetan language groups. These two groups are used by 67% of the global
population.
1. GENEALOGICAL CLASSIFICATION
2. TYPOLOGICAL CLASSIFICATION
Languages are grouped into language types on the basis of formal criteria,
according to their similarities in grammatical structure. There are several
types: flexile (morphological resources), agglutinative (affixes), and rooted
(the root of the word as a morphological resource).
3. AREAL CLASSIFICATION
It involves geographic criteria, and covers those languages that are close by
and have developed similar characteristics in terms of structure. Under the
influence of intensive mutual influences, these kinds of languages are
creating language unions such as the Balkan Language Union, encompassing
Macedonian, Bulgarian, Serbian, and Albanian, for example.
2 Ambiguity
3 Productivity
• Syntactic Ambiguity
This kind of ambiguity occurs when a sentence is parsed in different
ways. For example, the sentence “The man saw the girl with the
telescope”. It is ambiguous whether the man saw the girl carrying a
telescope or he saw her through his telescope.
• Semantic Ambiguity
This kind of ambiguity occurs when the meaning of the words
themselves can be misinterpreted. In other words, semantic ambiguity
happens when a sentence contains an ambiguous word or phrase.
For example, the sentence “The car hit the pole while it was moving” is
having semantic ambiguity because the interpretations can be “The
car, while moving, hit the pole” and “The car hit the pole while the
pole was moving”.
• Anaphoric Ambiguity
This kind of ambiguity arises due to the use of anaphora entities in
discourse. For example, the horse ran up the hill. It was very steep. It
soon got tired. Here, the anaphoric reference of “it” in two situations
cause ambiguity.
• Pragmatic ambiguity
Such kind of ambiguity refers to the situation where the context of a
phrase gives it multiple interpretations. In simple words, we can say
that pragmatic ambiguity arises when the statement is not specific.
For example, the sentence “I like you too” can have multiple
interpretations like I like you (just like you like me), I like you (just
like someone else dose).
3. Productivity
1. Set Objectives
The simplest way of implementing a form of NLP in your work
environment is to make sure everyone is working towards goals. By
setting objectives you are giving your team a direction and something
to work for. If employees feel they are expected to achieve these firm
objectives they will naturally work harder to make sure they do. They
will also automatically be taking more responsibility over their role and
the work they do.
This increases productivity on an individual level. Incentives for
successfully achieving objectives can also be specified in order to
motivate staff to succeed and thrive in the work environment.
3. Better Communication
Internal communications and client relationships are vital for a
productive and efficient working environment. Making people aware of
how they come across when interacting with others is a key aspect of
using NLP to improve communication. NLP will help to identify adverse
behaviours such as body language. Body language such as avoiding
eye contact or slouching shoulders is generally a subconscious
behaviour.
Once the negative behaviour has been recognised, the individual can
work to change and improve. As the individual becomes more self-
aware, they also become more aware of other people. Effective
communication requires an understanding to others’ thought processes
as well as an awareness of yourself. See yourself in the way that you
would like others to see you.
5. Changing Behaviour
The main objective of NLP is to reverse negative behaviours and
habits. How an individual interprets their workplace has little to do
with the actual working environment and more to do with the
individual. Employees have completely different experiences at work,
even though the work environment is the same for everyone. NLP
makes employees aware that the problems they face at work are
usually internal, not external. Making employees self-aware of their
attitudes and behaviours is the first step towards a positive change.
success, no matter what your goals are.
Morphological Models
1 Dictionary Lookup
2 Finite-State Morphology
3 Unification-Based Morphology
4 Functional Morphology
5 Morphology Induction
Morphological Models
There are many possible approaches to designing and implementing
morphological models. Over time, computational linguistics has witnessed
the development of a number of formalisms and frameworks, in particular
grammars of different kinds and expressive power, with which to address
whole classes of problems in processing natural as well as formal languages.
1. Dictionary Lookup
Morphological parsing is a process by which word forms of a language are
associated with corresponding linguistic descriptions.
In this context, a dictionary is understood as a data structure that directly
enables obtaining some precomputed results, in our case word analyses. The
data structure can be optimized for efficient lookup, and the results can be
shared. Lookup operations are relatively simple and usually quick.
Dictionaries can be implemented, for instance, as lists, binary search trees,
tries, hash tables, and so on.
Because the set of associations between word forms and their desired
descriptions is declared by plain enumeration, the coverage of the model is
finite and the generative potential of the language is not exploited.
Developing as well as verifying the association list is tedious, liable to errors,
and likely inefficient and inaccurate unless the data are retrieved
automatically from large and reliable linguistic resources.
2. Finite-State Morphology
By finite-state morphological models, we mean those in which the
specifications written by human programmers are directly compiled into
finite-state transducers.
The two most popular tools supporting this approach, which have been cited
in literature and for which example implementations for multiple languages
are available online, i) XFST (Xerox Finite-State Tool) ii) LexTools
Finite-state transducers are computational devices extending the power of
finite-state automata. They consist of a finite set of nodes connected by
directed edges labeled with pairs of input and output symbols. In such a
network or graph, nodes are also called states, while edges are called arcs.
Traversing the network from the set of initial states to the set of final states
along the arcs is equivalent to reading the sequences of encountered input
symbols and writing the sequences of corresponding output symbols.
The set of possible sequences accepted by the transducer defines the input
language; the set of possible sequences emitted by the transducer defines
the output language. For example, a finite-state transducer could translate
the infinite regular language consisting of the
words vnuk, pravnuk, prapravnuk, ... to the matching words in the infinite
regular language defined by grandson, great-grandson, great-great-
grandson, ...
Let us have a relation R, and let us denote by [Σ] the set of all sequences
over some set of symbols Σ, so that the domain and the range of R are
subsets of [Σ]. We can then consider R as a function mapping an input string
into a set of output strings, formally denoted by this type signature, where
[Σ] equals String:
3. Unification-Based Morphology
Unification-based approaches to morphology have been inspired by advances
in various formal linguistic frameworks aiming at enabling complete
grammatical descriptions of human languages, especially head-driven phrase
structure grammar (HPSG), and by development of languages for lexical
knowledge representation, especially DATR. The concepts and methods of
these formalisms are often closely connected to those of logic programming.
In finite-state morphological models, both surface and lexical forms are by
themselves unstructured strings of atomic symbols. In higher-level
approaches, linguistic information is expressed by more appropriate data
structures that can include complex values or can be recursively nested if
needed. Morphological parsing P thus associates linear forms φ with
alternatives of structured content ψ, cf. (1.1):
4. Functional Morphology
This group of morphological models includes not only the ones following the
methodology of functional morphology , but even those related to it, such as
morphological resource grammars of Grammatical Framework . Functional
morphology defines its models using principles of functional programming
and type theory. It treats morphological operations and processes as pure
mathematical functions and organizes the linguistic as well as abstract
elements of a model into distinct types of values and type classes.
5. Morphology Induction
We have focused on finding the structure of words in diverse languages
supposing we know what we are looking for. We have not considered the
problem of discovering and inducing word structure without the human
insight (i.e., in an unsupervised or semi-supervised manner). The motivation
for such approaches lies in the fact that for many languages, linguistic
expertise might be unavailable or limited, and implementations adequate to
a purpose may not exist at all. Automated acquisition of morphological and
lexical information, even if not perfect, can be reused for bootstrapping and
improving the classical morphological models, too.
There are several challenging issues about deducing word structure just
from the forms and their context. They are caused by ambiguity [76] and
irregularity [75] in morphology, as well as by orthographic and phonological
alternations [85] and nonlinear morphological processes [86, 87].
import nltk
from nltk.tokenize import sent_tokenize
def get_sentences(text):
return sent_tokenize(text)
"out of Birmingham.")
step 5: Use the sent_tokenize() method for text that contains periods (.)
other than those found at the ends of sentences:
get_sentences("Mr. Donald John Trump is the current "\
"president of the USA. Before joining "\
"politics, he was a businessman.")
Methods
1. Rule-based
Rule-based approaches are the oldest approaches to NLP. have been proven
to work well. Rules applied to text can offer a lot of insight: think of what
you can learn about arbitrary text by finding what words are nouns, or what
verbs end in -ing, or whether a pattern recognizable as Python code can be
identified. Regular expressions and context free grammars are textbook
examples of rule-based approaches to NLP.
Rule-based approaches:
3. Neural Networks
This is similar to "traditional" machine learning, but with a few differences: