NLP Report - Modified
NLP Report - Modified
Compilers
by
Jayesh K. Suryavanshi
(B150134323)
Professor
Department of Computer Engineering, K.K.W.I.E.E.R
Nashik-422001
TABLE OF CONTENTS
1. INTRODUCTION 1
3. <Heading3> 6
4. <Heading4> 7
5. <Heading5> 9
INTRODUCTION
Part of Speech (POS) Tagging is the first step in the development of any NLP Application.
It is a task which assigns POS labels to words supplied in the text. This is the reason why
researchers consider this as a sequence labeling task where words are considered as
sequences which needs to be labeled. Each word’s tag is identified within a context using
the previous word/tag combination. POS tagging is used in various applications like
parsing where word and their tags are transformed into chunks which can be combined to
generate the complete parse of a text.
Taggers are used in Machine Translation (MT) while developing a transfer based MT
Engine. Here, we require the text in the source language to be POS tagged and then parsed
which can then be transferred to the target side using transfer grammar. Taggers can also be
used in Name Entity Recognition (NER) where a word tagged as a noun (either proper or
common noun) is further classified as a name of a person, organization, location, time, date
etc.
Tagging of text is a complex task as many times we get words which have different tag
categories as they are used in different context. This phenomenon is termed as lexical ambiguity.
For example, let us consider text in Table 1. The same word ‘सोना’ given a different label in the
two sentences. In the first case it is termed as a common noun as it is referring to an object (Gold
Ornament). In the second case it is termed as a verb as it is referring to an experience (feelings)
of the speaker. This problem can be resolved by looking at the word/tag combinations of the
surrounding words with respect to the ambiguous word (the word which has multiple tags).
Over the years, a lot of research has been done on POS tagging. Broadly, all the efforts can
be categorized in three directions. They are: rule based approach where a human annotator
is required to develop rules for tagging words or statistical approach where we use
mathematical formulations and tag words or hybrid approach which is partially rule based
and partially statistical. In the context of European languages POS taggers are generally
developed using machine learning approach, but in the Indian context, we still do not have
a clear good approach. In this paper we discuss the development of a POS tagger for Hindi
using Hidden Markov Model (HMM).
IMPLEMENTATION & IMPORTANT MODULES
In this project we are doing POS tagging for Hindi sentences and for that we have used Python.
For developing a HMM based tagger we were first required to annotate a corpus based on a
tagset.
Modules
Downloading dataset
So using our source code we first download a Hindi dataset which has numerous sentence in
Hindi.
Here is the probability of a current tag given the previous tag and
is the probability of the future tag given the current tag. This captures the
transition between the tags.
Each tag transition probability is computed by calculating the frequency count of two tags
seen together in the corpus divided by the frequency count of the previous tag seen
independently in the corpus. This is done because we know that it is more likely for some
tags to precede the other tags. For example, an adjective (JJ) will be followed by a common
noun (NN) and not by a postposition (PSP) or a pronoun (PRP). Figure 1 shows this
example
POS Tags for Hindi sentences