0% found this document useful (0 votes)
34 views8 pages

NLP Report - Modified

The document discusses the development of a part-of-speech tagger for Hindi language using Hidden Markov Model. It describes downloading a Hindi dataset and preprocessing it which includes stripping words from sentences. It also discusses training the POS tagger on the dataset and using it to tag new Hindi text.

Uploaded by

xijasab439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

NLP Report - Modified

The document discusses the development of a part-of-speech tagger for Hindi language using Hidden Markov Model. It describes downloading a Hindi dataset and preprocessing it which includes stripping words from sentences. It also discusses training the POS tagger on the dataset and using it to tag new Hindi text.

Uploaded by

xijasab439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

K.K.

Wagh Institute of Engineering Education and research


Department of Computer Engineering

A Mini Project Report on

“Hindi POS-Machine Translation”

Submitted in partial fulfillment of the subject

Compilers
by

Jayesh K. Suryavanshi
(B150134323)

Under the guidance


of

Prof. Smita T. Patil

Professor
Department of Computer Engineering, K.K.W.I.E.E.R
Nashik-422001
TABLE OF CONTENTS

1. INTRODUCTION 1

2. IMPLEMENTATION AND IMPORTANT MODULES 3

3. <Heading3> 6

4. <Heading4> 7

5. <Heading5> 9
INTRODUCTION

Part of Speech (POS) Tagging is the first step in the development of any NLP Application.
It is a task which assigns POS labels to words supplied in the text. This is the reason why
researchers consider this as a sequence labeling task where words are considered as
sequences which needs to be labeled. Each word’s tag is identified within a context using
the previous word/tag combination. POS tagging is used in various applications like
parsing where word and their tags are transformed into chunks which can be combined to
generate the complete parse of a text.

Taggers are used in Machine Translation (MT) while developing a transfer based MT
Engine. Here, we require the text in the source language to be POS tagged and then parsed
which can then be transferred to the target side using transfer grammar. Taggers can also be
used in Name Entity Recognition (NER) where a word tagged as a noun (either proper or
common noun) is further classified as a name of a person, organization, location, time, date
etc.

Tagging of text is a complex task as many times we get words which have different tag
categories as they are used in different context. This phenomenon is termed as lexical ambiguity.
For example, let us consider text in Table 1. The same word ‘सोना’ given a different label in the
two sentences. In the first case it is termed as a common noun as it is referring to an object (Gold
Ornament). In the second case it is termed as a verb as it is referring to an experience (feelings)
of the speaker. This problem can be resolved by looking at the word/tag combinations of the
surrounding words with respect to the ambiguous word (the word which has multiple tags).
Over the years, a lot of research has been done on POS tagging. Broadly, all the efforts can
be categorized in three directions. They are: rule based approach where a human annotator
is required to develop rules for tagging words or statistical approach where we use
mathematical formulations and tag words or hybrid approach which is partially rule based
and partially statistical. In the context of European languages POS taggers are generally
developed using machine learning approach, but in the Indian context, we still do not have
a clear good approach. In this paper we discuss the development of a POS tagger for Hindi
using Hidden Markov Model (HMM).
IMPLEMENTATION & IMPORTANT MODULES

In this project we are doing POS tagging for Hindi sentences and for that we have used Python.
For developing a HMM based tagger we were first required to annotate a corpus based on a
tagset.

Modules

Downloading dataset
So using our source code we first download a Hindi dataset which has numerous sentence in
Hindi.

Preprocessing the downloaded dataset


Our next step is to preprocess the corpus dataset which we have downloaded so as to implement
operations on it. This is done by selecting every individual sentence from the dataset for which
we want POS tagging.

Stripping words in sentence


Following the previous step, we strip the words from the sentence so that separate operations can
be performed. For example, we take a sentence and strip it, we then perform operation on each
constituting word to tag it with the most accurate POS tags.

Training POS tagger


This way we achieve the module of training the POS tagger with the results thus obtained for
each and every data in the corpus. Now the POS tagger is ready to tag any Hindi sentence.

Tagging new line


.At the end when the POS tagger is trained, it can then be used for tagging new Hindi lines,
according to the user’s choice.
Implementing Concepts
A POS tagger based on HMM assigns the best tag to a word by calculating the forward and
backward probabilities of tags along with the sequence provided as an input. The following
equation explains this phenomenon.

Here is the probability of a current tag given the previous tag and

is the probability of the future tag given the current tag. This captures the
transition between the tags.

These probabilities are computed using equation 2.

Each tag transition probability is computed by calculating the frequency count of two tags
seen together in the corpus divided by the frequency count of the previous tag seen
independently in the corpus. This is done because we know that it is more likely for some
tags to precede the other tags. For example, an adjective (JJ) will be followed by a common
noun (NN) and not by a postposition (PSP) or a pronoun (PRP). Figure 1 shows this
example
POS Tags for Hindi sentences

You might also like