0% found this document useful (0 votes)
246 views

NLP Assignment 2

This document provides instructions for Natural Language Processing Assignment 2 on part-of-speech tagging using Hidden Markov Models. Students are asked to: 1. Train an emission model and transition model on labeled sentences from the Brown corpus tagged with universal POS tags. 2. Use the last 500 sentences as a test set, and report the actual and predicted POS tags for each sentence in a table. 3. Submit the code and results file following specific formatting and naming conventions.

Uploaded by

muhammad shoaib
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
246 views

NLP Assignment 2

This document provides instructions for Natural Language Processing Assignment 2 on part-of-speech tagging using Hidden Markov Models. Students are asked to: 1. Train an emission model and transition model on labeled sentences from the Brown corpus tagged with universal POS tags. 2. Use the last 500 sentences as a test set, and report the actual and predicted POS tags for each sentence in a table. 3. Submit the code and results file following specific formatting and naming conventions.

Uploaded by

muhammad shoaib
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Natural Language Processing

Assignment 2
Due Date: 20th Nov 2020

Hidden Markov Models: Part-of-speech Tagging

1. Introduction

This assignment will make use of the Natural Language Tool Kit (NLTK) for Python. NLTK is a
platform for writing programs to process human language data, that provides both corpora and modules.
For more information on NLTK, please visit: http://www.nltk.org/.

1.1. Getting Started

This is an individual assignment and submission is only through LMS. Late assignments will not be
entertained without a valid reason.

1.2. Submitting Your Assignment

When ready to submit, create a directory that should be called nlp-assign2-< CMS >, where < CMS > is
your CMS ID number e.g.: 1234567. In this directory put your template.py file and test result word or
pdf file, renamed with your CMS ID number, e.g. 1234567.py.

Submit your assignment by creating a gzipped tar file from your nlp-assign2-< CMS > directory. You do
this using the following command in a DICE machine:

tar -cvzf nlp-assign2-< CMS >tar.gz nlp-assign2-< CMS >

You can check that this file stores the intended data with the following command, which lists all the files
one would get when extracting the original directory (and its files) from this file:

tar -tv nlp-assign2-< CMS >.tar.gz

Upload the file on LMS. Before submitting your assignment:

 Ensure that your code works on DICE. Your modified template.py should fully execute using
python3.

 Ensure that you include comments in your code where appropriate. This makes it easier for the
markers to understand what you have done and makes it more likely that partial marks can be
awarded.

 Any character limits to open questions will be strictly enforced. Answers will be passed through
an automatic filter that only keeps the first N characters, where N is the character limit given in a
question.

 Important: Whenever you use corpus data in this assignment, you must convert the data to
lowercase, so that e.g. the original tokens “Freedom” and “freedom” are made equal. Do this
throughout the assignment, whether it’s explicitly stated or not.
Section A: Training a Hidden Markov Model
In this part of the assignment you have to train a Hidden Markov Model (HMM) for part-of-speech
(POS) tagging. You will need to create and train two models—an Emission Model and a Transition
Model as described in lectures.

Use labelled sentences from the ‘news’ part of the Brown corpus. You can download the dataset using
instruction given at NLTK website1. These are annotated with parts of speech, which you will convert
into the Universal POS tagset (NLTK uses the smaller version of this set defined by Petrov et al. 2).
Having a smaller number of labels (states) will make Viterbi decoding faster.

Use the last 500 sentences from the corpus as the test set and the rest for training. This split corresponds
roughly to a 90/10% division. Do not shuffle the data before splitting.

Give the results of last 500 sentences in the form of table as given below:

Sr. No Sentence Actual Tags Predicted Tags

Note: Submit your code and file with results

Failure to follow these instructions exactly will render most of your answers incorrect.

1
https://www.nltk.org/book/ch02.html

2
https://github.com/slavpetrov/universal-pos-tags

You might also like