NLP Assignment 2
NLP Assignment 2
Assignment 2
Due Date: 20th Nov 2020
1. Introduction
This assignment will make use of the Natural Language Tool Kit (NLTK) for Python. NLTK is a
platform for writing programs to process human language data, that provides both corpora and modules.
For more information on NLTK, please visit: http://www.nltk.org/.
This is an individual assignment and submission is only through LMS. Late assignments will not be
entertained without a valid reason.
When ready to submit, create a directory that should be called nlp-assign2-< CMS >, where < CMS > is
your CMS ID number e.g.: 1234567. In this directory put your template.py file and test result word or
pdf file, renamed with your CMS ID number, e.g. 1234567.py.
Submit your assignment by creating a gzipped tar file from your nlp-assign2-< CMS > directory. You do
this using the following command in a DICE machine:
You can check that this file stores the intended data with the following command, which lists all the files
one would get when extracting the original directory (and its files) from this file:
Ensure that your code works on DICE. Your modified template.py should fully execute using
python3.
Ensure that you include comments in your code where appropriate. This makes it easier for the
markers to understand what you have done and makes it more likely that partial marks can be
awarded.
Any character limits to open questions will be strictly enforced. Answers will be passed through
an automatic filter that only keeps the first N characters, where N is the character limit given in a
question.
Important: Whenever you use corpus data in this assignment, you must convert the data to
lowercase, so that e.g. the original tokens “Freedom” and “freedom” are made equal. Do this
throughout the assignment, whether it’s explicitly stated or not.
Section A: Training a Hidden Markov Model
In this part of the assignment you have to train a Hidden Markov Model (HMM) for part-of-speech
(POS) tagging. You will need to create and train two models—an Emission Model and a Transition
Model as described in lectures.
Use labelled sentences from the ‘news’ part of the Brown corpus. You can download the dataset using
instruction given at NLTK website1. These are annotated with parts of speech, which you will convert
into the Universal POS tagset (NLTK uses the smaller version of this set defined by Petrov et al. 2).
Having a smaller number of labels (states) will make Viterbi decoding faster.
Use the last 500 sentences from the corpus as the test set and the rest for training. This split corresponds
roughly to a 90/10% division. Do not shuffle the data before splitting.
Give the results of last 500 sentences in the form of table as given below:
Failure to follow these instructions exactly will render most of your answers incorrect.
1
https://www.nltk.org/book/ch02.html
2
https://github.com/slavpetrov/universal-pos-tags