Glove
Glove
implement:
Text processing methods for transforming raw text data into input vectors for your network (2)
The Stochastic Gradient Descent (SGD) algorithm with back-propagation to learn the weights of your Neural network.
Your algorithm should:
Use (and minimise) the Categorical Cross-entropy loss function (2 )
Perform a Forward pass to compute intermediate outputs (5 )
Perform a Backward pass to compute gradients and update all sets of weights (12 )
Implement and use Dropout after each hidden layer for regularisation (4 )
Discuss how did you choose hyperparameters? You can tune the learning rate (hint: choose small values), embedding
size {e.g. 50, 300, 500}, the dropout rate {e.g. 0.2, 0.5} and the learning rate. Please use tables or graphs to show training
and validation performance for each hyperparameter combination (5 ).
After training a model, plot the learning process (i.e. training and validation loss in each epoch) using a line plot and
report accuracy. Does your model overfit, underfit or is about right? (2 ).
Re-train your network by using pre-trained embeddings (GloVe) trained on large corpora. Instead of randomly initialising
the embedding weights matrix, you should initialise it with the pre-trained weights. During training, you should not
update them (i.e. weight freezing) and backprop should stop before computing gradients for updating embedding
weights. Report results by performing hyperparameter tuning and plotting the learning process. Do you get better
performance? (7 ).
Extend you Feedforward network by adding more hidden layers (e.g. one more or two). How does it affect the
performance? Note: You need to repeat hyperparameter tuning, but the number of combinations grows exponentially.
Therefore, you need to choose a subset of all possible combinations (8 )
Provide well documented and commented code describing all of your choices. In general, you are free to make decisions
about text processing (e.g. punctuation, numbers, vocabulary size) and hyperparameter values. Need justifications and
discussion for all of your choices (5 ).
Data
The data you will use for the task is a subset of the AG News Corpus and you can find it in the ./data_topic folder in CSV
format:
data_topic/train.csv : contains 2,400 news articles, 800 for each class to be used for training.
data_topic/dev.csv : contains 150 news articles, 50 for each class to be used for hyperparameter selection and
monitoring the training process.
data_topic/test.csv : contains 900 news articles, 300 for each class to be used for testing.
Pre-trained Embeddings
You can download pre-trained GloVe embeddings trained on Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors,
2.03 GB download) from here. No need to unzip, the file is large.
Save Memory
To save RAM, when you finish each experiment you can delete the weights of your network using del W followed by
Python's garbage collector gc.collect()
Instructions
1 of 10 23/04/2021, 05:04
You are advised to follow the code structure given in this notebook by completing all given funtions. You can also write any
auxilliary/helper functions (and arguments for the functions) that you might need but note that you can provide a full
solution without any such functions. Similarly, you can just use only the packages imported below but you are free to use any
functionality from the Python Standard Library, NumPy, SciPy (excluding built-in softmax funtcions) and Pandas. You are not
allowed to use any third-party library such as Scikit-learn (apart from metric functions already provided), NLTK, Spacy,
Keras, Pytorch etc.. You should mention if you've used Windows to write and test your code because we mostly use Unix
based machines for marking (e.g. Ubuntu, MacOS).
There is no single correct answer on what your accuracy should be, but correct implementations usually achieve F1-scores
around 80\% or higher. The quality of the analysis of the results is as important as the accuracy itself.
import pandas as pd
import numpy as np
from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random
from time import localtime, strftime
from scipy.stats import spearmanr,pearsonr
import zipfile
import gc
# dev = pd.read_csv("data_topic/dev.csv",header=0)
dev = pd.read_csv("data_topic/dev.csv", names=['label', 'text'])
train = pd.read_csv("data_topic/train.csv" ,header=0)
test = pd.read_csv("data_topic//test.csv")
dev.head()
label text
tokenise all texts into a list of unigrams (tip: you can re-use the functions from Assignment 1)
remove stop words (using the one provided or one of your preference)
2 of 10 23/04/2021, 05:04
remove unigrams appearing in less than K documents
use the remaining to create a vocabulary of the top-N most frequent unigrams in the entire corpus.
stop_words = ['a','in','on','at','and','or',
'to', 'the', 'of', 'an', 'by',
'as', 'is', 'was', 'were', 'been', 'be',
'are','for', 'this', 'that', 'these', 'those', 'you', 'i', 'if',
'it', 'he', 'she', 'we', 'they', 'will', 'have', 'has',
'do', 'did', 'can', 'could', 'who', 'which', 'what',
'but', 'not', 'there', 'no', 'does', 'not', 'so', 've', 'their',
'his', 'her', 'they', 'them', 'from', 'with', 'its']
and returns:
tokenRE = re.compile(token_pattern)
if ngram_range[0]==1:
x = x_uni
# ignore unigrams
if n==1: continue
for n in ngrams:
for t in n:
x.append(t)
if len(vocab)>0:
x = [w for w in x if w in vocab]
return x
3 of 10 23/04/2021, 05:04
Then the get_vocab function will be used to (1) create a vocabulary of ngrams; (2) count the document frequencies of
ngrams; (3) their raw frequency. It takes as input:
and returns:
tokenRE = re.compile(token_pattern)
df = Counter()
ngram_counts = Counter()
vocab = set()
Now you should use get_vocab to create your vocabulary and get document and raw frequencies of unigrams:
print(type(vocab))
<class 'set'>
len(vocab)
9914
Then, you need to create vocabulary id -> word and word -> vocabulary id dictionaries for reference:
4 of 10 23/04/2021, 05:04
dict1 = [*vocab]
dict_idword = {}
dict_wordid = {}
for i in range(len(dict1)):
dict_idword[i] = dict1[i]
for i in range(len(dict1)):
dict_wordid[dict1[i]] = i
First, represent documents in train, dev and test sets as lists of words in the vocabulary:
raw_text
len(raw_text)
150
dev_words = []
for x in raw_text:
token_pattern1=r'\b[A-Za-z][A-Za-z]+\b'
tokenRE1 = re.compile(token_pattern1)
x_t = [w for w in tokenRE1.findall(str(x).lower(),) if w in vocab]
dev_words.append(x_t)
Put the labels Y for train, dev and test sets into arrays:
Network Architecture
Your network should pass each word index into its corresponding embedding by looking-up on the embedding matrix and
then compute the first hidden layer h1 :
1
h1 = ∑ Wie , i ∈ x
|x| i
where |x| is the number of words in the document and W e is an embedding matrix |V | × d, |V | is the size of the vocabulary
and d the embedding size.
a1 = relu(h1 )
y = softmax(a1 W )
During training, a1 should be multiplied with a dropout mask vector (elementwise) for regularisation before it is passed to the
output layer.
You can extend to a deeper architecture by passing a hidden layer to another one:
hi = ai−1 Wi
ai = relu(hi )
5 of 10 23/04/2021, 05:04
Network Training
First we need to define the parameters of our network by initiliasing the weight matrices. For that purpose, you should
implement the network_weights function that takes as input:
and returns:
W : a dictionary mapping from layer index (e.g. 0 for the embedding matrix) to the corresponding weight matrix
initialised with small random numbers (hint: use numpy.random.uniform with from -0.1 to 0.1)
Make sure that the dimensionality of each weight matrix is compatible with the previous and next weight matrix, otherwise
you won't be able to perform forward and backward passes. Consider also using np.float32 precision to save memory.
return W
W = network_weights(vocab_size=3,embedding_dim=4,hidden_dim=[2], num_classes=2)
Then you need to develop a softmax function (same as in Assignment 1) to be used in the output layer.
It takes as input z (array of real numbers) and returns sig (the softmax of z )
def softmax(z):
return sig
Now you need to implement the categorical cross entropy loss by slightly modifying the function from Assignment 1 to
depend only on the true label y and the class probabilities vector y_preds :
return l
Then, implement the relu function to introduce non-linearity after each hidden layer of your network (during the forward
pass):
relu(zi ) = max(zi , 0)
and the relu_derivative function to compute its derivative (used in the backward pass):
def relu(z):
return a
def relu_derivative(z):
return dz
During training you should also apply a dropout mask element-wise after the activation function (i.e. vector of ones with a
random percentage set to zero). The dropout_mask function takes as input:
6 of 10 23/04/2021, 05:04
dropout_rate : the percentage of elements that will be randomly set to zeros
and returns:
return dropout_vec
print(dropout_mask(10, 0.2))
print(dropout_mask(10, 0.2))
[1. 1. 0. 1. 1. 1. 1. 1. 0. 1.]
[1. 1. 1. 1. 0. 1. 1. 0. 1. 1.]
Now you need to implement the forward_pass function that passes the input x through the network up to the output
layer for computing the probability for each class using the weight matrices in W . The ReLU activation function should be
applied on each hidden layer.
and returns:
out_vals : a dictionary of output values from each layer: h (the vector before the activation function), a (the resulting
vector after passing h from the activation function), its dropout mask vector; and the prediction vector (probability for
each class) from the output layer.
out_vals = {}
h_vecs = []
a_vecs = []
dropout_vecs = []
return out_vals
The backward_pass function computes the gradients and updates the weights for each matrix in the network from the
output to the input. It takes as input
and returns:
Hint: the gradients on the output layer are similar to the multiclass logistic regression.
7 of 10 23/04/2021, 05:04
def backward_pass(x, y, W, out_vals, lr=0.001, freeze_emb=False):
return W
Finally you need to modify SGD to support back-propagation by using the forward_pass and backward_pass functions.
and returns:
Now you are ready to train and evaluate your neural net. First, you need to define your network using the
network_weights function followed by SGD with backprop:
W = network_weights(vocab_size=len(vocab),embedding_dim=300,
hidden_dim=[], num_classes=3)
for i in range(len(W)):
print('Shape W'+str(i), W[i].shape)
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))
8 of 10 23/04/2021, 05:04
Use Pre-trained Embeddings
Now re-train the network using GloVe pre-trained embeddings. You need to modify the backward_pass function above to
stop computing gradients and updating weights of the embedding matrix.
Use the function below to obtain the embedding martix for your vocabulary. Generally, that should work without any
problem. If you get errors, you can modify it.
with zipfile.ZipFile(f_zip) as z:
with z.open(f_txt) as f:
for line in f:
line = line.decode('utf-8')
word = line.split()[0]
if word in vocab:
emb = np.array(line.strip('\n').split()[1:]).astype(np.float32)
w_emb[word2id[word]] +=emb
return w_emb
w_glove = get_glove_embeddings("glove.840B.300d.zip","glove.840B.300d.txt",word2id)
First, initialise the weights of your network using the network_weights function. Second, replace the weigths of the
embedding matrix with w_glove . Finally, train the network by freezing the embedding weights:
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))
Full Results
Add your final results here:
9 of 10 23/04/2021, 05:04
Model Precision Recall F1-Score Accuracy
Average Embedding
Please discuss why your best performing model is better than the rest.
10 of 10 23/04/2021, 05:04