Skip to content

Fix image display and clean up for NLP tutorial #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 4, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 53 additions & 52 deletions content/tutorial-nlp-from-scratch.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ jupyter:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.11.4
jupytext_version: 1.11.5
kernelspec:
display_name: Python 3 (ipykernel)
language: python
Expand All @@ -23,14 +23,14 @@ Your deep learning model (the LSTM) is a form of a Recurrent Neural Network and
Today, Deep Learning is getting adopted in everyday life and now it is more important to ensure that decisions that have been taken using AI are not reflecting discriminatory behavior towards a set of populations. It is important to take fairness into consideration while consuming the output from AI. Throughout the tutorial we'll try to question all the steps in our pipeline from an ethics point of view.


## Prerequisites
## Prerequisites

You are expected to be familiar with the Python programming language and array manipulation with NumPy. In addition, some understanding of Linear Algebra and Calculus is recommended. You should also be familiar with how Neural Networks work. For reference, you can visit the [Python](https://docs.python.org/dev/tutorial/index.html), [Linear algebra on n-dimensional arrays](https://numpy.org/doc/stable/user/tutorial-svd.html) and [Calculus](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/multivariable-calculus.html) tutorials.

To get a refresher on Deep Learning basics, You should consider reading [the d2l.ai book](https://d2l.ai/chapter_recurrent-neural-networks/index.html), which is an interactive deep learning book with multi-framework code, math, and discussions. You can also go through the [Deep learning on MNIST from scratch tutorial](https://numpy.org/numpy-tutorials/content/tutorial-deep-learning-on-mnist.html) to understand how a basic neural network is implemented from scratch.

In addition to NumPy, you will be utilizing the following Python standard modules for data loading and processing:
- [`pandas`](https://pandas.pydata.org/docs/) for handling dataframes
- [`pandas`](https://pandas.pydata.org/docs/) for handling dataframes
- [`Matplotlib`](https://matplotlib.org/) for data visualization
- [`pooch`](https://www.fatiando.org/pooch/latest/https://www.fatiando.org/pooch/latest/) to download and cache datasets

Expand All @@ -39,13 +39,13 @@ This tutorial can be run locally in an isolated environment, such as [Virtualenv

## Table of contents

1. Data Collection
1. Data Collection

2. Preprocess the datasets

3. Build and train a LSTM network from scratch

4. Perform sentiment analysis on collected speeches
4. Perform sentiment analysis on collected speeches

5. Next steps

Expand Down Expand Up @@ -105,24 +105,26 @@ We made sure to include different demographics in our data and included a range
>The GloVe word embeddings include sets that were trained on billions of tokens, some up to 840 billion tokens. These algorithms exhibit stereotypical biases, such as gender bias which can be traced back to the original training data. For example certain occupations seem to be more biased towards a particular gender, reinforcing problematic stereotypes. The nearest solution to this problem are some de-biasing algorithms as the one presented in https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/reports/6835575.pdf which one can use on embeddings of their choice to mitigate bias, if present.
<!-- #endregion -->

You'll start with importing the necessary packages to build our Deep Learning network
You'll start with importing the necessary packages to build our Deep Learning network.

```python tags=[]
# Importing the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```python
# Importing the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pooch
import string
import re
import zipfile
import re
import zipfile
import os
```

```python tags=["hide-input"]
Next, you'll define set of text preprocessing helper functions.

```python
class TextPreprocess:
"""Text Preprocessing for a Natural Language Processing model."""

def txt_to_df(self, file):
"""Function to convert a txt file to pandas dataframe.

Expand All @@ -133,7 +135,7 @@ class TextPreprocess:

Returns
-------
Pandas dataframe
Pandas dataframe
txt file converted to a dataframe.

"""
Expand All @@ -145,22 +147,22 @@ class TextPreprocess:
reviews[lines[1]] = float(lines[0])
df = pd.DataFrame(reviews.items(), columns=['review', 'sentiment'])
df = df.sample(frac=1).reset_index(drop=True)
return df
return df

def unzipper(self, zipped, to_extract):
"""Function to extract a file from a zipped folder.

Parameters
----------
zipped : str
Path to the zipped folder.

to_extract: str
Path to the file to be extracted from the zipped folder

Returns
-------
str
str
Path to the extracted file.

"""
Expand Down Expand Up @@ -266,7 +268,7 @@ class TextPreprocess:

Returns
-------
list
list
sentences with punctuation removed.

"""
Expand Down Expand Up @@ -299,7 +301,7 @@ class TextPreprocess:

Returns
-------
Dict
Dict
mapping from word to corresponding word embedding.

"""
Expand Down Expand Up @@ -328,7 +330,7 @@ class TextPreprocess:

Returns
-------
list
list
paragraphs of specified length.

"""
Expand All @@ -350,14 +352,14 @@ class TextPreprocess:

```python
data = pooch.create(
# folder where the data will be stored in the
# default cache folder of your Operating System
# folder where the data will be stored in the
# default cache folder of your Operating System
path=pooch.os_cache("numpy-nlp-tutorial"),
# Base URL of the remote data store
base_url="",
# The cache file registry. A dictionary with all files managed by this pooch.
# The keys are the file names and values are their respective hash codes which
# ensure we download the same, uncorrupted file each time.
# ensure we download the same, uncorrupted file each time.
registry={
"imdb_train.txt": "6a38ea6ab5e1902cc03f6b9294ceea5e8ab985af991f35bcabd301a08ea5b3f0",
"imdb_test.txt": "7363ef08ad996bf4233b115008d6d7f9814b7cc0f4d13ab570b938701eadefeb",
Expand Down Expand Up @@ -444,12 +446,11 @@ Unlike an MLP, the RNN was designed to work with sequence prediction problems.RN
The problem with an RNN however, is that it cannot retain long-term memory because the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections. This shortcoming is referred to as the vanishing gradient problem. Long Short-Term Memory (LSTM) is an RNN architecture specifically designed to address the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem).


### Overview of the Model Architecture
### Overview of the Model Architecture

<img src="_static/lstm.gif" width="900" align="center">
![Overview of the model architecture, showing a series of animated boxes. There are five identical boxes labeled A and receiving as input one of the words in the phrase "life's a box of chocolates". Each box is highlighted in turn, representing the memory blocks of the LSTM network as information passes through them, ultimately reaching a "Positive" output value.](_static/lstm.gif)


In the above gif, The rectangles labeled $A$ are called `Cells` and they are the **Memory Blocks** of our LSTM network. They are responsible for choosing what to remember in a sequence and pass on that information to the next cell via two states called the `hidden state` $H_{t}$ and the `cell state` $C_{t}$ where $t$ indicates the time-step. Each `Cell` has dedicated gates which are responsible for storing, writing or reading the information passed to an LSTM. You will now look closely at the architecture of the network by implementing each mechanism happening inside of it.
In the above gif, the rectangles labeled $A$ are called `Cells` and they are the **Memory Blocks** of our LSTM network. They are responsible for choosing what to remember in a sequence and pass on that information to the next cell via two states called the `hidden state` $H_{t}$ and the `cell state` $C_{t}$ where $t$ indicates the time-step. Each `Cell` has dedicated gates which are responsible for storing, writing or reading the information passed to an LSTM. You will now look closely at the architecture of the network by implementing each mechanism happening inside of it.


Lets start with writing a function to randomly initialize the parameters which will be learned while our model trains
Expand Down Expand Up @@ -513,7 +514,7 @@ The **Forget Gate** takes the current word embedding and the previous hidden sta
def fp_forget_gate(concat, parameters):
ft = sigmoid(np.dot(parameters['Wf'], concat)
+ parameters['bf'])
return ft
return ft
```

The **Input Gate** takes the current word embedding and the previous hidden state concatenated together as input. and governs how much of the new data we take into account via the **Candidate Memory Gate** which utilizes the [Tanh](https://d2l.ai/chapter_multilayer-perceptrons/mlp.html?highlight=tanh#tanh-function) to regulate the values flowing through the network.
Expand All @@ -524,7 +525,7 @@ def fp_input_gate(concat, parameters):
+ parameters['bi'])
cmt = np.tanh(np.dot(parameters['Wcm'], concat)
+ parameters['bcm'])
return it, cmt
return it, cmt
```

Finally we have the **Output Gate** which takes information from the current word embedding, previous hidden state and the cell state which has been updated with information from the forget and input gates to update the value of the hidden state.
Expand All @@ -540,18 +541,18 @@ def fp_output_gate(concat, next_cs, parameters):
The following image summarizes each gate mechanism in the memory block of a LSTM network:
>Image has been modified from [this](https://link.springer.com/chapter/10.1007%2F978-3-030-14524-8_11) source

<img src="_static/mem_block.png" width="800" align="center">

![Diagram showing three sections of a memory block, labeled "Forget gate", "Input gate" and "Output gate". Each gate contains several subparts, representing the operations performed at that stage of the process.](_static/mem_block.png)

### But how do you obtain sentiment from the LSTM's output?

The hidden state you obtain from the output gate of the last memory block in a sequence is considered to be a representation of all the information contained in a sequence. To classify this information into various classes (2 in our case, positive and negative) we use a **Fully Connected layer** which firstly maps this information to a predefined output size (1 in our case). Then, an activation function such as the sigmoid converts this output to a value between 0 and 1. We'll consider values greater than 0.5 to be indicative of a positive sentiment.

```python
def fp_fc_layer(last_hs, parameters):
z2 = (np.dot(parameters['W2'], last_hs)
+ parameters['b2'])
a2 = sigmoid(z2)
return a2
return a2
```

Now you will put all these functions together to summarize the **Forward Propagation** step in our model architecture:
Expand Down Expand Up @@ -579,22 +580,22 @@ def forward_prop(X_vec, parameters, input_dim):

# Input to the gates is concatenated previous hidden state and current word embedding
concat = np.vstack((prev_hs, xt))

# Calculate output of the forget gate
ft = fp_forget_gate(concat, parameters)

# Calculate output of the input gate
it, cmt = fp_input_gate(concat, parameters)
io = it * cmt
# Update the cell state
io = it * cmt

# Update the cell state
next_cs = (ft * prev_cs) + io

# Calculate output of the output gate
ot, next_hs = fp_output_gate(concat, next_cs, parameters)

# store all the values used and calculated by
# the LSTM in a cache for backward propagation.
# the LSTM in a cache for backward propagation.
lstm_cache = {
"next_hs": next_hs,
"next_cs": next_cs,
Expand All @@ -612,12 +613,12 @@ def forward_prop(X_vec, parameters, input_dim):
prev_hs = next_hs
prev_cs = next_cs

# Pass the LSTM output through a fully connected layer to
# obtain probability of the sequence being positive
# Pass the LSTM output through a fully connected layer to
# obtain probability of the sequence being positive
a2 = fp_fc_layer(next_hs, parameters)

# store all the values used and calculated by the
# fully connected layer in a cache for backward propagation.
# fully connected layer in a cache for backward propagation.
fc_cache = {
"a2" : a2,
"W2" : parameters['W2']
Expand All @@ -642,7 +643,7 @@ def initialize_grads(parameters):
return grads
```

Now, for each gate and the fully connected layer, we define a function to calculate the gradient of the loss with respect to the input passed and the parameters used. To understand the mathematics behind how the derivatives were calculated we suggest you to follow this helpful [blog](https://christinakouridi.blog/2019/06/19/backpropagation-lstm/) by Christina Kouridi
Now, for each gate and the fully connected layer, we define a function to calculate the gradient of the loss with respect to the input passed and the parameters used. To understand the mathematics behind how the derivatives were calculated we suggest you to follow this helpful [blog](https://christinakouridi.blog/2019/06/19/backpropagation-lstm/) by Christina Kouridi.


Define a function to calculate the gradients in the **Forget Gate**:
Expand All @@ -659,7 +660,7 @@ def bp_forget_gate(hidden_dim, concat, dh_prev, dc_prev, cache, gradients, param
gradients['dbf'] += np.sum(dft, axis=1, keepdims=True)
# dh_f = dft * dft/dh_prev
dh_f = np.dot(parameters["Wf"][:, :hidden_dim].T, dft)
return dh_f, gradients
return dh_f, gradients
```

Define a function to calculate the gradients in the **Input Gate** and **Candidate Memory Gate**:
Expand All @@ -686,7 +687,7 @@ def bp_input_gate(hidden_dim, concat, dh_prev, dc_prev, cache, gradients, parame
dh_i = np.dot(parameters["Wi"][:, :hidden_dim].T, dit)
# dhcm = dcmt * dcmt/dh_prev
dh_cm = np.dot(parameters["Wcm"][:, :hidden_dim].T, dcmt)
return dh_i, dh_cm, gradients
return dh_i, dh_cm, gradients
```

Define a function to calculate the gradients for the **Output Gate**:
Expand All @@ -702,7 +703,7 @@ def bp_output_gate(hidden_dim, concat, dh_prev, dc_prev, cache, gradients, param
gradients['dbo'] += np.sum(dot, axis=1, keepdims=True)
# dho = dot * dot/dho
dh_o = np.dot(parameters["Wo"][:, :hidden_dim].T, dot)
return dh_o, gradients
return dh_o, gradients
```

Define a function to calculate the gradients for the **Fully Connected Layer**:
Expand All @@ -721,14 +722,14 @@ def bp_fc_layer (target, caches, gradients):
# dh_last = dZ2 * W2
W2 = caches['fc_values'][0]["W2"]
dh_last = np.dot(W2.T, dZ2)
return dh_last, gradients
return dh_last, gradients
```

Put all these functions together to summarize the **Backpropagation** step for our model:

```python
def backprop(y, caches, hidden_dim, input_dim, time_steps, parameters):

# Initialize gradients
gradients = initialize_grads(parameters)

Expand All @@ -742,7 +743,7 @@ def backprop(y, caches, hidden_dim, input_dim, time_steps, parameters):
# loop back over the whole sequence
for t in reversed(range(time_steps)):
cache = caches['lstm_values'][t]

# Input to the gates is concatenated previous hidden state and current word embedding
concat = np.concatenate((cache["prev_hs"], cache["xt"]), axis=0)

Expand All @@ -765,7 +766,7 @@ def backprop(y, caches, hidden_dim, input_dim, time_steps, parameters):
return gradients
```

### Updating the Parameters
### Updating the Parameters

We update the parameters through an optimization algorithm called [Adam](https://optimization.cbe.cornell.edu/index.php?title=Adam) which is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing. Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters `beta1` and `beta2` control the decay rates of these moving averages. Adam has shown increased convergence and robustness over other gradient descent algorithms and is often recommended as the default optimizer for training.

Expand Down