0% found this document useful (0 votes)
70 views156 pages

Lecture 2 - Neural Networks

The document provides an overview of neural networks and deep learning for natural language processing. It discusses how neural networks can be used to solve problems that are not linearly separable by learning richer features through multiple hidden layers. Each layer of the neural network represents combinations of features from the previous layer to learn increasingly complex patterns in the data. The network functions by applying linear classifiers and activation functions to map inputs to outputs through a sequence of layers that learn hierarchical representations.

Uploaded by

samo firi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views156 pages

Lecture 2 - Neural Networks

The document provides an overview of neural networks and deep learning for natural language processing. It discusses how neural networks can be used to solve problems that are not linearly separable by learning richer features through multiple hidden layers. Each layer of the neural network represents combinations of features from the previous layer to learn increasingly complex patterns in the data. The network functions by applying linear classifiers and activation functions to map inputs to outputs through a sequence of layers that learn hierarchical representations.

Uploaded by

samo firi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 156

Course series: Deep Learning for NLP

Neural Networks

Lecture # 2

Hassan Sajjad and Fahim Dalvi


Qatar Computing Research Institute, HBKU
Overall Picture

Parameters

Optimization
Objective Function
Input data Function

Loss Function
Linear Classifier Recap
x1

x0
Linear Classifier Recap
x1

x0
Linear Classifier?
x1

x0
Linear Separability
Not all problems are linearly classifiable - i.e. if
you plot the examples in space, you cannot
draw a line/plane to separate them out
Linear Separability
Not all problems are linearly classifiable - i.e. if
you plot the examples in space, you cannot
draw a line/plane to separate them out

Neural Networks are one way


to solve this problem
Linear Classifier

Linear Classifier

Score for the car to be Score for the car to be


accident prone NOT accident prone
Neural Network

Linear Classifier Linear Classifier Linear Classifier Linear Classifier Linear Classifier Linear Classifier Linear Classifier

Score for the car to be Score for the car to be


accident prone NOT accident prone
Neural Network

Speed
Color
Acceleration Old cars

Linear Classifier Linear Classifier Linear Classifier Linear Classifier Linear Classifier Linear Classifier Linear Classifier

As a simplification, we can say that, each classifier learns to


look at a particular feature of the input (the car in this case)

Score for the car to be Score for the car to be NOT


accident prone accident prone
Neural Network

Speed
Color
Acceleration Old cars

Neuron Neuron Neuron Neuron Neuron Neuron Neuron

As a simplification, we can say that, each neuron learns to


look at a particular feature of the input (the car in this case)

Score for the car to be Score for the car to be NOT


accident prone accident prone
Neural Network

Input
x
Neural Network

x0
x1
Features

x2
x3
x4
x5
x6

Input
x
Neural Network

The neurons in the layer can be


thought of as representing richer
features

Input

Layer 1
Neural Network

The neurons in the layer can be


thought of as representing richer
features

Think of these richer features as


combinations of the input features
we provided to the system
Input

Layer 1
Neural Network

The neurons in the layer can be


thought of as representing richer
features

Think of these richer features as


combinations of the input features
we provided to the system
Input

Layer 1
Neural Network

The neurons in the layer can be


thought of as representing richer
features

Think of these richer features as


combinations of the input features
we provided to the system
Input

Layer 1
Neural Network

The neurons in the layer can be


thought of as representing richer
features

Think of these richer features as


combinations of the input features
we provided to the system
Input

Layer 1
Neural Network

The neurons in the layer can be


thought of as representing richer
features

Think of these richer features as


combinations of the input features
we provided to the system
Input

Layer 1
Neural Network

The neurons in the layer can be


thought of as representing richer
features

Think of these richer features as


combinations of the input features
we provided to the system
Input

Layer 1
Neural Network

Input

Layer 1 Layer 2
Neural Network

Input

Layer 1 Layer 2 Layer N


Neural Network

Output

Input

Layer 1 Layer 2 Layer N


Neural Network

One score
per class

Output

Input

Layer 1 Layer 2 Layer N


Neural Network

Neurons
Neuron
A Neuron can be thought of as a linear
classifier plus an activation function

Output
Activation
Function
Linear classifier
Input
x
Activation Functions
• Intuitively, a neuron looks at a particular feature of the
data
Activation Functions
• Intuitively, a neuron looks at a particular feature of the
data
• The activation after the linear classifier gives us an idea
of how much the neuron “supports” the feature

As an example, the output of a neuron will be high if the


feature it supports is contained in the input
(like “low speed” in the current “car”)
Activation Functions
• Intuitively, a neuron looks at a particular feature of the
data
• The activation after the linear classifier gives us an idea
of how much the neuron “supports” the feature
• Activations also helps us map linear spaces into
non-linear spaces
Non linearity
Activation Functions
Neural Network
• Entire network is nothing but a function:

Linear classifier
Neural Network
• Entire network is nothing but a function:
Neural network with 3 hidden layers
Neural Network
• Entire network is nothing but a function:
Neural network with 3 hidden layers
Neural Network
• Entire network is nothing but a function:
Neural network with 3 hidden layers
Activation
function

Output of linear
classifier
“richer features”
Neural Network
• Entire network is nothing but a function:
Neural network with 3 hidden layers
Neural Network
• Entire network is nothing but a function:
Neural network with 3 hidden layers
Neural Network
• Entire network is nothing but a function:
Neural network with 3 hidden layers
Neural Network
• Entire network is nothing but a function:
Neural network with 3 hidden layers

Final scores
Neural Network
• Everything else remains the same!

Linear classifier

Neural network with 3 hidden layers


Neural Network
Input Layer 1 Layer 2 Output
300 features
Input

100 neurons
Layer 1

50 neurons
Layer 2

3 classes
Neural Network
Output
Neural Network
Input Layer 1 Layer 2 Output

100 neurons
300 features

50 neurons

3 classes
[3 x 300] [3 x 3]
[examples x features] [examples x classes]

[300 x 100] [100 x 50] [50 x 3]


Neural Network
Input Layer 1 Layer 2 Output

100 neurons
300 features

50 neurons

3 classes
[3 x 300] [3 x 3]
[examples x features] [examples x classes]

[300 x 100] [100 x 50] [50 x 3]


Neural Network
Input Layer 1 Layer 2 Output

100 neurons
300 features

50 neurons

3 classes
[3 x 300] [3 x 3]
[examples x features] [examples x classes]

[300 x 100] [100 x 50] [50 x 3]

hidden layer 1 size hidden layer 2 size


Neural Network: Objective
function
Input Layer 1 Layer 2 Output

Forward Pass
Neural Network: Loss
Input Layer 1 Layer 2 Output

Cross Entropy Loss


Neural Network: Optimization
Input Layer 1 Layer 2 Output

Compute and using backpropagation

Optimization
Neural Network: Parameter
Update
Input Layer 1 Layer 2 Output

Backward Pass
Overall Picture

Parameters

Neural Optimization
Input data Function
Network

Loss Function
Neural Network
Let’s implement a simple two layer neural
network model!
Neural Network
Recall the model definition for binary
classification:
Neural Network
Recall the model definition for binary
classification:

Just need to add a


hidden layer here!
(mostly)
Neural Network
Model definition
Neural Network
Model definition

Hidden layer
Neural Network
Model definition

100 neurons in the hidden layer


Neural Network
Model definition

Still only two input features


Neural Network
Model definition

Activation function for the


neurons in this layer
Neural Network
Model definition

Activation function for the


neurons in this layer

All neurons in a single layer conventionally have the same


activation
Neural Network
Model definition

Output layer
Neural Network
Model definition

We are using the Adam optimizer here instead of SGD, since it


works much better is the majority of the cases.
Neural Network
Model Learning Curve
Neural Network
Model Learning Curve

As we have seen before, the fit function returns the history


of losses. We can plot these values to debug and analyze how
our model is learning.
Neural Network
Model Learning Curve

As a general rule, your loss curve should go down with more


epochs - we will learn more about this later
Neural Network
Let’s see it in action!

Lecture 2 - Neural Network with Spiral Data


Neural Network
Neural network learns the boundaries
Neural Network
Is the learning because of hidden layer or
because of non-linearity added by activation
functions?

Exercise:
1) Remove ReLU but keep one hidden layer and report the
score
2) See the effect of learning rate (Hint: modify the code so that
you use an explicitly initialize Adam optimizer object)
Neural Network
Some terminologies:
• Fully connected neural network
• Feed-forward neural network
Neural Network Language Model
Language Model

You shall know a word by the company it keeps


ーFirth, J. R. 1957:11
Language Model
Fill in the blank:
... a _______ ...

car cars water cat


Language Model
Fill in the blank:
... a _______ ...
car and cat both work

car cars water cat


Language Model
Fill in the blank:
... a _______ ...
car and cat both work

car cars water cat

John is driving a _______ ...


Language Model
Fill in the blank:
... a _______ ...
car and cat both work

car cars water cat

John is driving a _______ ...


only car works here
Language Model
Fill in the blank:
... a _______ ...
car and cat both work

car cars water cat

John is driving a _______ ...


Similarly, machines use the context to predict the next words
Language Model
You chose “driving a car” because you’ve seen
that phrase more frequently

“driving a cat” is not a common phrase


Language Model
Fill in the blank:
This ______ is going at 100 km/hours
car bicycle

Car at 100km/hours is more probable than a bicycle


Language Model

Language model defines


“how probable a sentence is”
Language Model
Let’s look at the example again

How probable is:


John is driving a car vs. John is driving a cat

In other words, what is the probability to predict


cat or car given the context “John is driving a”
Language Model
Q: Can we see language modeling as a classification
problem?
Language Model
Q: Can we see language modeling as a classification
problem?
A: Yes! We are just predicting which word (“class”)
is coming next.
Language Model
Q: Can we see language modeling as a classification
problem?
A: Yes! We are just predicting which word (“class”)
is coming next.
Language Model
Q: Can we see language modeling as a classification
problem?
A: Yes! We are just predicting which word (“class”)
is coming next.

Predict Dan given <s>


Language Model
Q: Can we see language modeling as a classification
problem?
A: Yes! We are just predicting which word (“class”)
is coming next.

Predict Dan given <s>


What is the probability of Dan given <s>?
Language Model
Q: Can we see language modeling as a classification
problem?
A: Yes! We are just predicting which word (“class”)
is coming next.

Predict likes given Dan


Language Model
Q: Can we see language modeling as a classification
problem?
A: Yes! We are just predicting which word (“class”)
is coming next.

Predict ham given likes


Language Model
Q: Can we see language modeling as a classification
problem?
A: Yes! We are just predicting which word (“class”)
is coming next.
Language Model
Words represent classes that we want to predict!

Input to the classifier: previous words i.e. context


Output: probability distribution over all possible
words, i.e. our vocabulary
Neural Network Language Model
Input Layer 1 Layer 2 Output
Neural Network Language Model
Input Layer 1 Layer 2 Output
score for “cat”
score for “dog”
score for “car”
Previous word(s)

score for “house”


score for “door”
score for “school”

score for “laptop”


score for “phone”
score for “zebra”
Neural Network Language Model
Input Layer 1 Layer 2 Output
score for “cat”
score for “dog”
score for “car”
Previous word(s)

score for “house”


score for “door”
score for “school”

score for “laptop”


score for “phone”
scores over entire vocabulary score for “zebra”
Input Representation
Input Previously we’ve used a vector as
input, where each element of the
vector represented some “feature” of
the input
Previous word(s)
Input Representation
Input Previously we’ve used a vector as
input, where each element of the
vector represented some “feature” of
the input
Previous word(s)

age
color

maximum speed

Car
Input Representation
Input Can we represent a word as a feature
vector?

?
Previous word(s)

“University”
One Hot Vector Representation
• Every word can be represented as a one hot vector
One Hot Vector Representation
• Every word can be represented as a one hot vector
• Suppose the total number of unique words in the
corpus is 10,000
One Hot Vector Representation
• Every word can be represented as a one hot vector
• Suppose the total number of unique words in the
corpus is 10,000
• Assign each word a unique index:
University: 1
cat: 2
house: 3
car: 4

apple: 10,000
One Hot Vector Representation
• Every word can be represented as a one hot vector
• Suppose the total number of unique words in the
corpus is 10,000
• Assign each word a unique index:
University: 1
cat: 2
house: 3
car: 4

apple: 10,000

Dictionary One-hot representation


One Hot Vector Representation
• Every word can be represented as a one hot vector
• Suppose the total number of unique words in the
corpus is 10,000
• Assign each word a unique index:

Only index that


represents the input
word will be one

One-hot representation
One Hot Vector Representation
• Every word can be represented as a one hot vector
• Suppose the total number of unique words in the
corpus is 10,000
• Assign each word a unique index:

Only index that


represents the input
word will be one

One-hot representation
One Hot Vector Representation
• Every word can be represented as a one hot vector
• Suppose the total number of unique words in the
corpus is 10,000
• Assign each word a unique index:

Vector size will be the


size of the vocabulary,
i.e. 10,000 in this case

One-hot representation
One Hot Vector Representation

One-hot vector

weight matrix
One Hot Vector Representation

One-hot vector

weight matrix
One Hot Vector Representation

One-hot vector
[1 x V]
[1 x h]

weight matrix
[V x h]

One-hot vector will “turn on” one row of weights


Higher ngram Vector Representation

• What about representing multiple words?

Bag of words approach


Higher ngram Vector Representation

• What about representing multiple words?

Bag of words approach

Bigram: indices of the


two previous words are
1 in the vector
Higher ngram Vector Representation

• What about representing multiple words?

Bag of words approach

Bigram: indices of the


two previous words are
1 in the vector
Higher ngram Vector Representation

• What about representing multiple words?

Bag of words approach

Trigram: indices of the


three previous words
are 1 in the vector
Higher ngram Vector Representation

• What about representing multiple words?

Context-aware approach
In the bag of words approach, order
information is lost!
Higher ngram Vector Representation

• What about representing multiple words?

Context-aware approach
In the bag of words approach, order
information is lost!
Solution: for N words, concatenate one-hot
vectors for each of the words in the correct
order
Higher ngram Vector Representation
Context-aware approach
Higher ngram Vector Representation
Context-aware approach

[V x 1] [V x 1] [2V x 1]
Higher ngram Vector Representation
Context-aware approach

input vector length has increased

[V x 1] [V x 1] [2V x 1]
Higher ngram Vector Representation
Context-aware approach

input vector length has increased

• order information is available for the training


Advantages

• long vectors in case of large context size


• number of parameters increases with
context
[V x size
1] [V x 1] Disadvantages
[2V x 1]
Higher ngram Vector Representation

• Bag of words vs. context-aware approach?


– Given the disadvantages of the context-aware
approach, Bag of words is more commonly used
– Works well in practice
Input Representation
Generally, the size of the vocabulary is very
large
- Results in very large one-hot vectors!
Input Representation
Generally, the size of the vocabulary is very
large
- Results in very large one-hot vectors!

Some tricks to reduce vocabulary size:


1) Take most frequent top words. For example, consider only
10,000 most frequent words and map the rest to a unique
token <UNK>
2) Cluster words
a) based on context
b) based on linguistic properties
Neural Network Language Model
Let us look at a complete example:
Vocabulary: {“how”, “you”, “hello”, “are”}
Network Architecture: 2 hidden layers of size 3 each
Neural Network Language Model
Let us look at a complete example:
Vocabulary: {“how”, “you”, “hello”, “are”}
Network Architecture: 2 hidden layers of size 3 each

[1 x 4] [1 x 4]
one-hot vector [3 x 3] [3 x 4] output scores
[4 x 3]
Neural Network Language Model
Vocabulary: {“how”, “you”, “hello”, “are”}
Neural Network Language Model
Vocabulary: {“how”, “you”, “hello”, “are”}

“hello”
Neural Network Language Model
Vocabulary: {“how”, “you”, “hello”, “are”}

“hello”
Neural Network Language Model
Vocabulary: {“how”, “you”, “hello”, “are”}

“hello”
Neural Network Language Model
Vocabulary: {“how”, “you”, “hello”, “are”}

“hello”

output scores
max: “how”
Neural Network Language Model
Vocabulary: {“how”, “you”, “hello”, “are”}

“how”

output scores
max: “are”
Neural Network Language Model
Vocabulary: {“how”, “you”, “hello”, “are”}

“are”

output scores
max: “you”
Neural Network Language Model

“are”

Each one-hot vector turns on one row in the


weight matrix and results in [1 x 3] vector
Neural Network Language Model

“are”

Each one-hot vector turns on one row in the


weight matrix and results in [1 x 3] vector
Can we say that the [1 x 3] vector represents
the input word?
Neural Network Language Model

“are”

Each one-hot vector turns on one row in the


weight matrix and results in [1 x 3] vector
Can we say that the [1 x 3] vector represents
the input word? Yes
Exercise
Create a 2D vector space representation of the
following words:
dog, lion, cat, rabbit, horse, zebra,
cheetah, parrot, sparrow, elephant, chicken,
monkey

small large
carnivore herbivore
wild domestic

fast slow mammal bird


Exercise
Sparrow
Parrot

Chicken
Rabbit
Dog
Cat
Monkey
Lion
Cheetah

Zebra
Horse

Elephant
Exercise
Sparrow
Parrot
Herbivores
Chicken
Rabbit
Dog
Cat
Monkey
Lion
Cheetah

Zebra Carnivores
Horse

Elephant
Exercise
Small Animals
Sparrow
Parrot

Chicken
Rabbit
Dog
Cat
Monkey
Lion
Cheetah
Large Animals
Zebra
Horse

Elephant
Exercise
Pets Sparrow
Parrot

Chicken
Rabbit
Dog
Cat
Monkey
Lion
Cheetah

Zebra
Horse

Elephant
Word Embeddings
How did you decide which animals need to be
closer?
How did you handle conflicts between animals
that belong to multiple groups?
How does having this kind of vector space
representation help us?
Word Embeddings
• In one-hot vector representation, a word is
represented as one large sparse vector

only one element is 1 in the entire vector

vectors of different words do not give us any


information about the potential relations
between the words!
Word Embeddings
• In one-hot vector representation, a word is
represented as one large sparse vector
• Instead, word embeddings are dense vectors
in some vector space
Word Embeddings
• In one-hot vector representation, a word is
represented as one large sparse vector
• Instead, word embeddings are dense vectors
in some vector space
word vectors are continuous representations of words

vectors of different words give us information about the


potential relations between the words - words closer
together in meaning have vectors closer to each other
Word Embeddings

one-hot word
vector embedding
Word Embeddings
“Representation of words in continuous space”

Inherit benefits
• Reduce dimensionality
• Semantic relatedness
• Increase expressiveness
– one word is represented in the form of several
features (numbers)
Word Embeddings
Play with some embeddings!
https://rare-technologies.com/word2vec-tutorial/#bonus_app

Try various relationships...


Word Embeddings
• Reduce dimensionality
Embedding layer
(100 neurons)

Input layer
(10000
words)
Word Embeddings
• Semantic relatedness

Mikolov et al. 2013


Word Embeddings
• Semantic relatedness

Plot shows the relationship between vectors


representing related concepts
Mikolov et al. 2013
Word Embeddings
• Semantic relatedness

The vectors from countries to capitals point roughly


in the same direction
Mikolov et al. 2013
Word Embeddings
• Similarly, learning the gender relationship
Word Embeddings
Q: How can we learn these embeddings
automatically?
Word Embeddings
Q: How can we learn these embeddings
automatically?
A: Neural Networks are a step ahead -
embeddings are already learned as “richer”
features
Word Embeddings
Neural Networks are a step ahead -
embeddings are already learned as “richer”
features
score for “cat”
score for “dog”
score for “car”

score for “house”


score for “door”
score for “school”

score for “laptop”


score for “phone”
score for “zebra”
Word Embeddings
Neural Networks are a step ahead -
embeddings are already learned as “richer”
features
Embedding
layer
score for “cat”
score for “dog”
score for “car”

score for “house”


score for “door”
score for “school”

score for “laptop”


score for “phone”
score for “zebra”
Word Embeddings
The overall training task defines the relationships
which will be learned by the model
For example:
• In language modeling, the model uses neighboring
context thus bringing words with similar context
closer
• In doing POS tagging task, words with similar POS
tags will come close to each other
• If our network is doing machine translation, the
embeddings will be tuned for translation
Word Embeddings
• Generally, task specific embeddings are better than
generic embeddings
• In case of small amount of training data, generic
embeddings learned on large amount of data works
better
• Generic embeddings can also be used as a starting
point
Word Embeddings
We can use pre-trained embeddings as well - just initialize
the weights in the first layer with some learned
embeddings
Pre-trained
embedding
layer
score for “cat”
score for “dog”
score for “car”

score for “house”


score for “door”
score for “school”

score for “laptop”


score for “phone”
score for “zebra”
Word Embedding Tools
Some tools to learn word embeddings:
• Word2Vec (from Google)
• FastText (from Facebook)
• GloVe (from Stanford)

• Contextualized word embeddings


– Cove
– ELMo
– Bert
Word Embeddings
A few pre-trained word embeddings
• GloVe: Wikipedia plus Gigaword https://goo.gl/1XYZhc
• FastText: Wikipedia of 294 languages https://goo.gl/1v423g
• Dependency-based https://goo.gl/tpgw4R

Using pre-trained embeddings in keras:


https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
Summary
• Neural networks
– Activation function
– Forward pass, loss, backward pass, update
– Implementation in Keras
• Neural network language model
– Input representation
– Output representation
• Word embeddings
Neural Network Implementation
Let’s implement!
• Neural network
– Spiral data
• Neural network language model
– Sherlock holmes data
• Neural network for multiclass classification
– Sentiment analysis (14 way classification)
• one-hot vector for every sentence

You might also like