0% found this document useful (0 votes)
16 views78 pages

Deep Learning For NLP

This document provides an overview of deep learning and its applications to natural language processing. It begins with an introduction to deep learning, including how neural networks can learn hierarchical representations of data. It then discusses justifications for deep learning, such as its ability to learn from large amounts of unlabeled data. Finally, it outlines how the talk will cover the history of deep learning, common network architectures, and two applications of deep learning to NLP tasks.

Uploaded by

emanamin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views78 pages

Deep Learning For NLP

This document provides an overview of deep learning and its applications to natural language processing. It begins with an introduction to deep learning, including how neural networks can learn hierarchical representations of data. It then discusses justifications for deep learning, such as its ability to learn from large amounts of unlabeled data. Finally, it outlines how the talk will cover the history of deep learning, common network architectures, and two applications of deep learning to NLP tasks.

Uploaded by

emanamin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

And Deep Networks for Natural Language Processing

DEEP LEARNING
Overview of the Talk

1. Overview of Deep Learning


2. Justification \ Properties of Deep Learning
3. Neural Networks 101
4. Brief History of Deep Learning
5. Implementation Details
1. RBM’s and DBN’s
2. Auto-Encoders
6. Deep Learning for NLP
1. i) Learning Neural Embeddings
2. Ii) Recursive Auto-Encoders
Aims of Talk
 Provide a comprehensible introduction to Deep
Learning for the uninitiated
 Give an overview of how deep learning can be
applied to NLP
 Provide an understanding of the justification for
deep learning and the approaches used
 Illustrate the type of problems it can be used to solve
What I am Not
 An expert in Deep Learning

What this Talk is Not


 Deep exploration of the mathematics behind
some of the deep learning models (although
some basic-intermediate math is covered)
 An extensive explanation of neural networks -
some knowledge is assumed
However
 Some of this stuff can be confusing \ complex

So……
 Please feel free to ask sensible questions during the talk
for clarification if needed

And
 I have an accent, so let me know if you have trouble
understanding the Queen’s English
Overview of the Talk

1. Overview of Deep Learning


Deep Learning – WTF?

 Learning deep (many layered) neural networks


 The more layers in a Neural Network, the more
abstract features can be represented
 E.g. Classify a cat:
 Bottom Layers: Edge detectors, curves, corners
straight lines
 Middle Layers: Fur patterns, eyes, ears
 Higher Layers: Body, head, legs
 Top Layer: Cat or Dog
Deep Learning – WTF?

 Real world information has a hierarchical


structure, cannot easily be modeled by a
neural network with 3 layers
 The human brain is a deep neural network,
has many layers of neurons which acts as
feature detectors, detecting more and more
abstract features as you go up
Deep Learning – WTF?

 Traditional approach is to use back


propagation to train multiple layers
 However back propagation does not work
well over multiple layers and does not scale
well
 Back propagation cannot leverage unlabelled
data
 Recent advances in deep learning attempt to
address this short-comings
Deep-Learning is Typically –

 1. Layer-wise, bottom-up pre-training of


unsupervised neural networks (auto-encoders,
RBM’s)

 2. Supervised training on labeled data using


either:
i) Features learned from 1. fed into a classifier
 e.g. SVM
ii) An additional output layer is placed on top to form a
feed forward network, which is then trained using back
prop on labeled data
Huh?....
Huh?....

 Don’t worry, we’ll come back to that


shortly….
Overview of the Talk

1. Overview of Deep Learning


2. Justification \ Properties of Deep Learning
Why? –
Achieved State of the Art in
a Number of Different Areas
 Language Modeling (2012, Mikolov et al)
 Image Recognition (Krizhevsky won 2012 ImageNet
competition)
 Sentiment Classification (2011, Socher et al)
 Speech Recognition (2010, Dahl et al)
 MNIST hand-written digit recognition (Ciresan et al,
2010)
 Andrew Ng – Machine Learning Professor, Stanford:
 “I’ve worked all my life in Machine Learning, and I’ve never
seen one algorithm knock over benchmarks like Deep
Qu: What do these Problems
have in Common?
Application Areas

 Typically applied to image and speech


recognition, and NLP
 Each are non-linear classification problems
where the inputs are highly hierarchal in
nature (language, images, etc)
 The world has a hierarchical structure – Jeff
Hawkins – On Intelligence
 Problems that humans excel in and machine
do very poorly
Deep vs Shallow Networks

 Given the same number of non-linear (neural


network) units, a deep architecture is more
expressive than a shallow one (Bishop 1995)
 Two layer (plus input layer) neural networks
have been shown to be able to approximate
any function
 However, functions compactly represented in
k layers may require exponential size when
expressed in 2 layers
Deep Network Shallow Network

Shallow (2 layer) networks need a lot more


hidden layer nodes to compensate for lack of
expressivity

In a deep network, high levels can express combinations


between features learned at lower levels
Traditional Supervised
Machine Learning Approach
 For each new problem:
 Gather as much LABELED data as you can get \
handle
 Throw a bunch of algorithms at it (after trying RF \
SVM .. insert favorite algo here)
 Pick the best
 Spend hours hand engineering some features \
doing feature selection \ dimensionality reduction
(PCA, SVD, etc)
 RINSE AND REPEAT…..
Biological Justification
 This is NOT how humans learn
 Humans learn facts and skills and apply them to different
problem areas
 -> Transfer Learning
 Humans first learn simple concepts, and then learner more
complex ideas by combining simpler concepts
 There is evidence that the cortex has a single learning algorithm:
 Inputs from optic nerves of ferrets was rerouted to into their audio
cortex
 They were able to learn to see with their audio cortex instead
 If we want a general learning algorithm, it needs to be able to:
 Work with any type of data
 Extract it’s own features
 Transfer what it’s learned to new domains
 Perform multi-modal learning – simultaneously learn from multiple
Unsupervised Training
 Far more un-labeled data in the world (i.e. online)
than labeled data:
 Websites
 Books
 Videos
 Pictures
 Deep networks take advantage of unlabelled
data by learning good representations of the
data through unsupervised learning
 Humans learn initially from unlabelled examples
 Babies learn to talk without labeled data
Unsupervised Feature
Learning
 Learning features that represent the data
allows them to be used to train a supervised
classifier
 As the features are learned in an unsupervised
way from a different and larger dataset, less
risk of over-fitting
 No need for manual feature engineering
 (e.g. Kaggle Salary Prediction contest)
 Latent features are learned that attempt to
explain the data
Unsupervised Learning -
Distributed Representations
 Approaches to unsupervised learning of
features fall into two categories:
 Local Representations (hard clustering)
 Distributed Representations (soft \ fuzzy
clustering)
 Hard clustering approaches (e.g. k-means,
DBSCAN) - learn to map a set of data points
to individual clusters
Distributed Representations

 Fuzzy clustering, dimensionality reduction


approaches (SVD, PCA), topic modeling (LDA)
and unsupervised feature learning with neural
networks learn distributed representations
 Assumes that the data can be explained by the
interaction of many different unobserved factors
 Unseen configurations of these factors can more
effectively explain unseen data
 Much fewer features needed to describe the space
as they can be combined in many different ways
Local Representation
Distributed Representation
Hierarchical Representations
 These factors are organized into multiple levels
 Each level creates new features from
combinations of features from the level below
 Each level is more abstract than the ones below
 Hierarchies of distributed representations
attempt to solve the “Curse of Dimensionality”
by learning the underlying latent variables that
cause the variability in the data
Hierarchical Representations
Discriminative Vs Generative
Models
 2 types of classification algorithms
 1. Generative – Model Joint Distribution
 p(Class /\ Data)
 E.g. NB, HMM, RBM (see later), LDA
 2. Discriminative – Conditional Distribution
 p(Class\Data)
 E.g. Decision Trees, SVMs, Nnets, Linear
Regression, Logistic Regression
Discriminative Vs Generative
Models
 Discriminative models tend to give better classification
accuracy
 BUT are more prone to over-fitting (that again…)
 Generative models can be used to generate conditional models:

 p(A/B) = p(A /\ B)/p(B)

 Generative models can also generate samples of data


according to the distribution of the training data (hence the
name) i.e. they learn to model the data distribution not Class\
Data
Discriminative + Generative
Model –>
Semi-Supervised Learning
 In deep learning, a generative model (RBM, Auto-Encoder)
is learned from the data
 Generative model maximizes prior - p(Data)
 Then a discriminative classifier is trained using the features
learned from the generative model
 This maximizes posterior - p(Class\ Data)
 Popular discriminative classifiers used:
 NNet soft max layer
 SVM
 Logistic Regression
Overview of the Talk

1. Overview of Deep Learning


2. Justification \ Properties of Deep Learning
3. Neural Networks 101
Neural Networks – Very Brief
Primer
1. Activation Function
2. Back Propagation
3. Gradient Descent
Activation Function

 For each neuron, sum the inputs multiplied by


their weights, and add the bias
 The result is passed through an activation
function, whose output feeds the next layer
 Non-linearity needed to learn non-linear
functions
 Typically the sigmoid function used (as in logistic
regression)
 Hyperbolic tangent also popular, has a shallower
gradient around the limits
Sigmoid Function
Activation Functions
Back Propagation 101

 Target = y
 Learn y = f(x)
 For each Neuron:
 Activation <- Sum the inputs, add the bias, apply a sigmoid
function (tanh, logistic, etc) as the activation function
 Activations Propagate through the layers
 Output Layer: compute error for each neuron:
 Error = y– f(x)
 Update the weights using the derivative of the error
 Backwards – propagate the error derivatives through the
hidden layers
Backpropagation

Errors
Gradient Descent
 Weights are updated using the partial derivative of
the activation function w.r.t. the error
 Derivative pushes learning down the gradient of
steepest descent on the error curve
Gradient Descent
Drawbacks - Backpropagation

 Needs labeled data (most data is not labeled)


 Scalability – does not scale well over multiple layers
 Very slow to converge
 “Vanishing gradients problem” : errors shrink
exponentially with the number of layers
 Thus makes poor use of many layers
 This is the reason most feed forward neural networks
have only 3 layers
 For more: “Understanding the Difficulty of Training
Deep Feed Forward Neural Networks”:
http://machinelearning.wustl.edu/mlpapers/paper_f
iles/AISTATS2010_GlorotB10.pdf
Overview of the Talk

1. Overview of Deep Learning


2. Justification \ Properties of Deep Learning
3. Neural Networks 101
4. Brief History of Deep Learning
Brief History of Deep
Learning
 See: http://www.ipam.ucla.edu/publications/gss2012/gss2012_10596.pdf

 1960’s – Perceptron invented (single neuron)


 1960’s – Papert and Minsky prove that perceptrons can only learn
to model linearly separable functions. Interest in perceptrons
rapidly declines.
 1970’s-1980’s – Back propagation (BP) invented for training
multiple layers of non-linear features. Leads to a resurgence in
interest in neural networks
 BP takes errors from the output layer and propagates them back through
the hidden layer(s)
 1990’s - Many researchers gave up on BP as it could not make
effective use of multiple hidden layers
 1990’s – present: Simple, faster models, such as SVM’s came to
dominate the field
Brief History of Deep
Learning (cont…)
 Mid 2000’s – Geoffrey Hinton makes a
breakthrough, trains deep belief networks by
 Stacking RBM’s on top of one another – deep belief
network
 Training layer by layer on un-labeled data
 Using back prop to fine tune weights on labeled data
 Bengio et al, 2006 – examined deep auto-
encoders as an alternative to Deep Boltzmann
Machines
 Easier to train
Enabling Factors
 Training of deep networks was made computationally
feasible by:
 Faster CPU’s
 The move to parallel CPU architectures
 Advent of GPU computing

 Neural networks are often represented as a matrix of


weight vectors
 GPU’s are optimized for very fast matrix multiplication
 2008 - Nvidia’s CUDA library for GPU computing is
released
Overview of the Talk

1. Overview of Deep Learning


2. Justification \ Properties of Deep Learning
3. Neural Networks 101
4. Brief History of Deep Learning
5. Implementation Details:
1. RBM’s and DBN’s
2. Auto-Encoders
Implementation

 Most current architectures consist of learning


layers of RBM’s or Auto-Encoders
 Both are 2 layer neural networks that learn to
model their inputs
 Key difference:
 RBM’s model their inputs as a probability
distribution
 Auto-Encoders learn to reproduce inputs as their
outputs
Restricted Boltzmann
Machines (RBM’s)
 Two layer undirected (bi-directional) neural network:
 Visible Layer
 Hidden Layer
 Connections run visible to hidden
 No connections within each layer
 Trained to maximize the expected log probability of the
data
 For the physicists\chemists: ‘Boltzmann’ as they minimize
the energy of the data (equates to maximizing the
probability)
 Inputs are binary vectors (as it learns Bernouli distributions
over each input)
RBM Structure – Bipartite Graph
Activation Function

 The activation function is computed the same way


as in a regular neural network
 Logistic function usually used (0-1)
 However, the output is treated as a probability
and each neuron is activated if activation >
random variable(0-1)
 Hidden layer neurons take visible units as inputs
 Visible neurons take binary input vectors as initial
input, then hidden layer probabilities (during
Gibbs sampling – next slide)
Training Procedure –
Contrastive Divergence
 Remarkably simple
 Performs Gibbs Sampling (MCMC technique)
 Equates to computing a probability
distribution using a Markov Chain Monte
Carlo approach
Contrastive Divergence
 PASS 1: From inputs v, compute hidden layer probabilities
h
 PASS 2: Pass those values back down to the visible layer,
and back up to the hidden layer to get v’ and h’
 Update the weights using the differences in the outer
products of the hidden and visible activations between the
first and second passes (multiplied by some learning rate)

 Note: For some reason, all implementations I have seen


take the inner (dot) and not the outer product
 To approach the optimal model, an infinite number of
passes are needed, so this approach provides proximate
inference, but works well in practice
Feature Representation

 Once trained, the hidden layer activations of


an RBM can be used as learned features
Auto Encoders

 An auto-encoder is a 3 layer neural network, which is


trained to reconstruct its inputs by using them as the output
 Needs to learn features that capture the variance in the
data so it can be reproduced
 If only linear activation functions are used, it can be shown
to be equivalent to PCA and can be used for dimensionality
reduction
 Once trained, the hidden layer activations are used as the
learned features, and the top layer can be discarded
 However, the auto-encoder will learn the identity function
unless some strategy is used to force it to learn features
from the data
Training Strategies
1. De-noising Auto-Encoders
 Some random noise added to the input
 The encoder is required to reproduce the original input
 Hinton’s group recently showed that randomly deactivating inputs (dropout)
during training will improve the generalization performance of regular neural
networks
2. Contractive Auto-Encoders
 Setting the number of nodes in the hidden layer to be much lower than the
number of input nodes forces the network to perform dimensionality reduction,
 This prevents it from learning the identity function as the hidden layer has
insufficient nodes to simply store the input
3. Sparse Auto-Encoders
 A sparsity penalty is applied to the weight update function
 Penalizes the total size of the connection weights,
 Causes most weights to have small values
 Allows
Building Deep Networks

 RBM’s or Auto-Encoders can be trained layer


by layer
 The features learned from one layer are fed
into the next layer
 The top-layer activations can be treated as
features and fed into any suitable classifier
(RF, SVM, etc)
Building Deep Networks

 Alternatively, an additional output layer can


be placed on top, and the network fine-tuned
with back propagation
 Back propagation only works well in deep
networks only if the weights are initialized
close to a good solution
 The layer wise pre-training ensures this
 Many other approaches exist for fine tuning
deep networks (e.g. dropout, maxout)
Training a Deep Auto-Encoder
from Stacked RBM’s – Hinton `06
Overview of the Talk

1. Overview of Deep Learning


2. Justification \ Properties of Deep Learning
3. Neural Networks 101
4. Brief History of Deep Learning
5. Implementation Details
1. RBM’s and DBN’s
2. Auto-Encoders
6. Deep Learning for NLP
1. i) Learning Neural Embeddings
2. Ii) Recursive Auto-Encoders
Deep Learning for NLP

 This section will focus primarily on the


ground-breaking work of Richard Socher at
Stanford:
 “Semi-Supervised Recursive Autoencoders for
Predicting Sentiment Distributions” (2011)
 His work builds on top of the neural word
embeddings work performed by Collobert
and Weston (2008)
Word Vectors

 To do NLP with neural networks, words need to be


represented as vectors
 Traditional approach – “one hot vector”
 Binary vector
 Length = | vocab |
 1 in the position of the word id, the rest are 0
 However, does not represent word meaning
 Similar words such as English and French, cat and dog
should have similar vector representations
 However, similarity between all “one hot vectors” is
the same
Solution:
Distributional Word Vectors
 Word is represented as a distribution over k latent variables
 Distribution chosen so that similar words have similar
distributions
 Traditional approaches have used various vector space
models
 Words form the rows
 Columns represent the context (other words occurring within x
words, whole documents, etc)
 Cells represent co-occurrence (binary vectors) frequency, tf-idf or
relative distance from the context word
 Dimensionality reduction (PCA, SVD, etc) used to reduce the
vector size
Neural Word Embeddings

 Various researchers (Bengio, Collobert and


Weston, Hinton) have used neural language
models to develop “word embeddings”
 A language model is a statistical model that
assigns a probability to words given the
preceding words
 Have similar properties to distributional word
vectors, but claim better representations
Neural Word Embeddings

 Collobert and Weston, 2008 -“A Unified Architecture for


Natural Language Processing”
 They extracted all 11-length n-grams from the entire of Wikipedia
 Middle (6th) word is the target word
 Negative examples are created by replacing the middle word with a
different word chosen randomly
 For each word, they randomly initialized a 50 element vector
 The n-grams are then translated into input vectors by
concatenating the corresponding vector for each word
 These are fed into a neural network that is trained to maximize the
difference between the probability it assigns to a valid versus an
invalid sentence
 Errors are propagated back into the word embeddings
Results

 Example words with their 10 nearest


neighbors according to the embeddings:
A Unified Architecture for
NLP
 Using a very complex, deep architecture, Collobert and
Weston were able to train a single deep model to do:
 NER (Named Entity Recognition)
 POS tagging
 Chunking (shallow parsing)
 Parsing
 SRL (Semantic Role Labeling)
 Model is too complex to cover here
 No hand engineered features were used
 Achieved either near SOTA or the SOTA in each of the
above domains
Recursive Auto-Encoders

 Using the Neural Language Model technique


to learn word vectors, Richard Socher
developed a deep architecture for NLP
 His architecture was applied to sentiment
analysis, but can be used for nearly any text
classification problem
Recursive Auto-Encoders

 Each sentence is reduced to a single 50 element


vector as follows:
 Each sentence of length n is mapped into n - 50
element word vectors using neural word embeddings
 For each bi-gram in the sentence, concatenate the
word vectors and feed into a contractive auto-
encoder – 100 inputs 50 outputs
 Take the bi-gram with the lowest reconstruction
error, and replace with the output of the auto-
encoder
 Repeat until you have one 50 element vector
The Recursive Auto-Encoder
Semi-Supervised Training

 Greedy algorithm
 Can be viewed as constructing a binary parse
tree with the lowest reconstruction error
 Auto-encoder is trained with two objective
functions:
 1 Minimize the reconstruction error
 2 Minimize the classification error in a softmax layer
 The output at each level of the tree is fed into a
softmax neural network layer, trained on
labeled data
Semi-Supervised Training

 Cost function minimizes both the reconstruction


error of the input vectors, and the classification error
of the softmax classifier on labeled data
 The sentence is then classified by feeding the top-
level auto-encoder output into the softmax classifier
 Can use either:
 1 . Static Collobert and Weston neural word embeddings
 2. Learn it’s own embeddings using back propagation
through structure to propagate errors back into word
embeddings matrix
Semi-Supervised Training
Results

 SOTA Results on standard sentiment analysis


datasets
 In our current research in automated essay
annotation, this algorithm out-performed other
approaches considerably:
 Logistic Regression using bags of word (binary vectors):
 F1 of 0.62
 RAE, using default parameters:
 F1 of 0.71
 My current best non-deep learning approach
 F1 of 0.66
 Also uses a (much simpler) word vector composition model
Some Criticisms of RAE

 It is considered a deep learning approach


because the auto-encoder forms a deep
network with itself when parsing a sentence
 Only uses one auto-encoder, thus fails to
utilize hierarchical composition of features
present in other deep networks
 50 hidden neurons * (100 inputs + bias)
 Thus only 5,050 parameters (weights)
 Probably insufficient to model the English
language!
Disadvantages of Deep
Learning
 Very slow to train
 Availability of algorithms – lots of Python
implementations, pretty rate in other languages (e.g. R)
 Models are very complex, with lot of parameters to
optimize:
 Initialization of weights
 Layer-wise training algorithm (RBM, AE, several others)
 Neural architecture
 Number of layers
 Size of layers
 Type – regular, pooling, max pooling, soft max
 Fine-tuning using back prop or feed outputs into a different
classifier
Disadvantages of Deep
Learning
 Steep learning curve
 Some problems more amenable to deep learning
than other applications
 Simpler models may be sufficient for certain
problem domains
 Regression models?
 Unless you are working with images, the models
are very hard to explain (compared with a
decision tree)
 What does neuron 524 do?
Useful Deep Learning Links
 Deeplearning.Net:
 Code, tutorials, papers
 http://deeplearning.net/
 Theano (Cuda + Python also):
 Comprehensive tutorials
 Symbolic programming (like SymPy) can be a little confusing
 http://deeplearning.net/software/theano/
 Toronto groups’ code (Cuda + Python):
 Easier to understand than Theano
 https://github.com/nitishsrivastava/deepnet
 www.socher.org
 All of Richard Socher’s research papers and code (mainly Matlab, some java)
 Links to his tutorials on YouTube on Deep Learning and NLP
 The SENNA system developed by Collobert and Weston
 http://ronan.collobert.com/senna/
 A pretty complete NLP system (for download) that uses Deep Learning to perform
NER, POS tagging, parsing, chunking and SRL
 Contains the word embeddings file so you can use their word embeddings in your

You might also like