0% found this document useful (0 votes)

16 views78 pages

Deep Learning For NLP

This document provides an overview of deep learning and its applications to natural language processing. It begins with an introduction to deep learning, including how neural networks can learn hierarchical representations of data. It then discusses justifications for deep learning, such as its ability to learn from large amounts of unlabeled data. Finally, it outlines how the talk will cover the history of deep learning, common network architectures, and two applications of deep learning to NLP tasks.

Uploaded by

emanamin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views78 pages

Deep Learning For NLP

Uploaded by

emanamin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 78

And Deep Networks for Natural Language Processing

DEEP LEARNING
Overview of the Talk

1. Overview of Deep Learning

2. Justification \ Properties of Deep Learning
3. Neural Networks 101
4. Brief History of Deep Learning
5. Implementation Details
1. RBM’s and DBN’s
2. Auto-Encoders
6. Deep Learning for NLP
1. i) Learning Neural Embeddings
2. Ii) Recursive Auto-Encoders
Aims of Talk
 Provide a comprehensible introduction to Deep
Learning for the uninitiated
 Give an overview of how deep learning can be
applied to NLP
 Provide an understanding of the justification for
deep learning and the approaches used
 Illustrate the type of problems it can be used to solve
What I am Not
 An expert in Deep Learning

What this Talk is Not

 Deep exploration of the mathematics behind
some of the deep learning models (although
some basic-intermediate math is covered)
 An extensive explanation of neural networks -
some knowledge is assumed
However
 Some of this stuff can be confusing \ complex

So……
 Please feel free to ask sensible questions during the talk
for clarification if needed

And
 I have an accent, so let me know if you have trouble
understanding the Queen’s English
Overview of the Talk

1. Overview of Deep Learning

Deep Learning – WTF?

 Learning deep (many layered) neural networks

 The more layers in a Neural Network, the more
abstract features can be represented
 E.g. Classify a cat:
 Bottom Layers: Edge detectors, curves, corners
straight lines
 Middle Layers: Fur patterns, eyes, ears
 Higher Layers: Body, head, legs
 Top Layer: Cat or Dog
Deep Learning – WTF?

 Real world information has a hierarchical

structure, cannot easily be modeled by a
neural network with 3 layers
 The human brain is a deep neural network,
has many layers of neurons which acts as
feature detectors, detecting more and more
abstract features as you go up
Deep Learning – WTF?

 Traditional approach is to use back

propagation to train multiple layers
 However back propagation does not work
well over multiple layers and does not scale
well
 Back propagation cannot leverage unlabelled
data
 Recent advances in deep learning attempt to
address this short-comings
Deep-Learning is Typically –

 1. Layer-wise, bottom-up pre-training of

unsupervised neural networks (auto-encoders,
RBM’s)

 2. Supervised training on labeled data using

either:
i) Features learned from 1. fed into a classifier
 e.g. SVM
ii) An additional output layer is placed on top to form a
feed forward network, which is then trained using back
prop on labeled data
Huh?....
Huh?....

 Don’t worry, we’ll come back to that

shortly….
Overview of the Talk

1. Overview of Deep Learning

2. Justification \ Properties of Deep Learning
Why? –
Achieved State of the Art in
a Number of Different Areas
 Language Modeling (2012, Mikolov et al)
 Image Recognition (Krizhevsky won 2012 ImageNet
competition)
 Sentiment Classification (2011, Socher et al)
 Speech Recognition (2010, Dahl et al)
 MNIST hand-written digit recognition (Ciresan et al,
2010)
 Andrew Ng – Machine Learning Professor, Stanford:
 “I’ve worked all my life in Machine Learning, and I’ve never
seen one algorithm knock over benchmarks like Deep
Qu: What do these Problems
have in Common?
Application Areas

 Typically applied to image and speech

recognition, and NLP
 Each are non-linear classification problems
where the inputs are highly hierarchal in
nature (language, images, etc)
 The world has a hierarchical structure – Jeff
Hawkins – On Intelligence
 Problems that humans excel in and machine
do very poorly
Deep vs Shallow Networks

 Given the same number of non-linear (neural

network) units, a deep architecture is more
expressive than a shallow one (Bishop 1995)
 Two layer (plus input layer) neural networks
have been shown to be able to approximate
any function
 However, functions compactly represented in
k layers may require exponential size when
expressed in 2 layers
Deep Network Shallow Network

Shallow (2 layer) networks need a lot more

hidden layer nodes to compensate for lack of
expressivity

In a deep network, high levels can express combinations

between features learned at lower levels
Traditional Supervised
Machine Learning Approach
 For each new problem:
 Gather as much LABELED data as you can get \
handle
 Throw a bunch of algorithms at it (after trying RF \
SVM .. insert favorite algo here)
 Pick the best
 Spend hours hand engineering some features \
doing feature selection \ dimensionality reduction
(PCA, SVD, etc)
 RINSE AND REPEAT…..
Biological Justification
 This is NOT how humans learn
 Humans learn facts and skills and apply them to different
problem areas
 -> Transfer Learning
 Humans first learn simple concepts, and then learner more
complex ideas by combining simpler concepts
 There is evidence that the cortex has a single learning algorithm:
 Inputs from optic nerves of ferrets was rerouted to into their audio
cortex
 They were able to learn to see with their audio cortex instead
 If we want a general learning algorithm, it needs to be able to:
 Work with any type of data
 Extract it’s own features
 Transfer what it’s learned to new domains
 Perform multi-modal learning – simultaneously learn from multiple
Unsupervised Training
 Far more un-labeled data in the world (i.e. online)
than labeled data:
 Websites
 Books
 Videos
 Pictures
 Deep networks take advantage of unlabelled
data by learning good representations of the
data through unsupervised learning
 Humans learn initially from unlabelled examples
 Babies learn to talk without labeled data
Unsupervised Feature
Learning
 Learning features that represent the data
allows them to be used to train a supervised
classifier
 As the features are learned in an unsupervised
way from a different and larger dataset, less
risk of over-fitting
 No need for manual feature engineering
 (e.g. Kaggle Salary Prediction contest)
 Latent features are learned that attempt to
explain the data
Unsupervised Learning -
Distributed Representations
 Approaches to unsupervised learning of
features fall into two categories:
 Local Representations (hard clustering)
 Distributed Representations (soft \ fuzzy
clustering)
 Hard clustering approaches (e.g. k-means,
DBSCAN) - learn to map a set of data points
to individual clusters
Distributed Representations

 Fuzzy clustering, dimensionality reduction

approaches (SVD, PCA), topic modeling (LDA)
and unsupervised feature learning with neural
networks learn distributed representations
 Assumes that the data can be explained by the
interaction of many different unobserved factors
 Unseen configurations of these factors can more
effectively explain unseen data
 Much fewer features needed to describe the space
as they can be combined in many different ways
Local Representation
Distributed Representation
Hierarchical Representations
 These factors are organized into multiple levels
 Each level creates new features from
combinations of features from the level below
 Each level is more abstract than the ones below
 Hierarchies of distributed representations
attempt to solve the “Curse of Dimensionality”
by learning the underlying latent variables that
cause the variability in the data
Hierarchical Representations
Discriminative Vs Generative
Models
 2 types of classification algorithms
 1. Generative – Model Joint Distribution
 p(Class /\ Data)
 E.g. NB, HMM, RBM (see later), LDA
 2. Discriminative – Conditional Distribution
 p(Class\Data)
 E.g. Decision Trees, SVMs, Nnets, Linear
Regression, Logistic Regression
Discriminative Vs Generative
Models
 Discriminative models tend to give better classification
accuracy
 BUT are more prone to over-fitting (that again…)
 Generative models can be used to generate conditional models:

 p(A/B) = p(A /\ B)/p(B)

 Generative models can also generate samples of data

according to the distribution of the training data (hence the
name) i.e. they learn to model the data distribution not Class\
Data
Discriminative + Generative
Model –>
Semi-Supervised Learning
 In deep learning, a generative model (RBM, Auto-Encoder)
is learned from the data
 Generative model maximizes prior - p(Data)
 Then a discriminative classifier is trained using the features
learned from the generative model
 This maximizes posterior - p(Class\ Data)
 Popular discriminative classifiers used:
 NNet soft max layer
 SVM
 Logistic Regression
Overview of the Talk

1. Overview of Deep Learning

2. Justification \ Properties of Deep Learning
3. Neural Networks 101
Neural Networks – Very Brief
Primer
1. Activation Function
2. Back Propagation
3. Gradient Descent
Activation Function

 For each neuron, sum the inputs multiplied by

their weights, and add the bias
 The result is passed through an activation
function, whose output feeds the next layer
 Non-linearity needed to learn non-linear
functions
 Typically the sigmoid function used (as in logistic
regression)
 Hyperbolic tangent also popular, has a shallower
gradient around the limits
Sigmoid Function
Activation Functions
Back Propagation 101

 Target = y
 Learn y = f(x)
 For each Neuron:
 Activation <- Sum the inputs, add the bias, apply a sigmoid
function (tanh, logistic, etc) as the activation function
 Activations Propagate through the layers
 Output Layer: compute error for each neuron:
 Error = y– f(x)
 Update the weights using the derivative of the error
 Backwards – propagate the error derivatives through the
hidden layers
Backpropagation

Errors
Gradient Descent
 Weights are updated using the partial derivative of
the activation function w.r.t. the error
 Derivative pushes learning down the gradient of
steepest descent on the error curve
Gradient Descent
Drawbacks - Backpropagation

 Needs labeled data (most data is not labeled)

 Scalability – does not scale well over multiple layers
 Very slow to converge
 “Vanishing gradients problem” : errors shrink
exponentially with the number of layers
 Thus makes poor use of many layers
 This is the reason most feed forward neural networks
have only 3 layers
 For more: “Understanding the Difficulty of Training
Deep Feed Forward Neural Networks”:
http://machinelearning.wustl.edu/mlpapers/paper_f
iles/AISTATS2010_GlorotB10.pdf
Overview of the Talk

1. Overview of Deep Learning

2. Justification \ Properties of Deep Learning
3. Neural Networks 101
4. Brief History of Deep Learning
Brief History of Deep
Learning
 See: http://www.ipam.ucla.edu/publications/gss2012/gss2012_10596.pdf

 1960’s – Perceptron invented (single neuron)

 1960’s – Papert and Minsky prove that perceptrons can only learn
to model linearly separable functions. Interest in perceptrons
rapidly declines.
 1970’s-1980’s – Back propagation (BP) invented for training
multiple layers of non-linear features. Leads to a resurgence in
interest in neural networks
 BP takes errors from the output layer and propagates them back through
the hidden layer(s)
 1990’s - Many researchers gave up on BP as it could not make
effective use of multiple hidden layers
 1990’s – present: Simple, faster models, such as SVM’s came to
dominate the field
Brief History of Deep
Learning (cont…)
 Mid 2000’s – Geoffrey Hinton makes a
breakthrough, trains deep belief networks by
 Stacking RBM’s on top of one another – deep belief
network
 Training layer by layer on un-labeled data
 Using back prop to fine tune weights on labeled data
 Bengio et al, 2006 – examined deep auto-
encoders as an alternative to Deep Boltzmann
Machines
 Easier to train
Enabling Factors
 Training of deep networks was made computationally
feasible by:
 Faster CPU’s
 The move to parallel CPU architectures
 Advent of GPU computing

 Neural networks are often represented as a matrix of

weight vectors
 GPU’s are optimized for very fast matrix multiplication
 2008 - Nvidia’s CUDA library for GPU computing is
released
Overview of the Talk

1. Overview of Deep Learning

2. Justification \ Properties of Deep Learning
3. Neural Networks 101
4. Brief History of Deep Learning
5. Implementation Details:
1. RBM’s and DBN’s
2. Auto-Encoders
Implementation

 Most current architectures consist of learning

layers of RBM’s or Auto-Encoders
 Both are 2 layer neural networks that learn to
model their inputs
 Key difference:
 RBM’s model their inputs as a probability
distribution
 Auto-Encoders learn to reproduce inputs as their
outputs
Restricted Boltzmann
Machines (RBM’s)
 Two layer undirected (bi-directional) neural network:
 Visible Layer
 Hidden Layer
 Connections run visible to hidden
 No connections within each layer
 Trained to maximize the expected log probability of the
data
 For the physicists\chemists: ‘Boltzmann’ as they minimize
the energy of the data (equates to maximizing the
probability)
 Inputs are binary vectors (as it learns Bernouli distributions
over each input)
RBM Structure – Bipartite Graph
Activation Function

 The activation function is computed the same way

as in a regular neural network
 Logistic function usually used (0-1)
 However, the output is treated as a probability
and each neuron is activated if activation >
random variable(0-1)
 Hidden layer neurons take visible units as inputs
 Visible neurons take binary input vectors as initial
input, then hidden layer probabilities (during
Gibbs sampling – next slide)
Training Procedure –
Contrastive Divergence
 Remarkably simple
 Performs Gibbs Sampling (MCMC technique)
 Equates to computing a probability
distribution using a Markov Chain Monte
Carlo approach
Contrastive Divergence
 PASS 1: From inputs v, compute hidden layer probabilities
h
 PASS 2: Pass those values back down to the visible layer,
and back up to the hidden layer to get v’ and h’
 Update the weights using the differences in the outer
products of the hidden and visible activations between the
first and second passes (multiplied by some learning rate)

 Note: For some reason, all implementations I have seen

take the inner (dot) and not the outer product
 To approach the optimal model, an infinite number of
passes are needed, so this approach provides proximate
inference, but works well in practice
Feature Representation

 Once trained, the hidden layer activations of

an RBM can be used as learned features
Auto Encoders

 An auto-encoder is a 3 layer neural network, which is

trained to reconstruct its inputs by using them as the output
 Needs to learn features that capture the variance in the
data so it can be reproduced
 If only linear activation functions are used, it can be shown
to be equivalent to PCA and can be used for dimensionality
reduction
 Once trained, the hidden layer activations are used as the
learned features, and the top layer can be discarded
 However, the auto-encoder will learn the identity function
unless some strategy is used to force it to learn features
from the data
Training Strategies
1. De-noising Auto-Encoders
 Some random noise added to the input
 The encoder is required to reproduce the original input
 Hinton’s group recently showed that randomly deactivating inputs (dropout)
during training will improve the generalization performance of regular neural
networks
2. Contractive Auto-Encoders
 Setting the number of nodes in the hidden layer to be much lower than the
number of input nodes forces the network to perform dimensionality reduction,
 This prevents it from learning the identity function as the hidden layer has
insufficient nodes to simply store the input
3. Sparse Auto-Encoders
 A sparsity penalty is applied to the weight update function
 Penalizes the total size of the connection weights,
 Causes most weights to have small values
 Allows
Building Deep Networks

 RBM’s or Auto-Encoders can be trained layer

by layer
 The features learned from one layer are fed
into the next layer
 The top-layer activations can be treated as
features and fed into any suitable classifier
(RF, SVM, etc)
Building Deep Networks

 Alternatively, an additional output layer can

be placed on top, and the network fine-tuned
with back propagation
 Back propagation only works well in deep
networks only if the weights are initialized
close to a good solution
 The layer wise pre-training ensures this
 Many other approaches exist for fine tuning
deep networks (e.g. dropout, maxout)
Training a Deep Auto-Encoder
from Stacked RBM’s – Hinton `06
Overview of the Talk

1. Overview of Deep Learning

 This section will focus primarily on the

ground-breaking work of Richard Socher at
Stanford:
 “Semi-Supervised Recursive Autoencoders for
Predicting Sentiment Distributions” (2011)
 His work builds on top of the neural word
embeddings work performed by Collobert
and Weston (2008)
Word Vectors

 To do NLP with neural networks, words need to be

represented as vectors
 Traditional approach – “one hot vector”
 Binary vector
 Length = | vocab |
 1 in the position of the word id, the rest are 0
 However, does not represent word meaning
 Similar words such as English and French, cat and dog
should have similar vector representations
 However, similarity between all “one hot vectors” is
the same
Solution:
Distributional Word Vectors
 Word is represented as a distribution over k latent variables
 Distribution chosen so that similar words have similar
distributions
 Traditional approaches have used various vector space
models
 Words form the rows
 Columns represent the context (other words occurring within x
words, whole documents, etc)
 Cells represent co-occurrence (binary vectors) frequency, tf-idf or
relative distance from the context word
 Dimensionality reduction (PCA, SVD, etc) used to reduce the
vector size
Neural Word Embeddings

 Various researchers (Bengio, Collobert and

Weston, Hinton) have used neural language
models to develop “word embeddings”
 A language model is a statistical model that
assigns a probability to words given the
preceding words
 Have similar properties to distributional word
vectors, but claim better representations
Neural Word Embeddings

 Collobert and Weston, 2008 -“A Unified Architecture for

Natural Language Processing”
 They extracted all 11-length n-grams from the entire of Wikipedia
 Middle (6th) word is the target word
 Negative examples are created by replacing the middle word with a
different word chosen randomly
 For each word, they randomly initialized a 50 element vector
 The n-grams are then translated into input vectors by
concatenating the corresponding vector for each word
 These are fed into a neural network that is trained to maximize the
difference between the probability it assigns to a valid versus an
invalid sentence
 Errors are propagated back into the word embeddings
Results

 Example words with their 10 nearest

neighbors according to the embeddings:
A Unified Architecture for
NLP
 Using a very complex, deep architecture, Collobert and
Weston were able to train a single deep model to do:
 NER (Named Entity Recognition)
 POS tagging
 Chunking (shallow parsing)
 Parsing
 SRL (Semantic Role Labeling)
 Model is too complex to cover here
 No hand engineered features were used
 Achieved either near SOTA or the SOTA in each of the
above domains
Recursive Auto-Encoders

 Using the Neural Language Model technique

to learn word vectors, Richard Socher
developed a deep architecture for NLP
 His architecture was applied to sentiment
analysis, but can be used for nearly any text
classification problem
Recursive Auto-Encoders

 Each sentence is reduced to a single 50 element

vector as follows:
 Each sentence of length n is mapped into n - 50
element word vectors using neural word embeddings
 For each bi-gram in the sentence, concatenate the
word vectors and feed into a contractive auto-
encoder – 100 inputs 50 outputs
 Take the bi-gram with the lowest reconstruction
error, and replace with the output of the auto-
encoder
 Repeat until you have one 50 element vector
The Recursive Auto-Encoder
Semi-Supervised Training

 Greedy algorithm
 Can be viewed as constructing a binary parse
tree with the lowest reconstruction error
 Auto-encoder is trained with two objective
functions:
 1 Minimize the reconstruction error
 2 Minimize the classification error in a softmax layer
 The output at each level of the tree is fed into a
softmax neural network layer, trained on
labeled data
Semi-Supervised Training

 Cost function minimizes both the reconstruction

error of the input vectors, and the classification error
of the softmax classifier on labeled data
 The sentence is then classified by feeding the top-
level auto-encoder output into the softmax classifier
 Can use either:
 1 . Static Collobert and Weston neural word embeddings
 2. Learn it’s own embeddings using back propagation
through structure to propagate errors back into word
embeddings matrix
Semi-Supervised Training
Results

 SOTA Results on standard sentiment analysis

datasets
 In our current research in automated essay
annotation, this algorithm out-performed other
approaches considerably:
 Logistic Regression using bags of word (binary vectors):
 F1 of 0.62
 RAE, using default parameters:
 F1 of 0.71
 My current best non-deep learning approach
 F1 of 0.66
 Also uses a (much simpler) word vector composition model
Some Criticisms of RAE

 It is considered a deep learning approach

because the auto-encoder forms a deep
network with itself when parsing a sentence
 Only uses one auto-encoder, thus fails to
utilize hierarchical composition of features
present in other deep networks
 50 hidden neurons * (100 inputs + bias)
 Thus only 5,050 parameters (weights)
 Probably insufficient to model the English
language!
Disadvantages of Deep
Learning
 Very slow to train
 Availability of algorithms – lots of Python
implementations, pretty rate in other languages (e.g. R)
 Models are very complex, with lot of parameters to
optimize:
 Initialization of weights
 Layer-wise training algorithm (RBM, AE, several others)
 Neural architecture
 Number of layers
 Size of layers
 Type – regular, pooling, max pooling, soft max
 Fine-tuning using back prop or feed outputs into a different
classifier
Disadvantages of Deep
Learning
 Steep learning curve
 Some problems more amenable to deep learning
than other applications
 Simpler models may be sufficient for certain
problem domains
 Regression models?
 Unless you are working with images, the models
are very hard to explain (compared with a
decision tree)
 What does neuron 524 do?
Useful Deep Learning Links
 Deeplearning.Net:
 Code, tutorials, papers
 http://deeplearning.net/
 Theano (Cuda + Python also):
 Comprehensive tutorials
 Symbolic programming (like SymPy) can be a little confusing
 http://deeplearning.net/software/theano/
 Toronto groups’ code (Cuda + Python):
 Easier to understand than Theano
 https://github.com/nitishsrivastava/deepnet
 www.socher.org
 All of Richard Socher’s research papers and code (mainly Matlab, some java)
 Links to his tutorials on YouTube on Deep Learning and NLP
 The SENNA system developed by Collobert and Weston
 http://ronan.collobert.com/senna/
 A pretty complete NLP system (for download) that uses Deep Learning to perform
NER, POS tagging, parsing, chunking and SRL
 Contains the word embeddings file so you can use their word embeddings in your

Deep Learning Final Sheet
No ratings yet
Deep Learning Final Sheet
915 pages
Unit II
No ratings yet
Unit II
27 pages
Visual Testing: - Asme - Section 5 (NDT) - Section 5 - Article 9 (VT)
100% (3)
Visual Testing: - Asme - Section 5 (NDT) - Section 5 - Article 9 (VT)
29 pages
Deep Learning
No ratings yet
Deep Learning
95 pages
Deep Learning Basics in Machine Learnning 1
No ratings yet
Deep Learning Basics in Machine Learnning 1
29 pages
Deep Learning Module 1 Chapter 1
No ratings yet
Deep Learning Module 1 Chapter 1
18 pages
02-03-Warming-Up and Data and Features
No ratings yet
02-03-Warming-Up and Data and Features
22 pages
Machine Learning Techniques-Bcds062!01!01
No ratings yet
Machine Learning Techniques-Bcds062!01!01
66 pages
CM20315 01 Intro
No ratings yet
CM20315 01 Intro
62 pages
01 - Introduction To Deep Learning
No ratings yet
01 - Introduction To Deep Learning
56 pages
465-Lecture 1 (Deep Learning)
No ratings yet
465-Lecture 1 (Deep Learning)
47 pages
Conmatphys 031119 050745
No ratings yet
Conmatphys 031119 050745
28 pages
Deep Learning Material
No ratings yet
Deep Learning Material
136 pages
Deep Learning in Neural Networks An Overview
No ratings yet
Deep Learning in Neural Networks An Overview
89 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
Deep Learning 2 July 2014
No ratings yet
Deep Learning 2 July 2014
75 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Deep Learning Introduction Class
No ratings yet
Deep Learning Introduction Class
46 pages
Deep Learning Question
No ratings yet
Deep Learning Question
4 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
ML - MODULE7 - Advanced Topics in ML
No ratings yet
ML - MODULE7 - Advanced Topics in ML
22 pages
Introduction To Deep Learning: Technical Seminar by Md. Abul Fazl (14261A05A0) CSE Dept
No ratings yet
Introduction To Deep Learning: Technical Seminar by Md. Abul Fazl (14261A05A0) CSE Dept
21 pages
Lecture 1 Introduction of Deep Learning
No ratings yet
Lecture 1 Introduction of Deep Learning
31 pages
Statistics Mechanic of Deep Learning
No ratings yet
Statistics Mechanic of Deep Learning
28 pages
Deep Learning Techniques: An Overview: January 2021
No ratings yet
Deep Learning Techniques: An Overview: January 2021
11 pages
Chapter 6 AI
No ratings yet
Chapter 6 AI
52 pages
Unit 4 Hca
No ratings yet
Unit 4 Hca
57 pages
Deep Learning
No ratings yet
Deep Learning
13 pages
Deep
No ratings yet
Deep
15 pages
Deep Learning
100% (3)
Deep Learning
32 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Deep Learning
No ratings yet
Deep Learning
10 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Unit 4
100% (1)
Unit 4
57 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Group I
No ratings yet
Group I
20 pages
Deep Learning in Neural Networks: An Overview
No ratings yet
Deep Learning in Neural Networks: An Overview
31 pages
Deep Learning With Tensorflow
100% (1)
Deep Learning With Tensorflow
70 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Deep Learning: A Visual Introduction
No ratings yet
Deep Learning: A Visual Introduction
53 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Unit 3 Introduction To Deep Learning Part 1
No ratings yet
Unit 3 Introduction To Deep Learning Part 1
7 pages
Unit 1
No ratings yet
Unit 1
20 pages
2015 Lecun Deeplearn
No ratings yet
2015 Lecun Deeplearn
10 pages
Lecture 1a - Introduction
No ratings yet
Lecture 1a - Introduction
38 pages
Deep Learnong
No ratings yet
Deep Learnong
14 pages
Lec 1,2
No ratings yet
Lec 1,2
69 pages
Deep Learning and Applications: Pham The Bao Ptbao@sgu - Edu.vn
No ratings yet
Deep Learning and Applications: Pham The Bao Ptbao@sgu - Edu.vn
43 pages
Deep Learnig
No ratings yet
Deep Learnig
16 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Deep Learning Concise Notes
No ratings yet
Deep Learning Concise Notes
4 pages
Deep Learning Review and Discussion of Its Future
No ratings yet
Deep Learning Review and Discussion of Its Future
7 pages
Advance Deep Learning
No ratings yet
Advance Deep Learning
10 pages
Ref System Fault Location
No ratings yet
Ref System Fault Location
24 pages
Deep Learning Lecture 0 Introduction Alexander Tkachenko
No ratings yet
Deep Learning Lecture 0 Introduction Alexander Tkachenko
31 pages
Introduction To Deep Learning: by Gargee Sanyal
No ratings yet
Introduction To Deep Learning: by Gargee Sanyal
20 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
Benefits in Planting Trees and Fruit Trees
100% (1)
Benefits in Planting Trees and Fruit Trees
2 pages
Southern Mindanao College 229
No ratings yet
Southern Mindanao College 229
3 pages
Study Material 2 PDF
No ratings yet
Study Material 2 PDF
8 pages
11-Chapter 11-Wellsite Geologist
No ratings yet
11-Chapter 11-Wellsite Geologist
140 pages
6 FM Circuits
100% (1)
6 FM Circuits
33 pages
740 (B) Calculation of Smoke Spilled System
No ratings yet
740 (B) Calculation of Smoke Spilled System
8 pages
EfES L1
No ratings yet
EfES L1
10 pages
How Living Things Grow and Change
No ratings yet
How Living Things Grow and Change
14 pages
Import As Import As From Import
No ratings yet
Import As Import As From Import
23 pages
Hexa Research Inc
No ratings yet
Hexa Research Inc
5 pages
Social Science Disciplines
No ratings yet
Social Science Disciplines
2 pages
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
No ratings yet
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
9 pages
ISO 9001 2015 Internal Audit Process Map Sample
No ratings yet
ISO 9001 2015 Internal Audit Process Map Sample
1 page
Bio IA Template
100% (2)
Bio IA Template
4 pages
Graphical Peak Discharge Method
No ratings yet
Graphical Peak Discharge Method
7 pages
JPT Story
No ratings yet
JPT Story
47 pages
Job and Business Opportunities
No ratings yet
Job and Business Opportunities
6 pages
Formal and Informal Communication
No ratings yet
Formal and Informal Communication
10 pages
Account STMT
No ratings yet
Account STMT
2 pages
LYONS, Martin. New Directions in The History of Written Culture
No ratings yet
LYONS, Martin. New Directions in The History of Written Culture
9 pages
UE271
No ratings yet
UE271
1 page
ME2102 Tutorial 6
No ratings yet
ME2102 Tutorial 6
2 pages
Sen (2017) What Stays Unsaid in Therapeutic Relationships
No ratings yet
Sen (2017) What Stays Unsaid in Therapeutic Relationships
6 pages
Brosur Erne
No ratings yet
Brosur Erne
6 pages
Final Mid Term Risk Factore Including Nag
No ratings yet
Final Mid Term Risk Factore Including Nag
11 pages
MK PPR Ecu en 72
No ratings yet
MK PPR Ecu en 72
2 pages
Dipak Jha Booking - Com - Confirmation
No ratings yet
Dipak Jha Booking - Com - Confirmation
2 pages
Dsa 24 H Imp
No ratings yet
Dsa 24 H Imp
1 page

Deep Learning For NLP

Uploaded by

Deep Learning For NLP

Uploaded by

And Deep Networks for Natural Language Processing

1. Overview of Deep Learning

What this Talk is Not

1. Overview of Deep Learning

 Learning deep (many layered) neural networks

 Real world information has a hierarchical

 Traditional approach is to use back

 1. Layer-wise, bottom-up pre-training of

 2. Supervised training on labeled data using

 Don’t worry, we’ll come back to that

1. Overview of Deep Learning

 Typically applied to image and speech

 Given the same number of non-linear (neural

Shallow (2 layer) networks need a lot more

In a deep network, high levels can express combinations

 Fuzzy clustering, dimensionality reduction

 p(A/B) = p(A /\ B)/p(B)

 Generative models can also generate samples of data

1. Overview of Deep Learning

 For each neuron, sum the inputs multiplied by

 Needs labeled data (most data is not labeled)

1. Overview of Deep Learning

 1960’s – Perceptron invented (single neuron)

 Neural networks are often represented as a matrix of

1. Overview of Deep Learning

 Most current architectures consist of learning

 The activation function is computed the same way

 Note: For some reason, all implementations I have seen

 Once trained, the hidden layer activations of

 An auto-encoder is a 3 layer neural network, which is

 RBM’s or Auto-Encoders can be trained layer

 Alternatively, an additional output layer can

1. Overview of Deep Learning

 This section will focus primarily on the

 To do NLP with neural networks, words need to be

 Various researchers (Bengio, Collobert and

 Collobert and Weston, 2008 -“A Unified Architecture for

 Example words with their 10 nearest

 Using the Neural Language Model technique

 Each sentence is reduced to a single 50 element

 Cost function minimizes both the reconstruction

 SOTA Results on standard sentiment analysis

 It is considered a deep learning approach

You might also like