0% found this document useful (0 votes)
134 views56 pages

CCS369 - TSS-Unit 2

Uploaded by

thirushharidoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views56 pages

CCS369 - TSS-Unit 2

Uploaded by

thirushharidoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CCS369 TEXT AND SPEECH ANALYSIS

By
C.Jerin Mahibha
Assoc.Prof / CSE
UNIT II TEXT CLASSIFICATION
Vector Semantics and Embeddings -Word Embeddings - Word2Vec model – Glove model –
FastText model – Overview of Deep Learning models – RNN – Transformers – Overview of
Text summarization and Topic Models

COURSE OBJECTIVES:
Apply classification algorithms to text documents
COURSE OUTCOME:
CO2::Apply deep learning techniques for NLP tasks, language modelling and machine
translation

Text Book : Speech and Language Processing: An Introductionto Natural Language


Processing, Computational Linguistics, and Speech Recognition” by Daniel Jurafsky and James
H. Martin - Chapter 6
Vector Semantics and Embeddings
Vector Semantics
• Is the standard way to represent word meaning in NLP
• Helps to model many of the aspects of word meaning
• Representations of the meaning of words- embeddings
• Computed directly from the word distributions in texts
• Used in every natural language processing application
• The roots of the model – convergence of two big ideas
 use a point in three-dimensional space to represent the connotation of a word (Osgood)
 to define the meaning of a word by its distribution in language use - neighboring words or grammatical environments (Joos, Harris,
and Firth)
• idea was that two words that occur in very similar distributions have similar meanings (whose neighboring words are similar)

Ongchoi is delicious sauteed with garlic spinach sauteed with garlic over rice...
Ongchoi is superb over rice chard stems and leaves are delicious...
Ongchoi leaves with salty sauces... collard greens and other salty leafy greens
ongchoi - occurs with words like rice and garlic and delicious and salty, as do words like spinach, chard, and collard
• spinach, chard, and collard – leafy greens
• ongchoi - leafy green similar
computationally implemented - by counting words in the context of ongchoi
• vector semantics - represent a word as a point in a multidimensional
semantic space that is derived from the distributions of word neighbors
• Vectors for representing words - embeddings
• “embedding” - mapping from one space or structure to another

visualization of embeddings learned for


sentiment analysis
• shows the location of selected words
projected down from 60-dimensional
space into a two dimensional space.
• distinct regions contain positive
words, negative words, and neutral
function words
• Word Similarity of vector semantics - offers enormous power to NLP applications
• Application like sentiment classifiers - depend on the same words appearing in the
training and test sets
• represent words as embeddings - classifiers assign sentiment based on words with
similar meanings
• can be learned automatically from text without supervision
• two most commonly used models
tf-idf model
• an important baseline
• the meaning of a word - function of the counts of nearby words
• results in very long vectors that are sparse, i.e. mostly zero
word2vec model
• construct short, dense vectors that have useful semantic properties
• Cosine
• standard way to use embeddings to compute semantic similarity, between two words, two
sentences, or two documents,
• an important tool in practical applications like question answering, summarization, or automatic
essay grading.
Word Embeddings
• more powerful word representation
• Embeddings - short, dense vectors
• Short -number of dimensions - range from 50-1000
• dense – less number of zeroes, values can be real-valued numbers, negative
• work better in every NLP task than sparse vectors
• Less dimension – classifier need to learn fewer weights
Eg: 300-dimensional dense vectors requires the classifiers to learn fewer weights
than the words represented as 50,000-dimensional vectors
• smaller parameter space - helps with generalization and avoiding overfitting
• capture synonyms better
Eg: sparse vector representation
• dimensions for synonyms like car and automobile - are distinct and unrelated
• fail to capture the similarity between a word with car as a neighbor and a word
with automobile as a neighbor
Method for computing embeddings
Word2vec
Software package
Implemented using two algorithms
skip-gram with negative sampling - SGNS
Continuous Bag of Words
skip-gram algorithm – called as word2vec
• Fast
• Efficient to train
• Easily available online with code and pretrained embeddings
Static embeddings
• Method learns one fixed embedding for each word in the vocabulary
Dynamic embeddings
• Contextual
• The vector for each word is different in different contexts - BERT or ELMO representations
Word2vec embeddings - static embeddings
chapter 6.8 page 112
Word2Vec model
Train a classifier on a binary prediction task
“Is word w1 likely to occur near another word w2?”

• “Is word w likely to show up near apricot?”


• counting how often each word w occurs near- apricot – Not embedding
• train a classifier on a binary prediction task
• actually we do not care about this prediction task
• take the learned classifier weights as the word embeddings
• use running text as implicitly supervised training data for such a classifier
• a word c that occurs near the target word apricot
• acts as gold ‘correct answer’ to the question “Is word c likely to show up near apricot?”
• often called as self-supervision - avoids the need for hand-labeled supervision
word2vec – can be compared to Neural Language Model
• Neural Language Model
• Is a neural network that learned to predict the next word from prior words
• use the next word in running text as its supervision signal
• used to learn an embedding representation for each word as part of doing this prediction task
• word2vec - much simpler model than the neural network language model, in two ways.
1. word2vec simplifies the task
• making it binary classification
• Neural Language Model - word prediction
2. word2vec simplifies the architecture
• training a logistic regression classifier
• Neural Language Model - multi-layer neural network with hidden layers - demand more sophisticated
training algorithms
The intuition of skip-gram is:
1. Treat the target word and a neighboring context word as positive examples
2. Randomly sample other words in the lexicon to get negative samples
3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the learned weights as the embeddings
Skip gram model
• Generate embeddings
• Compute similarity – dot product
• Convert dot product to Probability –
• Sigmoid function – Convert dot product to be in the range 0 -1
The classifier

Eg: (apricot, jam)


• Classification task (apricot, aardvark))
• target word - apricot
• window - ±2 context words Eg: (apricot, jam) - True
• Goal - train a classifier (apricot, aardvark) - False
• Given a tuple (w, c)
w-target word
c- candidate context word
• Return the probability that c is a real context word
• The probability that word c is not a real context word for w
How does the classifier compute the probability P?
• Skipgram model - base - probability on embedding similarity
• A word is likely to occur near the target
if its embedding is similar to the target embedding
• Compute similarity between dense embeddings
two vectors are similar if they have a high dot product
cosine - is a normalized dot product Similarity(w, c) ≈ c · w
The dot product c · w
is not a probability
it’s a number ranging from −∞ to ∞
can be negative
Turn the dot product into a probability
use the logistic or sigmoid function σ(x)-fundamental core of logistic regression

probability that word c is a real context word for target word w

The sigmoid function returns a number between 0 and 1


• TO compute the actual probability
• Compute total probability of the two possible events which will sum to 1
• c is a context word
• c isn’t a context word
• Estimate of the probability that word c is not a real context word for w as

• Above equation gives the probability for one word


- but there are many context words in the window
- Skip-gram - assumption - all context words are independent- multiply their probabilities
Intuition of the parameters :
• Skip-gram - stores 2 embeddings
for each word
1) word as target
2) word as context
• Parameters –
• two matrices W and C
• each containing an embedding for
every one of the |V| words in the
vocabulary V.
Learning skip-gram embeddings
• Skip-gram learns embeddings
• start with random embedding
vectors
• iteratively shift the embedding of
each word w
• more like the embeddings of words
that occur nearby
• less like the embeddings of words
that don’t occur nearby.
target word (w ) - (apricot)
• 4 context words - L = ±2 window
Skipgram with negative sampling (SGNS)
• uses more negative examples than positive examples - ratio between them set
by a parameter k.
• for each (w, cpos) training instance - create k negative samples - a ‘noise word’
cneg
• A noise word is a random word from the lexicon - not the target word w.
• k = 2  2 negative examples for each positive example
• The noise words are chosen according to their weighted unigram frequency
pα(w), where α is a weight.
• Unweighted frequency p(w)
• Common to set α = .75
• maximize the dot
product of the word
with the actual context
words, and
• minimize the dot
products of the word
with the k negative
sampled nonneighbor
words.
• minimize this loss
function using stochastic
gradient descent
Glove model
• GloVe short for Global Vectors
• Widely used static embedding model
• Based on capturing global corpus
statistics
• ratios of probabilities from the word-
word co-occurrence matrix
• combining the intuitions of count-
based models and word2vec
FastText model
• uses subword models
• deals with unknown words and sparsity in languages with rich
morphology
• Each word - represented as itself plus a bag of constituent n-grams, with
special boundary symbols < and > added to each word.
Eg: n = 3 -where - <where> <wh, whe, her, ere, re>
• skipgram embedding is learned for each constituent n-gram
• sum of all of the embeddings of its constituent n-grams
• fasttext - open-source library- pretrained embeddings for 157 languages
• https://fasttext.cc.
Overview of Deep Learning models Chapter : 9
Deep learning
• uses artificial neural networks
• perform sophisticated computations on large amounts of data
• type of machine learning - works based on the structure and function of the human brain
Deep learning algorithms
• train machines by learning from examples
• health care, eCommerce, entertainment, and advertising - use deep learning
Neural Networks
• is structured like the human brain - consists of artificial neurons-nodes
• nodes are stacked next to each other in three layers:
• The input layer
• The hidden layer(s)
• The output layer
Deep learning models
• are trained using a neural network architecture or a set of labeled data that contains multiple layers
• sometimes exceed human-level performance
• learn features directly from the data without hindrance to manual feature extraction
Types of Deep Learning Algorithms
1. Convolutional Neural Networks (CNNs) 6. Multilayer Perceptrons (MLPs)
2. Long Short Term Memory Networks 7. Self Organizing Maps (SOMs)
(LSTMs) 8. Deep Belief Networks (DBNs)
3. Recurrent Neural Networks (RNNs) 9. Restricted Boltzmann Machines( RBMs)
4. Generative Adversarial Networks (GANs) 10. Autoencoders
5. Radial Basis Function Networks (RBFNs)
RNN - Recurrent Neural Network
• network that contains a cycle within its network connections
• value of a unit is directly, or indirectly, dependent on its own earlier outputs
• difficult to reason about and to train
• proven to be effective when applied to spoken and written language
• class of recurrent networks -Elman Networks - simple recurrent networks
• serve as the basis for more complex approaches like the Long Short-Term
Memory (LSTM) networks
Structure of an RNN
 input vector represent the current input - Xt
 multiply by a weight matrix
 pass through a non-linear activation function - compute
the values for a layer of hidden units.
 Use hidden layer - to calculate a corresponding output Yt
• key difference from a feedforward network - recurrent link -
dashed line
• link augments the input to the computation at the hidden layer
with the value of the hidden layer from the preceding point in
time.
• The hidden layer from the previous time step provides a form
of memory, or context, that encodes earlier processing and
informs the decisions to be made at later points in time.
• does not impose a fixed-length limit on this prior context;
context embodied in the previous hidden layer - includes
information extending back to the beginning of the sequence.
• Temporal dimension - makes RNNs appear to be more complex
• Perform the standard feedforward calculation
• Change - new set of weights, U, that connect the hidden layer from the
previous time step to the current hidden layer
• Weights - determine how the network makes use of past context in
calculating the output for the current input
• Like other weights in the network, these connections are trained via
backpropagation.
Inference in RNN
• Forward inference (mapping a sequence of inputs to a sequence of outputs) in an
• RNN is nearly identical to feedforward networks.
• To compute an output yt for an input xt , we need the activation value for the
hidden
• layer ht
• To calculate this, we multiply the input xt with the weight matrix W, and the
hidden layer from the previous time step ht−1 with the weight matrix U.
• We add these values together and pass them through a suitable activation
function, g, to arrive at the activation value for the current hidden layer, ht
• Once we have the values for the hidden layer, we proceed with the usual
computation to generate the output vector.
Training
• To obtain the gradients needed to adjust the weights
 Training Set
 Loss Function – cross entrophy - difference between predicted and correct distribution
 Back Propagation
• 3 sets of weights to update
 W- weights from the input layer to the hidden layer
 U - weights from the previous hidden layer to the current hidden layer
 V - weights from the hidden layer to the output layer
• two considerations
 compute the loss function for the output at time t  use the hidden layer from time t − 1
 hidden layer at time t influences
 output at time t
 hidden layer at time t +1
• two-pass algorithm for training the weights in RNNs
• first pass - perform forward inference
• compute ht , yt
• accumulate the loss at each step in time
• save the value of the hidden layer at each step for use at the next time step
• second phase – Back Propagation
• process the sequence in reverse
• compute the required gradients
• compute and save the error term for use in the hidden layer for each step backward in time.
RNNs as Language Models
• Process sequences - a word at a time - predict the next word in a sequence – using
current word and the previous hidden state as inputs
• Limited context constraint inherent in N-gram models is avoided
• Hidden state - has information about all preceding words – from the beginning
• Forward inference in a recurrent language model – same process
• The input sequence x -consists of word embeddings represented as one-hot vectors of size |V| × 1
• output predictions, y - represented as vectors representing a probability distribution over the
vocabulary.
• At each step
• uses the word embedding matrix E - to retrieve the embedding for the current word
• combine it with the hidden layer from the previous step to compute a new hidden layer
• hidden layer is then used to generate an output layer which is passed through a softmax layer to
generate a probability distribution over the entire vocabulary
chapter 9.4 Page No:190
Transformers
RNN
• leads to a loss of relevant information - difficulties in training
• inhibits the use of parallel computational resources - sequential nature of recurrent
networks.
Transformers
• an approach to sequential processing - eliminates recurrent connections
• map sequences of input vectors (x1,..., xn) to sequences of output vectors (y1,..., yn)
of the same length
• made up of stacks of network layers
• simple linear layers
• feedforward networks
• custom connections
• use of self-attention layers - key innovation
Self-attention
• allows a network to directly extract and use information from arbitrarily large
contexts without the need to pass it through intermediate recurrent
connections
• application of self-attention – past context to be used
problems of language modeling
autoregressive generation
• access to all of the inputs - up to the one under consideration
• no access to information about inputs beyond the current one – access only
past info
• computation performed for each item is independent of all the other
computations – Parallel computing can be implemented
Attention-based approach
• Compare an item to a collection of other items - reveals relevance in the
current context – dot product – Scores (other possible comparisons)
computation of y3
set of comparisons between the input x3 and its preceding elements x1, x2 and x3
compute three scores: x3 · x1, x3 · x2 and x3 · x3
• Normalize - softmax - create a vector of weights, αi j - indicates the
proportional relevance of each input to the input element i- probability
• Generate an output value yi by taking the sum of the inputs weighted by
their respective α value.
• simple mechanism
 provides no opportunity for learning
 everything is directly based on the original input values x
 no opportunity to learn - contribution of words to represent longer inputs
• Transformers – handles above issues
 include additional parameters - set of weight matrices that operate over the input embeddings
• roles that each input embedding plays during the attention process
 query - current focus of attention
 key - preceding input
 Value - output for the current focus of attention

• introduce three sets of weights - WQ , WK, and WV .

• Given input embedding of size dm, the dimensionality of these matrices are dq×dm, dk ×dm
and dv×dm
• score between xi and xj - dot product between its query vector qi and the preceding
elements key vectors kj

• softmax calculation – α i, j
• Result of dot product - arbitrarily large (positive or negative) value
• Exponentiating large values
can lead to numerical issues
effective loss of gradients during training
• To avoid this, the dot product are scaled - scaled dot-product approach
• divides the result of the dot product by a factor related to the size of the embeddings

 Output yi , is computed independently - process can be parallelized- matrix multiplication


 Matrix includes informations that follow the query
 Elements in the upper-triangular portion of the comparisons matrix are zeroed out (set to
−∞) - eliminate any knowledge of words that follow in the sequence
Transformer Blocks

• self-attention layer
• feedforward layers
• residual connections
• normalizing layers

blocks can then be


stacked
Multihead Attention
• different words in a sentence - can
relate to each other in many different
ways
• difficult for a single transformer block -
to capture all of the different relations
• Transformers address this issue -
multihead self-attention layers
• sets of self-attention layers - heads
o parallel layers at the same depth in a
model
o each has its own set of parameters
WK i , W Q i and WV i
o each head can learn different aspects
of the relationships that exist among
inputs at the same level of
abstraction
o combined –
 concatenate the outputs from each head
 reduce to the original output dimension -
another linear projection
o rest of the Transformer block
remain the same - feedforward
layer, residual connections, and
layer norms
Positional Embeddings

• order of the inputs – does not affect the output of Transformer


• Transformer – combine word embedding with positional embeddings
• positional embeddings - are learned along with other parameters
during training

• Eg: embedding for the word ‘class’ + embedding for the position 3
Transformers as Autoregressive Language Models
• train a model to predict
the next word in a
sequence - teacher
forcing
• calculate the cross-
entropy loss for each
item in the sequence
• each training item can
be processed in parallel
Contextual Generation and Summarization
• simple variation on autoregressive generation
Overview of Text summarization and Topic
Models

Text Analytics with Python – chapter 5 - Page 216


Text summarization
• important concept in text analytics
• practical application of context-based autoregressive generation
• task - take a full-length article and produce an effective summary
• used by businesses and analytical firms
• shorten and summarize huge documents of text such that they still
• retain their key essence or theme
• present this summarized information to consumers and clients
• To train a transformer - corpus with full-length articles + corresponding summaries
• Append a summary to each full-length article in a corpus, with a unique marker
separating the two
• article-summary pair (x1,..., xm), (y1,..., yn) - converted to (x1,..., xm,δ, y1,...yn) - length of n+m +1
• treated as long sentences - used to train an autoregressive language model using teacher forcing
Text summarization

• Extract the key influential phrases from the documents


• Extract various diverse concepts or topics present in the documents
• Summarize the documents to provide a gist that retains the important
parts of the whole corpus
• popular techniques are
keyphrase extraction
topic modeling
automated document summarization
Topic modeling
• Extract main topics, themes, or concepts from a corpus of documents
• Involves statistical and mathematical modeling techniques
• Requires more diverse set of documents
- More topics or concepts are generated
- Single document –Single concept – may not have too many topics
• Also known as probabilistic statistical models
• Used extensively in text analytics and even bioinformatics
• Use mathematical and statistical techniques - to discover hidden and latent semantic
structures in a corpus
• Extract features from document terms
 using mathematical structures and frameworks like matrix factorization and SVD (Singular Value Decomposition)
 generate clusters or groups of terms that are distinguishable from each other
 these cluster of words form topics or concepts
• Used to
 interpret the main themes of a corpus
 make semantic connections among words that co-occur together frequently in various documents
Frameworks and algorithms to build topic models

1. Latent semantic indexing - popular


2. Latent Dirichlet allocation - popular
3. Non-negative matrix factorization
- very recent technique
- extremely effective and gives excellent results
1.Latent Semantic Indexing (LSI)
• used for text summarization, information retrieval, search
• uses the very popular SVD technique (Singular Value Decomposition)
• main principle - similar terms tend to be used in the same context and hence
tend to co-occur more
2.Latent Dirichlet Allocation (LDA)
• is a generative probabilistic model
• each document is assumed to have a
combination of topics similar to a
probabilistic latent semantic indexing
model
• latent topics contain a Dirichlet prior
over them.

End-to-end LDA framework


LDA plate notation
Steps Involved

1. Initialize the necessary parameters.


2. For each document, randomly initialize each word to one of the K topics.
3. Start an iterative process as follows and repeat it several times.
4. For each document D:
a. For each word W in document:
• For each topic T:
i. Compute P( T|D) - proportion of words in D assigned to topic T.
ii. Compute P (W|T) - proportion of assignments to topic T over all documents having
the word W.
• Reassign word W with topic T with probability P(T|D) * P(W|T) considering
all other words and their topic assignments
3.Non-negative Matrix Factorization(NNMF)
• matrix decomposition technique similar to SVD
• NNMF operates on nonnegative matrices and works well for multivariate data
• NNMF can be formally defined like so:
• Given a non-negative matrix V
• objective - find two non-negative matrix factors W and H such that when they are
multiplied, they can approximately reconstruct V.
• Mathematically this is represented by
• To get to this approximation
• use a cost function - Euclidean distance or L2 norm between two matrices, or the
Frobenius norm - is a slight modification of the L2 norm.
• represented as simplified as

• works the best even with small corpora with few documents
• depends on the type of data that are dealt
Potential Harms from Language Models
• can generate toxic language - hate speech and abuse, negative
attitudes toward minority identities such as being Black or gay.
• can amplify demographic and other biases in training data
• can also be a tool for generating text for misinformation, phishing,
radicalization, and other socially harmful activities
• privacy issues- can leak information about their training data.
• Extra pre-training on non-toxic subcorpora seems to reduce tendency
to generate toxic language
• analyze the data used to pretrain - understand toxicity and bias in
generation, as well as privacy
THANK YOU

You might also like