CCS369 - TSS-Unit 2
CCS369 - TSS-Unit 2
By
C.Jerin Mahibha
Assoc.Prof / CSE
UNIT II TEXT CLASSIFICATION
Vector Semantics and Embeddings -Word Embeddings - Word2Vec model – Glove model –
FastText model – Overview of Deep Learning models – RNN – Transformers – Overview of
Text summarization and Topic Models
COURSE OBJECTIVES:
Apply classification algorithms to text documents
COURSE OUTCOME:
CO2::Apply deep learning techniques for NLP tasks, language modelling and machine
translation
Ongchoi is delicious sauteed with garlic spinach sauteed with garlic over rice...
Ongchoi is superb over rice chard stems and leaves are delicious...
Ongchoi leaves with salty sauces... collard greens and other salty leafy greens
ongchoi - occurs with words like rice and garlic and delicious and salty, as do words like spinach, chard, and collard
• spinach, chard, and collard – leafy greens
• ongchoi - leafy green similar
computationally implemented - by counting words in the context of ongchoi
• vector semantics - represent a word as a point in a multidimensional
semantic space that is derived from the distributions of word neighbors
• Vectors for representing words - embeddings
• “embedding” - mapping from one space or structure to another
• Given input embedding of size dm, the dimensionality of these matrices are dq×dm, dk ×dm
and dv×dm
• score between xi and xj - dot product between its query vector qi and the preceding
elements key vectors kj
• softmax calculation – α i, j
• Result of dot product - arbitrarily large (positive or negative) value
• Exponentiating large values
can lead to numerical issues
effective loss of gradients during training
• To avoid this, the dot product are scaled - scaled dot-product approach
• divides the result of the dot product by a factor related to the size of the embeddings
• self-attention layer
• feedforward layers
• residual connections
• normalizing layers
• Eg: embedding for the word ‘class’ + embedding for the position 3
Transformers as Autoregressive Language Models
• train a model to predict
the next word in a
sequence - teacher
forcing
• calculate the cross-
entropy loss for each
item in the sequence
• each training item can
be processed in parallel
Contextual Generation and Summarization
• simple variation on autoregressive generation
Overview of Text summarization and Topic
Models
• works the best even with small corpora with few documents
• depends on the type of data that are dealt
Potential Harms from Language Models
• can generate toxic language - hate speech and abuse, negative
attitudes toward minority identities such as being Black or gay.
• can amplify demographic and other biases in training data
• can also be a tool for generating text for misinformation, phishing,
radicalization, and other socially harmful activities
• privacy issues- can leak information about their training data.
• Extra pre-training on non-toxic subcorpora seems to reduce tendency
to generate toxic language
• analyze the data used to pretrain - understand toxicity and bias in
generation, as well as privacy
THANK YOU