Unit 2 TB
Unit 2 TB
Unit-2
TEXT REPRESENTATION
Feature representation for text is often much more complex as compared to other formats of data.
● The way an image is stored in a computer is in the form of a matrix of pixels where each cell[i,j]
in the matrix represents pixel i,j of the image. The real value stored at cell[i,j] represents the
intensity of the corresponding pixel in the image. This matrix representation accurately represents
the complete image.
● Video is similar: a video is just a collection of frames where each frame is an image. Hence, any
video can be represented as a sequential collection of matrices, one per frame, in the same order.
● Speech—it’s transmitted as a wave. To represent it mathematically, we sample the wave and record
its amplitude.
The Text representation approaches are classified into four categories:
• Basic vectorization approaches
• Distributed representations
• Universal language representation
• Handcrafted features
Represention of text units (characters, phonemes, words, phrases, sentences, paragraphs, and documents)
with vectors of numbers. This is known as the vector space model (VSM).
The common way to calculate similarity between two text blobs is using cosine similarity: the cosine of the
angle between their corresponding vectors. The cosine of 0° is 1 and the cosine of 90° is 0, with the cosine
monotonically increasing from 0 to 1, the vectors are more similar.
1
NLP—Unit-2 (TB)
Let’s take a toy corpus with only four documents—D1, D2, D3, D4 —as an example.
This corpus is comprised of six words: [dog, bites, man, eats, meat, food].
1. One-Hot Encoding
● In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID wid that is
between 1 and |V|, where V is the set of the corpus vocabulary.
● Each word is then represented by a V-dimensional binary vector of 0s and 1s.
● |V| dimension vector filled with all 0s barring the index, where index = wid. At this index, we
simply put a 1.
● The representation for individual words is then combined to form a sentence representation.
● Map each of the six words to unique IDs: dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats =
6.ii Let’s consider the document D1: “dog bites man”. As per the scheme, each word is a six-
dimensional vector. Dog is represented as [1 0 0 0 0 0], as the word “dog” is mapped to ID 1. Bites
is represented as [0 1 0 0 0 0], and so on and so forth. Thus, D1 is represented as [ [1 0 0 0 0 0] [0
1 0 0 0 0] [0 0 1 0 0 0]]. D4 is represented as [ [ 0 0 1 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]]. Other
documents in the corpus can be represented similarly.
• The size of a one-hot vector is directly proportional to size of the vocabulary, and most real-world corpora
have large vocabularies. This results in a sparse representation where most of the entries in the vectors are
zeroes, making it computationally inefficient to store, compute with, and learn from (sparsity leads to
overfitting).
• This representation does not give a fixed-length representation for text, i.e., if a text has 10 words, you
get a longer representation for it as compared to a text with 5 words. For most learning algorithms, we need
the feature vectors to be of the same length.
2
NLP—Unit-2 (TB)
• It treats words as atomic units and has no notion of (dis)similarity between words. For example, consider
three words: run, ran, and apple. Run and ran have similar meanings as opposed to run and apple. But if we
take their respective vectors and compute Euclidean distance between them, they’re all equally apart ( 2).
Thus, semantically, they’re very poor at capturing the meaning of the word in relation to other words.
• Say we train a model using our toy corpus. At runtime, we get a sentence: “man eats fruits.” The training
data didn’t include “fruit” and there’s no way to represent it in our model. This is known as the out of
vocabulary (OOV) problem.
2. Bag of Words
● Similar to one-hot encoding, BoW maps words to unique integer IDs between 1 and |V|.
● Each document in the corpus is then converted into a vector of |V| dimensions where in the ith
component of the vector, i = wid, is simply the number of times the word w occurs in the document,
i.e., we simply score each word in V by their occurrence count in the document.
● Thus, for our toy corpus, where the word IDs are dog = 1, bites = 2, man = 3, meat = 4 , food = 5,
eats = 6, D1 becomes [1 1 1 0 0 0]. This is because the first three words in the vocabulary appeared
exactly once in D1, and the last three did not appear at all. D4 becomes [0 0 1 0 1 1].
• With this representation, documents having the same words will have their vector representations
closer to each other in Euclidean space as compared to documents with completely different words.
The distance between D1 and D2 is 0 as compared to the distance between D1 and D4, which is 2.
Thus, the vector space resulting from the BoW scheme captures the semantic similarity of documents.
Disadvantages:
• The size of the vector increases with the size of the vocabulary. Thus, sparsity continues to be a
problem.
• It does not capture the similarity between different words that mean the same thing. Say we have three
documents: “I run”, “I ran”, and “I ate”. BoW vectors of all three documents will be equally apart.
• This representation does not have any way to handle out of vocabulary words .
• As the name indicates, it is a “bag” of words—word order information is lost in this representation.
Both D1 and D2 will have the same representation in this scheme.
3. Bag of N-Grams
3
NLP—Unit-2 (TB)
4. TF-IDF
● In all the previous approaches, all the words in the text are treated as equally important—there’s
no such importance to certain words.
● It aims to quantify the importance of a given word relative to other words in the document and
in the corpus.
● TF (term frequency) measures how often a term or word occurs in a given document.
● IDF (inverse document frequency) measures the importance of the term across a corpus. In
computing TF, all terms are given equal importance (weightage).
● Library---TfidfVectorizer
4
NLP—Unit-2 (TB)
Example:2
Document 1 It is going to rain today.
Document 2 Today I am not going outside.
Document 3 I am going to watch the season premiere.
Find TF
5
NLP—Unit-2 (TB)
Find IDF
Find IDF for documents (we do this for feature names only/ vocab words which have no stop words )
6
NLP—Unit-2 (TB)
TF-IDF=TF*IDF
Distributed Representations
They use neural network architectures to create dense, low-dimensional representations of
words and texts.
Some key terms:
● Distributional similarity
Meaning is defined by context.
Eg. “NLP rocks”
● Distributional hypothesis
Words that occur in similar contexts have similar meanings.
Eg. “dog” and “cat”
● Distributional representation
These vectors are obtained from a co-occurrence matrix that captures co-occurrence of word
and context. The dimension of this matrix is equal to the size of the vocabulary of the corpus.
The previous four schemes—one-hot, bag of words, bag of n-grams, and TF-IDF—come under
distributional representation.
● Distributed representation
In Distributional hypothesis, the vectors are very high dimensional and sparse. So, they are
computationally inefficient and hamper learning. To overcome this problem, distributed
representation schemes compress the dimensionality. This results in vectors that are compact
(i.e., low dimensional) and dense (i.e., hardly any zeros).
● Embedding
7
NLP—Unit-2 (TB)
CBOW:
● In CBOW, a language model is built that correctly predicts the center word given the context
words in which the center word appears.
● If we take the word “jumps” as the center word, the context size of 2, then for our example,
the context is given by brown, fox, over, the. CBOW uses the context words to predict the
target word—jumps.
8
NLP—Unit-2 (TB)
● Shallow net (single hidden layer) is built. We assume we want to learn D-dim word
embeddings. Further, let V be the vocabulary of the text corpus.
9
NLP—Unit-2 (TB)
● The vectors fetched are then added to get a single D-dim vector, and this is passed to the next
layer.
● The next layer simply takes this d vector and multiplies it with another matrix E’d x |V|.
● This gives a 1 x |V| vector, which is fed to a softmax function to get probability distribution
over the vocabulary space.
● This distribution is compared with the label and uses back propagation to update both the
matrices E and E’ accordingly.
● At the end of the training, E is the embedding matrix we wanted to learn.
● Multiple 1Xd vectors🡺 Sum to get 1Xd vector🡺 Multiply with E’d x |V| 🡺 1 x |V|🡺 Softmax
10
NLP—Unit-2 (TB)
Dense layers perform matrix-vector multiplication, where the row vector of the output from the previous
layers is equal to the column vector of the dense layer.
embedding_dim = 100
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=vocab_size,
output_dim=embedding_dim, input_length=context_window*2),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(vocab_size, activation='softmax')
])
import gensim
11
NLP—Unit-2 (TB)
print(model1)
word_vectors = model1.wv
print(word_vectors[1])
SkipGram:
12
NLP—Unit-2 (TB)
● Single 1Xd vector🡺 Multiply with E’d x |V| 🡺 1 x |V|🡺 Softmax over this vector🡺 Error
13
NLP—Unit-2 (TB)
Universal text representation refers to methods for encoding text into a standardized format that captures
its meaning and context, making it suitable for machine learning or natural language processing tasks. The
goal is to create a general-purpose representation of text that can be applied across various downstream
tasks, such as classification, translation, or information retrieval.
A universal representation ideally encodes the semantic content of text, taking into account syntactic,
lexical, and contextual information, making it transferable and reusable.
Key Characteristics
1. Semantic Richness: Captures the meaning of text rather than just syntactic form.
2. Contextuality: Adapts based on the surrounding context of words or sentences.
3. Transferability: Works effectively across multiple NLP tasks without retraining for each task.
4. Scalability: Handles large-scale text data efficiently.
1. Bag-of-Words (BoW)
● Extends BoW by weighing words based on their frequency in a document and across documents.
14
NLP—Unit-2 (TB)
● Example: Common words like "the" have low weights, while rare, meaningful words get higher
weights.
● Generate representations that account for the context in which words appear.
● Example: The word "bank" in "river bank" is different from "financial bank," and the embeddings
will reflect this.
6. Character-Level Representations
● Focus on character sequences to handle typos, rare words, or morphologically rich languages.
● Example: Text is represented as embeddings derived from characters rather than words.
Challenges
15
NLP—Unit-2 (TB)
A search engine is a complex system that involves several components working together to provide
users with relevant information in response to their queries. The main components of a search engine
include:
● Web Crawling:
Web Crawlers or Spiders: These are automated programs that systematically browse the web to
discover and collect information from web pages. Crawlers start by visiting a set of seed URLs and
follow links to other pages, recursively expanding the crawl.
● Indexing:
Index: The collected data is processed and organized into an index, a data structure that allows for
quick and efficient retrieval of relevant documents. The index contains information about the words
on a page, their frequency, and the URLs where they are found.
16
NLP—Unit-2 (TB)
● Document Processing:
Document Parser: This component extracts text and metadata from various document types, such as
HTML, PDF, and other formats. It normalizes and tokenizes the text for further processing.
● Text Processing:
Text Analyzer: This component processes the text data from documents, performing tasks such as
stemming (reducing words to their root form), stop-word removal, and other natural language
processing (NLP) techniques to enhance the accuracy of search results.
● Inverted Index:
Inverted Index Generator: The inverted index is a critical data structure that maps terms to the
documents or web pages in which they appear. It enables the search engine to quickly identify
relevant documents containing a given query term.
● Ranking Algorithm:
Ranking Engine: Search engines use sophisticated algorithms to rank search results based on
relevance to a user's query. Common ranking algorithms include PageRank (link-based) and more
modern approaches that consider user behavior and other factors.
● Query Processing:
Query Processor: This component interprets and processes user queries, transforming them into a
form that can be matched against the index. It may involve tasks such as query expansion, spelling
correction, and handling synonyms to improve search accuracy.
● User Interface:
User Interface (UI): The search engine's front-end presents the search results to users in a user-
friendly manner. It includes features like the search bar, filters, and options for refining search
queries.
Caching System: To improve search speed and reduce server load, search engines often use caching
to store frequently accessed search results or parts of the index. This allows for faster retrieval of
information.
● Relevance Feedback:
17
NLP—Unit-2 (TB)
Relevance Feedback System: Some search engines incorporate user feedback to continuously
improve the relevance of search results. Feedback mechanisms may include user clicks, dwell time,
and other engagement metrics.
● Web Server:
Web Server: The web server handles user requests, interacts with the back-end components, and
delivers search results to users via the user interface. It also manages communication with other
components, such as the indexing system and database.
Database: Search engines often use databases to store and manage the index, as well as other
metadata and information about web pages. A database management system (DBMS) is crucial for
efficient data retrieval and storage.
These components work together to enable the search engine to crawl, index, and retrieve relevant
information efficiently, providing users with accurate and timely search results. The effectiveness
of a search engine depends on the integration and optimization of these components.
A language model (LM) is a statistical or machine learning model designed to understand, generate, or
predict text in a natural language. It works by learning patterns, structures, and meanings from large
amounts of text data, allowing it to perform various tasks such as predicting the next word in a sentence or
generating coherent paragraphs of text.
1. Probability Estimation: Assign a probability to a sequence of words, indicating how likely the
sequence is in the language.
2. Text Understanding: Capture the context and semantics of words and sentences for better
comprehension.
3. Text Generation: Generate grammatically and semantically meaningful text.
18
NLP—Unit-2 (TB)
A pre-trained language model (PLM) is a language model trained on a large corpus of text data in an
unsupervised manner before being fine-tuned for specific tasks. These models are designed to learn general
patterns and linguistic knowledge that can be transferred to various downstream tasks.
1. Pretraining: The model is trained on a large dataset (e.g., Wikipedia, books) to predict missing
words (masked language modeling) or the next word in a sequence (causal language modeling).
2. Fine-tuning: The pretrained model is further trained on a smaller, task-specific dataset to
optimize performance for tasks like sentiment analysis, translation, or summarization.
1. Transfer Learning
Pretrained models act as a foundation, transferring general linguistic knowledge to a variety of NLP tasks,
reducing the need to train models from scratch.
2. Faster Development
Pretrained models significantly reduce the computational resources and time required to build models, as
the heavy lifting (general language understanding) has already been done.
3. Improved Accuracy
By leveraging large datasets during pretraining, these models often achieve state-of-the-art performance on
multiple NLP benchmarks.
4. Contextual Understanding
Modern pretrained models like BERT and GPT capture context and semantics, improving their ability to
handle tasks like sentiment analysis and question answering.
5. Multi-Task Capability
A single pretrained model can be fine-tuned for different tasks, making them versatile.
19
NLP—Unit-2 (TB)
5. XLNet: Combines advantages of autoregressive and autoencoding methods for improved text
understanding.
Limitations
20