0% found this document useful (0 votes)
13 views20 pages

Unit 2 TB

Unit-2 of the NLP document focuses on text representation, detailing methods for converting text data into numerical forms suitable for NLP and ML algorithms. It discusses various approaches including one-hot encoding, Bag of Words, Bag of N-Grams, and TF-IDF, highlighting their advantages and limitations. Additionally, it introduces distributed representations and pre-trained word embeddings, emphasizing the importance of context in understanding word meanings.

Uploaded by

rainanewmail8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Unit 2 TB

Unit-2 of the NLP document focuses on text representation, detailing methods for converting text data into numerical forms suitable for NLP and ML algorithms. It discusses various approaches including one-hot encoding, Bag of Words, Bag of N-Grams, and TF-IDF, highlighting their advantages and limitations. Additionally, it introduces distributed representations and pre-trained word embeddings, emphasizing the importance of context in understanding word meanings.

Uploaded by

rainanewmail8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

NLP—Unit-2 (TB)

Unit-2

TEXT REPRESENTATION

● How do we go about doing feature engineering for text data?


● How do we transform a given text into numerical form so that it can be fed into NLP and ML
algorithms?
● This conversion of raw text to a suitable numerical form is called text representation

Feature representation for text is often much more complex as compared to other formats of data.
● The way an image is stored in a computer is in the form of a matrix of pixels where each cell[i,j]
in the matrix represents pixel i,j of the image. The real value stored at cell[i,j] represents the
intensity of the corresponding pixel in the image. This matrix representation accurately represents
the complete image.
● Video is similar: a video is just a collection of frames where each frame is an image. Hence, any
video can be represented as a sequential collection of matrices, one per frame, in the same order.
● Speech—it’s transmitted as a wave. To represent it mathematically, we sample the wave and record
its amplitude.
The Text representation approaches are classified into four categories:
• Basic vectorization approaches
• Distributed representations
• Universal language representation
• Handcrafted features

Vector Space Models

Represention of text units (characters, phonemes, words, phrases, sentences, paragraphs, and documents)
with vectors of numbers. This is known as the vector space model (VSM).

The common way to calculate similarity between two text blobs is using cosine similarity: the cosine of the
angle between their corresponding vectors. The cosine of 0° is 1 and the cosine of 90° is 0, with the cosine
monotonically increasing from 0 to 1, the vectors are more similar.

1
NLP—Unit-2 (TB)

Basic Vectorization Approaches

Let’s take a toy corpus with only four documents—D1, D2, D3, D4 —as an example.

D1: Dog bites man.

D2: Man bites dog.

D3: Dog eats meat.

D4: Man eats food.

This corpus is comprised of six words: [dog, bites, man, eats, meat, food].

1. One-Hot Encoding

● In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID wid that is
between 1 and |V|, where V is the set of the corpus vocabulary.
● Each word is then represented by a V-dimensional binary vector of 0s and 1s.
● |V| dimension vector filled with all 0s barring the index, where index = wid. At this index, we
simply put a 1.
● The representation for individual words is then combined to form a sentence representation.
● Map each of the six words to unique IDs: dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats =
6.ii Let’s consider the document D1: “dog bites man”. As per the scheme, each word is a six-
dimensional vector. Dog is represented as [1 0 0 0 0 0], as the word “dog” is mapped to ID 1. Bites
is represented as [0 1 0 0 0 0], and so on and so forth. Thus, D1 is represented as [ [1 0 0 0 0 0] [0
1 0 0 0 0] [0 0 1 0 0 0]]. D4 is represented as [ [ 0 0 1 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]]. Other
documents in the corpus can be represented similarly.

It suffers from a few shortcomings:

• The size of a one-hot vector is directly proportional to size of the vocabulary, and most real-world corpora
have large vocabularies. This results in a sparse representation where most of the entries in the vectors are
zeroes, making it computationally inefficient to store, compute with, and learn from (sparsity leads to
overfitting).

• This representation does not give a fixed-length representation for text, i.e., if a text has 10 words, you
get a longer representation for it as compared to a text with 5 words. For most learning algorithms, we need
the feature vectors to be of the same length.

2
NLP—Unit-2 (TB)

• It treats words as atomic units and has no notion of (dis)similarity between words. For example, consider
three words: run, ran, and apple. Run and ran have similar meanings as opposed to run and apple. But if we
take their respective vectors and compute Euclidean distance between them, they’re all equally apart ( 2).
Thus, semantically, they’re very poor at capturing the meaning of the word in relation to other words.

• Say we train a model using our toy corpus. At runtime, we get a sentence: “man eats fruits.” The training
data didn’t include “fruit” and there’s no way to represent it in our model. This is known as the out of
vocabulary (OOV) problem.

2. Bag of Words

● Similar to one-hot encoding, BoW maps words to unique integer IDs between 1 and |V|.
● Each document in the corpus is then converted into a vector of |V| dimensions where in the ith
component of the vector, i = wid, is simply the number of times the word w occurs in the document,
i.e., we simply score each word in V by their occurrence count in the document.
● Thus, for our toy corpus, where the word IDs are dog = 1, bites = 2, man = 3, meat = 4 , food = 5,
eats = 6, D1 becomes [1 1 1 0 0 0]. This is because the first three words in the vocabulary appeared
exactly once in D1, and the last three did not appear at all. D4 becomes [0 0 1 0 1 1].

Advantages of this encoding:

• Like one-hot encoding, BoW is simple to understand and implement.

• With this representation, documents having the same words will have their vector representations
closer to each other in Euclidean space as compared to documents with completely different words.
The distance between D1 and D2 is 0 as compared to the distance between D1 and D4, which is 2.
Thus, the vector space resulting from the BoW scheme captures the semantic similarity of documents.

Disadvantages:

• The size of the vector increases with the size of the vocabulary. Thus, sparsity continues to be a
problem.

• It does not capture the similarity between different words that mean the same thing. Say we have three
documents: “I run”, “I ran”, and “I ate”. BoW vectors of all three documents will be equally apart.

• This representation does not have any way to handle out of vocabulary words .

• As the name indicates, it is a “bag” of words—word order information is lost in this representation.
Both D1 and D2 will have the same representation in this scheme.

3. Bag of N-Grams

● Each chunk is called an n-gram.


● The corpus vocabulary, V, is then nothing but a collection of all unique n-grams across the text
corpus.

3
NLP—Unit-2 (TB)

● Then, each document in the corpus is represented by a vector of length |V|.


● This vector simply contains the frequency counts of n-grams present in the document and zero
for the n-grams that are not present.
● Let’s construct a 2-gram model for it. The set of all bigrams in the corpus is as follows: {dog
bites, bites man, man bites, bites dog, dog eats, eats meat, man eats, eats food}. Then, BoN
representation consists of an eight-dimensional vector for each document. The bigram
representation for the first two documents is as follows: D1 : [1,1,0,0,0,0,0,0], D2 :
[0,0,1,1,0,0,0,0].

Here are the main pros and cons of BoN:

• It captures some context and word-order information in the form of n-grams.

• Thus, resulting vector space is able to capture some semantic similarity.

• As n increases, dimensionality (and therefore sparsity) only increases rapidly.

• It still provides no way to address the OOV problem.

4. TF-IDF

● In all the previous approaches, all the words in the text are treated as equally important—there’s
no such importance to certain words.
● It aims to quantify the importance of a given word relative to other words in the document and
in the corpus.
● TF (term frequency) measures how often a term or word occurs in a given document.

● IDF (inverse document frequency) measures the importance of the term across a corpus. In
computing TF, all terms are given equal importance (weightage).

● Library---TfidfVectorizer

4
NLP—Unit-2 (TB)

All the Vectorization approaches have three fundamental drawbacks:


• They’re discrete representations—i.e., they treat language units (words, n-grams, etc.) as
atomic units. This discreteness hampers their ability to capture relationships between words.
• The dimensionality increases with the size of the vocabulary, with most values being zero for
any vector. Makes them computationally inefficient.
• They cannot handle OOV words.

Example:2
Document 1 It is going to rain today.
Document 2 Today I am not going outside.
Document 3 I am going to watch the season premiere.

Find TF

Document 1—It is going to rain today.

Find it’s TF = (Number of repetitions of word in a document) / (# of words in a document)

Continue for rest of sentences -

5
NLP—Unit-2 (TB)

Find IDF

Find IDF for documents (we do this for feature names only/ vocab words which have no stop words )

IDF =Log[(Number of documents) / (Number of documents containing the word)]

Build model i.e. stack all words next to each other —

6
NLP—Unit-2 (TB)

TF-IDF=TF*IDF

Distributed Representations
They use neural network architectures to create dense, low-dimensional representations of
words and texts.
Some key terms:
● Distributional similarity
Meaning is defined by context.
Eg. “NLP rocks”
● Distributional hypothesis
Words that occur in similar contexts have similar meanings.
Eg. “dog” and “cat”
● Distributional representation
These vectors are obtained from a co-occurrence matrix that captures co-occurrence of word
and context. The dimension of this matrix is equal to the size of the vocabulary of the corpus.
The previous four schemes—one-hot, bag of words, bag of n-grams, and TF-IDF—come under
distributional representation.
● Distributed representation
In Distributional hypothesis, the vectors are very high dimensional and sparse. So, they are
computationally inefficient and hamper learning. To overcome this problem, distributed
representation schemes compress the dimensionality. This results in vectors that are compact
(i.e., low dimensional) and dense (i.e., hardly any zeros).
● Embedding

7
NLP—Unit-2 (TB)

Embedding is a mapping between vector space coming from distributional representation to


vector space coming from distributed representation.

Pre-trained word embeddings


● They are trained over large corpus, such as Wikipedia, news articles, or even the entire web,
and has put words and their corresponding vectors on the web.
● These embeddings can be downloaded and used to get the vectors for the words.
● Such embeddings can be thought of as a large collection of key-value pairs, where keys are the
words in the vocabulary and values are their corresponding word vectors.
● Some of the most popular pre-trained embeddings are Word2vec by Google, GloVe by
Stanford, and fasttext embeddings by Facebook.
● They’re available for various dimensions like d = 25, 50, 100, 200, 300, 600.
Training our own embeddings
Two architectural variants of Word2vec approach:
• Continuous bag of words (CBOW)
• SkipGram
Language model-
● It is a statistical model that tries to give a probability distribution over sequences of words.
● Given a sentence of, say, m words, it assigns a probability Pr(w1, w2, ….., wn) to the whole
sentence.
● The objective of a language model is to assign probabilities in such a way that it gives high
probability to “good” sentences and low probabilities to “bad” sentences.

CBOW:
● In CBOW, a language model is built that correctly predicts the center word given the context
words in which the center word appears.
● If we take the word “jumps” as the center word, the context size of 2, then for our example,
the context is given by brown, fox, over, the. CBOW uses the context words to predict the
target word—jumps.

We run a sliding window of size 2k+1 over the text corpus.


Dataset for CBOW

8
NLP—Unit-2 (TB)

● Shallow net (single hidden layer) is built. We assume we want to learn D-dim word
embeddings. Further, let V be the vocabulary of the text corpus.

The objective is to learn an embedding matrix E|V| x d.


● Initialize the matrix randomly, |V| is the size of corpus vocabulary and d is the dimension of
the embedding.
● In the input layer, indices of the words in context are used to fetch the corresponding rows from
the embedding matrix E|V| x d.

9
NLP—Unit-2 (TB)

● The vectors fetched are then added to get a single D-dim vector, and this is passed to the next
layer.
● The next layer simply takes this d vector and multiplies it with another matrix E’d x |V|.
● This gives a 1 x |V| vector, which is fed to a softmax function to get probability distribution
over the vocabulary space.
● This distribution is compared with the label and uses back propagation to update both the
matrices E and E’ accordingly.
● At the end of the training, E is the embedding matrix we wanted to learn.
● Multiple 1Xd vectors🡺 Sum to get 1Xd vector🡺 Multiply with E’d x |V| 🡺 1 x |V|🡺 Softmax

over this vector🡺 Error calculation🡺 Back propagation

10
NLP—Unit-2 (TB)

Dense layers: Matrix-vector multiplication

Dense layers perform matrix-vector multiplication, where the row vector of the output from the previous
layers is equal to the column vector of the dense layer.

embedding_dim = 100

model = tf.keras.Sequential([

tf.keras.layers.Embedding(input_dim=vocab_size,
output_dim=embedding_dim, input_length=context_window*2),

tf.keras.layers.GlobalAveragePooling1D(),

tf.keras.layers.Dense(vocab_size, activation='softmax')

])

Example for getting embeddings of a word using CBOW in gensim:

import gensim

from gensim.models import Word2Vec

11
NLP—Unit-2 (TB)

# Create CBOW model

data= // read any file from corpus as a list of tokens

model1 = gensim.models.Word2Vec(data, min_count = 1, window = 2)

print(model1)

word_vectors = model1.wv

print(word_vectors[1])

SkipGram:

● SkipGram is very similar to CBOW, with some minor changes.


● In SkipGram, the task is to predict the context words from the center word.

12
NLP—Unit-2 (TB)

● Single 1Xd vector🡺 Multiply with E’d x |V| 🡺 1 x |V|🡺 Softmax over this vector🡺 Error

calculation for context words🡺 Back propagation

Differences between CBOW and skip-gram:


Feature CBOW Skip-gram

Predicts Target word given context Context words given


words target word

Efficiency More efficient Less efficient

Fine-grained word Less fine-grained More fine-grained


representations

Task Text classification, Natural language


natural language generation, machine
understanding translation

13
NLP—Unit-2 (TB)

Universal Text Representations


● All text representations are inherently biased based on what they saw in training data.
Eg., an embedding model trained heavily on technology news or articles is likely to identify
Apple as being closer to, say, Microsoft or Facebook than to an orange or pear.
● Unlike the basic vectorization approaches, pre-trained embeddings are generally large-sized
files (several gigabytes), which may cause problems in certain deployment issues.
One good practice is to use in-memory databases like Redis with a cache on top of them to
address scaling and latency issues.
● Modeling language for a real-world application is more than capturing the information via word
and sentence embeddings.
● Consider practical issues such as return on investment from the effort, business needs, and
infrastructural constraints before trying to use them in production-grade applications.

Universal text representation refers to methods for encoding text into a standardized format that captures
its meaning and context, making it suitable for machine learning or natural language processing tasks. The
goal is to create a general-purpose representation of text that can be applied across various downstream
tasks, such as classification, translation, or information retrieval.

A universal representation ideally encodes the semantic content of text, taking into account syntactic,
lexical, and contextual information, making it transferable and reusable.

Key Characteristics

1. Semantic Richness: Captures the meaning of text rather than just syntactic form.
2. Contextuality: Adapts based on the surrounding context of words or sentences.
3. Transferability: Works effectively across multiple NLP tasks without retraining for each task.
4. Scalability: Handles large-scale text data efficiently.

Types of Universal Text Representations

1. Bag-of-Words (BoW)

● Text is represented as a vector of word frequencies or occurrences.


● Ignores word order and context.
● Example: "I love NLP" → [1, 1, 1, 0] (if the vocabulary is {I, love, NLP, hate}).

2. Term Frequency-Inverse Document Frequency (TF-IDF)

● Extends BoW by weighing words based on their frequency in a document and across documents.

14
NLP—Unit-2 (TB)

● Example: Common words like "the" have low weights, while rare, meaningful words get higher
weights.

3. Word Embeddings (e.g., Word2Vec, GloVe)

● Represent words as dense vectors in a continuous space, capturing semantic relationships.


● Example: The word "king" might be represented as a 300-dimensional vector where similar
words like "queen" are nearby.

4. Contextual Embeddings (e.g., BERT, GPT)

● Generate representations that account for the context in which words appear.
● Example: The word "bank" in "river bank" is different from "financial bank," and the embeddings
will reflect this.

5. Sentence Embeddings (e.g., Universal Sentence Encoder, Sentence-BERT)

● Represent entire sentences or paragraphs as single, fixed-size vectors.


● Useful for tasks like semantic similarity, question answering, and document clustering.

6. Character-Level Representations

● Focus on character sequences to handle typos, rare words, or morphologically rich languages.
● Example: Text is represented as embeddings derived from characters rather than words.

Applications of Universal Text Representations

1. Text Classification: Sentiment analysis, spam detection, and topic modeling.


2. Semantic Search: Finding relevant documents or passages based on semantic similarity.
3. Question Answering: Generating or retrieving answers to natural language questions.
4. Translation: Improving the quality of machine translation systems.
5. Summarization: Capturing key information from lengthy documents.

Challenges

1. Scalability: Creating representations for massive datasets can be computationally expensive.


2. Out-of-Vocabulary Words: Traditional methods struggle with words not seen during training.
3. Context Sensitivity: Non-contextual models fail to capture polysemy and nuanced meanings.
4. Task Adaptability: Balancing universality with specificity for certain tasks can be difficult.

Modern Universal Text Representation Models

1. BERT (Bidirectional Encoder Representations from Transformers): Context-aware


embeddings for words and phrases.

15
NLP—Unit-2 (TB)

2. GPT (Generative Pretrained Transformer): Pretrained transformer-based embeddings for


generation tasks.
3. Universal Sentence Encoder (USE): Pretrained embeddings for entire sentences or paragraphs.
4. FastText: Embedding words based on character n-grams, improving robustness to rare words.

Search and Information Retrieval


● A search engine is an important component of everyone’s online activity.
● We search for information to decide on the best items to purchase, nice places to eat out, and
businesses to frequent, just to name a few examples.
● We also rely heavily on search to sift through our emails, documents, and financial transactions.
● A lot of these search interactions happen through text (or speech converted to text in voice
input).
● Google search query uses:
1. Spelling correction
2. Related queries
3. Snippet extraction
4. Biographical information extraction
5. Search results classification and so on.

Two types of search engines:


• Generic search engines, such as Google and Bing, that crawl the web and aim to cover as
much as possible by constantly looking for new webpages
• Enterprise search engines, where our search space is restricted to a smaller set of already
existing documents within an organization.

Components of a Search Engine

A search engine is a complex system that involves several components working together to provide
users with relevant information in response to their queries. The main components of a search engine
include:

● Web Crawling:

Web Crawlers or Spiders: These are automated programs that systematically browse the web to
discover and collect information from web pages. Crawlers start by visiting a set of seed URLs and
follow links to other pages, recursively expanding the crawl.

● Indexing:

Index: The collected data is processed and organized into an index, a data structure that allows for
quick and efficient retrieval of relevant documents. The index contains information about the words
on a page, their frequency, and the URLs where they are found.

16
NLP—Unit-2 (TB)

● Document Processing:

Document Parser: This component extracts text and metadata from various document types, such as
HTML, PDF, and other formats. It normalizes and tokenizes the text for further processing.

● Text Processing:

Text Analyzer: This component processes the text data from documents, performing tasks such as
stemming (reducing words to their root form), stop-word removal, and other natural language
processing (NLP) techniques to enhance the accuracy of search results.

● Inverted Index:

Inverted Index Generator: The inverted index is a critical data structure that maps terms to the
documents or web pages in which they appear. It enables the search engine to quickly identify
relevant documents containing a given query term.

● Ranking Algorithm:

Ranking Engine: Search engines use sophisticated algorithms to rank search results based on
relevance to a user's query. Common ranking algorithms include PageRank (link-based) and more
modern approaches that consider user behavior and other factors.

● Query Processing:

Query Processor: This component interprets and processes user queries, transforming them into a
form that can be matched against the index. It may involve tasks such as query expansion, spelling
correction, and handling synonyms to improve search accuracy.

● User Interface:

User Interface (UI): The search engine's front-end presents the search results to users in a user-
friendly manner. It includes features like the search bar, filters, and options for refining search
queries.

● Caching and Storage:

Caching System: To improve search speed and reduce server load, search engines often use caching
to store frequently accessed search results or parts of the index. This allows for faster retrieval of
information.

● Relevance Feedback:

17
NLP—Unit-2 (TB)

Relevance Feedback System: Some search engines incorporate user feedback to continuously
improve the relevance of search results. Feedback mechanisms may include user clicks, dwell time,
and other engagement metrics.

● Web Server:

Web Server: The web server handles user requests, interacts with the back-end components, and
delivers search results to users via the user interface. It also manages communication with other
components, such as the indexing system and database.

● Database Management System (DBMS):

Database: Search engines often use databases to store and manage the index, as well as other
metadata and information about web pages. A database management system (DBMS) is crucial for
efficient data retrieval and storage.

These components work together to enable the search engine to crawl, index, and retrieve relevant
information efficiently, providing users with accurate and timely search results. The effectiveness
of a search engine depends on the integration and optimization of these components.

A language model (LM) is a statistical or machine learning model designed to understand, generate, or
predict text in a natural language. It works by learning patterns, structures, and meanings from large
amounts of text data, allowing it to perform various tasks such as predicting the next word in a sentence or
generating coherent paragraphs of text.

Key Goals of Language Models:

1. Probability Estimation: Assign a probability to a sequence of words, indicating how likely the
sequence is in the language.
2. Text Understanding: Capture the context and semantics of words and sentences for better
comprehension.
3. Text Generation: Generate grammatically and semantically meaningful text.

Types of Language Models:

1. Statistical Language Models:


○ n-gram Models: Use fixed-length sequences of words to estimate probabilities.
○ Markov Models: Predict the next word based on the current state (e.g., the previous
word or words).
2. Neural Language Models:
○ RNNs (Recurrent Neural Networks): Use sequential dependencies to model text.
○ LSTMs and GRUs: Handle longer dependencies by addressing the vanishing gradient
problem.
○ Transformers (e.g., BERT, GPT): Model dependencies across long ranges using self-
attention mechanisms.

18
NLP—Unit-2 (TB)

Pre-Trained Language Models:

A pre-trained language model (PLM) is a language model trained on a large corpus of text data in an
unsupervised manner before being fine-tuned for specific tasks. These models are designed to learn general
patterns and linguistic knowledge that can be transferred to various downstream tasks.

Working of Pre-Trained Language Models:

1. Pretraining: The model is trained on a large dataset (e.g., Wikipedia, books) to predict missing
words (masked language modeling) or the next word in a sequence (causal language modeling).
2. Fine-tuning: The pretrained model is further trained on a smaller, task-specific dataset to
optimize performance for tasks like sentiment analysis, translation, or summarization.

Purpose of Pre-Trained Language Models

1. Transfer Learning

Pretrained models act as a foundation, transferring general linguistic knowledge to a variety of NLP tasks,
reducing the need to train models from scratch.

2. Faster Development

Pretrained models significantly reduce the computational resources and time required to build models, as
the heavy lifting (general language understanding) has already been done.

3. Improved Accuracy

By leveraging large datasets during pretraining, these models often achieve state-of-the-art performance on
multiple NLP benchmarks.

4. Contextual Understanding

Modern pretrained models like BERT and GPT capture context and semantics, improving their ability to
handle tasks like sentiment analysis and question answering.

5. Multi-Task Capability

A single pretrained model can be fine-tuned for different tasks, making them versatile.

Examples of Pre-Trained Language Models

1. BERT (Bidirectional Encoder Representations from Transformers): Designed for


understanding the context from both directions in a sentence.
2. GPT (Generative Pretrained Transformer): Specialized in generating coherent and context-
aware text.
3. RoBERTa: An improved version of BERT, optimized for higher performance.
4. T5 (Text-to-Text Transfer Transformer): Converts all NLP problems into text-to-text format.

19
NLP—Unit-2 (TB)

5. XLNet: Combines advantages of autoregressive and autoencoding methods for improved text
understanding.

Advantages of Pre-Trained Language Models

1. Efficiency: Eliminates the need for training from scratch.


2. Generalization: Works well across tasks due to exposure to diverse datasets during pretraining.
3. Scalability: Models like GPT-4 can handle large and complex NLP tasks.

Limitations

1. Resource Intensity: Pretraining requires massive computational resources and datasets.


2. Biases: Models may inherit biases present in the training data.
3. Overfitting on Fine-Tuning: Careful fine-tuning is needed to prevent overfitting on small
datasets.

20

You might also like