Ngram Experiment 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Department of Computer Science & Engineering (AI&ML)

BE SEM :VII AY: 2024-25

Subject: Natural Language Processing Lab

Aim: Implementation of: (i) Display BoW of an input text (ii) Display N-Gram of an input text

Theory:

Machine learning algorithms cannot work with raw text directly, we need to convert the text into
vectors of numbers. This is called feature extraction.

The bag-of-words model is a popular and simple feature extraction technique used when we
work with text. It describes the occurrence of each word within a document.

To use this model, we need to:

• Design a vocabulary of known words (also called tokens)

• Choose a measure of the presence of known words

Any information about the order or structure of words is discarded. That’s why it’s called a
bag of words. This model is trying to understand whether a known word occurs in a document,
but don’t know where is that word in the document.
The intuition is that similar documents have similar contents. Also, from a content, we can
learn something about the meaning of the document.

As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words
model. There are simple text cleaning techniques that can be used as a first step, such as:

• Ignoring case

• Ignoring punctuation

Department of Computer Science & Engineering-(AI&ML) | APSIT


• Ignoring frequent words that don’t contain much information, called stop words, like “a,”
“of,” etc.
• Fixing misspelled words.

• Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.

A more sophisticated approach is to create a vocabulary of grouped words. This both changes
the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning
from the document.

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word
pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are
modeled, not all possible bigrams.

An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a


two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a 3-gram
(more commonly called a trigram) is a three-word sequence of words like “please turn your”, or
“turn your homework”.

Once a vocabulary has been chosen, the occurrence of words in example documents needs to
be scored.

Some additional simple scoring methods include:

• Counts. Count the number of times each word appears in a document.

Frequencies. Calculate the frequency that each word appears in a document out of all the
words in the document.

Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of
flexibility for customization on your specific text data.

It has been used with great success on prediction problems like language modeling and
documentation classification.

Nevertheless, it suffers from some shortcomings, such as:

Department of Computer Science & Engineering-(AI&ML) | APSIT


Vocabulary: The vocabulary requires careful design, most specifically in order to manage the
size, which impacts the sparsity of the document representations.

Sparsity: Sparse representations are harder to model both for computational reasons (space and
time complexity) and also for information reasons, where the challenge is for the models to
harness so little information in such a large representational space.

Examples of Use Cases

1. Autocomplete and Spell Checkers: By predicting the next word in a sequence or


suggesting corrections.
2. Speech Recognition: To predict the most likely sequence of words from a sequence of
sounds.
3. Machine Translation: To find the most probable sequence of words in the target
language given a sequence of words in the source language.
4. Text Generation: In chatbots or content generation systems to produce coherent text.

Overall, N-gram models provide a balance between simplicity and effectiveness, making them a
valuable tool in the NLP toolkit, especially for tasks involving local context and manageable data
sizes.

Conclusion:
The bag-of-words model is a popular and simple feature extraction technique used when we
work with text. It describes the occurrence of each word within a document. Statistical language
models, in its essence, are the type of models that assign probabilities to the sequences of
words. In this practical, we understood the simplest model that assigns probabilities to
sentences and sequences of words, the n-gram. Often a simple bigram approach is better than a
1-gram bag-of-words model for tasks like documentation classification.

Department of Computer Science & Engineering-(AI&ML) | APSIT

You might also like