Ngram Experiment 3
Ngram Experiment 3
Ngram Experiment 3
Aim: Implementation of: (i) Display BoW of an input text (ii) Display N-Gram of an input text
Theory:
Machine learning algorithms cannot work with raw text directly, we need to convert the text into
vectors of numbers. This is called feature extraction.
The bag-of-words model is a popular and simple feature extraction technique used when we
work with text. It describes the occurrence of each word within a document.
Any information about the order or structure of words is discarded. That’s why it’s called a
bag of words. This model is trying to understand whether a known word occurs in a document,
but don’t know where is that word in the document.
The intuition is that similar documents have similar contents. Also, from a content, we can
learn something about the meaning of the document.
As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words
model. There are simple text cleaning techniques that can be used as a first step, such as:
• Ignoring case
• Ignoring punctuation
• Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.
A more sophisticated approach is to create a vocabulary of grouped words. This both changes
the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning
from the document.
In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word
pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are
modeled, not all possible bigrams.
Once a vocabulary has been chosen, the occurrence of words in example documents needs to
be scored.
Frequencies. Calculate the frequency that each word appears in a document out of all the
words in the document.
Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of
flexibility for customization on your specific text data.
It has been used with great success on prediction problems like language modeling and
documentation classification.
Sparsity: Sparse representations are harder to model both for computational reasons (space and
time complexity) and also for information reasons, where the challenge is for the models to
harness so little information in such a large representational space.
Overall, N-gram models provide a balance between simplicity and effectiveness, making them a
valuable tool in the NLP toolkit, especially for tasks involving local context and manageable data
sizes.
Conclusion:
The bag-of-words model is a popular and simple feature extraction technique used when we
work with text. It describes the occurrence of each word within a document. Statistical language
models, in its essence, are the type of models that assign probabilities to the sequences of
words. In this practical, we understood the simplest model that assigns probabilities to
sentences and sequences of words, the n-gram. Often a simple bigram approach is better than a
1-gram bag-of-words model for tasks like documentation classification.