Bag of Words

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Essential Ingredients (6)

Week 9-12

Written by Abhishek kaushik (Abhi)

~ Material is for educational purpose for the


specific audience. Distribution in any form is
not allowed.
Table of content
5-9 Weeks (Python Practical hand)

1. Natural Language processing


2. Theory of Bag of Words
3. Types of Techniques
4. Numerical to calculate TF-IDF
5. Applications of Natural language processing

T stands for Theory, P stand for Pythonic and D stand for Discussion
Please make notes while discussions.
Note : Source of all images with open License from the Internet
Text Data is Superficial
But Language is Complex...
What is NLP research?
Some Early NLP History
Bag-of-Words Model
Introductions
The bag-of-words model is a way of representing text data
when modeling text with machine learning algorithms.

The bag-of-words model is simple to understand and implement


and has seen great success in problems such as language
modeling and document classification.

In this tutorial, you will discover the bag-of-words model


for feature extraction in natural language processing
Introductions (2)
After completing this tutorial, you will know:

● What the bag-of-words model is and why it is needed to


represent text.

● How to develop a bag-of-words model for a collection of


documents.

● How to use different techniques to prepare a vocabulary


and score words.
Problem with Text
A problem with modeling text is that it is messy, and
techniques like machine learning algorithms prefer well
defined fixed-length inputs and outputs.

Machine learning algorithms cannot work with raw text


directly; the text must be converted into numbers.
Specifically, vectors of numbers.

This is called feature extraction or feature encoding.

A popular and simple method of feature extraction with text


data is called the bag-of-words model of text.
Bag-of-Words
● A bag-of-words model, or BoW for short, is a way of
extracting features from text for use in modeling, such
as with machine learning algorithms.
● The approach is very simple and flexible, and can be used
in a myriad of ways for extracting features from
documents.
Bag-of-Words(1)
A bag-of-words is a representation of text that describes
the occurrence of words within a document. It involves two
things:

1. A vocabulary of known words.

2. A measure of the presence of known words.

It is called a “bag” of words, because any information about


the order or structure of words in the document is
discarded. The model is only concerned with whether known
words occur in the document, not where in the document.
Bag-of-Words(2)
The intuition is that documents are similar if they have
similar content. Further, that from the content alone we can
learn something about the meaning of the document.

The bag-of-words can be as simple or complex as you like.


The complexity comes both in deciding how to design the
vocabulary of known words (or tokens) and how to score the
presence of known words.

We will take a closer look at both of these concerns.


Example of the Bag-of-Words Model
● New documents that overlap with the vocabulary of known
words, but may contain words outside of the vocabulary,
can still be encoded, where only the occurrence of known
words are scored and unknown words are ignored.
● You can see how this might naturally scale to large
vocabularies and larger documents.
Managing Vocabulary
● As the vocabulary size increases, so does the vector
representation of documents.
● In the previous example, the length of the document
vector is equal to the number of known words.
● You can imagine that for a very large corpus, such as
thousands of books, that the length of the vector might
be thousands or millions of positions. Further, each
document may contain very few of the known words in the
vocabulary.
● This results in a vector with lots of zero scores, called
a sparse vector or sparse representation.
Managing Vocabulary (2)
● Sparse vectors require more memory and computational
resources when modeling and the vast number of positions
or dimensions can make the modeling process very
challenging for traditional algorithms.
● As such, there is pressure to decrease the size of the
vocabulary when using a bag-of-words model.
● There are simple text cleaning techniques that can be
used as a first step, such as:
● Ignoring case, punctuation and frequent words that don’t
contain much information, called stop words, like “a,”
“of,” etc and fixing misspelled words.
● Reducing words to their stem (e.g. “play” from “playing”)
using stemming algorithms
Scoring Words
Once a vocabulary has been chosen, the occurrence of words in
example documents needs to be scored.

In the worked example, we have already seen one very simple approach
to scoring: a binary scoring of the presence or absence of words.

Some additional simple scoring methods include:

● Counts. Count the number of times each word appears in a


document.

● Frequencies. Calculate the frequency that each word appears in a


document out of all the words in the document.
Word Hashing
● You may remember from computer science that a hash function is a
bit of math that maps data to a fixed size set of numbers.

● For example, we use them in hash tables when programming where


perhaps names are converted to numbers for fast lookup.

● We can use a hash representation of known words in our


vocabulary. This addresses the problem of having a very large
vocabulary for a large text corpus because we can choose the
size of the hash space, which is in turn the size of the vector
representation of the document.
Word Hashing (1)
● Words are hashed deterministically to the same integer
index in the target hash space. A binary score or count
can then be used to score the word.
● This is called the “hash trick” or “feature hashing“.
● The challenge is to choose a hash space to accommodate
the chosen vocabulary size to minimize the probability of
collisions and trade-off sparsity.
TF-IDF
● A problem with scoring word frequency is that highly
frequent words start to dominate in the document (e.g.
larger score), but may not contain as much “informational
content” to the model as rarer but perhaps domain
specific words.
● One approach is to rescale the frequency of words by how
often they appear in all documents, so that the scores
for frequent words like “the” that are also frequent
across all documents are penalized.
TF-IDF (1)
This approach to scoring is called Term Frequency – Inverse Document
Frequency, or TF-IDF for short, where:

● Term Frequency: is a scoring of the frequency of the word in the


current document.

● Inverse Document Frequency: is a scoring of how rare the word is


across documents.

The scores are a weighting where not all words are equally as important
or interesting.

The scores have the effect of highlighting words that are distinct
(contain useful information) in a given document.
TF-IDF Calculation
Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement
and offers a lot of flexibility for customization on your specific
text data.

It has been used with great success on prediction problems like


language modeling and documentation classification.

Nevertheless, it suffers from some shortcomings, such as:

● Vocabulary: The vocabulary requires careful design, most


specifically in order to manage the size, which impacts the
sparsity of the document representations
Limitations of Bag-of-Words
● Sparsity: Sparse representations are harder to model both for
computational reasons (space and time complexity) and also for
information reasons, where the challenge is for the models to
harness so little information in such a large representational
space.

● Meaning: Discarding word order ignores the context, and in turn


meaning of words in the document (semantics). Context and
meaning can offer a lot to the model, that if modeled could tell
the difference between the same words differently arranged
(“this is interesting” vs “is this interesting”), synonyms (“old
bike” vs “used bike”), and much more.
Reference
● Jason Brownlee, Machine Learning Algorithms in Python,
Machine Learning Mastery, Available from
https://machinelearningmastery.com/machine-learning-with-
python/, accessed April 15th, 2018.

● http://www.cs.cmu.edu/~arielpro/15381f16/slides/NLP_guest
_lecture.pdf
Thank you

You might also like