Course Name: Advanced Information Retrieval
Course Name: Advanced Information Retrieval
January, 2021
Definition of Term-document matrix
Document-term matrices are often stored as a sparse matrix object. These objects can be
treated as though they were matrices (for example, accessing particular rows and columns),
but are stored in a more efficient format.
1|Page
D1 = "I like databases"
D2 = "I dislike databases",
Then the document-term matrix would be:
I like dislike Databases
D1 1 1 0 1
D2 1 0 1 1
Which shows which documents contain which terms and how many times they appear. Note
that, unlike representing a document as just a token-count list, the document-term matrix
includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-
counts for terms in the corpus which do not also occur in a specific document.
As a result of the power-law distribution of tokens in nearly every corpus (see Zipf's law), it is
common to weight the counts. This can be as simple as dividing counts by the total number of
tokens in a document (called relative frequency or proportions), dividing by the maximum
frequency in each document (called prop max), or taking the log of frequencies (called log
count). If one desires to weight the words most unique to an individual document as compared
to the corpus as a whole, it is common to use tf-idf, which divides the term frequency by the
inverse of the term's document frequency.
Each row of the matrix is a document vector, with one column for every term in the entire
corpus.
Naturally, some documents may not contain a given term, so this matrix is sparse. The
value in each cell of the matrix is the term frequency. (This value is often a weighted
term frequency, typically using tf-idf -- term frequency-inverse document frequency
Disadvantages
2|Page
We will move towards richer representations, beginning with the inverted index.
3. Implementation using any programming language (python/java ...). I recommend you
python programming language
Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms
for constructing term-document matrices from text plus common transformations
import pandas as pd
vec = CountVectorizer()
c= vec.fit_transform(docs)
3|Page
The result that shows as
4|Page
Reference
1. term-document matrix https://en.wikipedia.org/wiki/Document-term_matrix Assessed
on 2021
2. document-term-matrixhttps://bookdown.org/Maxine/tidy-text-mining/tidying-a-
document-term-matrix.html Assessed on 2021
3. Term-document matrix https://www.rdocumentation.org/packages/tm/versions/0.7-
8/topics/TermDocumentMatrix Assessed on 2021
5|Page