0% found this document useful (0 votes)
41 views6 pages

Course Name: Advanced Information Retrieval

The document discusses term-document matrices. [1] A term-document matrix describes the frequency of terms in documents with rows representing documents and columns representing terms. [2] It has advantages like being an important representation for text analytics, but disadvantages like being very sparse. [3] Python can be used to implement term-document matrices using sklearn's CountVectorizer to transform documents into a matrix with documents as rows and terms as columns showing term frequency.

Uploaded by

jewar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views6 pages

Course Name: Advanced Information Retrieval

The document discusses term-document matrices. [1] A term-document matrix describes the frequency of terms in documents with rows representing documents and columns representing terms. [2] It has advantages like being an important representation for text analytics, but disadvantages like being very sparse. [3] Python can be used to implement term-document matrices using sklearn's CountVectorizer to transform documents into a matrix with documents as rows and terms as columns showing term frequency.

Uploaded by

jewar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

JIMMA UNIVERSITY

JIMMA INSTITUTE OF TECHNOLOGY


FACULTY OF COMPUTING AND
INFORMATICS
MSC. IN INFORMATION SCIENCE
(ELECTRONIC AND DIGITAL RECOURSE
MANAGEMENT)

Course Name: Advanced Information Retrieval


Assignment 1: Term-document matrix

Prepared by: Ruth Wondu

Submitted to: Dr Getachew Mamo

January, 2021
Definition of Term-document matrix

1. A document-term matrix is a mathematical matrix that describes the frequency of terms that


occur in a collection of documents. In a document-term matrix, rows correspond to documents in
the collection and columns correspond to terms. This matrix is a specific instance of a document-
feature matrix where "features" may refer to other properties of a document besides terms. It is
also common to encounter the transpose, or term-document matrix where documents are the
columns and terms are the rows. They are useful in the field of natural language
processing and computational text analysis. While the value of the cells is commonly the raw
count of a given term, there are various schemes for weighting the raw counts such as relative
frequency/proportions and tf-idf. Terms are commonly single tokens separated by whitespace or
punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words"
representation because the counts of individual words is retained, but not the order of the words
in the document.
Document-term matrix or term-document matrix is a mathematical matrix that describes the
frequency of terms that occur in a collection of documents. This is a matrix where

 each row represents one document


 each column represents one term (word)
 each value (typically) contains the number of appearances of that term in that
document

Document-term matrices are often stored as a sparse matrix object. These objects can be
treated as though they were matrices (for example, accessing particular rows and columns),
but are stored in a more efficient format.

When creating a data-set of terms that appear in a corpus of documents, the document-term


matrix contains rows corresponding to the documents and columns corresponding to the
terms. Each ij cell, then, is the number of times word j occurs in document i. As such, each
row is a vector of term counts that represents the content of the document corresponding to
that row. For instance if one has the following two (short) documents:

1|Page
 D1 = "I like databases"
 D2 = "I dislike databases",
Then the document-term matrix would be:
I like dislike Databases
D1 1 1 0 1
D2 1 0 1 1
Which shows which documents contain which terms and how many times they appear. Note
that, unlike representing a document as just a token-count list, the document-term matrix
includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-
counts for terms in the corpus which do not also occur in a specific document.

As a result of the power-law distribution of tokens in nearly every corpus (see Zipf's law), it is
common to weight the counts. This can be as simple as dividing counts by the total number of
tokens in a document (called relative frequency or proportions), dividing by the maximum
frequency in each document (called prop max), or taking the log of frequencies (called log
count). If one desires to weight the words most unique to an individual document as compared
to the corpus as a whole, it is common to use tf-idf, which divides the term frequency by the
inverse of the term's document frequency.

2. Advantages and Disadvantages


Advantages

 A term-document matrix is an important representation for text analytics.

 Each row of the matrix is a document vector, with one column for every term in the entire
corpus.
 Naturally, some documents may not contain a given term, so this matrix is sparse. The
value in each cell of the matrix is the term frequency. (This value is often a weighted
term frequency, typically using tf-idf -- term frequency-inverse document frequency

Disadvantages

 Observation: the term-document matrix is very sparse

 Contains no more than one billion 1s.


 lack of support for more complex query operators (e.g., proximity search)

2|Page
We will move towards richer representations, beginning with the inverted index.
3. Implementation using any programming language (python/java ...). I recommend you
python programming language
Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms
for constructing term-document matrices from text plus common transformations

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

docs = ['Computer program used to retrieve digital information',

'Software is necessary for users to access digital information',

'Digital ICT is communication through computer-based systems',]

vec = CountVectorizer()

c= vec.fit_transform(docs)

tdm = pd.DataFrame(c.toarray(), columns=vec.get_feature_names())


print(tdm)

3|Page
The result that shows as

4|Page
Reference
1. term-document matrix https://en.wikipedia.org/wiki/Document-term_matrix Assessed
on 2021
2. document-term-matrixhttps://bookdown.org/Maxine/tidy-text-mining/tidying-a-
document-term-matrix.html Assessed on 2021
3. Term-document matrix https://www.rdocumentation.org/packages/tm/versions/0.7-
8/topics/TermDocumentMatrix Assessed on 2021

5|Page

You might also like