0% found this document useful (0 votes)

41 views6 pages

Course Name: Advanced Information Retrieval

The document discusses term-document matrices. [1] A term-document matrix describes the frequency of terms in documents with rows representing documents and columns representing terms. [2] It has advantages like being an important representation for text analytics, but disadvantages like being very sparse. [3] Python can be used to implement term-document matrices using sklearn's CountVectorizer to transform documents into a matrix with documents as rows and terms as columns showing term frequency.

Uploaded by

jewar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views6 pages

Course Name: Advanced Information Retrieval

Uploaded by

jewar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

JIMMA UNIVERSITY

JIMMA INSTITUTE OF TECHNOLOGY

FACULTY OF COMPUTING AND
INFORMATICS
MSC. IN INFORMATION SCIENCE
(ELECTRONIC AND DIGITAL RECOURSE
MANAGEMENT)

Course Name: Advanced Information Retrieval

Assignment 1: Term-document matrix

Prepared by: Ruth Wondu

Submitted to: Dr Getachew Mamo

January, 2021
Definition of Term-document matrix

1. A document-term matrix is a mathematical matrix that describes the frequency of terms that

occur in a collection of documents. In a document-term matrix, rows correspond to documents in
the collection and columns correspond to terms. This matrix is a specific instance of a document-
feature matrix where "features" may refer to other properties of a document besides terms. It is
also common to encounter the transpose, or term-document matrix where documents are the
columns and terms are the rows. They are useful in the field of natural language
processing and computational text analysis. While the value of the cells is commonly the raw
count of a given term, there are various schemes for weighting the raw counts such as relative
frequency/proportions and tf-idf. Terms are commonly single tokens separated by whitespace or
punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words"
representation because the counts of individual words is retained, but not the order of the words
in the document.
Document-term matrix or term-document matrix is a mathematical matrix that describes the
frequency of terms that occur in a collection of documents. This is a matrix where

 each row represents one document

 each column represents one term (word)
 each value (typically) contains the number of appearances of that term in that
document

Document-term matrices are often stored as a sparse matrix object. These objects can be
treated as though they were matrices (for example, accessing particular rows and columns),
but are stored in a more efficient format.

When creating a data-set of terms that appear in a corpus of documents, the document-term

matrix contains rows corresponding to the documents and columns corresponding to the
terms. Each ij cell, then, is the number of times word j occurs in document i. As such, each
row is a vector of term counts that represents the content of the document corresponding to
that row. For instance if one has the following two (short) documents:

1|Page
 D1 = "I like databases"
 D2 = "I dislike databases",
Then the document-term matrix would be:
I like dislike Databases
D1 1 1 0 1
D2 1 0 1 1
Which shows which documents contain which terms and how many times they appear. Note
that, unlike representing a document as just a token-count list, the document-term matrix
includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-
counts for terms in the corpus which do not also occur in a specific document.

As a result of the power-law distribution of tokens in nearly every corpus (see Zipf's law), it is
common to weight the counts. This can be as simple as dividing counts by the total number of
tokens in a document (called relative frequency or proportions), dividing by the maximum
frequency in each document (called prop max), or taking the log of frequencies (called log
count). If one desires to weight the words most unique to an individual document as compared
to the corpus as a whole, it is common to use tf-idf, which divides the term frequency by the
inverse of the term's document frequency.

2. Advantages and Disadvantages

Advantages

 A term-document matrix is an important representation for text analytics.

 Each row of the matrix is a document vector, with one column for every term in the entire
corpus.
 Naturally, some documents may not contain a given term, so this matrix is sparse. The
value in each cell of the matrix is the term frequency. (This value is often a weighted
term frequency, typically using tf-idf -- term frequency-inverse document frequency

Disadvantages

 Observation: the term-document matrix is very sparse

 Contains no more than one billion 1s.

 lack of support for more complex query operators (e.g., proximity search)

2|Page
We will move towards richer representations, beginning with the inverted index.
3. Implementation using any programming language (python/java ...). I recommend you
python programming language
Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms
for constructing term-document matrices from text plus common transformations

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

docs = ['Computer program used to retrieve digital information',

'Software is necessary for users to access digital information',

'Digital ICT is communication through computer-based systems',]

vec = CountVectorizer()

c= vec.fit_transform(docs)

tdm = pd.DataFrame(c.toarray(), columns=vec.get_feature_names())

print(tdm)

3|Page
The result that shows as

4|Page
Reference
1. term-document matrix https://en.wikipedia.org/wiki/Document-term_matrix Assessed
on 2021
2. document-term-matrixhttps://bookdown.org/Maxine/tidy-text-mining/tidying-a-
document-term-matrix.html Assessed on 2021
3. Term-document matrix https://www.rdocumentation.org/packages/tm/versions/0.7-
8/topics/TermDocumentMatrix Assessed on 2021

5|Page

Feature Engineering
100% (2)
Feature Engineering
44 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
No ratings yet
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
41 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
Module 5 Document Clustering
No ratings yet
Module 5 Document Clustering
33 pages
Qta Lse Day2 PDF
No ratings yet
Qta Lse Day2 PDF
55 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
HTML5 Notes
No ratings yet
HTML5 Notes
39 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Text Representation
No ratings yet
Text Representation
16 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
2 Quiz 1: Platform As A Service (Paas)
100% (1)
2 Quiz 1: Platform As A Service (Paas)
9 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
TM FBR 75 PDF
100% (1)
TM FBR 75 PDF
214 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
10 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
Textdb
No ratings yet
Textdb
27 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Stewart LabHandout
No ratings yet
Stewart LabHandout
11 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
IR Exercise LAB1
No ratings yet
IR Exercise LAB1
4 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
EEE 5103 Power System Analysis 1
No ratings yet
EEE 5103 Power System Analysis 1
113 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Latent Semantic Analysis
No ratings yet
Latent Semantic Analysis
36 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Delhi Public School Bangalore North
No ratings yet
Delhi Public School Bangalore North
8 pages
Latent Semantic Analysis
No ratings yet
Latent Semantic Analysis
3 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Seed Fill Algorithms
No ratings yet
Seed Fill Algorithms
13 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Guide For Property Accountability
No ratings yet
Guide For Property Accountability
44 pages
Copier Machine Specification
No ratings yet
Copier Machine Specification
6 pages
LG 5310
No ratings yet
LG 5310
32 pages
DMS - Course - File July-Dec 2024
No ratings yet
DMS - Course - File July-Dec 2024
21 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
Item/Device Specification Unit QT Y: Cat 6 UTP Cable RJ-45 Patch Panel
No ratings yet
Item/Device Specification Unit QT Y: Cat 6 UTP Cable RJ-45 Patch Panel
4 pages
BCA Project
No ratings yet
BCA Project
32 pages
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
No ratings yet
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
3 pages
Digital Forensics - EXAM - FINAL
No ratings yet
Digital Forensics - EXAM - FINAL
78 pages
Government Degree College Rajendranagar Shamshabad - 2022 - 2023 - 4926208
No ratings yet
Government Degree College Rajendranagar Shamshabad - 2022 - 2023 - 4926208
1 page
Management Accounts December 2023
No ratings yet
Management Accounts December 2023
19 pages
NW7XX Inst HDB UX Java
No ratings yet
NW7XX Inst HDB UX Java
194 pages
Atp3 34 80
No ratings yet
Atp3 34 80
112 pages
Mylapali Gireesh
No ratings yet
Mylapali Gireesh
94 pages
Jimma University Institute of Technology Article Review On Smart Home (Low Cost Strategy & Appl.)
100% (1)
Jimma University Institute of Technology Article Review On Smart Home (Low Cost Strategy & Appl.)
11 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
Full Fog/Edge Computing For Security, Privacy, and Applications Wei Chang PDF All Chapters
100% (5)
Full Fog/Edge Computing For Security, Privacy, and Applications Wei Chang PDF All Chapters
55 pages
Munaf Saiyed CV
No ratings yet
Munaf Saiyed CV
2 pages
Bharathiar University:: Coimbatore - 641 046 Common Entrance Test For M.Phil. / Ph.D. 2014 - Score List
No ratings yet
Bharathiar University:: Coimbatore - 641 046 Common Entrance Test For M.Phil. / Ph.D. 2014 - Score List
138 pages
External Debug Security
No ratings yet
External Debug Security
22 pages
NFT Report
No ratings yet
NFT Report
35 pages
Farhan Sir PDF
No ratings yet
Farhan Sir PDF
17 pages
Chain of Thought Prompting
No ratings yet
Chain of Thought Prompting
7 pages
Department of Electrical Engineering
No ratings yet
Department of Electrical Engineering
18 pages
Foundation of Knowledge Management
No ratings yet
Foundation of Knowledge Management
1 page
Foundation of Knowledge Management
No ratings yet
Foundation of Knowledge Management
1 page
Word Is A Powerful Tool Used To Create Professional Looking Documents
No ratings yet
Word Is A Powerful Tool Used To Create Professional Looking Documents
2 pages
Project 314
No ratings yet
Project 314
14 pages
Question Paper Stem
No ratings yet
Question Paper Stem
9 pages
Basic RISC-V Instruction Set Architecture Design and Validation
No ratings yet
Basic RISC-V Instruction Set Architecture Design and Validation
8 pages
An Overview On Clustering Methods: T. Soni Madhulatha
No ratings yet
An Overview On Clustering Methods: T. Soni Madhulatha
7 pages
Sentiment Analysis in E Commerce
No ratings yet
Sentiment Analysis in E Commerce
5 pages
Mod
No ratings yet
Mod
3 pages
ME 301 Self Assessment - HW1c
No ratings yet
ME 301 Self Assessment - HW1c
1 page

Course Name: Advanced Information Retrieval

Uploaded by

Course Name: Advanced Information Retrieval

Uploaded by

JIMMA UNIVERSITY

JIMMA INSTITUTE OF TECHNOLOGY

Course Name: Advanced Information Retrieval

Prepared by: Ruth Wondu

Submitted to: Dr Getachew Mamo

1. A document-term matrix is a mathematical matrix that describes the frequency of terms that

 each row represents one document

When creating a data-set of terms that appear in a corpus of documents, the document-term

2. Advantages and Disadvantages

 A term-document matrix is an important representation for text analytics.

 Observation: the term-document matrix is very sparse

 Contains no more than one billion 1s.

from sklearn.feature_extraction.text import CountVectorizer

docs = ['Computer program used to retrieve digital information',

'Software is necessary for users to access digital information',

'Digital ICT is communication through computer-based systems',]

tdm = pd.DataFrame(c.toarray(), columns=vec.get_feature_names())

You might also like