0% found this document useful (0 votes)
7 views

Vector Space Model

Vector space model in the information retrieval subject
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Vector Space Model

Vector space model in the information retrieval subject
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Vector Space Model:

we consider vector space model based on the bag-of-words representation. Documents and queries are represented as vectors.

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero.
Several different ways of computing these values, also known as (term) weights, have been developed. One of the best
known schemes is tf-idf weighting.
The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words
are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of
distinct words occurring in the corpus).

Vector operations can be used to compare documents with queries.[

The Vector Space Model represents documents and terms as vectors in a multi-dimensional space. Each dimension
corresponds to a unique term in the entire corpus of documents. Each dimension corresponds to a unique term, while
the documents and queries can be represented as vectors within that space.
1. Vector Representation: We represent documents and queries as vectors using techniques like TF-IDF. Each
document in the corpus and the query are converted into vectors in the same high-dimensional space.
2. Cosine Similarity Calculation: To determine the relevance of a document to a query, we calculate the cosine
similarity between the query vector and the vectors representing each document in the corpus.
3. Ranking: Documents with higher cosine similarity scores to the query are considered more relevant and are ranked
higher. Those with lower scores are ranked lower.

The key idea behind cosine similarity is to calculate the cosine of the angle between two vectors. If the vectors are

very similar, their angle will be small, and the cosine value will be close to 1. Conversely, if the vectors are dissimilar,

the angle will be large, and the cosine value will approach 0.

How is Cosine Similarity Calculated?

The formula for calculating cosine similarity between two vectors A and B is as follows:

Where:

∥A∥ and ∥B∥ represent the Euclidean norms (magnitudes) of vectors A and B, respectively.
A⋅B represents the dot product of vectors A and B.

The cosine similarity value ranges from -1 (completely dissimilar) to 1 (completely similar). A higher cosine similarity

score indicates greater similarity between the two vectors.

Why Cosine Similarity?

Cosine similarity has several advantages when applied to text data:


1. Scale Invariance: Cosine similarity is scale-invariant, meaning it’s not affected by the magnitude of the vectors. This
makes it suitable for documents of different lengths.
2. Angle Measure: It focuses on the direction of vectors rather than their absolute values, which is crucial for text
similarity, where document length can vary.
3. Efficiency: Calculating cosine similarity is computationally efficient, making it suitable for large-scale text datasets.

The vector space model has the following advantages over the Standard Boolean model:

1. Allows ranking documents according to their possible relevance


2. Allows retrieving documents with partial matching.

The vector space model has the following limitations:

1. Query terms are assumed to be independent, so phrases might not be represented well in the ranking
2. Semantic sensitivity; documents with similar context but different term vocabulary won't be associated [2]
Example:

Document 1: Cat runs behind rat


Document 2: Dog runs behind cat
Query: rat

1. Document vectors representation:

Document 1: (cat, runs, behind, rat)


Document 2: (Dog, runs, behind, cat)
Query: (rat)

2. Term-Document Matrix:

3.

4.
Tf-idf weightage is calculated using tf X idf

You might also like