Vector Space Model
Vector Space Model
we consider vector space model based on the bag-of-words representation. Documents and queries are represented as vectors.
Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero.
Several different ways of computing these values, also known as (term) weights, have been developed. One of the best
known schemes is tf-idf weighting.
The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words
are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of
distinct words occurring in the corpus).
The Vector Space Model represents documents and terms as vectors in a multi-dimensional space. Each dimension
corresponds to a unique term in the entire corpus of documents. Each dimension corresponds to a unique term, while
the documents and queries can be represented as vectors within that space.
1. Vector Representation: We represent documents and queries as vectors using techniques like TF-IDF. Each
document in the corpus and the query are converted into vectors in the same high-dimensional space.
2. Cosine Similarity Calculation: To determine the relevance of a document to a query, we calculate the cosine
similarity between the query vector and the vectors representing each document in the corpus.
3. Ranking: Documents with higher cosine similarity scores to the query are considered more relevant and are ranked
higher. Those with lower scores are ranked lower.
The key idea behind cosine similarity is to calculate the cosine of the angle between two vectors. If the vectors are
very similar, their angle will be small, and the cosine value will be close to 1. Conversely, if the vectors are dissimilar,
the angle will be large, and the cosine value will approach 0.
The formula for calculating cosine similarity between two vectors A and B is as follows:
Where:
∥A∥ and ∥B∥ represent the Euclidean norms (magnitudes) of vectors A and B, respectively.
A⋅B represents the dot product of vectors A and B.
The cosine similarity value ranges from -1 (completely dissimilar) to 1 (completely similar). A higher cosine similarity
The vector space model has the following advantages over the Standard Boolean model:
1. Query terms are assumed to be independent, so phrases might not be represented well in the ranking
2. Semantic sensitivity; documents with similar context but different term vocabulary won't be associated [2]
Example:
2. Term-Document Matrix:
3.
4.
Tf-idf weightage is calculated using tf X idf