Word Embedding
Word Embedding
GCR : t37g47w
Problem With Text
A problem with modeling text is that it is messy, and techniques like
machine learning algorithms prefer well defined fixed-length inputs and
outputs.
Machine learning algorithms cannot work with raw text directly; the
text must be converted into numbers. Specifically, vectors of numbers.
We look at the histogram of the words within the text, i.e. considering each word
count as a feature.
Bag Of Words
The intuition is that documents are similar if they have similar
content. Further, that from the content alone we can learn
something about the meaning of the document.
One approach is to rescale the frequency of words by how often they appear in
all documents, so that the scores for frequent words like “the” that are also
frequent across all documents are penalized.