Module 2 Feature Engineering and Text Representation
Module 2 Feature Engineering and Text Representation
Text Representation
NLP Process
Exp:
One Hot Encoding in Scikit-Learn
from sklearn.preprocessing import OneHotEncoder
oh_enc = OneHotEncoder()
oh_enc.fit(df[['name', 'kind']])
oh_enc.transform(df[['name', 'kind']]).todense()
Exp:
Bag of Words Model using Scikit-Learn
sample_text = ['This is the first document.', 'This document is the second
document.', ‘This is the third document.’ ]
vectorizer = CountVectorizer(stop_words="english")
vectorizer.fit(sample_text)
print("Words:", list(enumerate(vectorizer.get_feature_names())))
N-gram Encoding
N-gram Encoding
Extracts features from text while capturing local word order by defining
counts over sliding windows
Exp n =2 :
N-gram Encoding using Scikit-Learn
sample_text = ['This is the first document.', 'This document is the second
document.', ‘This is the third document.’ ]
bigram.fit(sample_text)
print("Words:", list(zip(range(0,len(bigram.get_feature_names())),
bigram.get_feature_names())))
TFIDF Vectorizer
TFIDF Vectorizer
Converts a collection of raw documents to a matrix of TFIDF features
TFIDF stands for term frequency inverse document frequency and it represents
text data by indicating the importance of the word relative to the other words in
the text
2 Parts:
TFIDF = TF*IDF
IDF represents how important the word is to the document (rare words)
TFIDF Vectorizer Encoding using Scikit-Learn
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sample_text)
print(vectorizer.get_feature_names())