NLP 1 Week Tutorial NLTK
NLP 1 Week Tutorial NLTK
• stemmer = PorterStemmer()
• print(stemmer.stem("playing")) # play
• lemmatizer = WordNetLemmatizer()
• print(lemmatizer.lemmatize("playing", pos='v')) # play
POS Tagging & Named Entity Recognition
• POS Tagging: Labels each word (noun, verb, etc.)
• NER: Detects entities like names and places.
• nltk.download('averaged_perceptron_tagger')
• nltk.download('maxent_ne_chunker')
• nltk.download('words')
def format_sentence(sent):
return {'text': sent.lower()}
classifier = NaiveBayesClassifier.train(train)
print(classifier.classify(format_sentence("love product")))
TF-IDF (Term Frequency-Inverse Document Frequency)
• Words like “the”, “is”, “and” appear in all documents and carry little
meaning.
• TF-IDF downweights common words and upweights rare, important
ones.
• Example:
• If the word “excellent” appears 3 times in a review but rarely in other
reviews, it will get a high TF-IDF score, showing it's significant for that
specific document.
– from sklearn.feature_extraction.text import TfidfVectorizer
– docs = ["I love NLP", "NLP is fun and useful", "I love machine learning"]
– vectorizer = TfidfVectorizer()
– tfidf_matrix = vectorizer.fit_transform(docs)
– print(tfidf_matrix.toarray())
– print(vectorizer.get_feature_names_out())
Word2Vec
# Example corpus
sentences = [
["i", "love", "nlp"],
["nlp", "is", "fun"],
["i", "enjoy", "machine", "learning"]
]