0% found this document useful (0 votes)
17 views15 pages

NLP 1 Week Tutorial NLTK

The document provides an overview of Natural Language Processing (NLP) and its applications, highlighting the use of Python's NLTK library for various NLP tasks such as text preprocessing, tokenization, stemming, lemmatization, and text classification. It also covers advanced concepts like TF-IDF, Word2Vec, and sentiment analysis, explaining their significance and implementation. Additionally, it lists popular NLP libraries and tools, emphasizing the importance of understanding human language data for computational tasks.

Uploaded by

Dharna Ahuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

NLP 1 Week Tutorial NLTK

The document provides an overview of Natural Language Processing (NLP) and its applications, highlighting the use of Python's NLTK library for various NLP tasks such as text preprocessing, tokenization, stemming, lemmatization, and text classification. It also covers advanced concepts like TF-IDF, Word2Vec, and sentiment analysis, explaining their significance and implementation. Additionally, it lists popular NLP libraries and tools, emphasizing the importance of understanding human language data for computational tasks.

Uploaded by

Dharna Ahuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

NLP with Python using NLTK

What is NLP (Natural Language Processing)?


• NLP (Natural Language Processing) is a field at the intersection of
computer science, artificial intelligence (AI), and linguistics. It enables
computers to understand, interpret, and generate human language.
• You’ve used NLP if you've:
– Spoken to Alexa, Siri, or Google Assistant
– Typed something and used autocorrect or autocomplete
– Seen spam filters in your email
– Used chatbots or language translation tools
• Popular NLP libraries:
– NLTK (Natural Language Toolkit) – beginner-friendly
– spaCy – fast and industrial-strength
– TextBlob – simple, useful for sentiment analysis
– transformers (by HuggingFace) – for deep learning-based NLP (e.g., BERT, GPT)
What is NLTK?
• NLTK (Natural Language Toolkit) is a powerful Python library
used for working with human language data (text). It provides
easy-to-use tools and resources to process, analyze, and
understand natural language.
Text Preprocessing
• Install NLTK: pip install nltk
• import nltk
• nltk.download('punkt')
• nltk.download('stopwords')
• nltk.download('wordnet’)

• Tokenization: Breaking text into words or sentences.


• Stopwords: Common words (like "the", "is") that are removed before
analysis
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = "NLP is fun and powerful!"


tokens = word_tokenize(text)
filtered = [w for w in tokens if w.lower() not in stopwords.words('english')]
print(filtered)
This removes unimportant words so that your analysis focuses on meaningful
content.
Tokenization & Stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = "NLTK is a powerful Python library for NLP."


tokens = word_tokenize(text)
filtered = [word for word in tokens if word.lower() not in
stopwords.words('english')]
print(filtered)
Stemming & Lemmatization
• Stemming: Strips suffixes ("playing" → "play").
• Lemmatization: Reduces to dictionary form ("better" → "good").
• Both are used to normalize text. Lemmatization is more accurate but
slower.
• from nltk.stem import PorterStemmer, WordNetLemmatizer
• nltk.download('wordnet')

• stemmer = PorterStemmer()
• print(stemmer.stem("playing")) # play

• lemmatizer = WordNetLemmatizer()
• print(lemmatizer.lemmatize("playing", pos='v')) # play
POS Tagging & Named Entity Recognition
• POS Tagging: Labels each word (noun, verb, etc.)
• NER: Detects entities like names and places.
• nltk.download('averaged_perceptron_tagger')
• nltk.download('maxent_ne_chunker')
• nltk.download('words')

• sentence = "Steve Jobs founded Apple in California."


• tokens = word_tokenize(sentence)
• tags = nltk.pos_tag(tokens)
• ner_tree = nltk.ne_chunk(tags)
• print(ner_tree)
• This helps in identifying structure and important entities in a
sentence.
Text Classification (Naive Bayes)

• Text Classification: Predicts labels for input text (e.g.,


sentiment).
• Naive Bayes: A simple probabilistic classifier.
• from nltk.classify import NaiveBayesClassifier

def format_sentence(sent):
return {'text': sent.lower()}

train = [(format_sentence("I love this movie"), 'pos'),


(format_sentence("I hate this product"), 'neg')]

classifier = NaiveBayesClassifier.train(train)
print(classifier.classify(format_sentence("love product")))
TF-IDF (Term Frequency-Inverse Document Frequency)

• TF-IDF is a statistical measure used to evaluate how important


a word is in a document relative to a collection of documents
(called a corpus).
• Formula:
• TF-IDF(t,d)=TF(t,d)×IDF(t)
• TF (Term Frequency): How often term t appears in document d
– TF(t,d)=Total terms in d/No. of times t appears in d​
– Repeats as long as a condition is True.
• IDF (Inverse Document Frequency): How rare the term is across all
documents
– IDF(t)=log(Number of documents with term t/
Total number of documents)​)
Why Use TF-IDF?

• Words like “the”, “is”, “and” appear in all documents and carry little
meaning.
• TF-IDF downweights common words and upweights rare, important
ones.
• Example:
• If the word “excellent” appears 3 times in a review but rarely in other
reviews, it will get a high TF-IDF score, showing it's significant for that
specific document.
– from sklearn.feature_extraction.text import TfidfVectorizer

– docs = ["I love NLP", "NLP is fun and useful", "I love machine learning"]
– vectorizer = TfidfVectorizer()
– tfidf_matrix = vectorizer.fit_transform(docs)

– print(tfidf_matrix.toarray())
– print(vectorizer.get_feature_names_out())
Word2Vec

• Word2Vec is a technique to convert words into vectors


(numbers) so that a machine can understand their meaning
based on context. It’s used in NLP for tasks like similarity
detection, text classification, and more.
• Word2Vec trains a shallow neural network to learn word
embeddings using one of two models:
– CBOW (Continuous Bag of Words) – Predicts a word from its
surrounding context.
– Skip-Gram – Predicts context from the target word (works better with
small data).
– Words with similar meanings end up having similar vectors.
Install Required Library

• pip install genism


from gensim.models import Word2Vec

# Example corpus
sentences = [
["i", "love", "nlp"],
["nlp", "is", "fun"],
["i", "enjoy", "machine", "learning"]
]

# Train Word2Vec model


model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, sg=1) # sg=1 uses skip-gram

# Get vector for a word


print("Vector for 'nlp':")
print(model.wv['nlp'])

# Find similar words


print("\nWords similar to 'nlp':")
print(model.wv.most_similar('nlp'))
Explanation of Parameters

• vector_size: Dimension of word embeddings (usually 50–300)


• window: Context window size (how many words to the left/right
to consider)
• min_count: Ignores words that appear less than this number
• sg: 1 for skip-gram, 0 for CBOW
• After training, model.wv['king'] -
model.wv['man'] + model.wv['woman'] gives a
vector close to 'queen’.
• Why Use Word2Vec?
– Captures semantic relationships between words.
– Great for text classification, sentiment analysis,
chatbot development, etc.
Sentiment Analysis

• Sentiment Analysis is the process of


identifying and classifying emotions or
opinions in text — typically as:
– Positive
– Negative
– Neutral
• It's widely used in:
– Product reviews
– Social media monitoring
– Customer feedback analysis

You might also like