0% found this document useful (0 votes)
20 views3 pages

NLP PDF

The document provides an overview of essential Natural Language Processing (NLP) techniques, including tokenization, stop words removal, stemming, lemmatization, word embeddings, and vector databases. It includes Python examples using libraries like NLTK and Gensim to demonstrate these concepts. Additionally, it discusses the Bag of Words, TF-IDF, and N-grams methods for text representation and their applications in semantic search and recommendation systems.

Uploaded by

shabarirajan92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views3 pages

NLP PDF

The document provides an overview of essential Natural Language Processing (NLP) techniques, including tokenization, stop words removal, stemming, lemmatization, word embeddings, and vector databases. It includes Python examples using libraries like NLTK and Gensim to demonstrate these concepts. Additionally, it discusses the Bag of Words, TF-IDF, and N-grams methods for text representation and their applications in semantic search and recommendation systems.

Uploaded by

shabarirajan92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Natural Language Processing (NLP) Essentials with Examples

1. Tokenization
Tokenization is the process of splitting text into smaller units, such as words or sentences.

Example of word tokenization in Python:

```python

from nltk.tokenize import word_tokenize

sentence = "Tokenization is an important NLP task."

tokens = word_tokenize(sentence)

print(tokens) # Output: ['Tokenization', 'is', 'an', 'important', 'NLP', 'task', '.']

```

2. Stop Words
Stop words are common words that are usually removed from text as they carry minimal meaning.

Example of removing stop words using NLTK:

```python

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

words = word_tokenize("This is a simple NLP example.")

filtered_words = [word for word in words if word.lower() not in stop_words]

print(filtered_words) # Output: ['simple', 'NLP', 'example', '.']

```

3. Stemming and Lemmatization


- Stemming reduces words to their root form by cutting off prefixes or suffixes, e.g., "running"

becomes "run".

- Lemmatization reduces words to their base form using language rules, e.g., "better" to "good".
Example using NLTK:

```python

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()

lemmatizer = WordNetLemmatizer()

print(stemmer.stem("running")) # Output: 'run'

print(lemmatizer.lemmatize("better", pos="a")) # Output: 'good'

```

4. Word Embeddings
Word embeddings are dense vector representations of words capturing semantic relationships.

Example of generating word embeddings using Gensim's Word2Vec:

```python

from gensim.models import Word2Vec

sentences = [["hello", "world"], ["machine", "learning"], ["hello", "machine"]]

model = Word2Vec(sentences, vector_size=5, min_count=1)

vector = model.wv['hello']

print(vector) # Output: A dense vector representing 'hello'

```

5. Bag of Words, TF-IDF, and N-grams


- **Bag of Words (BoW)**: Represents text as a word count vector.

- **TF-IDF (Term Frequency-Inverse Document Frequency)**: Adjusts word frequency by document

frequency.

- **N-grams**: Groups of N consecutive words.


Example using Scikit-learn for BoW and TF-IDF:

```python

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

text = ["Natural Language Processing", "Text Processing"]

bow = CountVectorizer().fit_transform(text)

tfidf = TfidfVectorizer().fit_transform(text)

print(bow.toarray()) # Output: BoW representation

print(tfidf.toarray()) # Output: TF-IDF representation

```

6. Vector Databases
Vector databases store word embeddings or document vectors for efficient similarity search.

Example usage in search systems:

1. Convert query and documents to vectors.

2. Use cosine similarity to retrieve the closest vectors.

Applications: Semantic search, recommendation systems.

You might also like