0% found this document useful (0 votes)
55 views5 pages

Extra Feature NLP

The document discusses various natural language processing techniques including one hot encoding, count vectorization, TF-IDF, n-grams, and word embeddings using FastText. Code examples are provided for implementing each technique using scikit-learn and gensim libraries in Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views5 pages

Extra Feature NLP

The document discusses various natural language processing techniques including one hot encoding, count vectorization, TF-IDF, n-grams, and word embeddings using FastText. Code examples are provided for implementing each technique using scikit-learn and gensim libraries in Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Extra-Feature-NLP

March 21, 2024

[ ]: # One Hot Encoding

[1]: import pandas as pd


from sklearn.preprocessing import OneHotEncoder

# Example categorical data


categories = ['teacher', 'nurse', 'police', 'doctor']

# Convert categorical data into a DataFrame


data = pd.DataFrame({'Category': categories})

# Initialize the OneHotEncoder


encoder = OneHotEncoder(sparse_output=False, dtype=int)

# Fit and transform the categorical data


encoded_data = encoder.fit_transform(data)

# Convert the encoded data to a DataFrame


encoded_df = pd.DataFrame(encoded_data, columns=categories)

# Print the encoded DataFrame


encoded_df.head()

[1]: teacher nurse police doctor


0 0 0 0 1
1 0 1 0 0
2 0 0 1 0
3 1 0 0 0

[2]: #Count Vectorization

[3]: # Bag Of Words (BOW):

[4]: # It creates a vocabulary of unique words from the corpus and represents each␣
↪document as a vector of word frequencies.

[5]: import pandas as pd


from sklearn.feature_extraction.text import CountVectorizer

1
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame


data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer


vectorizer = CountVectorizer()

# Fit and transform the text data


bow_vectors = vectorizer.fit_transform(data['Text'])

# Convert the BOW vectors to a DataFrame


bow_df = pd.DataFrame(bow_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the BOW DataFrame


bow_df.head()

[5]: and document first is one second the third this


0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1

[6]: # N-gram features

[7]: import pandas as pd


from sklearn.feature_extraction.text import CountVectorizer

# Example text data


documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame


data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer with desired n-gram range


ngram_vectorizer = CountVectorizer(ngram_range=(2,3))

# Fit and transform the text data

2
ngram_vectors = ngram_vectorizer.fit_transform(data['Text'])

# Convert the N-gram vectors to a DataFrame


ngram_df = pd.DataFrame(ngram_vectors.toarray(), columns=ngram_vectorizer.
↪get_feature_names_out())

# Print the N-gram DataFrame


ngram_df.head()

[7]: and this and this is document is document is the first document \
0 0 0 0 0 1
1 0 0 1 1 0
2 1 1 0 0 0
3 0 0 0 0 1

is the is the first is the second is the third is this … \


0 1 1 0 0 0 …
1 1 0 1 0 0 …
2 1 0 0 1 0 …
3 0 0 0 0 1 …

the second document the third the third one third one this document \
0 0 0 0 0 0
1 1 0 0 0 1
2 0 1 1 1 0
3 0 0 0 0 0

this document is this is this is the this the this the first
0 0 1 1 0 0
1 1 0 0 0 0
2 0 1 1 0 0
3 0 0 0 1 1

[4 rows x 25 columns]

[8]: # TF-IDF Vectorizer:

[9]: # TF (Term Frequency) represents the frequency of a term in a document. It␣


↪refers to the number of

#times a particular term occurs in a document.

[10]: #IDF (Inverse Document Frequency) is used to determine the importance of a term␣
↪in a document.

[11]: import pandas as pd


from sklearn.feature_extraction.text import TfidfVectorizer

3
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame


data = pd.DataFrame({'Text': documents})

# Initialize the TF-IDF Vectorizer


vectorizer = TfidfVectorizer()

# Fit and transform the text data


tfidf_vectors = vectorizer.fit_transform(data['Text'])

# Convert the TF-IDF vectors to a DataFrame


tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the TF-IDF DataFrame


tfidf_df.head()

[11]: and document first is one second the \


0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085

third this
0 0.000000 0.384085
1 0.000000 0.281089
2 0.511849 0.267104
3 0.000000 0.384085

[ ]: # word Embedding

[12]: # FastText

[15]: #It learns word embeddings using the Skip-gram or Continuous Bag-of-Words␣
↪(CBOW) architecture,

# making it effective for various natural language processing tasks

[ ]: #FastText can handle out-of-vocabulary words and capture morphological and␣


↪semantic similarities, even for rare or unseen words.

[16]: import pandas as pd


from gensim.models import FastText

4
# Training data
sentences = [["I", "like", "apples"],
["I", "enjoy", "eating", "fruits"]]

# Training the FastText model


model_fasttext = FastText(sentences, min_count=1, window=5, vector_size=100)

# Accessing word vectors


word_vectors = model_fasttext.wv

# Creating a DataFrame for word vectors


word_vectors_df = pd.DataFrame(word_vectors.vectors, index=word_vectors.
↪index_to_key)

# Displaying the word vectors DataFrame


word_vectors_df.head(10)

[16]: 0 1 2 3 4 5 6 \
I -0.003053 0.001144 -0.001130 0.004910 -0.003084 -0.007648 0.007188
fruits -0.001457 0.001947 0.001137 -0.001536 -0.001588 -0.001997 -0.002027
eating 0.000412 0.001230 -0.002208 0.000289 0.001082 0.000401 0.001171
enjoy -0.001593 0.000200 0.000983 -0.001493 -0.000503 0.001380 0.001440
apples -0.000257 -0.000776 -0.000108 -0.001688 0.002155 -0.001124 0.002533
like 0.001024 -0.003016 0.001939 -0.001192 -0.003485 -0.001892 0.001637

7 8 9 … 90 91 92 \
I 0.007860 -0.001688 -0.002615 … 0.005416 0.001654 0.002986
fruits 0.002295 0.002176 -0.001157 … 0.000342 0.000272 -0.001761
eating -0.000369 -0.000706 0.002063 … -0.002273 0.001385 0.001710
enjoy -0.002292 -0.000112 -0.001617 … -0.003175 -0.001866 0.000952
apples 0.000522 0.000874 -0.000778 … 0.001021 0.000565 -0.001394
like -0.000633 -0.001284 0.001069 … -0.000179 0.002047 -0.000875

93 94 95 96 97 98 99
I 0.002967 0.007579 -0.002151 -0.003800 0.001423 0.001112 -0.000259
fruits -0.001308 -0.000937 -0.000236 -0.000219 -0.000568 -0.003610 -0.001075
eating -0.000360 -0.000841 0.002985 0.000116 -0.000775 -0.000186 0.001993
enjoy -0.002678 0.002496 -0.000418 -0.002535 -0.002113 -0.001011 0.000997
apples -0.000912 0.001105 -0.000151 0.001271 0.001879 0.001152 -0.000260
like -0.000740 0.002278 0.000509 0.001111 -0.001301 0.000404 0.001636

[6 rows x 100 columns]

You might also like