0% found this document useful (0 votes)

55 views5 pages

Extra Feature NLP

The document discusses various natural language processing techniques including one hot encoding, count vectorization, TF-IDF, n-grams, and word embeddings using FastText. Code examples are provided for implementing each technique using scikit-learn and gensim libraries in Python.

Uploaded by

1nt21ai012.vynavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views5 pages

Extra Feature NLP

Uploaded by

1nt21ai012.vynavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Extra-Feature-NLP

March 21, 2024

[ ]: # One Hot Encoding

[1]: import pandas as pd

from sklearn.preprocessing import OneHotEncoder

# Example categorical data

categories = ['teacher', 'nurse', 'police', 'doctor']

# Convert categorical data into a DataFrame

data = pd.DataFrame({'Category': categories})

# Initialize the OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, dtype=int)

# Fit and transform the categorical data

encoded_data = encoder.fit_transform(data)

# Convert the encoded data to a DataFrame

encoded_df = pd.DataFrame(encoded_data, columns=categories)

# Print the encoded DataFrame

encoded_df.head()

[1]: teacher nurse police doctor

0 0 0 0 1
1 0 1 0 0
2 0 0 1 0
3 1 0 0 0

[2]: #Count Vectorization

[3]: # Bag Of Words (BOW):

[4]: # It creates a vocabulary of unique words from the corpus and represents each␣
↪document as a vector of word frequencies.

[5]: import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

1
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame

data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer

vectorizer = CountVectorizer()

# Fit and transform the text data

bow_vectors = vectorizer.fit_transform(data['Text'])

# Convert the BOW vectors to a DataFrame

bow_df = pd.DataFrame(bow_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the BOW DataFrame

bow_df.head()

[5]: and document first is one second the third this

0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1

[6]: # N-gram features

[7]: import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

# Example text data

documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame

data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer with desired n-gram range

ngram_vectorizer = CountVectorizer(ngram_range=(2,3))

# Fit and transform the text data

2
ngram_vectors = ngram_vectorizer.fit_transform(data['Text'])

# Convert the N-gram vectors to a DataFrame

ngram_df = pd.DataFrame(ngram_vectors.toarray(), columns=ngram_vectorizer.
↪get_feature_names_out())

# Print the N-gram DataFrame

ngram_df.head()

[7]: and this and this is document is document is the first document \
0 0 0 0 0 1
1 0 0 1 1 0
2 1 1 0 0 0
3 0 0 0 0 1

is the is the first is the second is the third is this … \

0 1 1 0 0 0 …
1 1 0 1 0 0 …
2 1 0 0 1 0 …
3 0 0 0 0 1 …

the second document the third the third one third one this document \
0 0 0 0 0 0
1 1 0 0 0 1
2 0 1 1 1 0
3 0 0 0 0 0

this document is this is this is the this the this the first
0 0 1 1 0 0
1 1 0 0 0 0
2 0 1 1 0 0
3 0 0 0 1 1

[4 rows x 25 columns]

[8]: # TF-IDF Vectorizer:

[9]: # TF (Term Frequency) represents the frequency of a term in a document. It␣

↪refers to the number of

#times a particular term occurs in a document.

[10]: #IDF (Inverse Document Frequency) is used to determine the importance of a term␣
↪in a document.

[11]: import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

3
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame

data = pd.DataFrame({'Text': documents})

# Initialize the TF-IDF Vectorizer

vectorizer = TfidfVectorizer()

# Fit and transform the text data

tfidf_vectors = vectorizer.fit_transform(data['Text'])

# Convert the TF-IDF vectors to a DataFrame

tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the TF-IDF DataFrame

tfidf_df.head()

[11]: and document first is one second the \

0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085

third this
0 0.000000 0.384085
1 0.000000 0.281089
2 0.511849 0.267104
3 0.000000 0.384085

[ ]: # word Embedding

[12]: # FastText

[15]: #It learns word embeddings using the Skip-gram or Continuous Bag-of-Words␣
↪(CBOW) architecture,

# making it effective for various natural language processing tasks

[ ]: #FastText can handle out-of-vocabulary words and capture morphological and␣

↪semantic similarities, even for rare or unseen words.

[16]: import pandas as pd

from gensim.models import FastText

4
# Training data
sentences = [["I", "like", "apples"],
["I", "enjoy", "eating", "fruits"]]

# Training the FastText model

model_fasttext = FastText(sentences, min_count=1, window=5, vector_size=100)

# Accessing word vectors

word_vectors = model_fasttext.wv

# Creating a DataFrame for word vectors

word_vectors_df = pd.DataFrame(word_vectors.vectors, index=word_vectors.
↪index_to_key)

# Displaying the word vectors DataFrame

word_vectors_df.head(10)

[16]: 0 1 2 3 4 5 6 \
I -0.003053 0.001144 -0.001130 0.004910 -0.003084 -0.007648 0.007188
fruits -0.001457 0.001947 0.001137 -0.001536 -0.001588 -0.001997 -0.002027
eating 0.000412 0.001230 -0.002208 0.000289 0.001082 0.000401 0.001171
enjoy -0.001593 0.000200 0.000983 -0.001493 -0.000503 0.001380 0.001440
apples -0.000257 -0.000776 -0.000108 -0.001688 0.002155 -0.001124 0.002533
like 0.001024 -0.003016 0.001939 -0.001192 -0.003485 -0.001892 0.001637

7 8 9 … 90 91 92 \
I 0.007860 -0.001688 -0.002615 … 0.005416 0.001654 0.002986
fruits 0.002295 0.002176 -0.001157 … 0.000342 0.000272 -0.001761
eating -0.000369 -0.000706 0.002063 … -0.002273 0.001385 0.001710
enjoy -0.002292 -0.000112 -0.001617 … -0.003175 -0.001866 0.000952
apples 0.000522 0.000874 -0.000778 … 0.001021 0.000565 -0.001394
like -0.000633 -0.001284 0.001069 … -0.000179 0.002047 -0.000875

93 94 95 96 97 98 99
I 0.002967 0.007579 -0.002151 -0.003800 0.001423 0.001112 -0.000259
fruits -0.001308 -0.000937 -0.000236 -0.000219 -0.000568 -0.003610 -0.001075
eating -0.000360 -0.000841 0.002985 0.000116 -0.000775 -0.000186 0.001993
enjoy -0.002678 0.002496 -0.000418 -0.002535 -0.002113 -0.001011 0.000997
apples -0.000912 0.001105 -0.000151 0.001271 0.001879 0.001152 -0.000260
like -0.000740 0.002278 0.000509 0.001111 -0.001301 0.000404 0.001636

[6 rows x 100 columns]

Unit Ii
No ratings yet
Unit Ii
20 pages
Paper 1 - Visual Literacy - Advertising - Notes - Examples
No ratings yet
Paper 1 - Visual Literacy - Advertising - Notes - Examples
4 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
NLP m3
No ratings yet
NLP m3
111 pages
Lesson 2:: Language Use in Academic Writing
100% (1)
Lesson 2:: Language Use in Academic Writing
10 pages
Transformation (Baiga) in Vernacular Architecture of Baiga Tribes of Central India Page No 115
No ratings yet
Transformation (Baiga) in Vernacular Architecture of Baiga Tribes of Central India Page No 115
276 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Module III
No ratings yet
Module III
42 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
NLP Crecord Mid2
No ratings yet
NLP Crecord Mid2
36 pages
Unit IV
No ratings yet
Unit IV
58 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Interchange Unit 7
No ratings yet
Interchange Unit 7
64 pages
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
No ratings yet
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
79 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Lab 5
No ratings yet
Lab 5
27 pages
Unit 2B.: That'S Me in The Picture!
No ratings yet
Unit 2B.: That'S Me in The Picture!
15 pages
NLP Record 2
No ratings yet
NLP Record 2
18 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
Module 2 Feature Engineering and Text Representation
No ratings yet
Module 2 Feature Engineering and Text Representation
19 pages
Report On - Social Media Research Topic Modeling
No ratings yet
Report On - Social Media Research Topic Modeling
26 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
Ir Practical Manual 2
No ratings yet
Ir Practical Manual 2
24 pages
Lab Manual ML
No ratings yet
Lab Manual ML
28 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Allnlp
No ratings yet
Allnlp
15 pages
Kinder-New-DLL Week1 - Day3
No ratings yet
Kinder-New-DLL Week1 - Day3
4 pages
Text Vectorization
No ratings yet
Text Vectorization
10 pages
HW 5 Q 1
No ratings yet
HW 5 Q 1
22 pages
Sumati
No ratings yet
Sumati
10 pages
Assignment 3 Instructions
No ratings yet
Assignment 3 Instructions
10 pages
I041 NLP Assignment5
No ratings yet
I041 NLP Assignment5
12 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
NLP Assignment 4 (22bce9560)
No ratings yet
NLP Assignment 4 (22bce9560)
12 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
9 Feature Engineering Text Data
No ratings yet
9 Feature Engineering Text Data
7 pages
01 - Inspect - Pretrained - Model: 0.1 Download Pre-Trained Model Files
No ratings yet
01 - Inspect - Pretrained - Model: 0.1 Download Pre-Trained Model Files
8 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
EX1
No ratings yet
EX1
6 pages
Basenlp
No ratings yet
Basenlp
5 pages
ML Lab Exercise - 9
No ratings yet
ML Lab Exercise - 9
4 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Code Text
No ratings yet
Code Text
4 pages
Python Assignment 3
No ratings yet
Python Assignment 3
3 pages
Sample
No ratings yet
Sample
6 pages
DS 7
No ratings yet
DS 7
3 pages
Assign 3
No ratings yet
Assign 3
1 page
Import Import As Import As Import: 'Ignore'
No ratings yet
Import Import As Import As Import: 'Ignore'
4 pages
Makalah Relative Pronoun Kel 3
100% (1)
Makalah Relative Pronoun Kel 3
8 pages
Names of Common Flowers in English, Hindi, Tamil, Sanskrit and Malay
No ratings yet
Names of Common Flowers in English, Hindi, Tamil, Sanskrit and Malay
4 pages
Tiếng Anh 6 Smart World - Unit 1 Test
No ratings yet
Tiếng Anh 6 Smart World - Unit 1 Test
6 pages
Office Memo
No ratings yet
Office Memo
28 pages
2 Chapter 4 Transition Words
No ratings yet
2 Chapter 4 Transition Words
4 pages
Adjective List
No ratings yet
Adjective List
4 pages
Narration For BCS English PDF
100% (1)
Narration For BCS English PDF
10 pages
CLASS 10 AI PB 2 QP
No ratings yet
CLASS 10 AI PB 2 QP
8 pages
Paper 2 - Grade 7 - FINAL TERM - PRACTICE TEST Question Paper
No ratings yet
Paper 2 - Grade 7 - FINAL TERM - PRACTICE TEST Question Paper
13 pages
01network3 XPrac Mod5A
No ratings yet
01network3 XPrac Mod5A
2 pages
CÂU HỎI TRẮC NGHIỆM TiẾNG ANH-LỚP4
No ratings yet
CÂU HỎI TRẮC NGHIỆM TiẾNG ANH-LỚP4
7 pages
Bhojpuri Hindi Dictionary Toby Anderson PDF Download
No ratings yet
Bhojpuri Hindi Dictionary Toby Anderson PDF Download
35 pages
Meaning,: and Focus
No ratings yet
Meaning,: and Focus
148 pages
Weekly Assessment: Vocabulary
No ratings yet
Weekly Assessment: Vocabulary
8 pages
Tim Winton - The Turning
No ratings yet
Tim Winton - The Turning
39 pages
Reading Comprehension
No ratings yet
Reading Comprehension
3 pages
Summarizing (Meeting 4)
No ratings yet
Summarizing (Meeting 4)
29 pages
English - Module 4 BK 2
No ratings yet
English - Module 4 BK 2
134 pages
Class 5
No ratings yet
Class 5
8 pages
Japanese
No ratings yet
Japanese
1 page
Au t2 e 1815 Prepositional and Adverbial Phrases Presentation Ver 3
No ratings yet
Au t2 e 1815 Prepositional and Adverbial Phrases Presentation Ver 3
12 pages
Basic Tenses: Animal Continent Insect City Country Language
No ratings yet
Basic Tenses: Animal Continent Insect City Country Language
3 pages
Soal B. Inggris Hesti
No ratings yet
Soal B. Inggris Hesti
3 pages
Teacher's Notes - Reading File 4: Learning Objectives in This Lesson
No ratings yet
Teacher's Notes - Reading File 4: Learning Objectives in This Lesson
1 page
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet