0% found this document useful (0 votes)
5 views21 pages

A11 Merged

The document outlines a practical exercise using the NLTK library for various text processing tasks, including tokenization, stemming, and lemmatization. It discusses the recruitment of Chinese technicians by the PLA Rocket Force and their roles in enhancing military capabilities through civilian integration. The document includes code snippets demonstrating different tokenization methods and stemming techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

A11 Merged

The document outlines a practical exercise using the NLTK library for various text processing tasks, including tokenization, stemming, and lemmatization. It discusses the recruitment of Chinese technicians by the PLA Rocket Force and their roles in enhancing military capabilities through civilian integration. The document includes code snippets demonstrating different tokenization methods and stemming techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

practical_1

February 7, 2024

import nltk

[7]: import nltk


from nltk.tokenize import (TreebankWordTokenizer,
word_tokenize,
wordpunct_tokenize,
TweetTokenizer,
MWETokenizer)

[35]: nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/bread/nltk_data…


[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/bread/nltk_data…

[35]: True

[9]: import requests

[10]: text = """


The think tank of China’s People’s Liberation Army Rocket Force recently␣
↪recruited 13 Chinese technicians

from private companies, PLA Daily reported on Saturday.


Zhang Hao and 12 other science and technology experts received letters of␣
↪appointment at the founding ceremony of

the PLA Rocket Force national defense science and technology experts panel,␣
↪according to a report published by the

PLA Daily on Saturday.


Honored as “rocket force science and technology experts,” Zhang and his fellow␣
↪experts from private companies will

serve as members of the PLA Rocket Force think tank, which will conduct␣
↪research into fields like overall design of

the missiles, missile launching and network system technology for five years.
The experts will enjoy the same treatment as their counterparts from␣
↪State-owned firms, the report said.

The PLA Daily said that this marks a new development in deepening␣
↪military-civilian integration in China, which

1
could make science and technology innovation better contribute to the␣
↪enhancement of the force’s combat capabilities.

"""

1 whitespace tokenization
[11]: whitespace_token = text.split()

[14]: print(whitespace_token)

['The', 'think', 'tank', 'of', 'China’s', 'People’s', 'Liberation', 'Army',


'Rocket', 'Force', 'recently', 'recruited', '13', 'Chinese', 'technicians',
'from', 'private', 'companies,', 'PLA', 'Daily', 'reported', 'on', 'Saturday.',
'Zhang', 'Hao', 'and', '12', 'other', 'science', 'and', 'technology', 'experts',
'received', 'letters', 'of', 'appointment', 'at', 'the', 'founding', 'ceremony',
'of', 'the', 'PLA', 'Rocket', 'Force', 'national', 'defense', 'science', 'and',
'technology', 'experts', 'panel,', 'according', 'to', 'a', 'report',
'published', 'by', 'the', 'PLA', 'Daily', 'on', 'Saturday.', 'Honored', 'as',
'“rocket', 'force', 'science', 'and', 'technology', 'experts,”', 'Zhang', 'and',
'his', 'fellow', 'experts', 'from', 'private', 'companies', 'will', 'serve',
'as', 'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank,',
'which', 'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall',
'design', 'of', 'the', 'missiles,', 'missile', 'launching', 'and', 'network',
'system', 'technology', 'for', 'five', 'years.', 'The', 'experts', 'will',
'enjoy', 'the', 'same', 'treatment', 'as', 'their', 'counterparts', 'from',
'State-owned', 'firms,', 'the', 'report', 'said.', 'The', 'PLA', 'Daily',
'said', 'that', 'this', 'marks', 'a', 'new', 'development', 'in', 'deepening',
'military-civilian', 'integration', 'in', 'China,', 'which', 'could', 'make',
'science', 'and', 'technology', 'innovation', 'better', 'contribute', 'to',
'the', 'enhancement', 'of', 'the', 'force’s', 'combat', 'capabilities.']

2 Punctuation tokenization
[15]: punc_token = wordpunct_tokenize(text)

[16]: print(punc_token)

['The', 'think', 'tank', 'of', 'China', '’', 's', 'People', '’', 's',
'Liberation', 'Army', 'Rocket', 'Force', 'recently', 'recruited', '13',
'Chinese', 'technicians', 'from', 'private', 'companies', ',', 'PLA', 'Daily',
'reported', 'on', 'Saturday', '.', 'Zhang', 'Hao', 'and', '12', 'other',
'science', 'and', 'technology', 'experts', 'received', 'letters', 'of',
'appointment', 'at', 'the', 'founding', 'ceremony', 'of', 'the', 'PLA',
'Rocket', 'Force', 'national', 'defense', 'science', 'and', 'technology',
'experts', 'panel', ',', 'according', 'to', 'a', 'report', 'published', 'by',
'the', 'PLA', 'Daily', 'on', 'Saturday', '.', 'Honored', 'as', '“', 'rocket',
'force', 'science', 'and', 'technology', 'experts', ',”', 'Zhang', 'and', 'his',

2
'fellow', 'experts', 'from', 'private', 'companies', 'will', 'serve', 'as',
'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank', ',', 'which',
'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall', 'design',
'of', 'the', 'missiles', ',', 'missile', 'launching', 'and', 'network',
'system', 'technology', 'for', 'five', 'years', '.', 'The', 'experts', 'will',
'enjoy', 'the', 'same', 'treatment', 'as', 'their', 'counterparts', 'from',
'State', '-', 'owned', 'firms', ',', 'the', 'report', 'said', '.', 'The', 'PLA',
'Daily', 'said', 'that', 'this', 'marks', 'a', 'new', 'development', 'in',
'deepening', 'military', '-', 'civilian', 'integration', 'in', 'China', ',',
'which', 'could', 'make', 'science', 'and', 'technology', 'innovation',
'better', 'contribute', 'to', 'the', 'enhancement', 'of', 'the', 'force', '’',
's', 'combat', 'capabilities', '.']

3 Treebank tokenization
[18]: tokenizer = TreebankWordTokenizer()

[19]: tbank_token = tokenizer.tokenize(text)

[20]: print(tbank_token)

['The', 'think', 'tank', 'of', 'China’s', 'People’s', 'Liberation', 'Army',


'Rocket', 'Force', 'recently', 'recruited', '13', 'Chinese', 'technicians',
'from', 'private', 'companies', ',', 'PLA', 'Daily', 'reported', 'on',
'Saturday.', 'Zhang', 'Hao', 'and', '12', 'other', 'science', 'and',
'technology', 'experts', 'received', 'letters', 'of', 'appointment', 'at',
'the', 'founding', 'ceremony', 'of', 'the', 'PLA', 'Rocket', 'Force',
'national', 'defense', 'science', 'and', 'technology', 'experts', 'panel', ',',
'according', 'to', 'a', 'report', 'published', 'by', 'the', 'PLA', 'Daily',
'on', 'Saturday.', 'Honored', 'as', '“rocket', 'force', 'science', 'and',
'technology', 'experts', ',', '”', 'Zhang', 'and', 'his', 'fellow', 'experts',
'from', 'private', 'companies', 'will', 'serve', 'as', 'members', 'of', 'the',
'PLA', 'Rocket', 'Force', 'think', 'tank', ',', 'which', 'will', 'conduct',
'research', 'into', 'fields', 'like', 'overall', 'design', 'of', 'the',
'missiles', ',', 'missile', 'launching', 'and', 'network', 'system',
'technology', 'for', 'five', 'years.', 'The', 'experts', 'will', 'enjoy', 'the',
'same', 'treatment', 'as', 'their', 'counterparts', 'from', 'State-owned',
'firms', ',', 'the', 'report', 'said.', 'The', 'PLA', 'Daily', 'said', 'that',
'this', 'marks', 'a', 'new', 'development', 'in', 'deepening', 'military-
civilian', 'integration', 'in', 'China', ',', 'which', 'could', 'make',
'science', 'and', 'technology', 'innovation', 'better', 'contribute', 'to',
'the', 'enhancement', 'of', 'the', 'force’s', 'combat', 'capabilities', '.']

3
4 TweetTokenizer
[23]: tokenizer = TweetTokenizer()
tweet_token = tokenizer.tokenize(text)
print(tweet_token)

['The', 'think', 'tank', 'of', 'China', '’', 's', 'People', '’', 's',
'Liberation', 'Army', 'Rocket', 'Force', 'recently', 'recruited', '13',
'Chinese', 'technicians', 'from', 'private', 'companies', ',', 'PLA', 'Daily',
'reported', 'on', 'Saturday', '.', 'Zhang', 'Hao', 'and', '12', 'other',
'science', 'and', 'technology', 'experts', 'received', 'letters', 'of',
'appointment', 'at', 'the', 'founding', 'ceremony', 'of', 'the', 'PLA',
'Rocket', 'Force', 'national', 'defense', 'science', 'and', 'technology',
'experts', 'panel', ',', 'according', 'to', 'a', 'report', 'published', 'by',
'the', 'PLA', 'Daily', 'on', 'Saturday', '.', 'Honored', 'as', '“', 'rocket',
'force', 'science', 'and', 'technology', 'experts', ',', '”', 'Zhang', 'and',
'his', 'fellow', 'experts', 'from', 'private', 'companies', 'will', 'serve',
'as', 'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank', ',',
'which', 'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall',
'design', 'of', 'the', 'missiles', ',', 'missile', 'launching', 'and',
'network', 'system', 'technology', 'for', 'five', 'years', '.', 'The',
'experts', 'will', 'enjoy', 'the', 'same', 'treatment', 'as', 'their',
'counterparts', 'from', 'State-owned', 'firms', ',', 'the', 'report', 'said',
'.', 'The', 'PLA', 'Daily', 'said', 'that', 'this', 'marks', 'a', 'new',
'development', 'in', 'deepening', 'military-civilian', 'integration', 'in',
'China', ',', 'which', 'could', 'make', 'science', 'and', 'technology',
'innovation', 'better', 'contribute', 'to', 'the', 'enhancement', 'of', 'the',
'force', '’', 's', 'combat', 'capabilities', '.']

5 MWE
[24]: tokenizer = MWETokenizer()

[25]: mwe = tokenizer.tokenize(word_tokenize(text))

[26]: print(mwe)

['The', 'think', 'tank', 'of', 'China', '’', 's', 'People', '’', 's',
'Liberation', 'Army', 'Rocket', 'Force', 'recently', 'recruited', '13',
'Chinese', 'technicians', 'from', 'private', 'companies', ',', 'PLA', 'Daily',
'reported', 'on', 'Saturday', '.', 'Zhang', 'Hao', 'and', '12', 'other',
'science', 'and', 'technology', 'experts', 'received', 'letters', 'of',
'appointment', 'at', 'the', 'founding', 'ceremony', 'of', 'the', 'PLA',
'Rocket', 'Force', 'national', 'defense', 'science', 'and', 'technology',
'experts', 'panel', ',', 'according', 'to', 'a', 'report', 'published', 'by',
'the', 'PLA', 'Daily', 'on', 'Saturday', '.', 'Honored', 'as', '“', 'rocket',
'force', 'science', 'and', 'technology', 'experts', ',', '”', 'Zhang', 'and',
'his', 'fellow', 'experts', 'from', 'private', 'companies', 'will', 'serve',

4
'as', 'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank', ',',
'which', 'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall',
'design', 'of', 'the', 'missiles', ',', 'missile', 'launching', 'and',
'network', 'system', 'technology', 'for', 'five', 'years', '.', 'The',
'experts', 'will', 'enjoy', 'the', 'same', 'treatment', 'as', 'their',
'counterparts', 'from', 'State-owned', 'firms', ',', 'the', 'report', 'said',
'.', 'The', 'PLA', 'Daily', 'said', 'that', 'this', 'marks', 'a', 'new',
'development', 'in', 'deepening', 'military-civilian', 'integration', 'in',
'China', ',', 'which', 'could', 'make', 'science', 'and', 'technology',
'innovation', 'better', 'contribute', 'to', 'the', 'enhancement', 'of', 'the',
'force', '’', 's', 'combat', 'capabilities', '.']

6 port stemmer
[27]: from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
print(word+' --> '+p_stemmer.stem(word))

run --> run


runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli

7 snowball stemmer
[28]: from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter


s_stemmer = SnowballStemmer(language='english')
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
print(word+' --> '+s_stemmer.stem(word))

run --> run


runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair

5
8 lemmatization
[36]: from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
print(word+' --> '+lemmatizer.lemmatize(word))

run --> run


runner --> runner
running --> running
ran --> ran
runs --> run
easily --> easily
fairly --> fairly

[ ]:

6
Untitled

February 7, 2024

[44]: import pandas as pd


import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from tqdm.auto import tqdm

[43]:

Requirement already satisfied: tqdm in


d:\ayush\college\4thyear\sem2\nlp\venv\lib\site-packages (4.66.1)
Requirement already satisfied: colorama in
d:\ayush\college\4thyear\sem2\nlp\venv\lib\site-packages (from tqdm) (0.4.6)

[38]: nltk.download('punkt')

[nltk_data] Downloading package punkt to


[nltk_data] C:\Users\ayush\AppData\Roaming\nltk_data…
[nltk_data] Unzipping tokenizers\punkt.zip.

[38]: True

[2]: data = pd.read_csv("data.csv")

[5]: data = data[["Make","Model","Engine Fuel Type", "Transmission Type",␣


↪"Driven_Wheels", "Market Category","Vehicle Size","Vehicle Style"]]

[8]: data = data.fillna("x")

[10]: data["text"] = data.apply(lambda row: " - ".join(map(str,row)),axis=1)

[27]: bow_doc_count = 100

[28]: count_vec = CountVectorizer()

[29]: bow_matrix = count_vec.fit_transform(data["text"][:bow_doc_count])

[30]: bow_df = pd.DataFrame(bow_matrix.toarray(), columns=count_vec.


↪get_feature_names_out())

1
[31]: bow_df

[31]: 100 124 190 200 200sx 240sx all audi automatic benz … \
0 0 0 0 0 0 0 0 0 0 0 …
1 0 0 0 0 0 0 0 0 0 0 …
2 0 0 0 0 0 0 0 0 0 0 …
3 0 0 0 0 0 0 0 0 0 0 …
4 0 0 0 0 0 0 0 0 0 0 …
.. … … … … … … … … … … …
95 0 0 0 0 1 0 0 0 0 0 …
96 0 0 0 0 0 1 0 0 0 0 …
97 0 0 0 0 0 1 0 0 0 0 …
98 0 0 0 0 0 1 0 0 0 0 …
99 0 0 0 0 0 1 0 0 0 0 …

recommended regular required sedan series spider tuner unleaded \


0 0 0 1 0 1 0 1 1
1 0 0 1 0 1 0 0 1
2 0 0 1 0 1 0 0 1
3 0 0 1 0 1 0 0 1
4 0 0 1 0 1 0 0 1
.. … … … … … … … …
95 0 1 0 0 0 0 0 1
96 0 1 0 0 0 0 0 1
97 0 1 0 0 0 0 0 1
98 0 1 0 0 0 0 0 1
99 0 1 0 0 0 0 0 1

wagon wheel
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
.. … …
95 0 1
96 0 1
97 0 1
98 0 1
99 0 1

[100 rows x 42 columns]

[19]: # TF IDF

[32]: tfidf_vec = TfidfVectorizer()


t_matrix = tfidf_vec.fit_transform(data["text"][:100])

2
tf_df = pd.DataFrame(t_matrix.toarray(), columns=tfidf_vec.
↪get_feature_names_out())

[33]: tf_df

[33]: 100 124 190 200 200sx 240sx all audi automatic benz … \
0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
1 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
2 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
3 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
4 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
.. … … … … … … … … … … …
95 0.0 0.0 0.0 0.0 0.504103 0.000000 0.0 0.0 0.0 0.0 …
96 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …
97 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …
98 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …
99 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …

recommended regular required sedan series spider tuner \


0 0.0 0.000000 0.277765 0.0 0.246685 0.0 0.39672
1 0.0 0.000000 0.359746 0.0 0.319493 0.0 0.00000
2 0.0 0.000000 0.335552 0.0 0.298006 0.0 0.00000
3 0.0 0.000000 0.369856 0.0 0.328472 0.0 0.00000
4 0.0 0.000000 0.373677 0.0 0.331865 0.0 0.00000
.. … … … … … … …
95 0.0 0.289378 0.000000 0.0 0.000000 0.0 0.00000
96 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000
97 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000
98 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000
99 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000

unleaded wagon wheel


0 0.119763 0.0 0.119763
1 0.155111 0.0 0.155111
2 0.144679 0.0 0.144679
3 0.159470 0.0 0.159470
4 0.161117 0.0 0.161117
.. … … …
95 0.152180 0.0 0.152180
96 0.144561 0.0 0.144561
97 0.144561 0.0 0.144561
98 0.144561 0.0 0.144561
99 0.144561 0.0 0.144561

[100 rows x 42 columns]

3
[34]: # Word to Vec

[39]: m = pd.DataFrame()

tokenized_text = [word_tokenize(text) for text in data["text"]]

[40]: w2vec = Word2Vec(sentences=tokenized_text, vector_size=100, window=5,␣


↪min_count=1, workers=4)

[41]: w2vec_data = pd.DataFrame()

[45]: for i in tqdm(range(10)):


column_name = f"w2vec_embedding_{i+1}"
w2vec_data[column_name] = data["text"].apply(lambda text: w2vec.
↪wv[word_tokenize(text)[i]])

0%| | 0/10 [00:00<?, ?it/s]

[46]: w2vec_data

[46]: w2vec_embedding_1 \
0 [1.0144914, 0.09865342, -1.038593, -0.25373653…
1 [1.0144914, 0.09865342, -1.038593, -0.25373653…
2 [1.0144914, 0.09865342, -1.038593, -0.25373653…
3 [1.0144914, 0.09865342, -1.038593, -0.25373653…
4 [1.0144914, 0.09865342, -1.038593, -0.25373653…
… …
11909 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11910 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11911 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11912 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11913 [0.079442665, 0.18170413, -0.12031996, 0.13398…

w2vec_embedding_2 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_3 \
0 [0.19682735, -0.02487174, -0.18721059, -0.0303…

4
1 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
2 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
3 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
4 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
… …
11909 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11910 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11911 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11912 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11913 [-0.003097159, -0.0026490043, -0.0028470934, -…

w2vec_embedding_4 \
0 [0.50986725, -0.010175074, -0.5086816, -0.2202…
1 [0.50986725, -0.010175074, -0.5086816, -0.2202…
2 [0.50986725, -0.010175074, -0.5086816, -0.2202…
3 [0.50986725, -0.010175074, -0.5086816, -0.2202…
4 [0.50986725, -0.010175074, -0.5086816, -0.2202…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_5 \
0 [0.13098931, -0.012665014, -0.1817777, 0.04231…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11910 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11911 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11912 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11913 [-0.44647923, -0.71473974, 0.43583885, 0.07716…

w2vec_embedding_6 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
2 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
3 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
4 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
… …
11909 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
11910 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
11911 [-0.44654804, -0.0011222507, 1.6705241, -0.013…

5
11912 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
11913 [-0.44654804, -0.0011222507, 1.6705241, -0.013…

w2vec_embedding_7 \
0 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
1 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
2 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
3 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
4 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
… …
11909 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11910 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11911 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11912 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_8 \
0 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
1 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
2 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
3 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
4 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
… …
11909 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
11910 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
11911 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
11912 [1.3811533, 0.13443133, 0.67313755, 0.33346632…
11913 [-0.18816459, -0.45196122, -0.21790302, 0.9887…

w2vec_embedding_9 \
0 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
1 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
2 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
3 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
4 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
… …
11909 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11910 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11911 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11912 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_10 \
0 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
1 [0.93188065, -0.14711751, 1.1007845, -0.315147…
2 [0.93188065, -0.14711751, 1.1007845, -0.315147…
3 [0.93188065, -0.14711751, 1.1007845, -0.315147…

6
4 [0.93188065, -0.14711751, 1.1007845, -0.315147…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.20108287, 0.6029379, 0.2330216, -0.3212626,…

w2vec_embedding_11 \
0 [0.93188065, -0.14711751, 1.1007845, -0.315147…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11910 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11911 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11912 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11913 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…

w2vec_embedding_12 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
2 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
3 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
4 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [-0.14079317, 0.5140112, 0.56032175, -0.192161…

w2vec_embedding_13 \
0 [-0.22948255, -0.4156468,
-0.03284373, 0.07473…
1 [0.102402635, 0.32382426,
0.42588675, 0.177568…
2 [0.102402635, 0.32382426,
0.42588675, 0.177568…
3 [0.102402635, 0.32382426,
0.42588675, 0.177568…
4 [0.102402635, 0.32382426,
0.42588675, 0.177568…
… …
11909 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11910 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11911 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11912 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

7
w2vec_embedding_14 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
2 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
3 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
4 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
… …
11909 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11910 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11911 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11912 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11913 [-0.6609878, 0.4256675, 1.4135256, 0.87129855,…

w2vec_embedding_15 \
0 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
1 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
2 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
3 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
4 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
… …
11909 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11910 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11911 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11912 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_16 \
0 [-0.30040318, 0.5620006,
0.7801722, 0.9519743,…
1 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
2 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
3 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
4 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.18458737, 0.88140917, 1.2577158, 0.6808571,…

w2vec_embedding_17
0 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [-0.62329394, -0.27842516, 1.6441574, 0.278253…

8
11910 [-0.62329394, -0.27842516, 1.6441574, 0.278253…
11911 [-0.62329394, -0.27842516, 1.6441574, 0.278253…
11912 [-0.62329394, -0.27842516, 1.6441574, 0.278253…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

[11914 rows x 17 columns]

[ ]:

9
Untitled

February 7, 2024

[25]: import pandas as pd


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
import nltk

[26]: nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/bread/nltk_data…


[nltk_data] Unzipping corpora/stopwords.zip.

[26]: True

[13]: data = pd.read_pickle("News_dataset.pickle")

[17]: data = data.drop(["File_Name", "Complete_Filename", "id", "News_length"],axis=1)

[19]: data.head()

[19]: Content Category


0 Ad sales boost Time Warner profit\r\n\r\nQuart… business
1 Dollar gains on Greenspan speech\r\n\r\nThe do… business
2 Yukos unit buyer faces loan claim\r\n\r\nThe o… business
3 High fuel prices hit BA's profits\r\n\r\nBriti… business
4 Pernod takeover talk lifts Domecq\r\n\r\nShare… business

[20]: def preprocess(text: str):


text = re.sub(r'[^a-zA-Z]', ' ', text.lower())

tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
cleaned_text = ' '.join(tokens)

1
return cleaned_text

[28]: clean = pd.DataFrame()


text_columns = [col for col in data.columns]

for col in text_columns:


clean[f'cleaned_{col}'] = data[col].apply(preprocess)

label_encoder = LabelEncoder()
clean["encoded_label"] = label_encoder.fit_transform(data["Category"])

[29]: clean

[29]: cleaned_Content cleaned_Category \


0 ad sale boost time warner profit quarterly pro… business
1 dollar gain greenspan speech dollar hit highes… business
2 yukos unit buyer face loan claim owner embattl… business
3 high fuel price hit ba profit british airway b… business
4 pernod takeover talk lift domecq share uk drin… business
… … …
2220 bt program beat dialler scam bt introducing tw… tech
2221 spam e mail tempt net shopper computer user ac… tech
2222 careful code new european directive could put … tech
2223 u cyber security chief resigns man making sure… tech
2224 losing online gaming online role playing game … tech

encoded_label
0 0
1 0
2 0
3 0
4 0
… …
2220 4
2221 4
2222 4
2223 4
2224 4

[2225 rows x 3 columns]

[30]: combined_text = pd.DataFrame()


combined_text["text"] = clean.apply(lambda row: ' - '.join(map(str,row)),axis=1)

[31]: combined_text

2
[31]: text
0 ad sale boost time warner profit quarterly pro…
1 dollar gain greenspan speech dollar hit highes…
2 yukos unit buyer face loan claim owner embattl…
3 high fuel price hit ba profit british airway b…
4 pernod takeover talk lift domecq share uk drin…
… …
2220 bt program beat dialler scam bt introducing tw…
2221 spam e mail tempt net shopper computer user ac…
2222 careful code new european directive could put …
2223 u cyber security chief resigns man making sure…
2224 losing online gaming online role playing game …

[2225 rows x 1 columns]

[32]: # unclean

[33]: data.Content[1]

[33]: 'Dollar gains on Greenspan speech\r\n\r\nThe dollar has hit its highest level
against the euro in almost three months after the Federal Reserve head said the
US trade deficit is set to stabilise.\r\n\r\nAnd Alan Greenspan highlighted the
US government\'s willingness to curb spending and rising household savings as
factors which may help to reduce it. In late trading in New York, the dollar
reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns
about the deficit has hit the greenback in recent months. On Friday, Federal
Reserve chairman Mr Greenspan\'s speech in London ahead of the meeting of G7
finance ministers sent the dollar higher after it had earlier tumbled on the
back of worse-than-expected US jobs data. "I think the chairman\'s taking a much
more sanguine view on the current account deficit than he\'s taken for some
time," said Robert Sinche, head of currency strategy at Bank of America in New
York. "He\'s taking a longer-term view, laying out a set of conditions under
which the current account deficit can improve this year and
next."\r\n\r\nWorries about the deficit concerns about China do, however,
remain. China\'s currency remains pegged to the dollar and the US currency\'s
sharp falls in recent months have therefore made Chinese export prices highly
competitive. But calls for a shift in Beijing\'s policy have fallen on deaf
ears, despite recent comments in a major Chinese newspaper that the "time is
ripe" for a loosening of the peg. The G7 meeting is thought unlikely to produce
any meaningful movement in Chinese policy. In the meantime, the US Federal
Reserve\'s decision on 2 February to boost interest rates by a quarter of a
point - the sixth such move in as many months - has opened up a differential
with European rates. The half-point window, some believe, could be enough to
keep US assets looking more attractive, and could help prop up the dollar. The
recent falls have partly been the result of big budget deficits, as well as the
US\'s yawning current account gap, both of which need to be funded by the buying
of US bonds and assets by foreign firms and governments. The White House will

3
announce its budget on Monday, and many commentators believe the deficit will
remain at close to half a trillion dollars.'

[34]: combined_text.text[1]

[34]: 'dollar gain greenspan speech dollar hit highest level euro almost three month
federal reserve head said u trade deficit set stabilise alan greenspan
highlighted u government willingness curb spending rising household saving
factor may help reduce late trading new york dollar reached euro thursday market
concern deficit hit greenback recent month friday federal reserve chairman mr
greenspan speech london ahead meeting g finance minister sent dollar higher
earlier tumbled back worse expected u job data think chairman taking much
sanguine view current account deficit taken time said robert sinche head
currency strategy bank america new york taking longer term view laying set
condition current account deficit improve year next worry deficit concern china
however remain china currency remains pegged dollar u currency sharp fall recent
month therefore made chinese export price highly competitive call shift beijing
policy fallen deaf ear despite recent comment major chinese newspaper time ripe
loosening peg g meeting thought unlikely produce meaningful movement chinese
policy meantime u federal reserve decision february boost interest rate quarter
point sixth move many month opened differential european rate half point window
believe could enough keep u asset looking attractive could help prop dollar
recent fall partly result big budget deficit well u yawning current account gap
need funded buying u bond asset foreign firm government white house announce
budget monday many commentator believe deficit remain close half trillion dollar
- business - 0'

[ ]:

[35]: # TF - IDF

[39]: tfidf_vec = TfidfVectorizer()


tf_matrix = tfidf_vec.fit_transform(combined_text["text"][:10])

[40]: tf_df = pd.DataFrame(tf_matrix.toarray(), columns=tfidf_vec.


↪get_feature_names_out())

[41]: tf_df

[41]: abandon absorbing according account accumulated accused accusing \


0 0.000000 0.000000 0.000000 0.071491 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.125417 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.000000 0.000000 0.064076 0.000000 0.000000 0.000000 0.000000

4
7 0.057341 0.057341 0.000000 0.085293 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.000000 0.071168 0.071168 0.071168

acquisition action activity … would wsj yawning \


0 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.000000
1 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.056211
2 0.000000 0.05877 0.00000 … 0.038860 0.000000 0.000000
3 0.000000 0.00000 0.00000 … 0.029284 0.000000 0.000000
4 0.057551 0.00000 0.00000 … 0.000000 0.057551 0.000000
5 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.000000
6 0.000000 0.00000 0.00000 … 0.042369 0.000000 0.000000
7 0.000000 0.00000 0.00000 … 0.037916 0.000000 0.000000
8 0.000000 0.00000 0.06025 … 0.000000 0.000000 0.000000
9 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.000000

year yen yet yield york yugansk yukos


0 0.085342 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.024953 0.000000 0.000000 0.000000 0.112422 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.352618 0.352618
3 0.117960 0.000000 0.032938 0.044288 0.000000 0.000000 0.000000
4 0.025547 0.000000 0.042802 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.090703 0.067459 0.000000 0.000000 0.000000 0.000000
6 0.056888 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7 0.050909 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.053492 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.031593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

[10 rows x 898 columns]

[ ]:

You might also like