0% found this document useful (0 votes)

5 views21 pages

A11 Merged

The document outlines a practical exercise using the NLTK library for various text processing tasks, including tokenization, stemming, and lemmatization. It discusses the recruitment of Chinese technicians by the PLA Rocket Force and their roles in enhancing military capabilities through civilian integration. The document includes code snippets demonstrating different tokenization methods and stemming techniques.

Uploaded by

ayushdeshpande2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views21 pages

A11 Merged

Uploaded by

ayushdeshpande2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

practical_1

February 7, 2024

import nltk

[7]: import nltk

from nltk.tokenize import (TreebankWordTokenizer,
word_tokenize,
wordpunct_tokenize,
TweetTokenizer,
MWETokenizer)

[35]: nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/bread/nltk_data…

[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/bread/nltk_data…

[35]: True

[9]: import requests

[10]: text = """

The think tank of China’s People’s Liberation Army Rocket Force recently␣
↪recruited 13 Chinese technicians

from private companies, PLA Daily reported on Saturday.

Zhang Hao and 12 other science and technology experts received letters of␣
↪appointment at the founding ceremony of

the PLA Rocket Force national defense science and technology experts panel,␣
↪according to a report published by the

PLA Daily on Saturday.

Honored as “rocket force science and technology experts,” Zhang and his fellow␣
↪experts from private companies will

serve as members of the PLA Rocket Force think tank, which will conduct␣
↪research into fields like overall design of

the missiles, missile launching and network system technology for five years.
The experts will enjoy the same treatment as their counterparts from␣
↪State-owned firms, the report said.

The PLA Daily said that this marks a new development in deepening␣
↪military-civilian integration in China, which

1
could make science and technology innovation better contribute to the␣
↪enhancement of the force’s combat capabilities.

"""

1 whitespace tokenization
[11]: whitespace_token = text.split()

[14]: print(whitespace_token)

['The', 'think', 'tank', 'of', 'China’s', 'People’s', 'Liberation', 'Army',

'Rocket', 'Force', 'recently', 'recruited', '13', 'Chinese', 'technicians',
'from', 'private', 'companies,', 'PLA', 'Daily', 'reported', 'on', 'Saturday.',
'Zhang', 'Hao', 'and', '12', 'other', 'science', 'and', 'technology', 'experts',
'received', 'letters', 'of', 'appointment', 'at', 'the', 'founding', 'ceremony',
'of', 'the', 'PLA', 'Rocket', 'Force', 'national', 'defense', 'science', 'and',
'technology', 'experts', 'panel,', 'according', 'to', 'a', 'report',
'published', 'by', 'the', 'PLA', 'Daily', 'on', 'Saturday.', 'Honored', 'as',
'“rocket', 'force', 'science', 'and', 'technology', 'experts,”', 'Zhang', 'and',
'his', 'fellow', 'experts', 'from', 'private', 'companies', 'will', 'serve',
'as', 'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank,',
'which', 'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall',
'design', 'of', 'the', 'missiles,', 'missile', 'launching', 'and', 'network',
'system', 'technology', 'for', 'five', 'years.', 'The', 'experts', 'will',
'enjoy', 'the', 'same', 'treatment', 'as', 'their', 'counterparts', 'from',
'State-owned', 'firms,', 'the', 'report', 'said.', 'The', 'PLA', 'Daily',
'said', 'that', 'this', 'marks', 'a', 'new', 'development', 'in', 'deepening',
'military-civilian', 'integration', 'in', 'China,', 'which', 'could', 'make',
'science', 'and', 'technology', 'innovation', 'better', 'contribute', 'to',
'the', 'enhancement', 'of', 'the', 'force’s', 'combat', 'capabilities.']

2 Punctuation tokenization
[15]: punc_token = wordpunct_tokenize(text)

[16]: print(punc_token)

2
'fellow', 'experts', 'from', 'private', 'companies', 'will', 'serve', 'as',
'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank', ',', 'which',
'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall', 'design',
'of', 'the', 'missiles', ',', 'missile', 'launching', 'and', 'network',
'system', 'technology', 'for', 'five', 'years', '.', 'The', 'experts', 'will',
'enjoy', 'the', 'same', 'treatment', 'as', 'their', 'counterparts', 'from',
'State', '-', 'owned', 'firms', ',', 'the', 'report', 'said', '.', 'The', 'PLA',
'Daily', 'said', 'that', 'this', 'marks', 'a', 'new', 'development', 'in',
'deepening', 'military', '-', 'civilian', 'integration', 'in', 'China', ',',
'which', 'could', 'make', 'science', 'and', 'technology', 'innovation',
'better', 'contribute', 'to', 'the', 'enhancement', 'of', 'the', 'force', '’',
's', 'combat', 'capabilities', '.']

3 Treebank tokenization
[18]: tokenizer = TreebankWordTokenizer()

[19]: tbank_token = tokenizer.tokenize(text)

[20]: print(tbank_token)

['The', 'think', 'tank', 'of', 'China’s', 'People’s', 'Liberation', 'Army',

'Rocket', 'Force', 'recently', 'recruited', '13', 'Chinese', 'technicians',
'from', 'private', 'companies', ',', 'PLA', 'Daily', 'reported', 'on',
'Saturday.', 'Zhang', 'Hao', 'and', '12', 'other', 'science', 'and',
'technology', 'experts', 'received', 'letters', 'of', 'appointment', 'at',
'the', 'founding', 'ceremony', 'of', 'the', 'PLA', 'Rocket', 'Force',
'national', 'defense', 'science', 'and', 'technology', 'experts', 'panel', ',',
'according', 'to', 'a', 'report', 'published', 'by', 'the', 'PLA', 'Daily',
'on', 'Saturday.', 'Honored', 'as', '“rocket', 'force', 'science', 'and',
'technology', 'experts', ',', '”', 'Zhang', 'and', 'his', 'fellow', 'experts',
'from', 'private', 'companies', 'will', 'serve', 'as', 'members', 'of', 'the',
'PLA', 'Rocket', 'Force', 'think', 'tank', ',', 'which', 'will', 'conduct',
'research', 'into', 'fields', 'like', 'overall', 'design', 'of', 'the',
'missiles', ',', 'missile', 'launching', 'and', 'network', 'system',
'technology', 'for', 'five', 'years.', 'The', 'experts', 'will', 'enjoy', 'the',
'same', 'treatment', 'as', 'their', 'counterparts', 'from', 'State-owned',
'firms', ',', 'the', 'report', 'said.', 'The', 'PLA', 'Daily', 'said', 'that',
'this', 'marks', 'a', 'new', 'development', 'in', 'deepening', 'military-
civilian', 'integration', 'in', 'China', ',', 'which', 'could', 'make',
'science', 'and', 'technology', 'innovation', 'better', 'contribute', 'to',
'the', 'enhancement', 'of', 'the', 'force’s', 'combat', 'capabilities', '.']

3
4 TweetTokenizer
[23]: tokenizer = TweetTokenizer()
tweet_token = tokenizer.tokenize(text)
print(tweet_token)

['The', 'think', 'tank', 'of', 'China', '’', 's', 'People', '’', 's',
'Liberation', 'Army', 'Rocket', 'Force', 'recently', 'recruited', '13',
'Chinese', 'technicians', 'from', 'private', 'companies', ',', 'PLA', 'Daily',
'reported', 'on', 'Saturday', '.', 'Zhang', 'Hao', 'and', '12', 'other',
'science', 'and', 'technology', 'experts', 'received', 'letters', 'of',
'appointment', 'at', 'the', 'founding', 'ceremony', 'of', 'the', 'PLA',
'Rocket', 'Force', 'national', 'defense', 'science', 'and', 'technology',
'experts', 'panel', ',', 'according', 'to', 'a', 'report', 'published', 'by',
'the', 'PLA', 'Daily', 'on', 'Saturday', '.', 'Honored', 'as', '“', 'rocket',
'force', 'science', 'and', 'technology', 'experts', ',', '”', 'Zhang', 'and',
'his', 'fellow', 'experts', 'from', 'private', 'companies', 'will', 'serve',
'as', 'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank', ',',
'which', 'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall',
'design', 'of', 'the', 'missiles', ',', 'missile', 'launching', 'and',
'network', 'system', 'technology', 'for', 'five', 'years', '.', 'The',
'experts', 'will', 'enjoy', 'the', 'same', 'treatment', 'as', 'their',
'counterparts', 'from', 'State-owned', 'firms', ',', 'the', 'report', 'said',
'.', 'The', 'PLA', 'Daily', 'said', 'that', 'this', 'marks', 'a', 'new',
'development', 'in', 'deepening', 'military-civilian', 'integration', 'in',
'China', ',', 'which', 'could', 'make', 'science', 'and', 'technology',
'innovation', 'better', 'contribute', 'to', 'the', 'enhancement', 'of', 'the',
'force', '’', 's', 'combat', 'capabilities', '.']

5 MWE
[24]: tokenizer = MWETokenizer()

[25]: mwe = tokenizer.tokenize(word_tokenize(text))

[26]: print(mwe)

4
'as', 'members', 'of', 'the', 'PLA', 'Rocket', 'Force', 'think', 'tank', ',',
'which', 'will', 'conduct', 'research', 'into', 'fields', 'like', 'overall',
'design', 'of', 'the', 'missiles', ',', 'missile', 'launching', 'and',
'network', 'system', 'technology', 'for', 'five', 'years', '.', 'The',
'experts', 'will', 'enjoy', 'the', 'same', 'treatment', 'as', 'their',
'counterparts', 'from', 'State-owned', 'firms', ',', 'the', 'report', 'said',
'.', 'The', 'PLA', 'Daily', 'said', 'that', 'this', 'marks', 'a', 'new',
'development', 'in', 'deepening', 'military-civilian', 'integration', 'in',
'China', ',', 'which', 'could', 'make', 'science', 'and', 'technology',
'innovation', 'better', 'contribute', 'to', 'the', 'enhancement', 'of', 'the',
'force', '’', 's', 'combat', 'capabilities', '.']

6 port stemmer
[27]: from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
print(word+' --> '+p_stemmer.stem(word))

run --> run

runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli

7 snowball stemmer
[28]: from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter

s_stemmer = SnowballStemmer(language='english')
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
print(word+' --> '+s_stemmer.stem(word))

run --> run

runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair

5
8 lemmatization
[36]: from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
print(word+' --> '+lemmatizer.lemmatize(word))

run --> run

runner --> runner
running --> running
ran --> ran
runs --> run
easily --> easily
fairly --> fairly

[ ]:

6
Untitled

February 7, 2024

[44]: import pandas as pd

import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from tqdm.auto import tqdm

[43]:

Requirement already satisfied: tqdm in

d:\ayush\college\4thyear\sem2\nlp\venv\lib\site-packages (4.66.1)
Requirement already satisfied: colorama in
d:\ayush\college\4thyear\sem2\nlp\venv\lib\site-packages (from tqdm) (0.4.6)

[38]: nltk.download('punkt')

[nltk_data] Downloading package punkt to

[nltk_data] C:\Users\ayush\AppData\Roaming\nltk_data…
[nltk_data] Unzipping tokenizers\punkt.zip.

[38]: True

[2]: data = pd.read_csv("data.csv")

[5]: data = data[["Make","Model","Engine Fuel Type", "Transmission Type",␣

↪"Driven_Wheels", "Market Category","Vehicle Size","Vehicle Style"]]

[8]: data = data.fillna("x")

[10]: data["text"] = data.apply(lambda row: " - ".join(map(str,row)),axis=1)

[27]: bow_doc_count = 100

[28]: count_vec = CountVectorizer()

[29]: bow_matrix = count_vec.fit_transform(data["text"][:bow_doc_count])

[30]: bow_df = pd.DataFrame(bow_matrix.toarray(), columns=count_vec.

↪get_feature_names_out())

1
[31]: bow_df

[31]: 100 124 190 200 200sx 240sx all audi automatic benz … \
0 0 0 0 0 0 0 0 0 0 0 …
1 0 0 0 0 0 0 0 0 0 0 …
2 0 0 0 0 0 0 0 0 0 0 …
3 0 0 0 0 0 0 0 0 0 0 …
4 0 0 0 0 0 0 0 0 0 0 …
.. … … … … … … … … … … …
95 0 0 0 0 1 0 0 0 0 0 …
96 0 0 0 0 0 1 0 0 0 0 …
97 0 0 0 0 0 1 0 0 0 0 …
98 0 0 0 0 0 1 0 0 0 0 …
99 0 0 0 0 0 1 0 0 0 0 …

recommended regular required sedan series spider tuner unleaded \

0 0 0 1 0 1 0 1 1
1 0 0 1 0 1 0 0 1
2 0 0 1 0 1 0 0 1
3 0 0 1 0 1 0 0 1
4 0 0 1 0 1 0 0 1
.. … … … … … … … …
95 0 1 0 0 0 0 0 1
96 0 1 0 0 0 0 0 1
97 0 1 0 0 0 0 0 1
98 0 1 0 0 0 0 0 1
99 0 1 0 0 0 0 0 1

wagon wheel
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
.. … …
95 0 1
96 0 1
97 0 1
98 0 1
99 0 1

[100 rows x 42 columns]

[19]: # TF IDF

[32]: tfidf_vec = TfidfVectorizer()

t_matrix = tfidf_vec.fit_transform(data["text"][:100])

2
tf_df = pd.DataFrame(t_matrix.toarray(), columns=tfidf_vec.
↪get_feature_names_out())

[33]: tf_df

[33]: 100 124 190 200 200sx 240sx all audi automatic benz … \
0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
1 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
2 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
3 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
4 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 …
.. … … … … … … … … … … …
95 0.0 0.0 0.0 0.0 0.504103 0.000000 0.0 0.0 0.0 0.0 …
96 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …
97 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …
98 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …
99 0.0 0.0 0.0 0.0 0.000000 0.579066 0.0 0.0 0.0 0.0 …

recommended regular required sedan series spider tuner \

0 0.0 0.000000 0.277765 0.0 0.246685 0.0 0.39672
1 0.0 0.000000 0.359746 0.0 0.319493 0.0 0.00000
2 0.0 0.000000 0.335552 0.0 0.298006 0.0 0.00000
3 0.0 0.000000 0.369856 0.0 0.328472 0.0 0.00000
4 0.0 0.000000 0.373677 0.0 0.331865 0.0 0.00000
.. … … … … … … …
95 0.0 0.289378 0.000000 0.0 0.000000 0.0 0.00000
96 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000
97 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000
98 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000
99 0.0 0.274890 0.000000 0.0 0.000000 0.0 0.00000

unleaded wagon wheel

0 0.119763 0.0 0.119763
1 0.155111 0.0 0.155111
2 0.144679 0.0 0.144679
3 0.159470 0.0 0.159470
4 0.161117 0.0 0.161117
.. … … …
95 0.152180 0.0 0.152180
96 0.144561 0.0 0.144561
97 0.144561 0.0 0.144561
98 0.144561 0.0 0.144561
99 0.144561 0.0 0.144561

[100 rows x 42 columns]

3
[34]: # Word to Vec

[39]: m = pd.DataFrame()

tokenized_text = [word_tokenize(text) for text in data["text"]]

[40]: w2vec = Word2Vec(sentences=tokenized_text, vector_size=100, window=5,␣

↪min_count=1, workers=4)

[41]: w2vec_data = pd.DataFrame()

[45]: for i in tqdm(range(10)):

column_name = f"w2vec_embedding_{i+1}"
w2vec_data[column_name] = data["text"].apply(lambda text: w2vec.
↪wv[word_tokenize(text)[i]])

0%| | 0/10 [00:00<?, ?it/s]

[46]: w2vec_data

[46]: w2vec_embedding_1 \
0 [1.0144914, 0.09865342, -1.038593, -0.25373653…
1 [1.0144914, 0.09865342, -1.038593, -0.25373653…
2 [1.0144914, 0.09865342, -1.038593, -0.25373653…
3 [1.0144914, 0.09865342, -1.038593, -0.25373653…
4 [1.0144914, 0.09865342, -1.038593, -0.25373653…
… …
11909 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11910 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11911 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11912 [0.712496, 0.2578129, -0.4413481, -0.0887651, …
11913 [0.079442665, 0.18170413, -0.12031996, 0.13398…

w2vec_embedding_2 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_3 \
0 [0.19682735, -0.02487174, -0.18721059, -0.0303…

4
1 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
2 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
3 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
4 [0.19682735, -0.02487174, -0.18721059,
-0.0303…
… …
11909 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11910 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11911 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11912 [0.070702516, -0.0059914645, -0.05066969, 0.05…
11913 [-0.003097159, -0.0026490043, -0.0028470934, -…

w2vec_embedding_4 \
0 [0.50986725, -0.010175074, -0.5086816, -0.2202…
1 [0.50986725, -0.010175074, -0.5086816, -0.2202…
2 [0.50986725, -0.010175074, -0.5086816, -0.2202…
3 [0.50986725, -0.010175074, -0.5086816, -0.2202…
4 [0.50986725, -0.010175074, -0.5086816, -0.2202…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_5 \
0 [0.13098931, -0.012665014, -0.1817777, 0.04231…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11910 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11911 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11912 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
11913 [-0.44647923, -0.71473974, 0.43583885, 0.07716…

w2vec_embedding_6 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
2 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
3 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
4 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
… …
11909 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
11910 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
11911 [-0.44654804, -0.0011222507, 1.6705241, -0.013…

5
11912 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
11913 [-0.44654804, -0.0011222507, 1.6705241, -0.013…

w2vec_embedding_7 \
0 [-0.61892426, -0.26941118, 0.4358393, -0.12083…
1 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
2 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
3 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
4 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
… …
11909 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11910 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11911 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11912 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_8 \
0 [-0.44654804, -0.0011222507, 1.6705241, -0.013…
1 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
2 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
3 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
4 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
… …
11909 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
11910 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
11911 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
11912 [1.3811533, 0.13443133, 0.67313755, 0.33346632…
11913 [-0.18816459, -0.45196122, -0.21790302, 0.9887…

w2vec_embedding_9 \
0 [-0.23426266, -0.8697558, -0.09931712, 1.08638…
1 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
2 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
3 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
4 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
… …
11909 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11910 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11911 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11912 [0.93188065, -0.14711751, 1.1007845, -0.315147…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_10 \
0 [0.74533176, 0.5301534, 0.63357615, 0.6397639,…
1 [0.93188065, -0.14711751, 1.1007845, -0.315147…
2 [0.93188065, -0.14711751, 1.1007845, -0.315147…
3 [0.93188065, -0.14711751, 1.1007845, -0.315147…

6
4 [0.93188065, -0.14711751, 1.1007845, -0.315147…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.20108287, 0.6029379, 0.2330216, -0.3212626,…

w2vec_embedding_11 \
0 [0.93188065, -0.14711751, 1.1007845, -0.315147…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11910 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11911 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11912 [-0.18816459, -0.45196122, -0.21790302, 0.9887…
11913 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…

w2vec_embedding_12 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
2 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
3 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
4 [-0.22948255, -0.4156468, -0.03284373, 0.07473…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [-0.14079317, 0.5140112, 0.56032175, -0.192161…

w2vec_embedding_13 \
0 [-0.22948255, -0.4156468,
-0.03284373, 0.07473…
1 [0.102402635, 0.32382426,
0.42588675, 0.177568…
2 [0.102402635, 0.32382426,
0.42588675, 0.177568…
3 [0.102402635, 0.32382426,
0.42588675, 0.177568…
4 [0.102402635, 0.32382426,
0.42588675, 0.177568…
… …
11909 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11910 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11911 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11912 [0.23303074, 0.79225045, 0.5197343, 0.6270899,…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

7
w2vec_embedding_14 \
0 [0.102402635, 0.32382426, 0.42588675, 0.177568…
1 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
2 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
3 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
4 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
… …
11909 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11910 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11911 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11912 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
11913 [-0.6609878, 0.4256675, 1.4135256, 0.87129855,…

w2vec_embedding_15 \
0 [0.07896054, 0.3125886, 0.21870708, -0.3991353…
1 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
2 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
3 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
4 [-0.30040318, 0.5620006, 0.7801722, 0.9519743,…
… …
11909 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11910 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11911 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11912 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

w2vec_embedding_16 \
0 [-0.30040318, 0.5620006,
0.7801722, 0.9519743,…
1 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
2 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
3 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
4 [-0.14079317, 0.5140112,
0.56032175, -0.192161…
… …
11909 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11910 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11911 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11912 [0.102402635, 0.32382426, 0.42588675, 0.177568…
11913 [0.18458737, 0.88140917, 1.2577158, 0.6808571,…

w2vec_embedding_17
0 [-0.14079317, 0.5140112, 0.56032175, -0.192161…
1 [0.102402635, 0.32382426, 0.42588675, 0.177568…
2 [0.102402635, 0.32382426, 0.42588675, 0.177568…
3 [0.102402635, 0.32382426, 0.42588675, 0.177568…
4 [0.102402635, 0.32382426, 0.42588675, 0.177568…
… …
11909 [-0.62329394, -0.27842516, 1.6441574, 0.278253…

8
11910 [-0.62329394, -0.27842516, 1.6441574, 0.278253…
11911 [-0.62329394, -0.27842516, 1.6441574, 0.278253…
11912 [-0.62329394, -0.27842516, 1.6441574, 0.278253…
11913 [0.102402635, 0.32382426, 0.42588675, 0.177568…

[11914 rows x 17 columns]

[ ]:

9
Untitled

February 7, 2024

[25]: import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
import nltk

[26]: nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/bread/nltk_data…

[nltk_data] Unzipping corpora/stopwords.zip.

[26]: True

[13]: data = pd.read_pickle("News_dataset.pickle")

[17]: data = data.drop(["File_Name", "Complete_Filename", "id", "News_length"],axis=1)

[19]: data.head()

[19]: Content Category

0 Ad sales boost Time Warner profit\r\n\r\nQuart… business
1 Dollar gains on Greenspan speech\r\n\r\nThe do… business
2 Yukos unit buyer faces loan claim\r\n\r\nThe o… business
3 High fuel prices hit BA's profits\r\n\r\nBriti… business
4 Pernod takeover talk lifts Domecq\r\n\r\nShare… business

[20]: def preprocess(text: str):

text = re.sub(r'[^a-zA-Z]', ' ', text.lower())

tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
cleaned_text = ' '.join(tokens)

1
return cleaned_text

[28]: clean = pd.DataFrame()

text_columns = [col for col in data.columns]

for col in text_columns:

clean[f'cleaned_{col}'] = data[col].apply(preprocess)

label_encoder = LabelEncoder()
clean["encoded_label"] = label_encoder.fit_transform(data["Category"])

[29]: clean

[29]: cleaned_Content cleaned_Category \

0 ad sale boost time warner profit quarterly pro… business
1 dollar gain greenspan speech dollar hit highes… business
2 yukos unit buyer face loan claim owner embattl… business
3 high fuel price hit ba profit british airway b… business
4 pernod takeover talk lift domecq share uk drin… business
… … …
2220 bt program beat dialler scam bt introducing tw… tech
2221 spam e mail tempt net shopper computer user ac… tech
2222 careful code new european directive could put … tech
2223 u cyber security chief resigns man making sure… tech
2224 losing online gaming online role playing game … tech

encoded_label
0 0
1 0
2 0
3 0
4 0
… …
2220 4
2221 4
2222 4
2223 4
2224 4

[2225 rows x 3 columns]

[30]: combined_text = pd.DataFrame()

combined_text["text"] = clean.apply(lambda row: ' - '.join(map(str,row)),axis=1)

[31]: combined_text

2
[31]: text
0 ad sale boost time warner profit quarterly pro…
1 dollar gain greenspan speech dollar hit highes…
2 yukos unit buyer face loan claim owner embattl…
3 high fuel price hit ba profit british airway b…
4 pernod takeover talk lift domecq share uk drin…
… …
2220 bt program beat dialler scam bt introducing tw…
2221 spam e mail tempt net shopper computer user ac…
2222 careful code new european directive could put …
2223 u cyber security chief resigns man making sure…
2224 losing online gaming online role playing game …

[2225 rows x 1 columns]

[32]: # unclean

[33]: data.Content[1]

[33]: 'Dollar gains on Greenspan speech\r\n\r\nThe dollar has hit its highest level
against the euro in almost three months after the Federal Reserve head said the
US trade deficit is set to stabilise.\r\n\r\nAnd Alan Greenspan highlighted the
US government\'s willingness to curb spending and rising household savings as
factors which may help to reduce it. In late trading in New York, the dollar
reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns
about the deficit has hit the greenback in recent months. On Friday, Federal
Reserve chairman Mr Greenspan\'s speech in London ahead of the meeting of G7
finance ministers sent the dollar higher after it had earlier tumbled on the
back of worse-than-expected US jobs data. "I think the chairman\'s taking a much
more sanguine view on the current account deficit than he\'s taken for some
time," said Robert Sinche, head of currency strategy at Bank of America in New
York. "He\'s taking a longer-term view, laying out a set of conditions under
which the current account deficit can improve this year and
next."\r\n\r\nWorries about the deficit concerns about China do, however,
remain. China\'s currency remains pegged to the dollar and the US currency\'s
sharp falls in recent months have therefore made Chinese export prices highly
competitive. But calls for a shift in Beijing\'s policy have fallen on deaf
ears, despite recent comments in a major Chinese newspaper that the "time is
ripe" for a loosening of the peg. The G7 meeting is thought unlikely to produce
any meaningful movement in Chinese policy. In the meantime, the US Federal
Reserve\'s decision on 2 February to boost interest rates by a quarter of a
point - the sixth such move in as many months - has opened up a differential
with European rates. The half-point window, some believe, could be enough to
keep US assets looking more attractive, and could help prop up the dollar. The
recent falls have partly been the result of big budget deficits, as well as the
US\'s yawning current account gap, both of which need to be funded by the buying
of US bonds and assets by foreign firms and governments. The White House will

3
announce its budget on Monday, and many commentators believe the deficit will
remain at close to half a trillion dollars.'

[34]: combined_text.text[1]

[34]: 'dollar gain greenspan speech dollar hit highest level euro almost three month
federal reserve head said u trade deficit set stabilise alan greenspan
highlighted u government willingness curb spending rising household saving
factor may help reduce late trading new york dollar reached euro thursday market
concern deficit hit greenback recent month friday federal reserve chairman mr
greenspan speech london ahead meeting g finance minister sent dollar higher
earlier tumbled back worse expected u job data think chairman taking much
sanguine view current account deficit taken time said robert sinche head
currency strategy bank america new york taking longer term view laying set
condition current account deficit improve year next worry deficit concern china
however remain china currency remains pegged dollar u currency sharp fall recent
month therefore made chinese export price highly competitive call shift beijing
policy fallen deaf ear despite recent comment major chinese newspaper time ripe
loosening peg g meeting thought unlikely produce meaningful movement chinese
policy meantime u federal reserve decision february boost interest rate quarter
point sixth move many month opened differential european rate half point window
believe could enough keep u asset looking attractive could help prop dollar
recent fall partly result big budget deficit well u yawning current account gap
need funded buying u bond asset foreign firm government white house announce
budget monday many commentator believe deficit remain close half trillion dollar
- business - 0'

[ ]:

[35]: # TF - IDF

[39]: tfidf_vec = TfidfVectorizer()

tf_matrix = tfidf_vec.fit_transform(combined_text["text"][:10])

[40]: tf_df = pd.DataFrame(tf_matrix.toarray(), columns=tfidf_vec.

↪get_feature_names_out())

[41]: tf_df

[41]: abandon absorbing according account accumulated accused accusing \

0 0.000000 0.000000 0.000000 0.071491 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.125417 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.000000 0.000000 0.064076 0.000000 0.000000 0.000000 0.000000

4
7 0.057341 0.057341 0.000000 0.085293 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.000000 0.071168 0.071168 0.071168

acquisition action activity … would wsj yawning \

0 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.000000
1 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.056211
2 0.000000 0.05877 0.00000 … 0.038860 0.000000 0.000000
3 0.000000 0.00000 0.00000 … 0.029284 0.000000 0.000000
4 0.057551 0.00000 0.00000 … 0.000000 0.057551 0.000000
5 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.000000
6 0.000000 0.00000 0.00000 … 0.042369 0.000000 0.000000
7 0.000000 0.00000 0.00000 … 0.037916 0.000000 0.000000
8 0.000000 0.00000 0.06025 … 0.000000 0.000000 0.000000
9 0.000000 0.00000 0.00000 … 0.000000 0.000000 0.000000

year yen yet yield york yugansk yukos

0 0.085342 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.024953 0.000000 0.000000 0.000000 0.112422 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.352618 0.352618
3 0.117960 0.000000 0.032938 0.044288 0.000000 0.000000 0.000000
4 0.025547 0.000000 0.042802 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.090703 0.067459 0.000000 0.000000 0.000000 0.000000
6 0.056888 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7 0.050909 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.053492 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.031593 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

[10 rows x 898 columns]

[ ]:

NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
管新潮"语料库与Python应用"讲座课件
No ratings yet
管新潮"语料库与Python应用"讲座课件
39 pages
Documentary Requirement Checklist SGLG
100% (1)
Documentary Requirement Checklist SGLG
6 pages
Pathways Ls4 2e U07 Test 0
No ratings yet
Pathways Ls4 2e U07 Test 0
9 pages
The Value of Urban Design
No ratings yet
The Value of Urban Design
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Garments of Soul
100% (6)
Garments of Soul
51 pages
Gartner Build A Data Driven Enterprise August 2019
100% (1)
Gartner Build A Data Driven Enterprise August 2019
14 pages
Thesis Dissertation Urology
100% (3)
Thesis Dissertation Urology
7 pages
1-NLP - Lab Manual
No ratings yet
1-NLP - Lab Manual
15 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Clint-Roy Muvirimi-Mukarakate H1802386 AI Practical Assignment
No ratings yet
Clint-Roy Muvirimi-Mukarakate H1802386 AI Practical Assignment
8 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
TextMining
No ratings yet
TextMining
43 pages
Exp1 Ananya 66 C NLP
No ratings yet
Exp1 Ananya 66 C NLP
12 pages
NLP Lab File
No ratings yet
NLP Lab File
13 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
NLP Lab File
No ratings yet
NLP Lab File
15 pages
NLTK 1736134770
No ratings yet
NLTK 1736134770
13 pages
A Chinese Span-Extraction Dataset For Machine Reading Comprehension
No ratings yet
A Chinese Span-Extraction Dataset For Machine Reading Comprehension
4 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP FinAL
No ratings yet
NLP FinAL
27 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
7 Idf
No ratings yet
7 Idf
5 pages
Assignment#6-1 - 11-Arid-3624 - Jupyter Notebook
No ratings yet
Assignment#6-1 - 11-Arid-3624 - Jupyter Notebook
6 pages
NLP Python Guide
No ratings yet
NLP Python Guide
47 pages
Text Mining Basics
No ratings yet
Text Mining Basics
16 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP Record
No ratings yet
NLP Record
6 pages
3
No ratings yet
3
5 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
AECOM Handbook 2023 21 30
No ratings yet
AECOM Handbook 2023 21 30
10 pages
NLP EXP 3 (B) - Word Generation
No ratings yet
NLP EXP 3 (B) - Word Generation
2 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
Omkar Nimbalkar Ass3
No ratings yet
Omkar Nimbalkar Ass3
14 pages
NLP Lab 1
No ratings yet
NLP Lab 1
4 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Pages From 150 5300 13B Airport Design Taxiway Design
No ratings yet
Pages From 150 5300 13B Airport Design Taxiway Design
45 pages
AIPT LAB 24-25 MANUAL EXPE 4 To8
No ratings yet
AIPT LAB 24-25 MANUAL EXPE 4 To8
15 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
DSBDA7
No ratings yet
DSBDA7
2 pages
Python NLP Assignment
No ratings yet
Python NLP Assignment
9 pages
NLP
No ratings yet
NLP
12 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
DSBDL Assn 07
No ratings yet
DSBDL Assn 07
4 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Lab Assignment 8
No ratings yet
NLP Lab Assignment 8
14 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Grinding (Lecture 3)
No ratings yet
Grinding (Lecture 3)
27 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Lab
No ratings yet
NLP Lab
7 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
Controller User Manual Stag-4 Qbox Basic Stag-4 Qbox Plus Stag-4 Qnext Plus Stag-300 Qmax Basic Stag-300 Qmax Plus
100% (1)
Controller User Manual Stag-4 Qbox Basic Stag-4 Qbox Plus Stag-4 Qnext Plus Stag-300 Qmax Basic Stag-300 Qmax Plus
65 pages
Management Accounting Innovation in Organizations
No ratings yet
Management Accounting Innovation in Organizations
14 pages
Eds Sem Q.B Final
No ratings yet
Eds Sem Q.B Final
70 pages
HO3 Long 5 Pages
No ratings yet
HO3 Long 5 Pages
5 pages
Step2: Introduction To Designing of Robots (Part2) : College of Engineering Roorkee
No ratings yet
Step2: Introduction To Designing of Robots (Part2) : College of Engineering Roorkee
6 pages
2274-Article Text-6535-1-10-20241008
No ratings yet
2274-Article Text-6535-1-10-20241008
14 pages
Progress Test 2 (U 3&4)
No ratings yet
Progress Test 2 (U 3&4)
4 pages
Biosecurity SOP
No ratings yet
Biosecurity SOP
26 pages
Machinelearningforkids Ibmer PDF
No ratings yet
Machinelearningforkids Ibmer PDF
4 pages
Perito ES
No ratings yet
Perito ES
4 pages
Class Test 1-Introduction To Accounting and Basic Accounting Terms
No ratings yet
Class Test 1-Introduction To Accounting and Basic Accounting Terms
2 pages
Porter's Five Forces
No ratings yet
Porter's Five Forces
3 pages
Welcome To Al-Qassim
No ratings yet
Welcome To Al-Qassim
16 pages
Chapter 6 Exercise 2
No ratings yet
Chapter 6 Exercise 2
19 pages
Assignment 2 DESA 1004 - Paulo Ricardo Rangel Maciel Pimenta
No ratings yet
Assignment 2 DESA 1004 - Paulo Ricardo Rangel Maciel Pimenta
4 pages
Maxillary Incisor Based Objectives in Present Day o 2022 Seminars in Orthodo
No ratings yet
Maxillary Incisor Based Objectives in Present Day o 2022 Seminars in Orthodo
13 pages
A145312 PDF
No ratings yet
A145312 PDF
85 pages
Previous Year 6th Sem Question
No ratings yet
Previous Year 6th Sem Question
18 pages
4D3N Miri Brunei
No ratings yet
4D3N Miri Brunei
5 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
50-Gerund Infinitive Test 1
No ratings yet
50-Gerund Infinitive Test 1
3 pages
Silicon Planet: My Life in Computer Chips
From Everand
Silicon Planet: My Life in Computer Chips
Pat Hays
No ratings yet
Design and Technology in Today's World: A First Look
From Everand
Design and Technology in Today's World: A First Look
Baz Professor
No ratings yet

A11 Merged

Uploaded by

A11 Merged

Uploaded by

practical_1

[7]: import nltk

[nltk_data] Downloading package punkt to /home/bread/nltk_data…

[9]: import requests

[10]: text = """

from private companies, PLA Daily reported on Saturday.

PLA Daily on Saturday.

['The', 'think', 'tank', 'of', 'China’s', 'People’s', 'Liberation', 'Army',

[19]: tbank_token = tokenizer.tokenize(text)

['The', 'think', 'tank', 'of', 'China’s', 'People’s', 'Liberation', 'Army',

[25]: mwe = tokenizer.tokenize(word_tokenize(text))

run --> run

# The Snowball Stemmer requires that you pass a language parameter

run --> run

run --> run

[44]: import pandas as pd

Requirement already satisfied: tqdm in

[nltk_data] Downloading package punkt to

[2]: data = pd.read_csv("data.csv")

[5]: data = data[["Make","Model","Engine Fuel Type", "Transmission Type",␣

[8]: data = data.fillna("x")

[10]: data["text"] = data.apply(lambda row: " - ".join(map(str,row)),axis=1)

[27]: bow_doc_count = 100

[28]: count_vec = CountVectorizer()

[29]: bow_matrix = count_vec.fit_transform(data["text"][:bow_doc_count])

[30]: bow_df = pd.DataFrame(bow_matrix.toarray(), columns=count_vec.

recommended regular required sedan series spider tuner unleaded \

[100 rows x 42 columns]

[32]: tfidf_vec = TfidfVectorizer()

recommended regular required sedan series spider tuner \

unleaded wagon wheel

[100 rows x 42 columns]

tokenized_text = [word_tokenize(text) for text in data["text"]]

[40]: w2vec = Word2Vec(sentences=tokenized_text, vector_size=100, window=5,␣

[41]: w2vec_data = pd.DataFrame()

[45]: for i in tqdm(range(10)):

0%| | 0/10 [00:00<?, ?it/s]

[11914 rows x 17 columns]

[25]: import pandas as pd

[nltk_data] Downloading package stopwords to /home/bread/nltk_data…

[13]: data = pd.read_pickle("News_dataset.pickle")

[17]: data = data.drop(["File_Name", "Complete_Filename", "id", "News_length"],axis=1)

[19]: Content Category

[20]: def preprocess(text: str):

[28]: clean = pd.DataFrame()

for col in text_columns:

[29]: cleaned_Content cleaned_Category \

[2225 rows x 3 columns]

[30]: combined_text = pd.DataFrame()

[2225 rows x 1 columns]

[39]: tfidf_vec = TfidfVectorizer()

[40]: tf_df = pd.DataFrame(tf_matrix.toarray(), columns=tfidf_vec.

[41]: abandon absorbing according account accumulated accused accusing \

acquisition action activity … would wsj yawning \

year yen yet yield york yugansk yukos

[10 rows x 898 columns]

You might also like