0% found this document useful (0 votes)
13 views28 pages

Experiment: 1

nlp record

Uploaded by

pareshkumar3108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

Experiment: 1

nlp record

Uploaded by

pareshkumar3108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Experiment : 1

Aim : Determine noise removal for any textual data and remove regular expression
pattern such as hash tag from textual data.

Description : Test cleaning or Test preprocessing or data preprocessing .A text is


unstructed data may be full of inconsistencies and ambiguity .Test preprocessing is a
method in NLP that involves cleaning text data and making it ready for model building .
A raw text (text corpus) button variable text data collected from one or many sources such
as websites,spoken languages and voice recognition system may contain various words
with the wrong spelling,short words,special symbols,emojis etc.

Program :

import string

import re

input_text="\t #sample <HTML><H1>#Greetings our college is offering computer


courses in {B.Tech} with 3 specializations & [M.Tech] with 1 specialization #computer
Science->total courses:?https://www.aec.edu.in is our college website!!! <H1>...\t"

input_text=input_text.lower()

print(input_text)

input_text=re.sub(r'\d+','',input_text)

print(input_text)

input_text=input_text.strip()

print(input_text)
html_pattern=re.compile('<.*?>')

input_text=html_pattern.sub(r'',input_text)

print(input_text)

url_pattern=re.compile(r'https?://s+\www\s+')

input_text=url_pattern.sub(r'',input_text)

print(input_text)

hash_pattern=re.compile(r'#[a-z]+')

input_text=hash_pattern.sub(r'',input_text)

print(input_text)

for punc in string.punctuation:

if punc in input_text:

input_text=input_text.replace(punc,'')

print(input_text)

x=re.findall("is our\s+\s+",input_text)

input_text=re.sub("is our\s+\s+","",input_text)

print(input_text)

tweet="""If you hold an empty #gaterode #battle up to your ear@@you can hear the
sports oo%%"""

x=re.sub('[a-z A-Z 0-9]+','',tweet)

print(x)
Output :

#sample <html><h1>#greetings our college is offering computer courses in {b.tech} with


3 specializations & [m.tech] with 1 specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! <h1>...

#sample <html><h>#greetings our college is offering computer courses in


{b.tech} with specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! <h>...

#sample <html><h>#greetings our college is offering computer courses in {b.tech} with


specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! <h>...

#sample #greetings our college is offering computer courses in {b.tech} with


specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! ...

#sample #greetings our college is offering computer courses in {b.tech} with


specializations & [m.tech] with specialization #computer science->total
courses:?https://www.aec.edu.in is our college website!!! ...

our college is offering computer courses in {b.tech} with specializations & [m.tech] with
specialization science->total courses:?https://www.aec.edu.in is our college website!!! ...

our college is offering computer courses in btech with specializations mtech with
specialization sciencetotal courseshttpswwwaeceduin is our college website
our college is offering computer courses in btech with specializations mtech with
specialization sciencetotal courseshttpswwwaeceduin is our college website
##@@%%
Experiment : 2

Aim : Perform lemmatization and streaming using python library nltk.

Description : Stemming and Lemmatization in Python NLTK are text normalization


techniques for Natural Language Processing. These techniques are widely used for text
preprocessing. The difference between stemming and lemmatization is that stemming is
faster as it cuts words without knowing the context, while lemmatization is slower as it
knows the context of words before processing.

Stemming is a method of normalization of words in Natural Language Processing. It is a


technique in which a set of words in a sentence are converted into a sequence to shorten
its lookup. In this method, the words having the same meaning but have some variations
according to the context or sentence are normalized.
In another word, there is one root word, but there are many variations of the same
words. For example, the root word is “eat” and it’s variations are “eats, eating, eaten
and like so”. In the same way, with the help of Stemming in Python, we can find the
root word of any variations.

Program :

from nltk.stem import PorterStemmer

from nltk.stem import SnowballStemmer

import nltk

words=["walking","swimming","computer","computing","language","natural","educatio
n","easy","irrational","relation"]
Stemmed_ps=PorterStemmer()

Stemmed_words_ps=[Stemmed_ps.stem(word)for word in words]

print("Porter stemmed words : ",Stemmed_words_ps)

Stemmer_ss=SnowballStemmer("english")

Stemmed_words_ss=[Stemmer_ss.stem(word)for word in words]

print("Snowball Stemmed words : ",Stemmed_words_ss)

Sentence="I was wonder whwn I walk in Indian roads because everybody using
computers to understand the language so they forget their mother language it is netural
because people are edicted to computer it is irriting me."

token_words=nltk.word_tokenize(Sentence)

Stem_Sentence=[]

for word in token_words:

Stem_Sentence.append(Stemmed_ps.Stem(word))

print("The stemmed sentence is : ",Stem_Sentence)


Output :

Porter stemmed words : ['walk', 'swim', 'comput', 'comput', 'languag', 'natur', 'educ', 'easi',
'irrat', 'relat']

Snowball Stemmed words : ['walk', 'swim', 'comput', 'comput', 'languag', 'natur', 'educ',
'easi', 'irrat', 'relat']

The porter stemmed sentence is : ['I', 'wa', 'wonder', 'when', 'i', 'walk', 'in', 'indian', 'road',
'because', 'everybodi', 'use', 'compute', 'to', 'understand', 'the', 'language', 'so', 'they',
'forgot', 'their', 'mother', 'language', 'it', 'is', 'nature', 'became', 'peopl', 'are', 'edict', 'to',
'comput', 'it', 'is', 'irrit', 'me']
Experiment : 3

Aim : Demonstrate object standardization such as replace social media slangs from a
text.

Description : NLP is used for chat bots, summaries of articles or texts, language
translation, and verbal view description. NLP includes steps such as pre-processing,
entity extraction, word frequency measurements. With noise reduction, operations are
performed on connectors such as “and, or, but”.
Object Standardization : Text data often contains words or phrases which are not
present in any standard lexical dictionaries. These pieces are not recognized by search
engines and models. Examples – acronyms, hashtags with attached words, and
colloquial slangs. With the help of regular expressions and manually prepared data
dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup
method to replace social media slangs from a text. Other types of text preprocessing
includes encoding-decoding noise, grammar checker, and spelling correction
The dictionary contains the normalization process of the words from the same root, such
as “I do, I do, I will do” normalization. Object standardization is pre-processing
techniques that can be done on abbreviations such as “rt → retweet, dm → direct
message”.
After preprocessing, entity extraction and entity selection are performed at this stage. At
this stage, the relevant topic is removed from the text. One of the techniques used is Latent
Dirichlet Allocation for Topic Modeling (LDA)
Program :

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love",


"...": " "}

def lookup_words(input_text):

words = input_text.split()

new_words = []

for word in words:

if word.lower() in lookup_dict:

word = lookup_dict[word.lower()]

new_words.append(word)

new_text = " ".join(new_words)

return new_text

print("Message : RT this is a retweeted dm message tweet by Shivam Bansal")

print("Converted message : ",lookup_words("RT this is a retweeted dm message tweet by


Shivam Bansal"))

print(lookup_dict.keys())

print(lookup_dict.values())

Output :
Message : RT this is a retweeted dm message tweet by Shivam Bansal

Converted message : Retweet this is a retweeted direct message message tweet by


Shivam Bansal

dict_keys(['rt', 'dm', 'awsm', 'luv', '...'])

dict_values(['Retweet', 'direct message', 'awesome', 'love', ' '])


Experiment : 4

Aim : Perform part of speech tagging on any textual data.

Description : One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a sentence as nouns,
adjectives, verbs...etc. Even more impressive, it also labels by tense, and more. Here's a
list of the tags, what they mean, and some examples:
Part-of-speech (POS) tagging is just what it sounds like: the process goes through the
words in your corpus and tags them with metadata, indicating whether those words are
nouns, verbs, adjectives, etc.

Alphabetical list of part-of-speech tags used in the Penn Treebank Project: Number Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative

sentence tokenizer, PunktSentenceTokenizer is capable of unsupervised machine


learning, train it on any body of text that you use.
create training and testing data
Data Set used i) State of the Union address from 2005 ii) State of the Union address from
2006 of President George W. Bush.
Train the Punkt tokenizer
Finish up part of speech tagging script by creating a function that will run through and
tag all of the parts of speech per sentence

Program :
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
nltk.download('state_union')
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()

Output :
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S",
'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'),
('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'),
('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'),
('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT',
'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney',
'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members',
'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'),
('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','),
('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'),
('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',',
','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'),
('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried',
'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'),
('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'),
('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'),
('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'),
('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'),
('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'),
('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'),
('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'),
(',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]
Experiment : 5

Aim : Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

Description : Latent Dirichlet allocation (LDA) is a topic model that generates topics
based on word frequency from a set of documents. LDA is particularly useful for finding
reasonably accurate mixtures of topics within a given document set. It is a generative
probabilistic model that assumes each topic is a mixture over an underlying set of words,
and each document is a mixture of over a set of topic probabilities.
Process of LDA:

Input : M number of documents, N number of words, K number of topics.


The model trains to output:
psi -distribution of words for each topic K
phi - the distribution of topics for each document i
Required Python packages:
i i. NLTK(Natural language toolkit)

ii ii. stop_words Python package containing stop words

iii iii. gensim, a topic modeling package containing our LDA model.

Steps involved
1) Loading data
2) Data cleaning

3) Exploratory analysis

4) Preparing data for LDA analysis

5) LDA model training

6) Analyzing LDA model results

Data Cleaning methods :


Tokenizing: converting a document to its atomic elements.
Stopping: removing meaningless words.
Stemming: merging words that are equivalent in meaning.
Constructing a document-term matrix : The result of cleaning stage is texts, a tokenized,
stopped and stemmed list of words from a single document. we looped through all our
documents and appended each one to texts. So now texts is a list of lists, one list for each of
our original documents. To generate an LDA model, we need to understand how frequently
each term occurs within each document.
Construct a document-term matrix with a package called genism : The Dictionary() function
traverses texts, assigning a unique integer id to each unique token while also collecting
word counts and relevant statistics. dictionary must be converted into a bag-of-words:
Applying the LDA model : corpus is a document-term matrix and now we’re ready to
generate an LDA model: The LdaModel class is described in detail in the gensim
documentation. Parameters used in our example:
Parameters:
num_topics: required. An LDA model requires the user to determine how many topics
should be generated. Our document set is small, so we’re only asking for three topics.
id2word: required. The LdaModel class requires our previous dictionary to map ids to
strings.
passes: optional. The number of laps the model will take through corpus. The greater the
number of passes, the more accurate the model will be. A lot of passes can be slow on a
very large corpus.
Examining the results LDA model is now stored as ldamodel with the print_topic and
print_topics methods

Program :
!pip install stop_words
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood
pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to
drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word =
dictionary, passes=20)
print(ldamodel.num_terms)
print(ldamodel.num_topics)
print(ldamodel.get_topics())
ldamodel.print_topics()
print(ldamodel.print_topics(num_topics=3, num_words=3))
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word =
dictionary, passes=20)
print(ldamodel.print_topics(num_topics=4, num_words=8))

Output :

32
2
[[0.08554609 0.03569528 0.06107529 0.08554756 0.03660925 0.03570221
0.01230862 0.01231141 0.03600929 0.01231128 0.01230981 0.01231116
0.0123111 0.0366259 0.03662846 0.03662638 0.08556051 0.03662615
0.03662581 0.03646108 0.0366278 0.03662746 0.01228877 0.01228839
0.01228723 0.01228927 0.01228977 0.01228833 0.01228785 0.01228687
0.03661788 0.03661777]
[ 0.01359389 0.06842649 0.01358789 0.01359227 0.01357666 0.06841886
0.04030583 0.04030274 0.0680811 0.0403029 0.04030451 0.04030303
0.04030309 0.01355834 0.01355553 0.01355782 0.01357803 0.01355808
0.01355845 0.04066184 0.01355626 0.01355663 0.04032766 0.04032807
0.04032935 0.0403271 0.04032655 0.04032813 0.04032866 0.04032975
0.01356716 0.01356728]]
[(0, '0.086*"health" + 0.086*"good" + 0.086*"brocolli"'), (1, '0.068*"brother" +
0.068*"mother" + 0.068*"drive"')]

[(0, '0.135*"health" + 0.052*"expert" + 0.052*"may" + 0.052*"suggest" + 0.052*"caus"


+ 0.052*"tension" + 0.052*"increas" + 0.052*"blood"'), (1, '0.063*"drive" +
0.063*"pressur" + 0.062*"never" + 0.062*"seem" + 0.062*"often" + 0.062*"well" +
0.062*"feel" + 0.062*"better"'), (2, '0.031*"drive" + 0.031*"brocolli" + 0.031*"good" +
0.031*"brother" + 0.031*"mother" + 0.031*"profession" + 0.031*"say" +
0.031*"pressur"'), (3, '0.087*"brocolli" + 0.087*"good" + 0.087*"mother" +
0.087*"brother" + 0.087*"eat" + 0.048*"like" + 0.048*"around" + 0.048*"basebal"')]
Experiment : 6

Aim : Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using
python

Description : It is a widely used statistical method in natural language processing and


information retrieval. It measures how important a term is within a document relative to a
collection of documents. Words within a text document are transformed into importance numbers
by a text vectorization process. TF-IDF vectorizes/scores a word by multiplying the word’s Term
Frequency (TF) with the Inverse Document Frequency (IDF).

TF-IDF is useful in many natural language processing applications.


1. Search Engines used to rank the relevance of a document for a query.
2. Text classification, text summarization, and topic modelling.
Example

Program :
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science','this is one of the
best data science courses','data scientists analyze data' ]
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)
columns_set = tuple(words_set)
n_docs = len(corpus)
n_words_set = len(words_set)
df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=columns_set)

for i in range(n_docs):
words = corpus[i].split(' ')
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))
print("IDF of: ")
idf = {}
for w in words_set:
k=0
for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log10(n_docs / k)
print(f'{w:>15}: {idf[w]:>10}' )
df_tf_idf = df_tf.copy()
for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
df_tf_idf

Output :
Experiment : 7

Aim : Demonstrate word embeddings using word2vec

Description : Word embedding is one of the most important techniques in natural


language processing(NLP), where words are mapped to vectors of real numbers. Word
embedding is capable of capturing the meaning of a word in a document, semantic and
syntactic similarity, relation with other words. It also has been widely used for
recommender systems and text classification. Word2vec is one of the most popular
technique to learn word embeddings using a two-layer neural network. Its input is a text
corpus and its output is a set of vectors. Word embedding via word2vec can make natural
language computer-readable, then further implementation of mathematical operations on
words can be used to detect their similarities. A well-trained set of word vectors will place
similar words close to each other in that space. For instance, the words women, men, and
human might cluster in one corner, while yellow, red and blue cluster together in another.

gensim is an open source python library for natural language processing and it was
developed and is maintained by the Czech natural language processing researcher
Gensim library will enable us to develop word embeddings by training our own word2vec
models on a custom corpus either with CBOW of skip-grams algorithms.

Train the genism word2vec model


model = Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

The hyperparameters of this model.


size: The number of dimensions of the embeddings and the default is 100.
window: The maximum distance between a target word and words around the target word.
The default window is 5.
min_count: The minimum count of words to consider when training the model; words
with occurrence less than this count will be ignored. The default for min_count is 5.
workers: The number of partitions during training and the default workers is 3.
sg: The training algorithm, either CBOW(0) or skip gram(1). The default training
algorithm is
CBOW.

After training the word2vec model, obtain the word embedding from the training model.
Finally print the model.
Program :
from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second',
'sentence'],['yet', 'another', 'sentence'],['one', 'more', 'sentence'],['and', 'the', 'final',
'sentence']]
model = Word2Vec(sentences, min_count=1)
print(model)
words = list(model.wv.key_to_index)
print(words)
print(model.wv['is'])
model.save('model.bin')
new_model = Word2Vec.load('model.bin')
print(new_model)

Output :
Experiment : 8

Aim : Implement Text classification using naïve bayes classifier and text blob library.

Description : Text classifier are systems that classify your texts and divide them in
different classes.

TextBlob is a Python library for processing textual data. It provides a consistent API for
diving into common natural language processing (NLP) tasks such as part-of-speech
tagging, noun phrase extraction, sentiment analysis, and more.

Step -1 Installing textblob.


Step-2 Download the data files that textblob uses for its functionality and for nltk.
Step-3 Traine the classifier based on Naive Bayes Classifier.
Step-4 Test the data using classifier to get your text classified.
Step-5 Calculat the accuracy of the classifier.

Program :
!pip install textblob
!pip3 install textblob
import nltk
nltk.download('punkt')
train = [
('Whatanamazing weather.', 'pos'), ('this is an amazing idea!', 'pos'),
('I feelvery good about these ideas.', 'pos'), ('this is my best performance.', 'pos'), ("what
an awesome view", 'pos'),
('I do not like this place', 'neg'), ('Iamtired ofthisstuff.','neg'),
("I can'tdealwith all this tension", 'neg'), ('he is my sworn enemy!', 'neg'),
('myfriendsishorrible.','neg')
]
test =[
('the foodwasgreat.','pos'),
('I do notwant to liveanymore', 'neg'), ("I ain't feeling dandy today.", 'neg'), ("I feel
amazing!", 'pos'),
('Ramesh is a friend of mine.', 'pos'), ("I can't believe I'm doing this.", 'neg')
]
from textblob.classifiers import NaiveBayesClassifier
cl=NaiveBayesClassifier(train)
print(cl.classify("Thisis an amazing library!"))
print(cl.accuracy(test))
print(cl.classify("my friends is tension"))
print(cl.accuracy(test))
cl.show_informative_features(4)

Output :
Experiment : 9

Aim : Apply support vector machine for text classification.

Description : Support Vector Machine” (SVM) is a supervised machine learning algorithm


that can be used for both classification or regression challenges. However, it is mostly used in
classificationproblems. In theSVMalgorithm,weploteach dataitemasapointin n-dimensionalspace
(where n is a number of features you have)with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well. Support Vectors are the coordinates of individual
observation. The SVM classifier is a frontier that best segregates the two classes (hyper-plane/
line).

Program :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
svc = svm.SVC(kernel='linear', C=1,gamma='auto').fit(X, y)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1


y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()

svc = svm.SVC(kernel='rbf', C=1,gamma='auto').fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)


plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()

Output :
Experiment : 10

Aim : Convert text to vectors (using term frequency) and apply cosine similarity to
provide closeness among two texts.

Description : Cosine similarity is a measure of similarity between two non-zero vectors of an


inner product space that “measures the cosine of the angle between them” Cosine Similarity tends
to determine how similar two words or sentence are, It can be used for Sentiment Analysis, Text
Comparison and being used by lot of popular packages out there like word2vec.

This is also called as Scalar product because the dot product of two vectors gives a scalar result.
For Example,

Vector(A) = [5,0,2] Vector(B) = [2,5,0]

The dot product vector(A) DOT vector(B) = 5*2+0*5+2*0=10+0+0 =10 The documents are
similar lesser the angle between them and Cosine of Angle increase as the value of angle decreases
since Cos 0 =1 and Cos 90 = 0

First step calculate the cosine similarity between the documents. Convert the
documents/Sentences/words in a form of feature vector first. Useful Methods for feature
extraction i) Bag of Words ii) TF-IDF.

Bag of Words counts the unique words in documents and frequency of each of the words. Scikit
learn Countvectorizer extract the Bag of Words Features.

TF-IDF score of a word to rank it’s importance in a document tfidf score of a word w =
tf(w)*idf(w) tf(w) = Number of times the word appears in a document/Total number of words in
the document.

idf(w) = Number of documents/Number of documents that contains word w se Scikit learn Cosine
Similarity function to compare the first document i.e. Document 0 with the other Documents in
Corpus.

Cosine Similarities of the document 0 compared with other documents in the corpus. The first
element in array is 1 which means Document 0 is compared with Document 0 and second element
0.08619387, 0,0 where Document 0 is compared with Document 1,2,3.
Program :
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
count_vect = CountVectorizer()
Document1= "Aditya Engineering College situated at Surampalem"
Document2= "Engineering Colleges offer computer science courses in MCA AIML CSE
IT departments"
Document3= "Computer science students have opprtunities in IT sector"
Document4= "IT sector hire students with skills in computer science"
corpus = [Document1,Document2,Document3,Document4]
X_train_counts = count_vect.fit_transform(corpus)
df1=pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names_o
ut(),index=['Document 0','Document 1','Document2','Document3'])
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus)
df2=pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names_out(),index=
['Document 0','Document 1','Document 2','Document 3'])
print(df1)
print(df2)
cosine_similarity(trsfm[0:3], trsfm)
Output :

You might also like