0% found this document useful (0 votes)

34 views15 pages

NLP Record

Uploaded by

bslsdevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views15 pages

NLP Record

Uploaded by

bslsdevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Python Libraries: nltk, re,word2vec

List of Experiments
1. Demonstrate Noise Removal for any textual data and remove regular expression pattern such as
hash tag from textual data.
Noise Removal: Any piece of text which is not relevant to the context of the data and the end-output can
be specified as the noise.
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or
links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals
with removal of all types of noisy entities present in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object
by tokens (or by words), eliminating those tokens which are present in the noise dictionary.
# Sample code to remove noisy words from a text
noise_list = ["is", "a", "this", "..."]
def _remove_noise(input_text):
words = input_text.split()
noise_free_words = [word for word in words if word not in noise_list]
noise_free_text = " ".join(noise_free_words)
return noise_free_text
print(_remove_noise("this is a book"))

Output: book

# Sample code to remove noisy words from a text

noise_list = ["are", "a", "this", "..."]

def _remove_noise(input_text):
words = input_text.split()
noise_free_words = [word for word in words if word not in noise_list]
noise_free_text = " ".join(noise_free_words)
return noise_free_text
print(_remove_noise("Hai how are u"))
Output: Hai how u

Another approach is to use the regular expressions while dealing with special patterns of noise.
Following python code removes a regex pattern from the input text:
import re
def remove_regex_pattern(text, pattern):
"""Removes a regex pattern from the input text.
Args:
text: The input text.
pattern: The regex pattern to remove.
Returns:
The input text with the regex pattern removed.
"""
return re.sub(pattern, "", text)

# Example usage:
text = "welcome to (.*) CIST"
pattern = r"$.*$"
# Remove the regex pattern from the text.
text_without_regex_pattern = remove_regex_pattern(text, pattern)
# Print the text without the regex pattern.
print(text_without_regex_pattern)
Output:
"welcome to CIST"

example of how to remove a regex pattern #tag from the input text in Python:
import re
def remove_hashtag(text):
"""Removes hashtag from the input text.
Args:
text: The input text.
Returns:
The text with the hashtag removed.
"""
pattern = re.compile(r'#\w+')
return pattern.sub('', text)
# Example usage:
text = 'This is a tweet with #hashtag.'
print(remove_hashtag(text))
Output:
This is a tweet with .

2. Perform lemmatization and stemming using python library nltk.

The most common lexicon normalization practices are :

 Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”,

“ly”, “es”, “s” etc) from a word.

 Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure

of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of

words) and morphological analysis (word structure and grammar relations).

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
def perform_lemmatization(text):
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
return ' '.join(lemmatized_words)
def perform_stemming(text):
stemmer = PorterStemmer()
tokens = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in tokens]
return ' '.join(stemmed_words)
# Example text
sample_text = "The quick brown foxes are jumping over the lazy dogs"
# Lemmatization
lemmatized_text = perform_lemmatization(sample_text)
print("Lemmatized Text:", lemmatized_text)
# Stemming
stemmed_text = perform_stemming(sample_text)
print("Stemmed Text:", stemmed_text)

Output:
Lemmatized Text :The quick brown fox are jumping over the lazy dog
Stemmed Text: the quick brown fox are jump over the lazi dog

Output:
Lemmatized Text :They are playing well
Stemmed Text: they are play well

3. Demonstrate object standardization such as replace social media slangs from a text.
Text data often contains words or phrases which are not present in any standard lexical dictionaries. These
pieces are not recognized by search engines and models.
Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help
of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code
below uses a dictionary lookup method to replace social media slangs from a text.
Program:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel
Commuinty", "EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel
Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome
had established two new communities alongside the ECSC: the eec and the European Atomic Energy
Community (Euratom). However their executives were called Commissions rather than High Authorities")

Output:

import nltk
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love",'pls':"please"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words("RT this is a retweeted tweet by Devi")
_lookup_words(" She is a awsm student")
_lookup_words("pls send the material")

Output:
Retweet this is a retweeted tweet by Devi
She is a awesome student
please send the material

4. Perform part of speech tagging on any textual data:

Part of speech tagging – Apart from the grammar relations, every word in a sentence is also associated with
a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and
function of a word in the sentence.
Part of Speech tagging is used for many important purposes in NLP:
A..Word sense disambiguation: Some language words have multiple meanings according to their usage.
For example, in the two sentences below:

I. “Please book my flight for Delhi”

II. “I am going to read this book in the flight”

“Book” is used with different context, however the part of speech tag for both of the cases are different. In
sentence I, the word “book” is used as v erb, while in II it is used as noun. (Lesk Algorithm is also us ed for
similar purposes)

B. Improving word-based features: A learning model could learn different contexts of a word when used
word as the features, however if the part of speech tag is linked with them, the context is preserved, thus
making strong features. For example:

Sentence -“book my flight, I will read this book”

Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)

Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1),
(“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a
word to its base form (lemma).

D. Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important words of a
language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”,
“mu st” etc)
5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

Topic Modelling With LDA -A Hands-on Introduction

Introduction

Imagine walking into a bookstore to buy a book on world economics and not being able to figure out

the section of the store that has this book, assuming the bookstore has simply stacked all types of
books together. You then realize how important it is to divide the bookstore into different sections

based on the type of book.

Topic Modelling is similar to dividing a bookstore based on the content of the books as it refers to the

process of discovering themes in a text corpus and annotating the documents based on the identified

topics.

When you need to segment, understand, and summarize a large collection of documents, topic

modelling can be useful.

https://
thinkinfi.com/latent-dirichlet-allocation-for-beginners-a-high-level-overview/

Topic Modelling using LDA:

Latent Dirichlet Allocation (LDA) is one of the ways to implement Topic Modelling. It is a generative

probabilistic model in which each document is assumed to be consisting of a different proportion of

topics.
How does the LDA algorithm work?

The following steps are carried out in LDA to assign topics to each of the documents:

1) For each document, randomly initialize each word to a topic amongst the K topics where K is the

number of pre-defined topics.

2) For each document d:

For each word w in the document, compute:

 P(topic t| document d): Proportion of words in document d that are assigned to topic t

 P(word w| topic t): Proportion of assignments to topic t across all documents from words that

come from w

3) Reassign topic T’ to word w with probability p(t’|d)*p(w|t’) considering all other words and their

topic assignments

The last step is repeated multiple times till we reach a steady state where the topic assignments do not

change further. The proportion of topics for each document is then determined from these topic

assignments.

Illustrative Example of LDA:

Let us say that we have the following 4 documents as the corpus and we wish to carry out topic

modelling on these documents.

Document 1: We watch a lot of videos on YouTube.

Document 2: YouTube videos are very informative.

Document 3: Reading a technical blog makes me understand things easily.

Document 4: I prefer blogs to YouTube videos.

LDA modelling helps us in discovering topics in the above corpus and assigning topic mixtures for

each of the documents. As an example, the model might output something as given below:

Topic 1: 40% videos, 60% YouTube

Topic 2: 95% blogs, 5% YouTube

Document 1 and 2 would then belong 100% to Topic 1. Document 3 would belong 100% to Topic 2.

Document 4 would belong 80% to Topic 2 and 20% to Topic 1.

This assignment of topics to documents is carried out by LDA modelling using the steps that we

discussed in the previous section. Let us now apply LDA to some text data and analyze the actual

outputs in Python.

Topic Modelling using LDA in Python:

We have taken the ‘Amazon Fine Food Reviews’ data from Kaggle

(https://www.kaggle.com/snap/amazon-fine-food-reviews ) here to illustrate how we can implement

topic modelling using LDA in Python.

Reading the Data:

We start by importing the Pandas library to read the CSV and save it in a data frame.

Python Code:
import pandas as pd

import nltk

from nltk.corpus import stopwords #stopwords

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

#from sklearn.feature_extraction.text import TfidfVectorizer

rev = pd.read_csv(r"Reviews.csv")

print(rev.head())

We are interested in carrying out topic modelling for the ‘Text’ column in this dataset.

Importing the necessary libraries:

We will need the NLTK library to be imported as we will use lemmatization for pre-processing.

Additionally, we would also remove the stop-words before carrying out the LDA. To carry out topic

modelling, we need to convert our text column into a vectorized form and therefore we import the

TfidfVectorizer.

import nltk
from nltk.corpus import stopwords #stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
stop_words=set(nltk.corpus.stopwords.words('english'))

Pre-processing the text:

We will apply lemmatization to the words so that the root words of all derived words are used.

Furthermore, the stop-words are removed and words with lengths greater than 3 are used.

def clean_text(headline):
le=WordNetLemmatizer()
word_tokens=word_tokenize(headline)
tokens=[le.lemmatize(w) for w in word_tokens if w not in stop_words and len(w)>3]
cleaned_text=" ".join(tokens)
return cleaned_text
rev['cleaned_text']=rev['Text'].apply(clean_text)

TFIDF vectorization on the text column:

Carrying out a TFIDF vectorization on the text column gives us a document term matrix on which we

can carry out the topic modelling. TFIDF refers to Term Frequency Inverse Document Frequency – as

this vectorization compares the number of times a word appears in a document with the number of

documents that contain the word.

vect =TfidfVectorizer(stop_words=stop_words,max_features=1000)
vect_text=vect.fit_transform(rev['cleaned_text'])

LDA on the vectorized text:

The parameters that we have given to the LDA model, as shown below, include the number of topics,

the learning method (which is the way the algorithm updates the assignments of the topics to the

documents), the maximum number of iterations to be carried out and the random state. The parameters

that we have given to the LDA model, as shown below, include the number of topics, the learning

method (which is the way the algorithm updates the assignments of the topics to the documents), the

maximum number of iterations to be carried out and the random state.

from sklearn.decomposition import LatentDirichletAllocation

lda_model=LatentDirichletAllocation(n_components=10,
learning_method='online',random_state=42,max_iter=1)
lda_top=lda_model.fit_transform(vect_text)

Checking the results:

We can check the proportion of topics that have been assigned to the first document using the lines of

code given below.

print("Document 0: ")
for i,topic in enumerate(lda_top[0]):
print("Topic ",i,": ",topic*100,"%")

Analyzing the Topics:

Let us check what are the top words that comprise the topics. This would give us a view of what

defines each of these topics.

vocab = vect.get_feature_names()
for i, comp in enumerate(lda_model.components_):
vocab_comp = zip(vocab, comp)
sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:10]
print("Topic "+str(i)+": ")
for t in sorted_words:
print(t[0],end=" ")
print("n")
In addition to LDA, other algorithms can be leveraged to carry out topic modelling. Latent Semantic

Indexing (LSI), Non-negative matrix factorization are some of the other algorithms one could try to

carry out topic modelling. All these algorithms, like LDA, involve feature extraction from document

term matrices and generating a group of terms that are differentiating from each other, which

eventually lead to the creation of topics. These topics can help in assessing the main themes of a

corpus and hence organizing large collections of textual data.

6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using python
7. Demonstrate word embeddings using word2vec.
8. Implement Text classification using naïve bayes classifier and text blob library.
9. Apply support vector machine for text classification.
10. Convert text to vectors (using term frequency) and apply cosine similarity to provide closeness
among two text.
11. Case study 1: Identify the sentiment of tweets
In this problem, you are provided with tweet data to predict sentiment on electronic
products of netizens.
12. Case study 2: Detect hate speech in tweets.
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we
say a tweet contains hate speech if it has a racist or sexist sentiment associated with it.
So, the task is to classify racist or sexist tweets from other tweets

Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
VVM - Ai MCQ'S
No ratings yet
VVM - Ai MCQ'S
25 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
TextMining
No ratings yet
TextMining
43 pages
UBC Summer School in NLP - VSP 2019 Lecture 11
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 11
51 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
02 - Text Preprocessing - Part2
No ratings yet
02 - Text Preprocessing - Part2
36 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
24 pages
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
No ratings yet
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
27 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
Session2 3
No ratings yet
Session2 3
18 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
NLP Record
No ratings yet
NLP Record
16 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Clint-Roy Muvirimi-Mukarakate H1802386 AI Practical Assignment
No ratings yet
Clint-Roy Muvirimi-Mukarakate H1802386 AI Practical Assignment
8 pages
Bling
No ratings yet
Bling
7 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
No ratings yet
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
17 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Text Processing
No ratings yet
Text Processing
5 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Unit 5
No ratings yet
Unit 5
4 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
Interview Questions For DS & DA (ML)
100% (1)
Interview Questions For DS & DA (ML)
66 pages
PRP
No ratings yet
PRP
60 pages
Class AI Class 10
No ratings yet
Class AI Class 10
15 pages
IT Record
No ratings yet
IT Record
81 pages
AComparative Study of Machine Learning and Deep Learning Techniques For Fake News Detection
No ratings yet
AComparative Study of Machine Learning and Deep Learning Techniques For Fake News Detection
28 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Natural Language Processing and Machine Learning For Law and Policy Texts
No ratings yet
Natural Language Processing and Machine Learning For Law and Policy Texts
35 pages
JML-Apr22-Book (2) Military Learning
No ratings yet
JML-Apr22-Book (2) Military Learning
96 pages
FOA Project Report: Basic Conversational Chatbot - Robo
No ratings yet
FOA Project Report: Basic Conversational Chatbot - Robo
10 pages
Fake News Detection: Using Machine Learning & Python (Predicting Website)
No ratings yet
Fake News Detection: Using Machine Learning & Python (Predicting Website)
13 pages
TARP
No ratings yet
TARP
21 pages
C Notes
No ratings yet
C Notes
47 pages
Alternative Probabilistic Models: Probability Theory
100% (1)
Alternative Probabilistic Models: Probability Theory
37 pages
CO - Question Bank - ALLUNITS
No ratings yet
CO - Question Bank - ALLUNITS
3 pages
Sms Spam Term Paper
No ratings yet
Sms Spam Term Paper
10 pages
Journal Public
No ratings yet
Journal Public
13 pages
Ir Notes
No ratings yet
Ir Notes
111 pages
Chapter 15 - MINING MEANING FROM TEXT
No ratings yet
Chapter 15 - MINING MEANING FROM TEXT
20 pages
R23 Mech Python
No ratings yet
R23 Mech Python
17 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Chandana Combined Documentation PDF
No ratings yet
Chandana Combined Documentation PDF
66 pages
Healthcare Chatbot System Using Artificial Intelligence
No ratings yet
Healthcare Chatbot System Using Artificial Intelligence
8 pages
Android With Flutter: Dobariya Bhargav Virendrakumar
No ratings yet
Android With Flutter: Dobariya Bhargav Virendrakumar
59 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
29 pages
Requirements Similarity and Retrieval: July 2024
No ratings yet
Requirements Similarity and Retrieval: July 2024
28 pages
Fake News Detector - Final Project Report - (154429, 160041026, 160041028) (2) - Md. Rabiul Alam, 160041026
No ratings yet
Fake News Detector - Final Project Report - (154429, 160041026, 160041028) (2) - Md. Rabiul Alam, 160041026
47 pages
3HAN A Deep Neural Network For Fake News Detection
No ratings yet
3HAN A Deep Neural Network For Fake News Detection
10 pages
Unit I - MMD - Lecture NoteStu
No ratings yet
Unit I - MMD - Lecture NoteStu
10 pages
DevOPS Lab1
No ratings yet
DevOPS Lab1
9 pages
Testing Different Log Bases For Vector Model Weighting Technique
No ratings yet
Testing Different Log Bases For Vector Model Weighting Technique
15 pages
DevOPS Lab8
No ratings yet
DevOPS Lab8
4 pages
IoT-based Green City Architecture Using Secured and Sustanibale Android Services
No ratings yet
IoT-based Green City Architecture Using Secured and Sustanibale Android Services
12 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet

NLP Record

Uploaded by

NLP Record

Uploaded by

Python Libraries: nltk, re,word2vec

# Sample code to remove noisy words from a text

noise_list = ["are", "a", "this", "..."]

2. Perform lemmatization and stemming using python library nltk.

The most common lexicon normalization practices are :

 Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”,

“ly”, “es”, “s” etc) from a word.

words) and morphological analysis (word structure and grammar relations).

4. Perform part of speech tagging on any textual data:

I. “Please book my flight for Delhi”

Sentence -“book my flight, I will read this book”

Topic Modelling With LDA -A Hands-on Introduction

based on the type of book.

modelling can be useful.

Topic Modelling using LDA:

probabilistic model in which each document is assumed to be consisting of a different proportion of

number of pre-defined topics.

2) For each document d:

For each word w in the document, compute:

Illustrative Example of LDA:

modelling on these documents.

Document 1: We watch a lot of videos on YouTube.

Document 3: Reading a technical blog makes me understand things easily.

Document 4: I prefer blogs to YouTube videos.

Topic 1: 40% videos, 60% YouTube

Topic 2: 95% blogs, 5% YouTube

Document 4 would belong 80% to Topic 2 and 20% to Topic 1.

Topic Modelling using LDA in Python:

(https://www.kaggle.com/snap/amazon-fine-food-reviews ) here to illustrate how we can implement

topic modelling using LDA in Python.

Reading the Data:

from nltk.corpus import stopwords #stopwords

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

#from sklearn.feature_extraction.text import TfidfVectorizer

Importing the necessary libraries:

Pre-processing the text:

TFIDF vectorization on the text column:

documents that contain the word.

LDA on the vectorized text:

maximum number of iterations to be carried out and the random state.

from sklearn.decomposition import LatentDirichletAllocation

Checking the results:

code given below.

Analyzing the Topics:

defines each of these topics.

corpus and hence organizing large collections of textual data.

You might also like