Ir Manual
Ir Manual
Problem Statement:
Write a program for pre-processing of a text document such as stop word removal, stemming.
Objective:
To understand the concepts of information retrieval and web mining
Theory:
Text data derived from natural language is unstructured and noisy. Text preprocessing involves
transforming text into a clean and consistent format that can then be fed into a model for further analysis
and learning.
Text preprocessing techniques may be general so that they are applicable to many types of applications,
or they can be specialized for a specific task. For example, the methods for processing scientific
documents with equations and other mathematical symbols can be quite different from those for dealing
with user comments on social media.
However, some steps, such as sentence segmentation, tokenization, spelling corrections, and stemming,
are common to both.
Here's what you need to know about text preprocessing to improve your natural language processing
(NLP).
A natural language processing system for textual data reads, processes, analyzes, and interprets text. As
a first step, the system preprocesses the text into a more structured format using several different stages.
The output from one stage becomes an input for the next—hence the name “preprocessing pipeline.”
An NLP pipeline for document classification might include steps such as sentence segmentation, word
tokenization, lowercasing, stemming or lemmatization, stop word removal, and spelling
correction. Some or all of these commonly used text preprocessing stages are used in typical NLP
systems, although the order can vary depending on the application.
Segmentation
Segmentation involves breaking up text into corresponding sentences. While this may seem like a trivial
task, it has a few challenges. For example, in the English language, a period normally indicates the end
of a sentence, but many abbreviations, including “Inc.,” “Calif.,” “Mr.,” and “Ms.,” and all fractional
numbers contain periods and introduce uncertainty unless the end-of-sentence rules accommodate those
exceptions.
Tokenization
The tokenization stage involves converting a sentence into a stream of words, also called “tokens.”
Tokens are the basic building blocks upon which analysis and other methods are built.
Many NLP toolkits allow users to input multiple criteria based on which word boundaries are
determined. For example, you can use a whitespace or punctuation to determine if one word has ended
and the next one has started. Again, in some instances, these rules might fail. For example, don’t, it’s,
etc. are words themselves that contain punctuation marks and have to be dealt with separately.
Change Case
Changing the case involves converting all text to lowercase or uppercase so that all word strings follow
a consistent format. Lowercasing is the more frequent choice in NLP software.
Spell Correction
Many NLP applications include a step to correct the spelling of all words in the text.
Stop-Words Removal
"Stop words" are frequently occurring words used to construct sentences. In the English language, stop
words include is, the, are, of, in, and and. For some NLP applications, such as document categorization,
sentiment analysis, and spam filtering, these words are redundant, and so are removed at the
preprocessing stage.
Stemming
The term word stem is borrowed from linguistics and used to refer to the base or root form of a word.
For example, learn is a base word for its variants such as learn, learns, learning, and learned.
Stemming is the process of converting all words to their base form, or stem. Normally, a lookup table is
used to find the word and its corresponding stem. Many search engines apply stemming for retrieving
documents that match user queries. Stemming is also used at the preprocessing stage for applications
such as emotion identification and text classification.
Lemmatization
Lemmatization is a more advanced form of stemming and involves converting all words to their
corresponding root form, called “lemma.” While stemming reduces all words to their stem via a lookup
table, it does not employ any knowledge of the parts of speech or the context of the word. This means
stemming can’t distinguish which meaning of the word right is intended in the sentences “Please turn
right at the next light” and “She is always right.”
The stemmer would stem right to right in both sentences; the lemmatizer would treat right differently
based upon its usage in the two phrases.
A lemmatizer also converts different word forms or inflections to a standard form. For example, it
would convert less to little, wrote to write, slept to sleep, etc.
A lemmatizer works with more rules of the language and contextual information than does a stemmer. It
also relies on a dictionary to look up matching words. Because of that, it requires more processing
power and time than a stemmer to generate output. For these reasons, some NLP applications only use a
stemmer and not a lemmatizer.
Text Normalization
Text normalization is the preprocessing stage that converts text to a canonical representation. A
common application is the processing of social media posts, where input text is shortened or words are
spelled in different ways. For example, hello might be written as hellooo or something might appear as
smth, and different people might choose to write real time, real-time, or realtime. Text normalization
cleans the text and ideally replaces all words with their corresponding canonical representation. In the
last example, all three forms would be converted to realtime. Many text normalization stages also
replace emojis in text with a corresponding word. For example, :-) is replaced by happy face.
One of the more advanced text preprocessing techniques is parts of speech (POS) tagging. This step
augments the input text with additional information about the sentence’s grammatical structure. Each
word is, therefore, inserted into one of the predefined categories such as a noun, verb, adjective, etc.
This step is also sometimes referred to as grammatical tagging.
Conclusion:
By using above steps, we have performed pre-processing of a text document such as stop word removal,
stemming successfully.
Oral Questions
1. What are the different NLTK libraries?
2. How to remove stop words from the file?
3. What is mean by stemming?
4. What is mean by Lemmatization?
Assignment 2
Problem Statement:
Implement a program for retrieval of documents using inverted files.
Objective:
1. Evaluate and analyse retrieved information
2. To study Indexing, Inverted Files and searching with the help of inverted file
Theory:
An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a HashMap like
data structure that directs you from a word to a document or a web page.
We will create a Word level inverted index that is it will return the list of lines in which the word is
present. We will also create a dictionary in which key values represent the words present in the file
and the value of a dictionary will be represented by the list containing line numbers in which they are
present. To create a file in Jupiter notebook, use magic function:
%%writefile file.txt
This is the first word.
This is the second text, Hello! How are you?
This is the third, this is it now.
This will create a file named file.txt will the following content.
To read file:
Python3
read = file.read()
file.seek(0)
read
# to obtain the
# number of lines
# in file
line = 1
if word == '\n':
line += 1
# create a list to
# an element of
list array = []
for i in range(line):
array.append(file.readline())
array
Functions used:
Open: It is used to open the file.
read: This function is used to read the content of the file.
seek(0): It returns the cursor to the beginning of the file.
Remove punctuation:
Python3
read
# to maintain uniformity
read=read.lower()
read
Output:
Apply linguistic preprocessing by converting each word in the sentences into tokens. Tokenizing the
sentences help with creating the terms for the upcoming indexing operation.
Python3
def tokenize_words(file_contents):
"""
Parameters
file_contents : list
Returns
list
"""
result = []
for i in range(len(file_contents)):
tokenized = []
tokenized = file_contents[i].split()
result.append(tokenized)
return result
Stop words are those words that have no emotions associated with it and can safely be ignored without
sacrificing the meaning of the sentence.
Python3
import nltk
nltk.download('stopwords')
for i in range(1):
text_tokens = word_tokenize(read)
tokens_without_sw = [
print(tokens_without_sw)
Output:
dict = {}
for i in range(line):
check = array[i].lower()
if item in check:
dict[item] = []
if item in dict:
dict[item].append(i+1)
dict
Output:
{'first': [1],
'word': [1],
'second': [2],
'text': [2],
'hello': [2],
'third': [3]}
Conclusion:
By this way, we can perform retrieval of documents using inverted files.
Oral Questions?
1. What is mean by inverted index?
2. What are steps for creation of Inverted Index?
3. What are built in functions used for index creation?
Assignment 3
Problem Statement:
Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using the standard Heart Disease Data Set (You can use
Java/Python ML library classes/API.
Objective:
1. Evaluate and analyse retrieved information.
2. To study Bayesian network model.
Theory:
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
The goal is to calculate the posterior conditional probability distribution of each of the possible
unobserved causes given the observed evidence, i.e., P [Cause | Evidence].
Data Set:
The Cleveland database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In
particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “Heartdisease”
field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
1. Value 1: typical angina
2. Value 2: atypical angina
3. Value 3: non-anginal pain
4. Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
1. Value 0: normal
2. Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or
depression of > 0.05 mV)
3. Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
1. Value 1: upsloping
2. Value 2: flat
3. Value 3: downsloping
12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
13. Heartdisease: It is integer valued from 0 (no presence) to 4.
Heart
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal disease
Python Program to Implement and Demonstrate Bayesian network using pgmpy Machine
Learning
import numpy as np
import pandas as pd
import csv
heartDisease = pd.read_csv('heart.csv')
heartDisease = heartDisease.replace('?',np.nan)
print(heartDisease.head())
print(heartDisease.dtypes)
model.fit(heartDisease,estimator=MaximumLikelihoodEstimator)
HeartDiseasetest_infer = VariableElimination(model)
q1=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'restecg':1})
print(q1)
q2=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'cp':2})
print(q2)
Conclusion:
In this way, we have successfully constructed a Bayesian network considering medical data. We have
Use this created model to demonstrate the diagnosis of heart patients using the standard Heart Disease
Data Set
Oral Question:
Problem Statement:
Implement e-mail spam filtering using text classification algorithm with appropriate dataset.
Objective:
1. Basic concepts of Spam Filtering.
2. To study KNN algorithm.
Theory:
In the new era of technical advancement, electronic mail (emails) has gathered significant
users for professional, commercial, and personal communications. In 2019, on average,
every person received 130 emails each day, and overall, 296 Billion emails were sent that
year.
Because of the high demand and huge user base, there is an upsurge in unwanted emails,
also known as spam emails. At times, more than 50% of the total emails were spam.
Even today, people lose millions of dollars to fraud every day.
But, in the figure shown below, it can be observed that the quantity of such emails has
decreased significantly after 2016 because of the evolution of the software that can detect
these spam emails and can filter them out.
Use-case of Gmail, Outlook, and Yahoo. How do these companies classify emails?
Possible interview questions on this machine learning application.
Many several techniques are present in the market to detect spam emails. If we want to
classify broadly, there are five different techniques based on which algorithms decide
whether any mail is spam.
Algorithms analyze words, the occurrence of words, and the distribution of words and
phrases inside the content of emails and segregate them into spam and non-spam
categories.
Algorithms use pre-defined rules as regular expressions to give a score to the messages in
the emails. They segregate emails into spam and non-spam categories based on the scores
generated.
Algorithms extract the incoming mails' features and create a multi-dimensional space
vector and draw points for every new instance. Based on the KNN algorithm, these new
points get assigned to the closest class of spam and non-spam.
Algorithms classify the incoming emails into various groups and, based on the comparison
scores of every group with the defined set of groups, spam and non-spam emails got
segregated.
This article will give an idea for implementing content-based filtering using one of the
most famous spam detection algorithms, K- Nearest Neighbour (KNN) .
K-NN based algorithms are widely used for clustering tasks. Let's quickly know the entire
architecture of this implementation first and then explore every step. Executing these five
steps, one after the other, will help us implement our spam classifier smoothly.
The dataset contained in a corpus plays a crucial role in assessing the performance of any
spam filter. Many open-source datasets are freely available in the public domain. The two
datasets below are widely popular as they contain many emails.
1. Enron corpus datasets (Created in 2006 and having 55% spam emails)
2. Trec 2007 dataset ( Created in 2007 and having 67% spam emails)
Train/Test Split: Split the dataset into train and test datasets but make sure that both sets
balance the numbers of ham and spam emails (ham is a fancy name for non-spam emails).
A s the dataset is in text format, we will need the pre- processing of text data . At this step,
we mainly perform tokenization of mail. Tokenization is a process where we break the
content of an email into words and transform big messages into a sequence of
representative symbols termed tokens. These tokens are extracted from the email body,
header, subject, and image.
Extracting words from images (For a simple implementation, this can be ignored)
These days, senders have options to attach inline images to the mail. These emails can be
categorized as spam, not based on their mail content but on the images' content. This was
not an easy task until google came up with the open-source library Tesseract . This library
extracts the words from images automatically with certain accuracy. But still, Times New
Roman and Captcha words are challenging to read automatically.
After pre-processing, we can have a large number of words. Here we can maintain a
database that contains the frequency of the different words represented in each column.
These attributes can be categorized on another basis, like:
You must be clear that the more the number of attributes, → more the time complexity of
the model.
These attributes can be tremendous, so techniques like Stemming, noise removal, and
stop-word removal can be used. One of the famous stemming algorithms is the Porter-
Stemmer Algorithm. Some general things that we do in stemming are:
"that", "these", "those", "am", "is", "are", "was", "were", "be" , an",
Similar to the Nearest Neighbour algorithm, the K-Nearest Neighbour algorithm serves the
purpose of clustering. Still, instead of giving just one nearest instance, it looks at the
closest K instances to the new incoming instance. K-NN classifies the new instances based
on the frequency of those K instances. The value of K is considered to be a
hyperparameter that needs tuning. To tune this, one can take one of the famous Hit and
Trial approaches, where we try some K values and then check the model's performance.
To find the nearest instance, one can use the Euclidean distance. One can use the Scikit-
learn library to implement the K-NN algorithm for this task.
Our algorithm is ready, so we must check the model's performance. Even a single missed
important message may cause a user to reconsider the value of spam filtering. So we
must be sure that our algorithm will be as close to 100% accurate. But some researchers
feel that more than accuracy as the evaluation parameter for spam classification is needed.
According to the below table (also known as the confusion matrix), we must evaluate our
spam classification model based on four different parameters.
More advanced algorithms are available for this classification, but you can easily achieve
more than 90% accuracy using k-NN-based implementation.
Gmail, Yahoo, and Outlook Case Study
Gmail
Google data centers use thousands of rules to filter spam emails. They provide the
weightage to different parameters; based on that, they filter the emails. Google's spam
classifier is said to be a state of an art technique that uses various techniques like
Optical character recognition, linear regression, and a combination of multiple neural
networks.
Yahoo
Yahoo mail is the world's first free webmail service provider, with more than 320 million
active users. They have their filtering techniques to categorize emails. Yahoo's basic
methods are URL filtering, email content, and user spam complaints. Unlike Gmail,
Yahoo filter emails messages by domain and not the IP address. Yahoo also provides
users with custom-filtering options to send mail directly to junk folders.
Outlook
Conclusion
In terms of the Number of spam emails sent daily and the Number of money people lose every day
because of these spam scams, Spam-filtering becomes the primary need for all email-providing
companies. This article discussed the complete process of spam email filtering using advanced
machine learning technologies. We also have closed one possible way of implementing our spam
classifier using one of the most famous algorithms, k-NN. We also discussed the case studies
of well-known companies like Gmail, Outlook, and Yahoo to review how they use ML and AI
techniques to filter such spammers.
Assignment 5
Problem Statement:
Implement Agglomerative hierarchical clustering algorithm using appropriate dataset.
Objective:
Evaluate and analyse retrieved information using Page Ranking algorithm.
To study Random Walk.
Theory:
Prerequisites:
Agglomerative Clustering Agglomerative Clustering is one of the most common hierarchical
clustering techniques.
Dataset – Credit Card Dataset.
Assumption: The clustering technique assumes that each data point is similar enough to the other data
points that the data at the starting can be assumed to be clustered in 1 cluster.
import pandas as pd
import numpy as np
cd C:\Users\Dev\Desktop\Kaggle\Credit_Card
X = pd.read_csv('CC_GENERAL.csv')
X = X.drop('CUST_ID', axis = 1)
# Handling the missing values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_normalized = normalize(X_scaled)
X_normalized = pd.DataFrame(X_normalized)
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(X_normalized)
X_principal = pd.DataFrame(X_principal)
Dendrograms are used to divide a given cluster into many different clusters. Step 5: Visualizing the
working of the Dendrograms
Python3
ac2 = AgglomerativeClustering(n_clusters = 2)
plt.scatter(X_principal['P1'], X_principal['P2'],
plt.show()
Python3
ac3 = AgglomerativeClustering(n_clusters = 3)
plt.scatter(X_principal['P1'], X_principal['P2'],
plt.show()
b) k = 4
Python3
ac4 = AgglomerativeClustering(n_clusters = 4)
plt.scatter(X_principal['P1'], X_principal['P2'],
plt.show()
c) k = 5
Python3
ac5 = AgglomerativeClustering(n_clusters = 5)
plt.scatter(X_principal['P1'], X_principal['P2'],
plt.show()
d) k = 6
Python3
ac6 = AgglomerativeClustering(n_clusters = 6)
plt.scatter(X_principal['P1'], X_principal['P2'],
plt.show()
We now determine the optimal number of clusters using a mathematical technique. Here, We will use
the Silhouette Scores for the purpose.
k = [2, 3, 4, 5, 6]
silhouette_scores = []
silhouette_scores.append(
silhouette_score(X_principal, ac2.fit_predict(X_principal)))
silhouette_scores.append(
silhouette_score(X_principal, ac3.fit_predict(X_principal)))
silhouette_scores.append(
silhouette_score(X_principal, ac4.fit_predict(X_principal)))
silhouette_scores.append(
silhouette_score(X_principal, ac5.fit_predict(X_principal)))
silhouette_scores.append(
silhouette_score(X_principal, ac6.fit_predict(X_principal)))
plt.bar(k, silhouette_scores)
plt.show()
Thus, with the help of the silhouette scores, it is concluded that the optimal number of clusters for the
given data and clustering technique is 2.
Conclusion:
In this way, we have successfully completed implementation of Agglomerative hierarchical clustering
algorithm using appropriate dataset.
Oral Questions:
1. What is mean by Agglomerative hierarchical clustering algorithm?
2. Which is the readymade function available to build Agglomerative hierarchical clustering
algorithm?
3. Tell me the steps to implement Agglomerative hierarchical clustering algorithm.
4. Applications of Agglomerative hierarchical clustering algorithm.
Assignment 6
Problem Statement:
Implement Page Rank Algorithm. (Use python or beautiful soup for implementation)
Objective:
Evaluate and analyze retrieved information using Page Ranking algorithm.
To study Random Walk.
Theory:
Prerequisite: Page Rank Algorithm and Implementation, Random Walk.
In Social Networks page rank is a very important topic. Basically, page rank is nothing but how webpages are
ranked according to its importance and relevance of search. All search engines use page ranking. Google is the
best example that uses page rank using the web graph.
Random Walk
The web can be represented like a directed graph where nodes represent the web pages and edges form links
between them. Typically, if a node (web page) i is linked to a node j, it means that i refers to j.
We have to define what is the importance of a web page. As a first approach, we could say that it is the total
number of web pages that refer to it. If we stop to these criteria, the importance of these web pages that refer to it
is not taken into account. In other words, an important web page and a less important one has the same weight.
Another approach is to assume that a web page spread its importance equally to all web pages it links to. By doing
that, we can then define the score of a node j as follows:
Since a Markov Chain is defined by an initial distribution and a transition matrix, the above graph can be seen as
a Markov chain with the following transition matrix:
We notice that P transpose is row stochastic which is a condition to apply Markov chain theorems.
For the initial distribution, let’s consider that it is equal to :
where n is the total number of nodes. This means that the random walker will choose randomly the initial node
from where it can reach all other nodes.
At every step, the random walker will jump to another node according to the transition matrix. the probability
distribution is then computed for every step. This distribution tells us where the random walker is likely to be
after a certain number of steps. The probability distribution is computed using the following equation:
A stationary distribution of a Markov chain is a probability distribution π with π = Pπ. This means that the
distribution will not change after one step. It is important to note that not all Markov chains admit a stationary
distribution.
If a Markov chain is strongly connected, which means that any node can be reached from any other node, then it
admits a stationary distribution. It is the case in our problem. So, after an infinitely long walk, we know that the
probability distribution will converge to a stationary distribution π.
We notice that π is an eigenvector of the matrix P with the eigenvalue 1. Instead of computing all eigenvectors
of P and select the one which corresponds to the eigenvalue 1, we use the Frobenius-Perron theorem.
According to Frobenius-Perron theorem, if a matrix A is a square and positive matrix (all its entries are
positive), then it has a positive eigenvalue r, such as |λ| < r, where λ is an eigenvalue of A. The
eigenvector v of A with eigenvalue r is positive and is the unique positive eigenvector.
In our case, the matrix P is positive and square. The stationary distribution π is necessarily positive because it is a
probability distribution. We conclude that π is the dominant eigenvector of P with the dominant eigenvalue 1.
To compute π, we use the power method iteration which is an iterative method to compute the dominant
eigenvector of a given matrix A. From an initial approximation of the dominant eigenvector b that can be
initialized randomly, the algorithm will update it until convergence using the following algorithm:
In the web graph, for example, we can find a web page i which refers only to web page j and j refers only to i.
This is what we call spider trap problem. We can also find a web page which has no outlink. It is
commonly named Dead end.
In the case of a spider trap, when the random walker reaches the node 1 in the above example, he can only jump
to node 2 and from node 2, he can only reach node 1, and so on. The importance of all other nodes will be taken
by nodes 1 and 2. In the above example, the probability distribution will converge to π = (0, 0.5, 0.5, 0). This is
not the desired result.
In the case of Dead ends, when the walker arrives at node 2, it can’t reach any other node because it has no
outlink. The algorithm cannot converge.
Teleportation consists of connecting each node of the graph to all other nodes. The graph will be then complete.
The idea is with a certain probability β, the random walker will jump to another node according to the transition
matrix P and with a probability (1-β)/n, it will jump randomly to any node in the graph. We get then the new
transition matrix R:
The matrix R has the same properties than P which means that it admits a stationary distribution, so we can use
all the theorems we saw previously.
That’s it for the PageRank algorithm. I hope you understood the intuition and the theory behind the PageRank
algorithm. Please, do not hesitate to leave comments or share my work.
Random Walk Method – In the random walk method we will choose 1 node from the graph uniformly at
random. After choosing the node we will look at its neighbors and choose a neighbor uniformly at random and
continue these iterations until convergence is reached. After N iterations a point will come after which there
will be no change In points of every node. This situation is called convergence.
Algorithm: Below are the steps for implementing the Random Walk method.
Create a directed graph with N nodes.
Below is the python code for the implementation of the points distribution algorithm.
Output:
PageRank using Random Walk Method
[ 9 10 4 6 3 8 13 14 0 7 1 2 5 12 11]
Conclusion:
By this way, we have successfully Implemented Page Rank Algorithm.
Oral Questions-
What is mean by Random Walk?
How you have implemented random walk method?
What is mean by Page Rank Algorithm?