0% found this document useful (0 votes)
22 views53 pages

7-Text Classification-13-11-2024

NLP

Uploaded by

Vanshika Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views53 pages

7-Text Classification-13-11-2024

NLP

Uploaded by

Vanshika Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Natural Language Processing

Dr K Ganesan
Professor – HAG
SCORE, VIT
[email protected]
Phone : 6382203768
• In NLP, info is often expressed using unstructured text.
• The following steps are often involved in extracting
structured information from unstructured texts:
• 1. Initial processing. 2. Proper names identification.
• 3. Parsing. 4. Extraction of events and relations.
• 5. Anaphora resolution. 6. Output result generation.
• 1. Initial processing
• The first step is to break down a text into
fragments such as zones, phrases, segments, and
tokens.
• This function can be performed by tokenizers.
• Tokenizers in NLP break down text into smaller units
(words, subwords, or characters )called tokens.
• Tokenization is a crucial in text classification,
sentiment analysis, machine translation.
• sentence = "Tokenization is an important step in
NLP tasks.“
• ['Tokenization', 'is', 'an', 'important', 'step', 'in',
'NLP', 'tasks', '.']
• Parts of speech tagging means words in a text are categorized
into their parts of speech - nouns, verbs, adjectives.
• It can find grammatical structure of sentences, and used in
tasks like sentiment analysis, named entity recognition & ML
• Input Text: "The quick brown fox jumps over the lazy dog."
• POS Tagged Output: "The" - Determiner
• "quick" - Adjective
• "brown" - Adjective
• "fox" - Noun
• "jumps" - Verb
• "over" - Preposition
• "the" - Determiner
• "lazy" - Adjective
• "dog" - Noun
• Here each word is tagged with its corresponding part of speech.
• In natural language processing, phrasal unit identification means
identifying phrases or units of words that have a specific syntactic
or semantic meaning.
• Simple noun phrase: "the red car"
• Complex verb phrase : "will be going to the store."
• Example: "The quick brown fox jumps over the lazy dog."
• Noun Phrase (NP): "The quick brown fox" - This is a noun phrase
consisting of determiner "the," adjectives "quick" and "brown," and
the noun "fox."
• Verb Phrase (VP): "Jumps over the lazy dog" - This is a verb phrase
consisting of the verb "jumps," preposition "over," and the noun
phrase "the lazy dog."
• Prepositional Phrase (PP): "Over the lazy dog" - This is a
prepositional phrase consisting of the preposition "over" and the
noun phrase "the lazy dog."
• Identifying these phrasal units helps to understand the structure
and meaning of sentences.
2. Proper names identification
• Also known as named entity recognition (NER) – identifies and
classifies named entities in the text into predefined categories
such as person names, organization names, locations, dates,
etc.
• "John Smith works at Google headquarters in NewYork City"
• We can identify the following named entities:
• Person Name: "John Smith"
• Organization Name: "Google"
• Location: "New York City"
• NER uses MLmodels, such as conditional random fields (CRFs)
or deep learning models like recurrent neural networks
(RNNs) and transformers, to recognize and classify named
entities in text.
• These models are trained on annotated datasets that provide
• 3. Parsing
• Parsing refers to the process of analyzing the grammatical
structure of sentences to determine their syntactic structure
• Here we break sentences into smaller components such as
phrases and identifying relationships between them,
represented in the form of a parse tree.
• Here's an example to illustrate parsing:
• Example: "The quick brown fox jumps over the lazy dog."
• Tokenization: The sentence is tokenized into individual words:
"The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog".
• Part-of-Speech (POS) Tagging: Each word is tagged with its
grammatical category (e.g., noun, verb, adjective):
• "The_DET", "quick_ADJ", "brown_ADJ", "fox_NOUN",
"jumps_VERB", "over_ADP", "the_DET", "lazy_ADJ",
"dog_NOUN".
• Parsing: The parsed tree structure represents syntactic
relationships between the words.

• In this parse tree:


• "jumps" is the root verb.
• "fox" is the subject of the verb "jumps."
• "The quick brown" is an adjective phrase describing "fox."
• "over" is a preposition indicating the relationship between "jumps"
and "dog."
• "the lazy dog" is a noun phrase governed by the preposition "over."
• This parsing process is fundamental for tasks like syntax analysis,
semantic analysis, and information extraction.
• 4. Extraction of events and relations
• Extracting events and relations means identifying and
understanding relationships between entities in the text.
• It is crucial for tasks such as information extraction,
knowledge graph construction, and sentiment analysis.
• Extraction of Events:
• Events in NLP refer to actions or occurrences described in
text.
• They are extracted using techniques like rule-based
approaches, ML models, etc.
• Example : "John bought a new car."
• In this sentence:
• Event: "bought"
• Subject Entity: "John"
• Extraction of Relations:
• It refers to connections / associations between entities.
• This can include relationships like ownership, membership,
location, etc. They identify the connections based on
contextual information.
• For example, in the sentence "John lives in New York," the
relation is "location" between "John" and "New York."
• Example: "The company announced a new partnership with
XYZ Corporation."
• Entities: "The company," "a new partnership," "XYZ
Corporation“ and Relation: "partnership with"
• Here, the relation "partnership with" connects "The company"
and "XYZ Corporation," indicating a business relationship.
• Relation extraction involves dependency parsing, named entity
recognition (NER), and domain-specific knowledge to
accurately identify and classify relationships between entities.
• 5. Co-reference or Anaphora resolution
• Anaphora resolution is a process in NLP where
pronouns or other referring expressions are
connected to their corresponding entities or concepts
in the text.
• Goal is to understand which word or phrase a
pronoun or anaphor refers to, ensuring clarity in
text.
• John went to the store. He bought some groceries.
• Here, "He" is an anaphor that refers back to "John."
• Anaphora resolution involves identifying that "He"
refers to "John" in this context.
• This process is crucial for understanding meaning of
• 6. Output results generation
• This stage convert structures collected during preceding
processes into output templates that follow the format
defined by the user. It uses many normalization processes.
• Spacy: Python library for advanced NLP
• It is designed for production use and helps to process and
understand large amounts of text.
• It is used to create info extraction or natural language
understanding systems, preprocess text for deep learning.
• A regular expression (abbreviated regex or regexp, also known
as a rational expression) is a string of characters that defines a
search pattern.
• A regular expression, is a pattern that characterizes a
collection of strings.
• REs are built similarly to arithmetic expressions, by combining
• Introduction to Text Classification
• Text classification is one of the most common tasks in NLP.
• It is process of assigning a label or category to a given piece of text.
• For e.g, we can classify emails as spam or not spam, tweets as
positive/negative, articles as relevant or not relevant ones.
• Benefits of text classification include:
• Categorizing documents:
• It is used to automatically categorize documents and used for
organizing info and making it easier for searching.
• Predicting outcomes:
• It can be used to predict future events. For e.g, if we have a
collection of news articles, we could use text classification to
predict stock market's reaction to a particular event.
• Finding patterns: It can be used to find hidden patterns in data.
• For example, we could use text classification to identify which
customers are likely to respond positively to a marketing campaign.
• There are 2 types of text classification:
• Supervised and Unsupervised.
• Supervised text classification involves training a model
on a dataset where the labels are already known.
• Unsupervised text classification does not need labels;
Model is trained on the data itself and learns to group
documents into categories based on similarities.
• Supervised text classification is more accurate, but it is
time-consuming & expensive, as it needs labeled data.
• Unsupervised text classification is less accurate but can
be used when labels are not available.
• For example, if accuracy is paramount, a supervised
approach is preferable. But, if time or resources are
limited, an unsupervised approach is better.
• Spam Detection - We can train a machine learning model on a
dataset of labeled emails (spam or not spam) and use it to predict
whether new emails are spam or not.
• Here, we first load the dataset of emails into a Pandas DataFrame.
• We then split the data into training and testing sets using the
train_test_split function from scikit-learn.
• In the above code, x and y are our features and labels.
• The train_test_split function shuffles the dataset and then splits it.
• The test_size parameter determines the proportion of the original
dataset to include in the test split.
• Here, we’ve set it to 0.2, meaning 20% of the data will be used for
the test set, and remaining 80% for training set.
• We create a CountVectorizer object to convert text in each email to
numerical features, use it to transform training and testing sets
• CountVectorizer is a tool provided by the scikit-learn library.
• It is used to transform a given text into a vector on the basis of the
frequency (count) of each word that occurs in the entire text.
• This is helpful when we have multiple such texts, and we wish to
convert each word in each text into vectors .
• Consider sample texts from a document (given as list element):
• document = [ “One Geek helps Two Geeks”, “Two Geeks help Four
Geeks”, “Each Geek helps many other Geeks at GeeksforGeeks.”]
• CountVectorizer creates a matrix in which each unique word is
represented by a column of the matrix, and each text sample from
the document is a row in the matrix. Value of each cell is the count
of the word in that particular text sample.
• There are 12 unique words in document, represented as columns of
table. There are 3 text samples in the document, each represented
as rows of table. Every cell contains a number, that represents count
of word in that particular text. Words are converted to lowercase.
• The words in columns have been arranged alphabetically.
geeksf
at each four geek geeks orgee help helps many one other two
ks

doc[0] 0 0 0 1 1 0 0 1 0 1 0 1

doc[1] 0 0 1 0 2 0 1 0 0 0 0 1

doc[2] 1 1 0 1 1 1 0 1 1 0 1 0
• Here we call fit_transform() method on our training data and
transform() method on our test data.
• Common way to scale the features is using scikit-learn’s
StandardScaler class.
• Data standardization is the process of rescaling the attributes so
that they have mean as 0 and variance as 1.
• Goal of standardization is to bring down all features to a
common scale without distorting differences in the range of
values.
• In sklearn.preprocessing.StandardScaler(), centering and scaling
happens independently on each feature.

• Here, the model built by us will learn the mean and variance of
the features of the training set.
• The parameters learned by our model using the training data will
• The fit method is calculating the mean and variance of each of
the features present in our data.
• The transform method is transforming all the features using
the respective mean and variance.
• We then train a Multinomial Naïve Bayes model on the
training set and evaluate its accuracy on the testing set.
• Multimodal Naive Bayes is an extension of the Naive Bayes
algorithm that can handle multiple types of features (or
modalities) in the input data.
• Each modality refers to a different type of attribute or
feature, say, text, numerical values, categorical variables, …
• It assumes that features are conditionally independent given
the class label, like in Naive Bayes algorithm.
• Consider a classification problem where we want to predict
whether a customer will buy a product based on two types of
features: age (numerical) and profession (categorical).
• Step 1: Data Preparation - We have following training data:
Age Profession Bought
30 Engineer Yes
25 Doctor Yes
40 Teacher No
35 Engineer No
28 Doctor Yes

• Step 2: Feature Encoding - Encode the categorical variable


(Profession) using one-hot encoding:
Age Engineer Doctor Teacher Bought
30 1 0 0 Yes
25 0 1 0 Yes
40 0 0 1 No
35 1 0 0 No
28 0 1 0 Yes

One Hot Encoding converts categorical variables into a binary format. It creates new binary
columns (0s and 1s) for each variable. Each category in the original column is represented as a
separate column, where a value of 1 indicates presence of that category, 0 indicates its absence.
• Step 3: Calculate Class Probabilities
• Calculate prior probabilities P(Bought=Yes) and P(Bought=No):
• P(Bought=Yes) = (Number of 'Yes' instances) / (Total number of
instances) = 3/5
• P(Bought=No) = (Number of 'No' instances) / (Total number of
instances) = 2/5
• Step 4: Calculate Feature Probabilities
• For numerical features (like Age), use probability density functions
(PDFs) such as Gaussian distribution for simplicity.
• Let's assume Gaussian distributions for each class:
• For Bought=Yes: (Mean = (30+25+28)/3 = 27.7
Age: Mean=27.7, Variance=15.67 (calculated from 'Yes' instances)
• For Bought=No: (Mean = (40 + 35)/2 = 37.5
Age: Mean=37.5, Variance=12.5 (calculated from 'No' instances)
• Step 5: Predictions
• Given a new instance with Age=32 and Profession=Engineer, we
calculate the conditional probabilities for each class:
• For Bought=Yes:
P(Age=32 | Bought=Yes) = Gaussian(32, 27.7, 15.67) = 0.071 (aprox)
P(Engineer | Bought=Yes) = 2/3 (since 2 out of 3 'Yes' instances are
Engineers)
P(Bought=Yes) = 3/5
Posterior Probability = P(Age=32 | Bought=Yes) * P(Engineer |
Bought=Yes) * P(Bought=Yes) = 0.071 * (2/3) * (3/5) = 0.0473 (aprox)
• For Bought=No:
P(Age=32 | Bought=No) = Gaussian(32, 37.5, 12.5) = 0.055 (aptox)
P(Engineer | Bought=No) = 1/2 (as 1 out of 2 'No' instances is an
Engineer)
P(Bought=No) = 2/5
Posterior Probability = P(Age=32 | Bought=No) * P(Engineer |
Bought=No) * P(Bought=No) = 0.055 * (1/2) * (2/5) =0.011 (aprox)
• As posterior probability for Bought=Yes is higher, we predict that
the customer will buy the product.
• To calculate Gaussian function values, we need to know the formula
for the Gaussian distribution, which is also known as the normal
distribution.
• The Gaussian distribution is given by the formula:

x is the input value (e.g., the age in our example), 𝜇 is the mean of
the distribution, 𝜎 is the standard deviation of the distribution.
• Suppose we have a Gaussian distribution with mean 𝜇=37.5 and
standard deviation 𝜎=12.5. Calculate 𝑓(32), which is Gaussian
function value for age =32 in this distribution. We get 0.055
• Sklearn's model.score(X_test,y_test) calculation is based on co-
efficient of determination i.e R^2 (it values is between 0 and 1)
• u = ((y_test - y_predicted) ** 2).sum()
• v = ((y_test - y_test.mean()) ** 2).sum()
• R2 = 1 - (u/v); [If R2 = 0, dependent variable cannot be predicted
from independent variable. If R2 = 1, then dependent variable can
be predicted from the independent variable without any error.]
News Classification- Another example of text classification is news
classification, where we can train a model to predict the category of
a news article based on its content.
• Load dataset of news articles into a Pandas DataFrame.
• Then split the data into training and testing sets using the
train_test_split function from scikit-learn.
• TfidfVectorizer is a feature extraction technique used in
NLP and text mining.
• It stands for Term Frequency-Inverse Document Frequency
Vectorizer.
• This technique is used to convert a collection of raw
documents into a matrix of TF-IDF features.
• TF-IDF evaluates importance of a word in a document
relative to a collection of documents.
• Suppose we have three documents:
• "The quick brown fox jumps over the lazy dog."
• "The dog chased the cat."
• Step 1: Tokenization
• Text is tokenized into individual words. Tokenization results are:
• Document 1: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the',
'lazy', 'dog']
• Document 2: ['the', 'dog', 'chased', 'the', 'cat']
• Document 3: ['the', 'cat', 'climbed', 'the', 'tree']
• Step 2: TF (Term Frequency) Calculation
• TfidfVectorizer calculates TF for each word in each document.
• TF is count of how many times a word appears in a document,
normalized by the total number of words in the document.
• For example, let's calculate TF for the word 'the' in Document 1:
• Total words in Document 1 = 9
• Count of 'the' in Document 1 = 2
• TF('the', Document 1) = 2/9 ≈ 0.2222
• Similarly, TF values are calculated for all words in all
• Step 3: IDF (Inverse Document Frequency) Calculation
• Now, TfidfVectorizer computes the inverse document
frequency (IDF) for each word.
• IDF measures how important a word is in all documents.
• Words that appear frequently in many documents are
penalized, while rare words are given higher weight.

• where: 𝑁 is the total number of documents.


• IDF is calculated using:

• 𝑑𝑓(𝑤) is the number of documents containing the word w.

• 𝑁=3 (total number of documents)


• For example, let's calculate IDF for the word 'the':

• df(′the′)=3 (all documents contain 'the')


• IDF('the') = log⁡(3/3)=log⁡(1)=0
• Similarly, IDF values are calculated for all words.
• Step 4: TF-IDF Calculation
• Finally, TfidfVectorizer computes the TF-IDF values for each word

• 𝑇𝐹−𝐼𝐷𝐹(𝑤,𝑑)=𝑇𝐹(𝑤,𝑑)×𝐼𝐷𝐹(𝑤)
in each document using the formula:

• where: 𝑇𝐹(𝑤,𝑑) is term frequency of word 𝑤 in document d.


• 𝐼𝐷𝐹(𝑤) is the inverse document frequency of word w.
• For e.g, let's calculate TF-IDF for word 'the' in Document 1:
• TF('the', Document 1) = 2/9 ≈ 0.2222
• IDF('the') = 0
• TF-IDF('the', Document 1) = 0.2222 * 0 = 0
• TF-IDF values are calculated for all words in all documents.
• After this process, we’ll have a TF-IDF matrix where each row
represents a document, and each column represents a word, with
the corresponding TF-IDF value in each cell.
• This matrix can be used as input for machine learning models in
tasks like text classification or clustering
• We then train a logistic regression model on the training
set and evaluate its accuracy on testing set.
• Logistic regression is a statistical algorithm which analyze
the relationship between two data factors.
• The goal is to predict the probability that an instance
belongs to a given class or not.
• By training machine learning models on labeled datasets,
we can automatically categorize text based on its content,
make predictions about new text data.
• Topic Categorization
• Topic categorization or topic modeling, is the process of
automatically identifying main themes or topics present in
a collection of documents or text data.
• It is used in content recommendation, search engines, and
text summarization.
• Goal here is to group similar documents or text
snippets together based on the topics they cover.
• This is done by extracting and analyzing patterns and
relationships between the words and phrases in text.
• The resulting topics can then be used, to organize and
label the data , or to train machine learning models
for various applications.
• There are several methods and algorithms for topic
categorization, including Latent Dirichlet Allocation
(LDA), Non-negative Matrix Factorisation (NMF), and
Hierarchical Dirichlet Process (HDP).
• These methods use techniques such as dimensionality
reduction, clustering, and probabilistic modeling to
identify the most relevant topics in the data.
• The dataset of news articles is in CSV file named news.csv.
• We use TF-IDF vectorization to convert the text into
numerical features, and then apply Non-Negative Matrix
Factorization (NMF) to extract 5 topics from the text data.
• NMF is a technique used in ML for dimensionality
reduction and feature extraction.
• It factors a given non-negative matrix into two lower-
dimensional non-negative matrices, which can be
interpreted as representing latent features in original
data.
• Finally, we print the top 10 keywords for each topic to
understand what the topics are about.
• Additionally, quality and accuracy of the categorization
will depend on the size and quality of the dataset, as well
as the preprocessing and feature selection methods used.
• Information retrieval is defined as the process of accessing
and retrieving the most appropriate information from text
based on a particular query given by the user, with the
help of context-based indexing or metadata.
• Google Search is a famous example of information retrieval.
• It tells about existence and location of documents that
might consist of required information that is to be given to
user.
• The documents that satisfy the user’s requirement are
called relevant documents.
• A user who needs information will have to formulate a
request in the form of a query in natural language.
• After that, IR system will return output by retrieving the
relevant output, in the form of documents, about the
required information.
• The step by step procedure of these systems are as follows:
• Indexing the collection of documents.
• Transforming query in the same way as document content is
represented.
• Comparing description of each document with that of query.
• Listing the results in order of relevancy.
• Retrieval Systems consist of 2 processes: Indexing and
Matching
• Indexing is the process of selecting terms to represent a text.
• Indexing involves:
• Tokenization of string
• Removing frequent words
• Stemming
• 2 Indexing Techniques: Boolean Model, Vector space model
• Matching - It is the process of finding a measure of similarity
between two text representations.
• Relevance of a document is computed based on following
parameters:
• 1. TF: It stands for Term Frequency which is simply the number of
times a given term appears in that document.
• TF(i, j) = (count of ith term in jth document)/(total terms in
jth document)
• 2. IDF: It stands for Inverse Document Frequency which is
a measure of the general importance of the term.
• IDF(i) = (total no. of documents)/(no. of documents containing
ith term)
• 3. TF-IDF Score (i, j) = TF * IDF
• In ad-hoc retrieval, user must have to enter a query in natural
language that describes the required information.
• Then the IR system will return the output as the required
• For Example, suppose we are searching for
something on the Internet and it gives some exact
pages that are relevant as per our requirement but
there can be some non-relevant pages too.
• This is due to the ad-hoc retrieval problem.
• Retrieval model consists of following components:
• D: Representation for documents.
• R: Representation for queries.
• F: The modeling framework for D, Q along with the
relationship between them.
• R(q, di): A ranking or similarity function that orders
the documents with respect to the query.
• 3 models are classified for Information model (IR) model:
• Classical IR Models
• These are the simplest and easy-to-implement IR models.
• These are based on mathematical knowledge that was easily
recognized and understood.
• Following are the examples of classical IR models:
• Boolean models, Vector models, Probabilistic models.
• Non-Classical IR Models
• These are based on principles other than similarity, probability,
Boolean operations.
• Following are the examples of Non-classical IR models:
• Information logic models, Situation theory models, Interaction
models.
• Following are examples of Alternative IR models:
• Cluster models, Fuzzy models, Latent Semantic Indexing (LSI)
• Boolean Model
• Boolean Model is the oldest model for Information Retrieval (IR).
• These models are based on set theory and Boolean algebra, where
• Documents: Sets of terms
• Queries: Boolean expressions on terms
• As a response to the query, the set of documents that satisfied the
Boolean expression are retrieved.
• The Boolean model can be defined as:
• D: It represents a set of words, i.e, the indexing terms present in a
document. Each term is either present (1) or absent (0) in the
document.
• Q: It represents a Boolean expression, where terms are the index terms
and operators are:
• logical products - AND, Logical sum − OR, Logical difference − NOT.
• F: It represents a Boolean algebra over sets of terms as well as over
sets of documents.
• If we talk about the relevance feedback, then in the Boolean IR model
• R: A document is predicted as relevant to query expression if

• ((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)


and only if it satisfies query expression:

• We can explain this model by a query term as an unambiguous


definition of a set of documents.
• For Example, suppose we have the query
term “analytics”, which defines the set of documents that are
indexed with the term “analytics”.
• Now, think on what is the result after we combine terms with
the Boolean ‘AND’ Operator?
• After doing the ‘AND’ operation, it will define a document set
that is smaller than or equal to the document sets of any of the
single terms.
• For Example, now we have the query with
terms “Vidhya” and “analytics” that will produce the set of
documents that are indexed with both the terms.
• In simple words, document set with intersection of
both the sets described here.
• Now, think on what is the result after combining
terms with the Boolean ‘OR’ operator?
• After doing the ‘OR’ operation, it will define a
document set that is bigger than or equal to the
document sets of any of the single terms.
• For Example, now we have the query with
terms “Vidhya” or “analytics” that will produce the
set of documents that are indexed with either the
term “Vidhya” or “analytics”.
• In simple words, the document set with the union
of both sets described here.
• To know vector Space model, let us discuss the following:
• 1. Here, index representations (documents) and the queries
are represented by vectors in a T dimensional Euclidean space.
• 2. T represents the number of distinct terms used in the
documents.
• 3. Each axis corresponds to one term.
• 4. Ranked list of documents ordered by similarity to the query
where the similarity between a query and a document is
computed using a metric on the respective vectors.
• 5. The similarity measure of a document vector to a query
vector is usually the cosine of the angle between them.
• Evaluation of IR Systems: The 2 common effective
measures for evaluating IR systems are as follows:
Precision: Precision
is the Proportion of
retrieved documents
that are relevant.
Recall: The recall is
the Proportion of
relevant documents
that are retrieved.
Ideally both
precision and recall
should be 1. In
practice, these are
inversely related.
• Page Rank Algorithm
• The page rank algorithm is applicable to web pages.
• The page rank algorithm is used by Google Search to
rank many websites in their search engine results.
• We can say that the page rank algorithm is a way of
measuring the importance of website pages.
• A web page is a directed graph which is having two
components namely Nodes and Connections.
• The pages are nodes and hyperlinks are
connections.
• Let us see how to solve Page Rank Algorithm.
• Compute page rank at every node at the end of the
second iteration (use teleportation factor = 0.8)
• The teleportation factor refers to probability that a random web
surfer will jump from current page to another random page on web.
• It's a way of introducing randomness into the model, simulating how

• In the PageRank algorithm, teleportation factor is denoted by 𝛽 and


a person might randomly click on links while browsing web.

is set to a value between 0 and 1.

• With probability 1−𝛽, the surfer follows a link on the current page.
• When a surfer reaches a page, there are two possibilities:

• With probability 𝛽, surfer jumps to a completely random page on


the web, regardless of links.
• This teleportation factor helps to prevent issues like dead ends
(pages with no outgoing links) or spider traps (pages with infinite
loops of links), which can affect convergence of PageRank algorithm
• It ensures that pages that are not directly linked still have a chance
to get some PageRank score through random teleportation process
The formula is:

Given, β is teleportation factor i.e. 0.8


NOTE: We need to solve at least till 2 iteration at maximum.
Let us create a table of the 0th Iteration, 1st Iteration, and
2nd Iteration & how to arrive at results given below in
detail:
NODES ITERATION 0 ITERATION 1 ITERATION 2
A 1/6 = 0.16 0.3 0.392

B 1/6 = 0.16 - 0.32 0.3568

C 1/6 = 0.16 0.32 0.3568

D 1/6 = 0.16 0.264 0.2714

E 1/6 = 0.16 0.264 0.2714

F 1/6 = 0.16 0.392 0.4141

Iteration 0:
For iteration 0 assume that each page is having page rank =
1 / Total number of nodes
PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
0.16/4 + 0.16/2 = 0.3
• So, what have we done is for node A we will see how many
incoming signals are there.
• Here we have PR(B) and PR(C).
• For each of the incoming signals, see the outgoing signals
from that particular incoming signal i.e. for PR(B) we have 4
outgoing signals and for PR(C) we have 2 outgoing signals.
• Same procedure is applicable for remaining nodes, iterations
• USE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
• PR(B) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.3/2 = 0.32
• PR(C) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.3/2 = 0.32
• PR(D) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.32/4 = 0.264
• PR(E) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.32/4 = 0.264
• PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
(0.32/4) + (0.32/2) = 0.392
• This is for iteration 1, now let us calculate for iteration 2.
• Iteration 2:
• By using the above-mentioned formula
• PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
(0.32/4) + (0.32/2) = 0.392
USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
• PR(B) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.392/2 =
0.3568
• PR(C) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.392/2 =
0.3568
• PR(D) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.3568/4 =
0.2714
• PR(E) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.3568/4 =
0.2714
• PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
(0.3568/4) + (0.3568/2) = 0.4141
NODES ITERATION 0 ITERATION 1 ITERATION 2
A 1/6 = 0.16 0.3 0.392
B 1/6 = 0.16 0.32 0.3568
C 1/6 = 0.16 0.32 0.3568
D 1/6 = 0.16 0.264 0.2714
E 1/6 = 0.16 0.264 0.2714
F 1/6 = 0.16 0.392 0.4141

The Weighted PageRank algorithm is an extension of PageRank


algorithm that takes into account the weights of links between pages.
These weights reflect relative importance or authority of pages linked
to each other.
Consider a small network of web pages:
Page A links to Page B and Page C.
Page B links to Page C and Page D.
Page C links to Page D.
Page D links to Page A.
• Let's assign weights to these links based on their importance.
• We'll use values between 0 and 1, where higher values
indicate stronger connections:
• Link from A to B: Weight = 0.8
• Link from A to C: Weight = 0.5
• Link from B to C: Weight = 0.6
• Link from B to D: Weight = 0.9
• Link from C to D: Weight = 0.7
• Link from D to A: Weight = 0.4
• We start with an initial PageRank value for each page.
• Let's assume equal initial PageRank values for simplicity:
• Page A: PageRank = 1
• Page B: PageRank = 1
• Page C: PageRank = 1
• Now, we iterate through Weighted PageRank algorithm to calculate the final
PageRank values. Initialization: Start with initial PageRank values.
• Iteration 1:
Page A's new PageRank = (0.4 * PageRank D) + (0.8 * PageRank B) + (0.5 *
PageRank C) = 0.4 + 0.8 + 0.5 = 1.7
Page B's new PageRank =(0.6 * PageRank C) + (0.9 * PageRank D) = 0.6 + 0.9 =
1.5
Page C's new PageRank = 0.7 * PageRank D = 0.7
Page D's new PageRank = 0.4 * PageRank A = 0.4
• Normalization: Normalize the PageRank values so they sum up to 1.
Page A's normalized PageRank = 1.7 / (1.7 + 1.5 + 0.7 + 0.4) ≈ 0.38
Page B's normalized PageRank = 1.5 / (1.7 + 1.5 + 0.7 + 0.4) ≈ 0.34
Page C's normalized PageRank = 0.7 / (1.7 + 1.5 + 0.7 + 0.4) ≈ 0.15
Page D's normalized PageRank = 0.4 / (1.7 + 1.5 + 0.7 + 0.4) ≈ 0.13
• After normalization, the final PageRank values are approximately:
• Page A: PageRank ≈ 0.38; Page B: PageRank ≈ 0.34; Page C: PageRank ≈ 0.15;
• Page D: PageRank ≈ 0.13
• These values represent the relative importance or authority of each page within
the network, taking into account the weighted links between them.
• Information Extraction
• Large volume of data creates challenges for info extraction
techniques as unstructured data is growing fast.
• This large volume of unstructured big data was too much
for traditional info extraction algorithms to manage.
• Due to the high volume and variety of data, these
Information Extraction systems needed to be upgraded.
• Information extraction (IE)
• Task of automatically extracting structured information
from unstructured and/or semi-structured machine-
readable documents and other electronically represented
sources is known as information extraction (IE).
• When there are a significant number of customers, manually
assessing Customer Feedback, for example, can be tedious, error-
prone, and time-consuming.
• There's a good chance we'll overlook a dissatisfied consumer.
• Fortunately, sentiment analysis can aid in the improvement of
customer support interactions' speed and efficacy.
• By doing sentiment analysis on all the incoming tickets and
prioritizing them, one can quickly identify most dissatisfied
customers or the most important issues.
• One might also allocate tickets to the appropriate individual or team
to handle them.
• As a result, Consumer satisfaction will improve dramatically.

You might also like