Web Mining 1-10
Web Mining 1-10
Basic Formula:
The PageRank PR of a page Pi is computed as:
Implementation Steps:
1. Initialize PageRank Values: Assign an initial value to each page, often set as 1/N1/N1/N
where NNN is the total number of pages.
2. Iterative Computation: Update the PageRank values using the formula iteratively until the
values converge (i.e., the difference between consecutive iterations is below a certain
threshold).
3. Normalization: Ensure the PageRank values add up to 1.
Code:
import numpy as np
ranks = new_ranks
return ranks
# Execute PageRank
ranks = page_rank(graph)
Output:
Experiment 2
Aim: Analyze the link structure of web using page rank algorithms.
Theory:
Concept of Web as a Graph:
In the context of PageRank, the web is viewed as a directed graph. Web pages are nodes,
and hyperlinks between them are edges. A link from page AAA to page BBB is an edge
directed from AAA to BBB.
The basic idea is that a page that is linked by other important pages should be ranked
higher. Random Surfer Model:
PageRank is based on a random surfer model, where a user randomly clicks on links. Each
click takes the user from one page to another.
The surfer has two options at each step:
1. Follow one of the links on the current page.
2. Randomly jump to any other page with a small probability (called the damping
factor ddd).
The Algorithm:
The importance of page iii (denoted as PR(i)PR(i)PR(i)) is determined by the PageRank
scores of the pages that link to it.
If a page jjj links to page iii, part of page jjj's importance is transferred to page iii. The more
links page jjj has, the less importance is passed to each individual linked page.
The PageRank of a page iii can be defined
as: Iterative Process:
The PageRank values are calculated iteratively. Initially, each page is assigned an equal rank
(e.g., 1/n1/n1/n for nnn pages).
The algorithm recalculates PageRank values using the formula until the ranks converge (i.e.,
they stop changing significantly from one iteration to the next).
Code:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
# Iterative calculation
for _ in range(max_iterations):
new_ranks = damping_vector + d *
stochastic_matrix.T.dot(ranks) if np.linalg.norm(new_ranks -
ranks, ord=1) <= tol:
break
ranks = new_ranks
return ranks
# Sample Web Graph (Adjacency Matrix)
# Example: Web graph with 5 pages (0 to 4)
graph = np.array([[0, 1, 1, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0]])
Output:
Experiment 3
# Sample text
text = """Text preprocessing is an essential task in NLP! We remove stopwords, punctuation,
numbers, and apply lemmatization to simplify the text."""
# Step 3: Tokenization
tokens = word_tokenize(text)
Output:
Experiment 4
Output:
Experiment 5
Output:
Experiment 6
2. Feature Extraction:
o Words or n-grams (combinations of adjacent words) are often used as features.
o Some approaches use the presence or frequency of specific words or phrases to
infer sentiment.
3. Sentiment Scoring:
o Each word in the text is assigned a polarity score based on a pre-existing lexicon or
a trained machine learning model. The overall sentiment score is a weighted sum or
average of these scores.
o Polarity ranges typically from -1 (very negative) to 1 (very positive). A score near 0
indicates a neutral sentiment.
4. Classification:
o The sentiment score is then used to classify the input as positive, negative, or
neutral.
o More advanced models can provide multi-class classification (e.g., happy, sad,
angry, etc.) or regression-based sentiment scores for more nuanced understanding.
Code:
from textblob import TextBlob
if polarity > 0:
sentiment = "Positive" elif
polarity < 0:
sentiment = "Negative" else:
sentiment = "Neutral"
# Example usage
text = input("Enter a sentence for sentiment analysis: ")
sentiment, polarity = analyze_sentiment(text)
print(f"Sentiment: {sentiment}")
print(f"Polarity Score: {polarity}")
Output:
Experiment 7
Code:
from cryptography.fernet import Fernet
# Function to generate and save an encryption key (use once to generate a key) def
generate_key():
key = Fernet.generate_key()
with open('secret.key', 'wb') as key_file: key_file.write(key)
print("Key generated and saved to 'secret.key'")
except Exception as e:
print("Decryption failed. Error:", str(e))
Output:
Experiment 8
Aim: Web usage mining
Theory:
Web Usage Mining (WUM) is a branch of data mining that focuses on discovering meaningful
patterns and insights from web data, specifically user interaction logs. It helps in understanding user
behavior, improving user experience, and optimizing web content.
Code:
import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame
df = pd.DataFrame(data)
plt.show()
Output:
Experiment 9
Aim: Recommender System
Theory:
Recommender systems are a type of information filtering system that predicts and suggests items
that a user might like. They are widely used in various applications such as e- commerce, streaming
services, and social media to improve user experience by personalizing content.
2. Collaborative Filtering:
o User-Based Collaborative Filtering: Recommends items liked by similar users.
o Item-Based Collaborative Filtering: Recommends items that are similar to items
the user has previously interacted with.
o Works based on user-item interactions without needing detailed item features.
3. Hybrid Systems:
o Combines content-based and collaborative filtering for more accurate
recommendations.
o Mitigates the limitations of each method individually.
Applications:
E-commerce: Product recommendations (e.g., Amazon).
Streaming Services: Movie and show recommendations (e.g., Netflix, YouTube).
Music Platforms: Playlist and song suggestions (e.g., Spotify).
Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer from
sklearn.metrics.pairwise import linear_kernel
# Create a DataFrame
df = pd.DataFrame(data)
# Get the pairwise similarity scores of all movies with that movie sim_scores =
list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the indices of the top 3 most similar movies (excluding itself) sim_scores =
sim_scores[1:4] # Skip the first one as it is the movie itself
Output:
Experiment 10
Aim: Web structure mining
Theory:
Web Structure Mining is a branch of web mining that focuses on analyzing the structure of web
pages and the relationships between them. It aims to discover and model the link structure of the web
to understand how web pages are connected. This type of mining is essential for search engines, web
crawlers, and algorithms like Google's PageRank, which use link structures to rank the importance of
web pages.
2. Link Analysis:
o In-degree: The number of incoming links to a page.
o Out-degree: The number of outgoing links from a page.
o Adjacency Matrix: Represents the link structure where rows and columns
correspond to web pages, and values indicate the presence of a link.
3. Graph Representation:
o Web structure mining often uses graphs to model the web. Pages are nodes, and
hyperlinks are edges between nodes.
o Directed Graphs are used since hyperlinks typically have a direction (from one
page to another).
4. Applications:
o Search Engine Ranking: Algorithms like PageRank use the web structure to
determine the importance of pages.
o Web Navigation: Understanding how users traverse websites.
o Community Detection: Identifying clusters of pages with strong
interconnections.
Code:
import networkx as nx
import matplotlib.pyplot as plt
# Create a directed graph to represent web structure web_graph =
nx.DiGraph()
# Calculate PageRank
page_ranks = nx.pagerank(web_graph) print("\
nPageRank of each page:")
for page, rank in page_ranks.items():
print(f"{page}: {rank:.4f}")
Output: