0% found this document useful (0 votes)

8 views31 pages

Web Mining 1-10

The document outlines various experiments related to web mining, including the implementation of the PageRank algorithm, link structure analysis, text preprocessing, social network analysis, opinion mining, and sentiment analysis. Each experiment includes theoretical background, implementation steps, and code examples using Python libraries such as NumPy and NetworkX. The aim is to demonstrate practical applications of these techniques in analyzing web data and extracting meaningful insights.

Uploaded by

jhaadi27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views31 pages

Web Mining 1-10

Uploaded by

jhaadi27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Experiment 1

Aim: Implement Page Rank Algorithm in Web Mining

Theory:
The PageRank Algorithm is a fundamental technique used in web mining to rank web pages in
terms of their importance or relevance based on the hyperlink structure of the web. It was originally
developed by Larry Page and Sergey Brin, the founders of Google, to rank web pages based on their
inbound links.
Here is an overview of how to implement the PageRank Algorithm:
PageRank Algorithm Overview:
The PageRank value for a web page depends on the following:
1. Inbound Links: A page is considered important if many pages link to it.
2. Outgoing Links: The importance of a page is distributed across its outgoing links.
3. Damping Factor (d): This represents the probability (often set around 0.85) that a user will
continue clicking on links. The remaining probability (1 - d) represents the probability that a
user randomly jumps to any other page.

Basic Formula:
The PageRank PR of a page Pi is computed as:

Implementation Steps:
1. Initialize PageRank Values: Assign an initial value to each page, often set as 1/N1/N1/N
where NNN is the total number of pages.
2. Iterative Computation: Update the PageRank values using the formula iteratively until the
values converge (i.e., the difference between consecutive iterations is below a certain
threshold).
3. Normalization: Ensure the PageRank values add up to 1.
Code:
import numpy as np

def page_rank(graph, d=0.85, max_iterations=100, tol=1.0e-6):

# Number of pages (nodes)
N = len(graph)

# Initialize the PageRank of each page to 1/N

ranks = np.ones(N) / N

# Create the transition probability matrix

out_links = np.array([np.sum(graph[i]) for i in range(N)])

# Avoid division by zero for pages with no out-links

for i in range(N):
if out_links[i] == 0:
graph[i] = np.ones(N)
out_links[i] = N

# Normalize the graph to form the stochastic matrix

stochastic_matrix = graph / out_links[:, None]

# Damping factor vector

damping_vector = np.ones(N) * (1 - d) / N

# Iteratively compute PageRank

for iteration in range(max_iterations):
new_ranks = damping_vector + d * stochastic_matrix.T.dot(ranks)

# Check for convergence (difference between iterations)

if np.linalg.norm(new_ranks - ranks, ord=1) <= tol:
break

ranks = new_ranks

return ranks

# Example graph as an adjacency matrix

graph = np.array([[0, 1, 1, 0],
[0, 0, 1, 1],
[1, 0, 0, 1],
[0, 0, 0, 1]])

# Execute PageRank
ranks = page_rank(graph)

print("PageRank values:", ranks)

Output:
Experiment 2

Aim: Analyze the link structure of web using page rank algorithms.
Theory:
Concept of Web as a Graph:
 In the context of PageRank, the web is viewed as a directed graph. Web pages are nodes,
and hyperlinks between them are edges. A link from page AAA to page BBB is an edge
directed from AAA to BBB.
 The basic idea is that a page that is linked by other important pages should be ranked
higher. Random Surfer Model:
 PageRank is based on a random surfer model, where a user randomly clicks on links. Each
click takes the user from one page to another.
 The surfer has two options at each step:
1. Follow one of the links on the current page.
2. Randomly jump to any other page with a small probability (called the damping
factor ddd).
The Algorithm:
 The importance of page iii (denoted as PR(i)PR(i)PR(i)) is determined by the PageRank
scores of the pages that link to it.
 If a page jjj links to page iii, part of page jjj's importance is transferred to page iii. The more
links page jjj has, the less importance is passed to each individual linked page.
 The PageRank of a page iii can be defined
as: Iterative Process:
 The PageRank values are calculated iteratively. Initially, each page is assigned an equal rank
(e.g., 1/n1/n1/n for nnn pages).
 The algorithm recalculates PageRank values using the formula until the ranks converge (i.e.,
they stop changing significantly from one iteration to the next).
Code:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

# PageRank Algorithm Implementation

def page_rank(graph, d=0.85, max_iterations=100, tol=1.0e-6):
N = len(graph)
ranks = np.ones(N) / N # Initialize PageRank values for each node

out_links = np.array([np.sum(graph[i]) for i in range(N)])

# Handle pages with no outgoing links (dead ends)

for i in range(N):
if out_links[i] == 0:
graph[i] = np.ones(N)
out_links[i] = N

stochastic_matrix = graph / out_links[:, None]

damping_vector = np.ones(N) * (1 - d) / N

# Iterative calculation
for _ in range(max_iterations):
new_ranks = damping_vector + d *
stochastic_matrix.T.dot(ranks) if np.linalg.norm(new_ranks -
ranks, ord=1) <= tol:
break
ranks = new_ranks

return ranks
# Sample Web Graph (Adjacency Matrix)
# Example: Web graph with 5 pages (0 to 4)
graph = np.array([[0, 1, 1, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0]])

# Perform PageRank analysis

ranks = page_rank(graph)

# Display PageRank results

print("PageRank scores for each page:")
for i, rank in enumerate(ranks):
print(f"Page {i} has a PageRank of {rank:.4f}")

# Visualize the web link structure and PageRank

scores def visualize_graph(graph, ranks):
G = nx.DiGraph() # Create a directed graph

# Add edges to the graph based on the adjacency matrix

for i in range(len(graph)):
for j in range(len(graph)):
if graph[i][j] == 1:
G.add_edge(i, j)

# Plotting the graph

pos = nx.spring_layout(G) # Layout for visualization
plt.figure(figsize=(8, 6))

# Node size is proportional to the PageRank score

nx.draw(G, pos, with_labels=True, node_size=[5000 * rank for rank in ranks],
node_color="skyblue", font_size=15, font_color="black", arrows=True)
# Annotating PageRank scores
labels = {i: f"{i}\nPR: {rank:.2f}" for i, rank in enumerate(ranks)}
nx.draw_networkx_labels(G, pos, labels, font_size=12)

plt.title("Web Graph with PageRank

Scores") plt.show()

# Visualize the graph with PageRank

visualize_graph(graph, ranks)

Output:
Experiment 3

Aim: Text and webpage pre‐processing

Theory:
Text and webpage pre-processing is a crucial step in natural language processing (NLP),
information retrieval (IR), and data mining tasks. Raw textual data from webpages or documents
often contains noise, formatting issues, and other inconsistencies that need to be cleaned and
standardized before analysis. Pre-processing ensures that the data is in a structured, manageable,
and machine-readable form, improving the efficiency and accuracy of downstream algorithms such
as machine learning models, text mining, or search engines.
1. Text Pre-processing Steps:
The general process for preparing raw text involves multiple steps, which can vary based on the
task (e.g., sentiment analysis, document classification, or information retrieval). These steps include
tokenization, normalization, removal of irrelevant content, and feature extraction.

Step-by-Step Text Pre-processing:

1. Lowercasing:
o Convert all characters in the text to lowercase.
o Rationale: Many models (unless case-sensitive) don’t differentiate between upper
and lowercase forms of the same word (e.g., "Apple" and "apple").
2. Tokenization:
o Breaking text into smaller components, typically words or sentences, is called
tokenization.
o Techniques:
 Word Tokenization: Splitting a document into individual words (tokens).
E.g., "Hello, world!" → ["Hello", ",", "world", "!"].
 Sentence Tokenization: Splitting text into sentences. E.g., "I like cats. You
like dogs." → ["I like cats.", "You like dogs."].
Code:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import
WordNetLemmatizer

# Download the required NLTK data

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample text
text = """Text preprocessing is an essential task in NLP! We remove stopwords, punctuation,
numbers, and apply lemmatization to simplify the text."""

# Step 1: Lowercase the text

text = text.lower()

# Step 2: Remove punctuation and numbers using regular expressions

text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\d+', '', text) # Remove numbers

# Step 3: Tokenization
tokens = word_tokenize(text)

# Step 4: Remove stopwords

stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]

# Step 5: Lemmatization (reducing words to their base form)

lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Final cleaned text

cleaned_text = ' '.join(tokens)
print("Cleaned Text:", cleaned_text)

Output:
Experiment 4

Aim: Social network analysis

Theory:
Social Network Analysis (SNA) is a method used to study the structure of relationships and
interactions among social entities (people, organizations, groups, etc.). It leverages both graph
theory and sociology to examine social structures through networks and graph-based visualization
techniques. The underlying theory emphasizes how relationships, rather than individual attributes,
shape social behavior and outcomes. Here's an overview of key concepts:
1. Networks and Graphs
In SNA, a social network is represented as a graph, where:
 Nodes (or vertices) represent individual actors (e.g., people, organizations).
 Edges (or links) represent relationships or connections between actors (e.g., friendships,
professional ties).
Types of Networks:
 Undirected Networks: Relationships where both nodes equally interact (e.g., mutual
friendships).
 Directed Networks: Asymmetrical relationships (e.g., follower-followee on social media).
 Weighted Networks: Relationships where ties have varying strengths (e.g., frequency of
interaction).
2. Key Theoretical Concepts
 Centrality: Measures the importance of a node within the network.
o Degree Centrality: Number of direct connections a node has.
o Betweenness Centrality: Measures the extent to which a node lies on paths between
other nodes, highlighting its role as a broker or bridge.
o Closeness Centrality: Measures how close a node is to all other nodes, emphasizing
the speed at which information can be disseminated from that node.
 Homophily: The tendency of individuals to associate with others who are similar to them in
some way (e.g., shared interests, socio-demographic traits). "Birds of a feather flock
together."
 Transitivity: A relationship structure where if A is connected to B, and B is connected to C,
then A is also likely to be connected to C (forming a triangle). This can explain the
formation of close-knit communities.
 Clustering: The tendency of nodes to form tightly-knit groups based on the density of
connections within a network, indicating cohesive subgroups.
Code:
import networkx as nx
import matplotlib.pyplot as plt

# Create a social network graph

G = nx.Graph()

# Add nodes (people) to the graph

G.add_nodes_from(["Wolf", "Dolphin", "Elephant", "Lion", "Deer", "Tiger"])

# Add edges (relationships) between the nodes

G.add_edges_from([("Wolf", "Dolphin"),
("Wolf", "Elephant"),
("Dolphin", "Elephant"),
("Dolphin", "Lion"),
("Elephant", "Deer"),
("Lion", "Deer"),
("Deer", "Tiger")])

# Compute basic social network metrics

degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)

# Print the computed metrics

print("Degree Centrality:", degree_centrality)
print("Closeness Centrality:", closeness_centrality)
print("Betweenness Centrality:", betweenness_centrality)
print("Eigenvector Centrality:", eigenvector_centrality)

# Draw the social network

pos = nx.spring_layout(G)
plt.figure(figsize=(8, 6))
nx.draw(G, pos, with_labels=True, node_size=2000, node_color="lightblue", font_size=15,
font_color="black", font_weight="bold", edge_color="gray")
plt.title("Social Network Graph", size=20)
plt.show()

Output:
Experiment 5

Aim: Opinion mining

Theory:
Opinion mining, also known as sentiment analysis, is a natural language processing (NLP)
technique used to determine the emotional tone or attitude expressed in a piece of text. It aims to
analyze and extract subjective information from sources like social media posts, reviews, blogs, or
customer feedback. Here's a deeper look at the theory behind it:
Key Concepts:
1. Subjectivity vs. Objectivity:
o Subjective Text: This refers to statements that express opinions, emotions, or
personal viewpoints (e.g., "I love this product").
o Objective Text: These are factual statements that do not involve emotions (e.g.,
"The product weighs 500g").
Opinion mining focuses primarily on subjective data.
2. Polarity: The main task of opinion mining is to determine the polarity of a text. This can be:
o Positive: Text expresses positive emotions or satisfaction (e.g., "The service was
fantastic").
o Negative: Text reflects dissatisfaction or negative feelings (e.g., "The product is
terrible").
o Neutral: The text does not express any strong emotion or is factual (e.g., "The
weather is mild").
3. Granularity:
o Document Level: The system tries to classify the overall sentiment of an entire
document or review.
o Sentence Level: The system assesses the sentiment of individual sentences.
o Aspect Level: Sentiment analysis at this level looks for specific aspects or entities
(e.g., product features) and determines the sentiment associated with each aspect
(e.g., “The camera quality is great, but the battery life is poor”).
Code:
from textblob import TextBlob

# List of example opinions (texts)

opinions = [
"I love the new phone! The camera quality is amazing.",
"The restaurant was terrible. The service was slow and the food was bad.",
"This movie was fantastic! I enjoyed every bit of it.",
"The product broke after just one use. I'm very disappointed.",
"The customer support was helpful and resolved my issue
quickly."
]

# Perform sentiment analysis for each opinion

for opinion in opinions:
blob = TextBlob(opinion)
sentiment = blob.sentiment
print(f"Opinion: {opinion}")
print(f"Sentiment: Polarity={sentiment.polarity}, Subjectivity={sentiment.subjectivity}")
print("-" * 50)

Output:
Experiment 6

Aim: Sentimetal analysis

Theory:
Sentiment analysis is a natural language processing (NLP) technique used to determine whether a
given piece of text expresses a positive, negative, or neutral sentiment. This type of analysis is
valuable in a variety of applications, such as analysing customer feedback, monitoring social
media, understanding market trends, and improving product reviews.

How Sentiment Analysis Works

1. Text Preprocessing:
o Tokenization: Splits the input text into words or sentences.
o Normalization: Converts text to lowercase, removes punctuation, and reduces words
to their base form (e.g., stemming or lemmatization).
o Noise Removal: Removes unnecessary elements like stopwords, numbers, or
HTML tags.

2. Feature Extraction:
o Words or n-grams (combinations of adjacent words) are often used as features.
o Some approaches use the presence or frequency of specific words or phrases to
infer sentiment.

3. Sentiment Scoring:
o Each word in the text is assigned a polarity score based on a pre-existing lexicon or
a trained machine learning model. The overall sentiment score is a weighted sum or
average of these scores.
o Polarity ranges typically from -1 (very negative) to 1 (very positive). A score near 0
indicates a neutral sentiment.

4. Classification:
o The sentiment score is then used to classify the input as positive, negative, or
neutral.
o More advanced models can provide multi-class classification (e.g., happy, sad,
angry, etc.) or regression-based sentiment scores for more nuanced understanding.
Code:
from textblob import TextBlob

# Function to analyze sentiment def

analyze_sentiment(text):
analysis = TextBlob(text)
# Polarity ranges from -1 (negative) to 1 (positive) polarity =
analysis.sentiment.polarity

if polarity > 0:
sentiment = "Positive" elif
polarity < 0:
sentiment = "Negative" else:
sentiment = "Neutral"

return sentiment, polarity

# Example usage
text = input("Enter a sentence for sentiment analysis: ")
sentiment, polarity = analyze_sentiment(text)

print(f"Sentiment: {sentiment}")
print(f"Polarity Score: {polarity}")

Output:
Experiment 7

Aim: Privatization of web content

Theory:
Privatization of web content refers to the techniques and strategies used to protect and control the
access and visibility of digital information on the web. This is critical for safeguarding personal data,
ensuring data integrity, and maintaining user privacy. Methods to achieve this include:
1. Data Encryption: Encrypts data to make it accessible only to authorized users.
Common encryption standards include AES (Advanced Encryption Standard) and RSA
(Rivest–Shamir–Adleman).
2. Access Control: Limits who can access content using authentication and
authorization mechanisms, such as OAuth or JWT tokens.
3. Web Scraping Protection: Implements measures to prevent unauthorized extraction of
data from web pages.
4. Content Watermarking: Embeds invisible markers within content to identify
unauthorized duplication or sharing.
5. Data Masking: Modifies sensitive data in such a way that the original information is
protected while still being usable for testing or analytics.

Techniques for Implementing Web Content Privatization

 Secure HTTPS Protocol: Ensures secure communication between web servers and
clients.
 CAPTCHAs: Protects against bots by requiring human interaction.
 Robots.txt and Metadata: Controls search engine behavior and limits the indexing of
sensitive content.
 Rate Limiting: Prevents DDoS attacks and unauthorized scraping by limiting the
number of requests from a single IP address.
 Authentication and Authorization: Uses techniques like Multi-Factor
Authentication (MFA) to secure access.

Code:
from cryptography.fernet import Fernet

# Function to generate and save an encryption key (use once to generate a key) def
generate_key():
key = Fernet.generate_key()
with open('secret.key', 'wb') as key_file: key_file.write(key)
print("Key generated and saved to 'secret.key'")

# Function to load the encryption key def

load_key():
return open('secret.key', 'rb').read()

# Function to encrypt a message def

encrypt_message(message):
key = load_key() cipher =
Fernet(key)
encrypted_message = cipher.encrypt(message.encode())
print("Encrypted message:", encrypted_message.decode()) return
encrypted_message

# Function to decrypt a message

def decrypt_message(encrypted_message): key =
load_key()
cipher = Fernet(key) try:
decrypted_message = cipher.decrypt(encrypted_message).decode()
print("Decrypted message:", decrypted_message)
return decrypted_message

except Exception as e:
print("Decryption failed. Error:", str(e))

if name == " main ":

print("Web Content Privatization Program") print("Options:")

print("1. Generate Encryption Key")

print("2. Encrypt a Message") print("3.
Decrypt a Message")
choice = input("Select an option (1/2/3): ")

if choice == '1': generate_key()

elif choice == '2':
message = input("Enter the content to encrypt: ")
encrypt_message(message)
elif choice == '3':
encrypted_message = input("Enter the encrypted content: ").encode()
decrypt_message(encrypted_message)
else:
print("Invalid choice. Please select 1, 2, or 3.")

Output:

Experiment 8
Aim: Web usage mining
Theory:
Web Usage Mining (WUM) is a branch of data mining that focuses on discovering meaningful
patterns and insights from web data, specifically user interaction logs. It helps in understanding user
behavior, improving user experience, and optimizing web content.

Key Aspects of Web Usage Mining:

1. Data Sources:
o Web Server Logs: Records of user activity stored by web servers (e.g., IP
addresses, timestamps, HTTP methods).
o Application Server Logs: Detailed interactions at the application level.
o Client-Side Data: Information gathered from user devices through cookies,
browser plugins, etc.

2. Process of Web Usage Mining:

o Data Collection: Gathering raw web log data from servers or user sessions.
o Data Preprocessing: Cleaning and structuring the data (e.g., removing
irrelevant data, handling missing values).
o Pattern Discovery: Using data mining techniques like clustering, association rule
mining, and sequential pattern mining to find interesting user behavior patterns.
o Pattern Analysis: Interpreting the discovered patterns for insights.

Applications of Web Usage Mining:

 Personalization: Recommending products or content based on user behavior.
 Web Optimization: Improving site navigation and layout.
 Fraud Detection: Identifying unusual or potentially harmful web activities.
 User Behavior Analysis: Understanding how users interact with different parts of a
website.

Code:
import pandas as pd
import matplotlib.pyplot as plt

# Sample web log data (as a CSV for demonstration) data

={
'IP': ['192.168.1.1', '192.168.1.2', '192.168.1.1', '192.168.1.3', '192.168.1.2'],
'Timestamp': ['2024-11-05 10:00:00', '2024-11-05 10:05:00', '2024-11-05 10:10:00',
'2024-11-05 10:15:00', '2024-11-05 10:20:00'],
'Page': ['/home', '/about', '/products', '/contact', '/home'],
'User_Agent': ['Mozilla/5.0', 'Chrome/90.0', 'Mozilla/5.0', 'Edge/91.0', 'Chrome/90.0']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Convert Timestamp to datetime format

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# Basic Preprocessing: Sorting data by timestamp df

= df.sort_values(by='Timestamp')

# Analyze page visits

page_visits = df['Page'].value_counts()

# Visualize the number of page visits

plt.figure(figsize=(10, 5))
page_visits.plot(kind='bar', color='skyblue')
plt.title('Page Visit Frequency')
plt.xlabel('Page')
plt.ylabel('Number of Visits')

plt.show()

# Grouping by IP to find unique visits and user paths

user_paths = df.groupby('IP')['Page'].apply(lambda pages: ' -> '.join(pages)).reset_index()
print("User Navigation Paths:")
print(user_paths)

# Finding the most common user agents user_agents

= df['User_Agent'].value_counts() print("\nMost
Common User Agents:") print(user_agents)

Output:

Experiment 9
Aim: Recommender System
Theory:
Recommender systems are a type of information filtering system that predicts and suggests items
that a user might like. They are widely used in various applications such as e- commerce, streaming
services, and social media to improve user experience by personalizing content.

Types of Recommender Systems:

1. Content-Based Filtering:
o Recommends items similar to those the user has liked in the past.
o Uses item features and user preferences to build a profile and match items.

2. Collaborative Filtering:
o User-Based Collaborative Filtering: Recommends items liked by similar users.
o Item-Based Collaborative Filtering: Recommends items that are similar to items
the user has previously interacted with.
o Works based on user-item interactions without needing detailed item features.

3. Hybrid Systems:
o Combines content-based and collaborative filtering for more accurate
recommendations.
o Mitigates the limitations of each method individually.

Applications:
 E-commerce: Product recommendations (e.g., Amazon).
 Streaming Services: Movie and show recommendations (e.g., Netflix, YouTube).
 Music Platforms: Playlist and song suggestions (e.g., Spotify).

Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer from
sklearn.metrics.pairwise import linear_kernel

# Sample dataset of movies with descriptions data =

{
'Title': ['The Matrix', 'Inception', 'Interstellar', 'The Social Network', 'The Godfather'], 'Description':
[
'A computer hacker learns about the true nature of reality and his role in the war against its
controllers.',
'A thief who steals corporate secrets through dream-sharing technology is given the inverse
task of planting an idea.',
'A team of explorers travels through a wormhole in space in an attempt to ensure humanity’s
survival.',
'The story of how Mark Zuckerberg created Facebook and faced personal and legal challenges.',
'The aging patriarch of an organized crime dynasty transfers control of his empire to his
reluctant son.'
]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Step 1: Convert the descriptions into TF-IDF feature vectors

tfidf_vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix =
tfidf_vectorizer.fit_transform(df['Description'])

# Step 2: Compute the cosine similarity matrix cosine_sim =

linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to recommend movies based on the title def

recommend(title, cosine_sim=cosine_sim):
# Get the index of the movie that matches the title idx =
df.index[df['Title'] == title].tolist()[0]

# Get the pairwise similarity scores of all movies with that movie sim_scores =
list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the indices of the top 3 most similar movies (excluding itself) sim_scores =
sim_scores[1:4] # Skip the first one as it is the movie itself

# Get the movie indices

movie_indices = [i[0] for i in sim_scores]

# Return the top 3 most similar movies return

df['Title'].iloc[movie_indices]

# Example usage movie_title =

'Inception'
print(f"Movies recommended for '{movie_title}':")
print(recommend(movie_title))

Output:

Experiment 10
Aim: Web structure mining
Theory:
Web Structure Mining is a branch of web mining that focuses on analyzing the structure of web
pages and the relationships between them. It aims to discover and model the link structure of the web
to understand how web pages are connected. This type of mining is essential for search engines, web
crawlers, and algorithms like Google's PageRank, which use link structures to rank the importance of
web pages.

Key Concepts in Web Structure Mining:

1. Nodes and Edges:
o Nodes represent web pages or web elements.
o Edges represent hyperlinks connecting these pages.

2. Link Analysis:
o In-degree: The number of incoming links to a page.
o Out-degree: The number of outgoing links from a page.
o Adjacency Matrix: Represents the link structure where rows and columns
correspond to web pages, and values indicate the presence of a link.

3. Graph Representation:
o Web structure mining often uses graphs to model the web. Pages are nodes, and
hyperlinks are edges between nodes.
o Directed Graphs are used since hyperlinks typically have a direction (from one
page to another).

4. Applications:
o Search Engine Ranking: Algorithms like PageRank use the web structure to
determine the importance of pages.
o Web Navigation: Understanding how users traverse websites.
o Community Detection: Identifying clusters of pages with strong
interconnections.

Code:
import networkx as nx
import matplotlib.pyplot as plt
# Create a directed graph to represent web structure web_graph =
nx.DiGraph()

# Add nodes representing web pages

pages = ['Page A', 'Page B', 'Page C', 'Page D']
web_graph.add_nodes_from(pages)

# Add edges representing hyperlinks between the pages

web_graph.add_edges_from([
('Page A', 'Page B'),
('Page B', 'Page C'),
('Page C', 'Page A'),
('Page A', 'Page D'),
('Page D', 'Page B')
])

# Draw the graph

plt.figure(figsize=(10, 6))
nx.draw(web_graph, with_labels=True, node_color='lightblue', edge_color='gray', node_size=3000,
font_size=15)
plt.title("Web Structure Graph")
plt.show()

# Calculate the in-degree and out-degree for each node

in_degrees = dict(web_graph.in_degree())
out_degrees = dict(web_graph.out_degree())
import networkx as nx
import matplotlib.pyplot as plt

# Print in-degree and out-degree of each page

print("In-degree of each page:")
for page, in_degree in in_degrees.items():
print(f"{page}: {in_degree}")
print("\nOut-degree of each page:")
for page, out_degree in out_degrees.items():
print(f"{page}: {out_degree}")

# Calculate PageRank
page_ranks = nx.pagerank(web_graph) print("\
nPageRank of each page:")
for page, rank in page_ranks.items():
print(f"{page}: {rank:.4f}")

Output:

Social Network Analysis
No ratings yet
Social Network Analysis
28 pages
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
4 pages
Maharaja Agrasen Institute of Technology - FE Interveiw
No ratings yet
Maharaja Agrasen Institute of Technology - FE Interveiw
2 pages
Workshop Proposal
No ratings yet
Workshop Proposal
20 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
Project2 SimplifiedPageRank
No ratings yet
Project2 SimplifiedPageRank
6 pages
Unit Iv, V
No ratings yet
Unit Iv, V
35 pages
Evaluation of A Tick Bite For Possible Lyme Disease - UpToDate
No ratings yet
Evaluation of A Tick Bite For Possible Lyme Disease - UpToDate
24 pages
Web Mining Lab Source Code 1-12 PRINT
No ratings yet
Web Mining Lab Source Code 1-12 PRINT
43 pages
14 Link 1
No ratings yet
14 Link 1
10 pages
US Manufacturing Output Falls in April On Weak Auto Production by
No ratings yet
US Manufacturing Output Falls in April On Weak Auto Production by
5 pages
Article 28
No ratings yet
Article 28
7 pages
Page Rank and HITS
No ratings yet
Page Rank and HITS
39 pages
EXP-11-Implementation of Page Rank Algorithm
No ratings yet
EXP-11-Implementation of Page Rank Algorithm
8 pages
Computer Science Practical For KV
No ratings yet
Computer Science Practical For KV
13 pages
Lecture 9
No ratings yet
Lecture 9
64 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
Atkinson 2020 Fields and Individuals From Bourdieu To Lahire and Back Again
No ratings yet
Atkinson 2020 Fields and Individuals From Bourdieu To Lahire and Back Again
16 pages
Graph Help Session
No ratings yet
Graph Help Session
27 pages
Lect 14-Web Ranking
No ratings yet
Lect 14-Web Ranking
30 pages
MMD4
No ratings yet
MMD4
13 pages
For 7 TH Sem AIML4 ABC
No ratings yet
For 7 TH Sem AIML4 ABC
4 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
Pagerank Equation
No ratings yet
Pagerank Equation
1 page
Safety of Ro-Ro Passenger and Cruise Ships PDF
88% (8)
Safety of Ro-Ro Passenger and Cruise Ships PDF
54 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
15 Link 2
No ratings yet
15 Link 2
11 pages
Unit 2
No ratings yet
Unit 2
14 pages
RajSingh WIexp1
No ratings yet
RajSingh WIexp1
7 pages
Pagerank
No ratings yet
Pagerank
3 pages
Practical 11
No ratings yet
Practical 11
2 pages
Page Rank PDF
0% (1)
Page Rank PDF
20 pages
Feb 28
No ratings yet
Feb 28
12 pages
Economic Complete English Notes
No ratings yet
Economic Complete English Notes
32 pages
3 RD Sem Results
No ratings yet
3 RD Sem Results
2 pages
DWM Exp 10
No ratings yet
DWM Exp 10
3 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
Page Rank, Structure of Web and Analyzing A Web Graph
No ratings yet
Page Rank, Structure of Web and Analyzing A Web Graph
17 pages
An Open Ended Contract
No ratings yet
An Open Ended Contract
5 pages
2.3.11.a Calculating Property Drainage
No ratings yet
2.3.11.a Calculating Property Drainage
6 pages
Page 5 A&A May 5, 2025 - Barclay Page 5
No ratings yet
Page 5 A&A May 5, 2025 - Barclay Page 5
1 page
(Student Version) 91264 - 2023 - Anything Is Popsicle
No ratings yet
(Student Version) 91264 - 2023 - Anything Is Popsicle
4 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
18 pages
GRP 11 - Page Rank Algorithms
No ratings yet
GRP 11 - Page Rank Algorithms
15 pages
SpyGlass DS System Brochure
No ratings yet
SpyGlass DS System Brochure
6 pages
Blue Modern Pitch Deck Presentation
No ratings yet
Blue Modern Pitch Deck Presentation
13 pages
Lec 31
No ratings yet
Lec 31
15 pages
Magicpin Invoice 8
No ratings yet
Magicpin Invoice 8
1 page
PageRank Report
No ratings yet
PageRank Report
3 pages
Air Conditioning
No ratings yet
Air Conditioning
4 pages
6 Pagerank
No ratings yet
6 Pagerank
7 pages
Name: Kartik Jolapara Sapid: Div: Branch
No ratings yet
Name: Kartik Jolapara Sapid: Div: Branch
4 pages
Faculty Guides - Slots Remaining - 6th - 7th Semester Minor Project - 2024-25-2
No ratings yet
Faculty Guides - Slots Remaining - 6th - 7th Semester Minor Project - 2024-25-2
1 page
Michael's Resume 2024
No ratings yet
Michael's Resume 2024
3 pages
Advert Receptionist Intern
No ratings yet
Advert Receptionist Intern
1 page
PMBD-07-Link Analysis
No ratings yet
PMBD-07-Link Analysis
42 pages
6 Internship Contract Agreement f2f
No ratings yet
6 Internship Contract Agreement f2f
2 pages
Island Agriculture Assessment - TOR
No ratings yet
Island Agriculture Assessment - TOR
2 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
Manual Ecoaire, Eco Insert, Eco2 & Kerala 2012 ENG
No ratings yet
Manual Ecoaire, Eco Insert, Eco2 & Kerala 2012 ENG
52 pages
CSE 3024: Web Mining: Lab Assessment - 3
No ratings yet
CSE 3024: Web Mining: Lab Assessment - 3
13 pages
Safety and Instruction Manual: Meat Grinder
No ratings yet
Safety and Instruction Manual: Meat Grinder
20 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Google Pagerank and Reduced-Order Modelling
No ratings yet
Google Pagerank and Reduced-Order Modelling
56 pages
DWM Expt9
No ratings yet
DWM Expt9
6 pages
FastLink CAT5e (SFTP) Outdoor
No ratings yet
FastLink CAT5e (SFTP) Outdoor
3 pages
Pagerank Project: Instructions
No ratings yet
Pagerank Project: Instructions
7 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
55 pages
Documentation
No ratings yet
Documentation
2 pages
A Report On An Automated Whistle Blowing System For Aiding Crime Investigation
No ratings yet
A Report On An Automated Whistle Blowing System For Aiding Crime Investigation
68 pages
Bigdata Analytics Module 6: Big Data Analytics Applications: Faculty Name: Ms. Varsha Sanap Dr. Vivek Kumar Singh
No ratings yet
Bigdata Analytics Module 6: Big Data Analytics Applications: Faculty Name: Ms. Varsha Sanap Dr. Vivek Kumar Singh
31 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
44 pages
Parts Catalog: TJ053E-AS50
No ratings yet
Parts Catalog: TJ053E-AS50
14 pages
Report PDF
No ratings yet
Report PDF
35 pages
Liuty
No ratings yet
Liuty
50 pages
Assignment 1 - KTEE203.1
No ratings yet
Assignment 1 - KTEE203.1
11 pages
CS345 Data Mining: Link Analysis Algorithms Page Rank
No ratings yet
CS345 Data Mining: Link Analysis Algorithms Page Rank
37 pages
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
No ratings yet
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
35 pages
Unit 5 Review Answers
No ratings yet
Unit 5 Review Answers
17 pages
Cse535 Link Analysis
No ratings yet
Cse535 Link Analysis
19 pages
Synchro PRO 2018 - Technical Overview
No ratings yet
Synchro PRO 2018 - Technical Overview
11 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
Job Analysis The Process and Its Uses
No ratings yet
Job Analysis The Process and Its Uses
13 pages
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
No ratings yet
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
19 pages
Technical Tip: Overview of Ethylene Oxide (Eo or Eto) Residuals
No ratings yet
Technical Tip: Overview of Ethylene Oxide (Eo or Eto) Residuals
3 pages
Color Quiz Lesson Handout
No ratings yet
Color Quiz Lesson Handout
2 pages
An Analytical Comparison of Approaches To Personalizing Pagerank
No ratings yet
An Analytical Comparison of Approaches To Personalizing Pagerank
4 pages
FANUC Software WeldPRO
No ratings yet
FANUC Software WeldPRO
2 pages
City of Cebu Versus Hers of Candido Rubi
No ratings yet
City of Cebu Versus Hers of Candido Rubi
1 page
Mini-Project #3 - Pagerank: 1 Motivation
No ratings yet
Mini-Project #3 - Pagerank: 1 Motivation
3 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
Pagerank Prediction
No ratings yet
Pagerank Prediction
4 pages