0% found this document useful (0 votes)
8 views31 pages

Web Mining 1-10

The document outlines various experiments related to web mining, including the implementation of the PageRank algorithm, link structure analysis, text preprocessing, social network analysis, opinion mining, and sentiment analysis. Each experiment includes theoretical background, implementation steps, and code examples using Python libraries such as NumPy and NetworkX. The aim is to demonstrate practical applications of these techniques in analyzing web data and extracting meaningful insights.

Uploaded by

jhaadi27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views31 pages

Web Mining 1-10

The document outlines various experiments related to web mining, including the implementation of the PageRank algorithm, link structure analysis, text preprocessing, social network analysis, opinion mining, and sentiment analysis. Each experiment includes theoretical background, implementation steps, and code examples using Python libraries such as NumPy and NetworkX. The aim is to demonstrate practical applications of these techniques in analyzing web data and extracting meaningful insights.

Uploaded by

jhaadi27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Experiment 1

Aim: Implement Page Rank Algorithm in Web Mining


Theory:
The PageRank Algorithm is a fundamental technique used in web mining to rank web pages in
terms of their importance or relevance based on the hyperlink structure of the web. It was originally
developed by Larry Page and Sergey Brin, the founders of Google, to rank web pages based on their
inbound links.
Here is an overview of how to implement the PageRank Algorithm:
PageRank Algorithm Overview:
The PageRank value for a web page depends on the following:
1. Inbound Links: A page is considered important if many pages link to it.
2. Outgoing Links: The importance of a page is distributed across its outgoing links.
3. Damping Factor (d): This represents the probability (often set around 0.85) that a user will
continue clicking on links. The remaining probability (1 - d) represents the probability that a
user randomly jumps to any other page.

Basic Formula:
The PageRank PR of a page Pi is computed as:

Implementation Steps:
1. Initialize PageRank Values: Assign an initial value to each page, often set as 1/N1/N1/N
where NNN is the total number of pages.
2. Iterative Computation: Update the PageRank values using the formula iteratively until the
values converge (i.e., the difference between consecutive iterations is below a certain
threshold).
3. Normalization: Ensure the PageRank values add up to 1.
Code:
import numpy as np

def page_rank(graph, d=0.85, max_iterations=100, tol=1.0e-6):


# Number of pages (nodes)
N = len(graph)

# Initialize the PageRank of each page to 1/N


ranks = np.ones(N) / N

# Create the transition probability matrix


out_links = np.array([np.sum(graph[i]) for i in range(N)])

# Avoid division by zero for pages with no out-links


for i in range(N):
if out_links[i] == 0:
graph[i] = np.ones(N)
out_links[i] = N

# Normalize the graph to form the stochastic matrix


stochastic_matrix = graph / out_links[:, None]

# Damping factor vector


damping_vector = np.ones(N) * (1 - d) / N

# Iteratively compute PageRank


for iteration in range(max_iterations):
new_ranks = damping_vector + d * stochastic_matrix.T.dot(ranks)

# Check for convergence (difference between iterations)


if np.linalg.norm(new_ranks - ranks, ord=1) <= tol:
break

ranks = new_ranks

return ranks

# Example graph as an adjacency matrix


graph = np.array([[0, 1, 1, 0],
[0, 0, 1, 1],
[1, 0, 0, 1],
[0, 0, 0, 1]])

# Execute PageRank
ranks = page_rank(graph)

print("PageRank values:", ranks)

Output:
Experiment 2

Aim: Analyze the link structure of web using page rank algorithms.
Theory:
Concept of Web as a Graph:
 In the context of PageRank, the web is viewed as a directed graph. Web pages are nodes,
and hyperlinks between them are edges. A link from page AAA to page BBB is an edge
directed from AAA to BBB.
 The basic idea is that a page that is linked by other important pages should be ranked
higher. Random Surfer Model:
 PageRank is based on a random surfer model, where a user randomly clicks on links. Each
click takes the user from one page to another.
 The surfer has two options at each step:
1. Follow one of the links on the current page.
2. Randomly jump to any other page with a small probability (called the damping
factor ddd).
The Algorithm:
 The importance of page iii (denoted as PR(i)PR(i)PR(i)) is determined by the PageRank
scores of the pages that link to it.
 If a page jjj links to page iii, part of page jjj's importance is transferred to page iii. The more
links page jjj has, the less importance is passed to each individual linked page.
 The PageRank of a page iii can be defined
as: Iterative Process:
 The PageRank values are calculated iteratively. Initially, each page is assigned an equal rank
(e.g., 1/n1/n1/n for nnn pages).
 The algorithm recalculates PageRank values using the formula until the ranks converge (i.e.,
they stop changing significantly from one iteration to the next).
Code:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

# PageRank Algorithm Implementation


def page_rank(graph, d=0.85, max_iterations=100, tol=1.0e-6):
N = len(graph)
ranks = np.ones(N) / N # Initialize PageRank values for each node

out_links = np.array([np.sum(graph[i]) for i in range(N)])

# Handle pages with no outgoing links (dead ends)


for i in range(N):
if out_links[i] == 0:
graph[i] = np.ones(N)
out_links[i] = N

stochastic_matrix = graph / out_links[:, None]


damping_vector = np.ones(N) * (1 - d) / N

# Iterative calculation
for _ in range(max_iterations):
new_ranks = damping_vector + d *
stochastic_matrix.T.dot(ranks) if np.linalg.norm(new_ranks -
ranks, ord=1) <= tol:
break
ranks = new_ranks

return ranks
# Sample Web Graph (Adjacency Matrix)
# Example: Web graph with 5 pages (0 to 4)
graph = np.array([[0, 1, 1, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0]])

# Perform PageRank analysis


ranks = page_rank(graph)

# Display PageRank results


print("PageRank scores for each page:")
for i, rank in enumerate(ranks):
print(f"Page {i} has a PageRank of {rank:.4f}")

# Visualize the web link structure and PageRank


scores def visualize_graph(graph, ranks):
G = nx.DiGraph() # Create a directed graph

# Add edges to the graph based on the adjacency matrix


for i in range(len(graph)):
for j in range(len(graph)):
if graph[i][j] == 1:
G.add_edge(i, j)

# Plotting the graph


pos = nx.spring_layout(G) # Layout for visualization
plt.figure(figsize=(8, 6))

# Node size is proportional to the PageRank score


nx.draw(G, pos, with_labels=True, node_size=[5000 * rank for rank in ranks],
node_color="skyblue", font_size=15, font_color="black", arrows=True)
# Annotating PageRank scores
labels = {i: f"{i}\nPR: {rank:.2f}" for i, rank in enumerate(ranks)}
nx.draw_networkx_labels(G, pos, labels, font_size=12)

plt.title("Web Graph with PageRank


Scores") plt.show()

# Visualize the graph with PageRank


visualize_graph(graph, ranks)

Output:
Experiment 3

Aim: Text and webpage pre‐processing


Theory:
Text and webpage pre-processing is a crucial step in natural language processing (NLP),
information retrieval (IR), and data mining tasks. Raw textual data from webpages or documents
often contains noise, formatting issues, and other inconsistencies that need to be cleaned and
standardized before analysis. Pre-processing ensures that the data is in a structured, manageable,
and machine-readable form, improving the efficiency and accuracy of downstream algorithms such
as machine learning models, text mining, or search engines.
1. Text Pre-processing Steps:
The general process for preparing raw text involves multiple steps, which can vary based on the
task (e.g., sentiment analysis, document classification, or information retrieval). These steps include
tokenization, normalization, removal of irrelevant content, and feature extraction.

Step-by-Step Text Pre-processing:


1. Lowercasing:
o Convert all characters in the text to lowercase.
o Rationale: Many models (unless case-sensitive) don’t differentiate between upper
and lowercase forms of the same word (e.g., "Apple" and "apple").
2. Tokenization:
o Breaking text into smaller components, typically words or sentences, is called
tokenization.
o Techniques:
 Word Tokenization: Splitting a document into individual words (tokens).
E.g., "Hello, world!" → ["Hello", ",", "world", "!"].
 Sentence Tokenization: Splitting text into sentences. E.g., "I like cats. You
like dogs." → ["I like cats.", "You like dogs."].
Code:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import
WordNetLemmatizer

# Download the required NLTK data


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample text
text = """Text preprocessing is an essential task in NLP! We remove stopwords, punctuation,
numbers, and apply lemmatization to simplify the text."""

# Step 1: Lowercase the text


text = text.lower()

# Step 2: Remove punctuation and numbers using regular expressions


text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\d+', '', text) # Remove numbers

# Step 3: Tokenization
tokens = word_tokenize(text)

# Step 4: Remove stopwords


stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]

# Step 5: Lemmatization (reducing words to their base form)


lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Final cleaned text


cleaned_text = ' '.join(tokens)
print("Cleaned Text:", cleaned_text)

Output:
Experiment 4

Aim: Social network analysis


Theory:
Social Network Analysis (SNA) is a method used to study the structure of relationships and
interactions among social entities (people, organizations, groups, etc.). It leverages both graph
theory and sociology to examine social structures through networks and graph-based visualization
techniques. The underlying theory emphasizes how relationships, rather than individual attributes,
shape social behavior and outcomes. Here's an overview of key concepts:
1. Networks and Graphs
In SNA, a social network is represented as a graph, where:
 Nodes (or vertices) represent individual actors (e.g., people, organizations).
 Edges (or links) represent relationships or connections between actors (e.g., friendships,
professional ties).
Types of Networks:
 Undirected Networks: Relationships where both nodes equally interact (e.g., mutual
friendships).
 Directed Networks: Asymmetrical relationships (e.g., follower-followee on social media).
 Weighted Networks: Relationships where ties have varying strengths (e.g., frequency of
interaction).
2. Key Theoretical Concepts
 Centrality: Measures the importance of a node within the network.
o Degree Centrality: Number of direct connections a node has.
o Betweenness Centrality: Measures the extent to which a node lies on paths between
other nodes, highlighting its role as a broker or bridge.
o Closeness Centrality: Measures how close a node is to all other nodes, emphasizing
the speed at which information can be disseminated from that node.
 Homophily: The tendency of individuals to associate with others who are similar to them in
some way (e.g., shared interests, socio-demographic traits). "Birds of a feather flock
together."
 Transitivity: A relationship structure where if A is connected to B, and B is connected to C,
then A is also likely to be connected to C (forming a triangle). This can explain the
formation of close-knit communities.
 Clustering: The tendency of nodes to form tightly-knit groups based on the density of
connections within a network, indicating cohesive subgroups.
Code:
import networkx as nx
import matplotlib.pyplot as plt

# Create a social network graph


G = nx.Graph()

# Add nodes (people) to the graph


G.add_nodes_from(["Wolf", "Dolphin", "Elephant", "Lion", "Deer", "Tiger"])

# Add edges (relationships) between the nodes


G.add_edges_from([("Wolf", "Dolphin"),
("Wolf", "Elephant"),
("Dolphin", "Elephant"),
("Dolphin", "Lion"),
("Elephant", "Deer"),
("Lion", "Deer"),
("Deer", "Tiger")])

# Compute basic social network metrics


degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)

# Print the computed metrics


print("Degree Centrality:", degree_centrality)
print("Closeness Centrality:", closeness_centrality)
print("Betweenness Centrality:", betweenness_centrality)
print("Eigenvector Centrality:", eigenvector_centrality)

# Draw the social network


pos = nx.spring_layout(G)
plt.figure(figsize=(8, 6))
nx.draw(G, pos, with_labels=True, node_size=2000, node_color="lightblue", font_size=15,
font_color="black", font_weight="bold", edge_color="gray")
plt.title("Social Network Graph", size=20)
plt.show()

Output:
Experiment 5

Aim: Opinion mining


Theory:
Opinion mining, also known as sentiment analysis, is a natural language processing (NLP)
technique used to determine the emotional tone or attitude expressed in a piece of text. It aims to
analyze and extract subjective information from sources like social media posts, reviews, blogs, or
customer feedback. Here's a deeper look at the theory behind it:
Key Concepts:
1. Subjectivity vs. Objectivity:
o Subjective Text: This refers to statements that express opinions, emotions, or
personal viewpoints (e.g., "I love this product").
o Objective Text: These are factual statements that do not involve emotions (e.g.,
"The product weighs 500g").
Opinion mining focuses primarily on subjective data.
2. Polarity: The main task of opinion mining is to determine the polarity of a text. This can be:
o Positive: Text expresses positive emotions or satisfaction (e.g., "The service was
fantastic").
o Negative: Text reflects dissatisfaction or negative feelings (e.g., "The product is
terrible").
o Neutral: The text does not express any strong emotion or is factual (e.g., "The
weather is mild").
3. Granularity:
o Document Level: The system tries to classify the overall sentiment of an entire
document or review.
o Sentence Level: The system assesses the sentiment of individual sentences.
o Aspect Level: Sentiment analysis at this level looks for specific aspects or entities
(e.g., product features) and determines the sentiment associated with each aspect
(e.g., “The camera quality is great, but the battery life is poor”).
Code:
from textblob import TextBlob

# List of example opinions (texts)


opinions = [
"I love the new phone! The camera quality is amazing.",
"The restaurant was terrible. The service was slow and the food was bad.",
"This movie was fantastic! I enjoyed every bit of it.",
"The product broke after just one use. I'm very disappointed.",
"The customer support was helpful and resolved my issue
quickly."
]

# Perform sentiment analysis for each opinion


for opinion in opinions:
blob = TextBlob(opinion)
sentiment = blob.sentiment
print(f"Opinion: {opinion}")
print(f"Sentiment: Polarity={sentiment.polarity}, Subjectivity={sentiment.subjectivity}")
print("-" * 50)

Output:
Experiment 6

Aim: Sentimetal analysis


Theory:
Sentiment analysis is a natural language processing (NLP) technique used to determine whether a
given piece of text expresses a positive, negative, or neutral sentiment. This type of analysis is
valuable in a variety of applications, such as analysing customer feedback, monitoring social
media, understanding market trends, and improving product reviews.

How Sentiment Analysis Works


1. Text Preprocessing:
o Tokenization: Splits the input text into words or sentences.
o Normalization: Converts text to lowercase, removes punctuation, and reduces words
to their base form (e.g., stemming or lemmatization).
o Noise Removal: Removes unnecessary elements like stopwords, numbers, or
HTML tags.

2. Feature Extraction:
o Words or n-grams (combinations of adjacent words) are often used as features.
o Some approaches use the presence or frequency of specific words or phrases to
infer sentiment.

3. Sentiment Scoring:
o Each word in the text is assigned a polarity score based on a pre-existing lexicon or
a trained machine learning model. The overall sentiment score is a weighted sum or
average of these scores.
o Polarity ranges typically from -1 (very negative) to 1 (very positive). A score near 0
indicates a neutral sentiment.

4. Classification:
o The sentiment score is then used to classify the input as positive, negative, or
neutral.
o More advanced models can provide multi-class classification (e.g., happy, sad,
angry, etc.) or regression-based sentiment scores for more nuanced understanding.
Code:
from textblob import TextBlob

# Function to analyze sentiment def


analyze_sentiment(text):
analysis = TextBlob(text)
# Polarity ranges from -1 (negative) to 1 (positive) polarity =
analysis.sentiment.polarity

if polarity > 0:
sentiment = "Positive" elif
polarity < 0:
sentiment = "Negative" else:
sentiment = "Neutral"

return sentiment, polarity

# Example usage
text = input("Enter a sentence for sentiment analysis: ")
sentiment, polarity = analyze_sentiment(text)

print(f"Sentiment: {sentiment}")
print(f"Polarity Score: {polarity}")

Output:
Experiment 7

Aim: Privatization of web content


Theory:
Privatization of web content refers to the techniques and strategies used to protect and control the
access and visibility of digital information on the web. This is critical for safeguarding personal data,
ensuring data integrity, and maintaining user privacy. Methods to achieve this include:
1. Data Encryption: Encrypts data to make it accessible only to authorized users.
Common encryption standards include AES (Advanced Encryption Standard) and RSA
(Rivest–Shamir–Adleman).
2. Access Control: Limits who can access content using authentication and
authorization mechanisms, such as OAuth or JWT tokens.
3. Web Scraping Protection: Implements measures to prevent unauthorized extraction of
data from web pages.
4. Content Watermarking: Embeds invisible markers within content to identify
unauthorized duplication or sharing.
5. Data Masking: Modifies sensitive data in such a way that the original information is
protected while still being usable for testing or analytics.

Techniques for Implementing Web Content Privatization


 Secure HTTPS Protocol: Ensures secure communication between web servers and
clients.
 CAPTCHAs: Protects against bots by requiring human interaction.
 Robots.txt and Metadata: Controls search engine behavior and limits the indexing of
sensitive content.
 Rate Limiting: Prevents DDoS attacks and unauthorized scraping by limiting the
number of requests from a single IP address.
 Authentication and Authorization: Uses techniques like Multi-Factor
Authentication (MFA) to secure access.

Code:
from cryptography.fernet import Fernet

# Function to generate and save an encryption key (use once to generate a key) def
generate_key():
key = Fernet.generate_key()
with open('secret.key', 'wb') as key_file: key_file.write(key)
print("Key generated and saved to 'secret.key'")

# Function to load the encryption key def


load_key():
return open('secret.key', 'rb').read()

# Function to encrypt a message def


encrypt_message(message):
key = load_key() cipher =
Fernet(key)
encrypted_message = cipher.encrypt(message.encode())
print("Encrypted message:", encrypted_message.decode()) return
encrypted_message

# Function to decrypt a message


def decrypt_message(encrypted_message): key =
load_key()
cipher = Fernet(key) try:
decrypted_message = cipher.decrypt(encrypted_message).decode()
print("Decrypted message:", decrypted_message)
return decrypted_message

except Exception as e:
print("Decryption failed. Error:", str(e))

if name == " main ":


print("Web Content Privatization Program") print("Options:")

print("1. Generate Encryption Key")


print("2. Encrypt a Message") print("3.
Decrypt a Message")
choice = input("Select an option (1/2/3): ")

if choice == '1': generate_key()


elif choice == '2':
message = input("Enter the content to encrypt: ")
encrypt_message(message)
elif choice == '3':
encrypted_message = input("Enter the encrypted content: ").encode()
decrypt_message(encrypted_message)
else:
print("Invalid choice. Please select 1, 2, or 3.")

Output:

Experiment 8
Aim: Web usage mining
Theory:
Web Usage Mining (WUM) is a branch of data mining that focuses on discovering meaningful
patterns and insights from web data, specifically user interaction logs. It helps in understanding user
behavior, improving user experience, and optimizing web content.

Key Aspects of Web Usage Mining:


1. Data Sources:
o Web Server Logs: Records of user activity stored by web servers (e.g., IP
addresses, timestamps, HTTP methods).
o Application Server Logs: Detailed interactions at the application level.
o Client-Side Data: Information gathered from user devices through cookies,
browser plugins, etc.

2. Process of Web Usage Mining:


o Data Collection: Gathering raw web log data from servers or user sessions.
o Data Preprocessing: Cleaning and structuring the data (e.g., removing
irrelevant data, handling missing values).
o Pattern Discovery: Using data mining techniques like clustering, association rule
mining, and sequential pattern mining to find interesting user behavior patterns.
o Pattern Analysis: Interpreting the discovered patterns for insights.

Applications of Web Usage Mining:


 Personalization: Recommending products or content based on user behavior.
 Web Optimization: Improving site navigation and layout.
 Fraud Detection: Identifying unusual or potentially harmful web activities.
 User Behavior Analysis: Understanding how users interact with different parts of a
website.

Code:
import pandas as pd
import matplotlib.pyplot as plt

# Sample web log data (as a CSV for demonstration) data


={
'IP': ['192.168.1.1', '192.168.1.2', '192.168.1.1', '192.168.1.3', '192.168.1.2'],
'Timestamp': ['2024-11-05 10:00:00', '2024-11-05 10:05:00', '2024-11-05 10:10:00',
'2024-11-05 10:15:00', '2024-11-05 10:20:00'],
'Page': ['/home', '/about', '/products', '/contact', '/home'],
'User_Agent': ['Mozilla/5.0', 'Chrome/90.0', 'Mozilla/5.0', 'Edge/91.0', 'Chrome/90.0']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Convert Timestamp to datetime format


df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# Basic Preprocessing: Sorting data by timestamp df


= df.sort_values(by='Timestamp')

# Analyze page visits


page_visits = df['Page'].value_counts()

# Visualize the number of page visits


plt.figure(figsize=(10, 5))
page_visits.plot(kind='bar', color='skyblue')
plt.title('Page Visit Frequency')
plt.xlabel('Page')
plt.ylabel('Number of Visits')

plt.show()

# Grouping by IP to find unique visits and user paths


user_paths = df.groupby('IP')['Page'].apply(lambda pages: ' -> '.join(pages)).reset_index()
print("User Navigation Paths:")
print(user_paths)

# Finding the most common user agents user_agents


= df['User_Agent'].value_counts() print("\nMost
Common User Agents:") print(user_agents)

Output:

Experiment 9
Aim: Recommender System
Theory:
Recommender systems are a type of information filtering system that predicts and suggests items
that a user might like. They are widely used in various applications such as e- commerce, streaming
services, and social media to improve user experience by personalizing content.

Types of Recommender Systems:


1. Content-Based Filtering:
o Recommends items similar to those the user has liked in the past.
o Uses item features and user preferences to build a profile and match items.

2. Collaborative Filtering:
o User-Based Collaborative Filtering: Recommends items liked by similar users.
o Item-Based Collaborative Filtering: Recommends items that are similar to items
the user has previously interacted with.
o Works based on user-item interactions without needing detailed item features.

3. Hybrid Systems:
o Combines content-based and collaborative filtering for more accurate
recommendations.
o Mitigates the limitations of each method individually.

Applications:
 E-commerce: Product recommendations (e.g., Amazon).
 Streaming Services: Movie and show recommendations (e.g., Netflix, YouTube).
 Music Platforms: Playlist and song suggestions (e.g., Spotify).

Code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer from
sklearn.metrics.pairwise import linear_kernel

# Sample dataset of movies with descriptions data =


{
'Title': ['The Matrix', 'Inception', 'Interstellar', 'The Social Network', 'The Godfather'], 'Description':
[
'A computer hacker learns about the true nature of reality and his role in the war against its
controllers.',
'A thief who steals corporate secrets through dream-sharing technology is given the inverse
task of planting an idea.',
'A team of explorers travels through a wormhole in space in an attempt to ensure humanity’s
survival.',
'The story of how Mark Zuckerberg created Facebook and faced personal and legal challenges.',
'The aging patriarch of an organized crime dynasty transfers control of his empire to his
reluctant son.'
]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Step 1: Convert the descriptions into TF-IDF feature vectors


tfidf_vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix =
tfidf_vectorizer.fit_transform(df['Description'])

# Step 2: Compute the cosine similarity matrix cosine_sim =


linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to recommend movies based on the title def


recommend(title, cosine_sim=cosine_sim):
# Get the index of the movie that matches the title idx =
df.index[df['Title'] == title].tolist()[0]

# Get the pairwise similarity scores of all movies with that movie sim_scores =
list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the indices of the top 3 most similar movies (excluding itself) sim_scores =
sim_scores[1:4] # Skip the first one as it is the movie itself

# Get the movie indices


movie_indices = [i[0] for i in sim_scores]

# Return the top 3 most similar movies return


df['Title'].iloc[movie_indices]

# Example usage movie_title =


'Inception'
print(f"Movies recommended for '{movie_title}':")
print(recommend(movie_title))

Output:

Experiment 10
Aim: Web structure mining
Theory:
Web Structure Mining is a branch of web mining that focuses on analyzing the structure of web
pages and the relationships between them. It aims to discover and model the link structure of the web
to understand how web pages are connected. This type of mining is essential for search engines, web
crawlers, and algorithms like Google's PageRank, which use link structures to rank the importance of
web pages.

Key Concepts in Web Structure Mining:


1. Nodes and Edges:
o Nodes represent web pages or web elements.
o Edges represent hyperlinks connecting these pages.

2. Link Analysis:
o In-degree: The number of incoming links to a page.
o Out-degree: The number of outgoing links from a page.
o Adjacency Matrix: Represents the link structure where rows and columns
correspond to web pages, and values indicate the presence of a link.

3. Graph Representation:
o Web structure mining often uses graphs to model the web. Pages are nodes, and
hyperlinks are edges between nodes.
o Directed Graphs are used since hyperlinks typically have a direction (from one
page to another).

4. Applications:
o Search Engine Ranking: Algorithms like PageRank use the web structure to
determine the importance of pages.
o Web Navigation: Understanding how users traverse websites.
o Community Detection: Identifying clusters of pages with strong
interconnections.

Code:
import networkx as nx
import matplotlib.pyplot as plt
# Create a directed graph to represent web structure web_graph =
nx.DiGraph()

# Add nodes representing web pages


pages = ['Page A', 'Page B', 'Page C', 'Page D']
web_graph.add_nodes_from(pages)

# Add edges representing hyperlinks between the pages


web_graph.add_edges_from([
('Page A', 'Page B'),
('Page B', 'Page C'),
('Page C', 'Page A'),
('Page A', 'Page D'),
('Page D', 'Page B')
])

# Draw the graph


plt.figure(figsize=(10, 6))
nx.draw(web_graph, with_labels=True, node_color='lightblue', edge_color='gray', node_size=3000,
font_size=15)
plt.title("Web Structure Graph")
plt.show()

# Calculate the in-degree and out-degree for each node


in_degrees = dict(web_graph.in_degree())
out_degrees = dict(web_graph.out_degree())
import networkx as nx
import matplotlib.pyplot as plt

# Print in-degree and out-degree of each page


print("In-degree of each page:")
for page, in_degree in in_degrees.items():
print(f"{page}: {in_degree}")
print("\nOut-degree of each page:")
for page, out_degree in out_degrees.items():
print(f"{page}: {out_degree}")

# Calculate PageRank
page_ranks = nx.pagerank(web_graph) print("\
nPageRank of each page:")
for page, rank in page_ranks.items():
print(f"{page}: {rank:.4f}")

Output:

You might also like