0% found this document useful (0 votes)

43 views26 pages

Web Mining Unit-1

Web Mining Chapter For beginners

Uploaded by

Bhavya Kapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views26 pages

Web Mining Unit-1

Web Mining Chapter For beginners

Uploaded by

Bhavya Kapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Web Mining (CIE-431)

UNIT-I
World Wide Web
The World Wide Web (WWW), often called the Web, is a system of interconnected webpages and
information that you can access using the Internet. It was created to help people share and find
information easily, using links that connect different pages together. The Web allows us to browse
websites, watch videos, shop online, and connect with others around the world through our computers
and phones.All public websites or web pages that people may access on their local computers and
other devices through the internet are collectively known as the World Wide Web or W3. Users can
get further information by navigating to links interconnecting these pages and documents. This data
may be presented in text, picture, audio, or video formats on the internet.

What is WWW?
WWW stands for World Wide Web and is commonly known as the Web. The WWW was started by
CERN in 1989. WWW is defined as the collection of different websites around the world, containing
different information shared via local servers(or computers). Web pages are linked together using
hyperlinks which are HTML-formatted and, also referred to as hypertext, these are the fundamental
units of the Internet and are accessed through Hypertext Transfer Protocol (HTTP).

System Architecture
From the user’s point of view, the web consists of a vast, worldwide connection of documents or web
pages. Each page may contain links to other pages anywhere in the world. The pages can be retrieved
and viewed by using browsers of which internet explorer, Netscape Navigator, Google Chrome, etc
are the popular ones. The browser fetches the page requested interprets the text and formatting
commands on it, and displays the page, properly formatted, on the screen.

The basic model of how the web works are shown in the figure below. Here the browser is displaying
a web page on the client machine. When the user clicks on a line of text that is linked to a page on the
abd.com server, the browser follows the hyperlink by sending a message to the abd.com server asking
it for the page.

Working of WWW
A Web browser is used to access web pages. Web browsers can be defined as programs which display
text, data, pictures, animation and video on the Internet. Hyperlinked resources on the World Wide
Web can be accessed using software interfaces provided by Web browsers. Initially, Web browsers
were used only for surfing the Web but now they have become more universal.
The below diagram indicates how the Web operates just like client-server architecture of the internet.
When users request web pages or other information, then the web browser of your system request to
the server for the information and then the web server provide requested services to web browser
back and finally the requested service is utilized by the user who made the request.

Web browsers can be used for several tasks including conducting searches, mailing, transferring files,
and much more. Some of the commonly used browsers are Internet Explorer, Opera Mini, and Google
Chrome.

Features of WWW

• WWW is open source.

• It is a distributed system spread across various websites.
• It is a Hypertext Information System.
• It is Cross-Platform.
• Uses Web Browsers to provide a single interface for many services.
• Dynamic, Interactive and Evolving.

Components of the Web

There are 3 components of the web:

• Uniform Resource Locator (URL): URL serves as a system for resources on the web.
• Hyper Text Transfer Protocol (HTTP): HTTP specifies communication of browser and
server.
• Hyper Text Markup Language (HTML): HTML defines the structure, organisation and
content of a web page.

Difference Between WWW and Internet

WWW Internet
It is originated in 1989. It is originated in 1960.
WWW is an interconnected network of websites and Internet is used to connect a computer
documents that can be accessed via the Internet. with other computer .
WWW used protocols such as HTTP Internet used protocols such as TCP/IP
It is based on software. It is based on hardware.
There is a entire infrastructure in
It is a service contained inside an infrastructure.
internet.
Data Mining Vs Web Mining

Data Mining Web Mining

Data mining refers to the process of extracting Web mining refers to the process of extracting
useful information, patterns, and trends from information from the web document and services,
huge data sets. hyperlinks, and server logs
Data engineers and data scientists can do data Data scientists, data engineers, and data analysts can
mining. do web mining.
Data mining is based on pattern identification Web mining is based on pattern identification from
from data available in any system. web data.
Tools used by data mining are machine Tools used by web mining are PageRank, Scrappy,
learning algorithms. Apache logs.
Applications of data mining are weather It uses the same process but on the web using the web
forecast, market analysis, fraud detection, etc. documents.
Skill needed for data mining is machine Skills needed for wen mining are application-level
learning algorithms, probability, statistics. knowledge, probability, statistics.

Data Mining Foundations

Data mining is the process of extracting knowledge or insights from large amounts of data
using various statistical and computational techniques. The data can be structured, semi-
structured or unstructured, and can be stored in various forms such as databases, data
warehouses, and data lakes.
The primary goal of data mining is to discover hidden patterns and relationships in the data
that can be used to make informed decisions or predictions. This involves exploring the data
using various techniques such as clustering, classification, regression analysis, association
rule mining, and anomaly detection.
Data mining has a wide range of applications across various industries, including marketing,
finance, healthcare, and telecommunications. For example, in marketing, data mining can be
used to identify customer segments and target marketing campaigns, while in healthcare, it
can be used to identify risk factors for diseases and develop personalized treatment plans.

Data Mining Architecture

Data mining architecture refers to the overall design and structure of a data mining system. A
data mining architecture typically includes several key components, which work together to
perform data mining tasks and extract useful insights and information from data. Some of the
key components of a typical data mining architecture include:

• Data Sources: Data sources are the sources of data that are used in data mining. These
can include structured and unstructured data from databases, files, sensors, and other
sources. Data sources provide the raw data that is used in data mining and can be
processed, cleaned, and transformed to create a usable data set for analysis.

• Data Preprocessing: Data pre-processing is the process of preparing data for analysis.
This typically involves cleaning and transforming the data to remove errors,
inconsistencies, and irrelevant information, and to make it suitable for analysis. Data
preprocessing is an important step in data mining, as it ensures that the data is of high
quality and is ready for analysis.

• Data Mining Algorithms: Data mining algorithms are the algorithms and models that
are used to perform data mining. These algorithms can include supervised and
unsupervised learning algorithms, such as regression, classification, and clustering, as
well as more specialized algorithms for specific tasks, such as association rule mining
and anomaly detection. Data mining algorithms are applied to the data to extract useful
insights and information from it.

• Data Visualization: Data visualization is the process of presenting data and insights in a
clear and effective manner, typically using charts, graphs, and other visualizations. Data
visualization is an important part of data mining, as it allows data miners to
communicate their findings and insights to others in a way that is easy to understand
and interpret.

3 Types of Data Mining

There are many different types of data mining, but they can generally be grouped into three
broad categories: descriptive, predictive, and prescriptive.

• Descriptive data mining involves summarizing and describing the characteristics of a

data set. This type of data mining is often used to explore and understand the data,
identify patterns and trends, and summarize the data in a meaningful way.

• Predictive data mining involves using data to build models that can make predictions
or forecasts about future events or outcomes. This type of data mining is often used to
identify and model relationships between different variables, and to make predictions
about future events or outcomes based on those relationships.

• Prescriptive data mining involves using data and models to make recommendations or
suggestions about actions or decisions. This type of data mining is often used to
optimize processes, allocate resources, or make other decisions that can help
organizations achieve their goals.

Data Warehousing and Mining Software

Data warehousing and mining software is a type of software that is used to store, manage, and
analyze large data sets. This software is commonly used in the field of data warehousing and
data mining, and it typically includes tools and features for pre-processing, storing, querying,
and analyzing data.

Some of the most common types of data warehousing and mining software include:

• Relational database management systems (RDBMS) – RDBMS are software systems

that are used to store and manage data in a structured, tabular format. These systems
are widely used in data warehousing and data mining, and they typically support SQL for
querying and manipulating data.

• Data mining tools – Data mining tools are software tools that are used to extract
information and insights from large data sets. These tools typically include algorithms
and methods for exploring, modeling, and analyzing data, and they are commonly used
in the field of data mining.

• Data visualization tools – Data visualization tools are software tools that are used to
visualize and display data in a graphical or graphical format. These tools are commonly
used in data mining to explore and understand the data, and to communicate the results
of the analysis.

• Data warehousing platforms – Data warehousing platforms are software systems that
are designed to support the creation and management of data warehouses. These
platforms typically include tools and features for loading, transforming, and managing
data, as well as tools for querying and analyzing the data.

Types and Part of Data Mining architecture

Data Mining refers to the detection and extraction of new patterns from the already collected
data. Data mining is the amalgamation of the field of statistics and computer science aiming to
discover patterns in incredibly large datasets and then transform them into a comprehensible
structure for later use.

The architecture of Data Mining:

Basic Working:

1. It all starts when the user puts up certain data mining requests, these requests are then
sent to data mining engines for pattern evaluation.
2. These applications try to find the solution to the query using the already present
database.
3. The metadata then extracted is sent for proper analysis to the data mining engine which
sometimes interacts with pattern evaluation modules to determine the result.
4. This result is then sent to the front end in an easily understandable manner using a
suitable interface.

A detailed description of parts of data mining architecture is shown:

1. Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of
data sources. The data in these sources may be in the form of plain text, spreadsheets, or
other forms of media like photos or videos. WWW is one of the biggest sources of data.
2. Database Server: The database server contains the actual data ready to be processed. It
performs the task of handling data retrieval as per the request of the user.
3. Data Mining Engine: It is one of the core components of the data mining architecture
that performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
4. Pattern Evaluation Modules: They are responsible for finding interesting patterns in
the data and sometimes they also interact with the database servers for producing the
result of the user requests.
5. Graphic User Interface: Since the user cannot fully understand the complexity of the
data mining process so graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base: Knowledge Base is an important part of the data mining engine that is
quite beneficial in guiding the search for the result patterns. Data mining engines may
also sometimes get inputs from the knowledge base. This knowledge base may contain
data from user experiences. The objective of the knowledge base is to make the result
more accurate and reliable.

Types of Data Mining architecture:

1. No Coupling: The no coupling data mining architecture retrieves data from particular
data sources. It does not use the database for retrieving the data which is otherwise
quite an efficient and accurate way to do the same. The no coupling architecture for data
mining is poor and only used for performing very simple data mining processes.
2. Loose Coupling: In loose coupling architecture data mining system retrieves data from
the database and stores the data in those systems. This mining is for memory-based data
mining architecture.
3. Semi-Tight Coupling: It tends to use various advantageous features of the data
warehouse systems. It includes sorting, indexing, and aggregation. In this architecture,
an intermediate result can be stored in the database for better performance.
4. Tight coupling: In this architecture, a data warehouse is considered one of its most
important components whose features are employed for performing data mining tasks.
This architecture provides scalability, performance, and integrated information

Advantages of Data Mining:

• Assists in preventing future adversaries by accurately predicting future trends.

• Contributes to the making of important decisions.
• Compresses data into valuable information.
• Provides new trends and unexpected patterns.
• Helps to analyze huge data sets.
• Aids companies to find, attract and retain customers.
• Helps the company to improve its relationship with the customers.
• Assists Companies to optimize their production according to the likability of a certain
product thus saving costs to the company.

Disadvantages of Data Mining:

• Excessive work intensity requires high-performance teams and staff training.

• The requirement of large investments can also be considered a problem as sometimes
data collection consumes many resources that suppose a high cost.
• Lack of security could also put the data at huge risk, as the data may contain private
customer details.
• Inaccurate data may lead to the wrong output.
• Huge databases are quite difficult to manage.

Association rules
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. Association rule
learning is a type of unsupervised learning technique that checks for the dependency of one
data item on another data item and maps accordingly so that it can be more profitable. It tries to
find some interesting relations or associations among the variables of dataset. It is based on
different rules to discover the interesting relations between variables in the database.

Here market basket analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a supermarket, as in
a supermarket, all products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram

How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items is
known as single cardinality. It is all about creating rules, and if the number of items increases,
then cardinality also increases accordingly. So, to measure the associations between thousands
of data items, there are several metrics. These metrics are given below:

• Support
• Confidence
• Lift

Let's understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as
the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items X and
Y occur together in the dataset when the occurrence of X is already given. It is the ratio of the
transaction that contains X and Y to the number of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent
of each other. It has three possible values:
• If Lift= 1: The probability of occurrence of antecedent and consequent is independent of
each other.
• Lift>1: It determines the degree to which the two itemsets are dependent to each other.
• Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.

Types of Association Rule Lerning

Association rule learning can be divided into three algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on
the databases that contain transactions. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-
first search technique to find frequent itemsets in a transaction database. It performs faster
execution than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the
Apriori Algorithm. It represents the database in the form of a tree structure that is known as a
frequent pattern or tree. The purpose of this frequent tree is to extract the most frequent
patterns.

Applications of Association Rule Learning

It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:

• Market Basket Analysis: It is one of the popular examples and applications of

association rule mining. This technique is commonly used by big retailers to determine
the association between items.
• Medical Diagnosis: With the help of association rules, patients can be cured easily, as it
helps in identifying the probability of illness for a particular disease.
• Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.
• It is also used for the Catalog Design and Loss-leader Analysis and many more other
aplications
Sequential Patterns
Sequential pattern mining is the mining of frequently appearing series events or subsequences
as patterns. An instance of a sequential pattern is users who purchase a Canon digital camera
are to purchase an HP color printer within a month.

For retail information, sequential patterns are beneficial for shelf placement and promotions.
This industry, and telecommunications and different businesses, can also use sequential
patterns for targeted marketing, user retention, and several tasks.

Basic of Sequential Pattern (GSP) Mining:

• Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …, sn}.
As the name suggests, it is the sequence of items occurring together. It can be considered
as a transaction or purchased items together in a basket.
• Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g, q, y,
e, c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that the
subsequence is not necessarily consecutive items of the sequence. From the sequences
of databases, subsequences are found from which the generalized sequence patterns are
found at the end.
• Sequence pattern: A sub-sequence is called a pattern when it is found in multiple
sequences. The goal of the GSP algorithm is to mine the sequence patterns from the
large database. The database consists of the sequences. When a subsequence has a
frequency equal to more than the “support” value. For example: the pattern <a, b> is a
sequence pattern mined from sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.

Sequential Pattern (GSP) Mining uses:

Sequential pattern mining, also known as GSP (Generalized Sequential Pattern) mining, is a
technique used to identify patterns in sequential data. The goal of GSP mining is to discover
patterns in data that occur over time, such as customer buying habits, website navigation
patterns, or sensor data.

Some of the main uses of GSP mining include:

Market basket analysis: GSP mining can be used to analyze customer buying habits and identify
products that are frequently purchased together. This can help businesses to optimize their
product placement and marketing strategies.

1. Fraud detection: GSP mining can be used to identify patterns of behavior that are
indicative of fraud, such as unusual patterns of transactions or access to sensitive data.
2. Website navigation: GSP mining can be used to analyze website navigation patterns,
such as the sequence of pages visited by users, and identify areas of the website that are
frequently accessed or ignored.
3. Sensor data analysis: GSP mining can be used to analyze sensor data, such as data from
IoT devices, and identify patterns in the data that are indicative of certain conditions or
states.
4. Social media analysis: GSP mining can be used to analyze social media data, such as
posts and comments, and identify patterns in the data that indicate trends, sentiment, or
other insights.
5. Medical data analysis: GSP mining can be used to analyze medical data, such as patient
records, and identify patterns in the data that are indicative of certain health conditions
or trends.

Methods for Sequential Pattern Mining:

• Apriori-based Approaches
o GSP
o SPADE
• Pattern-Growth-based Approaches
o FreeSpan
o PrefixSpan

Sequence Database: A database that consists of ordered elements or events is called a

sequence database. Example of a sequence database:

S.No. SID sequences

1. 100 <a(ab)(ac)d(cef)> or <a{ab}{ac}d{cef}>

2. 200 <(ad)c(bcd)(abe)>

3. 300 <(ef)(ab)(def)cb>

4. 400 <eg(adf)CBC>

Transaction: The sequence consists of many elements which are called transactions.

<a(ab)(ac)d(cef)> is a sequence whereas (a), (ab), (ac),

(d) and (cef) are the elements of the sequence.

These elements are sometimes referred as transactions.

An element may contain a set of items. Items within an element are unordered and we list them
alphabetically.

For example, (cef) is the element and it consists of 3 items c, e and f.

Since, all three items belong to same element, their order does not matter. But we prefer to put
them in alphabetical order for convenience.
The order of the elements of the sequence matters unlike order of items in same transaction.

k-length Sequence:
The number of items involved in the sequence is denoted by K. A sequence of 2 items is called a
2-len sequence. While finding the 2-length candidate sequence this term comes into use.
Example of 2-length sequence is: {ab}, {(ab)}, {bc} and {(bc)}.

• {bc} denotes a 2-length sequence where b and c are two different transactions. This can
also be written as {(b)(c)}
• {(bc)} denotes a 2-length sequence where b and c are the items belonging to the same
transaction, therefore enclosed in the same parenthesis. This can also be written as
{(cb)}, because the order of items in the same transaction does not matter.

Support in k-length Sequence:

Support means the frequency. The number of occurrences of a given k-length sequence in the
sequence database is known as the support. While finding the support the order is taken care.

Illustration:

Suppose we have 2 sequences in the database.

s1: <a(bc)b(cd)>
s2: <b(ab)abc(de)>
We need to find the support of {ab} and {(bc)}

Finding support of {ab}:

This is present in first sequence.

s1: <a(bc)b(cd)>
Since, a and b belong to different elements, their order matters.
In second sequence {ab} is not found but {ba} is present.

s2: <b(ab)abc(de)> Thus we don’t consider this.

Hence, support of {ab} is 1.

Finding support of {bc}:

Since, b and c are present in same element, their order does not matter.
s1: <a(bc)b(cd)>, first occurrence.
s2: <b(ab)abc(de)>, it seems correct, but is not. b and c are present in different elements
here. So, we don’t consider it.
Hence, support of {(bc)} is 1.

Pruning Phase:
While building Ck (candidate set of k-length), we delete a candidate sequence that has a
contiguous (k-1) subsequence whose support count is less than the minimum support
(threshold). Also, delete a candidate sequence that has any subsequence without minimum
support.

{abg} is a candidate sequence of C3.

{abg} is a candidate sequence of C3.
To check if {abg} is proper candidate or not, without checking its support, we check the support
of its subsets.
Because subsets of 3-length sequence will be 1 and 2 length sequences. We build the candidate
sets increment like 1-length, 2-length and so on.

Subsets of {abg} are: {ab], {bg} and {ag}

Check support of all three subsets. If any of them have support less than minimum support then
delete the sequence {abg} from the set C3 otherwise keep it.

Machine Learning in Data Mining

Both use analytical processes and are good at recognizing patterns. Sometimes, machine
learning techniques can be used in data mining to get accurate outputs.

Here are some of the scenarios where machine learning can help in tackling the challenges of
data mining.

1. The quality of the output of data mining tools depends on the data quality. It sometimes may
not even address the data quality issues. This leads to wrong results as the tool analyzes faulty
data. So, it is important to clean the data before processing it.
In such situations, machine learning algorithms are recommended as they can be incorporated
with data mining tools to automate the data entry process and get quality data. This
combination can easily identify any duplicate data and eliminate it. After this, a random forest
algorithm can be used to classify the data.

2. Data mining tools can be used to identify process-related issues, but they cannot find the root
cause of the issues. Machine learning algorithms, on the contrary, can help in solving the
problem. We can also introduce software with root cause analysis and data mining tools that
can tackle these kinds of issues.

3. Real-time data can be structured and unstructured. Some traditional data mining tools can
process only structured data and, therefore, are not applicable to unstructured data. This can be
solved by using these two machine learning algorithms - Optical Character Recognition (OCR)
and Natural Processing Language (NLP).

Machine learning techniques help in converting unstructured data to a machine-readable

format so that the data mining tool can do a better analysis and make decisions. Note that
developers need to pay attention while converting unstructured data into the machine-readable
format as they can result in imperfect data and produce errors.

4. Sometimes, data mining tools provide less clarity when processing a large number of
variables. The addition of data increases the complexity of the data mining outputs which is
hard for humans to understand. Data mining tools integrated with machine learning algorithms
and computer vision help to overcome this. Hence processed data can be captured and the
relevant output can be generated.

5. Data mining tools analyze the past performance of the process rather than analyzing the
ongoing process. They cannot guarantee predicting performance in the future. Using machine
learning applications with data mining can predict the final results and future events. They also
send an alert message to users if there are any shortcomings and if any improvements are
required.

Web Mining

Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is to discover
useful information from the World Wide Web and its usage patterns.

Applications of Web Mining

Web mining is the process of discovering patterns, structures, and relationships in web data. It
involves using data mining techniques to analyze web data and extract valuable insights. The
applications of web mining are wide-ranging and include:

• Personalized marketing:Web mining can be used to analyze customer behavior on

websites and social media platforms. This information can be used to create
personalized marketing campaigns that target customers based on their interests and
preferences.

• E-commerce: Web mining can be used to analyze customer behavior on e-commerce

websites. This information can be used to improve the user experience and increase
sales by recommending products based on customer preferences.

• Search engine optimization: Web mining can be used to analyze search engine queries
and search engine results pages (SERPs). This information can be used to improve the
visibility of websites in search engine results and increase traffic to the website.

• Fraud detection: Web mining can be used to detect fraudulent activity on websites.
This information can be used to prevent financial fraud, identity theft, and other types of
online fraud.

• Sentiment analysis: Web mining can be used to analyze social media data and extract
sentiment from posts, comments, and reviews. This information can be used to
understand customer sentiment towards products and services and make informed
business decisions.
• Web content analysis: Web mining can be used to analyze web content and extract
valuable information such as keywords, topics, and themes. This information can be
used to improve the relevance of web content and optimize search engine rankings.

• Customer service: Web mining can be used to analyze customer service interactions on
websites and social media platforms. This information can be used to improve the
quality of customer service and identify areas for improvement.

• Healthcare: Web mining can be used to analyze health-related websites and extract
valuable information about diseases, treatments, and medications. This information can
be used to improve the quality of healthcare and inform medical research.

Process of Web Mining

Web Mining Process

Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as following
below.

Categories of Web Mining

• Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several
types of data – text, image, audio, video etc. Content data is the group of facts that a web
page is designed. It can provide effective and interesting patterns about user needs. Text
documents are related to text mining, machine learning and natural language
processing. This mining is also known as text mining. This type of mining performs
scanning and mining of the text, images and groups of web pages according to the
content of the input.

• Web Structure Mining: Web structure mining is the application of discovering

structure information from the web. The structure of the web graph consists of web
pages as nodes, and hyperlinks as edges connecting related pages. Structure mining
basically shows the structured summary of a particular website. It identifies
relationship between web pages linked by information or direct link connection. To
determine the connection between two commercial websites, Web structure mining can
be very useful.

• Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to
understand the user behaviors or something like that. In web usage mining, user access
data on the web and collect data in form of logs. So, Web usage mining is also called log
mining.

Challenges of Web Mining

• Complexity of required web pages: Basically, there is no cohesive framework

throughout the site’s pages so when compared to conventional text, they are incredibly
intricate in the process. The web’s digital library contains a vast number of documents
in the actual system. There is no set order in which these libraries are typically arranged
for the user.

• Dynamic data source in the internet: The required online data is updated in real time.
For instance, news, weather, fashion, finance, sports, and so forth is not possible to
indicate properly.

• Data relevancy: It is much believed that a particular person is typically only concerned
with a limited percentage of the internet throughout the process, with the remaining
portion containing data that may provide unexpected outcomes for the actual
requirement and is unfamiliar to the user to verify.

• Too much large web: Basically, the web is getting bigger and bigger very quickly in the
system. The web seems to be too big for data mining and data warehousing as per
requirement.

Comparison between Data Mining and Web Mining

Parameters Data Mining Web Mining

Data Mining is the process that Web Mining is the process of data
attempts to discover pattern and mining techniques to automatically
Definition
hidden knowledge in large data discover and extract information from
sets in any system. web documents.

Data Mining is very useful for web Web Mining is very useful for a
Application
page analysis. particular website and e-service.
Parameters Data Mining Web Mining

Target Users Data scientist and data engineers. Data scientists along with data analysts.

In Web Mining get the information from

In Data Mining get the information
Structure structured, unstructured and semi-
from explicit structure.
structured web pages.

Clustering, classification,
Web content mining, Web structure
Problem Type regression, prediction,
mining.
optimization and control.

It includes tools like machine Special tools for web mining are Scrapy,
Tools
learning algorithms. PageRank and Apache logs.

It includes approaches for data

It includes application level knowledge,
cleansing, machine learning
Skills data engineering with mathematical
algorithms. Statistics and
modules like statistics and probability.
probability.
Web Structure Mining
Web Structure Mining is one of the three different types of techniques in Web Mining. In this
article, we will purely discuss about the Web Structure Mining. Web Structure Mining is the
technique of discovering structure information from the web. It uses graph theory to analyze
the nodes and connections in the structure of a website.

Depending upon the type of Web Structural data, Web Structure Mining can be categorised
into two types:

1.Extracting patterns from the hyperlink in the Web: The Web works through a
system of hyperlinks using the hyper text transfer protocol (http). Hyperlink is a structural
component that connects the web page according to different location. Any page can create a
hyperlink of any other page and that page can also be linked to some other page. the
intertwined or self-referral nature of web lends itself to some unique network analytical
algorithms. The structure of Web pages could also be analyzed to examine the pattern of
hyperlinks among pages.
2. Mining the document structure. It is the analysis of tree like structure of web page
to describe HTML or XML usage or the tags usage . There are different terms associated with
Web Structure Mining :

• Web Graph: Web Graph is the directed graph representing Web.

• Node: Node represents the web page in the graph.
• Edge(s): Edge represents the hyperlinks of the web page in the graph (Web graph)
• In degree(s): It is the number of hyperlinks pointing to a particular node in the graph.
• Degree(s): Degree is the number of links generated from a particular node. These are
also called the Out Degrees.

3. Page rank Algorithm

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine
results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a
way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough
estimate of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.

rank of a page is dependent on the number of pages and the quality of links pointing to the
target node.

So, we can say that the Web Structure Mining is the type of Mining that can be performed
either at the document level (intra-page) or at the hyperlink level (inter-page). The research
done at the hyperlink level is called as Hyperlink Analysis. the Hyperlink Structure can be
used to retrieve useful information on the Web.

Web structure Mining basically has two main approaches or there are two basic strategic
models for successful websites:
• Page rank
• Hubs and Authorities

Hubs And Attributes

• Hubs: These are pages with large number of interesting links. They serve as a hub or
a gathering point, where people visit to access a variety of information. More focused
sites can aspire to become a hub for the new emerging areas. The pages on website
themselves could be analyzed for quality of content that attracts most users.
• Authorities: People usually gravitate towards pages that provide the most complete
and authentic information on a particular subject. This could be factual information,
news, advice, etc. these websites would have the most number of inbound links from
other websites.

Applications of Web Structure Mining:

• Information retrieval in social networks.
• To find out the relevance of each web page.
• Measuring the completeness of Websites.
• Used in Search engines to find the relevant information.

Web Content Mining

Web Content Mining is one of the three different types of techniques in Web Mining. In this article,
we will purely discuss Web Content Mining. Mining, extraction, and integration of useful data,
information, and knowledge from Web page content are known as Web Mining.

Web data are generally semi-structured and/or unstructured, while data mining is primarily
concerned with structured data . It performs scanning and mining of text, image and images,
and groups of web pages according to the content of input by displaying the list in search
engines.
For Example: if the user is searching for a particular song then the search engine will display
or provide suggestions relevant to it.

Web content mining deals with different kinds of data such as text, audio, video, image, etc.

Unstructured Web Data Mining

Unstructured data includes data such as audio, video, etc, We convert these unstructured data
into structured data,i.e., into useful information or structured information (which is known as
Web Content Mining). the process of Conversion is mentioned as follows:

Unstructured Documents Feature Extraction:

1. Bag of words to represent unstructured documents
• Takes a single word as a feature.
• It ignores the sequence or order in which words occur.
2. Features could be:
• Boolean: This would either occur or may not occur in the document.
• Frequency-based: A number of times the word is repeated in the particular document.
3. Variations of the feature selection include:
• Removal of the case, punctuation, less frequent words and also top words, etc.
4. Features can be reduced using different feature selection techniques:
• Gain of Information, measuring of difference between the probability distribution.
• Stemming: it reduces words to their morphological roots.

Mining Techniques Using Agents and Databases:

1. Agent-Based Approaches:
• Intelligent- Search- This type of search basically refers to a particular goal of the user
and will return the results based on the conclusion of that goal.
• Information-Filtering/ Categorization – This type of search basically deals with the
filtering of data, i.e., removal of unwanted information or redundant information using
certain ai based methods. Like, HyPursuit, BO ( Bookmark Organizer).
• Growth of Sophisticated AI systems replacing users in an automated or unautomated
manner. One of these is Deep Learning, wherein the system is trained by feeding it with
certain kinds of data.

2. Database Approaches:

Used for transforming unstructured data into a more structured and high-level collection of
resources, such as in relational databases, and using standard database querying mechanisms
and data mining techniques to access and analyze this information.

• Multilevel Databases:
o Lowest Level – semi-structured information is kept.
o High Level- generalization from lower levels organized into relations and
objects.
• Web Query Systems:
o Web-query systems are developed such as SQL, and Natural Language
Processing for extracting data.

Web Content Mining Techniques:

1. Pre-processing
2. Clustering
3. Classifying
4. Identifying the associations
5. Topic identification, tracking, and drift analysis

Applications of Web Content Mining:

1. Classifying the web documents into categories.
2. Identify topics of web documents.
3. Finding similar web pages across the different web servers.
4. Applications related to relevance.

Web Usage Mining

Web usage mining, a subset of Data Mining, is basically the extraction of various types of
interesting data that is readily available and accessible in the ocean of huge web pages, Internet-
or formally known as World Wide Web (WWW). Being one of the applications of data mining
technique, it has helped to analyze user activities on different web pages and track them over a
period of time. Basically, Web Usage Mining can be divided into 2 major subcategories based on
web usage data.

There are 3 main types of web data:

1. Web Content Data: The common forms of web content data are HTML, web pages, images
audio-video, etc. The main being the HTML format. Though it may differ from browser to
browser the common basic layout/structure would be the same everywhere. Since it’s the most
popular in web content data. XML and dynamic server pages like JSP, PHP, etc. are also various
forms of web content data.

2. Web Structure Data: On a web page, there is content arranged according to HTML tags
(which are known as intrapage structure information). The web pages usually have hyperlinks
that connect the main webpage to the sub-web pages. This is called Inter-page structure
information. So basically relationship/links describing the connection between webpages is
web structure data.

3. Web Usage Data: The main source of data here is-Web Server and Application Server. It
involves log data which is collected by the main above two mentioned sources. Log files are
created when a user/customer interacts with a web page. The data in this type can be mainly
categorized into three types based on the source it comes from:

• Server-side
• Client-side
• Proxy side.

There are other additional data sources also which include cookies, demographics, etc.

Types of Web Usage Mining based upon the Usage Data:

1. Web Server Data: The web server data generally includes the IP address, browser logs,
proxy server logs, user profiles, etc. The user logs are being collected by the web server data.

2. Application Server Data: An added feature on the commercial application servers is to build
applications on it. Tracking various business events and logging them into application server
logs is mainly what application server data consists of.

3. Application-level data: There are various new kinds of events that can be there in an
application. The logging feature enabled in them helps us get the past record of the events.

Advantages of Web Usage Mining

• Government agencies are benefited from this technology to overcome terrorism.

• Predictive capabilities of mining tools have helped identify various criminal activities.
• Customer Relationship is being better understood by the company with the aid of these
mining tools. It helps them to satisfy the needs of the customer faster and efficiently.

Disadvantages of Web Usage Mining

• Privacy stands out as a major issue. Analyzing data for the benefit of customers is good.
But using the same data for something else can be dangerous. Using it within the
individual’s knowledge can pose a big threat to the company.
• Having no high ethical standards in a data mining company, two or more attributes can
be combined to get some personal information of the user which again is not
respectable.

Some Techniques in Web Usage Mining

1. Association Rules:The most used technique in Web usage mining is Association

Rules. Basically, this technique focuses on relations among the web pages that
frequently appear together in users’ sessions. The pages accessed together are always
put together into a single server session. Association Rules help in the reconstruction of
websites using the access logs. Access logs generally contain information about requests
which are approaching the webserver. The major drawback of this technique is that
having so many sets of rules produced together may result in some of the rules being
completely inconsequential. They may not be used for future use too.
2. Classification: Classification is mainly to map a particular record to multiple
predefined classes. The main target here in web usage mining is to develop that kind of
profile of users/customers that are associated with a particular class/category. For this
exact thing, one requires to extract the best features that will be best suitable for the
associated class. Classification can be implemented by various algorithms – some of
them include- Support vector machines, K-Nearest Neighbors, Logistic Regression,
Decision Trees, etc. For example, having a track record of data of customers regarding
their purchase history in the last 6 months the customer can be classified into frequent
and non-frequent classes/categories. There can be multiclass also in other cases too.

3. Clustering: Clustering is a technique to group together a set of things having similar

features/traits. There are mainly 2 types of clusters- the first one is the usage cluster
and the second one is the page cluster. The clustering of pages can be readily performed
based on the usage data. In usage-based clustering, items that are commonly accessed
/purchased together can be automatically organized into groups. The clustering of users
tends to establish groups of users exhibiting similar browsing patterns. In page
clustering, the basic concept is to get information quickly over the web pages.

Applications of Web Usage Mining

1. Personalization of Web Content: The World Wide Web has a lot of information and
is expanding very rapidly day by day. The big problem is that on an everyday basis the
specific needs of people are increasing and they quite often don’t get that query result.
So, a solution to this is web personalization. Web personalization may be defined as
catering to the user’s need-based upon its navigational behavior tracking and their
interests. Web Personalization includes recommender systems, check-box
customization, etc. Recommender systems are popular and are used by many
companies.

2. E-commerce: Web-usage Mining plays a very vital role in web-based companies.

Since their ultimate focus is on Customer attraction, customer retention, cross-sales, etc.
To build a strong relationship with the customer it is very necessary for the web-based
company to rely on web usage mining where they can get a lot of insights about
customer’s interests. Also, it tells the company about improving its web-design in some
aspects.

3. Prefetching and Catching: Prefetching basically means loading of data before it is

required to decrease the time waiting for that data hence the term ‘prefetch’. All the
results which we get from web usage mining can be used to produce prefetching and
caching strategies which in turn can highly reduce the server response time.

345 Web Development Project Ideas
No ratings yet
345 Web Development Project Ideas
6 pages
UNIT 2 Technologies in E-Business
No ratings yet
UNIT 2 Technologies in E-Business
28 pages
Introduction To Cybercrime and Security: Submitted To: Shivani Mam
No ratings yet
Introduction To Cybercrime and Security: Submitted To: Shivani Mam
15 pages
World Wide Web
No ratings yet
World Wide Web
8 pages
Semantc Web and Social Networks
No ratings yet
Semantc Web and Social Networks
63 pages
Cisco ACI Multipod
No ratings yet
Cisco ACI Multipod
49 pages
Unit 3 Networking
No ratings yet
Unit 3 Networking
49 pages
Lec01 - Introduction To The Internet and Web 2.0
No ratings yet
Lec01 - Introduction To The Internet and Web 2.0
27 pages
1.4 Types of Computer
No ratings yet
1.4 Types of Computer
8 pages
Web Technologies Notes
No ratings yet
Web Technologies Notes
154 pages
Ict 02 Css q1 Module 1 Basic Web Concepts
No ratings yet
Ict 02 Css q1 Module 1 Basic Web Concepts
10 pages
Introduction To World Wide Web
No ratings yet
Introduction To World Wide Web
36 pages
World Wide Web
No ratings yet
World Wide Web
23 pages
Bautista Rosentoreportmodule3
No ratings yet
Bautista Rosentoreportmodule3
26 pages
Web Development: Benedicto B. Balilo JR
No ratings yet
Web Development: Benedicto B. Balilo JR
101 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Module 1
No ratings yet
Module 1
53 pages
DM Unit4 1 Unit 1
No ratings yet
DM Unit4 1 Unit 1
15 pages
Web Mining
No ratings yet
Web Mining
23 pages
Cyber Crimes
No ratings yet
Cyber Crimes
5 pages
Web Mining
No ratings yet
Web Mining
71 pages
q2 Las 2 Ste Ict Edited
No ratings yet
q2 Las 2 Ste Ict Edited
18 pages
It Sivapadidapu PDF
No ratings yet
It Sivapadidapu PDF
130 pages
HTML Basics
No ratings yet
HTML Basics
52 pages
Unit 1 Faculty Notes
No ratings yet
Unit 1 Faculty Notes
64 pages
Shaik Yakub Pasha - MBA
No ratings yet
Shaik Yakub Pasha - MBA
5 pages
Project Planning
No ratings yet
Project Planning
55 pages
Introduction To Internet
No ratings yet
Introduction To Internet
11 pages
Ip Unit-2 Notes
No ratings yet
Ip Unit-2 Notes
57 pages
WP Material
No ratings yet
WP Material
470 pages
Web Design and Technology Handouts - 1709657863
No ratings yet
Web Design and Technology Handouts - 1709657863
53 pages
Fundamentals of Web UNIT 1
No ratings yet
Fundamentals of Web UNIT 1
21 pages
5 Unit5 Internet Related Terminologies
No ratings yet
5 Unit5 Internet Related Terminologies
10 pages
CSC 104 - Ift 202
No ratings yet
CSC 104 - Ift 202
63 pages
Bhajan
No ratings yet
Bhajan
74 pages
Web Technologies PDF
No ratings yet
Web Technologies PDF
46 pages
الأسطورة
No ratings yet
الأسطورة
4 pages
Chapter-1 Basics of Internet
No ratings yet
Chapter-1 Basics of Internet
20 pages
Computing: Web Browser
No ratings yet
Computing: Web Browser
4 pages
Intro To Meeting Incentive Convention and Exhibition 2
No ratings yet
Intro To Meeting Incentive Convention and Exhibition 2
17 pages
TELNET
No ratings yet
TELNET
14 pages
Syllabus
No ratings yet
Syllabus
3 pages
Unit-1 Upto HTML Tags
No ratings yet
Unit-1 Upto HTML Tags
36 pages
Introduction To Internet Programming Chapter One The Fundamentals
No ratings yet
Introduction To Internet Programming Chapter One The Fundamentals
56 pages
Intro WWW
No ratings yet
Intro WWW
3 pages
EPC Feature List
0% (1)
EPC Feature List
18 pages
CN Practical
No ratings yet
CN Practical
9 pages
SIP Gateways: For All Mediatrix Units
No ratings yet
SIP Gateways: For All Mediatrix Units
14 pages
Weekly Schedule
No ratings yet
Weekly Schedule
4 pages
cs1文档
No ratings yet
cs1文档
4 pages
Pptinternet Basics
No ratings yet
Pptinternet Basics
23 pages
Web3 Brochure
No ratings yet
Web3 Brochure
11 pages
World Wide Web
No ratings yet
World Wide Web
3 pages
500 Computer
No ratings yet
500 Computer
70 pages
Web Development Process
No ratings yet
Web Development Process
22 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Chapter-I (Introduction Internet and Web)
No ratings yet
Chapter-I (Introduction Internet and Web)
7 pages
WT R19 Unit 1
No ratings yet
WT R19 Unit 1
33 pages
Datamining
No ratings yet
Datamining
21 pages
Computer Hardware, Software and World Wide Web
No ratings yet
Computer Hardware, Software and World Wide Web
18 pages
Internet and Web Browsers
No ratings yet
Internet and Web Browsers
28 pages
Ghanem Etal WiMob 2023 IP MPLS and MPLS TP Teleprotection Latencies Over High Voltage Power Lines
No ratings yet
Ghanem Etal WiMob 2023 IP MPLS and MPLS TP Teleprotection Latencies Over High Voltage Power Lines
6 pages
Web Introduction
No ratings yet
Web Introduction
9 pages
Unit I
No ratings yet
Unit I
20 pages
5.3 World Wide Web (WWW)
No ratings yet
5.3 World Wide Web (WWW)
4 pages
Policy Brief - PA 243
No ratings yet
Policy Brief - PA 243
5 pages
Performance Testing: 4/15/2021 1. Web Site
No ratings yet
Performance Testing: 4/15/2021 1. Web Site
4 pages
Mass Media and Society Questions
No ratings yet
Mass Media and Society Questions
7 pages
World Wide Web
No ratings yet
World Wide Web
4 pages
Module 1 - Introduction To Web System
No ratings yet
Module 1 - Introduction To Web System
30 pages
What Is WWW?
No ratings yet
What Is WWW?
2 pages
Unit 1 Course Material
No ratings yet
Unit 1 Course Material
14 pages
Web Usage Mining and Discovery of Association Rules From HTTP Servers Logs
No ratings yet
Web Usage Mining and Discovery of Association Rules From HTTP Servers Logs
18 pages
The Word Wide Web Multimedia
No ratings yet
The Word Wide Web Multimedia
12 pages
Web Technologies Ktu Module 1
No ratings yet
Web Technologies Ktu Module 1
18 pages
CS2060 High Speed Networks Question Bank
100% (1)
CS2060 High Speed Networks Question Bank
7 pages
ICT Laboratory 9
No ratings yet
ICT Laboratory 9
22 pages
Art As Experience Meet The Active Audience
No ratings yet
Art As Experience Meet The Active Audience
16 pages
تطوير نظام معلومات جغرافي على شبكة الانترنت لإسكان الزوار بالمدينة المنورة
100% (1)
تطوير نظام معلومات جغرافي على شبكة الانترنت لإسكان الزوار بالمدينة المنورة
11 pages
Dinuca Ciobanu
No ratings yet
Dinuca Ciobanu
8 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
List of Network Protocols
No ratings yet
List of Network Protocols
17 pages
World Wide Web
No ratings yet
World Wide Web
30 pages
# All Words Fullform Related Computer
100% (1)
# All Words Fullform Related Computer
8 pages
The World Wide Web (Abbreviated As WWW or W3, Commonly Known As The Web, or The
No ratings yet
The World Wide Web (Abbreviated As WWW or W3, Commonly Known As The Web, or The
6 pages