DWM - Notes Unit 1 To Unit 5
DWM - Notes Unit 1 To Unit 5
Unit No: 01 to 05
Name of Faculty: Mr. Abhishek Kundu
Session 2023-24
Data mining also provides valuable insights for risk management and
fraud detection. By analyzing historical data, businesses can identify
patterns or anomalies that may indicate potential risks or fraudulent
activities. This can include detecting unusual transactions, identifying
suspicious behaviors, or detecting patterns of fraud. By leveraging data
mining techniques, businesses can proactively detect and prevent
potential risks or fraudulent activities, saving both time and money.
Data mining systems are powerful tools that allow organizations to discover
patterns, correlations, and relationships within their data. These systems use a
combination of statistical algorithms, machine learning techniques, and artificial
intelligence to analyze large amounts of data and uncover hidden patterns and
trends. By doing so, organizations can gain valuable insights that can be used for
various purposes such as business intelligence, marketing strategies, risk
assessment, and fraud detection, among others.
There are several types of data mining systems, each with its own unique
characteristics and applications. Let's explore some of the most common types:
Clustering: Clustering data mining systems group similar data points together
based on their characteristics or attributes. This helps identify similarities or
patterns within the data that might not be initially apparent. Clustering is
commonly used in customer segmentation, where organizations group customers
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
Regression: Regression data mining systems are used to analyze the relationship
between a dependent variable and one or more independent variables. They help
organizations understand how changes in one variable affect the others.
Regression analysis is frequently used in sales forecasting, where historical sales
data is analyzed to predict future sales based on various factors such as price,
promotion, and seasonality.
It's worth noting that these types of data mining systems are not mutually
exclusive, and often multiple techniques are used in combination to extract the
most valuable insights from the data. Additionally, the choice of data mining
system depends on the specific requirements and objectives of an organization.
In simple terms, data preprocessing refers to the techniques and methods used to
prepare data for analysis. It involves cleaning the data, handling missing values,
dealing with outliers, normalizing the data, and reducing dimensionality. By
performing these preprocessing steps, data mining algorithms can more
effectively extract meaningful patterns and insights from the data.
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
One of the primary tasks in data preprocessing is data cleaning. Real-world data
is often incomplete, noisy, and inconsistent. The cleaning process involves
removing or correcting inaccurate data, dealing with missing values, and
handling duplicate or inconsistent data. For example, if a dataset contains
missing values, analysts can choose to either remove those records or impute the
missing values using techniques such as mean imputation or regression
imputation.
Another important step is outlier detection and handling. Outliers are data points
that significantly deviate from the normal pattern and can have a significant
impact on the analysis. Identifying and dealing with outliers is essential to
ensure that they do not skew the results. Outliers can be identified using
statistical techniques such as Z-score or using domain knowledge.
Data preprocessing also includes other tasks such as data integration, where
multiple datasets are combined, and data transformation, where the data is
converted into a suitable format for analysis. These steps ensure that the data is
in a standardized and consistent format, making it easier for data mining
algorithms to extract meaningful patterns and insights.
Data mining is the process of extracting valuable insights and patterns from
large datasets. However, before this can be done, it is crucial to ensure that the
data being analyzed is of high quality and free from errors. This is where data
cleaning comes into play.
Data cleaning, also known as data cleansing or data scrubbing, is the process of
identifying and correcting or removing errors, inconsistencies, and inaccuracies
within a dataset. It involves detecting and resolving issues such as missing
values, duplicate records, outliers, and formatting errors. The goal is to
transform messy and raw data into clean, reliable, and consistent data that can be
effectively analyzed.
Data cleaning is an essential step in the data mining process. Here are a few
reasons why it is important:
Accurate Analysis: Clean data ensures accurate analysis and reliable results. If
the dataset is riddled with errors and inconsistencies, any insights or patterns
derived from it may be misleading or inaccurate. By cleaning the data, we can
ensure that the analysis is based on reliable information.
Data cleaning can involve a variety of techniques and methods depending on the
specific issues present in the dataset. Here are some common techniques used in
data cleaning:
Removing Duplicates: Duplicate records can skew the analysis and produce
inaccurate results. Identifying and removing duplicate entries helps to eliminate
redundancy and ensures that each data point is unique.
Handling Missing Values: Missing values are a common issue in datasets and
can arise due to various reasons such as data entry errors or incomplete data
collection. Techniques such as imputation (replacing missing values with
estimated values) or deletion can be used to handle missing values appropriately.
Outlier Detection: Outliers are data points that deviate significantly from the
majority of the data. They can have a significant impact on the analysis, leading
to biased results. Identifying and handling outliers helps to ensure that the
analysis is not influenced by extreme values.
Conclusion:
Data cleaning is a critical step in the data mining process. It helps to ensure that
the data being analyzed is accurate, reliable, and consistent. By detecting and
resolving errors, inconsistencies, and inaccuracies, data cleaning allows analysts
to derive meaningful insights and patterns from the dataset. Effective data
cleaning techniques such as removing duplicates, handling missing values,
outlier detection, and standardization help to enhance the quality of the data and
improve the accuracy of analysis. Ultimately, data cleaning plays a vital role in
enabling data-driven decision-making and extracting valuable information from
large datasets.
Write about Data Cube technology.
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
Ans: Data mining is a powerful technique that involves extracting valuable
insights(Established
and patternsUnder
from large
UGC datasets. With thePradesh
(2f) and Madhya ever-increasing amount of (Sthapana evam
Niji Vishwavidyalaya
data beingSanchalan)
generated,Adhiniyam
traditional Act
dataNo.
mining
17 ofmethods are struggling
2007), Gram to keep Village-Saikheda,
Dhoda Borgaon, up
Teh-Saunsar, Dist.-Chhindwara, (M.P.)
with the complexity and scale of these datasets. This is where data cube– 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
technology comes into play.
AData
datacube technology
warehouse offers several
centralizes advantageslarge
and consolidates overamounts
traditional data from
of data mining
methods. Firstly, it allows for faster query performance, as data cubes
multiple sources. Its analytical capabilities allow organizations to derive are pre-
computed and stored in a specialized data structure optimized for
valuable business insights from their data to improve decision-making. Overfast access.
This eliminates
time, the need to
it builds a historical perform
record thatcomplex calculations
can be invaluable on the
to data fly, resulting
scientists and in
significant time savings.
business analysts. Because of these capabilities, a data warehouse can be
considered an organization’s “single source of truth.”
A typical data warehouse often includes the following elements:
Secondly, data cube technology provides a comprehensive view of data by
enabling users to analyze it from multiple dimensions simultaneously. This
multidimensional analysis allows for a deeper understanding of the data and
A relational
uncovers database
hidden to store
patterns andand manage data
relationships that may not be apparent using
traditional data mining techniques.
An extraction, loading, and transformation (ELT) solution for preparing the data
for analysis
Statistical
Furthermore,analysis, reporting,
data cubes andthe
support data mining capabilities
exploration of data at different levels of
granularity.
Client Users
analysis can
tools fordrill down into
visualizing andthe data to view
presenting datadetailed information
to business users or
roll up to higher levels of aggregation to gain a broader perspective. This
Other, moreallows
flexibility sophisticated analytical
for in-depth applications
analysis at various that generate
levels, actionable
providing valuable
information by applying data
insights for decision-making. science and artificial intelligence (AI) algorithms,
or graph and spatial features that enable more kinds of analysis of data at scale
Organizations can also select a solution combining transaction processing, real-
Dataanalytics
time cube technology is widely
across data used in
warehouses various
and industries,
data lakes, including
and machine finance,
learning in
retail,
one healthcare,
MySQL and telecommunications.
Database service—without the For example,latency,
complexity, in retail,cost,
dataand
cubes
riskcan
beextract,
of used to transform,
analyze sales
andperformance by product, region, and time, identifying
load (ETL) duplication.
trends, popular products, and areas for improvement. In finance, data cubes can
help analyze investment portfolios based on asset class, risk level, and return,
enabling better portfolio management decisions.
Benefits of a Data Warehouse
Data warehouses offer the overarching and unique benefit of allowing
organizations
However, datatocube analyze large amounts
technology also hasofits
variant data and
limitations. Theextract significant
biggest challenge
value
lies infrom it, as well
the storage as to keep aashistorical
requirements, data cubes record.
can become extremely large,
especially when dealing with massive datasets. Managing and storing these
cubes can be a resource-intensive task, requiring significant computing power
and storage capabilities.
Simple. All data warehouses share a basic design in which metadata, summary
data, and raw data are stored within the central repository of the warehouse. The
repository is fed by data sources on one end and accessed by end users for
analysis, reporting, and mining on the other end.
Simple with a staging area. Operational data must be cleaned and processed
before being put in the warehouse. Although this can be done programmatically,
many data warehouses add a staging area for data before it enters the warehouse,
to simplify data preparation.
Hub and spoke. Adding data marts between the central repository and end users
allows an organization to customize its data warehouse to serve various lines of
business. When the data is ready for use, it is moved to the appropriate data
mart.
Sandboxes. Sandboxes are private, secure, safe areas that allow companies to
quickly and informally explore new datasets or ways of analyzing data without
having to conform to or comply with the formal rules and protocol of the data
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a class.
Decision Tree
The benefits of having a decision tree are as follows −
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due
to noise or outliers. The pruned trees are smaller and less complex.
Cost Complexity
The cost complexity is measured by the following two parameters −
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
Points to remember −
One rule is created for each path from the root to the leaf node.
The leaf node holds the class prediction, forming the rule consequent.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per
the general strategy the rules are learned one at a time. For each time rules are
learned, a tuple covered by the rule is removed and the process continues for the
rest of the tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for
one class at a time. When learning a rule from a class Ci, we want the rule to
cover all the tuples from class C only and no tuple form any other class.
Input:
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
The Assessment of quality is made on the original set of training data. The rule
may perform well on training data but less well on subsequent data. That's why
the rule pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule
R,
Note − This value will increase with the accuracy of R on the pruning set.
Hence, if the FOIL_Prune value is higher for the pruned version of R, then we
prune R.
Ans: There exit a large number of clustering algorithms in the literature .The choice of clustering
algorithm depends both on the type of data available and on the particular purpose and application. If
cluster analysis is used as a descriptive or exploratory tool,it is possible to try several algorithms on the
same data to see what the data may disclose. In general, major clustering methods can be classified into
the following categories. 21.2 Partitioning methods: Given a database of n objects or data tuples,a
partition in method constructs k partitions of the data, where each partition represents cluster and
K<=n. That is ,it classifies the data into k groups, which together satisfy the following requirements: (1)
each group must contain at least on e object,and (2) each object must belong to exactly one group-
Notice that the second requirement can be relaxed in some fuzzy partitioning technique. Given K, the
number of partitions to construct , a partitining method creates an initial partitioning. It then uses an
iterative relocation technique that attempts to improve the partitioning by moving objects from one
group to another .The general criterion of a good partitioning is that objects in the same clusters are
"close" or related to each other,whereas objects of different clusters are "far apart"or very different.
there are various kinds of other criteria for judging the quality of partitions. To achieve global optimality
in partitioning-based clustering would require the exhaustive enumeration of all of the possible
partitions. Instead, most applications adopt one of two popular heuristic methods; 1. the k-means
algorithm,where each cluster is represented by the mean value of the objects in the cluster,and 2. the k-
medoids algorithm,where each cluster is represented by one of the objects located near the center of
the cluster.These heuristic clustering methods work well for finding spherical-shaped clusters in small to
medium -sized databases.To find clusters with complex shapes and for clustering very large data sets,
partitioning-based methods need to be extended.Partitioning-based clustering methods are studied in
depth later. 21.3 Hierarchical methods: A hierarchical method creates a hierarchical decomposition of
the given set of data objects,A hierarchical method can be classified as being either agglomerative or
divisive ,based on how the hierarchical decomposition is formed. The agglomerative approach,also
called the bottom -up aproach ,starts with each object forming a separate group, It successively merges
the objects or groups close to one another, until all of the groups are merged into one( the topmost
level of the hierarchy), or until a trmination condition holds. The divisive approach, also called the top-
down approach, starts with all the objects in the same cluster,until eventually each object is in one
cluster, or until a termination condition holds, Hierarchical methods suffer form the fact that once a
step(merge o9r split) is done,it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not worrying about a combinatorial number of different choices.However, a major
problem of such techniques is that they cannot correct erroneous decisions.There are two approaches
to improving the quality of hierarchical partitioning, such as in CURE and Chameleon, or (2) integate
hierarchical agglomeration and iterative relocation by first using a hierarchical agglomerative algorithm
and then refining the result using iterative relocation by first using a hierarchical aggomerative algorithm
and then refining the result using iterative relocation , as in BIRCH. 21.4 Density- based methods: most
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
Ans: Spatial data mining is a specialized subfield of data mining that deals with
extracting knowledge from spatial data. Spatial data refers to data that is
associated with a particular location or geography. Examples of spatial data
include maps, satellite images, GPS data, and other geospatial information.
Spatial data mining involves analyzing and discovering patterns, relationships,
and trends in this data to gain insights and make informed decisions.
The use of spatial data mining has become increasingly important in various
fields, such as logistics, environmental science, urban planning, transportation,
and public health. By analyzing spatial data, researchers and data mining
professionals can identify correlations, predict future events, and make
informed decisions that can have a significant impact. For instance, a
transportation company can optimize its delivery routes for faster and more
efficient deliveries using spatial data mining techniques. They can analyze their
delivery data along with other spatial data, such as traffic flow, road network,
and weather patterns, to identify the most efficient routes for each delivery.
In the following sections, we'll answer questions about spatial data mining.
Different types of spatial data are used in spatial data mining. These include
point data, line data, and polygon data.
Point Data
Line Data
Polygon Data
Ans: Text mining (also known as text analysis), is the process of transforming unstructured text into
structured data for easy analysis. Text mining uses natural language processing (NLP), allowing machines
to understand the human language and process it automatically.
TRY NOW
For businesses, the large amount of data generated every day represents both an opportunity and a
challenge. On the one side, data helps companies get smart insights on people’s opinions about a
product or service. Think about all the potential ideas that you could get from analyzing emails, product
reviews, social media posts, customer feedback, support tickets, etc. On the other side, there’s the
dilemma of how to process all this data. And that’s where text mining plays a major role.
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
Like most things related to Natural Language Processing (NLP), text mining may sound like a hard-to-
grasp concept. But the truth is, it doesn’t need to be. This guide will go through the basics of text mining,
explain its different methods and techniques, and make it simple to understand how it works. You will
also learn about the main applications of text mining and how companies can use it to automate many
of their processes:
Text mining is an automatic process that uses natural language processing to extract valuable insights
from unstructured text. By transforming data into information that machines can understand, text
mining automates the process of classifying texts by sentiment, topic, and intent.
Thanks to text mining, businesses are being able to analyze complex and large sets of data in a simple,
fast and effective way. At the same time, companies are taking advantage of this powerful tool to
reduce some of their manual and repetitive tasks, saving their teams precious time and allowing
customer support agents to focus on what they do best.
Let’s say you need to examine tons of reviews in G2 Crowd to understand what customers are praising
or criticizing about your SaaS. A text mining algorithm could help you identify the most popular topics
that arise in customer comments, and the way that people feel about them: are the comments positive,
negative or neutral? You could also find out the main keywords mentioned by customers regarding a
given topic.
In a nutshell, text mining helps companies make the most of their data, which leads to better data-
driven business decisions.
At this point you may already be wondering, how does text mining accomplish all of this? The answer
takes us directly to the concept of machine learning.
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
When text mining and machine learning are combined, automated text analysis becomes possible.
Going back to our previous example of SaaS reviews, let’s say you want to classify those reviews into
different topics like UI/UX, Bugs, Pricing or Customer Support. The first thing you’d do is train a topic
classifier model, by uploading a set of examples and tagging them manually. After being fed several
examples, the model will learn to differentiate topics and start making associations as well as its own
predictions. To obtain good levels of accuracy, you should feed your models a large number of examples
that are representative of the problem you’re trying to solve.
Now that you’ve learned what text mining is, we’ll see how it differentiates from other usual terms, like
text analysis and text analytics.
Ans: Over the last few years, the World Wide Web has become a significant source of information and
simultaneously a popular platform for business. Web mining can define as the method of utilizing data
mining techniques and algorithms to extract useful information directly from the web, such as Web
documents and services, hyperlinks, Web content, and server logs. The World Wide Web contains a
large amount of data that provides a rich source to data mining. The objective of Web mining is to look
for patterns in Web data by collecting and examining data in order to gain insights.
What is Web Mining?
Web mining can widely be seen as the application of adapted data mining techniques to the web,
whereas data mining is defined as the application of the algorithm to discover patterns on mostly
structured data embedded into a knowledge discovery process. Web mining has a distinctive property to
provide a set of various data types. The web has multiple aspects that yield different approaches for the
mining process, such as web pages consist of text, web pages are linked via hyperlinks, and user activity
can be monitored via web server logs. These three features lead to the differentiation between the
three areas are web content mining, web structure mining, web usage mining.
Web content mining can be used to extract useful data, information, knowledge from the web page
content. In web content mining, each web page is considered as an individual document. The individual
G H RAISONI
Department of Computer UNIVERSITY
Science and Engineering
(Established Under UGC (2f) and Madhya Pradesh Niji Vishwavidyalaya (Sthapana evam
Sanchalan) Adhiniyam Act No. 17 of 2007), Gram Dhoda Borgaon, Village-Saikheda,
Teh-Saunsar, Dist.-Chhindwara, (M.P.) – 480337
Tel: +91 9111104290/91, Web: www.ghru.edu.in, E-Mail: [email protected]
The web structure mining can be used to find the link structure of hyperlink. It is used to identify that data
either link the web pages or direct link network. In Web Structure Mining, an individual considers the
web as a directed graph, with the web pages being the vertices that are associated with hyperlinks. The
most important application in this regard is the Google search engine, which estimates the ranking of its
outcomes primarily with the PageRank algorithm. It characterizes a page to be exceptionally relevant
when frequently connected by other highly related pages. Structure and content mining methodologies are
usually combined. For example, web structured mining can be beneficial to organizations to regulate the
network between two commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the weblog records, and
assists in recognizing the user access patterns for web pages. In Mining, the usage of web resources, the
individual is thinking about records of requests of visitors of a website, that are often collected as web
server logs. While the content and structure of the collection of web pages follow the intentions of the
authors of the pages, the individual requests demonstrate how the consumers see these pages. Web usage
mining may disclose relationships that were not proposed by the creator of the pages.