0% found this document useful (0 votes)
15 views16 pages

DWDM Unit 3

Data mining is the process of discovering patterns and useful information from large datasets using techniques like machine learning and statistical analysis, essential for decision-making in various fields. It automates the analysis of massive data volumes, revealing hidden patterns and predicting trends, thereby enhancing business intelligence and customer relationship management. The Knowledge Discovery in Databases (KDD) process involves several steps, including data selection, preprocessing, transformation, mining, and pattern evaluation to ensure the extraction of accurate and relevant knowledge.

Uploaded by

Siddharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

DWDM Unit 3

Data mining is the process of discovering patterns and useful information from large datasets using techniques like machine learning and statistical analysis, essential for decision-making in various fields. It automates the analysis of massive data volumes, revealing hidden patterns and predicting trends, thereby enhancing business intelligence and customer relationship management. The Knowledge Discovery in Databases (KDD) process involves several steps, including data selection, preprocessing, transformation, mining, and pattern evaluation to ensure the extraction of accurate and relevant knowledge.

Uploaded by

Siddharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Warehousing and Data

Mining

Unit 3: Introduction to Data Mining

Introduction and Need for Data Mining


Introduction to Data Mining
Data mining is the process of discovering patterns, correlations, and useful
information from large datasets using techniques such as machine learning,
statistical analysis, and database systems. It is an essential component of
knowledge discovery in databases (KDD), where raw data is transformed
into meaningful insights that help in decision-making.
With the rapid growth of data in various domains such as healthcare,
finance, business, and social media, traditional data analysis techniques are
no longer sufficient to extract valuable information. Data mining automates
the process of analyzing large datasets, identifying hidden patterns, and
predicting future trends, enabling organizations to make data-driven
decisions.

Definition of Data Mining


Data mining can be defined as:
"The process of extracting useful, valid, and previously unknown patterns
or knowledge from large amounts of data stored in databases, data
warehouses, or other information repositories."

Characteristics of Data Mining:


1. Automatic Processing: Data mining tools use machine learning
algorithms to process data automatically.

Data Warehousing and Data Mining 1


2. Pattern Discovery: It finds hidden patterns and relationships in large
datasets.

3. Prediction and Classification: It helps in predicting future outcomes


based on historical data.

4. Large-Scale Data Handling: It efficiently processes massive amounts of


structured and unstructured data.

5. Decision Support: The insights gained from data mining help in making
strategic business decisions.

Need for Data Mining


The increasing volume and complexity of data generated in various fields
have made data mining an essential tool for extracting useful knowledge.
Below are the key reasons why data mining is needed:

1. Handling Large Volumes of Data


With the rise of big data, IoT, and digital transformation, organizations
generate vast amounts of data every second.

Traditional data analysis methods cannot efficiently process such large


datasets.

Data mining provides automated techniques to extract useful patterns


from massive data repositories.

Example:
E-commerce companies like Amazon analyze millions of customer
transactions daily to identify purchasing patterns and recommend products.

2. Extracting Hidden Patterns and Relationships


Raw data often contains hidden correlations that are not immediately
visible through traditional analysis.

Data mining helps discover non-obvious relationships between


different attributes.

Example:

In the healthcare industry, data mining is used to identify risk factors for
diseases by analyzing patient history, genetic data, and environmental

Data Warehousing and Data Mining 2


conditions.

3. Enhancing Decision-Making and Business Intelligence


Organizations use data mining to make data-driven decisions rather
than relying on intuition.

It provides valuable insights that help businesses improve strategies,


optimize operations, and enhance customer experience.

Example:
Banks use data mining to analyze customer transactions and detect
fraudulent activities in real time.

4. Improving Customer Relationship Management (CRM)


Companies use data mining to understand customer preferences,
buying behavior, and feedback.

It enables businesses to personalize services, improve customer


satisfaction, and enhance loyalty programs.

Example:
Netflix uses data mining to analyze user viewing patterns and recommend
personalized content to its subscribers.

5. Fraud Detection and Security


Financial institutions use data mining techniques to detect credit card
fraud, money laundering, and cyber threats.

Anomalies in transaction data can indicate suspicious activities,


allowing quick intervention.

Example:

Banks use machine learning models to flag unusual transactions,


preventing unauthorized access to accounts.

6. Predictive Analysis for Future Trends


Data mining helps organizations forecast trends and market behavior
based on historical data.

Data Warehousing and Data Mining 3


It enables companies to anticipate customer needs, manage inventory,
and plan for future demand.

Example:

Retailers use data mining to predict seasonal sales trends and adjust stock
levels accordingly.

7. Cost Reduction and Efficiency Optimization


Businesses use data mining to identify inefficiencies, optimize supply
chains, and reduce operational costs.

It helps in resource allocation, workforce management, and minimizing


wastage.

Example:

Manufacturing companies use data mining to analyze machine


performance and predict equipment failures, reducing maintenance costs.

8. Competitive Advantage in the Market


Companies that effectively utilize data mining gain a competitive edge
over others.

It enables organizations to stay ahead of market trends and make better


strategic decisions.

Example:

Social media platforms like Facebook and Instagram use data mining to
analyze user behavior and optimize ad placements, generating more
revenue.

9. Personalized Marketing and Recommendation Systems


Data mining helps in targeted advertising by analyzing customer
preferences and online behavior.

Businesses can segment customers based on demographics, interests,


and purchase history.

Example:

Data Warehousing and Data Mining 4


Google Ads uses data mining to display personalized advertisements
based on users' search history and browsing behavior.

10. Enhancing Scientific Research and Healthcare


Scientists use data mining to analyze large datasets in genetics, climate
change, and medical research.

Healthcare providers use it for disease diagnosis, treatment


recommendations, and drug discovery.

Example:

Pharmaceutical companies use data mining to analyze clinical trial results,


speeding up drug development.

Knowledge Discovery in Databases


(KDD) Process
Introduction to KDD
Knowledge Discovery in Databases (KDD) is a systematic process of
extracting useful, valid, and previously unknown patterns or knowledge from
large datasets. It involves multiple stages, starting from raw data collection
to meaningful insight generation, enabling informed decision-making in
various fields such as business, healthcare, finance, and scientific research.

KDD is not just about data mining; it is a broader process that includes data
preprocessing, transformation, and interpretation of results.

Steps in the KDD Process


The KDD process consists of the following major steps:

1. Data Selection

2. Data Preprocessing (Cleaning and Integration)

3. Data Transformation

4. Data Mining

5. Pattern Evaluation and Knowledge Representation

Data Warehousing and Data Mining 5


Each of these steps plays a crucial role in ensuring that the final extracted
knowledge is accurate, relevant, and useful.

1. Data Selection
This is the first step, where relevant data is chosen from various sources
such as databases, data warehouses, web data, or sensor logs.

Objectives:
Identify the most relevant attributes (features) required for analysis.

Remove unnecessary or redundant data.

Extract data from different sources such as transactional databases,


logs, spreadsheets, or cloud storage.

Example:
In a retail business, sales records from the last five years might be selected
from a database for customer purchasing behavior analysis.

2. Data Preprocessing (Cleaning and Integration)


Raw data is often incomplete, noisy, or inconsistent. This step ensures data
quality by handling missing values, removing errors, and integrating multiple
datasets.

Key Tasks:
Data Cleaning: Handling missing values, removing duplicate records,
and correcting errors.

Data Integration: Combining data from multiple sources to form a single,


consistent dataset.

Example:
Filling missing age values in a customer database using the average of
available values.

Merging customer transaction data from multiple branches into a central


database.

Data Warehousing and Data Mining 6


3. Data Transformation
Once the data is cleaned and integrated, it is transformed into a suitable
format for analysis. This step involves normalization, aggregation, and
feature selection.

Techniques Used:
Normalization: Scaling numerical values to a common range (e.g.,
between 0 and 1).

Aggregation: Summarizing data at different levels (e.g., daily sales →


monthly sales).

Feature Selection: Choosing only the most relevant attributes for


analysis.

Example:
Converting salary figures into standardized values (e.g., converting
rupees into dollars).

Aggregating daily product sales data into monthly sales reports.

4. Data Mining
This is the core step of the KDD process, where data mining algorithms are
applied to extract patterns, trends, and insights.

Common Data Mining Techniques:


Classification: Assigning labels to data (e.g., spam vs. non-spam
emails).

Clustering: Grouping similar data points (e.g., customer segmentation).

Association Rule Mining: Finding relationships between variables (e.g.,


"Customers who buy bread often buy butter").

Anomaly Detection: Identifying unusual patterns (e.g., fraud detection


in credit card transactions).

Example:
A bank uses classification to predict whether a loan applicant is likely to
default.

Data Warehousing and Data Mining 7


A supermarket uses association rules to discover that customers
buying milk also tend to buy cereal.

5. Pattern Evaluation and Knowledge Representation


In this final step, the discovered patterns are evaluated for usefulness and
interpreted into meaningful knowledge. Only significant and valid patterns
are retained for decision-making.

Key Aspects:
Filtering out patterns that are statistically insignificant or irrelevant.

Visualizing results using graphs, charts, or dashboards.

Converting patterns into business strategies or actionable insights.

Example:
A healthcare provider analyzes mined data to identify key factors
leading to heart disease and takes preventive actions.

An e-commerce website personalizes product recommendations based


on customer behavior analysis.

Data Mining Architecture


Basic Working:

1. It all starts when the user puts up certain data mining requests, these
requests are then sent to data mining engines for pattern evaluation.

2. These applications try to find the solution to the query using the already
present database.

3. The metadata then extracted is sent for proper analysis to the data
mining engine which sometimes interacts with pattern evaluation
modules to determine the result.

4. This result is then sent to the front end in an easily understandable


manner using a suitable interface.

A detailed description of parts of data mining architecture is shown:

1. Data Sources: Database, World Wide Web(WWW), and data


warehouse are parts of data sources. The data in these sources may be

Data Warehousing and Data Mining 8


in the form of plain text, spreadsheets, or other forms of media like
photos or videos. WWW is one of the biggest sources of data.

2. Database Server: The database server contains the actual data ready to
be processed. It performs the task of handling data retrieval as per the
request of the user.

3. Data Mining Engine: It is one of the core components of the data mining
architecture that performs all kinds of data mining techniques like
association, classification, characterization, clustering, prediction, etc.

4. Pattern Evaluation Modules: They are responsible for finding interesting


patterns in the data and sometimes they also interact with the database
servers for producing the result of the user requests.

5. Graphic User Interface: Since the user cannot fully understand the
complexity of the data mining process so graphical user interface helps
the user to communicate effectively with the data mining system.

6. Knowledge Base: Knowledge Base is an important part of the data


mining engine that is quite beneficial in guiding the search for the result
patterns. Data mining engines may also sometimes get inputs from the
knowledge base. This knowledge base may contain data from user
experiences. The objective of the knowledge base is to make the result
more accurate and reliable.

Types of Data Mining architecture:

1. No Coupling: The no coupling data mining architecture retrieves data


from particular data sources. It does not use the database for retrieving
the data which is otherwise quite an efficient and accurate way to do the
same. The no coupling architecture for data mining is poor and only
used for performing very simple data mining processes.

2. Loose Coupling: In loose coupling architecture data mining system


retrieves data from the database and stores the data in those systems.
This mining is for memory-based data mining architecture.

3. Semi-Tight Coupling: It tends to use various advantageous features of


the data warehouse systems. It includes sorting, indexing, and
aggregation. In this architecture, an intermediate result can be stored in
the database for better performance.

Data Warehousing and Data Mining 9


4. Tight coupling: In this architecture, a data warehouse is considered one
of its most important components whose features are employed for
performing data mining tasks. This architecture provides scalability,
performance, and integrated information

Advantages of Data Mining:

Assists in preventing future adversaries by accurately predicting future


trends.

Contributes to the making of important decisions.

Compresses data into valuable information.

Provides new trends and unexpected patterns.

Helps to analyze huge data sets.

Aids companies to find, attract and retain customers.

Helps the company to improve its relationship with the customers.

Assists Companies to optimize their production according to the likability


of a certain product thus saving costs to the company.

Disadvantages of Data Mining:

Excessive work intensity requires high-performance teams and staff


training.

The requirement of large investments can also be considered a problem


as sometimes data collection consumes many resources that suppose a
high cost.

Lack of security could also put the data at huge risk, as the data may
contain private customer details.

Inaccurate data may lead to the wrong output.

Huge databases are quite difficult to manage.

Data Mining Functionalities


Data mining functionalities are used to represent the type of patterns that
have to be discovered in data mining tasks. In general, data mining tasks
can be classified into two types including descriptive and predictive.
Descriptive mining tasks define the common features of the data in the

Data Warehousing and Data Mining 10


database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −

Data characterization − It is a summarization of the general


characteristics of an object class of data. The data corresponding to the
user-specified class is generally collected by a database query. The
output of data characterization can be presented in multiple forms.

Data discrimination − It is a comparison of the general characteristics


of target class data objects with the general characteristics of objects
from one or a set of contrasting classes. The target and contrasting
classes can be represented by the user, and the equivalent data objects
fetched through database queries.

Association Analysis − It analyses the set of items that generally occur


together in a transactional dataset. There are two parameters that are
used for determining the association rules −

It provides which identifies the common item set in the database.

Confidence is the conditional probability that an item occurs in a


transaction when another item occurs.

Classification − Classification is the procedure of discovering a model


that represents and distinguishes data classes or concepts, for the
objective of being able to use the model to predict the class of objects
whose class label is anonymous. The derived model is established on
the analysis of a set of training data (i.e., data objects whose class label
is common).

Prediction − It defines predict some unavailable data values or pending


trends. An object can be anticipated based on the attribute values of the
object and attribute values of the classes. It can be a prediction of
missing numerical values or increase/decrease trends in time-related
information.

Clustering − It is similar to classification but the classes are not


predefined. The classes are represented by data attributes. It is
unsupervised learning. The objects are clustered or grouped, depends
on the principle of maximizing the intraclass similarity and minimizing
the intraclass similarity.

Data Warehousing and Data Mining 11


Outlier analysis − Outliers are data elements that cannot be grouped in
a given class or cluster. These are the data objects which have multiple
behaviour from the general behaviour of other data objects. The analysis
of this type of data can be essential to mine the knowledge.

Evolution analysis − It defines the trends for objects whose behaviour


changes over some time.

Data Mining Task Primitives


A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery to
direct the mining process or examine the findings from different angles or
depths. The data mining primitives specify the following,

1. Set of task-relevant data to be mined.

2. Kind of knowledge to be mined.

3. Background knowledge to be used in the discovery process.

4. Interestingness measures and thresholds for pattern evaluation.

5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these


primitives, allowing users to interact with data mining systems flexibly.
Having a data mining query language provides a foundation on which user-
friendly graphical interfaces can be built.
Designing a comprehensive data mining language is challenging because
data mining covers a wide spectrum of tasks, from data characterization to
evolution analysis. Each task has different requirements. The design of an
effective data mining query language requires a deep understanding of the
power, limitation, and underlying mechanisms of the various kinds of data
mining tasks. This facilitates a data mining system's communication with
other information systems and integrates with the overall information
processing environment.

Data Warehousing and Data Mining 12


List of Data Mining Task Primitives
A data mining query is defined in terms of the following primitives, such as:
1. The set of task-relevant data to be mined
This specifies the portions of the database or the set of data in which the
user is interested. This includes the database attributes or data warehouse
dimensions of interest (the relevant attributes or dimensions).
In a relational database, the set of task-relevant data can be collected via a
relational query involving operations like selection, projection, join, and
aggregation.
The data collection process results in a new data relational called the initial
data relation. The initial data relation can be ordered or grouped according
to the conditions specified in the query. This data retrieval can be thought of
as a subtask of the data mining task.
This initial relation may or may not correspond to physical relation in the
database. Since virtual relations are called Views in the field of databases,
the set of task-relevant data for data mining is called a minable view.
2. The kind of knowledge to be mined
This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
3. The background knowledge to be used in the discovery process
This knowledge about the domain to be mined is useful for guiding the
knowledge discovery process and evaluating the patterns found. Concept
hierarchies are a popular form of background knowledge, which allows data
to be mined at multiple levels of abstraction.
Concept hierarchy defines a sequence of mappings from low-level concepts
to higher-level, more general concepts.

Rolling Up - Generalization of data: Allow to view data at more


meaningful and explicit abstractions and makes it easier to understand.
It compresses the data, and it would require fewer input/output
operations.

Drilling Down - Specialization of data: Concept values replaced by


lower-level concepts. Based on different user viewpoints, there may be
more than one concept hierarchy for a given attribute or dimension.

Data Warehousing and Data Mining 13


An example of a concept hierarchy for the attribute (or dimension) age is
shown below. User beliefs regarding relationships in the data are another
form of background knowledge.
4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They


may be used to guide the mining process or, after discovery, to evaluate the
discovered patterns. For example, interesting measures for association rules
include support and confidence. Rules whose support and confidence
values are below user-specified thresholds are considered uninteresting.

Simplicity: A factor contributing to the interestingness of a pattern is the


pattern's overall simplicity for human comprehension. For example, the
more complex the structure of a rule is, the more difficult it is to
interpret, and hence, the less interesting it is likely to be. Objective
measures of pattern simplicity can be viewed as functions of the pattern
structure, defined in terms of the pattern size in bits or the number of
attributes or operators appearing in the pattern.

Certainty (Confidence): Each discovered pattern should have a


measure of certainty associated with it that assesses the validity or
"trustworthiness" of the pattern. A certainty measure for association
rules of the form "A =>B" where A and B are sets of items is confidence.
Confidence is a certainty measure. Given a set of task-relevant data
tuples, the confidence of "A => B" is defined asConfidence (A=>B) = #
tuples containing both A and B /# tuples containing A

Utility (Support): The potential usefulness of a pattern is a factor


defining its interestingness. It can be estimated by a utility function,
such as support. The support of an association pattern refers to the
percentage of task-relevant data tuples (or transactions) for which the
pattern is true.Utility (support): usefulness of a patternSupport (A=>B) =
# tuples containing both A and B / total #of tuples

Novelty: Novel patterns are those that contribute new information or


increased performance to the given pattern set. For example -> A data
exception. Another strategy for detecting novelty is to remove redundant
patterns.

5. The expected representation for visualizing the discovered patterns

Data Warehousing and Data Mining 14


This refers to the form in which discovered patterns are to be displayed,
which may include rules, tables, cross tabs, charts, graphs, decision trees,
cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for
displaying the discovered patterns. Some representation forms may be
better suited than others for particular kinds of knowledge.
For example, generalized relations and their corresponding cross tabs or
pie/bar charts are good for presenting characteristic descriptions, whereas
decision trees are common for classification.

Example of Data Mining Task Primitives


Suppose, as a marketing manager of AllElectronics, you would like to
classify customers based on their buying patterns. You are especially
interested in those customers whose salary is no less than $40,000 and
who have bought more than $1,000 worth of items, each of which is priced
at no less than $100.
In particular, you are interested in the customer's age, income, the types of
items purchased, the purchase location, and where the items were made.
You would like to view the resulting classification in the form of rules. This
data mining query is expressed in DMQL3 as follows, where each line of the
query has been enumerated to aid in our discussion.

1. use database AllElectronics_db

2. use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age

3. mine classification as promising_customers

4. in relevance to C.age, C.income, I.type, I.place_made, T.branch

5. from customer C, an item I, transaction T

6. where I.item_ID = T.item_ID and C.cust_ID = T.cust_ID and C.income ≥


40,000 and I.price ≥ 100

7. group by T.cust_ID

What is the integration of a data mining


system with a database system?

Data Warehousing and Data Mining 15


The data mining system is integrated with a database or data warehouse
system so that it can do its tasks in an effective presence. A data mining
system operates in an environment that needed it to communicate with
other data systems like a database system. There are the possible
integration schemes that can integrate these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use
any function of a database or data warehouse system. It can retrieve data
from a specific source (including a file system), process data using some
data mining algorithms, and therefore save the mining results in a different
file.
Such a system, though simple, deteriorates from various limitations. First, a
Database system offers a big deal of flexibility and adaptability at storing,
organizing, accessing, and processing data. Without using a Database/Data
warehouse system, a Data mining system can allocate a large amount of
time finding, collecting, cleaning, and changing data.
Loose Coupling − In this data mining system uses some services of a
database or data warehouse system. The data is fetched from a data
repository handled by these systems. Data mining approaches are used to
process the data and then the processed data is saved either in a file or in a
designated area in a database or data warehouse. Loose coupling is better
than no coupling as it can fetch some area of data stored in databases by
using query processing or various system facilities.
Semitight Coupling − In this adequate execution of a few essential data
mining primitives can be supported in the database/datawarehouse system.
These primitives can contain sorting, indexing, aggregation, histogram
analysis, multi-way join, and pre-computation of some important statistical
measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is
smoothly integrated into the database/data warehouse system. The data
mining subsystem is considered as one functional element of an information
system.
Data mining queries and functions are developed and established on mining
query analysis, data structures, indexing schemes, and query processing
methods of database/data warehouse systems. It is hugely desirable
because it supports the effective implementation of data mining functions,
high system performance, and an integrated data processing environment.

Data Warehousing and Data Mining 16

You might also like