0% found this document useful (0 votes)
52 views41 pages

Unit 1 DMW

Uploaded by

Ankita Kurle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views41 pages

Unit 1 DMW

Uploaded by

Ankita Kurle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

UNIT 1

Introduction : Data Mining: Definitions, KDD v/s Data Mining, DBMS v/s Data Mining , DM
techniques, Mining problems, Issues and Challenges in DM, DM Application areas. Association
Rules & Clustering Techniques: Introduction, Various association algorithms like A Priori, Partition,
Pincer search etc., Generalized association rules.

Data Mining:
Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer(a name that is wrong or not appropriate). Thus, data mining should
have been more appropriately named as knowledge mining which emphasis on mining from
large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases

Tasks of Data Mining


Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error.
Summarization – providing a more compact representation of the data set, including
visualization and report generation.

Architecture of Data Mining


A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).

2. Data Mining Engine:


This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:


This component typically employs interestingness measures interacts with the data
mining modules so as to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the datamining method used. For efficient data mining, it is highly
recommended to push the evaluation of pattern interestingness as deep as possible into
the mining process so as to confine the search to only the interesting patterns.

4. User interface:
This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory datamining based on
the intermediate data mining results. In addition, this component allows the user to
browse database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.

Data Mining Process:


Data Mining is a process of discovering various models, summaries, and derived values from
a given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:
1. State the problem and formulate the hypothesis
Most data-based modelling studies are performed in a particular application domain.
Hence, domain-specific knowledge and experience are usually necessary in order to
come up with a meaningful problem statement. Unfortunately, many application
studies tend to focus on the data-mining technique at the expense of a clear problem
statement. In this step, a modeler usually specifies a set of variables for the unknown
dependency and, if possible, a general form of this dependency as an initial
hypothesis. There may be several hypotheses formulated for a single problem at this
stage. The first step requires the combined expertise of an application domain and a
data-mining model. In practice, it usually means a close interaction between the data-
mining expert and the application expert. In successful data-mining applications, this
cooperation does not stop in the initial phase; it continues during the entire data-
mining process.
2. Collect the data
This step is concerned with how the data are generated and collected. In general, there
are two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data- generation process:
this is known as the observational approach. An observational setting, namely,
random data generation, is assumed in most data-mining applications. Typically, the
sampling distribution is completely unknown after data are collected, or it is partially
and implicitly given in the data-collection procedure. It is very important, however, to
understand how data collection affects its theoretical distribution, since such a priori
knowledge can be very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for estimating a model
and the data used later for testing and applying a model come from the same,
unknown, sampling distribution. If this is not the case, the estimated model cannot be
successfully used in a final application of the results.

3. Pre-processing the data


In the observational setting, data are usually "collected" from the existing databases,
data warehouses, and data marts. Data pre-processing usually includes at least two
common tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the pre-processing phase, or
b. Develop robust modelling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data pre-processing includes several


steps such as variable scaling and different types of encoding. For example, one
feature with the range [0, 1] and the other with the range [−100, 1000] will not have
the same weights in the applied technique; they will also influence the final data-
mining results differently. Therefore, it is recommended to scale them and bring both
features to the same weight for further analysis. Also, application-specific encoding
methods usually achieve dimensionality reduction by providing a smaller number of
informative features for subsequent data modelling.
These two classes of pre-processing tasks are only illustrative examples of a large
spectrum of pre-processing activities in a data-mining process.
Data pre-processing steps should not be considered completely independent from
other data-mining phases. In every iteration of the data-mining process, all activities,
together, could define new and improved data sets for subsequent iterations.
Generally, a good pre-processing method provides an optimal representation for a
data-mining technique by incorporating a priori knowledge in the form of application-
specific scaling and encoding.

4. Estimate the model


The selection and implementation of the appropriate data-mining technique is the
main task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task.

5. Interpret the model and draw conclusions


In most cases, data-mining models should help in decision making. Hence, such
models need to be interpretable in order to be useful because humans are not likely to
base their decisions on complex "black-box" models. Note that the goals of accuracy
of the model and accuracy of its interpretation are somewhat contradictory. Usually,
simple models are more interpretable, but they are also less accurate. Modern data-
mining methods are expected to yield highly accurate results using high dimensional
models. The problem of interpreting these models, also very important, is considered
a separate task, with specific techniques to validate the results. A user does not want
hundreds of pages of numeric results. He does not understand them; he cannot
summarize, interpret, and use them for successful decision making
Classification of Data mining Systems:
The data mining system can be classified according to the following criteria:
 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines

Some Other Classification Criteria:


 Classification according to kind of databases mined
 Classification according to kind of knowledge mined
 Classification according to kinds of techniques utilized
 Classification according to applications adapted

Classification according to kind of databases mined


We can classify the data mining system according to kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data etc.
And the data mining system can be classified accordingly. For example if we classify the
database according to data model then we may have a relational, transactional, object-
relational, or data warehouse mining system.

Classification according to kind of knowledge mined


We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as:
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis

Classification according to kinds of techniques utilized


We can classify the data mining system according to kind of techniques used. We can
describes these techniques according to degree of user interaction involved or the methods of
analysis employed.

Classification according to applications adapted


We can classify the data mining system according to application adapted. These applications
are as follows:
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail

Major Issues In Data Mining:


Mining different kinds of knowledge in databases. - The need of different users is
not the same. And Different user may be in interested in different kind of knowledge.
Therefore it is necessary for data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining


process needs to be interactive because it allows users to focus the search for patterns,
providing and
refining data mining requests based on returned results.

Incorporation of background knowledge. - To guide discovery process and to express the


discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.

Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations
should be easily understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can
handle the noise, incomplete objects while mining the data regularities. If data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered


should be interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size
of databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the
data into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again
from scratch.

Knowledge Discovery in Databases(KDD)


KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD
process is an iterative process and it requires multiple iterations of the above steps to extract
accurate knowledge from the data. The following steps are included in KDD process:

Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.

Cleaning in case of Missing values.


Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a
common source (DataWarehouse). Data integration using Data Migration tools, Data
Synchronization tools and ETL(Extract-Load-Transformation) process.

Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.

Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:

 Data Mapping: Assigning elements from source base to destination to capture


transformations.
 Code generation: Creation of the actual transformation program.

Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model using classification
or characterization.

Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing
knowledge based on given measures. It find interestingness score of each pattern, and uses
summarization and Visualization to make data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make
decisions.

KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results. Preprocessing of databases consists of Data cleaning and Data
Integration.

OR
Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:
Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step, knowledge is represented.

The following diagram shows the process of knowledge discovery process:


KDD v/s Data Mining:
Parameter KDD Data Mining

KDD refers to a process of


Data Mining refers to a
identifying valid, novel,
process of extracting useful
potentially useful, and
Definition and valuable information or
ultimately understandable
patterns from large data
patterns and relationships in
sets.
data.

To find useful knowledge from To extract useful


Objective
data. information from data.

Data cleaning, data


Association rules,
integration, data selection,
classification, clustering,
Techniques data transformation, data
regression, decision trees,
Used mining, pattern evaluation,
neural networks, and
and knowledge representation
dimensionality reduction.
and visualization.

Output Structured information, such Patterns, associations, or


as rules and models, that can insights that can be used to
be used to make decisions or improve decision-making or
Parameter KDD Data Mining

predictions. understanding.

Focus is on the discovery of Data mining focus is on the


Focus useful knowledge, rather than discovery of patterns or
simply finding patterns in data. relationships in data.

Domain expertise is important Domain expertise is less


Role of in KDD, as it helps in defining critical in data mining, as
domain the goals of the process, the algorithms are designed
expertise choosing appropriate data, to identify patterns without
and interpreting the results. relying on prior knowledge.

Comparison of KDD and Data Mining

KDD (Knowledge Discovery in


Aspect Data Mining
Databases)
A complete process of finding useful A step within KDD focused on
Definition
knowledge from data. extracting patterns or models.
Broader: Includes data preparation, Narrower: Focused only on
Scope
selection, cleaning, and interpretation. finding patterns in the data.
Involves multiple stages such as selection,
Process Focused on the "mining" phase
preprocessing, transformation, mining,
Involvement of the KDD process.
and interpretation.
Discover actionable and meaningful Extract patterns or models from
Objective
knowledge from data. the dataset.
Uses statistical, machine
Techniques Includes all techniques used in data
learning, and algorithmic
Used preprocessing, mining, and evaluation.
techniques.
Knowledge, actionable insights, or Patterns, rules, or predictions
Output
patterns with context. from data.
Focuses on the entire pipeline from raw Focuses on building models and
Focus
data to actionable knowledge. identifying relationships.
Identifying frequent purchase
Detecting customer churn and interpreting
Example patterns (e.g., "A Priori
it for business improvement.
Algorithm").
Knowledge discovery and decision- Generation of data patterns or
End Goal
making. insights to be further analyzed.
Relies on algorithms and models
Relies on data cleaning, transformation, to extract patterns from
Dependencies
and selection as prerequisites. processed data.
DBMS v/s Data Mining:

Comparison of DBMS and Data Mining

Aspect DBMS (Database Management System) Data Mining

A software system designed to store, The process of discovering patterns and


Definition
manage, and retrieve data efficiently. knowledge from large datasets.

Primary To organize and manage structured data To analyze data and extract meaningful
Purpose for easy access and manipulation. patterns, trends, or rules.

Focuses on identifying hidden


Focuses on managing data efficiently and
Focus relationships and useful knowledge from
ensuring data integrity.
data.

CRUD operations: Create, Read, Update, Pattern discovery, classification,


Operations
Delete. clustering, prediction.

Techniques Relational algebra, SQL queries, indexing, Statistical analysis, machine learning, and
Used and normalization. algorithmic techniques.

Structured data, typically in relational Structured, semi-structured, and


Type of Data
tables. unstructured data.

Specific query results or reports based Patterns, rules, trends, or predictive


Output
on stored data. models.

Example Use Storing customer data in a relational Identifying frequent purchase patterns
Case database and retrieving it via queries. from customer transaction data.

Simpler, focuses on data storage and More complex, involves advanced


Complexity
retrieval mechanisms. analytics and computations.

User Requires users to define and manage Automates the discovery of insights with
Interaction data schema and write queries. minimal user input.

Processing Deterministic: Returns exact data based Non-deterministic: Finds potential


Approach on a query. patterns and insights.

Used for transaction processing, data Used for market basket analysis, fraud
Use Cases
storage, and management. detection, customer segmentation.

Tools like WEKA, RapidMiner, TensorFlow,


Key Tools MySQL, PostgreSQL, Oracle, SQL Server.
or custom algorithms.

Efficient data storage, retrieval, and Extracting actionable insights and


End Goal
manipulation. knowledge from data.
DM techniques:

1. Association
Association analysis is the finding of association rules showing attribute-value conditions that
occur frequently together in a given set of data. Association analysis is widely used for a
market basket or transaction data analysis. Association rule mining is a significant and
exceptionally dynamic area of data mining research. One method of association-based
classification, called associative classification, consists of two steps. In the main step,
association instructions are generated using a modified version of the standard association
rule mining algorithm known as Apriori. The second step constructs a classifier based on the
association rules discovered.

2. Classification
Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The determined model depends on the
investigation of a set of training data information (i.e. data objects whose class label is
known). The derived model may be represented in various forms, such as classification (if –
then) rules, decision trees, and neural networks. Data Mining has a different type of classifier:

 Decision Tree
 SVM(Support Vector Machine)
 Generalized Linear Models
 Bayesian classification:
 Classification by Backpropagation
 K-NN Classifier
 Rule-Based Classification
 Frequent-Pattern Based Classification
 Rough set theory
 Fuzzy Logic
Decision Trees: A decision tree is a flow-chart-like tree structure, where each node
represents a test on an attribute value, each branch denotes an outcome of a test, and tree
leaves represent classes or class distributions. Decision trees can be easily transformed into
classification rules. Decision tree enlistment is a nonparametric methodology for building
classification models. In other words, it does not require any prior assumptions regarding the
type of probability distribution satisfied by the class and other attributes. Decision trees,
especially smaller size trees, are relatively easy to interpret. The accuracies of the trees are
also comparable to two other classification techniques for a much simple data set. These
provide an expressive representation for learning discrete-valued functions. However, they do
not simplify well to certain types of Boolean problems.

This figure generated on the IRIS data set of the UCI machine repository. Basically, three
different class labels available in the data set: Setosa, Versicolor, and Virginia.

Support Vector Machine (SVM) Classifier Method: Support Vector Machines is a


supervised learning strategy used for classification and additionally used for regression.
When the output of the support vector machine is a continuous value, the learning
methodology is claimed to perform regression; and once the learning methodology will
predict a category label of the input object, it’s known as classification. The independent
variables could or could not be quantitative. Kernel equations are functions that transform
linearly non-separable information in one domain into another domain wherever the instances
become linearly divisible. Kernel equations are also linear, quadratic, Gaussian, or anything
that achieves this specific purpose. A linear classification technique may be a classifier that
uses a linear function of its inputs to base its decision on. Applying the kernel equations
arranges the information instances in such a way at intervals in the multi-dimensional space,
that there is a hyper-plane that separates knowledge instances of one kind from those of
another. The advantage of Support Vector Machines is that they will make use of certain
kernels to transform the problem, such we are able to apply linear classification techniques to
nonlinear knowledge. Once we manage to divide the information into two different classes
our aim is to include the most effective hyper-plane to separate two kinds of instances.

Generalized Linear Models: Generalized Linear Models(GLM) is a statistical technique, for


linear modeling.GLM provides extensive coefficient statistics and model statistics, as well as
row diagnostics. It also supports confidence bounds.
Bayesian Classification: Bayesian classifier is a statistical classifier. They can predict class
membership probabilities, for instance, the probability that a given sample belongs to a
particular class. Bayesian classification is created on the Bayes theorem. Studies comparing
the classification algorithms have found a simple Bayesian classifier known as the naive
Bayesian classifier to be comparable in performance with decision tree and neural network
classifiers. Bayesian classifiers have also displayed high accuracy and speed when applied to
large databases. Naive Bayesian classifiers adopt that the exact attribute value on a given
class is independent of the values of the other attributes. This assumption is termed class
conditional independence. It is made to simplify the calculations involved, and is considered
“naive”. Bayesian belief networks are graphical replicas, which unlike naive Bayesian
classifiers allow the depiction of dependencies among subsets of attributes. Bayesian belief
can also be utilized for classification.

Classification By Backpropagation: A Backpropagation learns by iteratively processing a


set of training samples, comparing the network’s estimate for each sample with the actual
known class label. For each training sample, weights are modified to minimize the mean
squared error between the network’s prediction and the actual class. These changes are made
in the “backward” direction, i.e., from the output layer, through each concealed layer down to
the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in
general, the weights will finally converge, and the knowledge process stops.

K-Nearest Neighbor (K-NN) Classifier Method: The k-nearest neighbor (K-NN) classifier
is taken into account as an example-based classifier, which means that the training documents
are used for comparison instead of an exact class illustration, like the class profiles utilized by
other classifiers. As such, there’s no real training section. once a new document has to be
classified, the k most similar documents (neighbors) are found and if a large enough
proportion of them are allotted to a precise class, the new document is also appointed to the
present class, otherwise not. Additionally, finding the closest neighbors is quickened using
traditional classification strategies.

Rule-Based Classification: Rule-Based classification represent the knowledge in the form of


If-Then rules. An assessment of a rule evaluated according to the accuracy and coverage of
the classifier. If more than one rule is triggered then we need to conflict resolution in rule-
based classification. Conflict resolution can be performed on three different parameters: Size
ordering, Class-Based ordering, and rule-based ordering. There are some advantages of Rule-
based classifier like:

 Rules are easier to understand than a large tree.


 Rules are mutually exclusive and exhaustive.
 Each attribute-value pair along a path forms conjunction: each leaf holds the class
prediction.

Frequent-Pattern Based Classification: Frequent pattern discovery (or FP discovery, FP


mining, or Frequent itemset mining) is part of data mining. It describes the task of finding the
most frequent and relevant patterns in large datasets. The idea was first presented for mining
transaction databases. Frequent patterns are defined as subsets (item sets, subsequences, or
substructures) that appear in a data set with a frequency no less than a user-specified or auto-
determined threshold.
Rough Set Theory: Rough set theory can be used for classification to discover structural
relationships within imprecise or noisy data. It applies to discrete-valued features.
Continuous-valued attributes must therefore be discrete prior to their use. Rough set theory is
based on the establishment of equivalence classes within the given training data. All the data
samples forming a similarity class are indiscernible, that is, the samples are equal with
respect to the attributes describing the data. Rough sets can also be used for feature reduction
(where attributes that do not contribute towards the classification of the given training data
can be identified and removed), and relevance analysis (where the contribution or
significance of each attribute is assessed with respect to the classification task). The problem
of finding the minimal subsets (redacts) of attributes that can describe all the concepts in the
given data set is NP-hard. However, algorithms to decrease the computation intensity have
been proposed. In one method, for example, a discernibility matrix is used which stores the
differences between attribute values for each pair of data samples. Rather than pointed on the
entire training set, the matrix is instead searched to detect redundant attributes.

Fuzzy-Logic: Rule-based systems for classification have the disadvantage that they involve
sharp cut-offs for continuous attributes. Fuzzy Logic is valuable for data mining frameworks
performing grouping /classification. It provides the benefit of working at a high level of
abstraction. In general, the usage of fuzzy logic in rule-based systems involves the following:

Attribute values are changed to fuzzy values.


For a given new data set /example, more than one fuzzy rule may apply. Every applicable
rule contributes a vote for membership in the categories. Typically, the truth values for each
projected category are summed.

3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for
prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for
which values are being predicted is consistently valued(ordered) instead of categorical
(discrete-esteemed and unordered). The attribute can be referred to simply as the predicted
attribute. Prediction can be viewed as the construction and use of a model to assess the class
of an unlabeled object, or to assess the value or value ranges of an attribute that a given
object is likely to have.

4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes,
clustering analyzes data objects without consulting an identified class label. In general, the
class labels do not exist in the training data simply because they are not known to begin with.
Clustering can be used to generate these labels. The objects are clustered based on the
principle of maximizing the intra-class similarity and minimizing the interclass similarity.
That is, clusters of objects are created so that objects inside a cluster have high similarity in
contrast with each other, but are different objects in other clusters. Each Cluster that is
generated can be seen as a class of objects, from which rules can be inferred. Clustering can
also facilitate classification formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.

5. Regression
Regression can be defined as a statistical modeling method in which previously obtained data
is used to predicting a continuous quantity for new observations. This classifier is also known
as the Continuous Value Classifier. There are two types of regression models: Linear
regression and multiple linear regression models.

6. Artificial Neural network (ANN) Classifier Method


An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN),
could be a process model supported by biological neural networks. It consists of an
interconnected collection of artificial neurons. A neural network is a set of connected
input/output units where each connection has a weight associated with it. During the
knowledge phase, the network acquires by adjusting the weights to be able to predict the
correct class label of the input samples. Neural network learning is also denoted as
connectionist learning due to the connections between units. Neural networks involve long
training times and are therefore more appropriate for applications where this is feasible. They
require a number of parameters that are typically best determined empirically, such as the
network topology or “structure”. Neural networks have been criticized for their poor
interpretability since it is difficult for humans to take the symbolic meaning behind the
learned weights. These features firstly made neural networks less desirable for data mining.

The advantages of neural networks, however, contain their high tolerance to noisy data as
well as their ability to classify patterns on which they have not been trained. In addition,
several algorithms have newly been developed for the extraction of rules from trained neural
networks. These issues contribute to the usefulness of neural networks for classification in
data mining.

An artificial neural network is an adjective system that changes its structure-supported


information that flows through the artificial network during a learning section. The ANN
relies on the principle of learning by example. There are two classical types of neural
networks, perceptron and also multilayer perceptron.

7. Outlier Detection
A database may contain data objects that do not comply with the general behavior or model
of the data. These data objects are Outliers. The investigation of OUTLIER data is known as
OUTLIER MINING. An outlier may be detected using statistical tests which assume a
distribution or probability model for the data, or using distance measures where objects
having a small fraction of “close” neighbors in space are considered outliers. Rather than
utilizing factual or distance measures, deviation-based techniques distinguish
exceptions/outlier by inspecting differences in the principle attributes of items in a group.

8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of
evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and
genetics. These are intelligent exploitation of random search provided with historical data to
direct the search into the region of better performance in solution space. They are commonly
used to generate high-quality solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those species who
can adapt to changes in their environment are able to survive and reproduce and go to the
next generation. In simple words, they simulate “survival of the fittest” among individuals of
consecutive generations for solving a problem. Each generation consist of a population of
individuals and each individual represents a point in search space and possible solution. Each
individual is represented as a string of character/integer/float/bits. This string is analogous to
the Chromosome.
Mining Problems in Data Mining
Data Mining involves extracting useful information from large datasets. However, the
process is not without challenges. Here are common mining problems encountered in Data
Mining:

1. Data Quality Issues

 Problem:
o Incomplete, noisy, or inconsistent data can lead to inaccurate results.
 Causes:
o Missing values in datasets.
o Errors in data entry or collection.
o Redundant or duplicate data.
 Solution:
o Data cleaning and preprocessing techniques.
o Use of imputation methods for missing data.

2. Scalability

 Problem:
o Difficulty in processing and analyzing large-scale datasets efficiently.
 Causes:
o Limited computational resources.
o Increasing size of modern datasets (big data).
 Solution:
o Parallel and distributed computing.
o Use of scalable algorithms like MapReduce.

3. High Dimensionality

 Problem:
o Data with a large number of attributes (features) increases complexity.
 Causes:
o Modern applications generate high-dimensional data, such as images or genetic
data.
 Solution:
o Dimensionality reduction techniques like PCA (Principal Component Analysis).
o Feature selection methods.

4. Privacy and Security


 Problem:
o Handling sensitive data while ensuring user privacy and security.
 Causes:
o Data breaches or misuse of sensitive information.
o Legal and ethical concerns.
 Solution:
o Implementation of data anonymization techniques.
o Compliance with regulations like GDPR.

5. Integration of Heterogeneous Data

 Problem:
o Combining data from various sources with different formats.
 Causes:
o Diverse data sources, including structured, semi-structured, and unstructured data.
 Solution:
o Use of ETL (Extract, Transform, Load) tools.
o Data integration frameworks.

6. Real-Time Processing

 Problem:
o Difficulty in analyzing data streams in real time.
 Causes:
o High-speed data generation from IoT devices or social media.
 Solution:
o Real-time processing frameworks like Apache Kafka and Apache Flink.

7. Interpretability of Results

 Problem:
o Results from data mining models may be difficult to interpret or explain.
 Causes:
o Use of complex models like neural networks.
o Lack of domain expertise.
 Solution:
o Use interpretable models like decision trees.
o Implement explainable AI techniques.

8. Imbalanced Data

 Problem:
o Unequal distribution of classes in a dataset can affect model performance.
 Causes:
o Rare events or minority classes in datasets.
 Solution:
o Use of oversampling (e.g., SMOTE) or undersampling techniques.
o Employ algorithms designed for imbalanced data.

9. Overfitting

 Problem:
o Models perform well on training data but fail on unseen data.
 Causes:
o Excessively complex models.
o Lack of sufficient training data.
 Solution:
o Cross-validation techniques.
o Regularization methods.

10. Underfitting

 Problem:
o Models fail to capture the underlying patterns in data.
 Causes:
o Oversimplified models.
o Insufficient features or data.
 Solution:
o Use more complex algorithms.
o Incorporate additional relevant features.

11. Dynamic and Evolving Data

 Problem:
o Data characteristics change over time, affecting model performance.
 Causes:
o Real-world data is often dynamic (e.g., stock market data).
 Solution:
o Use adaptive learning techniques.
o Regularly update models with new data.

12. Noise and Outliers

 Problem:
o Presence of irrelevant or extreme values in data.
 Causes:
o Measurement errors or anomalies.
 Solution:
o Outlier detection techniques (e.g., Isolation Forest).
o Robust algorithms resistant to noise.

13. Selecting the Right Technique

 Problem:
o Choosing the most appropriate data mining technique for a problem.
 Causes:
o Diverse types of problems and datasets.
 Solution:
o Understand the nature of the problem.
o Experiment with multiple techniques.

14. Cost of Data Mining

 Problem:
o High costs associated with computational resources, software, and expertise.
 Causes:
o Complexity of data mining projects.
 Solution:
o Open-source tools like Python, R, and Weka.
o Cloud-based data mining platforms.

15. Evaluation of Results

 Problem:
o Assessing the accuracy and reliability of data mining outputs.
 Causes:
o Lack of ground truth for unsupervised learning tasks.
 Solution:
o Use performance metrics like precision, recall, and F1-score.
o Conduct thorough validation and testing.

Challenges in DM:
Challenges in Data Mining
Data Mining is a powerful process for extracting meaningful patterns and insights from large
datasets. However, it comes with its own set of challenges. Below are the primary challenges
in Data Mining, along with explanations and potential solutions:

1. Data Quality Issues

 Challenge:
o Data often contains missing, incomplete, noisy, or inconsistent information.
 Impact:
o Poor-quality data leads to inaccurate and unreliable results.
 Solution:
o Implement data cleaning and preprocessing techniques.
o Use imputation methods for missing values and noise reduction.

2. Scalability

 Challenge:
o Handling and analyzing massive datasets efficiently.
 Impact:
o Slower performance and higher computational costs for large-scale data.
 Solution:
o Use distributed computing frameworks like Hadoop or Spark.
o Employ scalable algorithms and parallel processing.

3. High Dimensionality

 Challenge:
o Datasets with many attributes (features) make the analysis more complex and
computationally expensive.
 Impact:
o Difficulty in identifying relevant patterns due to the "curse of dimensionality."
 Solution:
o Apply dimensionality reduction techniques such as PCA (Principal Component
Analysis).
o Use feature selection methods to focus on the most important attributes.

4. Privacy and Security

 Challenge:
o Protecting sensitive data and ensuring user privacy during analysis.
 Impact:
o Risk of data breaches, misuse of personal information, and legal penalties.
 Solution:
o Use data anonymization and encryption techniques.
o Follow privacy regulations like GDPR or HIPAA.

5. Dynamic Data

 Challenge:
o Real-world data evolves over time, requiring frequent updates to models and
algorithms.
 Impact:
o Outdated models may fail to capture current trends or patterns.
 Solution:
o Use incremental learning techniques to adapt to new data.
o Regularly update and retrain models with fresh data.

6. Integration of Heterogeneous Data

 Challenge:
o Combining data from multiple sources with different formats and structures.
 Impact:
o Inconsistent and incomplete integration leads to incorrect conclusions.
 Solution:
o Use data integration tools like ETL (Extract, Transform, Load).
o Standardize and normalize data during preprocessing.

7. Interpretability of Results

 Challenge:
o Complex models like neural networks can be difficult to interpret and explain.
 Impact:
o Stakeholders may struggle to understand how decisions are made.
 Solution:
o Use interpretable models like decision trees or linear regression.
o Apply explainable AI (XAI) techniques for black-box models.

8. Noisy and Uncertain Data

 Challenge:
o Real-world data often contains noise, errors, or uncertainty.
 Impact:
o Reduced model accuracy and reliability.
 Solution:
o Implement outlier detection and noise reduction techniques.
o Use robust algorithms designed to handle uncertainty.

9. Overfitting and Underfitting

 Challenge:
o Overfitting occurs when models are too complex and fit the training data too closely.
o Underfitting occurs when models are too simple and fail to capture the data's
complexity.
 Impact:
o Poor performance on unseen data.
 Solution:
o Use cross-validation techniques.
o Apply regularization methods to balance model complexity.

10. Cost of Data Mining

 Challenge:
o High costs associated with data storage, processing, and skilled expertise.
 Impact:
o Organizations may find it challenging to allocate sufficient resources.
 Solution:
o Use open-source tools like Python, R, or Weka.
o Leverage cloud-based solutions to reduce infrastructure costs.

11. Imbalanced Data

 Challenge:
o Unequal representation of classes in datasets (e.g., fraud detection where fraud
cases are rare).
 Impact:
o Models may be biased toward the majority class.
 Solution:
o Use resampling techniques like oversampling (e.g., SMOTE) or undersampling.
o Apply algorithms specifically designed for imbalanced data.

12. Selection of Appropriate Algorithms

 Challenge:
o Choosing the right algorithm for a specific problem can be challenging due to the
variety of available methods.
 Impact:
o Inappropriate algorithm selection leads to suboptimal results.
 Solution:
o Understand the problem and dataset characteristics.
o Experiment with multiple algorithms and evaluate their performance.

13. Real-Time Analysis

 Challenge:
o Processing and analyzing data streams in real time.
 Impact:
o Difficulties in providing instant insights for dynamic applications.
 Solution:
o Use real-time data mining tools like Apache Kafka or Flink.
o Design algorithms optimized for streaming data.

14. Handling Unstructured Data

 Challenge:
o Data mining often involves unstructured data like text, images, or videos.
 Impact:
o Difficulty in processing and extracting meaningful patterns.
 Solution:
o Use Natural Language Processing (NLP) for text data.
o Apply image processing techniques for visual data.

15. Evaluation of Results

 Challenge:
o Measuring the accuracy and reliability of data mining outcomes.
 Impact:
o Inconsistent evaluation may lead to incorrect interpretations.
 Solution:
o Use performance metrics like precision, recall, and F1-score.
o Compare results across multiple datasets and validation methods.

16. Ethical Concerns

 Challenge:
o Misuse of data mining results for biased or unethical purposes.
 Impact:
o Loss of trust and potential legal consequences.
 Solution:
o Establish ethical guidelines for data mining practices.
o Conduct regular audits of data mining projects.

DM Application areas
There are many measurable benefits that have been achieved in different application areas
from data mining. So, let’s discuss different applications of Data Mining:

Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyze the old data already accumulated. Example of
scientific analysis:

 Sequence analysis in bioinformatics


 Classification of astronomical objects
 Medical decision support.

Intrusion Detection: A network intrusion refers to any unauthorized activity on a


digital network. Network intrusions often involve stealing valuable network resources. Data
mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic about
the foreign invasions in the system. For example:

 Detect security violations


 Misuse Detection
 Anomaly Detection
Business Transactions: Every business industry is memorized for perpetuity. Such
transactions are usually time-related and can be inter-business deals or intra-business
operations. The effective and in-time use of the data in a reasonable time frame for
competitive decision-making is definitely the most important problem to solve for businesses
that struggle to survive in a highly competitive world. Data mining helps to analyze these
business transactions and identify marketing approaches and decision-making. Example :

 Direct mail targeting


 Stock trading
 Customer segmentation
 Churn prediction (Churn prediction is one of the most popular Big Data use cases in
business)

Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers. This analysis can help to promote deals, offers, sale by
the companies and data mining techniques helps to achieve this analysis task. Example:

 Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
 Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
 Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.

Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational task:

 Predicting students admission in higher education


 Predicting students profiling
 Predicting student performance
 Teachers teaching performance
 Curriculum development
 Predicting student placement opportunities

Research: A data mining technique can perform predictions, classification, clustering,


associations, and grouping of data with perfection in the research area. Rules generated by
data mining are unique to find results. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model is a strategy to measure
the precision of the proposed model. It is called Train/Test because we split the data set into
two sets: a training data set and a testing data set. A training data set used to design the
training model whereas testing data set is used in the testing model. Example:

 Classification of uncertain data.


 Information-based clustering.
 Decision support system
 Web Mining
 Domain-driven data mining
 IoT (Internet of Things)and Cybersecurity
 Smart farming IoT(Internet of Things)

Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure out
which promoting activities will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict which customers will buy new
policies, identify behavior patterns of risky customers and identify fraudulent behavior of
customers.

 Claims analysis i.e which medical procedures are claimed together.


 Identify successful medical therapies for different illnesses.
 Characterizes patient behavior to predict office visits.

Transportation: A diversified transportation company with a large direct sales force can
apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers.

 Determine the distribution schedules among outlets.


 Analyze loading patterns.

Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.

 Credit card fraud detection.


 Identify ‘Loyal’ customers.
 Extraction of information related to customers.
 Determine credit card spending by customer groups.
Association Rules
Market basket analysis:
This process analyzes customer buying habits by finding associations between
the different items that customers place in their shopping baskets. The discovery
of such associations can help retailers develop marketing strategies by gaining
insight into which items are frequently purchased together by customers. For
instance, if customers are buying milk, how likely are they to also buy bread
(and what kind of bread) on the same trip to the supermarket. Such information
can lead to increased sales by helping retailers do selective marketing and plan
their shelf space.
Example: If customers who purchase computers also tend to buy antivirus
software at the same time, then placing the hardware display close to the
software display may help increase the sales of both items. In an alternative
strategy, placing hardware and software at opposite ends of the store may entice
customers who purchase such items to pick up other items along the way. For
instance, after deciding on an expensive computer,a customer may observe
security systems for sale while heading toward the software display to purchase
antivirus software and may decide to purchase a home security system as well.
Market basket analysis can also help retailers plan which items to put on saleat
reduced prices. If customers tend to purchase computers and printers together,
then having a sale on printers may encourage the sale of printers as well as
computers.

Frequent Pattern Mining:

Frequent pattern mining can be classified in various ways, based on the


following criteria:
The Apriori Algorithm
 Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in
1994 for mining frequent itemsets for Boolean association rules.
 The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent itemset properties.
 Apriori employs an iterative approach known as a level-wise search,
where k-itemsets are used to explore (k+1)-itemsets.
 First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted L1.Next, L1 is used to find
L2, the set of frequent 2-itemsets, which is used to find L3, and so on,
until no more frequent k-itemsets can be found.
 The finding of each Lkrequires one full scan of the database.
 A two-step process is followed in Aprioriconsisting of joinand prune
action.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of
candidate1- itemsets, C1. The algorithm simply scans all of the transactions
in order to count the number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup =
The set of frequent 1-itemsets, L1, can then be determined. It consists of the
candidate 1-itemsets satisfying minimum support.In our example, all of the
candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join
L1 on L1 to generate a candidate set of 2-itemsets, C2. No candidates are
removed from C2 during the prune step because each subset of thecandidates
is also frequent.
4.Next, the transactions in Dare scanned and the support count of each
candidate itemsetInC2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate2- itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step,
we first getC3 =L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},
{I2, I3, I5}, {I2, I4, I5}. Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent.
7.The transactions in D are scanned in order to determine L3, consisting of
those candidate 3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
Generating Association Rules from Frequent
Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it
is straightforward to generate strong association rules from them.

Mining Multilevel Association Rules:

 For many applications, it is difficult to find strong associations among


data items at low or primitive levels of abstraction due to the sparsity of
data at those levels.
 Strong associations discovered at high levels of abstraction may represent
common sense knowledge.
 Therefore, data mining systems should provide capabilities for mining
association rules at multiple levels of abstraction, with sufficient
flexibility for easy traversal among different abstraction spaces.
 Association rules generated from mining data at multiple levels of
abstraction are called multiple-level or multilevel association rules.
 Multilevel association rules can be mined efficiently using concept
hierarchies under a support-confidence framework.
 In general, a top-down strategy is employed, where counts are
accumulated for the calculation of frequent itemsets at each concept level,
starting at the concept level 1 and working downward in the hierarchy
toward the more specific concept levels,until no more frequent itemsets
can be found.
A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher level, more general concepts. Data can be generalized by
replacing low-level concepts within the data by their higher-level concepts,
or ancestors, from a concept hierarchy.

The concept hierarchy has five levels, respectively referred to as levels 0to 4,
starting with level 0 at the root node for all.

 Here, Level 1 includes computer, software, printer&camera, and


computer accessory.
 Level 2 includes laptop computer, desktop computer, office software,
antivirus software
 Level 3 includes IBM desktop computer, . . . , Microsoft office
software, and so on.
 Level 4 is the most specific abstraction level of this hierarchy.

Approaches For Mining Multilevel Association Rules:


1.UniformMinimum Support:
 The same minimum support threshold is used when mining at each
level of abstraction.
 When a uniform minimum support threshold is used, the search
procedure is simplified.
 The method is also simple in that users are required to specify only
one minimum support threshold.
 The uniform support approach, however, has some difficulties. It is
unlikely that items at lower levels of abstraction will occur as
frequently as those at higher levels of abstraction.
 If the minimum support threshold is set too high, it could miss
some meaningful associations occurring at low abstraction levels.
If the threshold is set too low, it may generate many uninteresting
associations occurring at high abstraction levels.

2.Reduced Minimum Support:


 Each level of abstraction has its own minimum support threshold.The
deeper the level of abstraction, the smaller the corresponding threshold is.
 For example,the minimum support thresholds for levels 1 and 2 are 5%
and 3%,respectively. In this way, ―computer,‖ ―laptop computer,‖ and
―desktop computer‖ areall considered frequent.
3.Group-Based Minimum Support:
 Because users or experts often have insight as to which groups are
more important than others, it is sometimes more desirable to set up
user-specific, item, or group based minimal support thresholds when
mining multilevel rules.
 For example, a user could set up the minimum support thresholds
based on product price, or on items of interest, such as by setting
particularly low support thresholds for laptop computersand flash
drives in order to pay particular attention to the association patterns
containing items in these categories.

You might also like