Unit 1 DMW
Unit 1 DMW
Introduction : Data Mining: Definitions, KDD v/s Data Mining, DBMS v/s Data Mining , DM
techniques, Mining problems, Issues and Challenges in DM, DM Application areas. Association
Rules & Clustering Techniques: Introduction, Various association algorithms like A Priori, Partition,
Pincer search etc., Generalized association rules.
Data Mining:
Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer(a name that is wrong or not appropriate). Thus, data mining should
have been more appropriately named as knowledge mining which emphasis on mining from
large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases
4. User interface:
This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory datamining based on
the intermediate data mining results. In addition, this component allows the user to
browse database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations
should be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can
handle the noise, incomplete objects while mining the data regularities. If data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size
of databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the
data into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again
from scratch.
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a
common source (DataWarehouse). Data integration using Data Migration tools, Data
Synchronization tools and ETL(Extract-Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model using classification
or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing
knowledge based on given measures. It find interestingness score of each pattern, and uses
summarization and Visualization to make data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make
decisions.
KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results. Preprocessing of databases consists of Data cleaning and Data
Integration.
OR
Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:
Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step, knowledge is represented.
predictions. understanding.
Primary To organize and manage structured data To analyze data and extract meaningful
Purpose for easy access and manipulation. patterns, trends, or rules.
Techniques Relational algebra, SQL queries, indexing, Statistical analysis, machine learning, and
Used and normalization. algorithmic techniques.
Example Use Storing customer data in a relational Identifying frequent purchase patterns
Case database and retrieving it via queries. from customer transaction data.
User Requires users to define and manage Automates the discovery of insights with
Interaction data schema and write queries. minimal user input.
Used for transaction processing, data Used for market basket analysis, fraud
Use Cases
storage, and management. detection, customer segmentation.
1. Association
Association analysis is the finding of association rules showing attribute-value conditions that
occur frequently together in a given set of data. Association analysis is widely used for a
market basket or transaction data analysis. Association rule mining is a significant and
exceptionally dynamic area of data mining research. One method of association-based
classification, called associative classification, consists of two steps. In the main step,
association instructions are generated using a modified version of the standard association
rule mining algorithm known as Apriori. The second step constructs a classifier based on the
association rules discovered.
2. Classification
Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The determined model depends on the
investigation of a set of training data information (i.e. data objects whose class label is
known). The derived model may be represented in various forms, such as classification (if –
then) rules, decision trees, and neural networks. Data Mining has a different type of classifier:
Decision Tree
SVM(Support Vector Machine)
Generalized Linear Models
Bayesian classification:
Classification by Backpropagation
K-NN Classifier
Rule-Based Classification
Frequent-Pattern Based Classification
Rough set theory
Fuzzy Logic
Decision Trees: A decision tree is a flow-chart-like tree structure, where each node
represents a test on an attribute value, each branch denotes an outcome of a test, and tree
leaves represent classes or class distributions. Decision trees can be easily transformed into
classification rules. Decision tree enlistment is a nonparametric methodology for building
classification models. In other words, it does not require any prior assumptions regarding the
type of probability distribution satisfied by the class and other attributes. Decision trees,
especially smaller size trees, are relatively easy to interpret. The accuracies of the trees are
also comparable to two other classification techniques for a much simple data set. These
provide an expressive representation for learning discrete-valued functions. However, they do
not simplify well to certain types of Boolean problems.
This figure generated on the IRIS data set of the UCI machine repository. Basically, three
different class labels available in the data set: Setosa, Versicolor, and Virginia.
K-Nearest Neighbor (K-NN) Classifier Method: The k-nearest neighbor (K-NN) classifier
is taken into account as an example-based classifier, which means that the training documents
are used for comparison instead of an exact class illustration, like the class profiles utilized by
other classifiers. As such, there’s no real training section. once a new document has to be
classified, the k most similar documents (neighbors) are found and if a large enough
proportion of them are allotted to a precise class, the new document is also appointed to the
present class, otherwise not. Additionally, finding the closest neighbors is quickened using
traditional classification strategies.
Fuzzy-Logic: Rule-based systems for classification have the disadvantage that they involve
sharp cut-offs for continuous attributes. Fuzzy Logic is valuable for data mining frameworks
performing grouping /classification. It provides the benefit of working at a high level of
abstraction. In general, the usage of fuzzy logic in rule-based systems involves the following:
3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for
prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for
which values are being predicted is consistently valued(ordered) instead of categorical
(discrete-esteemed and unordered). The attribute can be referred to simply as the predicted
attribute. Prediction can be viewed as the construction and use of a model to assess the class
of an unlabeled object, or to assess the value or value ranges of an attribute that a given
object is likely to have.
4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes,
clustering analyzes data objects without consulting an identified class label. In general, the
class labels do not exist in the training data simply because they are not known to begin with.
Clustering can be used to generate these labels. The objects are clustered based on the
principle of maximizing the intra-class similarity and minimizing the interclass similarity.
That is, clusters of objects are created so that objects inside a cluster have high similarity in
contrast with each other, but are different objects in other clusters. Each Cluster that is
generated can be seen as a class of objects, from which rules can be inferred. Clustering can
also facilitate classification formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.
5. Regression
Regression can be defined as a statistical modeling method in which previously obtained data
is used to predicting a continuous quantity for new observations. This classifier is also known
as the Continuous Value Classifier. There are two types of regression models: Linear
regression and multiple linear regression models.
The advantages of neural networks, however, contain their high tolerance to noisy data as
well as their ability to classify patterns on which they have not been trained. In addition,
several algorithms have newly been developed for the extraction of rules from trained neural
networks. These issues contribute to the usefulness of neural networks for classification in
data mining.
7. Outlier Detection
A database may contain data objects that do not comply with the general behavior or model
of the data. These data objects are Outliers. The investigation of OUTLIER data is known as
OUTLIER MINING. An outlier may be detected using statistical tests which assume a
distribution or probability model for the data, or using distance measures where objects
having a small fraction of “close” neighbors in space are considered outliers. Rather than
utilizing factual or distance measures, deviation-based techniques distinguish
exceptions/outlier by inspecting differences in the principle attributes of items in a group.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of
evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and
genetics. These are intelligent exploitation of random search provided with historical data to
direct the search into the region of better performance in solution space. They are commonly
used to generate high-quality solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those species who
can adapt to changes in their environment are able to survive and reproduce and go to the
next generation. In simple words, they simulate “survival of the fittest” among individuals of
consecutive generations for solving a problem. Each generation consist of a population of
individuals and each individual represents a point in search space and possible solution. Each
individual is represented as a string of character/integer/float/bits. This string is analogous to
the Chromosome.
Mining Problems in Data Mining
Data Mining involves extracting useful information from large datasets. However, the
process is not without challenges. Here are common mining problems encountered in Data
Mining:
Problem:
o Incomplete, noisy, or inconsistent data can lead to inaccurate results.
Causes:
o Missing values in datasets.
o Errors in data entry or collection.
o Redundant or duplicate data.
Solution:
o Data cleaning and preprocessing techniques.
o Use of imputation methods for missing data.
2. Scalability
Problem:
o Difficulty in processing and analyzing large-scale datasets efficiently.
Causes:
o Limited computational resources.
o Increasing size of modern datasets (big data).
Solution:
o Parallel and distributed computing.
o Use of scalable algorithms like MapReduce.
3. High Dimensionality
Problem:
o Data with a large number of attributes (features) increases complexity.
Causes:
o Modern applications generate high-dimensional data, such as images or genetic
data.
Solution:
o Dimensionality reduction techniques like PCA (Principal Component Analysis).
o Feature selection methods.
Problem:
o Combining data from various sources with different formats.
Causes:
o Diverse data sources, including structured, semi-structured, and unstructured data.
Solution:
o Use of ETL (Extract, Transform, Load) tools.
o Data integration frameworks.
6. Real-Time Processing
Problem:
o Difficulty in analyzing data streams in real time.
Causes:
o High-speed data generation from IoT devices or social media.
Solution:
o Real-time processing frameworks like Apache Kafka and Apache Flink.
7. Interpretability of Results
Problem:
o Results from data mining models may be difficult to interpret or explain.
Causes:
o Use of complex models like neural networks.
o Lack of domain expertise.
Solution:
o Use interpretable models like decision trees.
o Implement explainable AI techniques.
8. Imbalanced Data
Problem:
o Unequal distribution of classes in a dataset can affect model performance.
Causes:
o Rare events or minority classes in datasets.
Solution:
o Use of oversampling (e.g., SMOTE) or undersampling techniques.
o Employ algorithms designed for imbalanced data.
9. Overfitting
Problem:
o Models perform well on training data but fail on unseen data.
Causes:
o Excessively complex models.
o Lack of sufficient training data.
Solution:
o Cross-validation techniques.
o Regularization methods.
10. Underfitting
Problem:
o Models fail to capture the underlying patterns in data.
Causes:
o Oversimplified models.
o Insufficient features or data.
Solution:
o Use more complex algorithms.
o Incorporate additional relevant features.
Problem:
o Data characteristics change over time, affecting model performance.
Causes:
o Real-world data is often dynamic (e.g., stock market data).
Solution:
o Use adaptive learning techniques.
o Regularly update models with new data.
Problem:
o Presence of irrelevant or extreme values in data.
Causes:
o Measurement errors or anomalies.
Solution:
o Outlier detection techniques (e.g., Isolation Forest).
o Robust algorithms resistant to noise.
Problem:
o Choosing the most appropriate data mining technique for a problem.
Causes:
o Diverse types of problems and datasets.
Solution:
o Understand the nature of the problem.
o Experiment with multiple techniques.
Problem:
o High costs associated with computational resources, software, and expertise.
Causes:
o Complexity of data mining projects.
Solution:
o Open-source tools like Python, R, and Weka.
o Cloud-based data mining platforms.
Problem:
o Assessing the accuracy and reliability of data mining outputs.
Causes:
o Lack of ground truth for unsupervised learning tasks.
Solution:
o Use performance metrics like precision, recall, and F1-score.
o Conduct thorough validation and testing.
Challenges in DM:
Challenges in Data Mining
Data Mining is a powerful process for extracting meaningful patterns and insights from large
datasets. However, it comes with its own set of challenges. Below are the primary challenges
in Data Mining, along with explanations and potential solutions:
Challenge:
o Data often contains missing, incomplete, noisy, or inconsistent information.
Impact:
o Poor-quality data leads to inaccurate and unreliable results.
Solution:
o Implement data cleaning and preprocessing techniques.
o Use imputation methods for missing values and noise reduction.
2. Scalability
Challenge:
o Handling and analyzing massive datasets efficiently.
Impact:
o Slower performance and higher computational costs for large-scale data.
Solution:
o Use distributed computing frameworks like Hadoop or Spark.
o Employ scalable algorithms and parallel processing.
3. High Dimensionality
Challenge:
o Datasets with many attributes (features) make the analysis more complex and
computationally expensive.
Impact:
o Difficulty in identifying relevant patterns due to the "curse of dimensionality."
Solution:
o Apply dimensionality reduction techniques such as PCA (Principal Component
Analysis).
o Use feature selection methods to focus on the most important attributes.
Challenge:
o Protecting sensitive data and ensuring user privacy during analysis.
Impact:
o Risk of data breaches, misuse of personal information, and legal penalties.
Solution:
o Use data anonymization and encryption techniques.
o Follow privacy regulations like GDPR or HIPAA.
5. Dynamic Data
Challenge:
o Real-world data evolves over time, requiring frequent updates to models and
algorithms.
Impact:
o Outdated models may fail to capture current trends or patterns.
Solution:
o Use incremental learning techniques to adapt to new data.
o Regularly update and retrain models with fresh data.
Challenge:
o Combining data from multiple sources with different formats and structures.
Impact:
o Inconsistent and incomplete integration leads to incorrect conclusions.
Solution:
o Use data integration tools like ETL (Extract, Transform, Load).
o Standardize and normalize data during preprocessing.
7. Interpretability of Results
Challenge:
o Complex models like neural networks can be difficult to interpret and explain.
Impact:
o Stakeholders may struggle to understand how decisions are made.
Solution:
o Use interpretable models like decision trees or linear regression.
o Apply explainable AI (XAI) techniques for black-box models.
Challenge:
o Real-world data often contains noise, errors, or uncertainty.
Impact:
o Reduced model accuracy and reliability.
Solution:
o Implement outlier detection and noise reduction techniques.
o Use robust algorithms designed to handle uncertainty.
Challenge:
o Overfitting occurs when models are too complex and fit the training data too closely.
o Underfitting occurs when models are too simple and fail to capture the data's
complexity.
Impact:
o Poor performance on unseen data.
Solution:
o Use cross-validation techniques.
o Apply regularization methods to balance model complexity.
Challenge:
o High costs associated with data storage, processing, and skilled expertise.
Impact:
o Organizations may find it challenging to allocate sufficient resources.
Solution:
o Use open-source tools like Python, R, or Weka.
o Leverage cloud-based solutions to reduce infrastructure costs.
Challenge:
o Unequal representation of classes in datasets (e.g., fraud detection where fraud
cases are rare).
Impact:
o Models may be biased toward the majority class.
Solution:
o Use resampling techniques like oversampling (e.g., SMOTE) or undersampling.
o Apply algorithms specifically designed for imbalanced data.
Challenge:
o Choosing the right algorithm for a specific problem can be challenging due to the
variety of available methods.
Impact:
o Inappropriate algorithm selection leads to suboptimal results.
Solution:
o Understand the problem and dataset characteristics.
o Experiment with multiple algorithms and evaluate their performance.
Challenge:
o Processing and analyzing data streams in real time.
Impact:
o Difficulties in providing instant insights for dynamic applications.
Solution:
o Use real-time data mining tools like Apache Kafka or Flink.
o Design algorithms optimized for streaming data.
Challenge:
o Data mining often involves unstructured data like text, images, or videos.
Impact:
o Difficulty in processing and extracting meaningful patterns.
Solution:
o Use Natural Language Processing (NLP) for text data.
o Apply image processing techniques for visual data.
Challenge:
o Measuring the accuracy and reliability of data mining outcomes.
Impact:
o Inconsistent evaluation may lead to incorrect interpretations.
Solution:
o Use performance metrics like precision, recall, and F1-score.
o Compare results across multiple datasets and validation methods.
Challenge:
o Misuse of data mining results for biased or unethical purposes.
Impact:
o Loss of trust and potential legal consequences.
Solution:
o Establish ethical guidelines for data mining practices.
o Conduct regular audits of data mining projects.
DM Application areas
There are many measurable benefits that have been achieved in different application areas
from data mining. So, let’s discuss different applications of Data Mining:
Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyze the old data already accumulated. Example of
scientific analysis:
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers. This analysis can help to promote deals, offers, sale by
the companies and data mining techniques helps to achieve this analysis task. Example:
Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational task:
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure out
which promoting activities will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict which customers will buy new
policies, identify behavior patterns of risky customers and identify fraudulent behavior of
customers.
Transportation: A diversified transportation company with a large direct sales force can
apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
The concept hierarchy has five levels, respectively referred to as levels 0to 4,
starting with level 0 at the root node for all.