Data Mining
Data Mining
Data mining is the process of extracting useful information from large sets of
data. It involves using various techniques from statistics, machine learning, and
database systems to identify patterns, relationships, and trends in the data. This
information can then be used to make data-driven decisions, solve business
problems, and uncover hidden insights. Applications of data mining include
customer profiling and segmentation, market basket analysis, anomaly
detection, and predictive modeling. Data mining tools and technologies are
widely used in various industries, including finance, healthcare, retail, and
telecommunications.
In general terms, “Mining” is the process of extraction of some valuable
material from the earth e.g. coal mining, diamond mining, etc. In the context of
computer science, “Data Mining” can be referred to as knowledge mining
from data, knowledge extraction, data/pattern analysis, data
archaeology, and data dredging. It is basically the process carried out for the
extraction of useful information from a bulk of data or data warehouses. One can
see that the term itself is a little confusing. In the case of coal or diamond
mining, the result of the extraction process is coal or diamond. But in the case of
Data Mining, the result of the extraction process is not data!! Instead, data
mining results are the patterns and knowledge that we gain at the end of the
extraction process. In that sense, we can think of Data Mining as a step in the
process of Knowledge Discovery or Knowledge Extraction.
Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in
Databases” in 1989. However, the term ‘data mining’ became more popular
in the business and press communities. Currently, Data Mining and Knowledge
Discovery are used interchangeably.
Nowadays, data mining is used in almost all places where a large amount of data
is stored and processed. For example, banks typically use ‘data mining’ to find
out their prospective customers who could be interested in credit cards, personal
loans, or insurance as well. Since banks have the transaction details and detailed
profiles of their customers, they analyze all this data and try to find out patterns
that help them predict that certain customers could be interested in personal
loans, etc.
1
Main Purpose of Data Mining
Data Mining
Basically, Data mining has been integrated with many other techniques from
other domains such as statistics, machine learning, pattern recognition,
database and data warehouse systems, information retrieval,
visualization, etc. to gather more information about the data and to helps
predict hidden patterns, future trends, and behaviors and allows businesses to
make decisions.
Technically, data mining is the computational process of analyzing data from
different perspectives, dimensions, angles and categorizing/summarizing it into
meaningful information.
Data Mining can be applied to any type of data e.g. Data Warehouses,
Transactional Databases, Relational Databases, Multimedia Databases,
Spatial Databases, Time-series Databases, World Wide Web.
Data Mining as a Whole Process
The whole process of Data Mining consists of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection, and
transformation takes place
2
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analyzing and presenting results
3
4. Improved customer service: Data mining can help organizations better
understand their customers and tailor their products and services to meet
their needs.
5. Fraud detection: Data mining can be used to identify fraudulent
activities by detecting unusual patterns and anomalies in data.
6. Predictive modeling: Data mining can be used to build models that can
predict future events and trends, which can be used to make proactive
decisions.
7. New product development: Data mining can be used to identify new
product opportunities by analyzing customer purchase patterns and
preferences.
8. Risk management: Data mining can be used to identify potential risks by
analyzing data on customer behavior, market conditions, and other
factors.
4
categorical labels (e.g., color, type), textual descriptions (e.g., name,
description), or any other measurable or qualitative aspect of the data objects.
To build a strong foundation in understanding data attributes and other essential
data science concepts, consider enrolling in the Data Science Live
course. This course offers comprehensive training in data analysis, visualization,
and machine learning, equipping you with the knowledge and skills needed to
excel in the field of data science. Learn from industry experts and apply your
skills to real-world projects for a successful career.
Types of attributes:
This is the initial phase of data preprocessing involves categorizing attributes
into different types, which serves as a foundation for subsequent data processing
steps. Attributes can be broadly classified into two main types:
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes :
Nominal attributes, as related to names, refer to categorical data where the
values represent different categories or labels without any inherent order or
ranking. These attributes are often used to represent names or labels associated
with objects, entities, or concepts.
Example :
5
2. Binary Attributes: Binary attributes are a type of qualitative attribute where
the data can take on only two distinct values or states. These attributes are often
used to represent yes/no, presence/absence, or true/false conditions within a
dataset. They are particularly useful for representing categorical data where
there are only two possible outcomes. For instance, in a medical study, a binary
attribute could represent whether a patient is affected or unaffected by a
particular condition.
Symmetric: In a symmetric attribute, both values or states are
considered equally important or interchangeable. For example, in the
attribute “Gender” with values “Male” and “Female,” neither value holds
precedence over the other, and they are considered equally significant for
analysis purposes.
6
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values. Numerical attributes are of 2
types: interval , and ratio-scaled.
An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the correct
reference point, or we can call zero points. Data can be added and
subtracted at an interval scale but can not be multiplied or divided.
Consider an example of temperature in degrees Centigrade. If a day’s
temperature of one day is twice of the other day we cannot say that one
day is twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a
measurement is ratio-scaled, we can say of a value as being a multiple (or
ratio) of another value. The values are ordered, and we can also compute
the difference between values, and the mean, median, mode, Quantile-
range, and Five number summary can be given.
2. Discrete : Discrete data refer to information that can take on specific,
separate values rather than a continuous range. These values are often distinct
and separate from one another, and they can be either numerical or categorical
in nature.
Example:
7
Data Preprocessing in Data Mining
Data preprocessing is an important step in the data mining process. It refers to
the cleaning, transforming, and integrating of data in order to make it ready for
analysis. The goal of data preprocessing is to improve the quality of the data and
to make it more suitable for the specific data mining task.
Steps of Data Preprocessing
Data preprocessing is an important step in the data mining process that involves
cleaning and transforming raw data to make it suitable for analysis. Some
common steps in data preprocessing include:
1. Data Cleaningg (to remove noise and inconsistent data): This
involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be
used for data cleaning, such as imputation, removal, and transformation.
2. Data Integration(where multiple data sources may be
combined): This involves combining data from multiple sources to create
a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics.
Techniques such as record linkage and data fusion can be used for data
integration.
3. Data Transformation where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations): This involves converting the
data into a suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete
categories.
4. Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction.
Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.
5. Data Discretization: This involves dividing continuous data into discrete
categories or intervals. Discretization is often used in data mining and
machine learning algorithms that require categorical data. Discretization
8
can be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.
9
Data preprocessing plays a crucial role in ensuring the quality of data and the
accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the analysis
goals.
By performing these steps, the data mining process becomes more efficient and
the results become more accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
Ste
ps Involved in Data Preprocessing
1. Data Cleaning: The data can have many irrelevant and missing parts. To
handle this part, data cleaning is done. It involves handling of missing data, noisy
data etc.
Missing Data: This situation arises when some data is missing in the
data. It can be handled in various ways.
Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset
we have is quite large and multiple values are missing within a
tuple.
o Fill the Missing values: There are various ways to do this task.
You can choose to fill the missing values manually, by attribute
mean or the most probable value.
10
Noisy Data: Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways :
o Binning Method: This method works on sorted data in order to
smooth it. The whole data is divided into segments of equal size and
then various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete
the task.
o Regression:Here data can be made smooth by fitting it to a
regression function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).
o Clustering: This approach groups the similar data in a cluster. The
outliers may be undetected or it will fall outside the clusters.
2. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways:
Normalization: It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection: In this strategy, new attributes are constructed
from the given set of attributes to help the mining process.
Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
Concept Hierarchy Generation: Here attributes are converted from
lower level to higher level in hierarchy. For Example-The attribute “city”
can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data mining process
that involves reducing the size of the dataset while preserving the important
information. This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove
irrelevant or redundant features from the dataset. It can be done using
various techniques such as correlation analysis, mutual information, and
principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the
dataset. Sampling is often used to reduce the size of the dataset while
11
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.
Compression: This involves compressing the dataset while preserving
the important information. Compression is often used to reduce the size of
the dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gif
compression.
12
in general terms, or (2) data discrimination, by comparison of the target class
with one or a set of comparative classes (often called the contrasting classes), or
(3) both data characterization and discrimination. Data characterization is a
summarization of the general characteristics or features of a target class of data.
The data corresponding to the user-specified class are typically collected by a
query. For example, to study the characteristics of software products with sales
that increased by 10% in the previous year, the data related to such products
can be collected by executing an SQL query on the sales database.
Data characterization. A customer relationship manager at AllElectronics may
order the following data mining task: Summarize the characteristics of customers
who spend more than $5000 a year at AllElectronics. The result is a general
profile of these customers, such as that they are 40 to 50 years old, employed,
and have excellent credit ratings. The data mining system should allow the
customer relationship manager to drill down on any dimension, such as on
occupation to view these customers according to their type of employment.
Data discrimination is a comparison of the general features of the target class
data objects against the general features of objects from one or multiple
contrasting classes. The target and contrasting classes can be specified by a
user, and the corresponding data objects can be retrieved through database
queries. For example, a user may want to compare the general features of
software products with sales that increased by 10% last year against those with
sales that decreased by at least 30% during the same period. The methods used
for data discrimination are similar to those used for data characterization. “How
are discrimination descriptions output?” The forms of output presentation are
similar to those for characteristic descriptions, although discrimination
descriptions should include comparative measures that help to distinguish
between the target and contrasting classes. Discrimination descriptions
expressed in the form of rules are referred to as discriminant rules.
Example 1.6
Data discrimination. A customer relationship manager at AllElectronics may
want to compare two groups of customers—those who shop for computer
products regularly (e.g., more than twice a month) and those who rarely shop for
such products (e.g., less than three times a year). The resulting description
provides a general comparative profile of these customers, such as that 80% of
the customers who frequently purchase computer products are between 20 and
40 years old and have a university education, whereas 60% of the customers
who infrequently buy such products are either seniors or youths, and have no
university degree. Drilling down on a dimension like occupation, or adding a new
dimension like income level, may help to find even more discriminative features
between the two classes.
13
as the name suggests, are patterns that occur frequently in data. There are
many kinds of frequent patterns, including frequent itemsets, frequent
subsequences (also known as sequential patterns), and frequent substructures. A
frequent itemset typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers. A frequently occurring
subsequence, such as the pattern that customers, tend to purchase first a
laptop, followed by a digital camera, and then a memory card, is a (frequent)
sequential pattern. A substructure can refer to different structural forms (e.g.,
graphs, trees, or lattices) that may be combined with itemsets or subsequences.
If a substructure occurs frequently, it is called a (frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
14
Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data
sets, clustering analyzes data objects without consulting class labels. In many
cases, classlabeled data may simply not exist at the beginning. Clustering can be
used to generate
15