0% found this document useful (0 votes)
23 views27 pages

DM Unit-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views27 pages

DM Unit-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT-I

What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for further use.
The key properties of data mining are
Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information

Focus on large datasets and databases

The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:

Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-
on analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

Tasks of Data Mining


Data mining involves six common classes of tasks:

Anomaly detection (Outlier/change/deviation detection) – The identification of


unusual data records, that might be interesting or data errors that require further
investigation.

Association rule learning (Dependency modelling) – Searches for relationships


between variables. For example, a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error.

Summarization – providing a more compact representation of the data set, including


visualization and report generation.

Types of Data Mining

Each of the following data mining techniques serves several different business problems and
provides a different insight into each of them. However, understanding the type of business problem
you need to solve will also help in knowing which technique will be best to use, which will yield the
best results. The Data Mining types can be divided into two basic parts that are as follows:

1. Predictive Data Mining Analysis


2. Descriptive Data Mining Analysis

1. Predictive Data Mining

As the name signifies, Predictive Data-Mining analysis works on the data that may help to know
what may happen later (or in the future) in business. Predictive Data-Mining can also be further
divided into four types that are listed below:

o Classification Analysis
o Regression Analysis
o Time Serious Analysis
o Prediction Analysis

2. Descriptive Data Mining

The main goal of the Descriptive Data Mining tasks is to summarize or turn given data into relevant
information. The Descriptive Data-Mining Tasks can also be further divided into four types that are
as follows:

o Clustering Analysis
o Summarization Analysis
o Association Rules Analysis
o Sequence Discovery Analysis

Here, we will discuss each of the data mining's types in detail. Below are several different data
mining techniques that can help you find optimal outcomes as the results.

1. Classification Analysis

This type of data mining technique is generally used in fetching or retrieving important and relevant
information about the data & metadata. It is also even used to categories the different types of data
format into different classes. If you focus on this article until it ends, you may definitely find out that
Classification and clustering are similar data mining types. As clustering also categorizes or classify
the data segments into the different data records known as the classes. However, unlike clustering,
the data analyst would have the knowledge of different classes or clusters. Therefore in the
classification analysis, you have to apply or implement the algorithms to decide in which way the
new data should be categorized or classified. A classic example of classification analysis would be
Outlook email. In Outlook, they use certain algorithms to characterize an email is legitimate or spam.

This technique is usually very helpful for retailers who can use it to study the buying habits of their
different customers. Retailers can also study the past sales data and then lookout (or search ) for
products that customers usually buy together. After which, they can put those products nearby of
each other in their retail stores to help customers save their time and as well as to increase their sales.

2. Regression Analysis

In statistical terms, regression analysis is a process usually used to identify and analyze the
relationship among variables. It means one variable is dependent on another, but it is not vice versa.
It is generally used for prediction and forecasting purposes. It can also help you understand the
characteristic value of the dependent variable changes if any of the independent variables is varied.

3. Time Serious Analysis

A time series is a sequence of data points that are usually recorded at specific time intervals of
points. Usually, they are - most often in regular time intervals (seconds, hours, days, months etc.).
Almost every organization generates a high volume of data every day, such as sales figures, revenue,
traffic, or operating cost. Time series data mining can help in generating valuable information for
long-term business decisions, yet they are underutilized in most organizations.
4. Prediction Analysis

This technique is generally used to predict the relationship that exists between both the independent
and dependent variables as well as the independent variables alone. It can also use to predict profit
that can be achieved in future depending on the sale. Let us imagine that profit and sale are
dependent and independent variables, respectively. Now, on the basis of what the past sales data
says, we can make a profit prediction of the future using a regression curve.

5. Clustering Analysis

In Data Mining, this technique is used to create meaningful object clusters that contain the same
characteristics. Usually, most people get confused with Classification, but they won't have any issues
if they properly understand how both these techniques actually work. Unlike Classification that
collects the objects into predefined classes, clustering stores objects in classes that are defined by it.
To understand it in more detail, you can consider the following given example:

Example

Suppose you are in a library that is full of books on different topics. Now the real challenge for you
is to organize those books so that readers don't face any problem finding out books on any particular
topic. So here, we can use clustering to keep books with similarities in one particular shelf and then
give those shelves a meaningful name or class. Therefore, whenever a reader looking for books on a
particular topic can go straight to that shelf. Hence he won't be required to roam the entire library to
find the book he wants to read.

6. Summarization Analysis

The Summarization analysis is used to store a group (or a set ) of data in a more compact way and an
easier-to-understand form. We can easily understand it with the help of an example:

Example

You might have used Summarization to create graphs or calculate averages from a given set (or
group) of data. This is one of the most familiar and accessible forms of data mining.

7. Association Rule Learning

In general, it can be considered a method that can help us identify some interesting relations
(dependency modeling) between different variables in large databases. This technique can also help
us to unpack some hidden patterns in the data, which can be used to identify the variables within the
data. It also helps in detecting the concurrence of different variables that appear very frequently in
the dataset. Association rules are generally used for examining and forecasting the behavior of the
customer. It is also highly recommended in the retail industry analysis. This technique is also used to
determine shopping basket data analysis, catalogue design, product clustering, and store layout. In
IT, programmers also uses the association rules to create programs capable of machine learning. Or
in short, we can say that this data mining technique helps to find the association between two or more
Items. It discovers a hidden pattern in the data set.

8. Sequence Discovery Analysis


The primary goal of sequence discovery analysis is to discover interesting patterns in data on the
basis of some subjective or objective measure of how interesting it is. Usually, this task involves
discovering frequent sequential patterns with respect to a frequency support measure. Some people
may often confuse it with time series as both the Sequence discovery analysis and Time series
analysis contains the adjacent observation that are order dependent. However, if the people see both
of them in a little more depth, their confusion can be easily avoided as the Time series analysis
technique contains numerical data, whereas the Sequence discovery analysis contains discrete values
or data.

Architecture of Data Mining

A typical data mining system may have the following major components.

1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
2. Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:


This component typically employs interestingness measures interacts with the data
mining modules so as to focus this arch toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the datamining method used. For efficient data mining, it is highly
recommended to push the evaluation of pattern interestingness as deep as possible
into the mining process so as to confine the search to only the interesting patterns.

4. User interface:
This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory datamining based on
the intermediate data mining results. In addition, this component allows the user to
browse database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.
Functionalities of Data Mining

Data mining functionalities are used to represent the type of patterns that have to be discovered in
data mining tasks. Data mining tasks can be classified into two types: descriptive and predictive.
Descriptive mining tasks define the common features of the data in the database, and the predictive
mining tasks act in inference on the current information to develop predictions.

Data mining is extensively used in many areas or sectors. It is used to predict and characterize data.
But the ultimate objective in Data Mining Functionalities is to observe the various trends in data
mining. There are several data mining functionalities that the organized and scientific methods offer,
such as:
1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define the class or a concept. A
class can be a category of items on a shop floor, and a concept could be the abstract idea on which
data may be categorized like products to be put on clearance sale and non-sale products. There are
two concepts here, one that helps with grouping and the other that helps in differentiating.

o Data Characterization: This refers to the summary of general characteristics or features of


the class, resulting in specific rules that define a target class. A data analysis technique called
Attribute-oriented Induction is employed on the data set for achieving characterization.
o Data Discrimination: Discrimination is used to separate distinct data sets based on the
disparity in attribute values. It compares features of a class with features of one or more
contrasting classes. eg., bar charts, curves and pie charts.

2. Mining Frequent Patterns

One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.

o Frequent item set:This term refers to a group of items that are commonly found together,
such as milk and sugar.
o Frequent substructure: It refers to the various types of data structures that can be combined
with an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by a
cover.

3. Association Analysis

It analyses the set of items that generally occur together in a transactional dataset. It is also known as
Market Basket Analysis for its wide use in retail sales. Two parameters are used for determining the
association rules:
o It provides which identifies the common item set in the database.
o Confidence is the conditional probability that an item occurs when another item occurs in a
transaction.

4. Classification

Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a class
or essentially classify a collection of items. A training set containing items whose properties are
known is used to train the system to predict the category of items from an unknown collection of
items.

5. Prediction

It defines predict some unavailable data values or spending trends. An object can be anticipated
based on the attribute values of the object and attribute values of the classes. It can be a prediction of
missing numerical values or increase or decrease trends in time-related information. There are
primarily two types of predictions in data mining: numeric and class predictions.

o Numeric predictions are made by creating a linear regression model that is based on
historical data. Prediction of numeric values helps businesses ramp up for a future event that
might impact the business positively or negatively.
o Class predictions are used to fill in missing class information for products using a training
data set where the class for products is known.

6. Cluster Analysis

In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class label is
not known. Clustering algorithms group data based on similar features and dissimilarities.

7. Outlier Analysis

Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of turn
in the data and whether it indicates a situation that a business needs to consider and take measures to
mitigate. An outlier analysis of the data that cannot be grouped into any classes by the algorithms is
pulled up.

8. Evolution and Deviation Analysis

Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify, cluster or
discriminate time-related data.

9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes is
related to one another. It refers to the various types of data structures, such as trees and graphs, that
can be combined with an item set or subsequence. It determines how well two numerically measured
continuous variables are linked. Researchers can use this type of analysis to see if there are any
possible correlations between variables in their study.

Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from a
given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:
1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application domain.


Hence, domain-specific knowledge and experience are usually necessary in order to come
up with a meaningful problem statement. Unfortunately, many application studies tend to
focus on the data-mining technique at the expense of a clear problem statement. In this
step, a modeler usually specifies a set of variables for the unknown dependency and, if
possible, a general form of this dependency as an initial hypothesis. There may be several
hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it
usually means a close interaction between the data-mining expert and the application
expert. In successful data-mining applications, this cooperation does not stop in the initial
phase; it continues during the entire data-mining process.

2. Collect the data

This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and implicitly
given in the data-collection procedure. It is very important, however, to understand how
data collection affects its theoretical distribution, since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of results. Also, it is
important to make sure that the data used for estimating a model and the data used later
for testing and applying a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully used in a final application
of the results.

3. Preprocessing the data

In the observational setting, data are usually "collected" from the existing databases, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing includes several steps
such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights
in the applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same weight
for further analysis. Also, application-specific encoding methods usually achieve
dimensionality reduction by providing a smaller number of informative features for subsequent
data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large
spectrum of preprocessing activities in a data-mining process.
Data-preprocessing steps should not be considered completely independent from other
data-mining phases. In every iteration of the data-mining process, all activities, together,
could define new and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and
encoding.

4. Estimate the model

The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
applied to perform a successful learning process from data and to develop an appropriate
model.

5. Interpret the model and draw conclusions

In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model
and accuracy of its interpretation are somewhat contradictory. Usually, simple models are
more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using high dimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific

techniques to validate the results. A user does not want hundreds of pages of numeric
results. He does not understand them; he cannot summarize, interpret, and use them for
successful decision making.

The Data mining Process

Classification of Data mining Systems:


To understand the system and meet the desired requirements, data mining can be classified
into the following systems:

Database
Technology Statistics
Machine Learning
Information Science
Visualization
Other Disciplines

Some Other Classification Criteria:

Classification according to kind of databases mined


Classification according to kind of knowledge
mined
Classification according to kinds of techniques utilized
Classification according to applications adapted

Classification according to kind of databases mined

We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example, if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.

Classification according to kind of knowledge mined


We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as:

Characterization

Discrimination
Association and Correlation Analysis
Classification
Prediction

Clustering
Outlier Analysis
Evolution Analysis

Classification according to kinds of techniques utilized

We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.

Classification according to applications adapted

We can classify the data mining system according to application adapted. These applications are
as follows:

Finance
Telecommunications
DNA
(deoxyribonucleic acid)
Stock Markets
E-mail

Examples of Classification Task

Following is some of the main examples of classification tasks:

o Classification helps in determining tumor cells as benign or malignant.


o Classification of credit card transactions as fraudulent or legitimate.
o Classification of secondary structures of protein as alpha-helix, beta-sheet, or random coil.
o Classification of news stories into distinct categories such as finance, weather, entertainment,
sports, etc.

Knowledge Discovery in Databases (KDD)


Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:

Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step, knowledge is represented.
The following diagram shows the process of knowledge discovery process:

Architecture of KDD
Data Warehouse:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision-making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

Data Warehouse Design Process:

A data warehouse can be built using a top-down approach, a bottom-up approach, or a


combination of both.

The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.

The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.

In the combined approach, an organization can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.

The warehouse design process consists of the following steps:


Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is organizational
and involves multiple complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.

Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.

Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric
additive quantities like dollars sold and units sold.

OLAP(Online analytical Processing):

OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.

OLAP is part of the broader category of business intelligence, which also


encompasses relational database, report writing and data mining.

OLAP tools enable users to analyze multidimensional data interactively from


multiple perspectives.

OLAP consists of three basic analytical operations:

 Consolidation (Roll-Up)
 Drill-Down
 Slicing And Dicing

Consolidation involves the aggregation of data that can be accumulated and computed in
one or more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.

The drill-down is a technique that allows users to navigate through the details.
For instance, users can view the sales by individual products that make up a region’s
sales.

Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of
the OLAP cube and view (dicing) the slices from different viewpoints.
Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during discovery
to direct the mining process or examine the findings from different angles or depths. The data mining
primitives specify the following,

1. Set of task-relevant data to be mined.


2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a
foundation on which user-friendly graphical interfaces can be built.

Designing a comprehensive data mining language is challenging because data mining covers a wide
spectrum of tasks, from data characterization to evolution analysis. Each task has different
requirements. The design of an effective data mining query language requires a deep understanding
of the power, limitation, and underlying mechanisms of the various kinds of data mining tasks. This
facilitates a data mining system's communication with other information systems and integrates with
the overall information processing environment.

List of Data Mining Task Primitives

A data mining query is defined in terms of the following primitives, such as:

1. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (the relevant attributes or
dimensions).

In a relational database, the set of task-relevant data can be collected via a relational query involving
operations like selection, projection, join, and aggregation.
The data collection process results in a new data relational called the initial data relation. The initial
data relation can be ordered or grouped according to the conditions specified in the query. This data
retrieval can be thought of as a subtask of the data mining task.

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding the knowledge discovery process
and evaluating the patterns found. Concept hierarchies are a popular form of background knowledge,
which allows data to be mined at multiple levels of abstraction.

Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level, more
general concepts.

o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand. It compresses the data, and it would require
fewer input/output operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level concepts.
Based on different user viewpoints, there may be more than one concept hierarchy for a given
attribute or dimension.

An example of a concept hierarchy for the attribute (or dimension) age is shown below. User beliefs
regarding relationships in the data are another form of background knowledge.

4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They may be used to guide
the mining process or, after discovery, to evaluate the discovered patterns. For example, interesting
measures for association rules include support and confidence. Rules whose support and confidence
values are below user-specified thresholds are considered uninteresting.

o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall


simplicity for human comprehension. For example, the more complex the structure of a rule
is, the more difficult it is to interpret, and hence, the less interesting it is likely to be.
Objective measures of pattern simplicity can be viewed as functions of the pattern structure,
defined in terms of the pattern size in bits or the number of attributes or operators appearing
in the pattern.
o Certainty (Confidence): Each discovered pattern should have a measure of certainty
associated with it that assesses the validity or "trustworthiness" of the pattern. A certainty
measure for association rules of the form "A =>B" where A and B are sets of items is
confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples, the
confidence of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness.
It can be estimated by a utility function, such as support. The support of an association pattern
refers to the percentage of task-relevant data tuples (or transactions) for which the pattern is
true.Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples
o Novelty: Novel patterns are those that contribute new information or increased performance
to the given pattern set. For example -> A data exception. Another strategy for detecting
novelty is to remove redundant patterns.

5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.

Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of
knowledge.

For example, generalized relations and their corresponding cross tabs or pie/bar charts are good for
presenting characteristic descriptions, whereas decision trees are common for classification.

Example of Data Mining Task Primitives

Suppose, as a marketing manager of AllElectronics, you would like to classify customers based on
their buying patterns. You are especially interested in those customers whose salary is no less than
$40,000 and who have bought more than $1,000 worth of items, each of which is priced at no less
than $100.

In particular, you are interested in the customer's age, income, the types of items purchased, the
purchase location, and where the items were made. You would like to view the resulting
classification in the form of rules. This data mining query is expressed in DMQL3 as follows, where
each line of the query has been enumerated to aid in our discussion.

1. use database AllElectronics_db


2. use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age
3. mine classification as promising_customers
4. in relevance to C.age, C.income, I.type, I.place_made, T.branch
5. from customer C, an item I, transaction T
6. where I.item_ID = T.item_ID and C.cust_ID = T.cust_ID and C.income ≥ 40,000 and I.price
≥ 100
7. group by T.cust_ID
Integration Of A Data Mining System With A Database Or Data Warehouse System

Data Integration

It has been an integral part of data operations because data can be obtained from several sources. It is
a strategy that integrates data from several sources to make it available to users in a single uniform
view that shows their status. There are communication sources between systems that can include
multiple databases, data cubes, or flat files. Data fusion merges data from various diverse sources to
produce meaningful results. The consolidated findings must exclude inconsistencies, contradictions,
redundancies, and inequities.

Data integration is important because it gives a uniform view of scattered data while also maintaining
data accuracy. It assists the data-mining program in meaningful mining information, which in turn
assists the executive and managers make strategic decisions for the enterprise's benefit.

The data integration methods are formally characterized as a triple (G, S, M), where;

G represents the global schema,

S represents the heterogeneous source of schema,

M represents the mapping between source and global schema queries.

There are mainly four types of approaches or schemes for data integration. These are as follows:

1.No coupling: No coupling means that a DM system will not utilize any function of a DB or
DW system. It may fetch data from a particular source (such as a file system), process data using
some data mining algorithms, and then store the mining results in another file.
2.Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or
DW system, fetching data from a data repository managed by these systems, performing data
mining, and then storing the mining results either in a file or in a designated place in a database or
data Warehouse. Loose coupling is better than no coupling because it can fetch any portion of data
stored in databases or data warehouses by using query processing, indexing, and other system
facilities.
However, many loosely coupled mining systems are main memory-based. Because mining does not
explore data structures and query optimization methods provided by DB or DW systems, it is
difficult for loose coupling to achieve high scalability and good performance with large data sets.
3.Semitight coupling: Semitight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data mining primitives (identified by
the analysis of frequently encountered data mining functions) can be provided in the DB/DW
system. These primitives can include sorting, indexing, aggregation, histogram analysis, multi way
join, and precomputation of some essential statistical measures, such as sum, count, max, min,
standard deviation,
4.Tight coupling: Tight coupling means that a DM system is smoothly integrated into the
DB/DW system. The data mining subsystem is treated as one functional component of information
system. Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB or DW system.
Major issues Data mining
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore, it is necessary for data mining to cover a broad
range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple levels of
abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These
representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning methods
are not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed in a parallel fashion. Then the results from the
partitions is merged. The incremental algorithms, update databases without mining the data
again from scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kinds of data.
 Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore, mining the knowledge from them
adds challenges to data mining.
Data Preprocessing:
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
 (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways:

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute. The attribute having p-
value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are: Wavelet transforms and PCA (Principal Component Analysis).

Advantages and Disadvantages of Data Mining


Advantages of Data Mining:
o Marketing/Retailing:

Direct marketers can benefit from data mining by providing precise and helpful trends regarding their
target audience's purchase habits. These trends enable marketers to target their target market more
precisely with their marketing efforts. For consumers with a long history of purchasing software, a
software company's marketing may promote its new product.

Additionally, data mining can aid marketers in making predictions about the goods their target
customers may be interested in buying. Marketers can surprise consumers and enhance the shopping
experience by making this forecast. Data mining techniques can also be advantageous for retail
establishments. For instance, store management can combine shelves and particular stock items or
provide a price based on patterns supported by data mining that will draw customers.

o Banking/Crediting:
Financial companies can benefit from data mining in areas like credit documentation and loan
records. A bank, for instance, can determine the degree of risk associated with each specific loan by
assessing prior consumers who share comparable features. Data mining can also assist credit card
issuers in alerting customers to possibly fraudulent credit card transactions. Credit card issuers can
cut their losses even though data mining technology only sometimes predicts fraudulent charges with
100% accuracy.

o Manufacturing:

Manufacturers can spot defective equipment and establish the best control parameters by using data
mining on operational engineering data. For instance, semiconductor manufacturers face a dilemma
since even in diverse wafer production facilities' manufacturing environments, the quality of the
wafers is generally the same, and some even have faults for unexplained reasons. Data mining has
been used to identify the control parameter ranges that result in the fabrication of the golden wafer.
The desired grade wafers are then produced using those ideal control settings.

o Customer Identification:

Every consumer in the market is unique in their ways. Their fundamental behavior and traits differ.
As a result, it is easier to comprehend their preferences with the right methodology. Businesses may
better identify their clients with data mining, increasing the likelihood that they will buy their
products.

o Detecting criminal activity:

Governments and other institutions can use market analysis data to identify criminals. For instance,
the data can be structured to make it easier to analyze a customer's prior transactions. As a result, it
might quickly reveal any fraudulent activity.

o Business Administration:

New business prospects are made possible by the data mining process. Data mining can be used with
all products to adopt the proper company strategy. As an illustration, delivering the appropriate
product to the customer helps ensure product sales. In addition, the data mining information will
enable organizations to use various marketing strategies.

o Marketing Techniques:

Businesses can build data models using data mining approaches. They could quickly determine
which people would be interested in their products using these models. As a result, the firms may be
sure that the products they introduce will be profitable. Therefore, whatever new products are
presented will help the company's profits expand.

o Criminal Justice:

By discovering patterns in location, crime type, habit, and other behavior patterns, data mining can
help law enforcement locate and apprehend criminal offenders.

Disadvantages of Data Mining:


o Privacy Issues:

Businesses gather data about their customers in various ways to understand the trends in their buying
habits. Particularly now that the internet is booming with social networks, e-commerce, forums, and
blogs, concerns about personal privacy have been growing significantly. People worry that their
personal information will be collected and used unethically, which could get them into a lot of
trouble due to privacy concerns. However, businesses don't last forever; on occasion, they might be
bought out by another company or go out of business entirely. At this time, they likely sell or leak
the personal information they possess.

o Safety concerns:

A major concern is security. Social Security numbers, birthdays, salary information, and other details
about customers and employees are owned by businesses. But it still needs to be determined how
well this information is protected. Many large corporations like Ford Motor Credit Company and
Sony Pictures have seen hackers access and steal large amounts of consumer data. The credit card
was stolen, and identity theft became a major issue because so much financial and personal
information was available.

o Information that has been misused or is erroneous:

Data mining techniques can be used improperly to gather information for unethical objectives. Using
this information to their advantage, unethical individuals or organizations could discriminate against
a certain group of people or take advantage of the weak. A further drawback of data mining is its
imperfect accuracy. Inaccurate information will have major repercussions if used to make decisions.

o Expensive:

A particularly expensive procedure is data mining. For instance, businesses need to hire more staff
and technical experts to ensure that data mining is done properly. Advanced data mining software is
necessary for many firms but may be expensive. Because they need to yield more useful insights,
data mining often costs more than it saves for most small enterprises.

o Technical Knowledge:

Depending on how they should be used, various mining tools are available. They each have a
distinctive algorithm and design. Selecting the appropriate tool will only be possible with the
required technical knowledge. Therefore, it is necessary to send out a competent specialist to handle
the tool selection.

o Accuracy:

Even though data mining has created a framework for simple data collection with its techniques, its
accuracy is still constrained. Making decisions can be complicated by erroneous information that has
been acquired.

o Large databases are needed for data mining:

Although data mining is one of the most effective tools in a marketer's arsenal, it has its challenges.
One such disadvantage is that huge datasets are necessary for data mining to be effective. For
instance, if an email list contains just 100 subscribers, more than the data from those emails will be
required for data mining. On the other hand, more information will be available, and data mining will
be more successful if the list has 100,000 persons.

o Data mining methods are not perfect:

Accurate information is only sometimes produced through data mining. There are numerous methods
for analyzing data, some of which are more precise than others. Predictive models, for instance, rely
on the expectation that particular data patterns will be discovered. When only some facts back a
forecast, this can result in overestimating how accurate it will turn out. Another problem arises when
a database contains missing data that must be considered to produce an accurate analysis.

You might also like