0% found this document useful (0 votes)
97 views6 pages

Data Mining Models and Tasks

Data mining involves analyzing large datasets to discover unknown patterns and relationships. It uses techniques from statistics, machine learning, and pattern recognition. This document discusses several basic data mining models and tasks including classification, regression, clustering, prediction, and association rule mining. The goal is to either build a predictive or descriptive model of the data that can provide insights or make predictions about unknown data. Data mining has many applications in industries like banking, insurance, retail, and healthcare to reduce costs, enhance research, and increase sales.

Uploaded by

navaneethangceb
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views6 pages

Data Mining Models and Tasks

Data mining involves analyzing large datasets to discover unknown patterns and relationships. It uses techniques from statistics, machine learning, and pattern recognition. This document discusses several basic data mining models and tasks including classification, regression, clustering, prediction, and association rule mining. The goal is to either build a predictive or descriptive model of the data that can provide insights or make predictions about unknown data. Data mining has many applications in industries like banking, insurance, retail, and healthcare to reduce costs, enhance research, and increase sales.

Uploaded by

navaneethangceb
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 6

DATA MINING

Dr. P. Vasantha & Mrs. P. Manonmani


Associate Prof. of Statistics
Sri Sarada College for Women, Salem, Tamil Nadu.

ABSTRACT Data mining applications can use a


Data Mining is the analysis of (generally variety of parameters to examine the data. They
large) observational data sets to discover include association (patterns where one event is
previously unknown, valid patterns and connected to another event1 such as purchasing a
relationships and to summarize the data in novel pen and purchasing paper), sequence or path
ways that are both understandable and useful. This analysis (patterns where one event leads to another
paper gives an overview of data mining, basic data event such as the birth of a child and purchasing
mining models and tasks, uses, limitations and a diapers), classification (identification of new
few important implementation issues. patterns such as coincidences between duct tape
purchases and plastic sheeting purchases),
Keywords : Classification, clustering, sequence
clustering (finding and visually documenting
discovery, mission creep, interoperability
groups of previously unknown facts such as
geographic location and brand preferences) and
INTRODUCTION
forecasting (discovering patterns from which one
Data mining is a component of a wider
can make reasonable predictions regarding future
process called Knowledge Discovery from
activities such as the prediction that people who
Databases. It involves experts from a set of
join an athletic club may take exercise classes).
disciplines, including mathematicians, computer
scientists and statisticians, as well as those
BASIC DATA MINING MODELS AND
working in fields such as machine learning,
TASKS
artificial intelligence, information retrieval and
Data mining involves many different
pattern recognition. It uses sophisticated data
algorithms to accomplish different tasks. All of
analysis tools to discover previously unknown,
these algorithms attempt to fit a model to the data.
valid patterns and relationships in large data sets.
The algorithms examine the data and determine a
These tools can include statistical models,
model that is closest to the characteristics of the
mathematical algorithms and machine learning
data being examined.
methods (algorithms that improve their
performance automatically through experience Data mining models and tasks
such as neural networks or decision trees).
Consequently, data mining consists of more than
collecting and managing data, it also includes
analysis and prediction. Data mining can be
performed on data represented in quantitative,
textual, or multimedia forms. The essence of data
mining is an attempt to discover the unexpected -
and the unexpected, by its very nature, can arise in
unexpected ways.
A predictive model makes a prediction about factors, so-called because they are used by
values of data using known results found from insurance companies when setting the rates of
different data. Predictive modeling may be made premiums.
based on the use of other historical data. For
c) With time series analysis, the value of an
example, a credit card use might be refused not
attribute is examined as it varies over time. The
because of the user’s own credit history, but
values usually are obtained as evenly spaced time
because the current purchase is similar to earlier
points (daily. weekly, hourly, etc.). A time series
purchases that were subsequently found to be
graph is used to visualize the time series. There
made with stolen cards. Predictive model data
are three basic functions performed in time series
mining tasks include classification, regression,
analysis. In one case, distance measures are used
time series analysis and prediction.
to determine the similarity between different time
A descriptive model identifies patterns or series. In the second case, the structure of the line
relationships in data. Unlike the predictive model, is examined to determine (and perhaps classify) its
a descriptive model serves as a way to explore the behaviour. A third application would be to use the
properties of the data examined, not to predict new historical time series graph to predict future
properties. Clustering, summarization, association values.
rules, and sequence discovery are usually viewed
as descriptive in nature. d) Prediction can be viewed as a type of
classification. Here we are referring to a type of
a) Classification finds a rule or a formula for application rather than to a type of data mining
organising data into classes. For example, a bank modeling approach. Many real-world data mining
may wish to classify clients requesting loans into applications can be seen as predicting future data
categories based on the likelihood of repayment. A states based on past and current data. Prediction
rule or formula for making the classification is applications include flooding, speech recognition,
developed from the data in the training set. machine learning and pattern recognition.
(Examples of classification methods are linear
discriminant analysis, decision trees and neural e) Clustering breaks a large database into

networks). The reliability of the rule or formula is different subgroups or clusters. It differs from

then evaluated using the test set of data. This gives classification because there are no predefined

an indication of how well the procedure will work classes - the clusters are put together on the basis

on the remaining bulk of the data. of similarity to each other, but it is up to the data
miners to determine whether the clusters offer any
b) Regression uses values of one or more useful insight.
explanatory variables to explain or predict an
outcome variable. For example, insurance risk f) Summarization maps data into subsets with

analysts use regression when they have to estimate associated simple descriptions. Summarization is

the average value of a claim (an outcome variable) also called characterization or generalization. It

as a function of variables such as the age and extracts or derives representative information

gender of policy-holders (explanatory variables). about the database. This may be accomplished by

These explanatory variables are often called rating actually retrieving portions of the data.
Alternatively, summary type information (such as costs, enhance research and increase sales. For
the mean of some numeric attribute) can be example, the insurance and banking industries can
derived from the data. Market basket analysis can use data mining applications to detect fraud and
be used to determine which things go together. It assist in risk assessment (e.g. credit scoring).
is a form of clustering; for example, a market Using customer data collected over several years,
basket analysis of supermarket sales records might companies can develop models that predict
reveal that shopping trolleys containing cheese are whether a customer is a good credit risk or
also likely to contain pickled onions. The retailer whether an accident claim may be fraudulent and
could use this information in arranging its shelves should be investigated more closely. The medical
or for targeting an advertising campaign. community sometimes uses data mining to help
predict the effectiveness of a procedure or
g) An association rule is a model that identifies
medicine. Pharmaceutical firms use data mining of
specific types of data associations. These
chemical compounds and genetic material to help
associations are often used in the retail sales
guide research on new treatments for diseases.
community to identify items that are frequently
Retailers can use information collected through
purchased together. Associations are also used in
affinity programs (e.g., shoppers’ club cards,
many other applications such as predicting the
frequent flyer points, contests) to assess the
failure of telecommunication switches.
effectiveness of product selection and placement

h) Sequence discovery is used to determine decisions, coupon offers, and which products are

sequential patterns in data. These patterns are often purchased together. Companies such as

based on a time sequence of actions. These telephone service providers and music clubs can

patterns are similar to associations in that data (or use data mining to create a churn analysis to

events) are found to be related, but the relationship assess which customers are likely to remain as

is based on time. Unlike a market basket analysis, subscribers and which ones are likely to switch to

which requires the items to be purchased at the a competitor. Intelligence agencies like FBI and

same time, in sequence discovery, the items are CIA can use data mining to identify threats of

purchased over time in some order. For example, terrorism. The Aviation Administration can use

most people who purchase CD players may be data mining to review plane crash data to

found to purchase CDs within one week. As we recognize common defects and recommend

will see, temporal association rules really fail into precautionary measures.

this category.
LIMITATIONS OF DATA MINING
In all these cases, the basic objective is to To be successful, data mining requires
find something unusual, something that we might skilled technical and analytical specialists who can
not expect just by using common sense. structure the analysis and interpret the output that
is created. Consequently, the limitations of data
USES OF DATA MINING
mining are primarily data or personnel-related
Data mining can be used for a variety of
rather than technology-related. Although data
purposes in both the private and public sectors,
mining can help reveal patterns and relationships,
Industries such as banking, insurance, medicine,
it does not tell the user the value or significance of
and retailing commonly use data mining to reduce
these patterns. These types of determinations must
be made by the user. Similarly, the validity of the information sharing through e-government and
patterns discovered is dependent on how they homeland security initiatives. For data mining,
compare to real world circumstances. Data mining interoperability of databases and software is
does not necessarily identify a causal relationship important to enable the search and analysis of
between behaviours and/or variables. multiple databases simultaneously and to help
ensure the compatibility of data mining activities
DATA MINING ISSUES
of different agencies. Data mining projects that are
A few important issues associated with
trying to take advantage of existing legacy
data mining are :
databases or that are initiating first-time
collaborative efforts with other agencies or levels
(a) Data Quality
of government (e.g., police departments in
Data quality is a multifaceted issue that
different states) may experience interoperability
represents one of the biggest challenges for data
problems. Similarly, as agencies move forward
mining. Data quality refers to the accuracy and
with the creation of new databases and
completeness of the data. Data quality can also be
information sharing efforts, they will need to
affected by the structure and consistency of the
address interoperability issues during their
data being analyzed. The presence of duplicate
planning stages to better ensure the effectiveness
records, the lack of data standards, the timeliness
of their data mining projects.
of updates and human error can significantly
impact the effectiveness of the more complex data
(c) Mission Creep
mining techniques which are sensitive to subtle
Mission creep refers to the use of data for
differences that may exist in the data.
purposes other than that for which the data was
To improve data quality, it is sometimes
originally collected. This can occur regardless of
necessary to clean the data which can involve the
whether the data was provided voluntarily by the
removal of duplicate records, normalizing the
individual or was collected through other means.
values used to represent information in the
All data collection efforts suffer accuracy
database (e.g., ensuring that ‘no’ is represented as
concerns to some degree. Ensuring the accuracy of
a 0 throughout the database and not sometimes as
information can require costly protocols that may
a 0, sometimes as a N, etc.), accounting for
not be cost effective if the data is not of inherently
missing data points, removing unneeded data
high economic value.
fields, identifying anomalous data points (e.g., an
In well-managed data mining projects,
individual whose age is shown as 135 years) and
the original data collecting organization is likely
standardizing data formats (e.g., changing dates in
to be aware of the data’s limitations and account
the form MM/DD/YYYY).
for these limitations accordingly. However, such
awareness may not be communicated or heeded
(b) Interoperability
when data is used for other purposes. For example,
Interoperability refers to the ability of a
the accuracy of information collected through a
computer system and/or data to work with other
shopper’s club card may suffer for a variety of
systems or data using common standards or
reasons, including the lack of identity
processes. It is a critical part of the larger efforts
authentication when a card is issued, cashiers
to improve interagency collaboration and
using their own cards for customers who do not CONCLUSION
have one, and/or customers who use multiple Recent years have witnessed an
cards. exponential growth in terms of data generation
and manipulation. A number of advances in
(d) Privacy technology and business processes have
Concerns about privacy focus both on intensified the interest in data mining in both the
actual projects proposed as well as concerns about public and the private sectors for decision -
the potential for data mining applications to be making and prediction. One who uses data mining
expanded beyond their original purposes. Some should take into account the implementation
observers contend that tradeoffs may need to be issues, choose the model which best fits the data
made regarding privacy to ensure security. and apply a suitable technique to derive
Another set of observers suggest that existing laws potentially useful information.
and regulations regarding privacy protections are
REFERENCES
adequate and that these initiatives do not pose any
1. Daniel T. Larose : Data Mining Methods
threats to privacy. Still other observers argue that
and Models, Wiley India (P.) Ltd., New Delhi
not enough is known about how data mining
(2007).
projects will be carried out and that greater
2. Gopalan N.P. & Sivaselvan B. : Data
oversight is needed. There is also some
Mining, PHI Learning Pvt. Ltd., New Delhi
disagreement over how privacy concerns should
(2009).
be addressed. Some observers suggest that
3. Margaret H. Dunham : Data Mining,
technical solutions are adequate. In contrast, some
Dorling Kindersley (India) Pvt. Ltd., New
privacy advocates argue in favour of creating
Delhi (2009).
clearer policies and exercising stronger oversight.
4. http://www.google.com
5. http://www.yahoo.com

You might also like