Data Mining Models and Tasks
Data Mining Models and Tasks
networks). The reliability of the rule or formula is different subgroups or clusters. It differs from
then evaluated using the test set of data. This gives classification because there are no predefined
an indication of how well the procedure will work classes - the clusters are put together on the basis
on the remaining bulk of the data. of similarity to each other, but it is up to the data
miners to determine whether the clusters offer any
b) Regression uses values of one or more useful insight.
explanatory variables to explain or predict an
outcome variable. For example, insurance risk f) Summarization maps data into subsets with
analysts use regression when they have to estimate associated simple descriptions. Summarization is
the average value of a claim (an outcome variable) also called characterization or generalization. It
as a function of variables such as the age and extracts or derives representative information
gender of policy-holders (explanatory variables). about the database. This may be accomplished by
These explanatory variables are often called rating actually retrieving portions of the data.
Alternatively, summary type information (such as costs, enhance research and increase sales. For
the mean of some numeric attribute) can be example, the insurance and banking industries can
derived from the data. Market basket analysis can use data mining applications to detect fraud and
be used to determine which things go together. It assist in risk assessment (e.g. credit scoring).
is a form of clustering; for example, a market Using customer data collected over several years,
basket analysis of supermarket sales records might companies can develop models that predict
reveal that shopping trolleys containing cheese are whether a customer is a good credit risk or
also likely to contain pickled onions. The retailer whether an accident claim may be fraudulent and
could use this information in arranging its shelves should be investigated more closely. The medical
or for targeting an advertising campaign. community sometimes uses data mining to help
predict the effectiveness of a procedure or
g) An association rule is a model that identifies
medicine. Pharmaceutical firms use data mining of
specific types of data associations. These
chemical compounds and genetic material to help
associations are often used in the retail sales
guide research on new treatments for diseases.
community to identify items that are frequently
Retailers can use information collected through
purchased together. Associations are also used in
affinity programs (e.g., shoppers’ club cards,
many other applications such as predicting the
frequent flyer points, contests) to assess the
failure of telecommunication switches.
effectiveness of product selection and placement
h) Sequence discovery is used to determine decisions, coupon offers, and which products are
sequential patterns in data. These patterns are often purchased together. Companies such as
based on a time sequence of actions. These telephone service providers and music clubs can
patterns are similar to associations in that data (or use data mining to create a churn analysis to
events) are found to be related, but the relationship assess which customers are likely to remain as
is based on time. Unlike a market basket analysis, subscribers and which ones are likely to switch to
which requires the items to be purchased at the a competitor. Intelligence agencies like FBI and
same time, in sequence discovery, the items are CIA can use data mining to identify threats of
purchased over time in some order. For example, terrorism. The Aviation Administration can use
most people who purchase CD players may be data mining to review plane crash data to
found to purchase CDs within one week. As we recognize common defects and recommend
will see, temporal association rules really fail into precautionary measures.
this category.
LIMITATIONS OF DATA MINING
In all these cases, the basic objective is to To be successful, data mining requires
find something unusual, something that we might skilled technical and analytical specialists who can
not expect just by using common sense. structure the analysis and interpret the output that
is created. Consequently, the limitations of data
USES OF DATA MINING
mining are primarily data or personnel-related
Data mining can be used for a variety of
rather than technology-related. Although data
purposes in both the private and public sectors,
mining can help reveal patterns and relationships,
Industries such as banking, insurance, medicine,
it does not tell the user the value or significance of
and retailing commonly use data mining to reduce
these patterns. These types of determinations must
be made by the user. Similarly, the validity of the information sharing through e-government and
patterns discovered is dependent on how they homeland security initiatives. For data mining,
compare to real world circumstances. Data mining interoperability of databases and software is
does not necessarily identify a causal relationship important to enable the search and analysis of
between behaviours and/or variables. multiple databases simultaneously and to help
ensure the compatibility of data mining activities
DATA MINING ISSUES
of different agencies. Data mining projects that are
A few important issues associated with
trying to take advantage of existing legacy
data mining are :
databases or that are initiating first-time
collaborative efforts with other agencies or levels
(a) Data Quality
of government (e.g., police departments in
Data quality is a multifaceted issue that
different states) may experience interoperability
represents one of the biggest challenges for data
problems. Similarly, as agencies move forward
mining. Data quality refers to the accuracy and
with the creation of new databases and
completeness of the data. Data quality can also be
information sharing efforts, they will need to
affected by the structure and consistency of the
address interoperability issues during their
data being analyzed. The presence of duplicate
planning stages to better ensure the effectiveness
records, the lack of data standards, the timeliness
of their data mining projects.
of updates and human error can significantly
impact the effectiveness of the more complex data
(c) Mission Creep
mining techniques which are sensitive to subtle
Mission creep refers to the use of data for
differences that may exist in the data.
purposes other than that for which the data was
To improve data quality, it is sometimes
originally collected. This can occur regardless of
necessary to clean the data which can involve the
whether the data was provided voluntarily by the
removal of duplicate records, normalizing the
individual or was collected through other means.
values used to represent information in the
All data collection efforts suffer accuracy
database (e.g., ensuring that ‘no’ is represented as
concerns to some degree. Ensuring the accuracy of
a 0 throughout the database and not sometimes as
information can require costly protocols that may
a 0, sometimes as a N, etc.), accounting for
not be cost effective if the data is not of inherently
missing data points, removing unneeded data
high economic value.
fields, identifying anomalous data points (e.g., an
In well-managed data mining projects,
individual whose age is shown as 135 years) and
the original data collecting organization is likely
standardizing data formats (e.g., changing dates in
to be aware of the data’s limitations and account
the form MM/DD/YYYY).
for these limitations accordingly. However, such
awareness may not be communicated or heeded
(b) Interoperability
when data is used for other purposes. For example,
Interoperability refers to the ability of a
the accuracy of information collected through a
computer system and/or data to work with other
shopper’s club card may suffer for a variety of
systems or data using common standards or
reasons, including the lack of identity
processes. It is a critical part of the larger efforts
authentication when a card is issued, cashiers
to improve interagency collaboration and
using their own cards for customers who do not CONCLUSION
have one, and/or customers who use multiple Recent years have witnessed an
cards. exponential growth in terms of data generation
and manipulation. A number of advances in
(d) Privacy technology and business processes have
Concerns about privacy focus both on intensified the interest in data mining in both the
actual projects proposed as well as concerns about public and the private sectors for decision -
the potential for data mining applications to be making and prediction. One who uses data mining
expanded beyond their original purposes. Some should take into account the implementation
observers contend that tradeoffs may need to be issues, choose the model which best fits the data
made regarding privacy to ensure security. and apply a suitable technique to derive
Another set of observers suggest that existing laws potentially useful information.
and regulations regarding privacy protections are
REFERENCES
adequate and that these initiatives do not pose any
1. Daniel T. Larose : Data Mining Methods
threats to privacy. Still other observers argue that
and Models, Wiley India (P.) Ltd., New Delhi
not enough is known about how data mining
(2007).
projects will be carried out and that greater
2. Gopalan N.P. & Sivaselvan B. : Data
oversight is needed. There is also some
Mining, PHI Learning Pvt. Ltd., New Delhi
disagreement over how privacy concerns should
(2009).
be addressed. Some observers suggest that
3. Margaret H. Dunham : Data Mining,
technical solutions are adequate. In contrast, some
Dorling Kindersley (India) Pvt. Ltd., New
privacy advocates argue in favour of creating
Delhi (2009).
clearer policies and exercising stronger oversight.
4. http://www.google.com
5. http://www.yahoo.com