0% found this document useful (0 votes)
25 views41 pages

Data Mining

Uploaded by

jmwkimani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views41 pages

Data Mining

Uploaded by

jmwkimani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Mining

Introduction
• Data mining is a process of discovering patterns, relationships, and valuable
insights from large datasets using various techniques and methods.
• It involves extracting meaningful information and knowledge from data, often
hidden within the vast volume of information available.
• Data mining is a subset of the broader field of data analysis and plays a crucial
role in discovering actionable insights for decision-making, prediction, and
optimization.
• Data mining can be used by businesses in many ways. Three examples are:
• Customer profiling, identifying those subsets of customers most profitable to the business;
• Targeting, determining the characteristics of profitable customers who have been captured
by competitors;
• Market-basket analysis, determining product purchases by consumer, which can be used for
product positioning and for cross-selling.
Cont.
• Data mining has been called exploratory data analysis, among other
things. Masses of data generated from cash registers, from scanning, from
topic specific databases throughout the company, are explored, analyzed,
reduced, and reused.
• Searches are performed across different models proposed for predicting
sales, marketing response, and profit. Classical statistical approaches are
fundamental to data mining.
• Automated AI methods are also used. However, systematic exploration
through classical statistical methods is still the basis of data mining.
• Some of the tools developed by the field of statistical analysis are
harnessed through automatic control (with some key human guidance) in
dealing with data.
What is Needed to Do Data Mining?
• Data mining requires identification of a problem, along with
collection of data that can lead to better understanding, and
computer models to provide statistical or other means of analysis.
• This may be supported by visualization tools, that display data, or
through fundamental statistical analysis, such as correlation analysis.
• Data mining tools need to be versatile, scalable, capable of
accurately predicting responses between actions and results, and
capable of automatic implementation.
• Versatile refers to the ability of the tool to apply a wide variety of
models. Scalable tools imply that if the tools works on a small data
set, it should also work on larger data sets.
Cont.
• Automation is useful, but its application is relative. Some analytic
functions are often automated, but human setup prior to
implementing procedures is required.
• In fact, analyst judgment is critical to successful implementation of
data mining.
• Proper selection of data to include in searches is critical. Data
transformation also is often required. Too many variables produce
too much output, while too few can overlook key relationships in the
data.
• Fundamental understanding of statistical concepts is mandatory for
successful data mining.
DATA MINING PROCESS
• In order to systematically conduct data mining analysis, a general
process is usually followed.
• There are some standard processes, two of which are described in
this chapter. One (CRISP) is an industry standard process consisting of
a sequence of steps that are usually involved in a data mining study.
• The other (SEMMA) is specific to SAS. While each step of either
approach isn’t needed in every analysis, this process provides a good
coverage of the steps needed, starting with data exploration, data
collection, data processing, analysis, inferences drawn, and
implementation.
CROSS-INDUSTRY STANDARD PROCESS FOR DATA MINING (CRISP-
DM)
• This model consists of six phases intended as a cyclical process
• Business Understanding: Business understanding includes
determining business objectives, assessing the current situation,
establishing data mining goals, and developing a project plan.
• Data Understanding: Once business objectives and the project plan
are established, data understanding considers data requirements.
• This step can include initial data collection, data description, data
exploration, and the verification of data quality.
• Data exploration such as viewing summary statistics (which includes the
visual display of categorical variables) can occur at the end of this phase.
• Models such as cluster analysis can also be applied during this phase, with
the intent of identifying patterns in the data.
CRISP-DM Cont.
• Data Preparation: Once the data resources available are identified,
they need to be selected, cleaned, built into the form desired, and
formatted.
• Data cleaning and data transformation in preparation of data
modeling needs to occur in this phase.
• Data exploration at a greater depth can be applied during this phase,
and additional models utilized, again providing the opportunity to see
patterns based on business understanding.
CRISP-DM Cont.
• Modeling: Data mining software tools such as visualization (plotting data
and establishing relationships) and cluster analysis (to identify which
variables go well together) are useful for initial analysis.
• Tools such as generalized rule induction can develop initial association
rules.
• Once greater data understanding is gained (often through pattern
recognition triggered by viewing model output), more detailed models
appropriate to the data type can be applied.
• One can use off-the-shelf modeling software or develop custom Machine
learning models.
• The division of data into training and test sets is also needed for modeling.
CRISP-DM Cont.
• Evaluation: Model results should be evaluated in the context of the
business objectives established in the first phase (business
understanding).
• This will lead to the identification of other needs (often through
pattern recognition), frequently reverting to prior phases of CRISP-
DM.
• Gaining business understanding is an iterative procedure in data
mining, where the results of various visualization, statistical, and
artificial intelligence tools show the user new relationships that
provide a deeper understanding of organizational operations.
CRISP-DM Cont.
• Deployment: Data mining can be used to both verify previously held
hypotheses, or for knowledge discovery (identification of unexpected and
useful relationships).
• Through the knowledge discovered in the earlier phases of the CRISP-DM
process, sound models can be obtained that may then be applied to
business operations for many purposes, including prediction or
identification of key situations.
• These models need to be monitored for changes in operating conditions,
because what might be true today may not be true a year from now. If
significant changes do occur, the model should be redone. It’s also wise to
record the results of data mining projects so documented evidence is
available for future studies.
Note
• This six-phase process is not a rigid, by-the-numbers procedure.
There’s usually a great deal of backtracking.
• Additionally, experienced analysts may not need to apply each phase
for every study. But CRISP-DM provides a useful framework for data
mining.
SEMMA (Sample, Explore, Modify, Model, Assess)
• It was developed by the SAS institute
• Beginning with a statistically representative sample of your data,
SEMMA intends to make it easy to apply exploratory statistical and
visualization techniques, select and transform the most significant
predictive variables, model the variables to predict outcomes, and
finally confirm a model’s accuracy.
SEMMA
• Step 1 (Sample): This is where a portion of a large data set (big enough to
contain the significant information yet small enough to manipulate quickly)
is extracted.
• For optimal cost and computational performance, some (including the SAS
Institute) advocates a sampling strategy, which applies a reliable,
statistically representative sample of the full detail data.
• In the case of very large datasets, mining a representative sample instead
of the whole volume may drastically reduce the processing time required
to get crucial business information.
• If general patterns appear in the data as a whole, these will be traceable in a
representative sample.
• If a niche (a rare pattern) is so tiny that it is not represented in a sample and yet so
important that it influences the big picture, it should be discovered using exploratory
data description methods.
Cont.
• Step 2 (Explore): This is where the user searches for unanticipated trends
and anomalies in order to gain a better understanding of the data set.
• After sampling your data, the next step is to explore them visually or
numerically for inherent trends or groupings. Exploration helps refine and
redirect the discovery process.
• If visual exploration does not reveal clear trends, one can explore the data
through statistical techniques including factor analysis, correspondence
analysis, and clustering.
• For example, in data mining for a direct mail campaign, clustering might reveal
groups of customers with distinct ordering patterns.
• Limiting the discovery process to each of these distinct groups individually may
increase the likelihood of exploring richer patterns that may not be strong enough to
be detected if the whole dataset is to be processed together.
SEMMA Cont.
• Step 3 (Modify): This is where the user creates, selects, and
transforms the variables upon, which to focus the model construction
process.
• Based on the discoveries in the exploration phase, one may need to
manipulate data to include information such as the grouping of
customers and significant subgroups, or to introduce new variables.
• It may also be necessary to look for outliers and reduce the number of
variables, to narrow them down to the most significant ones. One
may also need to modify data when the “mined” data change.
• Because data mining is a dynamic, iterative process, you can update
data mining methods or models when new information is available.
SEMMA Cont.
• Step 4 (Model): This is where the user searches for a variable combination
that reliably predicts a desired outcome. Once you prepare your data, you
are ready to construct models that explain patterns in the data.
• Modeling techniques in data mining include artificial neural networks,
decision trees, rough set analysis, support vector machines, logistic models,
and other statistical models – such as time series analysis, memory- based
reasoning, and principal component analysis.
• Each type of model has particular strengths, and is appropriate within
specific data mining situations depending on the data.
• For example, artificial neural networks are very good at fitting highly complex
nonlinear relationships while Rough sets analysis is known to produce reliable results
with uncertain and imprecise problem situations.
SEMMA Cont.
• Step 5 (Assess): This is where the user evaluates the usefulness and the
reliability of findings from the data mining process. In this final step of the
data mining process user assesses the models to estimate how well it
performs.
• A common means of assessing a model is to apply it to a portion of data set
put aside (and not used during the model building) during the sampling
stage.
• If the model is valid, it should work for this reserved sample as well as for
the sample used to construct the model. Similarly, you can test the model
against known data.
• For example, if you know which customers in a file had high retention rates and your
model predicts retention, you can check to see whether the model selects these
customers accurately.
• In addition, practical applications of the model, such as partial mailings in a
direct mail campaign, help prove its validity.
Note
• The SEMMA approach is completely compatible with the CRISP
approach.
• Both aid the knowledge discovery process. Once models are obtained
and tested, they can then be deployed to gain value with respect to
business or research application.
COMPARISON OF CRISP-DM AND SIEMMA APPROACHES
• Process Phases: CRISP-DM consists of six phases: Business
Understanding, Data Understanding, Data Preparation, Modeling,
Evaluation, and Deployment. In contrast, SEMMA consists of five
phases: Sample, Explore, Modify, Model, and Assess.
• Focus: CRISP-DM focuses on iterative processes, with the aim of
delivering results in a timely and cost-effective manner. It emphasizes
the importance of understanding the business problem and ensuring
that the data mining effort aligns with business goals.
• SEMMA, on the other hand, is more focused on the modeling process
itself, with less emphasis on understanding the business problem and
the data.
Cont.
• Data Preparation: In CRISP-DM, Data Preparation is a separate phase
that involves cleaning, transforming, and integrating data to prepare
it for modeling. In contrast, SEMMA includes data preparation as part
of the Modify phase, which involves data cleansing, variable creation,
and other data preparation tasks.
• Modeling: In CRISP-DM, the Modeling phase involves selecting a
modeling technique and building a model. In SEMMA, the Modeling
phase involves developing and testing models, and selecting the best
model for the problem at hand.
Cont.
• Deployment: Both CRISP-DM and SEMMA have a Deployment phase that
involves putting the model into practice. However, CRISP-DM emphasizes
the importance of monitoring the model's performance over time, while
SEMMA places less emphasis on ongoing monitoring.
• CRISP-DM is a more comprehensive and flexible data mining process model
that emphasizes understanding the business problem and iterative
processes.
• SEMMA is a more focused data mining process model that emphasizes the
modeling process itself. The choice of which process model to use depends
on the specific needs of the project and the organization.
DATA MINING TASKS
Introduction
• Data mining is a process of discovering patterns and insights in large
datasets by using various algorithms and statistical models.
• These algorithms are designed to perform different tasks, such as
classification, clustering, association rule mining, and regression,
among others. let’s have the following examples:
1) Classification problem:
• This task involves assigning data points to predefined classes or
categories. Algorithms learn from labeled training data to make predictions
on new, unlabeled data. Examples include email spam detection, credit risk
assessment, and disease diagnosis.
• Case Scenario Suppose we have a dataset that contains information about
credit card transactions, including the amount spent, the time of day, and
the location of the transaction.
• The goal is to predict whether a transaction is fraudulent or not. We could
use several classification Machine learning algorithms, including decision
trees, logistic regression, support vector machines (SVM), and neural
networks, among others.
• The quality of the model would depend on the algorithm used, the quality
of the data, and the appropriate selection of parameters.
2) Clustering problem:
• Clustering algorithms group data points with similar characteristics
into clusters. The goal is to discover natural groupings within the data
without predefined categories.
• Case Scenario: Suppose we have a dataset that contains information
about customer demographics and buying behavior, and we want to
segment customers based on their preferences. We could use several
clustering algorithms, including k-means, hierarchical clustering, and
density-based clustering. However, the choice of algorithm would
depend on the quality of the data, the desired number of clusters,
and the underlying distribution of the data.
3) Regression
• Regression is used to map a data item to a real valued prediction
variable. Actually, regression involves the leaning of the function that
does this mapping.
• Regression assumes that the target data fit into some known type of
function (e.g., linear, logistic, etc.) and then determines the best
function of this type that models the given data.
• Some type of error analysis is used to determine which function is
"best". Standard linear regression, as illustrated in Example 1.3, is a
simple example of regression.
Cont.
• Example
• A college professor wishes to reach a certain level of savings before
her retirement. Periodically, she predicts what her retirement savings
will be based on its current value and several past values. She uses a
simple linear regression formula to predict this value by fitting past
behavior to a linear function and then using this function on to predict
the values at points in the future. Based on these values, she then
alters her investment portfolio.
4) Time Series Analysis
• With time series analysis, the value of an attribute is examined as it varies
over time. The values usually are obtained as evenly spaced time points
(daily, weekly, hourly, etc.).
• A time series plot is used to visualize the time series. In the figure next
page you can easily see that the plots for Y and Z have similar behavior,
while X appears to have less volatility.
• There are three basic functions performed in time series analysis: In one
case, distance measures are used to determine the similarity between
different time series.
• In the second case, the structure of the line is examined to determine (and
perhaps classify) its behavior. A third application would be to use the
historical time series plot to predict future values.
Cont.
• Example 1.4
• M. Smith is trying to determine whether to purchase stock from Companies
X, Y, or Z. For a period of one month he charts the daily stock price for each
company. Figure 1.3 shows the time series plot that M. Smith has
generated. Using this and similar information available from his
stockbroker, M. Smith decides to purchase stock X because it is less volatile
while overall showing a slightly larger relative amount of growth than
either of the other stocks. As a matter of fact, the stocks for Y and Z have a
similar behavior. The behavior of Y between days 6 and 20 is identical to
that for Z between days 13 and 27.
• Watch this video on Stock price prediction
• https://youtu.be/0YNLfyot0V8
5) Prediction
• Many real-world data mining applications can be seen as predicting future
data states based on past and current data. Prediction can be viewed as a
type of classification.
• (Note: This is a data mining task that is different from the prediction model,
although the prediction task is a type of prediction model.) The difference
is that prediction is predicting a future state rather than a current state.
• Here we are referring to a type of application rather than to a type of data
mining modeling approach, as discussed earlier. Prediction applications
include flooding, speech recognition, machine learning, and patter
recognition.
• Although future values may be predicted using time series analysis or
regression techniques, other approaches may be used as well.
Cont.
• Example:
• Predicting flooding is a difficult problem. One approach uses monitors
placed at various; points in the river. These monitors collect data
relevant to flood prediction: water level, rain amount, time, humidity,
and so on. Then the water level at a potential flooding point in the
river can be predicted based on the data collected by the sensors
upriver from this point. The prediction must be made with respect to
the time the data were collected
6) Summarization
• Summarization maps data into subsets with associated simple descriptions.
Summarization is also called characterization or generalization.
• It extracts or derives representative information about the database. This
may be accomplished by actually retrieving portions of the data.
• Alternatively, summary type information (such as the mean of some
numeric attribute) can be derived from the data. The summarization
succinctly characterizes the contents of the database.
• Example: A company ABC might use summarization techniques to create a
financial statement that provides a concise summary of its financial
performance over a given period. This statement might include information
such as revenue, expenses, profits, and losses, presented in a format that is
easy to understand and interpret.
7) Association Rules
• Link analysis, alternatively referred to as affinity analysis or association,
refers to the data mining task of uncovering relationships among data.
• The best example of this type of application is to determine association
rules. An association rule is a model that identifies specific types of data
associations.
• These associations are often used in the retail sales community to identify
items that are frequently purchased together. Example below illustrates the
use of association rules in market basket analysis.
• Here the data analyzed consist of information about what items a customer
purchases. Associations are also used in many other applications such as
predicting the failure of telecommunication switches.
Example
• A grocery store retailer is trying to decide whether to put bread on
sale. To help determine the impact of this decision, the retailer
generates association rules that show what other products are
frequently purchased with bread. He finds that 60% of the time that
bread is sold so are pretzels and that 70% of the time jelly is also sold.
Based on these facts, he tries to capitalize on the association between
bread, pretzels, and jelly by placing some pretzels and jelly at the end
of the aisle where the bread is placed. In addition, he decides not to
place either of these items on sale at the same time.
8) Sequence Discovery
• Sequential analysis or sequence discover is used to determine
sequential patters in data. These patterns are based on a time
sequence of actions.
• These patterns are similar to associations in that data (or events) are
found to be related, but the relationship is based on time. Unlike a
market basket analysis, which requires the items to be purchased at
the same time, in sequence discovery the items are purchased over
time in some order.
• Example below illustrates the discovery of some simple patterns. A
similar type of discovery can be seen in the sequence within which
data are purchased.
Example
• The Webmaster at the XYZ Corp. periodically analyzes the Web log
data to determine how users of the XYZ's Web pages access them. He
is interested in determining what sequences of pages are frequently
accessed. He determines that 70 percent of the users of page A follow
one of the following patterns of behavior: (A, B, C) or (A, D, B, C) or (A,
E, B, C). He then determines to add a link directly from page A to page
C.
The End

Q &A

You might also like