Lectures 1 and 2 - Data Anaysis in Management - MBM
Lectures 1 and 2 - Data Anaysis in Management - MBM
Data mining
Application of selected statistical multivariate
methods and data mining techniques
3
What is statistical analysis and why is it more
important?
4
What is statistical analysis and why is it more
important?
The information available for decision making exploded
in recent years, and will continue to do so in the future,
probably even faster.
7
What is statistical analysis and why is it more
important?
8
Multivariate analysis in statistical terms
Multivariate analysis techniques are popular because
they enable organizations to create knowledge and thereby
improve their decision making.
Nonmetric data
Metric data
Description
Researchers and analysts are simply trying to find ways to
describe patterns and trends lying within the data.
Descriptions of patterns and trends often suggest possible
explanations for such patterns and trends, as well as possible
recommendations for policy changes.
This description task can be accomplished capably with
exploratory data analysis (EDA), as we saw in earlier courses. The
description task may also be performed using descriptive
statistics.
Data mining models should be as transparent as possible, that is,
the results of the data mining model should describe clear
patterns that are amenable to intuitive interpretation and
explanation.
WHAT TASKS CAN DATA MINING ACCOMPLISH??
Description
Some data mining methods are more suited to transparent
interpretation than others. For example, decision trees provide
an intuitive and human-friendly explanation of their results. On
the other hand, neural networks are comparatively opaque to
non-specialists, due to the nonlinearity and complexity of the
model.
WHAT TASKS CAN DATA MINING ACCOMPLISH??
Estimation
In estimation, we approximate the value of a numeric target
variable using a set of numeric and/or categorical predictor
variables.
Prediction
Prediction is similar to classification and estimation, except that
for prediction, the results lie in the future. Examples of prediction
tasks in business and research include:
Predicting the price of a stock 3 months into the future.
Predicting the percentage increase in traffic deaths next year if the
speed limit is increased.
Classification
Classification is similar to estimation, except that the target
variable is categorical rather than numeric.
In classification, there is a target categorical variable, such as
income bracket, which, for example, could be partitioned into
three classes or categories: high income, middle income, and low
income.
The data mining model examines a large set of records, each
record containing information on the target variable as well as a
set of input or predictor variables.
Suppose the researcher would like to be able to classify the
income bracket of new individuals, not currently in the above
database, based on the other characteristics associated with that
individual, such as age, gender, and occupation.
WHAT TASKS CAN DATA MINING ACCOMPLISH??
Classification
This task is a classification task, very nicely suited to data mining
methods and techniques.
Examples of classification tasks in business and research include:
Determining whether a particular credit card transaction is fraudulent;
Placing a new student into a particular track with regard to special
needs;
Assessing whether a mortgage application is a good or bad credit risk;
Diagnosing whether a particular disease is present;
Graphs and plots are helpful for understanding two and three
dimensional relationships in data.
Common data mining methods used for classification are k-
nearest neighbour algorithm, classification and regression trees,
and neural networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??
Clustering
Clustering refers to the grouping of records, observations, or
cases into classes of similar objects.
A cluster is a collection of records that are similar to one another,
and dissimilar to records in other clusters.
Clustering differs from classification in that there is no target
variable for clustering.
The clustering task does not try to classify, estimate, or predict
the value of a target variable.
Instead, clustering algorithms seek to segment the whole data set
into relatively homogeneous subgroups or clusters, where the
similarity of the records within the cluster is maximized, and the
similarity to records outside of this cluster is minimized.
WHAT TASKS CAN DATA MINING ACCOMPLISH??
Clustering
Examples of clustering tasks in business and research include:
Target marketing of a niche product for a small-cap business which does
not have a large marketing budget,
For accounting auditing purposes to segmentize financial behavior into
benign and suspicious categories,
As a dimension-reduction tool when the data set has hundreds of
attributes,
For gene expression clustering, where very large quantities of genes may
exhibit similar behavior.
Clustering is often performed as a preliminary step in a data
mining process, with the resulting clusters being used as further
inputs into a different technique downstream, such as neural
networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??
Association
The association task for data mining is the job of finding which
attributes “go together.”
Most prevalent in the business world, where it is known as
affinity analysis or market basket analysis, the task of association
seeks to uncover rules for quantifying the relationship between
two or more attributes.
Examples of association tasks in business and research include:
Investigating the proportion of subscribers to your company’s cell phone
plan that respond positively to an offer of an service upgrade.
Examining the proportion of children whose parents read to them who are
themselves good readers.
Predicting degradation in telecommunications networks.
Finding out which items in a supermarket are purchased together, and which
items are never purchased together.
A classification of multivariate techniques
5. Logistic regression.
Types of multivariate methods including data mining
techniques
6. Canonical correlation analysis.
8. Conjoint analysis.
9. Multidimensional scaling.
14. Others.
32
33
Types of multivariate and data minig techniques
Correspondence Analysis
Correspondence analysis is a recently developed
interdependence technique that facilitates the perceptual
mapping of objects (e.g., products, persons) on a set of
nonmetric attributes.
Researchers are constantly faced with the need to “quantify
the qualitative data” found in nominal variables.
Correspondence analysis differs from the other
interdependence techniques in its ability to accommodate
both nonmetric data and nonlinear relationships.
Types of multivariate and data minig techniques
Correspondence Analysis
In its most basic form, correspondence analysis
employs a contingency table, which is the cross-
tabulation of two categorical variables.
It then transforms the nonmetric data to a metric level,
performs dimensional reduction and perceptual mapping.
Correspondence analysis provides a multivariate
representation of interdependence for nonmetric data that is
not possible with other methods.
Types of multivariate and data minig techniques
Correspondence Analysis
As an example, respondents’ brand preferences can be cross-
tabulated on demographic variables (e.g., gender, income
categories, occupation) by indicating how many people
preferring each brand fall into each category of the
demographic variables.
Correspondence Analysis
Cluster Analysis
Cluster Analysis
Discriminant analysis
Discriminant analysis is a multivariate statistical
technique used for classifying a set of observations
into pre defined groups.
Discriminant analysis is a set of methods and tools
used to distinguish between groups of populations
and to determine how to allocate new observations
into groups.
Types of multivariate and data minig techniques
Discriminant analysis
Discriminant analysis finds a set of prediction
equations based on independent variables that are
then used to classify individuals into groups.
Types of multivariate and data minig techniques
NOTE !
The point C on the
discriminant scale is the
so-called cutting score.