0% found this document useful (0 votes)
17 views49 pages

Lectures 1 and 2 - Data Anaysis in Management - MBM

The document provides an overview of data analysis and data mining techniques. It discusses statistical analysis, multivariate analysis, data mining tasks including description, estimation, prediction and classification. Measurement scales and their impact on analysis are also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views49 pages

Lectures 1 and 2 - Data Anaysis in Management - MBM

The document provides an overview of data analysis and data mining techniques. It discusses statistical analysis, multivariate analysis, data mining tasks including description, estimation, prediction and classification. Measurement scales and their impact on analysis are also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Analysis in Management

Programme: Modern Business Management


2022/2023

Roman Huptas, Department of Statistics,


Cracow University of Economics
Lectures 1 and 2

Introduction to multivariate methods of data


analysis and data mining
Outline of the lecture

 Short overview of multivariate methods

 Data mining
 Application of selected statistical multivariate
methods and data mining techniques

3
What is statistical analysis and why is it more
important?

 Today businesses must be


 more profitable,
 react quicker,
 offer higher-quality products and services,
 and do it all with fewer people and at lower cost.

 An essential requirement in this process is effective


knowledge creation and management.

4
What is statistical analysis and why is it more
important?
 The information available for decision making exploded
in recent years, and will continue to do so in the future,
probably even faster.

 Until recently, much of that information was either not


collected or discarded. Today this information is being
collected and stored in data warehouses, and it is
available to be “mined” for improved decision making.

 Some of that information can be analyzed and understood


with simple statistics, but much of it requires more
complex, multivariate statistical techniques to convert
these data into knowledge.
5
What is statistical analysis and why is it more
important?
 A number of technological advances help us to apply
multivariate techniques. Among the most important are the
developments in computer hardware and software.

 User-friendly software packages brought data analysis


into the point-and-click era, and we can quickly analyze
mountains of complex data with relative ease.

 Indeed, industry, government, and university-related research


centers throughout the world are making widespread
use of these techniques.
6
What is statistical analysis and why is it more
important?

 With the cheap and easy access afforded by personal


computers, modern data analysis has shifted to a different
paradigm.
 Rather than setting up a complete data analysis all at once,
the process has become highly interactive, with the output
from each stage serving as the input for the next stage.

 An example of a typical analysis is shown in the figure below:

7
What is statistical analysis and why is it more
important?

 Steps in a typical data analysis:

8
Multivariate analysis in statistical terms
 Multivariate analysis techniques are popular because
they enable organizations to create knowledge and thereby
improve their decision making.

Multivariate analysis refers to all statistical techniques


that simultaneously analyze multiple measurements
on individuals or objects under investigation.

Thus, any simultaneous analysis of two or more variables can


be loosely considered as multivariate analysis.
9
Multivariate analysis in statistical terms
Example:
 Suppose that a company markets two related products, say,
toothbrushes and toothpaste.
 The company’s marketing director may be interested in
analyzing consumers’ preferences for the two products. The
exact type of analysis may vary depending on what the
company needs to know.
 What distinguishes the analysis is that it should consider
people’s perceptions of both products jointly. Why?
 If the two products are related, it is likely that consumers’
perceptions of the two products will be correlated.
 Incorporating knowledge of such correlations in our analysis
makes the analysis more accurate and more meaningful.
Multivariate analysis in statistical terms
Note!

 Recall that regression analysis and correlation analysis are


methods involving several variables.

 In a sense, they are multivariate methods even though,


strictly speaking, in regression analysis we make the
assumption that the independent variable or variables are
not random but are fixed quantities.
Some basic concepts of multivariate analysis
Measurement scales!

 Data analysis involves the identification and measurement of


variation in a set of variables, either among themselves or
between a dependent variable and one or more independent
variables.

 The researcher cannot identify variation unless it can be


measured.

 Measurement is important in accurately representing the


research concepts being studied and is instrumental in the
selection of the appropriate multivariate method of analysis.
Some basic concepts of multivariate analysis
Measurement scales!

 Data can be classified into one of two categories—nonmetric


(qualitative) and metric (quantitative)—based on the type of
attributes or characteristics they represent.

 The researcher must define the measurement type for each


variable.To the computer, the values are only numbers.

 Whether data are metric or nonmetric substantially affects:


 what the data can represent,
 how it can be analyzed,
 and the appropriate multivariate techniques to use.
Some basic concepts of multivariate analysis
The impact of choice of measurement scale

Understanding the different types of measurement scales is


important for two reasons:

 The researcher must identify the measurement scale of each


variable used, so that nonmetric data are not incorrectly
used as metric data, and vice versa.

 The measurement scale is also critical in determining


which multivariate techniques are the most
applicable to the data. The metric or nonmetric
properties of independent and dependent variables are the
determining factors in selecting the appropriate technique.
Some basic concepts of multivariate analysis
NONMETRIC MEASUREMENT SCALES

Nonmetric data

 also called qualitative data, nominal data or ordinal data,

 these are attributes, characteristics, or categorical properties


that identify or describe a subject or object,

 describe differences in type or kind by indicating the presence


or absence of a characteristic or property.
Some basic concepts of multivariate analysis
METRIC MEASUREMENT SCALES

Metric data

 also called quantitative data, interval data, or ratio data,

 these measurements identify or describe subjects (or objects)


not only on the possession of an attribute but also by the
amount or degree to which the subject may be characterized
by the attribute
Data mining
What is data mining?

Data mining is the process of discovering useful


patterns and trends in large data sets.

Data mining is the process of discovering meaningful


new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
using pattern recognition technologies as well as
statistical and mathematical techniques.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 The following list shows the most common data


mining tasks:
 Description
 Estimation
 Prediction
 Classification
 Clustering
 Association
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Description
 Researchers and analysts are simply trying to find ways to
describe patterns and trends lying within the data.
 Descriptions of patterns and trends often suggest possible
explanations for such patterns and trends, as well as possible
recommendations for policy changes.
 This description task can be accomplished capably with
exploratory data analysis (EDA), as we saw in earlier courses. The
description task may also be performed using descriptive
statistics.
 Data mining models should be as transparent as possible, that is,
the results of the data mining model should describe clear
patterns that are amenable to intuitive interpretation and
explanation.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Description
 Some data mining methods are more suited to transparent
interpretation than others. For example, decision trees provide
an intuitive and human-friendly explanation of their results. On
the other hand, neural networks are comparatively opaque to
non-specialists, due to the nonlinearity and complexity of the
model.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Estimation
 In estimation, we approximate the value of a numeric target
variable using a set of numeric and/or categorical predictor
variables.

 The field of mathematical statistics supplies several venerable and


widely used estimation methods.

 These include point estimation and confidence interval


estimations, simple linear regression and correlation, and multiple
regression.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Prediction
 Prediction is similar to classification and estimation, except that
for prediction, the results lie in the future. Examples of prediction
tasks in business and research include:
 Predicting the price of a stock 3 months into the future.
 Predicting the percentage increase in traffic deaths next year if the
speed limit is increased.

 Any of the methods and techniques used for classification may


also be used, under appropriate circumstances, for prediction.

 These include the traditional statistical methods of simple linear


regression and multiple regression, as well as data mining and
knowledge discovery methods like k-nearest neighbor methods,
decision trees, and neural networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Classification
 Classification is similar to estimation, except that the target
variable is categorical rather than numeric.
 In classification, there is a target categorical variable, such as
income bracket, which, for example, could be partitioned into
three classes or categories: high income, middle income, and low
income.
 The data mining model examines a large set of records, each
record containing information on the target variable as well as a
set of input or predictor variables.
 Suppose the researcher would like to be able to classify the
income bracket of new individuals, not currently in the above
database, based on the other characteristics associated with that
individual, such as age, gender, and occupation.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Classification
 This task is a classification task, very nicely suited to data mining
methods and techniques.
 Examples of classification tasks in business and research include:
 Determining whether a particular credit card transaction is fraudulent;
 Placing a new student into a particular track with regard to special
needs;
 Assessing whether a mortgage application is a good or bad credit risk;
 Diagnosing whether a particular disease is present;
 Graphs and plots are helpful for understanding two and three
dimensional relationships in data.
 Common data mining methods used for classification are k-
nearest neighbour algorithm, classification and regression trees,
and neural networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Clustering
 Clustering refers to the grouping of records, observations, or
cases into classes of similar objects.
 A cluster is a collection of records that are similar to one another,
and dissimilar to records in other clusters.
 Clustering differs from classification in that there is no target
variable for clustering.
 The clustering task does not try to classify, estimate, or predict
the value of a target variable.
 Instead, clustering algorithms seek to segment the whole data set
into relatively homogeneous subgroups or clusters, where the
similarity of the records within the cluster is maximized, and the
similarity to records outside of this cluster is minimized.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Clustering
 Examples of clustering tasks in business and research include:
 Target marketing of a niche product for a small-cap business which does
not have a large marketing budget,
 For accounting auditing purposes to segmentize financial behavior into
benign and suspicious categories,
 As a dimension-reduction tool when the data set has hundreds of
attributes,
 For gene expression clustering, where very large quantities of genes may
exhibit similar behavior.
 Clustering is often performed as a preliminary step in a data
mining process, with the resulting clusters being used as further
inputs into a different technique downstream, such as neural
networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Association
 The association task for data mining is the job of finding which
attributes “go together.”
 Most prevalent in the business world, where it is known as
affinity analysis or market basket analysis, the task of association
seeks to uncover rules for quantifying the relationship between
two or more attributes.
 Examples of association tasks in business and research include:
 Investigating the proportion of subscribers to your company’s cell phone
plan that respond positively to an offer of an service upgrade.
 Examining the proportion of children whose parents read to them who are
themselves good readers.
 Predicting degradation in telecommunications networks.
 Finding out which items in a supermarket are purchased together, and which
items are never purchased together.
A classification of multivariate techniques

This classification is based on three judgments the researcher


must make about the research objective and nature of the data:
1. Can the variables be divided into independent and
dependent classifications based on some theory?
2. If they can, how many variables are treated as dependent
in a single analysis?
3. How are the variables, both dependent and independent,
measured?
Types of multivariate methods including data mining
techniques
Multivariate analysis is a set of techniques for data analysis that
encompasses a wide range of possible research situations.

The more established as well as emerging techniques include


but are not limited to the following:

1. Multiple regression and multiple correlation.

2. Multivariate analysis of variance and covariance.

3. Contingency table analysis.

4. Linear discriminant analysis.

5. Logistic regression.
Types of multivariate methods including data mining
techniques
6. Canonical correlation analysis.

7. Principal components and common factor analysis.

8. Conjoint analysis.

9. Multidimensional scaling.

10. Correspondence analysis.

11. Cluster analysis.

12. Classification and regression trees.

13. Neural networks.

14. Others.
32
33
Types of multivariate and data minig techniques

Correspondence Analysis
 Correspondence analysis is a recently developed
interdependence technique that facilitates the perceptual
mapping of objects (e.g., products, persons) on a set of
nonmetric attributes.
 Researchers are constantly faced with the need to “quantify
the qualitative data” found in nominal variables.
 Correspondence analysis differs from the other
interdependence techniques in its ability to accommodate
both nonmetric data and nonlinear relationships.
Types of multivariate and data minig techniques

Correspondence Analysis
 In its most basic form, correspondence analysis
employs a contingency table, which is the cross-
tabulation of two categorical variables.
 It then transforms the nonmetric data to a metric level,
performs dimensional reduction and perceptual mapping.
 Correspondence analysis provides a multivariate
representation of interdependence for nonmetric data that is
not possible with other methods.
Types of multivariate and data minig techniques

Correspondence Analysis
 As an example, respondents’ brand preferences can be cross-
tabulated on demographic variables (e.g., gender, income
categories, occupation) by indicating how many people
preferring each brand fall into each category of the
demographic variables.

 Through correspondence analysis, the association, or


“correspondence,” of brands and the distinguishing
characteristics of those preferring each brand are then shown
in a two- or three-dimensional map of both brands and
respondent characteristics.
Types of multivariate and data minig techniques

Correspondence Analysis

 Brands perceived as similar are located close to one another.

 Likewise, the most distinguishing characteristics of


respondents preferring each brand are also determined by the
proximity of the demographic variable categories to the
brand’s position.
Types of multivariate and data minig techniques

Cluster Analysis

 Cluster analysis is an analytical technique for developing


meaningful subgroups of individuals or objects.

 Specifically, the objective is to classify a sample of entities


(individuals or objects) into a small number of mutually
exclusive groups based on the similarities among the entities.

 In cluster analysis (unlike discriminant analysis) the


groups are not predefined !

 Instead, the technique is used to identify the groups.


Types of multivariate and data minig techniques

Cluster Analysis

 Cluster analysis usually involves at least three steps.


 The first is the measurement of some form of similarity or
association among the entities to determine how many groups
really exist in the sample.

 The second step is the actual clustering process, whereby


entities are partitioned into groups (clusters).
 The final step is to profile the persons or variables to
determine their composition.
Types of multivariate and data minig techniques
Cluster Analysis
As an example of cluster analysis:
 Let’s assume a restaurant owner wants to know whether
customers are patronizing the restaurant for different
reasons.
 Data could be collected on perceptions of pricing, food
quality, and so forth.
 Cluster analysis could be used to determine whether some
subgroups (clusters) are highly motivated by low prices versus
those who are much less motivated to come to the
restaurant based on price considerations.
Types of multivariate and data minig techniques
Classification and regression trees
What is a decision tree?
 A decision tree is a collection of decision nodes, connected
by branches, extending downward from the root node until
terminating in leaf nodes.
 Beginning at the root node, which by convention is placed at
the top of the decision tree diagram, attributes are tested at
the decision nodes, with each possible outcome resulting in a
branch.
 Each branch then leads either to another decision node or to
a terminating leaf node.
Types of multivariate and data minig techniques
Classification and regression trees

 Classification and regression trees are machine learning


methods for constructing prediction models from data.
 The models are obtained by recursively partitioning the data
space and fitting a simple prediction model within each
partition.
 As a result, the partitioning can be represented graphically as
a decision tree.
Types of multivariate and data minig techniques
Classification and regression trees

 The decision trees produced by classification and regression


trees (CARTs) are strictly binary, containing exactly two
branches for each decision node.

 CARTs recursively partition the records in the training data


set into subsets of records with similar values for the target
attribute.
Types of multivariate and data minig techniques
Classification and regression trees
An example of a simple classification tree:
Types of multivariate and data minig techniques
Logistic Regression
 Logistic regression models, often referred to as logit analysis,
are a combination of multiple regression and multiple
discriminant analysis.
 This technique is similar to multiple regression analysis in that
one or more independent variables are used to predict a
single dependent variable.
 The dependent variable is nonmetric, as in discriminant
analysis.
 The nonmetric scale of the dependent variable requires
differences in the estimation method and assumptions about
the type of underlying distribution.
Types of multivariate and data minig techniques
Logistic Regression
 Logistic regression models are distinguished from discriminant
analysis primarily in that they accommodate all types of
independent variables (metric and nonmetric) and do not
require the assumption of multivariate normality.
---------------------------------------------------------------------------------------
 Assume financial advisors were trying to develop a means of
selecting emerging firms for start-up investment.
 They reviewed past records and placed firms into one of two
classes: successful over a five-year period, and unsuccessful
after five years.
 Use a logistic regression to identify those financial and
managerial data that best differentiated between the
successful and unsuccessful firms.
Types of multivariate and data minig techniques

Discriminant analysis
 Discriminant analysis is a multivariate statistical
technique used for classifying a set of observations
into pre defined groups.
 Discriminant analysis is a set of methods and tools
used to distinguish between groups of populations
and to determine how to allocate new observations
into groups.
Types of multivariate and data minig techniques

Discriminant analysis
 Discriminant analysis finds a set of prediction
equations based on independent variables that are
then used to classify individuals into groups.
Types of multivariate and data minig techniques

Discriminant analysis – visualization


 Figure below shows a discriminant function for separation between two groups

NOTE !
The point C on the
discriminant scale is the
so-called cutting score.

You might also like