0% found this document useful (0 votes)

17 views49 pages

Lectures 1 and 2 - Data Anaysis in Management - MBM

The document provides an overview of data analysis and data mining techniques. It discusses statistical analysis, multivariate analysis, data mining tasks including description, estimation, prediction and classification. Measurement scales and their impact on analysis are also covered.

Uploaded by

Влада Клочко

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views49 pages

Lectures 1 and 2 - Data Anaysis in Management - MBM

Uploaded by

Влада Клочко

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data Analysis in Management

Programme: Modern Business Management

2022/2023

Roman Huptas, Department of Statistics,

Cracow University of Economics
Lectures 1 and 2

Introduction to multivariate methods of data

analysis and data mining
Outline of the lecture

 Short overview of multivariate methods

 Data mining
 Application of selected statistical multivariate
methods and data mining techniques

3
What is statistical analysis and why is it more
important?

 Today businesses must be

 more profitable,
 react quicker,
 offer higher-quality products and services,
 and do it all with fewer people and at lower cost.

 An essential requirement in this process is effective

knowledge creation and management.

4
What is statistical analysis and why is it more
important?
 The information available for decision making exploded
in recent years, and will continue to do so in the future,
probably even faster.

 Until recently, much of that information was either not

collected or discarded. Today this information is being
collected and stored in data warehouses, and it is
available to be “mined” for improved decision making.

 Some of that information can be analyzed and understood

with simple statistics, but much of it requires more
complex, multivariate statistical techniques to convert
these data into knowledge.
5
What is statistical analysis and why is it more
important?
 A number of technological advances help us to apply
multivariate techniques. Among the most important are the
developments in computer hardware and software.

 User-friendly software packages brought data analysis

into the point-and-click era, and we can quickly analyze
mountains of complex data with relative ease.

 Indeed, industry, government, and university-related research

centers throughout the world are making widespread
use of these techniques.
6
What is statistical analysis and why is it more
important?

 With the cheap and easy access afforded by personal

computers, modern data analysis has shifted to a different
paradigm.
 Rather than setting up a complete data analysis all at once,
the process has become highly interactive, with the output
from each stage serving as the input for the next stage.

 An example of a typical analysis is shown in the figure below:

7
What is statistical analysis and why is it more
important?

 Steps in a typical data analysis:

8
Multivariate analysis in statistical terms
 Multivariate analysis techniques are popular because
they enable organizations to create knowledge and thereby
improve their decision making.

Multivariate analysis refers to all statistical techniques

that simultaneously analyze multiple measurements
on individuals or objects under investigation.

Thus, any simultaneous analysis of two or more variables can

be loosely considered as multivariate analysis.
9
Multivariate analysis in statistical terms
Example:
 Suppose that a company markets two related products, say,
toothbrushes and toothpaste.
 The company’s marketing director may be interested in
analyzing consumers’ preferences for the two products. The
exact type of analysis may vary depending on what the
company needs to know.
 What distinguishes the analysis is that it should consider
people’s perceptions of both products jointly. Why?
 If the two products are related, it is likely that consumers’
perceptions of the two products will be correlated.
 Incorporating knowledge of such correlations in our analysis
makes the analysis more accurate and more meaningful.
Multivariate analysis in statistical terms
Note!

 Recall that regression analysis and correlation analysis are

methods involving several variables.

 In a sense, they are multivariate methods even though,

strictly speaking, in regression analysis we make the
assumption that the independent variable or variables are
not random but are fixed quantities.
Some basic concepts of multivariate analysis
Measurement scales!

 Data analysis involves the identification and measurement of

variation in a set of variables, either among themselves or
between a dependent variable and one or more independent
variables.

 The researcher cannot identify variation unless it can be

measured.

 Measurement is important in accurately representing the

research concepts being studied and is instrumental in the
selection of the appropriate multivariate method of analysis.
Some basic concepts of multivariate analysis
Measurement scales!

 Data can be classified into one of two categories—nonmetric

(qualitative) and metric (quantitative)—based on the type of
attributes or characteristics they represent.

 The researcher must define the measurement type for each

variable.To the computer, the values are only numbers.

 Whether data are metric or nonmetric substantially affects:

 what the data can represent,
 how it can be analyzed,
 and the appropriate multivariate techniques to use.
Some basic concepts of multivariate analysis
The impact of choice of measurement scale

Understanding the different types of measurement scales is

important for two reasons:

 The researcher must identify the measurement scale of each

variable used, so that nonmetric data are not incorrectly
used as metric data, and vice versa.

 The measurement scale is also critical in determining

which multivariate techniques are the most
applicable to the data. The metric or nonmetric
properties of independent and dependent variables are the
determining factors in selecting the appropriate technique.
Some basic concepts of multivariate analysis
NONMETRIC MEASUREMENT SCALES

Nonmetric data

 also called qualitative data, nominal data or ordinal data,

 these are attributes, characteristics, or categorical properties

that identify or describe a subject or object,

 describe differences in type or kind by indicating the presence

or absence of a characteristic or property.
Some basic concepts of multivariate analysis
METRIC MEASUREMENT SCALES

Metric data

 also called quantitative data, interval data, or ratio data,

 these measurements identify or describe subjects (or objects)

not only on the possession of an attribute but also by the
amount or degree to which the subject may be characterized
by the attribute
Data mining
What is data mining?

Data mining is the process of discovering useful

patterns and trends in large data sets.

Data mining is the process of discovering meaningful

new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
using pattern recognition technologies as well as
statistical and mathematical techniques.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 The following list shows the most common data

mining tasks:
 Description
 Estimation
 Prediction
 Classification
 Clustering
 Association
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Description
 Researchers and analysts are simply trying to find ways to
describe patterns and trends lying within the data.
 Descriptions of patterns and trends often suggest possible
explanations for such patterns and trends, as well as possible
recommendations for policy changes.
 This description task can be accomplished capably with
exploratory data analysis (EDA), as we saw in earlier courses. The
description task may also be performed using descriptive
statistics.
 Data mining models should be as transparent as possible, that is,
the results of the data mining model should describe clear
patterns that are amenable to intuitive interpretation and
explanation.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Description
 Some data mining methods are more suited to transparent
interpretation than others. For example, decision trees provide
an intuitive and human-friendly explanation of their results. On
the other hand, neural networks are comparatively opaque to
non-specialists, due to the nonlinearity and complexity of the
model.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Estimation
 In estimation, we approximate the value of a numeric target
variable using a set of numeric and/or categorical predictor
variables.

 The field of mathematical statistics supplies several venerable and

widely used estimation methods.

 These include point estimation and confidence interval

estimations, simple linear regression and correlation, and multiple
regression.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Prediction
 Prediction is similar to classification and estimation, except that
for prediction, the results lie in the future. Examples of prediction
tasks in business and research include:
 Predicting the price of a stock 3 months into the future.
 Predicting the percentage increase in traffic deaths next year if the
speed limit is increased.

 Any of the methods and techniques used for classification may

also be used, under appropriate circumstances, for prediction.

 These include the traditional statistical methods of simple linear

regression and multiple regression, as well as data mining and
knowledge discovery methods like k-nearest neighbor methods,
decision trees, and neural networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Classification
 Classification is similar to estimation, except that the target
variable is categorical rather than numeric.
 In classification, there is a target categorical variable, such as
income bracket, which, for example, could be partitioned into
three classes or categories: high income, middle income, and low
income.
 The data mining model examines a large set of records, each
record containing information on the target variable as well as a
set of input or predictor variables.
 Suppose the researcher would like to be able to classify the
income bracket of new individuals, not currently in the above
database, based on the other characteristics associated with that
individual, such as age, gender, and occupation.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Classification
 This task is a classification task, very nicely suited to data mining
methods and techniques.
 Examples of classification tasks in business and research include:
 Determining whether a particular credit card transaction is fraudulent;
 Placing a new student into a particular track with regard to special
needs;
 Assessing whether a mortgage application is a good or bad credit risk;
 Diagnosing whether a particular disease is present;
 Graphs and plots are helpful for understanding two and three
dimensional relationships in data.
 Common data mining methods used for classification are k-
nearest neighbour algorithm, classification and regression trees,
and neural networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Clustering
 Clustering refers to the grouping of records, observations, or
cases into classes of similar objects.
 A cluster is a collection of records that are similar to one another,
and dissimilar to records in other clusters.
 Clustering differs from classification in that there is no target
variable for clustering.
 The clustering task does not try to classify, estimate, or predict
the value of a target variable.
 Instead, clustering algorithms seek to segment the whole data set
into relatively homogeneous subgroups or clusters, where the
similarity of the records within the cluster is maximized, and the
similarity to records outside of this cluster is minimized.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Clustering
 Examples of clustering tasks in business and research include:
 Target marketing of a niche product for a small-cap business which does
not have a large marketing budget,
 For accounting auditing purposes to segmentize financial behavior into
benign and suspicious categories,
 As a dimension-reduction tool when the data set has hundreds of
attributes,
 For gene expression clustering, where very large quantities of genes may
exhibit similar behavior.
 Clustering is often performed as a preliminary step in a data
mining process, with the resulting clusters being used as further
inputs into a different technique downstream, such as neural
networks.
WHAT TASKS CAN DATA MINING ACCOMPLISH??

 Association
 The association task for data mining is the job of finding which
attributes “go together.”
 Most prevalent in the business world, where it is known as
affinity analysis or market basket analysis, the task of association
seeks to uncover rules for quantifying the relationship between
two or more attributes.
 Examples of association tasks in business and research include:
 Investigating the proportion of subscribers to your company’s cell phone
plan that respond positively to an offer of an service upgrade.
 Examining the proportion of children whose parents read to them who are
themselves good readers.
 Predicting degradation in telecommunications networks.
 Finding out which items in a supermarket are purchased together, and which
items are never purchased together.
A classification of multivariate techniques

This classification is based on three judgments the researcher

must make about the research objective and nature of the data:
1. Can the variables be divided into independent and
dependent classifications based on some theory?
2. If they can, how many variables are treated as dependent
in a single analysis?
3. How are the variables, both dependent and independent,
measured?
Types of multivariate methods including data mining
techniques
Multivariate analysis is a set of techniques for data analysis that
encompasses a wide range of possible research situations.

The more established as well as emerging techniques include

but are not limited to the following:

1. Multiple regression and multiple correlation.

2. Multivariate analysis of variance and covariance.

3. Contingency table analysis.

4. Linear discriminant analysis.

5. Logistic regression.
Types of multivariate methods including data mining
techniques
6. Canonical correlation analysis.

7. Principal components and common factor analysis.

8. Conjoint analysis.

9. Multidimensional scaling.

10. Correspondence analysis.

11. Cluster analysis.

12. Classification and regression trees.

13. Neural networks.

14. Others.
32
33
Types of multivariate and data minig techniques

Correspondence Analysis
 Correspondence analysis is a recently developed
interdependence technique that facilitates the perceptual
mapping of objects (e.g., products, persons) on a set of
nonmetric attributes.
 Researchers are constantly faced with the need to “quantify
the qualitative data” found in nominal variables.
 Correspondence analysis differs from the other
interdependence techniques in its ability to accommodate
both nonmetric data and nonlinear relationships.
Types of multivariate and data minig techniques

Correspondence Analysis
 In its most basic form, correspondence analysis
employs a contingency table, which is the cross-
tabulation of two categorical variables.
 It then transforms the nonmetric data to a metric level,
performs dimensional reduction and perceptual mapping.
 Correspondence analysis provides a multivariate
representation of interdependence for nonmetric data that is
not possible with other methods.
Types of multivariate and data minig techniques

Correspondence Analysis
 As an example, respondents’ brand preferences can be cross-
tabulated on demographic variables (e.g., gender, income
categories, occupation) by indicating how many people
preferring each brand fall into each category of the
demographic variables.

 Through correspondence analysis, the association, or

“correspondence,” of brands and the distinguishing
characteristics of those preferring each brand are then shown
in a two- or three-dimensional map of both brands and
respondent characteristics.
Types of multivariate and data minig techniques

Correspondence Analysis

 Brands perceived as similar are located close to one another.

 Likewise, the most distinguishing characteristics of

respondents preferring each brand are also determined by the
proximity of the demographic variable categories to the
brand’s position.
Types of multivariate and data minig techniques

Cluster Analysis

 Cluster analysis is an analytical technique for developing

meaningful subgroups of individuals or objects.

 Specifically, the objective is to classify a sample of entities

(individuals or objects) into a small number of mutually
exclusive groups based on the similarities among the entities.

 In cluster analysis (unlike discriminant analysis) the

groups are not predefined !

 Instead, the technique is used to identify the groups.

Types of multivariate and data minig techniques

Cluster Analysis

 Cluster analysis usually involves at least three steps.

 The first is the measurement of some form of similarity or
association among the entities to determine how many groups
really exist in the sample.

 The second step is the actual clustering process, whereby

entities are partitioned into groups (clusters).
 The final step is to profile the persons or variables to
determine their composition.
Types of multivariate and data minig techniques
Cluster Analysis
As an example of cluster analysis:
 Let’s assume a restaurant owner wants to know whether
customers are patronizing the restaurant for different
reasons.
 Data could be collected on perceptions of pricing, food
quality, and so forth.
 Cluster analysis could be used to determine whether some
subgroups (clusters) are highly motivated by low prices versus
those who are much less motivated to come to the
restaurant based on price considerations.
Types of multivariate and data minig techniques
Classification and regression trees
What is a decision tree?
 A decision tree is a collection of decision nodes, connected
by branches, extending downward from the root node until
terminating in leaf nodes.
 Beginning at the root node, which by convention is placed at
the top of the decision tree diagram, attributes are tested at
the decision nodes, with each possible outcome resulting in a
branch.
 Each branch then leads either to another decision node or to
a terminating leaf node.
Types of multivariate and data minig techniques
Classification and regression trees

 Classification and regression trees are machine learning

methods for constructing prediction models from data.
 The models are obtained by recursively partitioning the data
space and fitting a simple prediction model within each
partition.
 As a result, the partitioning can be represented graphically as
a decision tree.
Types of multivariate and data minig techniques
Classification and regression trees

 The decision trees produced by classification and regression

trees (CARTs) are strictly binary, containing exactly two
branches for each decision node.

 CARTs recursively partition the records in the training data

set into subsets of records with similar values for the target
attribute.
Types of multivariate and data minig techniques
Classification and regression trees
An example of a simple classification tree:
Types of multivariate and data minig techniques
Logistic Regression
 Logistic regression models, often referred to as logit analysis,
are a combination of multiple regression and multiple
discriminant analysis.
 This technique is similar to multiple regression analysis in that
one or more independent variables are used to predict a
single dependent variable.
 The dependent variable is nonmetric, as in discriminant
analysis.
 The nonmetric scale of the dependent variable requires
differences in the estimation method and assumptions about
the type of underlying distribution.
Types of multivariate and data minig techniques
Logistic Regression
 Logistic regression models are distinguished from discriminant
analysis primarily in that they accommodate all types of
independent variables (metric and nonmetric) and do not
require the assumption of multivariate normality.
---------------------------------------------------------------------------------------
 Assume financial advisors were trying to develop a means of
selecting emerging firms for start-up investment.
 They reviewed past records and placed firms into one of two
classes: successful over a five-year period, and unsuccessful
after five years.
 Use a logistic regression to identify those financial and
managerial data that best differentiated between the
successful and unsuccessful firms.
Types of multivariate and data minig techniques

Discriminant analysis
 Discriminant analysis is a multivariate statistical
technique used for classifying a set of observations
into pre defined groups.
 Discriminant analysis is a set of methods and tools
used to distinguish between groups of populations
and to determine how to allocate new observations
into groups.
Types of multivariate and data minig techniques

Discriminant analysis
 Discriminant analysis finds a set of prediction
equations based on independent variables that are
then used to classify individuals into groups.
Types of multivariate and data minig techniques

Discriminant analysis – visualization

 Figure below shows a discriminant function for separation between two groups

NOTE !
The point C on the
discriminant scale is the
so-called cutting score.

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Lectures 1 and 2 - Data Anaysis in Management - MBM

Uploaded by

Lectures 1 and 2 - Data Anaysis in Management - MBM

Uploaded by

Data Analysis in Management

Programme: Modern Business Management

Roman Huptas, Department of Statistics,

Introduction to multivariate methods of data

 Short overview of multivariate methods

 Today businesses must be

 An essential requirement in this process is effective

 Until recently, much of that information was either not

 Some of that information can be analyzed and understood

 User-friendly software packages brought data analysis

 Indeed, industry, government, and university-related research

 With the cheap and easy access afforded by personal

 An example of a typical analysis is shown in the figure below:

 Steps in a typical data analysis:

Multivariate analysis refers to all statistical techniques

Thus, any simultaneous analysis of two or more variables can

 Recall that regression analysis and correlation analysis are

 In a sense, they are multivariate methods even though,

 Data analysis involves the identification and measurement of

 The researcher cannot identify variation unless it can be

 Measurement is important in accurately representing the

 Data can be classified into one of two categories—nonmetric

 The researcher must define the measurement type for each

 Whether data are metric or nonmetric substantially affects:

Understanding the different types of measurement scales is

 The researcher must identify the measurement scale of each

 The measurement scale is also critical in determining

 also called qualitative data, nominal data or ordinal data,

 these are attributes, characteristics, or categorical properties

 describe differences in type or kind by indicating the presence

 also called quantitative data, interval data, or ratio data,

 these measurements identify or describe subjects (or objects)

Data mining is the process of discovering useful

Data mining is the process of discovering meaningful

 The following list shows the most common data

 The field of mathematical statistics supplies several venerable and

 These include point estimation and confidence interval

 Any of the methods and techniques used for classification may

 These include the traditional statistical methods of simple linear

This classification is based on three judgments the researcher

The more established as well as emerging techniques include

1. Multiple regression and multiple correlation.

2. Multivariate analysis of variance and covariance.

3. Contingency table analysis.

4. Linear discriminant analysis.

7. Principal components and common factor analysis.

10. Correspondence analysis.

11. Cluster analysis.

12. Classification and regression trees.

13. Neural networks.

 Through correspondence analysis, the association, or

 Brands perceived as similar are located close to one another.

 Likewise, the most distinguishing characteristics of

 Cluster analysis is an analytical technique for developing

 Specifically, the objective is to classify a sample of entities

 In cluster analysis (unlike discriminant analysis) the

 Instead, the technique is used to identify the groups.

 Cluster analysis usually involves at least three steps.

 The second step is the actual clustering process, whereby

 Classification and regression trees are machine learning

 The decision trees produced by classification and regression

 CARTs recursively partition the records in the training data

Discriminant analysis – visualization

You might also like