Chapter 4 Data Mining
Chapter 4 Data Mining
Data Mining
Over the past few decades, technological advances have led to a dramatic increase in the
amount of recorded data. The increase in the use of data-mining techniques in business has
been caused largely by three events:
1. the explosion in the amount of data being produced and electronically tracked,
2. the ability to electronically warehouse these data, and
3. the affordability of computer power to analyze the data
Observation – the set of recorded values of variables associated with a single entity
– is often displayed as a row of values in a spreadsheet or database in which the
columns correspond to the variables.
Example: in direct marketing data, an observation may correspond to a customer and
contain information regarding her response to an e-mail advertisement and
demographic characteristics
DATA SAMPLING
Sample – is representative if the analyst can make the same conclusions from it as from the
entire population of data
• The sample of data must be large enough to contain significant information, yet small
enough to be manipulated quickly
• Use enough data to eliminate any doubt about whether the sample size is sufficient
• Do not carelessly discard variables from consideration. It is generally best to include
as many variables as possible in the sample.
1
Business Analytics 2nd Semester 2021-2022
DATA PREPARATION
The data in a data set are often said to be “dirty” and “raw” before they have been
preprocessed to put them into a form that is best suited for a data-mining algorithm. Data
preparation makes heavy use of the descriptive statistics and data visualization methods to
gain an understanding of the data.
• A conservative approach is to create two data sets, one with and one without outliers, and
then construct a model on both data sets.
• If a model’s implications depend on the inclusion or exclusion of outliers, then one should
spend additional time to track down the cause of the outliers.
Variable Representation
2
Business Analytics 2nd Semester 2021-2022
Dimension reduction – is the process of removing variables from the analysis without losing
any crucial information.
• Determining how to represent the measurements of the variables and which variables to
consider is a critical part of data mining. The treatment of categorical variables is
particularly important. Typically, it is best to encode categorical variables with 0–1 dummy
variables.
Example:
Consider a data set that contains a variable Language to track the language preference of
callers to a call center. The variable Language with the possible values of English, German,
and Spanish would be replaced with three binary variables called English, German, and
Spanish.
An entry of German would be captured using a 0 for the English dummy variable, a 1 for the
German dummy variable and a 0 for the Spanish dummy variable.
• Using 0–1 dummy variables to encode categorical variables with many different categories
results in a large number of variables. In these cases, the use of PivotTables is helpful in
identifying categories that are similar and can possibly be combined to reduce the number
of 0–1 dummy variables.
Example:
Some categorical variables (zip code, product model number) may have many possible
categories such that, for the purpose of model building, there is no substantive difference
between multiple categories, and therefore the number of categories may be reduced by
combining categories.
• Often data sets contain variables that, considered separately, are not particularly insightful
but that, when combined as ratios, may represent important relationships.
Example:
Financial data supplying information on stock price and company earnings may be as useful
as the derived variable representing the price/earnings (PE) ratio.
A variable tabulating the dollars spent by a household on groceries may not be interesting
because this value may depend on the size of the household. Instead, considering the
proportion of total household spending on groceries may be more informative.
3
Business Analytics 2nd Semester 2021-2022
Two Categories of Data-Mining Approaches
1. Supervised learning – the goal is to predict an outcome based on a set of variables
(features)
– the outcome variable “supervises” or guides the process of learning how to predict
future outcomes
Supervised learning is the technique of accomplishing a task by providing training.
2. Unsupervised learning – do not attempt to predict an output value but are rather used
to detect patterns and relationships in the data.
UNSUPERVISED LEARNING
– there is no outcome variable to predict; rather, the goal is to use the variable
values to identify relationships between observations
Cluster Analysis
Clustering – segment observations into similar groups based on the observed variables.
– can be employed during the data preparation step to identify variables or observations
that can be aggregated or removed from consideration.
– commonly used in marketing to divide consumers into different homogeneous groups,
a process known as market segmentation.
Association Rules
Association Rules – convey the likelihood of certain items being purchased together.
SUPERVISED LEARNING
The goal of a supervised learning technique is to develop a model that predicts a value
for a continuous outcome or classifies a categorical outcome
4
Business Analytics 2nd Semester 2021-2022
Overview of Supervised Learning Methods
Strengths Weaknesses
k-NN • Simple • Requires large amounts of data
relative to number of variables
Classification and • May miss interactions between
• Provides easy-to-interpret
regression trees variables because splits occur
business rules;
• can handle data sets with one at a time;
missing data • sensitive to changes in data
entries