0% found this document useful (0 votes)
88 views5 pages

Chapter 4 Data Mining

The document discusses data mining techniques used in business analytics. It covers the increase in available data due to technology, and outlines the typical steps in data mining: data sampling, data preparation including missing data treatment and outlier identification, model construction, and model assessment. Data preparation makes raw data suitable for modeling and involves transforming variables. Both supervised and unsupervised learning are used, with supervised aiming to predict outcomes and unsupervised identifying patterns. Common supervised techniques are k-nearest neighbors, classification and regression trees, and logistic regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views5 pages

Chapter 4 Data Mining

The document discusses data mining techniques used in business analytics. It covers the increase in available data due to technology, and outlines the typical steps in data mining: data sampling, data preparation including missing data treatment and outlier identification, model construction, and model assessment. Data preparation makes raw data suitable for modeling and involves transforming variables. Both supervised and unsupervised learning are used, with supervised aiming to predict outcomes and unsupervised identifying patterns. Common supervised techniques are k-nearest neighbors, classification and regression trees, and logistic regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Business Analytics 2nd Semester 2021-2022

Data Mining
Over the past few decades, technological advances have led to a dramatic increase in the
amount of recorded data. The increase in the use of data-mining techniques in business has
been caused largely by three events:
1. the explosion in the amount of data being produced and electronically tracked,
2. the ability to electronically warehouse these data, and
3. the affordability of computer power to analyze the data

Observation – the set of recorded values of variables associated with a single entity
– is often displayed as a row of values in a spreadsheet or database in which the
columns correspond to the variables.
Example: in direct marketing data, an observation may correspond to a customer and
contain information regarding her response to an e-mail advertisement and
demographic characteristics

Steps in the data-mining process


1. Data Sampling – extract a sample of data that is relevant to the business problem under
consideration
2. Data Preparation – manipulate the data to put it in a form suitable for formal modeling
3. Model Construction – Apply the appropriate data-mining technique to accomplish the
desired data-mining task
4. Model Assessment – Evaluate models by comparing performance on appropriate data
sets

DATA SAMPLING
Sample – is representative if the analyst can make the same conclusions from it as from the
entire population of data
• The sample of data must be large enough to contain significant information, yet small
enough to be manipulated quickly
• Use enough data to eliminate any doubt about whether the sample size is sufficient
• Do not carelessly discard variables from consideration. It is generally best to include
as many variables as possible in the sample.
1
Business Analytics 2nd Semester 2021-2022

DATA PREPARATION
The data in a data set are often said to be “dirty” and “raw” before they have been
preprocessed to put them into a form that is best suited for a data-mining algorithm. Data
preparation makes heavy use of the descriptive statistics and data visualization methods to
gain an understanding of the data.

Common tasks include the following:


a. Treatment of Missing Data
b. Identification of Outliers and Erroneous Data
c. Variable Representation

Treatment of Missing Data


The primary options for addressing missing data are:
1. to discard observations with any missing values,
2. to discard any variable with missing values,
3. to fill in missing entries with estimated values, or
4. to apply a data-mining algorithm (such as classification and regression trees) that can
handle missing values

Identification of Outliers and Erroneous Data


• Examining the variables in the data set by means of summary statistics, histograms,
PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.
For example, negative values for sales may result from a data entry error or may actually
denote a missing value.
• Closer examination of outliers may reveal an error or a need for further investigation to
determine whether the observation is relevant to the current analysis.

• A conservative approach is to create two data sets, one with and one without outliers, and
then construct a model on both data sets.
• If a model’s implications depend on the inclusion or exclusion of outliers, then one should
spend additional time to track down the cause of the outliers.

Variable Representation

2
Business Analytics 2nd Semester 2021-2022
Dimension reduction – is the process of removing variables from the analysis without losing
any crucial information.

• Determining how to represent the measurements of the variables and which variables to
consider is a critical part of data mining. The treatment of categorical variables is
particularly important. Typically, it is best to encode categorical variables with 0–1 dummy
variables.
Example:
Consider a data set that contains a variable Language to track the language preference of
callers to a call center. The variable Language with the possible values of English, German,
and Spanish would be replaced with three binary variables called English, German, and
Spanish.
An entry of German would be captured using a 0 for the English dummy variable, a 1 for the
German dummy variable and a 0 for the Spanish dummy variable.

• Using 0–1 dummy variables to encode categorical variables with many different categories
results in a large number of variables. In these cases, the use of PivotTables is helpful in
identifying categories that are similar and can possibly be combined to reduce the number
of 0–1 dummy variables.
Example:
Some categorical variables (zip code, product model number) may have many possible
categories such that, for the purpose of model building, there is no substantive difference
between multiple categories, and therefore the number of categories may be reduced by
combining categories.

• Often data sets contain variables that, considered separately, are not particularly insightful
but that, when combined as ratios, may represent important relationships.

Example:
Financial data supplying information on stock price and company earnings may be as useful
as the derived variable representing the price/earnings (PE) ratio.
A variable tabulating the dollars spent by a household on groceries may not be interesting
because this value may depend on the size of the household. Instead, considering the
proportion of total household spending on groceries may be more informative.

3
Business Analytics 2nd Semester 2021-2022
Two Categories of Data-Mining Approaches
1. Supervised learning – the goal is to predict an outcome based on a set of variables
(features)
– the outcome variable “supervises” or guides the process of learning how to predict
future outcomes
Supervised learning is the technique of accomplishing a task by providing training.
2. Unsupervised learning – do not attempt to predict an output value but are rather used
to detect patterns and relationships in the data.

UNSUPERVISED LEARNING
– there is no outcome variable to predict; rather, the goal is to use the variable
values to identify relationships between observations
Cluster Analysis
Clustering – segment observations into similar groups based on the observed variables.
– can be employed during the data preparation step to identify variables or observations
that can be aggregated or removed from consideration.
– commonly used in marketing to divide consumers into different homogeneous groups,
a process known as market segmentation.

Association Rules
Association Rules – convey the likelihood of certain items being purchased together.

SUPERVISED LEARNING
The goal of a supervised learning technique is to develop a model that predicts a value
for a continuous outcome or classifies a categorical outcome

Three Commonly Used Supervised Learning Methods


1. k-Nearest Neighbors – can be used either to classify an outcome category or predict a
continuous outcome
2. Classification and Regression Trees (CART) – successively partition a data set of
observations into increasingly smaller and more homogeneous subsets
3. Logistic Regression – attempts to classify a categorical outcome as a linear function of
explanatory variables

4
Business Analytics 2nd Semester 2021-2022
Overview of Supervised Learning Methods
Strengths Weaknesses
k-NN • Simple • Requires large amounts of data
relative to number of variables
Classification and • May miss interactions between
• Provides easy-to-interpret
regression trees variables because splits occur
business rules;
• can handle data sets with one at a time;
missing data • sensitive to changes in data
entries

Multiple linear • • Assumes linear relationship


Provides easy-to-interpret
regression relationship between between independent variables
dependent and independent and a continuous dependent
variables variable

• Coefficients not easily


Logistic • Classification analog of the
regression familiar multiple regression interpretable in terms of effect on
modeling procedure likelihood of outcome event

• Assumes variables are normally


Discriminant • Allows classification based on
analysis interaction effects between distributed with equal variance;
variables • performance often dominated by
other classification methods

• Requires a large amount of data;


Naïve Bayes • Simple and effective at
classifying • restricted to categorical variables

• Many difficult decisions to make


Neural networks • Flexible and often effective
when building the model;
• results cannot be easily explained
(black box)

You might also like