0% found this document useful (0 votes)

88 views5 pages

Chapter 4 Data Mining

The document discusses data mining techniques used in business analytics. It covers the increase in available data due to technology, and outlines the typical steps in data mining: data sampling, data preparation including missing data treatment and outlier identification, model construction, and model assessment. Data preparation makes raw data suitable for modeling and involves transforming variables. Both supervised and unsupervised learning are used, with supervised aiming to predict outcomes and unsupervised identifying patterns. Common supervised techniques are k-nearest neighbors, classification and regression trees, and logistic regression.

Uploaded by

Che Manoguid Candelaria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views5 pages

Chapter 4 Data Mining

Uploaded by

Che Manoguid Candelaria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Business Analytics 2nd Semester 2021-2022

Data Mining
Over the past few decades, technological advances have led to a dramatic increase in the
amount of recorded data. The increase in the use of data-mining techniques in business has
been caused largely by three events:
1. the explosion in the amount of data being produced and electronically tracked,
2. the ability to electronically warehouse these data, and
3. the affordability of computer power to analyze the data

Observation – the set of recorded values of variables associated with a single entity
– is often displayed as a row of values in a spreadsheet or database in which the
columns correspond to the variables.
Example: in direct marketing data, an observation may correspond to a customer and
contain information regarding her response to an e-mail advertisement and
demographic characteristics

Steps in the data-mining process

1. Data Sampling – extract a sample of data that is relevant to the business problem under
consideration
2. Data Preparation – manipulate the data to put it in a form suitable for formal modeling
3. Model Construction – Apply the appropriate data-mining technique to accomplish the
desired data-mining task
4. Model Assessment – Evaluate models by comparing performance on appropriate data
sets

DATA SAMPLING
Sample – is representative if the analyst can make the same conclusions from it as from the
entire population of data
• The sample of data must be large enough to contain significant information, yet small
enough to be manipulated quickly
• Use enough data to eliminate any doubt about whether the sample size is sufficient
• Do not carelessly discard variables from consideration. It is generally best to include
as many variables as possible in the sample.
1
Business Analytics 2nd Semester 2021-2022

DATA PREPARATION
The data in a data set are often said to be “dirty” and “raw” before they have been
preprocessed to put them into a form that is best suited for a data-mining algorithm. Data
preparation makes heavy use of the descriptive statistics and data visualization methods to
gain an understanding of the data.

Common tasks include the following:

a. Treatment of Missing Data
b. Identification of Outliers and Erroneous Data
c. Variable Representation

Treatment of Missing Data

The primary options for addressing missing data are:
1. to discard observations with any missing values,
2. to discard any variable with missing values,
3. to fill in missing entries with estimated values, or
4. to apply a data-mining algorithm (such as classification and regression trees) that can
handle missing values

Identification of Outliers and Erroneous Data

• Examining the variables in the data set by means of summary statistics, histograms,
PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.
For example, negative values for sales may result from a data entry error or may actually
denote a missing value.
• Closer examination of outliers may reveal an error or a need for further investigation to
determine whether the observation is relevant to the current analysis.

• A conservative approach is to create two data sets, one with and one without outliers, and
then construct a model on both data sets.
• If a model’s implications depend on the inclusion or exclusion of outliers, then one should
spend additional time to track down the cause of the outliers.

Variable Representation

2
Business Analytics 2nd Semester 2021-2022
Dimension reduction – is the process of removing variables from the analysis without losing
any crucial information.

• Determining how to represent the measurements of the variables and which variables to
consider is a critical part of data mining. The treatment of categorical variables is
particularly important. Typically, it is best to encode categorical variables with 0–1 dummy
variables.
Example:
Consider a data set that contains a variable Language to track the language preference of
callers to a call center. The variable Language with the possible values of English, German,
and Spanish would be replaced with three binary variables called English, German, and
Spanish.
An entry of German would be captured using a 0 for the English dummy variable, a 1 for the
German dummy variable and a 0 for the Spanish dummy variable.

• Using 0–1 dummy variables to encode categorical variables with many different categories
results in a large number of variables. In these cases, the use of PivotTables is helpful in
identifying categories that are similar and can possibly be combined to reduce the number
of 0–1 dummy variables.
Example:
Some categorical variables (zip code, product model number) may have many possible
categories such that, for the purpose of model building, there is no substantive difference
between multiple categories, and therefore the number of categories may be reduced by
combining categories.

• Often data sets contain variables that, considered separately, are not particularly insightful
but that, when combined as ratios, may represent important relationships.

Example:
Financial data supplying information on stock price and company earnings may be as useful
as the derived variable representing the price/earnings (PE) ratio.
A variable tabulating the dollars spent by a household on groceries may not be interesting
because this value may depend on the size of the household. Instead, considering the
proportion of total household spending on groceries may be more informative.

3
Business Analytics 2nd Semester 2021-2022
Two Categories of Data-Mining Approaches
1. Supervised learning – the goal is to predict an outcome based on a set of variables
(features)
– the outcome variable “supervises” or guides the process of learning how to predict
future outcomes
Supervised learning is the technique of accomplishing a task by providing training.
2. Unsupervised learning – do not attempt to predict an output value but are rather used
to detect patterns and relationships in the data.

UNSUPERVISED LEARNING
– there is no outcome variable to predict; rather, the goal is to use the variable
values to identify relationships between observations
Cluster Analysis
Clustering – segment observations into similar groups based on the observed variables.
– can be employed during the data preparation step to identify variables or observations
that can be aggregated or removed from consideration.
– commonly used in marketing to divide consumers into different homogeneous groups,
a process known as market segmentation.

Association Rules
Association Rules – convey the likelihood of certain items being purchased together.

SUPERVISED LEARNING
The goal of a supervised learning technique is to develop a model that predicts a value
for a continuous outcome or classifies a categorical outcome

Three Commonly Used Supervised Learning Methods

1. k-Nearest Neighbors – can be used either to classify an outcome category or predict a
continuous outcome
2. Classification and Regression Trees (CART) – successively partition a data set of
observations into increasingly smaller and more homogeneous subsets
3. Logistic Regression – attempts to classify a categorical outcome as a linear function of
explanatory variables

4
Business Analytics 2nd Semester 2021-2022
Overview of Supervised Learning Methods
Strengths Weaknesses
k-NN • Simple • Requires large amounts of data
relative to number of variables
Classification and • May miss interactions between
• Provides easy-to-interpret
regression trees variables because splits occur
business rules;
• can handle data sets with one at a time;
missing data • sensitive to changes in data
entries

Multiple linear • • Assumes linear relationship

Provides easy-to-interpret
regression relationship between between independent variables
dependent and independent and a continuous dependent
variables variable

• Coefficients not easily

Logistic • Classification analog of the
regression familiar multiple regression interpretable in terms of effect on
modeling procedure likelihood of outcome event

• Assumes variables are normally

Discriminant • Allows classification based on
analysis interaction effects between distributed with equal variance;
variables • performance often dominated by
other classification methods

• Requires a large amount of data;

Naïve Bayes • Simple and effective at
classifying • restricted to categorical variables

• Many difficult decisions to make

Neural networks • Flexible and often effective
when building the model;
• results cannot be easily explained
(black box)

Zomato SQL Analysis Project
No ratings yet
Zomato SQL Analysis Project
23 pages
Unit I Predictive Analytics
No ratings yet
Unit I Predictive Analytics
39 pages
Azure Service Bus and Azure Functions
100% (1)
Azure Service Bus and Azure Functions
22 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
$KhabyLame AIRDROP #01 (Responses)
100% (2)
$KhabyLame AIRDROP #01 (Responses)
34 pages
Marketing Engineering and Analytics
No ratings yet
Marketing Engineering and Analytics
52 pages
BPM WBS
No ratings yet
BPM WBS
1 page
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Composite Functions PixiPPt
No ratings yet
Composite Functions PixiPPt
13 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Exam 1
No ratings yet
Exam 1
12 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Seminar Report Iot Based Health Monitoring System 2023
100% (1)
Seminar Report Iot Based Health Monitoring System 2023
19 pages
Data Wrangling and Visualization
No ratings yet
Data Wrangling and Visualization
48 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
BI Chapter 02 - Unlocked
No ratings yet
BI Chapter 02 - Unlocked
51 pages
Aa MDM MST
No ratings yet
Aa MDM MST
8 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
M6 Predictive Analytics Presentation
No ratings yet
M6 Predictive Analytics Presentation
49 pages
Hella-India Report
No ratings yet
Hella-India Report
36 pages
The Impact of Product
No ratings yet
The Impact of Product
26 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
(Under Regulation 28G (4) of The Chartered Accountants Regulations, 1988)
No ratings yet
(Under Regulation 28G (4) of The Chartered Accountants Regulations, 1988)
23 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Update and Document Operational Procedure-Final
No ratings yet
Update and Document Operational Procedure-Final
21 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Prompt Engineering 101
No ratings yet
Prompt Engineering 101
26 pages
Festo Training Book
No ratings yet
Festo Training Book
39 pages
DBMS Record
No ratings yet
DBMS Record
42 pages
365 Data
No ratings yet
365 Data
4 pages
01 Var2022 Strasser Si Ea Portfolio Pub
No ratings yet
01 Var2022 Strasser Si Ea Portfolio Pub
14 pages
PA Summary Sheet
No ratings yet
PA Summary Sheet
9 pages
BTechCSE (2023 27 R01
No ratings yet
BTechCSE (2023 27 R01
30 pages
Hangar9 Cap 232 Manual
No ratings yet
Hangar9 Cap 232 Manual
44 pages
Data Analytics Part 3
No ratings yet
Data Analytics Part 3
54 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Project Car Code
No ratings yet
Project Car Code
15 pages
Business Analytics
No ratings yet
Business Analytics
13 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
Accounting Analytics 2
No ratings yet
Accounting Analytics 2
41 pages
HD 90 Buss1020 Notes Organised Well Labelled Easy To Understand
No ratings yet
HD 90 Buss1020 Notes Organised Well Labelled Easy To Understand
51 pages
Masked Label Prediction (Contiene GTN)
No ratings yet
Masked Label Prediction (Contiene GTN)
7 pages
Product Life Cycle Announcement EOS TH6430
No ratings yet
Product Life Cycle Announcement EOS TH6430
2 pages
Presentation1 Revised (Autosaved)
No ratings yet
Presentation1 Revised (Autosaved)
83 pages
WandelGoltermann PJM4 Manual
No ratings yet
WandelGoltermann PJM4 Manual
6 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Workshop Layout: Teaching / Learning Areas Size Area Total Area
No ratings yet
Workshop Layout: Teaching / Learning Areas Size Area Total Area
3 pages
Unit 1 (DS)
No ratings yet
Unit 1 (DS)
15 pages
Topic 4 Convolution Integral
No ratings yet
Topic 4 Convolution Integral
5 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Computer Basics-WPS Office
No ratings yet
Computer Basics-WPS Office
4 pages
Proposal
No ratings yet
Proposal
5 pages
Microgrid Monitoring and Controlling Using PLC
No ratings yet
Microgrid Monitoring and Controlling Using PLC
4 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
MarketSmiths Growth250 (Total 300) 1-25
No ratings yet
MarketSmiths Growth250 (Total 300) 1-25
1 page
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Ch. 2
No ratings yet
Ch. 2
60 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Chisel Cheatsheet
No ratings yet
Chisel Cheatsheet
2 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
1 ASAP Business Analytics Introduction
No ratings yet
1 ASAP Business Analytics Introduction
25 pages
DW&DM (Unit - 4)
No ratings yet
DW&DM (Unit - 4)
9 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Financial Analyst
No ratings yet
Financial Analyst
1 page
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
No ratings yet
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
76 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Types of Data
No ratings yet
Types of Data
11 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
Business Analytics: Aviral Apurva Anureet Bansal Devansh Agarwaal Dhwani Dhingra Chirag Verma
No ratings yet
Business Analytics: Aviral Apurva Anureet Bansal Devansh Agarwaal Dhwani Dhingra Chirag Verma
49 pages
21MCME02
No ratings yet
21MCME02
1 page
Tyfrguhj
No ratings yet
Tyfrguhj
1 page
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
Dr. Gaurav Dixit: Department of Management Studies
No ratings yet
Dr. Gaurav Dixit: Department of Management Studies
26 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
DA Interview Questions
No ratings yet
DA Interview Questions
7 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Analytics - Notes
No ratings yet
Data Analytics - Notes
1 page
Midterm Notes MGMT 2050
No ratings yet
Midterm Notes MGMT 2050
10 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet

Chapter 4 Data Mining

Uploaded by

Chapter 4 Data Mining

Uploaded by

Business Analytics 2nd Semester 2021-2022

Steps in the data-mining process

Common tasks include the following:

Treatment of Missing Data

Identification of Outliers and Erroneous Data

Three Commonly Used Supervised Learning Methods

Multiple linear • • Assumes linear relationship

• Coefficients not easily

• Assumes variables are normally

• Requires a large amount of data;

• Many difficult decisions to make

You might also like