0% found this document useful (0 votes)
19 views23 pages

Ch1 2

The document outlines methodologies for data analysis, focusing on two main approaches: Top-Down hypothesis testing and Bottom-Up knowledge discovery. It details the steps involved in hypothesis testing, including theory generation, data preparation, model building, and evaluation, as well as the process of knowledge discovery through machine learning. The ultimate goal is to identify actionable insights from data that can drive management decisions and improve business outcomes.

Uploaded by

yjcho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views23 pages

Ch1 2

The document outlines methodologies for data analysis, focusing on two main approaches: Top-Down hypothesis testing and Bottom-Up knowledge discovery. It details the steps involved in hypothesis testing, including theory generation, data preparation, model building, and evaluation, as well as the process of knowledge discovery through machine learning. The ultimate goal is to identify actionable insights from data that can drive management decisions and improve business outcomes.

Uploaded by

yjcho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Methodology

Analyzing data
Given management goals and that management
can translate knowledge into action
Basic Styles
• Top-Down: HYPOTHESIS TESTING
– SUPERVISED
– have a theory, experiment to prove or disprove
– SCIENCE
• Bottom-Up: KNOWLEDGE DISCOVERY
– UNSUPERVISED
– start with data, see new patterns
– CREATIVITY
Hypothesis Testing
• Generate theory
• Determine data needed
• Get data
• Prepare data
• Build computer model
• Evaluate model results
– confirm or reject hypotheses
Generate Theory
• Study
• Systematically tie different input sources
together (MENTAL MODEL)
– what causes sales volume?
• Sales rep performance
• economy, seasonality
• product quality, price, promotion, location
Generate Theory
• Brainstorm:
– diverse representatives for broad coverage of
perspectives (electronic)
– keep under control (keep positive)
– generate testable hypotheses
Define Data Needed
• Determine data needed to test hypothesis
– Lucky - query existing database
– More often - gather
• pull together from diverse databases, survey, buy
Locate Data
• Usually scattered or unavailable
• Sources: warranty claims
point-of-sale data (cash register records)
medical insurance claims
telephone call detail records
direct mail response records
demographic data, economic data
• PROFILE: counts, summary statistics, cross-tabs,
cleanup
Prepare Data for Analysis
• Summarize: too much - no discriminant information
too little - swamped with useless detail
• Process for computer: EBCDIC, ASCII
• Data encoding: how data is recorded can vary
may have been collected with specific purpose (CAL
omitting LA)
• Textual data: avoid if possible (may need to code)
• Missing values: missing salary- use mean?
Build Computer Model
• Convert mental model into quantitative
– roamers less sensitive to price than others
• threshold defining roamer
• average price per call, or number of calls above
price level
– families with children in high school most
likely to respond to home-equity loan offer
• identify families with, without high school age
• past data - responded or didn’t
Evaluate Model
• Determine if hypotheses supported
– statistical practice
– test rule-based systems for accuracy
• Requires both business and analytic
knowledge
SUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
• Health care fraud
– Use statistics to identify
indicators of fraud or abuse
– Can rapidly sort through
large databases
• Identify patterns different
from norm
– Moderately successful
• But only effective on
schemes already detected
• To benefit firm, need to
identify fraud prior to
paying claim
Knowledge Discovery
• Machine learning?
– Usually need intelligent analyst
• Directed: explain value of some variable
• Undirected: no dependent variable selected
– identify patterns
• use undirected to recognize relationships,
• directed to explain once found
Directed
• Goal-oriented
• Examples: if discount apples, impact on products
who is likely to purchase credit insurance?
Predicted profitability of new customer
what to bundle with a particular package
• Identify sources of preclassified data
• Prepare data for analysis
• Built & train computer model
• Evaluate
Identify Data Sources
• Best - existing corporate data warehouse
– data clean, verified, consistent, aggregated
• usually need to generate
– most data in form most efficient for designed
purpose
– historical sales data often purged for dormant
customers (but you need that information)
Prepare Data
• Put in needed format for computer
• Make consistent in meaning
• need to recognize what data is missing
change in balance = new - old
add missing but known-to-be-important data
• divide data into training, test, evaluation
• decide how to treat outliers
– statistically biasing, but may be most important
Build & Train Model
• Regression - human builds (selects IVs)
• Automatic systems train
– give it data, let it hammer
• OVERFITTING:
– fit the data
– TEST SET a means to evaluate model against
data not used in training
• tune weights before using to evaluate
Evaluate Model
• ERROR RATE: proportion of
classifications in evaluation set that were
wrong
• too little training: poor fit on training data
and poor error rate
• optimal training: good fit on both
• too much training: great fit on training data
and poor error rate
Undirected Discovery
• What items sell together? Strawberries & cream
– Directed: what items sell with tofu? tabasco
• Long distance caller market segmentation
– uniform usage-weekday & weekend, spikes on
holidays
– after segmentation:
high & uniform except for several months of nothing
high credit worthiness & profitability college students
UNSUPERVISED
• Health care fraud
– Look at historical
claim submissions
• Build ad hoc model to
compare with current
claims
– Assign similarity score
to fraudulent claims
– Predict fraud potential
Undirected Process
• Identify data sources
• Prepare data
• Build & train computer model
• Evaluate model
• Apply model to new data
• Identify potential targets for undirected
• Generate new hypotheses to test
Identify potential targets
• Why
• Who
• When
Generate hypotheses
• Any commonalities in data?
• Are they useful?
– Many adults watch children’s movies
• chaperones an important market segment
• they probably make final decision
• when hypothesis is generated, that
determines data needed
Summary
• Knowledge Discovery
– New paradigm of data analysis
– Discover unexpected patterns
• ACTIONABLE – can make money from this
knowledge

You might also like