Ch1 2
Ch1 2
Analyzing data
Given management goals and that management
can translate knowledge into action
Basic Styles
• Top-Down: HYPOTHESIS TESTING
– SUPERVISED
– have a theory, experiment to prove or disprove
– SCIENCE
• Bottom-Up: KNOWLEDGE DISCOVERY
– UNSUPERVISED
– start with data, see new patterns
– CREATIVITY
Hypothesis Testing
• Generate theory
• Determine data needed
• Get data
• Prepare data
• Build computer model
• Evaluate model results
– confirm or reject hypotheses
Generate Theory
• Study
• Systematically tie different input sources
together (MENTAL MODEL)
– what causes sales volume?
• Sales rep performance
• economy, seasonality
• product quality, price, promotion, location
Generate Theory
• Brainstorm:
– diverse representatives for broad coverage of
perspectives (electronic)
– keep under control (keep positive)
– generate testable hypotheses
Define Data Needed
• Determine data needed to test hypothesis
– Lucky - query existing database
– More often - gather
• pull together from diverse databases, survey, buy
Locate Data
• Usually scattered or unavailable
• Sources: warranty claims
point-of-sale data (cash register records)
medical insurance claims
telephone call detail records
direct mail response records
demographic data, economic data
• PROFILE: counts, summary statistics, cross-tabs,
cleanup
Prepare Data for Analysis
• Summarize: too much - no discriminant information
too little - swamped with useless detail
• Process for computer: EBCDIC, ASCII
• Data encoding: how data is recorded can vary
may have been collected with specific purpose (CAL
omitting LA)
• Textual data: avoid if possible (may need to code)
• Missing values: missing salary- use mean?
Build Computer Model
• Convert mental model into quantitative
– roamers less sensitive to price than others
• threshold defining roamer
• average price per call, or number of calls above
price level
– families with children in high school most
likely to respond to home-equity loan offer
• identify families with, without high school age
• past data - responded or didn’t
Evaluate Model
• Determine if hypotheses supported
– statistical practice
– test rule-based systems for accuracy
• Requires both business and analytic
knowledge
SUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
• Health care fraud
– Use statistics to identify
indicators of fraud or abuse
– Can rapidly sort through
large databases
• Identify patterns different
from norm
– Moderately successful
• But only effective on
schemes already detected
• To benefit firm, need to
identify fraud prior to
paying claim
Knowledge Discovery
• Machine learning?
– Usually need intelligent analyst
• Directed: explain value of some variable
• Undirected: no dependent variable selected
– identify patterns
• use undirected to recognize relationships,
• directed to explain once found
Directed
• Goal-oriented
• Examples: if discount apples, impact on products
who is likely to purchase credit insurance?
Predicted profitability of new customer
what to bundle with a particular package
• Identify sources of preclassified data
• Prepare data for analysis
• Built & train computer model
• Evaluate
Identify Data Sources
• Best - existing corporate data warehouse
– data clean, verified, consistent, aggregated
• usually need to generate
– most data in form most efficient for designed
purpose
– historical sales data often purged for dormant
customers (but you need that information)
Prepare Data
• Put in needed format for computer
• Make consistent in meaning
• need to recognize what data is missing
change in balance = new - old
add missing but known-to-be-important data
• divide data into training, test, evaluation
• decide how to treat outliers
– statistically biasing, but may be most important
Build & Train Model
• Regression - human builds (selects IVs)
• Automatic systems train
– give it data, let it hammer
• OVERFITTING:
– fit the data
– TEST SET a means to evaluate model against
data not used in training
• tune weights before using to evaluate
Evaluate Model
• ERROR RATE: proportion of
classifications in evaluation set that were
wrong
• too little training: poor fit on training data
and poor error rate
• optimal training: good fit on both
• too much training: great fit on training data
and poor error rate
Undirected Discovery
• What items sell together? Strawberries & cream
– Directed: what items sell with tofu? tabasco
• Long distance caller market segmentation
– uniform usage-weekday & weekend, spikes on
holidays
– after segmentation:
high & uniform except for several months of nothing
high credit worthiness & profitability college students
UNSUPERVISED
• Health care fraud
– Look at historical
claim submissions
• Build ad hoc model to
compare with current
claims
– Assign similarity score
to fraudulent claims
– Predict fraud potential
Undirected Process
• Identify data sources
• Prepare data
• Build & train computer model
• Evaluate model
• Apply model to new data
• Identify potential targets for undirected
• Generate new hypotheses to test
Identify potential targets
• Why
• Who
• When
Generate hypotheses
• Any commonalities in data?
• Are they useful?
– Many adults watch children’s movies
• chaperones an important market segment
• they probably make final decision
• when hypothesis is generated, that
determines data needed
Summary
• Knowledge Discovery
– New paradigm of data analysis
– Discover unexpected patterns
• ACTIONABLE – can make money from this
knowledge