Sas Semma
Sas Semma
A SAS Institute
B es t P r a c t i c e s P a p e r
i
Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
CREDITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
ii
Figures
Figure 1 : The Data Mining Process and the Business Intelligence Cycle . . . . . . . . . . . . . . . . . . . . . . .2
Figure 2 : Steps in the SEMMA Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Figure 3 : How Sampling Size Affects Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
Figure 4 : How Samples Reveal the Distribution of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
Figure 5 : Example Surface Plots for Fitted Models; Regression, Decision Tree, and
Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
Figure 6 : Churn Analysis Steps, Actions, and Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
Figure 7 : Process Flow Diagram for the Customer Churn Project . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
Figure 8 : Input Data - Interval Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
Figure 9 : Input Data - Class Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
Figure 10 : Percentage of Churn and No Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
Figure 11 : General Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Figure 12 : Stratification Variables Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
Figure 13 : Stratification Criteria Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Figure 14 : Sampling Results Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Figure 15 : Data Partition Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
Figure 16 : Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Figure 17 : Diagnostic Chart for Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
Figure 18 : Diagnostic Chart for Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
Figure 19 : Incremental Sample Size and the Correct Classification Rates . . . . . . . . . . . . . . . . . . . . . .31
Figure 20 : Comparison of Sample Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
1
Abstract
Industry analysts expect the use of data mining to sustain double-digit growth into the 21st
century. One recent study, for example, predicts the worldwide statistical and data mining
software market to grow at a compound annual growth rate of 16.1 percent over the next
five years, reaching $1.13 billion in the year 2002 (International Data Corporation 1998
#15932).
Many large- to mid-sized organizations in the mainstream of business, industry, and the public
sector already rely heavily on the use of data mining as a way to search for relationships that
would otherwise be hidden in their transaction data. However, even with powerful data
mining techniques, it is possible for relationships in data to remain hidden due to the presence
of one or more of the following conditions:
relationships in the data are too complex to be seen readily via human observation
All of these conditions are complex problems that present their own unique challenges.
For example, organizing data by subject into data warehouses or data marts can solve
problems associated with aggregation.1 Data that contain errors, missing values, or other
problems can be cleaned in preparation for analysis.2 Relationships that are counter-intuitive
or highly complex can be revealed by applying predictive modeling techniques such as neural
networks, regression analysis, and decision trees as well as exploratory techniques like
clustering, associations and sequencing. However, processing large databases en masse is
another story one that carries along with it its own unique set of problems.
This paper discusses the use of sampling as a statistically valid practice for processing large
databases by exploring the following topics:
For those who want to study further the topics of data mining and the use of sampling
to process large amounts of data, this paper also provides references and a list of
recommended reading material.
1Accessing, aggregating, and transforming data are primary functions of data warehousing. For more information on data warehousing,
see the Recommended Reading section in this paper.
2Unscrubbed data and similar terms refer to data that are not prepared for analysis. Unscrubbed data should be cleaned (scrubbed,
transformed) to correct errors such as missing values, inconsistent variable names, and inconsequential outliers before being analyzed.
2
Times have changed. As disk storage has become increasingly affordable, businesses have
realized that their data can, in fact, be used as a corporate asset for competitive advantage. For
example, customers previous buying patterns often are good predictors of their future buying
patterns. As a result, many businesses now search their data to reveal those historical patterns.
To benefit from the assets bound up in their data, organizations have invested numerous
resources to develop data warehouses and data marts. The result has been substantial
returns on these kinds of investments. However, now that affordable systems exist for
storing and organizing large amounts of data, businesses face new challenges. For example,
how can hardware and software systems sift through vast warehouses of data efficiently?
What process leads from data, to information, to competitive advantage?
While data storage has become cheaper, CPU, throughput, memory management, and net-
work bandwidth continue to be constraints when it comes to processing large quantities of
data. Many IT managers and business analysts are so overwhelmed with the sheer volume,
they do not know where to start. Given these massive amounts of data, many ask, How can
we even begin to move from data to information? The answer is in a data mining process
that relies on sampling, visual representations for data exploration, statistical analysis and
modeling, and assessment of the results.
Figure 1 : The Data Mining Process and the Business Intelligence Cycle
3According to the META Group, The SAS Data Mining approach provides an end-to-end solution, in both the sense of integrating
data mining into the SAS Data Warehouse, and in supporting the data mining process. Here, SAS is the leader (META Group 1997, file #594).
3
Sample the data by creating one or more data tables.4 The samples should be big
enough to contain the significant information, yet small enough to process quickly.
Explore the data by searching for anticipated relationships, unanticipated trends, and
anomalies in order to gain understanding and ideas.
Modify the data by creating, selecting, and transforming the variables to focus the model
selection process.
Model the data by allowing the software to search automatically for a combination of
data that reliably predicts a desired outcome.
Assess the data by evaluating the usefulness and reliability of the findings from the data
mining process.
SEMMA is itself a cycle; the internal steps can be performed iteratively as needed. Figure 2
illustrates the tasks of a data mining project and maps those tasks to the five stages of the
SEMMA methodology. Projects that follow SEMMA can sift through millions of records5
and reveal patterns that enable businesses to meet data mining objectives such as:
5Record refers to an entire row of data in a data table. Synonyms for the
term record include observation, case, and event. Row refers to the way data
are arranged horizontally in a data table structure.
4
So what can be done when the volume of data grows to such massive proportions? The
answer is deceptively simple; either
Advantages
Although data mining often presupposes the need to process very large databases, some
data mining projects can be performed successfully when the databases are small. For example,
all of the data could be processed when there are more variables6 than there are records. In
such a situation, there are statistical techniques that can help ensure valid results in which
case, an advantage of processing the entire database is that enough richness can be maintained
in the limited, existing data to ensure a more precise fit.
In other cases, the underlying process that generates the data may be rapidly changing, and
records are comparable only over a relatively short time period. As the records age, they
might lose value. Older data can become essentially worthless. For example, the value of
sales transaction data associated with clothing fads is often short lived. Data generated by
such rapidly changing processes must be analyzed often to produce even short-term forecasts.
Processing the entire database also can be advantageous in sophisticated exception reporting
systems that find anomalies in the database or highlight values above or below some thresh-
old level that meet the selected criteria.
6Variable refers to a characteristic that defines records in a data table such as a variable B_DATE, which would contain customers
birth dates. Column refers to the way data are arranged vertically within a data table structure.
5
If the solution to the business problem is tied to one record or a few records, then to find
that subset, it may be optimal to process the complete database. For example, suppose a chain
of retail paint stores discovers that too many customers are returning paint. Paint pigments
used to mix the paints are obtained from several outside suppliers. Where is the problem?
With retailers? With customers? With suppliers? What actions should be taken to correct the
problem? The company maintains many databases consisting of various kinds of information
about customers, retailers, suppliers, and products. Routine anomaly detection (processing
that is designed to detect whether a summary performance measure is beyond an acceptable
range) might find that a specific store has a high percentage of returned paint. A subsequent
investigation discovers that employees at that store mix pigments improperly. Clearer
instructions could eliminate the problem. In a case like this one, the results are definitive
and tied to a single record. If the data had been sampled for analysis, then that single,
important record might not have been included in the sample.
Disadvantages
Processing the entire database affects various aspects of the data mining process including
the following:
Inference/Generalization
The goal of inference and predictive modeling is to apply successfully findings from a
data mining system to new records. Data mining systems that exhaustively search the
databases often leave no data from which to develop inferences. Processing all of the
data also leaves no holdout data with which to test the model for explanatory power on
new events. In addition, using all of the data leaves no way to validate findings on data
unseen by the model. In other words, there is no room left to accomplish the goal of
inference.
Instead, holdout samples must be available to ensure confidence in data mining results.
According to Elder and Pregibon, the true goal of most empirical modeling activities is
to employ simplifying constraints alongside accuracy measures during model formula-
tion in order to best generalize to new cases (1996, p. 95).
Occasionally, concerns arise about the way sampling might affect inference. This con-
cern is often expressed as a belief that a sample might miss some subtle but important
niches those hidden nuggets in the database. However, if a niche is so tiny that it is
not represented in a sample and yet so important as to influence the big picture, the
niche can be discovered: either by automated anomaly detection or by using appropri-
ate sampling methods. If there are pockets of important information hidden in the
database, application of the appropriate sampling technique will reveal them and will
process much faster than processing the whole database.
Within a database, there can be huge variations across individual records. A few data
values far from the main cluster can overly influence the analysis, and result in larger
forecast errors and higher misclassification rates. These data values may have been mis-
coded values, they may be old data, or they may be outlying records. Had the sample
been taken from the main cluster, these outlying records would not have been overly
influential.
Alternatively, little variation might exist in the data for many of the variables; the records
are very similar in many ways. Performing computationally intensive processing on the
entire database might provide no additional information beyond what can be obtained
from processing a small, well-chosen sample. Moreover, when the entire database is
processed, the benefits that might have been obtained from a pilot study are lost.
For some business problems, the analysis involves the destruction of an item. For exam-
ple, to test the quality of a new automobile, it is torn apart or run until parts fail. Many
products are tested in this way. Typically, only a sample of a batch is analyzed using this
approach. If the analysis involves the destruction of an item, then processing the entire
database is rarely viable.
Processing a Sample
Corporations that have achieved significant return on investment (ROI) in data mining have
done so by performing predictive data mining. Predictive data mining requires the development
of accurate predictive models that typically rely on sampling in one or more forms. ROI is the
final justification for data mining, and most often, the return begins with a relatively small sample.7
Advantages
Exploring a representative sample is easier, more efficient, and can be as accurate as explor-
ing the entire database. After the initial sample is explored, some preliminary models can
be fitted and assessed. If the preliminary models perform well, then perhaps the data min-
ing project can continue to the next phase. However, it is likely that the initial modeling gen-
erates additional, more specific questions, and more data exploration is required.
7Sampling also is effective when using exploratory or descriptive data mining techniques; however, the goals and benefits (and hence
the ROI) of using these techniques are less well defined.
7
In most cases, a database is logically a subset or a sample of some larger population.8 For
example, a database that contains sales records must be delimited in some way such as by
the month in which the items were sold. Thus, next months records will represent a differ-
ent sample from this months. The same logic would apply to longer time frames such as
years, decades, and so on. Additionally, databases can at best hold only a fraction of the
information required to fully describe customers, suppliers, and distributors. In the extreme,
the largest possible database would be all transactions of all types over the longest possible
time frame that fully describes the enterprise.
The speed and efficiency of a process can be measured in various ways. Throughput is
a common measure; however, when business intelligence is the goal, business-oriented
measurements are more useful. In the context of business intelligence, it makes more
sense to ask big picture questions about the speed and efficiency of the entire business
intelligence cycle than it does to dwell on smaller measurements that merely contribute
to the whole such as query/response times or CPU cycles.
Visualization
Data visualization and exploration facilitate understanding of the data.9 To better
understand a variable, univariate plots of the distribution of values are useful. To examine
relationships among variables, bar charts and scatter plots (2-dimensional and 3-dimensional)
are helpful. To understand the relationships among large numbers of variables, correlation
tables are useful. However, huge quantities of data require more resources and more
time to plot and manipulate (as in rotating data cubes). Even with the many recent
developments in data visualization, one cannot effectively view huge quantities of data
in a meaningful way. A representative sample gives visual order to the data and allows
the analyst to gain insights that speed the modeling process.
Generalization
Samples obtained by appropriate sampling methods are representative of the entire
database and therefore little (if any) information is lost. Sampling is statistically depend-
able. It is a mathematical science based on the demonstrable laws of probability, upon
which a large part of statistics is built.10
8Population refers to the entire collection of data from which samples are taken such as the entire database or data warehouse.
9Using data visualization techniques for exploration is Step 2 in the SEMMA process for data mining. For more information, see the
sources listed for data mining in the Recommended Reading section of this paper.
10For more information on the statistical bases of sampling, see the section The Statistical Validity of Sampling in this paper.
8
Economy
Data cleansing (detecting, investigating, and correcting errors, outliers, missing values,
and so on) can be very time-consuming. To cleanse the entire database might be a very
difficult and frustrating task. To the extent that a well-designed data warehouse or mart
is in place, much of this cleansing has already been addressed. However, even with
clean data in a warehouse or mart, additional pre-processing may be useful for the
data mining project. There may still be missing values or other fields that need to be
modified to address specific business problems. Business problems may require certain
data assumptions, which indicate the need for additional data preparation. For exam-
ple, a missing value for DEPENDENTS actually could be missing, or the value could be
zero. Rather than leave these values as missing, for analytical purposes, you may want
to impute a value based on the available information.
If a well-designed and well-prepared data warehouse or mart is in place, less data pre-
processing is necessary. The remaining data pre-processing is performed as needed for
each specific business problem, and it is much more efficiently done on a sample.
Disadvantages
Not all samples are created equal. To be representative, a sample should reflect the
characteristics of the data. Business analysts must know the data well enough to preserve
the important characteristics of the database. In addition, the technology must be robust
enough to perform various sampling techniques, because different sampling techniques are
appropriate in different situations.
11SAS Institute Inc. has agreements with a number of data providers including Claritas Inc., Geographic Data Technologies, and
Acxiom Corporation.
9
use statistical sampling methods can reveal valuable information and complex relationships
in large amounts of data relationships that might otherwise be hidden in a companys
data warehouse.
For example, Figure 3 graphs customer income levels. Graph A represents the entire database
(100 percent of the records), and reveals that most of the customers are of middle income.
As you go toward the extremes either very low or very high income levels, the number
of customer records declines. Graph B, which is a sample of 30 percent of the database,
reveals the same overall pattern or shape of the database. Statisticians refer to this pattern
as the distribution of the data.
If we were to increase the size of the sample, it would continue to take on the distribution of
the database. Taken to a logical extreme, if we read the entire database, we have the largest,
most comprehensive sample available the database itself. The distribution would, of
course, be the same, and any inferences we make about that sample would be true for the
entire database.12
12For more information about the size of samples, see the section Sampling as a Best Practice in Data Mining: Determining the
Sample Size in this paper.
10
But how is it possible to construct unbiased samples? How can we ensure that the samples
used in statistical analyses such as data mining projects are representative of the entire
database? The answer lies in the procedures used to select records from the database.
The sampling selection procedures determine the likelihood that any given record will
be included in the sample. To construct an unbiased sample, the procedure must be based
on a proven, quantifiable selection method one from which reliable statistics can be
obtained. In the least sophisticated form of sampling, each record has the same probability
of being selected.
Random in this context does not mean haphazard or capricious. Instead, random refers to a
lack of bias in the selection method. In a random sample, each record in the database has
an equal chance of being selected for inclusion in the sample. Random sampling in a sense
levels the playing field for all records in a database giving each record the same chance of
being chosen, and thereby ensuring that a sample of sufficient quantity will represent the
overall pattern of the database. For example, to create a simple random sample from a
customer database of one million records, you could use a selection process that would
assign numbers to all of the records one through one million. Then the selection process
could randomly select numbers (Hays 1973, pp. 20-22).
Not all data are distributed normally. Graphical representations of databases often reveal
that the data are skewed in one direction or the other. For example, if in Figure 3, most of
the customers had relatively lower incomes, then the distribution would be skewed to the
right. If these data were randomly sampled, then as the sample size gets larger, its distribution
would reflect better the distribution of the entire database.
11
For example, assume that we have a customer database of one million records from a
national chain of clothing stores. The chain specializes in athletic wear with team logos.
As a result of that specialization, 2 out of 3 of the records in the database are for customers
between ages 8 and 28, who prefer that style of clothing.
As you might expect, when you randomly select records from the database, you are more
likely to obtain records for customers age 8 to 28 than you are likely to obtain other records
simply because, proportionately, there are more 8- to 28-year old customers. Perhaps you
might not obtain a record for that age group the first time or even the second or third time,
but as more records are selected, your sample would eventually contain the 2 out of 3 ratio.13
If we extend the logic behind random samples to a more complex data mining scenario, we
can see how random samples are the foundation for the exploration, analysis, and other
steps in the SEMMA methodology. For example, if
23 percent of them bought at least one jacket and a copy of a sports-related magazine as
a part of the same purchase, and
you could expect a random sample of sufficient size to reveal those tendencies as well.
13The laws of probability are based in part on the notion of randomly sampling an infinite number of times. Hays provides a good
explanation of this basis in the section titled In the Long Run (1981, pp. 22-25).
12
Stated simply, standard deviation is the average distance of the data from the mean. Confidence
level refers to the percentage of records that are within a specified number of standard
deviations from the mean value. For a normal (symmetric, bell-shaped) distribution, approx-
imately 68 percent of all records in a database will fall within a range that is 1 standard
deviation above and below the mean. Approximately 95 percent of all records will fall
within a range that is 2 standard deviations above and below the mean (Hays 1981, p. 209).
In summary, when we have a database and use an unbiased selection process to obtain records,
then at some point as more and more records are selected, the sample will reveal the distrib-
ution of the database itself. In addition, by applying mathematical formulas based on laws of
probability, we can determine other characteristics of the entire database. These sampling
techniques and their underlying principles hold true for all populations regardless of the
particular application. It does not matter whether we are rolling dice, blindly pulling red and
white marbles from a barrel, or randomly selecting records from a years worth of customer
transaction data in order to construct a sample data table for use in a data mining project.14
Except for certain basic information required for every person for constitutional
or legal reasons, the whole census was shifted to a sample basis. This change,
accompanied by greatly increased mechanization, resulted in much earlier
publication and substantial savings.
Today, the U.S. government publishes numerous reports based on sample data obtained
during censuses. For example, sampling conducted during the 1990 census was used as the
basis for a variety of reports that trace statistics relating to social, labor, income, housing,
and poverty (U.S. Bureau of the Census 1992).
Sampling is also used by local governments and by commercial firms. Television and radio
broadcasters constantly monitor audience sizes. Marketing firms strive to know customer
reactions to new products and new packaging. Manufacturing firms make decisions to
accept or reject whole batches of a product based on the sample results. Public opinion
and elections polls have used sampling techniques for decades.
Sampling of an untreated group can provide a baseline for comparing the treated group,
and hence for assessing the effectiveness of the treatment. Researchers in various fields
routinely conduct studies that examine wide-ranging topics including human behavior and
health; sampling is often an integral part of these analyses.
14The techniques used in sampling (and the mathematical formulas that statisticians use to express those techniques) grew out of the
study of probability. The laws of probability grew out of the study of gambling in particular, the work of the 17th century mathemati-
cians Blaise Pascal and Pierre de Fermat who formulated the mathematics of probability by observing the odds associated with rolling
dice. Their work is the basis of the theory of probability in its modern form (Ross 1998, p. 89).
13
When exact dollar and cents accounting figures are required. For example, in systems
that track asset/liability holdings, each individual account and transaction would need to
be processed.
When the entire population must be used. For example, the U.S. Constitution states
that an actual enumeration of the U.S. population is to be performed (Article 1, Sec-
tion 2, Paragraph 3).15
When the process requires continuous monitoring. For example, for seriously ill med-
ical patients, in precision manufacturing possesses, and for severe weather over airports.
When performing auditing in which every record must be examined such as in an audit
of insurance claims to uncover anomalies and produce exception reports.
The problem was two-fold. First, in 1948 the science of using sampling in political polls was
still in its infancy. Instead of using random sampling techniques, which would have given
each voter equal opportunity to be polled, the pollsters at that time constructed their sample
by trying to match voters with what they assumed was Americas demographic makeup. This
technique led the pollsters to choose people on the basis of age, race, and gender thereby
skewing the results (Gladstone 1998).
The second error was one of timing that affected the quality of the sample data. Specifically,
the surveys upon which the Dewey/Truman prediction was made ended at least two weeks
before Election Day. Given the volatile political landscape of the late 1940s, those two
weeks were more than enough time for a major shift in votes from Dewey to Truman.
(McCullough 1992, p. 714).
Despite the now well-documented problems of the 1948 poll, myths about the validity of
sampling persist. When evaluating the applicability of sampling to business intelligence tech-
nology such as data mining, the skepticism is often expressed in one of the following ideas:
15The Constitution reads, The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the
United States, and within every subsequent Term of ten Years, in such Manner as they shall by Law direct. As the year 2000
approached, the legality and political ramifications of using statistical sampling for the Decennial Census were debated publicly and in
court. For example, in its report to the President of the United States, the American Statistical Association has argued in favor of using
sampling to mitigate the inevitable undercount of the population (American Statistical Association 1996). However, an August 24, 1998
ruling by a U.S. federal court upheld the constitutional requirement for enumeration (U.S. House of Representatives v. U.S. Department
of Commerce, et al. 1998). The federal court ruling has been appealed to the U.S. Supreme Court (Greenhouse 1998).
14
Sampling is difficult.
If the extraordinary customers record is included, the model might be overly optimistic
about predicting response to the mailing campaign. On the other hand, if the record is not
included, the model may yield more realistic prediction for most customers, but the score for
the extraordinary customer may not reflect that customers true importance.
In fact, by not sampling, important information can be missed because some of the most
interesting application areas in data mining require sampling to build predictive models. Rare-
event models require enriched or weighted sampling to develop models that can distinguish
between the event and the non-event.17 For example, in trying to predict fraudulent credit
card transactions, the occurrence of fraud may be as low as 2 percent; however, that percentage
may represent millions of dollars in write-offs. Therefore, a good return on the investment
of time and resources is to develop a predictive model that effectively characterizes fraudulent
transactions, and helps the firm to avoid some of those high-dollar write-offs.
There are many sophisticated modeling strategies from which to choose in developing the
model. A critical issue of these strategies involves which one to use: A sample? An enriched
sample? Or the entire database? If a simple random sample of the database is used, then it
is likely that very few of the rare events (fraudulent transactions) will be included in the
sample. If there are no fraudulent transactions in the sample that is used to develop the
model, then the resulting model will not be able to characterize which transactions are
fraudulent and which are not.
If the entire database is used to train18 a model and the event of interest is extremely rare,
then the resulting model may be unable to distinguish between the event and the non-events.
The model may correctly classify 99.95 percent of the cases, but the 0.05 percent that represent
the rare events are incorrectly classified.
Also, if the entire database is used to train the model and the event of interest is not rare,
then it may appear to be trained very well, in fact it may be over trained (or over fitted).
An over-trained model is trained not only to the underlying trends in the data, but unfortu-
nately, it is also trained to the specific variations of this particular database. The model may
predict this particular database very well, but it may be unable to correctly classify new
16Scoring is the process of applying a model to new data to compute outputs. For example, in a data mining project that seeks to predict
the results of the catalog mailing campaign, scoring the database might predict which recipients of the catalog will purchase which
goods and in what amounts.
17Also referred to as case-control sampling in biometrics and choice-based sampling in econometrics. For more information on the use
of sampling in biometrics and econometrics, see Manski and McFadden, (1981) and Breslow and Day (1980) respectively.
18Training (also known as fitting) a model is the mathematical process of calculating the optimal parameter values. For example, a
straight line is determined by two parameters: a slope parameter and an intercept parameter. A linear model is trained as these
parameter values are calculated.
15
transactions (those not currently in the database). A serious problem of using the entire
database to develop the model is that no records remain with which to test or refine the
models predictive capabilities.
By contrast, if an enriched sample is used to develop the model, then a larger percentage of
the rare fraudulent transactions are included in the sample, while a smaller percentage of
the non-fraudulent transactions are included. The resulting sample has a larger percentage
of fraudulent cases than the entire database. The resulting model may be more sensitive to
the fraudulent cases, and hence very good at characterizing fraudulent transactions. Moreover,
the model can also be tested and refined on the remaining records in the database (those not
used to develop the model). After the model is fully developed, then it can be field-tested
on new transaction data.
Another concern about sampling is that a technique may be inappropriately applied. The
problem is if an inappropriate sampling technique is applied to a data mining project, then
important information needed to solve the business problem may be overlooked and there-
by excluded from the sample. To ensure that all relevant information is included, business
analysts must be familiar with the data as well as the business problem to be solved. To illus-
trate, again consider a proposed mailing campaign. If a simple random sample were selected,
then important information like geographic region may be left out. There may be a strong
interaction between timing of when the catalog is to be mailed and where it is to be mailed.
If the proposed catalog is to contain cold weather goods and winter apparel, then a simple
random sample of the national database may be inappropriate. It may be much better to
stratify the random sampling on region, or even to exclude the extreme south from the mailing.
Without the use of modern software technologies, this argument has merit. However, software
that is designed to enable modern statistical sampling techniques can assist business analysts
in understanding massive amounts of data by enabling them to apply the most effective
sampling techniques at the optimal time in the data mining process. For example, easy-to-
use but powerful graphical user interfaces (GUIs) provide enhanced graphical capabilities
for data visualization and exploration. GUIs built on top of well-established yet complex
statistical routines enable practitioners to apply sampling and analytical techniques rapidly
and make assessments and adjustments as needed.
network bandwidth
throughput
memory management
From a perspective of the resources required, data mining is the process of selecting
data (network bandwidth and throughput), exploring data (memory), and modeling data
(memory management and CPU) to uncover previously unknown information for a
competitive advantage.
Another sampling myth related to hardware resources is the idea that parallel processing
is a requirement for data mining. In fact, in its worst incarnation, this myth states that parallel
processing is a panacea for the problems of mining massive amounts of data. Although parallel
processing can increase the speed with which some data processing tasks are performed, simply
applying parallel processing to a data mining project ignores the fact that data mining is not
merely a technical challenge. Instead, along with hardware and software challenges, the
massive data processing tasks known collectively as data mining have as their impetus logical,
business-oriented challenges. From the business perspective, data mining is a legitimate
investment one that is expected to provide healthy ROI because of the way it supports
the business goals of the organization.
To support business goals, data mining must itself be understood and practiced within a
logical process such as the SEMMA methodology. Simply addressing a portion of the technical
challenge by adding parallel processors ignores the fact that many of the constraints on a
data mining project can recur throughout the process. For example,
Paralleling the CPU operations also should raise some other concerns. In particular, more
threads can create conflicting demands for critical system resources such as physical memory.
Beyond the physical memory problem is the problem of matching the workloads for the
various threads in the parallel environment. If the workload per thread is not properly
matched, then the parallel algorithm simply uses more CPU time and does not reduce
the elapsed time to deliver the results.
What you are doing with an extraction is taking a representative sample of the
data set. This is similar to the way in which statistical sampling traditionally has
17
In the comparatively new discipline of data mining, the use of statistical sampling poses
some new questions. This section addresses several of the more common concerns about
how best to use sampling in data mining projects. In particular, this section addresses the
following questions:
What are the common types of sampling that apply to data mining?
What are some general sampling strategies that can be used in data mining?
Should multiple samples be taken for special purposes such as validation and testing?
For example, if the data mining problem is to profile customers, then all of the data for a
single customer should be contained in a single record. If you have data that describes a
customer in multiple records, then you could use the data warehouse to rearrange the data,
prior to sampling.
First N Sampling
The first n records are included in the sample. If the database records are in random
order, then this type of sampling produces a random sample. If the database records
are in some structured order, then the sample may capture only a portion of that structure.
Cluster Sampling
Each cluster of database records has the same chance of being included in the sample.
Each cluster consists of records that are similar in some way. For example, a cluster
could be all of the records associated with the same customer, which indicates different
purchases at various times.
Stratified Random Sampling
Within each stratum, all records have the same chance of being included in the sample.
Across the strata, records generally do not have the same probability of being included
in the sample.
Stratified random sampling is performed to preserve the strata proportions of the
population within the sample. In general, categorical variables are used to define the
strata. For instance, gender and marital status are categorical variables that could be
used to define strata. However, avoid stratifying on a variable with too many levels (or
too few records per level) such as postal codes, which can have many thousands of levels.
What is the functional form of the model (linear, linear with interaction terms, nonlinear,
and so on)?
If the answers to these questions are known, then sampling theory may be able to provide a
reasonably good answer to the required sample size. The less confidence you have in the
answers to these questions, the more you are into exploration of the data and iterating
through the SEMMA process.
The specifics of the statistical formulas used to determine optimal sample sizes can be complex
and difficult for the layperson to follow, but it is possible to generalize about the factors one
needs to consider. Those factors are
19For sources that include formulas for determining sample sizes, see Cochran (1977, pp. 72 ff) and Snedecor and Cochran (1989, pp.
52-53, and 438-440).
19
Complexity of Data
The first step to understanding the complexity of the data is to determine what variable or
variables are to be modeled. Very different models can be developed if the target variable is
a continuous variable (such as amount of purchase), rather than a two-level variable (such
as one that indicates purchase or no purchase). In a well-defined analysis, the researcher
will likely have some prior expectations and some confidence in those expectations.
Depending on the question or questions being asked, the outliers may be the most informa-
tive records or the least informative records. Even in the worst case, you should have some
idea of what the target variable is when doing predictive data mining. If the target consists
of a rare event and a commonly occurring event, then you can stratify on it. If there are
multiple records, you can perform cluster sampling, and so on. You should explore the data
well enough to identify the stratification variables and outlying records. It may be important
to stratify the sample using some category such as customer gender or geographic region.
For some business problems, such as fraud detection, the outlying records are the most
informative and should be included in the sample. However, for other problems, such as
general trend analysis, the outlying records are the least informative ones.
While some level of complexity exists in most data tables, many business problems can be
effectively addressed using relatively simple models. Some solutions necessitate limiting the
model complexity due to outside constraints. For example, regulators might require that the
values of the model parameters be no more complex than necessary so that the parameters
can be easily understood.20
Figure 5 : Example Surface Plots for Fitted Models; Regression, Decision Tree, and Neural Network
20Modeling can benefit from the application of Ockhams Razor, a precept developed by the English logician and philosopher
William of Ockham (circa. 1285 to 1349), which states that entities ought not to be multiplied except of necessity. (Gribbon 1996, p. 299).
21A simple regression model has one input variable linearly related to an output variable. Y = a + bX. More complex regression mod-
els include more variables, interaction terms, and polynomials of the input variables. A simple neural network having no hidden layers
is equivalent to a class of regression models. More complex neural networks include nonlinear transformations of the input variables,
hidden layers of transformations, and complex objective functions. A simple decision tree has only a few branches (splits) of the data
before reaching the terminal nodes. More complex decision trees have many splits that branch through many layers before reaching
the terminal nodes.
20
Decision tree analysis and clustering of records are iterative processes that sift the data to
form collections of records that are similar in some way.
Neural networks are still more complex with nonlinear relationships, and many more parameters
to estimate. In general, the more parameters in the model, the more records are required.22
Some modeling questions are easier to answer after exploring a sample of the data. For
example, are the input variables linearly related to the target variable? As knowledge about
the data is discovered, it may be useful to repeat some steps of the SEMMA process. An
initial sample may be quite useful for data exploration. Then, when the data are better
understood, a more representative sample can be use for modeling.
The structure of the records may also influence the sampling strategy. For example, if the
data structure is wide (data containing more variables for each record than individual records),
then more variables have to be considered for stratification, for inclusion in the model, for
inclusion in interaction terms, and so on. A large sample may be needed. Fortunately, some
data mining algorithms (CHAID and stepwise regression, for example) automatically assist
with variable reduction and selection.
By contrast, if the data structure is deep (data containing a few variables and many individual
records), then as the sample size grows, more patterns and details may appear. Consider
sales data with patterns across the calendar year. If only a small number of records are
selected, then perhaps, only quarterly patterns appear; for example, there is a winter sales
peak and a summer sales slump. If more records are included in the sample, perhaps
monthly patterns begin to appear followed by weekly, daily, and possibly even intra-day
sales patterns. As a general sampling strategy, it may be very useful to first take a relatively
wide sample to search for important variables (inputs), and then to take a relatively deep
sample to model the relationships.
training
validation
testing.
The training data table is used to train models, that is, to estimate the parameters of the
model. The validation data table is used to fine tune and/or select the best model. In other
words, based on some criteria, the model with the best criteria value is selected. For example,
smallest mean square forecast error is an often-used criterion. The test data table is used to
test the performance of the selected model. After the best model is selected and tested, it
can be used to score the entire database (Ripley 1996, p 354).
Each record in the sample can appear in only one of the three smaller data tables. When you
partition the sample data table, you may want to use the simple random sampling technique
again, or stratified random sampling may be appropriate. So first, you might randomly
select a small fraction of the records from a 5-terabyte database, and in general, simpler
models require smaller sample sizes and more complex models require larger sample sizes.
Then secondly, you might use random sampling again to partition the sample: 40 percent
training, 30 percent validation, and 30 percent test data tables.
Finally, when working with small data tables, bootstrapping may be appropriate. Bootstrapping
is the practice of repeatedly analyzing sub-samples of the data. Each sub-sample is a random
sample with replacement from the full sample.
The case study used SAS Enterprise Miner software running on Windows NT ServerTM
with Windows 95TM clients, and processed approximately 11 million records (approximately
2-gigabytes).23 We sampled incrementally starting at 100,000 records (0.03 percent sample),
and then 1 percent and 10 percent samples concluding with 100 percent or all 11 million
records. We also sampled repeatedly (100 runs for each sample size with 1 run at 100 percent)
to show that sampling can yield accurate results. Plotting the classification rate for the sample
sizes shows the resulting accuracy. (See Figure 19, Incremental Sample Size and the Correct
Classification Rates.)
Enterprise Miner provides multiple sampling techniques and multiple ways to model churn.
Figure 6 outlines the steps in the customer churn project and the nodes used to perform
those steps.
We begin by using the graphical user interface (GUI) of Enterprise Miner software. By
dragging and dropping icons onto the Enterprise Miner workspace, we build a process flow
diagram (PFD) that fits the requirements of our data mining project. The icons that we drag
and drop on the workspace represent nodes, which open to tabbed dialogs where a variety
of statistical tasks can be performed. Using the mouse, we connect the nodes in the workspace
in the order in which they are needed in the process flow. Information flows from node to
node along this path. In addition to running the PFD as a part of the churn analysis, we can
save the diagram for later use, and export it to share with others who have interest in the project.
Figure 7 shows the process flow diagram for the customer churn analysis project.
23The Windows NT ServerTM 4.0 with Service Pack 3 was configured as follows: 4-way Pentium ProTM, 200MHz, 512k L2 Cache, 4GB
Physical RAM, Ultra SCSI controllers, internal and external RAID 5 arrays, 100MBps EthernetTM.
23
Interval Variables
Each row in the table represents an interval variable. The variables are listed in the column
labeled Name, and the remaining columns show (for each variable) the minimum, maximum, mean,
standard deviation, number of missing values, skewness coefficient, and kurtosis coefficient.
Class Variables
For class variables, we see the number of levels, order, and hierarchical dependencies that
may be in the data as shown in Figure 9.
Each row represents a class variable. The column Values indicates the number of levels for
each class variable. The column Order indicates how the levels are ordered. For example,
the variable CHILD has two levels, and has the levels in ascending order.
24The meta-sample can be set to any size and is used only to get information about the data. All calculations are done on either the
entire input data source or the training data table not the meta-sample. The automatically generated metadata can be changed if you
choose to override the default best guess.
24
The variable P represents whether a customer has churned (switched firms). P has two levels:
0 indicating churn and 1 indicating no churn. (Enterprise Miner software enables you to
model either the event or the non-event regardless of whether the event is indicated by 0, 1,
or some other means.)
Figure 10 shows the percentage for each level of the variable P. In this case, there is a low
occurrence of churn (represented by the 0 column). Therefore, we should stratify any subsets
of data we create to ensure that sufficient records are represented. By stratifying, we maintain
in the sample the same percentage of records in each level that are present in the entire
database. Otherwise, we might obtain a sample with very few or even no customers, who
had switched firms.
Percentage
100
70
40
0
0.00 1.00
P
Legend: 0 indicates churn
1 indicates no churn
For enriched sampling, prior probabilities can be used to weight the data records for proper
assessment of the prediction results. One weighting scheme is to use the proportion in each
level to assign weights to the records in the assessment data table. For example, given a
variable with two levels event and non-event having the prior probabilities of .04 and
.96, respectively, and the proportion of the levels in the assessment data table are .45 and .55,
respectively; then the sampling weight for the event level equals .04/.45 and the sampling
weight for the non-event level equals .96/.55.
Sampling Methods
As Figure 11 shows, the Sampling node supports the following types of sampling:
Simple Random
By default, every record in the data table has the same probability of being selected for
the sample.
Sampling Every N th Record
Sampling every nth record is also known as systematic sampling.25 This setting computes
the percentage of the population that is required for the sample, or uses the percentage
specified in the General tab. It divides 100 percent by this percentage to come up with a
number. This setting selects all records that are multiples of this number.
Stratified Sampling
In stratified sampling, one or more categorical variables are specified from the input data
table to form strata (or subsets) of the total population. Within each stratum, all records
have an equal probability of being selected for the sample. Across all strata, however,
the records in the input data table generally do not have equal probabilities of being
selected for the sample. We perform stratified sampling to preserve the strata proportions
of the population within the sample. This may improve the classification precision of
fitted models, which is why we have chosen this method for our case study.
Sampling the First N Records
This type of sampling selects the first n records from the input data table for the
sample.26 We can specify either a percentage or an absolute number of records to
sample in the General tab.
Cluster Sampling
This method builds the sample from a cluster of records that are similar in some way.
For example, we want to get all the records of each customer from a random sample of
customers.
25Sampling every nth record can produce a sample that is not representative of the population, particularly if the input data table is not
in random order. If there is a structure to the input data table, then the nth-record sample may reflect only a part of that structure.
26Sampling the first n records can produce a sample that is not representative of the population, particularly if the input data table is
not in random order.
26
Sample Size
We can specify sample size as a percentage of the total population, or as an absolute number
of records to be sampled. The default percentage is 10 percent and the default number of
records is 10 percent of the total records. The actual sample size is an approximate percentage
or number. For example, a 10 percent random sample of 100 records may contain 9, 10, or
11 records. For the case study, we chose to specify both percentages and exact numbers.
For the smaller samples, 100 thousand records were used (less than 1 percent of the data
table); for larger samples, 1 percent and 10 percent were used.
Random Seed
The Sampling node displays the seed value used in the random number function for each
sample. The default seed value is set to 12345. In our case study, we change the seed repeat-
edly to look at hundreds of different samples of different sizes. The Sampling node saves
the seed value used for each sample so those samples can be replicated exactly. (If we set the
seed to 0, then the computer clock at run time is used to initialize the seed stream. Each
time we run the node, a new sample will be created.)
Stratification
Stratified random sampling is appropriate in this example because the levels of the categori-
cal data could be easily under- or over-represented if simple random sampling was used. We
need to stratify on the variable P, the variable that indicates whether or not the customer has
churned.
In general, to perform stratified random sampling, we first select the variable(s) on which we
want to stratify. Next we choose the option settings on how the stratification is to be performed
to achieve a representative sample.
Variables
The Variables sub-page (Figure 12) contains a data table that lists the variables that are
appropriate for use as stratification variables. Stratification variables must be categorical
(binary, ordinal, or nominal); ours is binary churn (P) is either 0 or 1.
Options
For our example, we are using the default settings for options. However, the Options
sub-page (Figure 13) allows us to specify various details about the stratification.
27
proportional sampling
equal size27
optimal allocation28.
All stratified sampling methods in the Sampling node use computations that may require
rounding the number of records within strata. To take a stratified sample on churn, we use
proportional stratified sampling, whereby the proportion of records in each stratum is the
same in the sample as it is in the population. We can review the results of the sample in
Figure 14, the Sampling Results Browser:
27The equal size criterion specifies that the Sampling node sample the same number of records from each stratum.
28With optimal allocation, both the proportion of records within strata and the relative standard deviation of a specified variable within
strata are the same in the sample as in the population.
28
The training data table is used for preliminary model fitting. The validation data table is
used to monitor and tune the model weights during estimation. The validation data table
also is used for model assessment. The test data table is an additional holdout data table that
we can use for further model assessment. In the same way that we used stratified sampling
to pull the original sample, we stratify on churn again to partition the data into train, test
and validation data tables to ensure sufficient cases of churn in each of them.
The Regression node automatically checks the data, and when a binary target variable is
detected, it uses a logistic regression model. If a continuous target variable is detected, then
a linear regression model is used. The Neural Network node and the Decision Tree node
have similar data checking capabilities.
Figure 16 shows the Regression node Results Browser, which provides part of the estimation
results from fitting the logistic regression model.
29
The bar chart shows the variables that were most important in explaining the likelihood of
a customer churning. The variables are sorted by their t-score, a statistic that indicates the
importance of the variables in the model. For this model, the variable DWELLING was the
most important variable.
The variable DWELLING has two levels, M for multi-family and S for single-family dwellings
(see Figure 9). Further analysis of DWELLING indicates that customers in multi-family
dwellings are more likely to churn than those in single-family dwellings.
The other variables lend themselves to similar interpretations. We can also develop decision
tree models and neural networks that further identify the characteristics of customers who
are likely to switch. For the purposes of this case study, we proceed using the fitted logistic
regression model.
In Figures 17 and 18, the actual values are listed on the left (1 for NO CHURN, 0 for
CHURN), and the models predictions at the bottom (1 for NO CHURN, 0 for CHURN).
The vertical axis shows the number (frequency) of records in each category. The model
predicted the correct classification for about 79 percent of the customers.
30
Next, we created samples of varying sizes. There were 100 different samples for each of the
following percentages of the entire data table: 0.03 percent, 1 percent, 10 percent, and 20
percent. Note that each 0.03 percent sample has 100,000 records or 5 megabytes, and that
the entire data table has 11,000,000 records or 2 gigabytes.
Each of these 400 samples was partitioned, and then separate logistic regressions were run
for each of the training data tables. The logistic regression models were used to predict the
behavior of customers in the respective validation and test data tables. This resulted in a
distribution of correct classification rates for each group.
To assess the accuracy of the classifications, the correct classification rates for each group were
averaged, and upper and lower 95 percent confidence limits were calculated. The average
(or mean) value could be compared with the actual classifications, which are obtained from
the entire data table. Did the samples preserve the characteristics of the entire data table?
Figure 19 plots a comparison of the correct classification rates among the sample sizes.
31
0.800
+ + + +
0.795
+ + + +
0.790
0.03% 1% 10% 20%
SIZE
The mean value of the correct classification rates is shown on the vertical axis. The sample
size is shown on the horizontal axis. The stars plot the mean value of each group, and the
plus signs (+) plot the upper and lower 95 percent confidence limits. Note the consistency in
the mean values and in the confidence limits. On average, the different sized-samples provide
almost exactly the same accuracy in the classification rates.
In Figure 19, the plot of the correct classification rates looks impressive, but are the results
really that good? What about the differences in the classification rates between the sample
sizes? To answer these questions, we performed a statistical test on the difference of the means.
Figure 20 compares each sample size to the other sample sizes. The first set of three rows
uses the 0.03 percent sample size as the base, and then subtracts the mean of the 1 percent, 10
percent, and 20 percent samples. The second set of three rows uses the 1 percent sample as
the base, and so on. The column labeled Difference Between Means shows that the difference
in ability to correctly classify is in the fourth decimal place, or in hundredths of a percent.
32
Simultaneous Simultaneous
Lower Difference Upper
Sample Size Confidence Between Confidence
Comparison Limit Means Limit
0.03%-1% -0.0008941 -0.00039 0.00011
0.03%-10% -0.0007792 -0.00028 0.000225
0.03%-20% -0.0016483 -0.00016 0.001337
There are two additional columns in the table that show the 95 percent upper and lower
confidence limits on the differences between the means. These values are small, and the
confidence limits are overlapping, that is, the mean for the 0.03 percent sample size falls
within the confidence limits for the 20 percent sample.
If a statistically significant difference did exist, then the difference in mean values would be
outside of the confidence interval (upper and lower 95 percent confidence limits). In every
case, the differences are between the confidence limits. We conclude that there are no
statistically significant differences between the mean values of the classification rates. Models
fit using each group of samples have almost the exact same accuracy in predicting whether
customers are likely to churn.
For data mining, the Institute has developed SAS Enterprise Miner software, which
enables business leaders to make better business decisions by controlling the process that
moves from data, to information, to business intelligence. However, moving from data to
business intelligence requires taking logical steps that will ensure the results of your data
mining efforts are reliable, cost efficient, and timely.
In data mining, reliability means the results are based on modern analytical practices. Thus,
inferences you make are quantifiable; that is, you can say with a specific degree of confidence
for example a 95 percent confidence level that the output of your data mining project is
between the values x and y.
Cost efficiency refers to the balance between, on the one hand, the reliability of the results
and, on the other hand, the human and computing resources that are expended obtaining
the results.
Timeliness refers to the need to obtain reliable, cost-effective results when you need them;
that is, in time to act proactively in a fast-moving, highly competitive global economy.
34
References
American Statistical Association, Blue Ribbon Panel on the Census, (1996 September). Report of the Presidents Blue Rib-
bon Panel on the Census. Retrieved September 8, 1998 from the World Wide Web: http://www.amstat.org/census.html
Breslow, N.E. and W. Day, (1980), Statistical Methods in Cancer Research. Volume 1 - The Analysis of Case-Control Studies,
Lyon: IARC Scientific Publication No. 32.
Cochran, William G., (1977), Sampling Techniques, New York: John Wiley & Sons, Inc.
Elder, John F., IV and Daryl Pregibon, (1996), Advances in Knowledge Discovery & Data Mining, A Statistical Perspective
on KDD.
Gladstone, Brooke, (1998), Error Naming Campaign Winner, NPR Online, Morning Edition, November 3, 1998,
Retrieved November 6, 1998 from the World Wide Web:
http://www.npr.org/programs/morning/archives/1998/981103.me.html
Greenhouse, Linda, (1998), High Court to Hear Appeal of Census Ruling, The New York Times on the Web, (September 11),
Retrieved via site search on census September 11, 1998 from the World Wide Web: http://www.nytimes.com
Gribbon, John, (1996), Companion to the Cosmos, New York: Little, Brown and Company.
Hays, William L., (1981), Statistics, New York: CBS College Publishing.
International Data Corporation, (1998 June), Information Access Tools: 1998 Worldwide Markets And Trends, IDC
#15932, Volume: 1.
Manski, C. F. and Daniel McFadden, (1981), Structural Analysis of discrete data with Applications,
Cambridge, Mass: MIT Press.
META Group, (1997 August 19), Data Mining the SAS Data Warehouse, Application Delivery Strategies, File #594.
Phillips, John L., (1996), How to Think About Statistics, Fifth Edition, New York: W.H. Freeman and Company.
Potts, William J. E., (1997), Data Mining Using SAS Enterprise Miner Software. Cary, North Carolina: SAS Institute Inc.
Ripley, B.D., (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Ross, Sheldon, (1998), A First Course in Probability, Fifth Edition, Upper Saddle River, New Jersey: Prentice-Hall, Inc.
Saerndal, Carl-Erik, Bengt Swensson, and Jan Wretman, (1992), Model Assisted Survey Sampling. New York: Springer-Verlag.
Sarle, W.S., ed. (1997), Neural Network FAQ, periodic posting to the Usenet newsgroup comp.ai.neural-nets, URL:
ftp://ftp.sas.com/pub/neural/FAQ.html
Snedecor, George W. and William G. Cochran, (1989), Statistical Methods, Eighth Edition, Ames, Iowa:
The Iowa State University Press.
U.S Bureau of the Census, (1992), Statistical Abstract of the United States, 112th edition, Washington, DC.
U.S. House of Representatives v. U.S. Department of Commerce, et al, (1998), Civil Action No. 98-0456 (Three Judge
Court) (RCL, DHG, RMU), Opinion filed August 24, 1998 by Circuit Judge Douglas H. Ginsburg, and District Court
Judges Royce C. Lamberth and Ricardo M. Urbina.
Westphal, Christopher and Teresa Blaxton, (1998), Data Mining Solutions: Methods and Tools for Solving
Real-World Problems, John Wiley and Sons.
35
Recommended Reading
Data Mining
Berry, Michael J. A. and Gordon Linoff, (1997), Data Mining Techniques, New York: John Wiley & Sons, Inc.
SAS Institute Inc., (1997), SAS Institute White Paper, Business Intelligence Systems and Data Mining,
Cary, NC: SAS Institute Inc.
SAS Institute Inc., (1998), SAS Institute White Paper, Finding the Solution to Data Mining: A Map of the Features and
Components of SAS Enterprise Miner Software, Cary, NC: SAS Institute Inc.
SAS Institute Inc., (1998), SAS Institute White Paper, From Data to Business Advantage: Data Mining, The SEMMA
Methodology and the SAS System, Cary, NC: SAS Institute Inc.
Weiss, Sholom M. and Nitin Indurkhya, (1998), Predictive Data Mining: A Practical Guide, San Francisco, California:
Morgan Kaufmann Publishers, Inc.
Data Warehousing
Inmon, W. H., (1993), Building the Data Warehouse, New York: John Wiley & Sons, Inc.
SAS Institute Inc., (1995), SAS Institute White Paper, Building a SAS Data Warehouse, Cary, NC: SAS Institute Inc.
SAS Institute Inc., (1998), SAS Institute White Paper, SAS Institutes Rapid Warehousing Methodology,
Cary, NC: SAS Institute Inc.
Singh, Harry, (1998), Data Warehousing Concepts, Technologies, Implementations, and Management, Upper Saddle
River, New Jersey: Prentice-Hall, Inc.
Statistics
Hays, William L. (1981), Statistics. New York: Holt, Rinehart and Winston.
Hildebrand, David K. and R. Lyman Ott, (1996), Basic Statistical Ideas for Managers, New York: Duxbury Press.
Mendenhall, William and Richard L. Scheaffer, (1973), Mathematical Statistics with Applications. North Scituate,
Massachusetts.
Phillips, John L., (1996), How to Think About Statistics, Fifth Edition, New York: W.H. Freeman and Company.
36
Credits
Data Mining and the Case for Sampling was a collaborative work. Contributors to the
development and production of this paper included the following persons:
Consultants
SAS Institute Inc.
Writers
SAS Institute Inc.
Anne H. Milley
James D. Seabolt, Ph.D.
John S. Williams
Technical Reviewers
SAS Institute Inc.
Copyeditor
SAS Institute Inc.
Rebecca Autore
Copyright 1998 by SAS Institute Inc., Cary, NC. All rights reserved. Credit must be given to the publisher. Otherwise,
no part of this publication may be reproduced without prior permission of the publisher. 19963US.0399 REV