Data Mining
Data Mining
Data comes in many forms, in many formats, and from multiple systems, as shown in Figure 2.2.
Identifying the right data sources and bringing them together are critical success factors
Every data mining project has data issues:
inconsistent systems,
table keys that don’t match across databases,
records overwritten every few months, and so on.
Profiles are often based on demographic variables, such as geographic location, gender, and
age
Profiling uses data from the past to describe what happened in the past.
Prediction goes one step further.
Prediction uses data from the past to predict what is likely to happen in the future.
This is a more powerful use of data.
Building a predictive model requires separation in time between the model inputs or
predictors and the model output, the thing to be predicted.
If this separation is not maintained, the model will not work
DIRECTED AND UNDIRECTED DATA MINING
The first three tasks, classification, estimation, and prediction are examples of
directed data mining. Affinity grouping and clustering are examples of undirected
data mining. Profiling may be either directed or undirected. In directed data mining
there is always a target variable—something to be classified, estimated, or
predicted. The process of building a classifier starts with a predefined set of classes
and examples of records that have already been correctly classified. Similarly, the
process of building an estimator starts with historical data where the values of the
target variable are already known. The modeling task is to find rules that explain the
known values of the target variable.
DIRECTED AND UNDIRECTED DATA MINING
In undirected data mining, there is no target variable.
The data mining task is to find overall patterns that are not tied to any one variable.
The most common form of undirected data mining is clustering, which finds groups of
similar records without any instructions about which variables should be considered as
most important.
Undirected data mining is descriptive by nature, so undirected data mining techniques
are often used for profiling, but directed techniques such as decision trees are also
very useful for building profiles.
In the machine learning literature, directed data mining is called supervised learning
and undirected data mining is called unsupervised learning.
DATA MINING METHODOLOGY
Translate the business problem into a data mining problem.
Select appropriate data.
Get to know the data.
Create a model set.
Fix problems with the data.
Transform data to bring information to the surface.
Build models.
Asses models.
Deploy models.
Assess results.
Begin again.
TRANSLATE THE BUSINESS PROBLEM INTO A DATA
MINING PROBLEM
General goals should be broken down into more specific ones to make it easier to
monitor progress in achieving them.
Gaining insight into customer behaviour might turn into concrete goals:
■■ Identify customers who are unlikely to renew their subscriptions.
STEP TWO: SELECT APPROPRIATE DATA
What Is Available?
The first place to look for data is in the corporate data warehouse. Data in the warehouse has
already been cleaned and verified and brought together from multiple sources. A single data
model hopefully ensures that similarly named fields have the same meaning and compatible
data types throughout the database
Unfortunately, there is no simple answer to this question. The answer depends on the particular
algorithms employed, the complexity of the data, and the relative frequency of possible
outcomes. Statisticians have spent years developing tests for determining the smallest model
set that can be used to produce a model.
Machine learning researchers have spent much time and energy devising ways to let parts of
the training set be reused for validation and test.
Data mining is most useful when the sheer volume of data obscures patterns that might be
detectable in smaller databases
How Much History Is Required
how far in the past should the data come from?
This is another simple question without a simple answer. The first thing to consider is seasonality
data from too far in the past may not be useful for mining because of changing market
conditions. This is especially true when some external event such as a change in the regulatory
regime has intervened
How Many Variables
, variables that had previously been ignored turn out to have predictive value when used in
combination with other variables. For example, one credit card issuer, that had never included
data on cash advances in its customer profitability models, discovered through data mining
What Must the Data Contain
At a minimum, the data must contain examples of all possible outcomes of interest. In
directed data mining, where the goal is to predict the value of a particular target
variable, it is crucial to have a model set comprised of preclassified data
STEP THREE: GET TO KNOW THE DATA
Examine Distributions
A good first step is to examine a histogram of each variable in the dataset and think
about what it is telling you.
Compare Values with Descriptions
Look at the values of each variable and compare them with the description given for
that variable in available documentation. This exercise often reveals that the
descriptions are inaccurate or incomplete
Validate Assumptions
Using simple cross-tabulation and visualization tools such as scatter plots, bar graphs,
and maps, validate assumptions about the data
STEP FOUR: CREATE A MODEL SET
The model set contains all the data that is used in the modeling process.
Some of the data in the model set is used to find patterns.
Some of the data in the model set is used to verify that the model is stable
Creating a model set requires assembling data from multiple sources to form customer signatures and then
preparing the data for analysis.
Including Multiple Timeframes
The primary goal of the methodology is creating stable models. Among other things, that means models that
will work at any time of year and well into the future. This is more likely to happen if the data in the model set
does not all come from one time of year
Creating a Model Set for Prediction
Although the model set should contain multiple timeframes, any one customer signature should have a gap in
time between the predictor variables and the target variable. Time can always be divided into three periods:
the past, present, and future. When making a prediction, a model uses data from the past to make predictions
about the future.
Partitioning the Model Set
Once the preclassified data has been obtained from the appropriate timeframes, the
methodology calls for dividing it into three parts.
The first part, the training set, is used to build the initial model.
The second part, the validation set, is used to adjust the initial model to make it more
general and less tied to the idiosyncrasies of the training set.
The third part, the test set, is used to gauge the likely effectiveness of the model when
applied to unseen data.
STEP FIVE: FIX PROBLEMS WITH THE DATA
This explanation may take the form of a neural network, a decision tree, a linkage
graph, or some other representation of the relationship between the target and the
other fields in the database.
In undirected data mining, there is no target variable. The model finds relationships
between records and expresses them as association rules or by assigning them to
common clusters
STEP EIGHT: ASSESS MODELS
The ability of decision makers to collect, filter, and interpret data, messages, and signals has a critical bearing on
their strategy (Makadok & Barney, 2001).
Social media analytics aim to monitor, filter, and analyse the discussions that take place on social media platforms,
providing a comprehensive picture of consumer opinions regarding products and services.
Sentiment analysis can suggest whether a product, service, or customer support is better or worse than industry
average.
Correlating sentiment to recent changes in product design, for example, could provide essential feedback..
While some social media data are structured, social media data are mainly unstructured and semistructured,
leading to high diversity, ambiguity, and textual disorder.
These unstructured and semi structured data require pre-processing and scrubbing before data analytics to
eliminate missing data, incorrect data, and inconsistent data. T
the process of data cleaning may involve spell checking, removal of typographical errors or duplicates, validating
and correcting values against a known list of entities, and tagging data with metadata.
Social media intelligence
It describes what already has happened
Social media intelligence enables managers to prescribe what should be done with the results of social
media analytics. Social media intelligence is achieved by combining knowledge generated from
traditional intelligence activities and knowledge gained from social media analytics, and helps
managers develop better actionable market decisions that align with the company’s objectives
Social media intelligence enables managers to prescribe what should be done with the results of social
media analytics. Social media intelligence is achieved by combining knowledge generated from
traditional intelligence activities and knowledge gained from social media analytics, and helps
managers develop better actionable market decisions that align with the company’s objectives