Intro 2
Intro 2
Lecture#2
Data mining is a process of discovering various models, Summaries and derived values from a given
collection of data.
The word process is very important here even in some professional environment, there is a belief that
data mining simply consisting on picking and applying a complete based tool to match presented problem and
automatically obtaining a solution. This is misconception based on our artificial idealization of the world. there are
several reasons why this is incorrect. Firstly, data mining is not only just collection of tools and secondly, it lies in the
notion of matching a problem to technique. it is hardly happens that a problem match technique. in fact data mining
is iterative process. we have to examine the problem, Decide to apply some tools and techniques. sometimes need
modification if it doesn't match, have to go to beginning and restart the process.
data mining is not a random application of statistical, machine learning or any other tool. it is not a
random walk through the space of analytic technique. but a carefully planned and considered process of deciding
what will be most useful processing and revealing.
Any general experimental procedure adapted to data mining problem involves following steps
1. state the problem and formulate the hypothesis
2. collect data
3. preprocessing that data
4. estimate the model
5. interpret the model and draw conclusion.
3. Data Preprocessing :
In the observational setting, data is usually “collected” from prevailing databases, data warehouses, and
data marts. Data preprocessing usually includes a minimum of two common tasks :
a. (i) Outlier Detection (and removal) : Outliers are unusual data values that are not according to
most observations. Commonly, outliers result from measurement errors, coding, and recording
errors, and, sometimes, are natural, abnormal values. Such non-representative samples can
seriously affect model produced later. There are two strategies for handling outliers :
1. Detect and eventually remove outliers
2. Develop robust modeling methods.
b. (ii) Scaling, encoding, and selecting features : Data preprocessing includes several steps like
variable scaling and differing types of encoding. For instance, one feature with range [0, 1] and
other with range [100, 1000] will not have an equivalent weight within applied technique. They
are going to also influence ultimate data-mining results differently. Therefore, it is
recommended to scale them and convey both features to an equivalent weight for further
analysis. Also, application-specific encoding methods usually achieve dimensionality reduction
by providing a smaller number of informative features for subsequent data modeling.
4. Estimate model :
The selection and implementation of acceptable data-mining technique is that main task during this
phase. This process is not straightforward. Usually, in practice, implementation is predicated on several
models, and selecting simplest one is a further task.
In most cases, data-mining models should help in deciding. Hence, such models got to be interpretable so
as to be useful because humans are not likely to base their decisions on complex “black-box” models.
Note that goals of accuracy of model and accuracy of its interpretation are somewhat contradictory.
Usually, simple models are more interpretable, but they are also less accurate. Modern data-mining
methods are expected to yield highly accurate results using high dimensional models. The matter of
interpreting these models, also vital, is taken into account a separate task, with specific techniques to
validate results.
In 1999,Several large companies including automaker dialnler Benz insurance provider ORHA,
Hardware and software manufacturer NCR Corp. And statical software maker SPSS inc. formalize and standardize
and approach two data mining process. The result of this work was CRISP-DM (Cross industry standard process for
data mining ) .
As a methodology, it include descriptions of the typical phase of a project, the task involved with
each phase and an explanation of the relationship between task.
1. business understanding :
In this step, The goal of the business are set and important factors that will help in
achieving the goal are discovered.
2. data understanding:
This step will collect the whole data and populate the data in the tool. The data is
listed with its data source, location, how it is required and if any issue encountered. data in visualization and
curate to check its completeness.
3. data preparation:
This step involves selecting appropriate data, cleaning, constructing attribute from data,
integrating Data from multiple database.
4. Modeling:
Selection of the data mining techniques such as decision tree, generate test design for
evaluating the selected model, building models from the data set and assessing the build model with expert
to discuss the result is done in this step.
5. Evaluation:
This step will determine the degree to which the resulting model meet the business
requirement. evaluation can be done by testing the model on real applications. the model is received for
any mistake or step should be repeated.
6. Deployment:
In this step, deployment plan is made, strategy to monitor and maintain the data mining
model, result to check for its usefulness is formed. Final reports are made and review of the whole process
is done to check any mistake and see if any step is repeated.