Data Mining - Classification & Prediction
Data Mining - Classification & Prediction
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky, or a prediction model to predict the expenditures in dollars of potential
customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as
-1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.
Need of Normalization –
Normalization is generally required when we are dealing with attributes on a different scale,
otherwise, it may lead to a dilution in effectiveness of an important equally important
attribute(on lower scale) because of other attribute having values on larger scale.
In simple words, when multiple attributes are there but attributes have values on different
scales, this may lead to poor data models while performing data mining operations. So they are
normalized to bring all the attributes on the same scale.
It normalizes by moving the decimal point of values of the data. To normalize the data by this
technique, we divide each value of the data by the maximum absolute value of data. The data
value, vi, of data is normalized to vi‘ by using the formula below –
EXAMPLE
Let the input data is: -10, 201, 301, -401, 501, 601, 701
To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701
Min-Max Normalization –
In this technique of data normalization, linear transformation is performed on the original data.
Minimum and maximum value from data is fetched and each value is replaced according to the
following formula.
Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of the data A.
The formula used is:
v’, v is the new and old of each entry in data respectively. σA, A is the standard deviation and
mean of A respectively.