Script
Script
For BI systems and mathematical models, we can only give high accuracy and effective results with a
well cleaned reliable set of data.
However, the raw data collected from primary sources may usually have several anomalies that need to
be identified and corrected. And that’s what we do in data preparation
As a first step of preparing our data, we need to validate it, it means that we need to identify and implement
corrective actions in case of anomalies.
We usually encounter 2 types of problems in raw data.
- Incomplete Data
- Data affected by noise (means that sometimes all the attributes are here but it’s noisy)
Obviously, incomplete data is a set of data with missing attributes, as we can see here, and to heal that,
there are several solutions:
- Elimination
- Inspection
- Identification
- Substitution
Elimination:
As a first solution, we can discard all the records where we have missing attributes, but this method may
occur loss of large amounts of data in case of a high percentage of missing values or when the distribution of
missing values varies in an irregular way across the attributes.
Inspection:
Alternatively, we can opt for inspecting each missing value, but this approach is rather time-consuming
or really difficult for large amounts of data
Identification:
As a third option, we can replace all the missing values with a conventionally chosen value to identify
those values, making it unnecessary to remove entire records from our dataset.
For example, for a continuous attribute that only takes positive values, it is possible to assign the value -1 to all
missing data. For categorical attribute, we can do the same with a default token.
Substitution:
For instance, missing values of an attribute may be replaced with the mean of the attribute calculated for
the remaining observations, this technique can only be used with numerical attributes. It is also possible to
replace missing values by calculating the mean of the records that have the same target class.
Finally, the maximum likelihood value, it is usually estimated using regression models, can be used to
replace missing values.
What is noise in physics? It’s a random perturbation in sound waves, right? So just like in physics, noise
in data is a random perturbation within the values of a numerical attribute, usually resulting noticeable
anomalies.
First of all, we need to identify those outliers, so that we can either correct, regularize or eliminate them.
1- The easiest way to identify those unusual values is the statistical concept of dispersion. The mean and
the variance of a sample X are calculated. If our attribute follows a distribution not far from normal, the
values falling outside a chosen interval centered around the mean are identified as outliers.
2- An alternative way, is based on the distance between observations. Once we identify the clusters, we can
assume that records that are not placed in any of the clusters are identified as outliers.
3- Unlike the others methods, that identify and correct each single anomaly, there is also techniques that
automatically correct anomalous data. For example, simple or multiple regression models predict the
value of an attribute a j. Once the regression model is developed and the corresponding confidence
interval calculated, it is possible to substitute the value computed along the prediction curve for the
values of the attribute a j that fall outside the interval.
After this step, and now that we have a complete set of data with no missing values or anomalies, we aim to
increase or improve the accuracy of our learning models. To do so, there are multiple types of transformations
that we can apply on our Data.
Slide 11
The most used transformation is standardization. And most popular standardization techniques include
decimal scaling, min-max scaling and z-score standardization.
Decimal scaling is based on the transformation below where h is the scaling intensity that is, in general,
fixed at a value that gives transformed values in the range [-1,1]
Min-Max scaling is achieved through the following transformation
where x min, j is the minimum value of x i , j for a fixed i, and x max , j is the maximum value of x i , j for a fixed i
Z-index or Z-score has the formula below
Where μ j is the sample mean and σ j is the sample standard deviation of a given attribute a j, this transformation
gives transformed values in the range [-3,3]
Slide 12
When dealing with small amounts of data, the transformation described earlier are sufficient to prepare
input data for a data mining analysis. However, with large datasets it is preferable to reduce the size of data in
order to make learning algorithms more efficient without decreasing the quality of the results.
There are 3 main criteria to determine if a data reduction technique should be used:
- Efficiency
- Accuracy
- Simplicity
Efficiency
A smaller dataset than the original means shorter computation time. Therefore, a reduction in processing
time allows the analyses to be carried out more quickly
Accuracy
Accuracy is a critical success factor in most models. As a consequence, data reduction should not
compromise the accuracy of the model.
Simplicity
In some data mining applications, concerned more with interpretation than with prediction. It is more
important that the models can be easily translated into simple rules that can be easily understood by experts in
the application domain. Some decision makers may allow a slight decrease in accuracy as a trade for simpler
rules.
Slide 14
Also called feature reduction, is the elimination of a subset of attributes judged irrelevant for the
purpose of data mining activities. The choice of the combination of predictive variable is one of the most
critical aspects in a learning process.
Feature reduction means fewer columns, that implies quicker execution time. The models generated
after elimination of irrelevant attributes are more often more accurate and easier to understand.
There are 3 main categories of feature selection models:
- Filter methods
- Wrapper methods
- Embedded methods
Filter methods select the relevant attributes before moving to the learning phase, and therefore
independent of the algorithm being used.
The simplest filter method to apply for supervised learning is the selection of each single attribute based
on its level of correlation with the target. As a result, we only high correlated attributes with the target.
In the wrapper methods the selection of predictive variables is based, not only on the level of relevance
of each single variable, but also on the learning algorithm utilized, which makes those methods burdensome for
computational standpoint, since it takes in count every possible combination of variable, that means huge
computation time.
For the embedded methods the attribute selection process is inside the learning algorithm, so that the selection
of the optimal subset of attribute is made during the phase of model generation. Decision Trees are the perfect example
for embedded methods, because at each node, they use a function that estimates the predictive value the attribute. By this
way, the relevant attributes are automatically selected and determine the rule for splitting the records into corresponding
nodes.
Principal Component Analysis is the most known technique of attribute reduction by means of
prediction. Generally speaking, the purpose of this method is to obtain a new subset of attributes which has a
lower number of attributes obtained as their linear combination, without this change causing any loss of
information.
Before applying this method, it is a must to standardize the data, so that all the values are in the same
range of [-1,1]. In addition to that, the mean of each attribute is made equal to 0 by applying this transformation.
Now, we need to find the principal components that are going to form our new set of attributes.
To do that we chose the attribute with the highest variance value as our first principal component. And
we iterate this operation until we find all the principal components.
Let’s give an example!
Finally, we’re going to talk about Data discretization. The purpose of data reduction is to decrease the
number of distinct values assumed by one or more attributes. Data discretization is the primary reduction
method, it reduces continuous attributes to categorical ones characterized by a limited number of distinct values.
For instance, the weekly spending of a mobile phone customer is a continuous numerical value, which
might be discretized into, say, five classes: low, [0 − 10) euros; medium low, [10 − 20) euros; medium, [20 −
30) euros; medium high, [30 − 40) euros; and high, over 40 euros.
The discretization process has brought about a reduction in the number of distinct values assumed by
each attribute. The models that can be generated on the reduced dataset are likely to be more intuitive and less
arbitrary
For example, this:
if spending is in the medium low range, and if a customer resides in region A, then the probability of
churning is higher than 0.85.
Is easier to read than this:
if spending is in the [12.21, 14.79] euro range, and if a customer resides in province B, then the
probability of churning is higher than 0.85
Among the most popular discretization techniques are subjective subdivision, subdivision into classes
and hierarchical discretization.
Subjective subdivision is the most popular and intuitive method. Classes are defined based on the
experience and judgment of experts in the application domain.
Subdivision into categorical classes may be achieved in an automated way using the techniques
described below. In particular, the subdivision can be based on classes of equal size or equal width.
The third type of discretization is hierarchical discretization and it is based on hierarchical
relationships between concepts and may be applied to categorical attributes, just as for the hierarchical
relationships between provinces and regions.
In general, given a hierarchical relationship of the one-to-many kind, it is possible to replace each value
of an attribute with the corresponding value found at a higher level in the hierarchy of concepts.
Slide 20
Here are some examples.