Journal Data Preprocessing 1906.08510
Journal Data Preprocessing 1906.08510
Abstract—Data mining is about obtaining new knowledge from and making the distribution of data more suitable for applying
arXiv:1906.08510v1 [cs.LG] 20 Jun 2019
existing datasets. However, the data in the existing datasets can optimization algorithms in the model training step.
be scattered, noisy, and even incomplete. Although lots of effort The input for data mining models can be huge: they may
is spent on developing or fine-tuning data mining models to make
them more robust to the noise of the input data, their qualities have too many dimensions or of massive amount, which would
still strongly depend on the quality of it. The article starts with make it difficult for the data mining model to train or cause
an overview of the data mining pipeline, where the procedures troubles while transferring and storing the data. Data reduction
in a data mining task are briefly introduced. Then an overview techniques can reduce the problem by applying reduction on
of the data preprocessing techniques which are categorized as dimensions (known as dimensional reduction) or amounts of
the data cleaning, data transformation and data preprocessing is
given. Detailed preprocessing methods, as well as their influenced data (known as instance selection and sampling).
on the data mining models, are covered in this article. To implement preprocessing to data, Python and R are
among the most popular tools. With bulks of packages such as
Index Terms—Data Mining, Data Preprocessing, Data Mining
Pipeline scikit-learn [1] and PreProcess [2], most of the preprocessing
algorithms covered in this paper can be implemented even
without consideration of its details.
I. I NTRODUCTION In the following section, the data mining pipeline and
B. Outlier Detection
A. Missing Values Handling
Outlier refers to the data sample that has a massive distance
Missing values is a typical kind of data incompleteness of to most of the other samples. Although the rare case does not
dataset. Most of the data mining models would not tolerate the necessarily mean wrong (e.g., age = 150), most outliers are
missing values of its input data: these values can not be used caused by measurement error or wrong recording, thus ignor-
for comparison, not available for categorizing, and can not ing a rarely appearing case would not harm a lot. Although
be operated with arithmetic algorithms. Thus, it is necessary some of the models are robust against outliers, outlier detection
to handle the missing values before pushing the dataset to is still recommended in data preprocessing work.
data mining models. The easiest way to deal with missing Statistics-based outlier detection algorithms are among the
value is to drop the entire sample. This method is effective if most commonly used algorithms, which assume an underlying
the proportion of missing value in a dataset is not significant, distribution of the data [12] and regard the data examples
however, if the number of missing values is not suitable for which corresponding probability density lower than a certain
ignoring, or the percentage of missing value for each attribute threshold as the outliers. As the underlying distribution is
is different [3], dropping the samples with missing value would unknown for most cases, the normal distribution is a good
reduce the amount of dataset dramatically, the information substitute, and its parameter could be estimated by the mean
contained in the dropped samples is not made use of. Another value and standard deviation of the data. The Mahalanobis
way to deal with missing values is by filling them, and there distance [12], as in (1), is a scale-irrelevant distance between
are varies methods for finding the suitable value to fill the two data samples. The outlier can be decided by comparing
missing value, some of them are listed as follows. the Mahalanobis distance between each sample and the mean
1) Use special value to represent missing: Sometimes the value of all samples. Box-plot, as another kind of statistics-
missing value itself has some meaning. For instance, in a based outlier detection technique, can give the graphical rep-
patient’s medical report, the missing value for uric acid means resentation of outlier by plotting the lower quartile and upper
the patient did not go through the renal function test. Thus, quartile along with the median [13].
using a certain value such as -1 makes sense, for they can be
operated like normal values while having special meaning in q
the dataset. DM (x, y) = (x − y)T Σ−1 (x − y) (1)
2) Use attribute statistics to fill: Statistics such as mean,
median, or mode can be obtained from non-missing values Without making any assumption of the distribution of the
in the missing value’s corresponding attribute. It is said that data, the distance-based outlier detection algorithm can detect
for a skewed dataset, the median would be a better choice the outlier by analyzing the distance between every two
[3]. However, this technique does not take the sample’s other samples, thus determine the outliers. Simple distance-based
non-missing attributes into account. outlier detection algorithms are not suitable for a large dataset,
3) Predicting the value with known attributes: If we as- since for n samples with m dimensions, their complexity is
sume, there exists a correlation between attributes, filling usually O(n2 m) [12], and each computation requires scanning
the missing value can be modeled as a prediction problem: all the samples. However, an extended cell-based outlier de-
predicting the value with the non-missing attributes with other tection algorithm is developed in [14], which guarantee linear
samples as training data. The predicting methods includes complexity over the dataset volume and no more than three
regression algorithms, decision trees [3], and K-Means [9]. dataset scans. The experiment shows this algorithm is the best
4) Assigning all possible values: For categorical attributes, for the dataset with dimension less than 4.
given an example E with m possible values for its miss- Sometimes, with consideration of temporal and spatial local-
ing value, then E can be replaced with m new examples ity, an outlier may not be a separate point, but a small cluster.
E1 , E2 , . . . , Em . This missing value filling technique assumes Cluster-based outlier detection algorithms consider clusters
the missing attribute does not matter for the example. Thus with small size as outlier clusters and clean the dataset by
the value can be anyone in its domain [10]. removing the whole cluster [15] [16].
SEMINAR DATA MINING, JUNE 2019 4
the data may have lots of attributes, there could be unrelated ing model. Usually, the weight over different attributes
features as well as the interdependence between features [22]. would act as feature selection. A typical example is the
Data reduction is the technique that helps to reduce the amount regularization term of the loss function in linear regres-
or dimension, or both, of a dataset, thus making the model’s sion, known as Lasso regression (for L1 regularization)
learning process more efficient as well as helping the model and ridge regression (for L2 regularization).
to obtain better performance, including preventing overfitting
B. Instance selection and sampling
problem, and fix the skewed data distribution.
Both instance selection and sampling are about achieving
the reduction of data by reducing the amount of data, seeking
A. Dimensional reduction the chance to train the model with minimum performance
The dimension reduction technique is about reducing the loss [20], while based on different criteria for selecting (or
dimensionality of data samples, thus reduce the total size of dropping) instance.
the data. As the number of attributes is reduced for a sample, Most instance selection algorithm is based on fine-tuning
there is less information contained in it. A good dimensional classification models. To help the model make better decision,
reduction algorithm will keep more general information: this condensation algorithm and edition algorithm are issued [20].
could make it more difficult for models to become overfitted. The condensation removes the samples lie in the relative center
Some dimensional reduction techniques pose a dimensional area of the class, assuming they do not contribute much in
reduction transformation over a dataset, generating new data classification. Condensed nearest neighbor [27], for instance,
samples, which have a fewer number of attributes than before. select instance by adding all the samples that cause a mistake
The transformations have different criteria. Principal compo- to a K-Nearest Neighbor classifier. Edition algorithm removes
nent analysis, known as PCA, could reduce the dimension the samples close to boundary, hoping to give the classifier
of data while keeping the maximum variance of data [23]. a smoother decision boundary. Related algorithms include a
This is achieved by multiplying matrix A = (a1 , a2 , . . . , ap )T clustering-based algorithm to select the center of clusters [28],
to the dataset X and keep the top k dimensions (ai stands Compared with instance selection methods, sampling is a
for the normalized eigenvector corresponds to the ith greatest faster and easier way to reduce the number of instances,
SEMINAR DATA MINING, JUNE 2019 6
R EFERENCES
[1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[2] K. R. Coombes, PreProcess: Basic Functions for Pre-Processing
Microarrays, 2019, r package version 3.1.7. [Online]. Available:
https://CRAN.R-project.org/package=PreProcess
[3] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques.
Elsevier, 2011.
[4] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data
streams: a review,” ACM Sigmod Record, vol. 34, no. 2, pp. 18–26,
Fig. 3. A comparison of reduction result with PCA and LDA. 2005.
[5] H. Wickham, ggplot2: Elegant Graphics for Data Analysis. Springer-
Verlag New York, 2016. [Online]. Available: http://ggplot2.org
[6] M. Bostock, V. Ogievetsky, and J. Heer, “D3 data-driven documents,”
since almost no complex selection algorithm is required for IEEE transactions on visualization and computer graphics, vol. 17,
sampling methods: they only focus on reducing the amount of no. 12, pp. 2301–2309, 2011.
the data samples. The easiest sampling technique is random [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
sampling, which collect a certain amount or portion of samples mation processing systems, 2012, pp. 1097–1105.
from the dataset randomly. For skewed datasets, stratified [8] T. Velmurugan and T. Santhanam, “Computational complexity between
sampling [22] is more adapted, since it takes the appearance k-means and k-medoids clustering algorithms for normal and uniform
distributions of data points,” Journal of computer science, vol. 6, no. 3,
frequency of labels from different classes into account and p. 363, 2010.
assigns a different probability of data with different labels [9] B. M. Patil, R. C. Joshi, and D. Toshniwal, “Missing value imputation
being chosen, thus makes the sampled dataset more balanced. based on k-mean clustering with weighted distance,” in International
Conference on Contemporary Computing. Springer, 2010, pp. 600–
609.
[10] J. W. Grzymala-Busse, “On the unknown attribute values in learning
VI. S UMMARY AND O UTLOOK from examples,” in International Symposium on Methodologies for
Intelligent Systems. Springer, 1991, pp. 368–377.
Data mining, as a technique to discover additional informa- [11] J. W. Grzymala-Busse and M. Hu, “A comparison of several approaches
tion from a dataset, can be integrated as a pipeline, in which to missing attribute values in data mining,” in International Conference
on Rough Sets and Current Trends in Computing. Springer, 2000, pp.
obtaining, scrubbing, exploring, modeling, and interpreting are 378–385.
the key steps. The purpose of the data mining pipeline is [12] M. Kantardzic, Data mining: concepts, models, methods, and algo-
to tackle realistic problems, including reviewing the past and rithms. John Wiley & Sons, 2011.
[13] S. Walfish, “A review of statistical outlier methods,” Pharmaceutical
predicting the future. The specific technique used in each step technology, vol. 30, no. 11, p. 82, 2006.
should be selected with care to give the best performance to [14] E. M. Knorr, R. T. Ng, and V. Tucakov, “Distance-based outliers: algo-
the pipeline. rithms and applications,” The VLDB JournalThe International Journal
on Very Large Data Bases, vol. 8, no. 3-4, pp. 237–253, 2000.
The success of a data mining model depends on the proper [15] I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery
data preprocessing work. The unpreprocessed data can be of handbook. Springer, 2005, pp. 131–146.
unsuitable format for model input, causing instability for the [16] L. Duan, L. Xu, Y. Liu, and J. Lee, “Cluster-based outlier detection,”
Annals of Operations Research, vol. 168, no. 1, pp. 151–168, 2009.
optimization algorithm of the model, having a great impact [17] S. Angelidis and M. Lapata, “Multiple instance learning networks for
on the model’s performance because of its noise and outliers, fine-grained sentiment analysis,” Transactions of the Association of
and causing performance problems on the model’s training Computational Linguistics, vol. 6, pp. 17–31, 2018.
[18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
process. With careful selection of preprocessing steps, these “Distributed representations of words and phrases and their composi-
problems can be reduced or avoided. tionality,” in Advances in neural information processing systems, 2013,
Data type transformation techniques as well as missing pp. 3111–3119.
[19] I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI
value handling techniques makes it possible for models for 2001 workshop on empirical methods in artificial intelligence, vol. 3,
processing different types of data. By applying normalization, no. 22, 2001, pp. 41–46.
the unit system of different attributes would be more unified, [20] S. Garcı́a, J. Luengo, and F. Herrera, Data preprocessing in data mining.
Springer, 2015.
reducing the probability of an optimization algorithm to miss [21] T. Y. Lin, “Attribute transformations for data mining i: Theoretical
the global minimum. For simpler models, numerical transfor- explorations,” International journal of intelligent systems, vol. 17, no. 2,
mation can provide richer features to the model, thus enhance pp. 213–222, 2002.
[22] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for
the model’s ability to discover more underlying relationships supervised leaning,” International Journal of Computer Science, vol. 1,
between features and labels. For the overfitting problems of no. 2, pp. 111–117, 2006.
the model, dimensional reduction techniques help model find [23] I. Jolliffe, Principal component analysis. Springer, 2011.
the more general information about samples instead of the too [24] A. M. Martı́nez and A. C. Kak, “Pca versus lda,” IEEE transactions on
pattern analysis and machine intelligence, vol. 23, no. 2, pp. 228–233,
detailed features by reducing the dimension of feature, thus re- 2001.
move some unimportant information. And for the performance [25] I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence
of the model training, both dimensional reduction techniques Livermore National Lab., CA (US), Tech. Rep., 2002.
[26] I. Guyon and A. Elisseeff, “An introduction to variable and feature
and instance selection techniques would improve the training selection,” Journal of machine learning research, vol. 3, no. Mar, pp.
performance by reducing the total amount of data. 1157–1182, 2003.
SEMINAR DATA MINING, JUNE 2019 7
[27] P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans-
actions on information theory, vol. 14, no. 3, pp. 515–516, 1968.
[28] A. Lumini and L. Nanni, “A clustering method for automatic biometric
template selection,” Pattern Recognition, vol. 39, no. 3, pp. 495–497,
2006.