0% found this document useful (0 votes)
55 views7 pages

Journal Data Preprocessing 1906.08510

The document provides an overview of preprocessing methods and pipelines in data mining. It discusses how preprocessing techniques like data cleaning, transformation, and reduction help improve data quality and suitability for modeling. The data mining pipeline involves obtaining data from sources, scrubbing it with these preprocessing techniques, exploring the data, modeling it, and interpreting results. Popular tools like scikit-learn and R can implement most preprocessing algorithms to clean noise, transform data types and distributions, and reduce dimensions or instances.

Uploaded by

Shovit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views7 pages

Journal Data Preprocessing 1906.08510

The document provides an overview of preprocessing methods and pipelines in data mining. It discusses how preprocessing techniques like data cleaning, transformation, and reduction help improve data quality and suitability for modeling. The data mining pipeline involves obtaining data from sources, scrubbing it with these preprocessing techniques, exploring the data, modeling it, and interpreting results. Popular tools like scikit-learn and R can implement most preprocessing algorithms to clean noise, transform data types and distributions, and reduce dimensions or instances.

Uploaded by

Shovit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SEMINAR DATA MINING, JUNE 2019 1

Preprocessing Methods and Pipelines of Data


Mining: An Overview
Li Canchen
Department of Informatics
Technical University of Munich
[email protected]

Abstract—Data mining is about obtaining new knowledge from and making the distribution of data more suitable for applying
arXiv:1906.08510v1 [cs.LG] 20 Jun 2019

existing datasets. However, the data in the existing datasets can optimization algorithms in the model training step.
be scattered, noisy, and even incomplete. Although lots of effort The input for data mining models can be huge: they may
is spent on developing or fine-tuning data mining models to make
them more robust to the noise of the input data, their qualities have too many dimensions or of massive amount, which would
still strongly depend on the quality of it. The article starts with make it difficult for the data mining model to train or cause
an overview of the data mining pipeline, where the procedures troubles while transferring and storing the data. Data reduction
in a data mining task are briefly introduced. Then an overview techniques can reduce the problem by applying reduction on
of the data preprocessing techniques which are categorized as dimensions (known as dimensional reduction) or amounts of
the data cleaning, data transformation and data preprocessing is
given. Detailed preprocessing methods, as well as their influenced data (known as instance selection and sampling).
on the data mining models, are covered in this article. To implement preprocessing to data, Python and R are
among the most popular tools. With bulks of packages such as
Index Terms—Data Mining, Data Preprocessing, Data Mining
Pipeline scikit-learn [1] and PreProcess [2], most of the preprocessing
algorithms covered in this paper can be implemented even
without consideration of its details.
I. I NTRODUCTION In the following section, the data mining pipeline and

D ATA mining is a knowledge obtaining process: it gets


data from various data sources and finally transforms the
data into knowledge, thus provides insight to its application
the primary procedures in the data mining pipeline will be
introduced. From Section 2 on, we will focus on the steps
in the data preprocessing work: Section 3 will introduce the
field. Data mining pipeline is a typical example of the end- techniques used in data cleaning, while Section 4 will cover
to-end data mining system: they are an integration of all data the data transformation techniques. In the last section, data
mining procedures and deliver the knowledge directly from reduction techniques will be discussed.
data source to human.
The purpose of data preprocessing is making the data easier II. DATA M INING P IPELINE
for data mining models to tackle. The quality of data can have The data mining pipeline is an integration of all procedures
a significant influence on data mining models. It is considered in a data mining task. While most of the data already exists
that the data and features have already set the upper bound in a data base, a data warehouse, or other types of data source
of the knowledge that can be obtained, and the data mining [3], various steps should be taken in order to make them easier
models are just about approximating the upper bound. Various for a human to understand. An illustration of the data mining
preprocessing techniques are invented to make the data meet pipeline is given as in Fig. 1. Generally speaking, the key
the input requirements of the model, improve the relevance of procedures include obtaining, scrubbing, exploring, modelling,
the prediction target, and make the optimization step of the interpreting. These procedures are known as ”OSEMN”1 .
model easier. However, note that the pipeline is not a linear process in the
It is common that raw data obtained from the natural world real world, but a successive and long-lasting task. Methods
is badly shaped. The problems include the appearance of in scrubbing and modeling procedures have to be tested and
missing values (e.g., a patient did not go through all the refined, the obtaining procedure may have to be adapted for
tests), duplications (e.g., annual income and monthly income), different kinds of data sources, and the visualization and
outlier values (e.g., age is -1) as well as contradictions (e.g., interpretation of the data may have to be adjusted for their
gender is male and is pregnant) in the dataset. Although the audience, thus meet the audience’s demand. In the rest of this
existing preprocessing techniques would not guarantee to solve section, details of these procedures will be discussed.
all these problems, they could at least correct some of them
and improve the performance of the models. A. Obtaining
The data type and distribution of data are usually trans- Obtaining the data is the most fundamental step in data
formed before being sent to data mining models. The purpose mining since it is the data itself that decides what knowledge
of data transformation includes making the data meets the
input requirement of the models, removing the noise of data, 1 http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
SEMINAR DATA MINING, JUNE 2019 2

other procedures in a data mining task, including helping to


choose a suitable model, and helping to justify your work in
the interpreting of data.
Tools for exploring the data and verifying the assumptions
are usually statistical analysis and data visualization. The
statistical analysis gives us theoretical probabilities, known as
significance level, of our assumption being incorrect, while
data visualization tools, such as ggplot [5] and D3 [6], give us
the impressions about the distribution of the data, help people
to verify their assumption conceptually. Also, new patterns
that are ignored in the assumption step might be found in the
visualization step.

Fig. 1. An illustration of data mining pipeline. D. Modeling


With underlying patterns existed in the data source, model-
it may contain. Data base and data warehouse are among ing makes it possible to represent the pattern explicitly with
the primary sources of data, where the structured data can the data mining models. For a data mining task, modeling
be fetched with query languages, usually SQL. The data would usually split the data into the training set and test set
warehouse is specially designed for organizing, understand, thus could score the accuracy of the model on a relatively
and making use of the data [3]: they are usually separated ”new” dataset. If the model contains hyperparameters, such
system from the operational database, having time-variant as the parameter k in a K-Nearest Neighbor (KNN) model, a
structure as well as structures that makes the subsequent cross validation set will be created for obtaining the best set
analysis work easy, and most importantly, nonvolatile. of the hyperparameter.
The obtained data can be archived as files and directly used For most of the data mining models, loss functions are
for the subsequent procedures. They may also be reformatted defined. Generally, a loss function will have lower value if the
and stored in a data base or a data warehouse, prepared for model performs well. Besides, it usually has special features
the data mining tasks in the future. such as convexity, which makes the gradient-based optimiz-
In the past, as well as in most common cases now, we regard ing algorithm performs better. With trainable parameters, a
the data obtaining step as the process for obtaining a dataset, model’s training step is about adjusting its parameter so that
regardless of how we obtained them. However, nowadays, new it gains lower loss on its training data. The specific definition
data is generating at an extreme speed: there are tons of data of loss function depends Pn on the model itself and the task.
2
being created every second of every day. Some services, such Mean squared error Pn ( i=1 (ŷ i − y i ) for regression task and
as the public opinion monitoring and the recommendation cross entropy (− i=1 [yi log yˆi + (1 − yi ) log(1 − yˆi )]) for
system, do need the newly generated data: they have strong classification task are frequently accepted loss functions.
demand for being on time. In these circumstances, the concept There are too many kinds of data mining models existed;
of ”stream” is comparatively more important than a dataset. A their tasks include clustering, classification, and regression.
stream is a real-time representation of data. Under this concept, The complexity of the models also varies: simple models
models and algorithms that can run online are developed [4]. such as linear regression only have a few parameters, a small
For the stream mining tasks, the goal of data obtaining is no amount of data will make the training step converge, while
longer obtaining a dataset, but a real-time input source. complex models such as AlexNet have millions of parameters
[7], and their training also require huge dataset. However,
complex does not mean better: the model should be decided
B. Scrubbing according to the predicting target of the task, the dataset size,
Scrubbing is about the cleaning and preprocessing of the the data type, etc. Sometimes, it is necessary to run different
data, aiming to make the data have a unified format and easy models over one dataset and find the most suitable data mining
to be modeled. As for the detailed concepts and techniques model.
in data scrubbing, the readers could refer to the following
sections of this paper, since most of them will be covered in E. Interpreting
the overview of data preprocessing.
While previous steps are generally pure science, the inter-
preting step is more humanistic. Knowledge can be extracted
C. Exploring from the input data, but it takes extra effort to convince people
Before modeling the data, people may want to get to know to accept this knowledge. Although complicated statistics and
the underlying distribution of the data, the correlation between models make the work looks more professional, for laymen,
variables, and their correlation with the labels. Assumptions graphs, tables and well explained accuracy make it easier to
can be made in this step. For instance, people may assume understand and accept. Besides, social skills such as story
smoking is highly correlated with lung cancer. These assump- telling and emotion quotient are also important: this procedure
tions are important because they would provide indications to is about the human, not the data.
SEMINAR DATA MINING, JUNE 2019 3

III. DATA C LEANING Comparison of different missing value filling techniques


is done in [11]. In the work, different missing value filling
The data obtained from the natural world is usually badly
techniques are tested on ten datasets for running simple and
shaped. Some of the problems, such as outlier, may affect the
extended classification methods. The conclusion shows C4.5
data mining model and produce a biased result. For instance,
decision tree method performs the best, ignoring the samples
the outlier may affect the K-Means clustering algorithm by
with missing values and assigning all possible values also
substantially distort the distribution of data [8]. Other prob-
performed well, and filling the values with mode perform
lems, if not handled, will make it impossible for the data to
the worst. However, the performance of missing value filling
be analyzed by models, such as the Not a Number (nan) values
techniques may differ because of the feature of the dataset.
in a data vector. Data cleaning techniques, including missing
As a result, most of the techniques are worth trying for a data
value handling and outlier detection, were issued to tackle the
mining task.
problem. They make the gathered data suitable as the input of
the model.

B. Outlier Detection
A. Missing Values Handling
Outlier refers to the data sample that has a massive distance
Missing values is a typical kind of data incompleteness of to most of the other samples. Although the rare case does not
dataset. Most of the data mining models would not tolerate the necessarily mean wrong (e.g., age = 150), most outliers are
missing values of its input data: these values can not be used caused by measurement error or wrong recording, thus ignor-
for comparison, not available for categorizing, and can not ing a rarely appearing case would not harm a lot. Although
be operated with arithmetic algorithms. Thus, it is necessary some of the models are robust against outliers, outlier detection
to handle the missing values before pushing the dataset to is still recommended in data preprocessing work.
data mining models. The easiest way to deal with missing Statistics-based outlier detection algorithms are among the
value is to drop the entire sample. This method is effective if most commonly used algorithms, which assume an underlying
the proportion of missing value in a dataset is not significant, distribution of the data [12] and regard the data examples
however, if the number of missing values is not suitable for which corresponding probability density lower than a certain
ignoring, or the percentage of missing value for each attribute threshold as the outliers. As the underlying distribution is
is different [3], dropping the samples with missing value would unknown for most cases, the normal distribution is a good
reduce the amount of dataset dramatically, the information substitute, and its parameter could be estimated by the mean
contained in the dropped samples is not made use of. Another value and standard deviation of the data. The Mahalanobis
way to deal with missing values is by filling them, and there distance [12], as in (1), is a scale-irrelevant distance between
are varies methods for finding the suitable value to fill the two data samples. The outlier can be decided by comparing
missing value, some of them are listed as follows. the Mahalanobis distance between each sample and the mean
1) Use special value to represent missing: Sometimes the value of all samples. Box-plot, as another kind of statistics-
missing value itself has some meaning. For instance, in a based outlier detection technique, can give the graphical rep-
patient’s medical report, the missing value for uric acid means resentation of outlier by plotting the lower quartile and upper
the patient did not go through the renal function test. Thus, quartile along with the median [13].
using a certain value such as -1 makes sense, for they can be
operated like normal values while having special meaning in q
the dataset. DM (x, y) = (x − y)T Σ−1 (x − y) (1)
2) Use attribute statistics to fill: Statistics such as mean,
median, or mode can be obtained from non-missing values Without making any assumption of the distribution of the
in the missing value’s corresponding attribute. It is said that data, the distance-based outlier detection algorithm can detect
for a skewed dataset, the median would be a better choice the outlier by analyzing the distance between every two
[3]. However, this technique does not take the sample’s other samples, thus determine the outliers. Simple distance-based
non-missing attributes into account. outlier detection algorithms are not suitable for a large dataset,
3) Predicting the value with known attributes: If we as- since for n samples with m dimensions, their complexity is
sume, there exists a correlation between attributes, filling usually O(n2 m) [12], and each computation requires scanning
the missing value can be modeled as a prediction problem: all the samples. However, an extended cell-based outlier de-
predicting the value with the non-missing attributes with other tection algorithm is developed in [14], which guarantee linear
samples as training data. The predicting methods includes complexity over the dataset volume and no more than three
regression algorithms, decision trees [3], and K-Means [9]. dataset scans. The experiment shows this algorithm is the best
4) Assigning all possible values: For categorical attributes, for the dataset with dimension less than 4.
given an example E with m possible values for its miss- Sometimes, with consideration of temporal and spatial local-
ing value, then E can be replaced with m new examples ity, an outlier may not be a separate point, but a small cluster.
E1 , E2 , . . . , Em . This missing value filling technique assumes Cluster-based outlier detection algorithms consider clusters
the missing attribute does not matter for the example. Thus with small size as outlier clusters and clean the dataset by
the value can be anyone in its domain [10]. removing the whole cluster [15] [16].
SEMINAR DATA MINING, JUNE 2019 4

IV. DATA T RANSFORMATION C. Normalization


The representation of data in different attributes varies: Since different attribute usually adopts a different unit
some are categorical, while some are numerical. For categori- system, their mean value and standard deviation are usually not
cal values, they can be nominal, binary or ordinal [3], and for identical. However, the numerical difference would make some
numerical data, they can also have different statistical features of the attributes look more ”important”, while others are not
including mean values and standard deviations. However, not [3]. This impression could cause trouble for some models; one
all kinds of data meet the requirement of data mining models. of the typical ones is KNN: larger value would strongly affect
Also, the difference among data attributes may bring troubles the distance comparison, making the model mainly consider
for the subsequent optimization work of data mining models. attributes tend to have larger numerical values. Besides, for
Data transformation is about modifying the representation of neuron network models, the different unit system will also
data so that they are qualified to be the input for data mining have a negative influence on gradient descent optimizing
models, as well as making the optimization algorithm of the methods, forcing it to adopt a smaller learning rate. To tackle
data mining model easier to take effect. the problem mentioned above varies of normalization methods
are issued, some of them are listed as follows.
A. Numeralization 1) Min-max normalization: Min-max normalization is used
for mapping the attribute from its range [lb, ub] to another
Categorical values widely exists in the natural world, some range [lbnew , ubnew ]; the target range is usually [0, 1] or
of the operations, such as calculating the entropy between [−1, 1] [20]. For a sample with value v, the normalized value
groups, can be done directly over categorical data. However, v 0 is given as in (2).
most operations are not applicable to categorical data. Thus,
categorical data is supposed to be encoded into numerical data, v − lb
making it meets the requirements of the models. The following v0 = (ubnew − lbnew ) + lbnew (2)
ub − lb
encoding techniques are adopted for numeralization.
2) Z-score normalization: If the underlying range of an
• One-Hot encoding: Regard each possible value of the
attribute is unknown or outlier exists, min-max normalization
categorical data as a single dimension, and use 1 for is not feasible or could be strongly affected [20]. Another
the dimension which the sample belongs to the category, normalization approach is to transform the data so that it would
otherwise 0. have 0 as mean and 1 as standard deviation. Given the mean
• Sequential encoding: For each possible value of the
µ and standard deviation σ of the attribute, the transformation
categorical data, assign it with a unique and numerical is represented as in (3).
index. This is implemented as a kind of word encoding,
as in [17]. v−µ
v0 = (3)
• Customized encoding: Customized encoding is based on σ
rules designed for a certain task. For instance, word2vec Note that if µ and σ are unknown, they can be substituted
[18] is an encoding that can turn a word into a 300 with the sample mean and standard deviation.
dimensional vector, with consideration of the word’s 3) Decimal scaling normalization: An easier way to im-
meaning. plement the normalization is to shift the floating point of the
Generally, one-hot encoding is suitable for categorical data data so that each value in an attribute would have an absolute
with fewer possible values; if there are too many possible value less than 1, the transformation is given as in (4).
values, such as English words, the encoded dataset would be
v
huge and sparse. Sequential encoding would not produce huge v0 = (4)
output, but the encoded data is not as easy to separate as one- 10j
hot encoded data. Customized encoding, if carefully designed,
usually perform well over a certain kind of task, but for other For some cases, different attributes have an identical or
tasks, the encoding should be redesigned, and its design can similar unit system, such as the preprocessing of RGB-
take lots of efforts. colored imaged. In these cases, normalization is not necessary.
However, if this is not guaranteed, normalization is still
recommended for all data mining tasks.
B. Discretization
Discretization of data is applied sometimes to meet the
requirement of input of models, such as Naive Bayes which D. Numerical Transformations
require its input to be countable [19]. Also, it can smooth the The transformation over dataset can help to obtain additional
noise. Discretization of data does not necessarily make the attributes. These features obtained by transformation could
data categorical, but make the continuous values countable. be unimportant for some data mining models, such as neu-
The discretization of data can be achieved with unsupervised ral network, which have superior fitting potential. However,
learning methods such as putting data into equal-width or for relatively simpler models with fewer parameters, linear
equal-frequency slots, known as binning, or clustering. Some regression, for example, the transformed features do help
supervised learning methods such as decision tree can also be the model to get better performance(as in Fig. 2), for they
used for discretization of data [3]. could provide additional indication of the relationship between
SEMINAR DATA MINING, JUNE 2019 5

eigenvalue of the covariance matrix of the dataset). By con-


trast, linear discriminant analysis (LDA), is meant to maximize
the component axes for class-separation. The implementation
of LDA is similar to PCA; the only difference is it replaces
the covariance matrix with the scatter matrix of samples. A
graphical illustration of the difference between PCA and LDA
is given as Fig. 3. In comparatively more situations, LDA
outperforms PCA. PCA may outperform LDA when the data
Fig. 2. The effect of box-cox transformation in linear regression. The feature
and label are quadratically related. amount is small, or the data is nonuniformly sampled [24].
Other dimensional reduction algorithms include factor anal-
ysis (assuming a lower dimensional underlying distribution),
attributes. The transformation would also be essential for projection pursuit(measuring the aspect of non-Gaussianity)
scientific discoveries and machine controls [21]. [25], and wavelet transform [3].
Generally, given attributes set {a1 , a2 , . . . , ap }, the numeri- Feature selection is another dimensional reduction tech-
cal transformation can be represented as in (5). Theoretically, nique: it is about removing irrelevant or correlated attributes
f can be any function, however, since the input data is finite, from the dataset while keeping the other relatively independent
f can take polynomial forms [21]. attributes untouched. Feature selection is more than simply
selecting the feature that has greater relevance with the vari-
x0 = f (a1 , a2 , . . . , ap ) (5) able to predict, the relationship between attributes are also
supposed to be taken into consideration: the goal is to find
The commonly used representation of f includes polyno- a sufficiently good subset of features to predict [22]. Feature
mial based transformation, approximation based transforma- selection methods can be divided into three types, as follows
tion, rank transformation and box-cox transformation [20]. [26].
The parameters in the transformation formula could be ob- • filter: Directly select the feature based on attribute level
tained by subjective definition (for situations where people criteria, including information gain, correlation score, or
know the relationship between attributes and labels well), by chi-square test. The filter method does not take the data
brute search [20] or by applying maximum likelihood method. mining model into consideration.
• wrapper: Use techniques to search through the potential
V. DATA R EDUCTION subsets, according to their performance on the data min-
The amount of data in a data warehouse or a dataset can be ing model. Greedy strategies, including forward selection
huge, causing difficulties for data storage and processing when and backward elimination [26], are issued in order to
working on a data mining task, while not every model needs reduce the time consumption.
a huge amount of data to train. On the other hand, although • embedded: Embed the feature selection into the data min-

the data may have lots of attributes, there could be unrelated ing model. Usually, the weight over different attributes
features as well as the interdependence between features [22]. would act as feature selection. A typical example is the
Data reduction is the technique that helps to reduce the amount regularization term of the loss function in linear regres-
or dimension, or both, of a dataset, thus making the model’s sion, known as Lasso regression (for L1 regularization)
learning process more efficient as well as helping the model and ridge regression (for L2 regularization).
to obtain better performance, including preventing overfitting
B. Instance selection and sampling
problem, and fix the skewed data distribution.
Both instance selection and sampling are about achieving
the reduction of data by reducing the amount of data, seeking
A. Dimensional reduction the chance to train the model with minimum performance
The dimension reduction technique is about reducing the loss [20], while based on different criteria for selecting (or
dimensionality of data samples, thus reduce the total size of dropping) instance.
the data. As the number of attributes is reduced for a sample, Most instance selection algorithm is based on fine-tuning
there is less information contained in it. A good dimensional classification models. To help the model make better decision,
reduction algorithm will keep more general information: this condensation algorithm and edition algorithm are issued [20].
could make it more difficult for models to become overfitted. The condensation removes the samples lie in the relative center
Some dimensional reduction techniques pose a dimensional area of the class, assuming they do not contribute much in
reduction transformation over a dataset, generating new data classification. Condensed nearest neighbor [27], for instance,
samples, which have a fewer number of attributes than before. select instance by adding all the samples that cause a mistake
The transformations have different criteria. Principal compo- to a K-Nearest Neighbor classifier. Edition algorithm removes
nent analysis, known as PCA, could reduce the dimension the samples close to boundary, hoping to give the classifier
of data while keeping the maximum variance of data [23]. a smoother decision boundary. Related algorithms include a
This is achieved by multiplying matrix A = (a1 , a2 , . . . , ap )T clustering-based algorithm to select the center of clusters [28],
to the dataset X and keep the top k dimensions (ai stands Compared with instance selection methods, sampling is a
for the normalized eigenvector corresponds to the ith greatest faster and easier way to reduce the number of instances,
SEMINAR DATA MINING, JUNE 2019 6

R EFERENCES
[1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[2] K. R. Coombes, PreProcess: Basic Functions for Pre-Processing
Microarrays, 2019, r package version 3.1.7. [Online]. Available:
https://CRAN.R-project.org/package=PreProcess
[3] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques.
Elsevier, 2011.
[4] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data
streams: a review,” ACM Sigmod Record, vol. 34, no. 2, pp. 18–26,
Fig. 3. A comparison of reduction result with PCA and LDA. 2005.
[5] H. Wickham, ggplot2: Elegant Graphics for Data Analysis. Springer-
Verlag New York, 2016. [Online]. Available: http://ggplot2.org
[6] M. Bostock, V. Ogievetsky, and J. Heer, “D3 data-driven documents,”
since almost no complex selection algorithm is required for IEEE transactions on visualization and computer graphics, vol. 17,
sampling methods: they only focus on reducing the amount of no. 12, pp. 2301–2309, 2011.
the data samples. The easiest sampling technique is random [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
sampling, which collect a certain amount or portion of samples mation processing systems, 2012, pp. 1097–1105.
from the dataset randomly. For skewed datasets, stratified [8] T. Velmurugan and T. Santhanam, “Computational complexity between
sampling [22] is more adapted, since it takes the appearance k-means and k-medoids clustering algorithms for normal and uniform
distributions of data points,” Journal of computer science, vol. 6, no. 3,
frequency of labels from different classes into account and p. 363, 2010.
assigns a different probability of data with different labels [9] B. M. Patil, R. C. Joshi, and D. Toshniwal, “Missing value imputation
being chosen, thus makes the sampled dataset more balanced. based on k-mean clustering with weighted distance,” in International
Conference on Contemporary Computing. Springer, 2010, pp. 600–
609.
[10] J. W. Grzymala-Busse, “On the unknown attribute values in learning
VI. S UMMARY AND O UTLOOK from examples,” in International Symposium on Methodologies for
Intelligent Systems. Springer, 1991, pp. 368–377.
Data mining, as a technique to discover additional informa- [11] J. W. Grzymala-Busse and M. Hu, “A comparison of several approaches
tion from a dataset, can be integrated as a pipeline, in which to missing attribute values in data mining,” in International Conference
on Rough Sets and Current Trends in Computing. Springer, 2000, pp.
obtaining, scrubbing, exploring, modeling, and interpreting are 378–385.
the key steps. The purpose of the data mining pipeline is [12] M. Kantardzic, Data mining: concepts, models, methods, and algo-
to tackle realistic problems, including reviewing the past and rithms. John Wiley & Sons, 2011.
[13] S. Walfish, “A review of statistical outlier methods,” Pharmaceutical
predicting the future. The specific technique used in each step technology, vol. 30, no. 11, p. 82, 2006.
should be selected with care to give the best performance to [14] E. M. Knorr, R. T. Ng, and V. Tucakov, “Distance-based outliers: algo-
the pipeline. rithms and applications,” The VLDB JournalThe International Journal
on Very Large Data Bases, vol. 8, no. 3-4, pp. 237–253, 2000.
The success of a data mining model depends on the proper [15] I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery
data preprocessing work. The unpreprocessed data can be of handbook. Springer, 2005, pp. 131–146.
unsuitable format for model input, causing instability for the [16] L. Duan, L. Xu, Y. Liu, and J. Lee, “Cluster-based outlier detection,”
Annals of Operations Research, vol. 168, no. 1, pp. 151–168, 2009.
optimization algorithm of the model, having a great impact [17] S. Angelidis and M. Lapata, “Multiple instance learning networks for
on the model’s performance because of its noise and outliers, fine-grained sentiment analysis,” Transactions of the Association of
and causing performance problems on the model’s training Computational Linguistics, vol. 6, pp. 17–31, 2018.
[18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
process. With careful selection of preprocessing steps, these “Distributed representations of words and phrases and their composi-
problems can be reduced or avoided. tionality,” in Advances in neural information processing systems, 2013,
Data type transformation techniques as well as missing pp. 3111–3119.
[19] I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI
value handling techniques makes it possible for models for 2001 workshop on empirical methods in artificial intelligence, vol. 3,
processing different types of data. By applying normalization, no. 22, 2001, pp. 41–46.
the unit system of different attributes would be more unified, [20] S. Garcı́a, J. Luengo, and F. Herrera, Data preprocessing in data mining.
Springer, 2015.
reducing the probability of an optimization algorithm to miss [21] T. Y. Lin, “Attribute transformations for data mining i: Theoretical
the global minimum. For simpler models, numerical transfor- explorations,” International journal of intelligent systems, vol. 17, no. 2,
mation can provide richer features to the model, thus enhance pp. 213–222, 2002.
[22] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for
the model’s ability to discover more underlying relationships supervised leaning,” International Journal of Computer Science, vol. 1,
between features and labels. For the overfitting problems of no. 2, pp. 111–117, 2006.
the model, dimensional reduction techniques help model find [23] I. Jolliffe, Principal component analysis. Springer, 2011.
the more general information about samples instead of the too [24] A. M. Martı́nez and A. C. Kak, “Pca versus lda,” IEEE transactions on
pattern analysis and machine intelligence, vol. 23, no. 2, pp. 228–233,
detailed features by reducing the dimension of feature, thus re- 2001.
move some unimportant information. And for the performance [25] I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence
of the model training, both dimensional reduction techniques Livermore National Lab., CA (US), Tech. Rep., 2002.
[26] I. Guyon and A. Elisseeff, “An introduction to variable and feature
and instance selection techniques would improve the training selection,” Journal of machine learning research, vol. 3, no. Mar, pp.
performance by reducing the total amount of data. 1157–1182, 2003.
SEMINAR DATA MINING, JUNE 2019 7

[27] P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans-
actions on information theory, vol. 14, no. 3, pp. 515–516, 1968.
[28] A. Lumini and L. Nanni, “A clustering method for automatic biometric
template selection,” Pattern Recognition, vol. 39, no. 3, pp. 495–497,
2006.

You might also like