0% found this document useful (0 votes)
119 views20 pages

(林維昭) Data Preprocessing for Effective Data Classification

The document discusses data preprocessing techniques for effective data classification, including instance selection and feature selection. Instance selection techniques like clustering can be used to reduce imbalanced datasets by selecting representative samples. Feature selection identifies the most important features to reduce data dimensionality. The order of instance selection and missing value imputation can impact classification results, with instance selection first generally performing better.

Uploaded by

muazzam22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views20 pages

(林維昭) Data Preprocessing for Effective Data Classification

The document discusses data preprocessing techniques for effective data classification, including instance selection and feature selection. Instance selection techniques like clustering can be used to reduce imbalanced datasets by selecting representative samples. Feature selection identifies the most important features to reduce data dimensionality. The order of instance selection and missing value imputation can impact classification results, with instance selection first generally performing better.

Uploaded by

muazzam22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Preprocessing

for Effective Data Classification

Wei-Chao (Vic) Lin 林維昭


Associate Professor
Department of Information Management
Chang Gung University, Taiwan
[email protected]
Data classification

Training
data D

Testing Classification
data

Training Preprocessing Training


data D data D1

Instance selection + Feature selection


A1 A2 A3 A4 A5 A6 AX
D1 A1 A2 A3 A4 A5 A6 … AX
D2 D1
D3 D3
D4 D4
D5 Instance selection D6
D6


Dn
DN Reduced Data
Data

A1 A2 A3 A4 A5 A6 … AX …
A1 A3 A4 A6 Ax
D1 D1
D2 D2
D3 D3
D4 D4
D5 Feature selection D5
D6 D6


DN DN
Data Reduced Data
Research
 in Data Mining (general) 2015~2018 13 SCI/SSCI
 2017/10 Information Sciences (MOST106)
 2015/12 Applied Soft Computing
 2015/08 Journal of Systems and Software

 in Image Retrieval 2015~2018 5 SCI

 in Medical Applications 2015~2018 5 SCI


 2015/09 Technology and Health Care
 Biomedical Journal (under review)

 Bibliometrics 2015~2016 3 SCI

長庚大學 資訊管理學系 \ 副教授 \ 林維昭


Instance selection
 Imbalanced Datasets
 Machine Learning Based Data Sampling for Class-
Imbalanced Datasets
 MOST 106
 Majority vs. minority classes
 The major aim is reducing the majority class data samples
 using clustering techniques, such as affinity propagation and
k-means

VS.
Instance selection
 Lin, W.-C., Tsai, C.-F.*, Hu, Y.-H., and Jhang, J.-S. (2017/10)
Clustering-Based Undersampling for Class-imbalanced Data.
Information Sciences, 409-410, pp. 17-26. (SCI; IF=4.832;
7/146 Q1(2016-JCR) in COMPUTER SCIENCE,
INFORMATION SYSTEMS)
 本論文提出一個應用分群技術於樣本抽樣以解決類別不平衡的問題。
 主要目的在挑選資料集中具代表性的樣本取代原始資料,平衡類別
之間的樣本數量,同時降低取樣時資料分布不均的機率。
 本研究使用44個不同的小型資料集與2個大型資料集、五種分類器
(C4.5, SVM, MLP, k-NN, NB)並搭配整體學習演算法,比較不同分
群抽樣方式、不同分類器、不同分群數量等等,並與文獻中傳統方
法、整體學習法進行比較。研究結果顯示在所有組合中,群中心點
之鄰近點的前處理搭配MLP演算法是最佳的選擇,無論是小型或大
型資料集,其整體的Adaboost C4.5結果表現最好且最穩定。
Class-imbalanced
 Lin, W.-C., Tsai, C.-F.*, Hu, Y.-H., and Jhang, J.-S. (2017/10)
Clustering-Based Undersampling for Class-imbalanced Data.
Information Sciences, 409-410, pp. 17-26. (SCI; IF=4.832;
7/146 Q1(2016-JCR) in COMPUTER SCIENCE,
INFORMATION SYSTEMS)

C4.5 / SVM / MLP / k-NN / NB

Majority Instance
Class selection

Training Balanced
Set Training Set
Minority
Imbalanced
Class
Dataset
Testing Classifier
Set
Instance selection
 Lin, W.-C., Tsai, C.-F., Ke, S.-W., and You, M.-L. (2015/12)
On Learning Dual Classifiers for Better Data Classification.
Applied Soft Computing, Special Issue on Soft
Computing for Big Data, 37, pp. 296-302. (SCI; IF=2.857;
21/130 Q1(2015-JCR) in COMPUTER SCIENCE,
ARTIFICIAL INTELLIGENCE)
 本研究提出了一個新的資料前處理流程(簡稱DuC)用以提升資料分
類的效能。
 主要的流程是先將訓練集的資料進行樣本選取,因此分別被樣本選
取演算法判定為雜訊及非雜訊的資料集將用來訓練兩個讀立的分類
模型;在測試階段時將測試集的資料做k-NN的相似度比對,較相
似為雜訊的測試資料集用雜訊資料集所訓練的模型做測試,同理,
較相似為非雜訊的測試資料集用非雜訊資料集所訓練的模型做測試。
 實驗結果証明此方法的分類正確率高於傳統的訓練單一分類模型方
法。
Dual Classification (DuC)

Good Classifier_good
training Data
Training Instance
SVM
Data selection
IB3 / GA / DROP3 Noisy Classifier_noisy
[1] [2] [3]
Dataset training Data

Testing
Data k-NN
similarity measure

Good Noisy
testing Data testing Data

[1] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach.Learn. 6 (1) (1991) 37-66.
[2] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learningalgorithms, Mach. Learn. 38 (2000) 257-286.
[3] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance selec-tion for data reduction: an experimental
study, IEEE Trans. Evolut. Comput. 7(6) (2003) 561-575.
Instance selection
 Lin, W.-C., Tsai, C.-F., Ke, S.-W., Hung, C.-W., and Eberle, W.
(2015/08) Learning to Detect Representative Data for Large
Scale Instance Selection. Journal of Systems and Software,
106, pp. 1-8. (SCI; IF=1.424; 24/106 Q1(2015-JCR) in
COMPUTER SCIENCE, SOFTWARE ENGINEERING)
 本論文提出一個能自動偵測代表性訓練資料的方法(簡稱ReDD),
主要是用來改善目前樣本選取方法對於處理大型資料集的時間效能。
 主要是先將少部份資料先進行快速樣本選取後,再利用分類器學習
由樣本選取所篩選出的代表性資料之特徵,以偵測所有原始資料中
所包含的代表性資料。
 本研究使用多個大型資料集進行實驗,結果證明此方法在處理的時
間以及分類器最後的正確率都表現比使用傳統的樣本選取方法更好。
Representative Data Detection
(ReDD)

Good
training Data
Training Instance
Data D1 selection

50% IB3 / GA / DROP3 Noisy


training Data
Training
Data
no yes

Training Good/noisy
Good
Data D2 detector

50% k-NN
similarity measure
Research

 in Medical Applications
 Cancer Prediction
1. 2017/01 PLOS ONE
 Large Scale Medical Data Mining
2. 2015/03 Technology and Health Care
 Incomplete Medical Data
3. 2018/02 Journal of Healthcare Engineering
4. 2016/10 Expert Systems
5. 2015/09 Technology and Health Care
Imputation
 Imputation,補值
處理資料中的 missing value
 利用平均值 / 中位數
 利用clustering

 利用observed data clustering


 machine learning techniques
 文獻上,observed data 的 imputation 大部份是透過 complete
training data 來預測 missing value
 換言之 missing value 比較偏向 complete training data,然而當
中所提供的資料或許參雜一些 noise information,而引響輸出結
果的誤差
training data 也不一定是完整的
Instance selection &
missing value imputation
 Chen, C.-W., Lin, W.-C., Ke, S.-W., Tsai, C.-F., and Hu, Y.-H.
(2015/09) On Mining Incomplete Medical Datasets: Ordering
Imputation and Classification. Technology and Health Care,
23(5), pp. 619-625. (SCI; IF=0.678; 68/76 Q4(2015-JCR) in
ENGINEERING, BIOMEDICAL)
 此研究利用 observed data 的方式來預測 missing value。
 先執行 instance selection 再 imputation 可得到較佳的分類效果;
 過度的 instance selection 反而會造成較差的分類結果 。
Instance selection &
missing value imputation
 D1: imputation + instance selection
kNNI DROP3
 D2: instance selection + imputation
Instance
D_complete
Training Imputation Training selection Reduced
data D D_incomplete data D’ data D1

Classifier training
Testing Classifier
data Classifier testing
Classifier training
Instance
selection
D_complete D_complete’ Imputation Reduced
Training
data D D_incomplete D_incomplete’ data D2
Instance selection &
missing value imputation
 D3: D1 + instance selection
 D4: D2 + instance selection
Instance Instance
D_complete
Training Imputation Training selection Reduced selection Reduced
data D D_incomplete data D’ data D1 data D3

Classifier training
Testing Classifier
data Classifier testing
Classifier training
Instance
selection Instance
D_complete D_complete’ Imputation Reduced selection Reduced
Training
data D D_incomplete D_incomplete’ data D2 data D4
Feature Selection ?

 Ensemble Feature Selection in Medical Datasets: Combining


Filter, Wrapper, and Embedded Feature Selection Results,
Biomedical Journal (Under review)

包裝式(Wrapper) 過濾式(Filters) 嵌入式(Embedded)

通過目標函(AUC/MSE)來 挑特徵值相關性最高的變
目的 模型自動選擇屬性
決定是否加入一個變量 數
黑箱與迭代概念,不考慮
讓演算法決定哪一些屬性
原理 時間成本,藉由最佳化, 剔除不相關的屬性
被挑選出來
產生出特徵子集
資料維度降低,不易過度
優點 正確率較高 較少的資料量
訓練,花費時間少
最花時間預測器會依賴資 正確率較低缺乏分類器的
缺點 預測器會依賴資料
料 互動性

常見演算法 GA、SA IG、熵、PCA 決策樹


醫學資料集
UCI 醫學資料集 KDD 2008 乳症資料集 長庚醫院糖尿病患者個案問卷

特徵選取
Wrapper Filter Embedded
GA IG C4.5

融合特徵選取
Intersection Union Multi-Intersection

GA IG GA C4.5 IG C4.5

Training Testing Baseline


GA IG GA C4.5 IG C4.5

GA GA GA Classification
SVMs / K-NN
IG C4.5 IG C4.5 IG C4.5

Accuracy
Future works

Training
data D

Testing Classification
data

Training Preprocessing Training


data D data D1

Fusion
Instance selection ++ Feature selection
MOST 108

You might also like