CHAPTER 2 – DATA PREPROCESSING
WHY DO WE
NEED TO
PREPROCESS THE
DATA?
MUCH OF THE RAW DATA CONTAINED IN DATABASES IS
UNPREPROCESSED, INCOMPLETE AND NOISY
THE DATABASES MAY CONTAIN:
• FIELDS THAT ARE REDUNDANT
• MISSING VALUE
• OUTLIERS
• DATA IN A FORM NOT SUITABLE FOR DATA
MINNING MODELS
• VALUES NOT CONSISTENT WITH POLICY OR
COMMON SENSE
TWO PRINCIPLE METHOD
DATA CLEANING
DATA TRANSFORMATION
DATA CLEANING
HANDLING MISSING DATA
INSIGHTFUL MINER OFFERS A CHOICE OF
REPLACEMENT VALUES FOR MISSING DATA:
1. REPLACE THE MISSING VALUE WITH SOME
CONSTANT, SPECIFIED BY THE ANALYST.
2. REPLACE THE MISSING VALUE WITH THE FIELD
MEAN (FOR NUMERICAL VARIABLES) OR THE
MODE (FOR CATEGORICAL VARIABLES).
3. REPLACE THE MISSING VALUES WITH A VALUE
GENERATED AT RANDOM FROM THE VARIABLE
DISTRIBUTION OBSERVED.
IDENTIFYINGMISCLASSIFICATIONS