Data pre-processing is essential for improving the quality of datasets used in data mining, as it addresses issues like inconsistencies, missing values, and noise. Techniques such as data cleaning, smoothing, normalization, and data reduction are employed to enhance data quality and ensure effective model building. The success of data mining projects heavily relies on the quality of the prepared data, making careful data preparation a prerequisite.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
18 views28 pages
03 Data Preparation
Data pre-processing is essential for improving the quality of datasets used in data mining, as it addresses issues like inconsistencies, missing values, and noise. Techniques such as data cleaning, smoothing, normalization, and data reduction are employed to enhance data quality and ensure effective model building. The success of data mining projects heavily relies on the quality of the prepared data, making careful data preparation a prerequisite.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28
DATA PRE-PROCESSING
▪ in real world application, data can be inconsistent,
incomplete and/or noisy ▪ Errors can happen ▪ prediction rate should be lower ▪ produces better models, faster because a good data is a prerequisite for producing effective models of any type ▪ Analyzing data that has not been carefully screened for such problems can produce highly misleading results. ▪Then, the success of data mining projects heavily depends on the quality of the prepared data. ▪Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. Start with an initial dataset to get familiar with the data, to ▪ discover first insights into the data and have a good understanding of any possible data quality issues. Data cleaning attempts to: Fill in missing values ▪ ▪ Smooth out noisy data ▪ Correct inconsistencies ▪ Ignore the tuple with missing values; ▪ Fill in the missing values manually; ▪ Use a global constant to fill in missing values (NULL, unknown, etc.); ▪ Use the attribute value mean to filling missing values of that attribute; ▪ Use the attribute mean for all samples belonging to the same class to fill in the missing values; ▪ Infer the most probable value to fill in the missing value. ▪ The purpose of data smoothing is to eliminate noise. ▪This can be done by: ✓ Binning ✓ Clustering ✓ Regression ▪Binning smooth the data by consulting the value’s neighborhood. ▪ It aims to remove the noise from the data set [1] smoothing the data by equal frequency bins [2] smoothing by bin means; [3] smoothing by bin boundaries Unsorted Data for price in dollars: 8, 16, 9, 15, 21, 24, 30, 26, 27, 30, 34, 21 STEP 1: Sort the Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34 Smooth the data by equal frequencies bins: Bin 1: 8, 9, 15, 16 Bin 2: 21, 21, 24, 26 Bin 3: 27, 30, 30, 34 STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34 Smooth the data by bin means: For Bin 1: (8 + 9 + 15 + 16)/4= 12 Bin 1: 12, 12, 12, 12 Bin 2: Bin 3: STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34 Smooth the data by bin boundaries: ▪ Pick the MIN and MAX value ▪ Put the MIN in the left side and MAX on the right side ▪ Middle values in bin boundaries move to its closest neighbor value with less distance Bin 1: 8, 8, 16, 16 Bin 2: Bin 3: ▪Data is organized into groups of “similar” values. ▪Rare values that fall outside these groups are considered outliers and are discarded. ▪Data regression consists of fitting the data to a function. ▪A linear regression for instance, finds the line to fit 2 variables so that one variable can predict the other. ▪More variables can be involved in a multiple linear regression. Data analysis may require a combination of data from multiple sources into a coherent data store. There are many challenges: ▪ Schema integration: CID » C_number » Cust-id » cust# ▪ Semantic heterogeneity ▪ Data value conflicts (different representations or scales, etc.) There are many challenges: ▪ Redundant records ▪ Redundant attributes (redundant if it can be derived from other attributes) ▪ Correlation analysis P(AÙB)/(P(A)P(B)) 1: independent, >1 positive correlation, <1: negative correlation. ▪ Data is sometimes in a form not appropriate for mining. ▪ Either the algorithm at hand can not handle it, the form of the data is not regular, or the data itself is not specific enough. ▪ Normalization (to compare carrots with carrots) ▪ Smoothing ▪ Aggregation (summary operation applied to data) ▪ Generalization (low level data is replaced with level data – concept hierarchy Min-max normalization: linear transformation from v to v’ ▪ v’= v-min/(max – min) (newmax – newmin) + new min ▪ Example: transform $30000 between [10000..45000] into [0..1] → Î30-10/35(1)+0=0.514 Zscore normalization: ▪ normalization v into v’ based on attribute value mean and standard deviation ▪ v’=v-Mean/StandardDeviation Normalization by decimal scaling: ▪ moves the decimal point of v by j positions such that j is the minimum number of positions moved to the decimal of the absolute maximum value to make is fall in [0..1]. v’=v/10j ▪ Example: if v ranges between –56 and 9976, j=4 → v’ ranges between –0.0056 and 0.9976 ▪ The data is often too large. ▪ Reducing the data can improve performance. ▪ Data reduction consists of reducing the representation of the data set while producing the same (or almost the same) results. Data reduction includes: Data cube ▪ aggregation ▪ Dimension reduction ▪ Data compression ▪Discretization ▪Numerosity reduction ▪ Reduce the data to the concepts level needed in the analysis ▪Queries regarding aggregated information should be answered using data cube when possible ▪ Feature selection (i.e., attribute subset selection) ▪Use heuristics: select local ‘best’ (or most pertinent) attribute ▪Decision Tree Induction Data compression reduces the size of data. ▪ saves storage space. ▪ saves communication time.
Data compression is beneficial if data mining algorithms can
manipulate compressed data directly without uncompressing it. Parametric ▪Regression (a model or function estimating the distribution instead of the data.) Non-parametric ▪Histograms ▪Clustering ▪Sampling A popular data reduction technique: ▪ Divide data into buckets and store representation of buckets ▪ (sum, count, etc.) ▪ Equi-width ▪ Equi-depth ▪ V-Optimal ▪ MaxDiff ▪ Partition data into clusters based on “closeness” in space. ▪ Retain representatives of clusters (centroids) and outliers. ▪ Effectiveness depends upon the distribution of data ▪Hierarchical clustering is possible (multi resolution). Allows a large data set to be represented by a much smaller random sample of the data (sub-set). ▪Simple random sample without replacement (SRSWOR) ▪Simple random sampling with replacement (SRSWR) ▪Cluster sample (SRSWOR or SRSWR from clusters) ▪Stratified sample ▪ Discretization is used to reduce the number of values for a given continuous attribute, by dividing the range of the attribute into intervals. Discretization can reduce the data set, and can also be used ▪ to generate concept hierarchies automatically