Unit-3 Finalized

Uploaded by

sandy candy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

26 views9 pages

Unit-3 Finalized

Uploaded by

sandy candy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 9

WED Data Warehousing and Data Mining Reference Note Unit-3 Data Preprocessing Introduction Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Raw data (real-world data) is often incomplete, inconsistent, and/or noisy, due to which there are some increased chances of error aud misinterpretation + Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. F.g., occupation =" ” + Noisy: containing emvors or outliers. E.g. Salar * Inconsistent: containing discrepancies in codes or names. Eg. Age=“42” Birthday~"03/07/1997" Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing. Wy is data dirty? Raw data collected fiom the environment can be dirty. A data is said to be dirty if it is incomplete, noisy, or inconsistent, + Incomplete data may come from - “Not applicable” data value when collected - Different considerations between the time when the data was collected and when it is analyzed - Humawhardware/software problems = Noisy data (incorrect values) may come from - Faulty data collection instruments - Human or computer error at data entry - Errors in data transmission + Inconsistent data may come from = Different data sources - Functional dependency violation (e.g., modify some linked data) + Duplicate records also need data cleaning Why do we need to preprocess data? By preprocessing data, we: + Make our database more accurate: We eliminate the incorrect or missing. vahues that are there as a result of human factor or bugs. + Boost Consistency: when there are inconsistencies in data or duplicates, it affects the accuracy of the results. + Make the database more complete: We can fill the attributes that are missing if needed + Smooth the data: This way we make it easier to use and interpret. Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note Major. Tasks in Preprocessing T. Data Cleaning Tt can be applied to remove noise and correct inconsistences in the data, 2. Data Integration Tt combines data from multiple sources into a coherent data store. E.g. data warehouse 3. Data Transformation Data Transformation is the process of consolidation of data so that the mining process result could be applied or maybe more efficient. 4. Data Reduction ‘Obtains reduced representation in volume but produces the same or similar analytical results. ‘5S. Data Discretization Part of data reduction but with particular importance, especially for merical data. re \\ Vy, |= Deaitepa > pe hn ed / / \\ -2, 32,100, 59,48 ——— -0.02, 0.32, 1.00, 0.59, 0.48 Data ah, transactions Data Cl ig 1 n B > ™ T2000) ‘Data Transformation — — = A126 AMS ' FRR Fig: Forms of data preprocessing 2 transactions The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data ete. 4) Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are: * Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. + Fill in the missing value manually, + Use global constant to fill in the missing value. E.g. “unknown”, a new class, * Use the attribute mean to fill in the missing valne + Use the attribute mean for all samples belonging to the same class as the given tuple. * Use the most probable value to fill in the missing value Collegenote Prepared By: Jayanta PoudelWE Data Warehousing and Data Mining Reference Note b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. E.g. Salary = “10”, It can be handled in following ways: Binning method: This method is to smooth or handle noisy data, First, the data is sorted then and then the sorted values are separated into segments of equal size and. stored in the form of bins. There are three methods for smoothing data in the bin, = Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean value of the bin; = Smoothing by bin media the median value; = Smoothing by bin boundary: In this method, the using minimum and maximum values of the bin values are taken and the values are replaced by the closest boundary value Example: In this method, the values in the bin are replaced by Unsorted data for price in dollars: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34 Afler Sorting: 8, 9, 15, 16,21, 21, 24, 26, 27, 30, 30, 34 Partitioning into (equal-frequency) bins: Bin 1: 8, 9, 15, 16 Bin 2: 21, 21, 24, 26, Bin 3: 27, 30, 30, 34 ‘Smoothing by bin means | Smoothing by bin boundaries Bin 1: 12, 12, 12, 12 Bin 1:8,8, 16, 16 Bin 2: 23, 23, 23, 23 Bn 1, 21, 26, 26, Bin 3: 30, 30, 30, 30 Bin 3:27, 27, 27,34 Regression: Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables) Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters. Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases, or flat files. : “ = , Unified ig: Data Integration Collegenote Prepared By: Jayanta PoudelWED Data Warehousing and Data Mining Reference Note There are mainly 2 major approaches for data integration: + Tight Coupling: In tight coupling data is combined from different sources into a single physical location through the process of ETL - Extraction, Transformation and Loading. * Loose Coupling: In loose coupling data only remains in the actual source databases. In this approach, an interface is provided that takes query from user and transforms it in a way the source database can understand and then sends the query directly to the source databases to obtain the result. Issues in Data Integration + Entity Identification Problem: As we know the data is unified from the heterogeneous sources then how can we ‘match the real-world entities from the data”? For example, we have customer data from two different data source. An entity from one data source has customer_id and the entity from the other data source has customer_mumber. Now how does the data analyst or the system would understand that these two entities refer to the same attribute? * Redundancy: An attribute may be redundant ifit can be derived or obtaining from another attribute or set of attributes. Inconsistencies in attributes can also cause redundancies in the resulting data set. Some redundancies can be detected by correlation analysis. + Data Conflict Detection and Resolution: Data conflict means the data merged from the different sources do not match. Like the attribute values may differ in different data sets The difference maybe because they are represented differently in the different data sets For suppose the price of a hotel room may be represented in different currencies in different cities. This kind of issues is detected and resolved during data integration, Data Ts Data transformation is the process of transforming data into the form that is appropriate for mining. Some Data Transformation Strategies 1. Smoothing: It is used to remove the noise from data. Such techniques include binning, clustering, and regression. 2. Aggregation: Here summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and anmual total amounts. This step is typically used in constmeting a data cube for analysis of the data at multiple granulerities, 3. Generalization: Here low level data are replaced by higher level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized tohigher level concepts, like city or country. 4, Attribute construction: Here new attributes are constructed and added from the given set of attributes to help the mining process. 5. Normalization: Here the attribute data are scaled so as to fall within a small specified range, such as -1 to +1, or 0 to 1. Techniques that are used for normalization are: Collegenote Prepared By: Jayanta PoudelWE Data Warehousing and Data Mining Reference Note - Min-Max Normalization: It performs a linear transformation on the original data. Suppose that min and max are the minimum and maximum values of an attribute, A Min-max normalization maps a value, v , of A to nv in the range [new_min, new_max] using following formula my =——™_ (new _max—new_min)+new_min max-min - Zscore Normalization: tn z-score normalization (or zero-mean normalization), the values for an attribute, A , are normalized based on the mean and standard deviation of A. The value, v, of A is normalized to nv as below. It is also called standard normalization Where,” where, 1 is mean and n is number of data points. ‘Example Marks: 8, 10, 15, 20 ‘Min-Max Normalization: Min = 8, Max = 20, is the respective value of the attribute. new_max = 1,new_min = 0 For marks as 8: For marks as 10: 8 = nmarks = 3-5 (1-0) +0 nmarkyo = 3; (1 — 0) +0 = 0.16 For marks as 15: For marks as 20: nmark,s = 8 (1 —0) +0 = 0.58 nmarkyg = 2=2(1 — 0) +0 Marks after min-max normalization: 0, 0.16, 0.58, 1 Z-score Normalization: w= (8410415 +20)/4 = 13.25 13.25)?+(20-1525)? ° 46 Now, nmark, = 27225 = 1.14 nmarkyg = 2828 nmark; = 225 03 nmarkgy = 225 = 1.4 ‘Marks after z-score normalization: 1.14, -0.7, 0.3, 14 Collegenote Prepared By: Jayanta PoudelWD Data Warehousing and Data Mining Reference Note Data Reduction ‘A database or date warehouse may store terabytes of data, So it may take very long to perform. data analysis and mining on such huge amounts of data, Data Reduction is obtaining a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data Reduction Techniques 1. Dimensionatity Reduction: Dimensionality reduction is the process of reducing the number of random variables or attributes under consideration. Dimensionality reduction methods include wavelet transforms and principal components analysis, which transform or project the original data onto a smaller space. Atribure subset selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed. For example, Name Mobile No. | Mobile Network Jayanta 9843xxxxxx | NTC Sagar 9801xxxxxx | NCELL Fig: Before Dimension Reduction Ifwe know Mobile Number, then we can know the Mobile Network. So we need to reduce the one dimension Name Mobile No. Tayanta_ | 9843x0000 Sagar [ 980] xxxxxx Fig: After Dimension Reduction 2. Numerosity Reduction: Numerosity reduction techniques replace the original data volume by alternative, smaller forms of data representation. These techniques may be parametric or nonparametric. For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data (Outliers may also be stored.) Regression aud log-linear models are examples. Nonparametric methods for storing reduced representations of the data include histograms, clustering, sampling, and data cube aggregation. 3. Data Compression: In data compression, transformations are applied so as to obtain a reduced or “compressed” representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduetion is called lossless. If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called Zassy. ‘Original Data “| Compressed Data Original Data Appronimated Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note Data Discret jon and Concept Hierarchy Generation Data Discretization Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Example: Suppose we have an attribute of Age with the given values: Age | 1, 4, 5, 7, 9, 11, 14, 17, 18, 19, 21, 31, 33, 36, 42, 44, 46, 70, 74, 77, 78 Before Discretization ‘Attribute Age ‘Age ‘Age Age 1,4,5,7,9 | 11, 14,17, 18,19, | 31,33, 36,42, | 70,74, 77,78 21 4, 46 After Diseretization | Child Young Mature Old Discretization can be categorized into following two types: + Top-down discretization: If we first consider one ot a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as top-down discretization also known as splitting, = Bottom-up discretization: If we first consider all the constant values as split-points, some are discarded through a combination of the neighborhood values in the interval, that process is called bottom-up discretization Concept Hierarchies Concept Hierarchies reduce the data by collecting and replacing low level concepts (such as city) by higher level concepts (stich as province or country) location ail ‘country province (British Columbia ‘New Yor or state Fig: Concept Hierarchy city Cancouver)- Collegenote Prepared By: Jayanta PoudelWED Data Warehousing and Data Mining Reference Note Techniques of Discretization and Concept Hierarchy Generation for Numerical Data + Binning: Attvibute values can be discretized by distributing the values into bin and replacing each bin by the mean bin value or bin median value, These technique can be applied recursively to the resulting partitions in order to generate concept hierarchies. + Histogram Anatysis: Histograms can also be used for discretization. Partitioning rules can be applied to define range of values. The histogram analysis algorithm can be applied recursively to each partition in order to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified number of concept levels has been reached. + Clustering: A clustering algorithm can be applied to partition data into clusters or groups, Each cluster forms a node of'a concept hierarchy, where all noses are at the same conceptual level. Each cluster may be further decomposed into sub-clusters, forming a lower level in the hierarchy. Clusters may also be grouped together to form a higher-level concept hierarchy. Data Mining Task Primitives We can specify a data mining task in the form of a data mining query. This query is input to the system. A data mining query is defined in terms of data mining task primitives. These primitives allow us to communicate in an interactive manner with the data mining system. The data mining task primitives are: 1. Task-relevant data: This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions). 2. The Kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. 3. Background Knowledge: Background knowledge is information about the domain to be mined that can be useful in the discovery process and for evaluating the patterns found Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction. 4, Interestingness measures: These functions are used to separate uninteresting patterns fom knowledge. They may be used to guide the mining process, or after discovery, to evaluate the discovered pattems. Different kinds of knowledge may have different interestingness measures. $. Presentation and visualization of discovered patterns: This refers to the form in which discovered pattems are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes, Collegenote Prepared By: Jayanta PoudelWED) Data Warehousing and Data Mining Reference Note Please let me know if Imissed anything or anything is incorrect. [email protected] Collegenote Prepared By: Jayanta Poudel

Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
CH 3
No ratings yet
CH 3
68 pages
Normalization
No ratings yet
Normalization
35 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
No ratings yet
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
9 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Unit 2
No ratings yet
Unit 2
37 pages
DWM
No ratings yet
DWM
14 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Correlation
No ratings yet
Correlation
14 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Unit 2 Preprocessing in Data Mining
No ratings yet
Unit 2 Preprocessing in Data Mining
6 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Chapter 3
No ratings yet
Chapter 3
81 pages

Unit-3 Finalized

Uploaded by

Unit-3 Finalized

Uploaded by

You might also like