100% found this document useful (1 vote)

32 views34 pages

Chapter 3 Data Preparation

data repesentation

Uploaded by

Abenezer Tesfaye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

32 views34 pages

Chapter 3 Data Preparation

data repesentation

Uploaded by

Abenezer Tesfaye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Chapter 2

Data
Outline
Preparation Data cleansing
Data integration
Data reduction
Data transformation

1
Data Collection for Mining
• Data mining requires collecting great amount of data (available in
data warehouses or databases) to achieve the intended objective.
– Data mining starts by understanding the business or problem domain in
order to gain the business knowledge
– Business knowledge guides the process towards useful results, and
enables the recognition of those results that are useful.
– Based on the business knowledge data related to the business problem
are identified from the database/data warehouse for mining.
• Before feeding data to DM we have to make sure the quality of
data?
2
9/25/2020
Data Mining
Data Quality Measures
A well-accepted multidimensional data quality measures are the
following:
Accuracy (free from errors and outliers)
Completeness (no missing attributes and values)
Consistency (no inconsistent values and attributes)
Timeliness (appropriateness of the data for the purpose it is
required)
Believability (acceptability)
Interpretability (easy to understand)
Most of the data in the real world are poor quality; that is: 3
9/25/2020
Data Mining
Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
Data is often of low quality
▪ Collecting the required data is challenging
▫ In addition to its heterogeneous & distributed nature of data,
real world data is low in quality.

▪ Why?
▫ You didn’t collect it yourself
▫ It probably was created for some other use, and then you came
along wanting to integrate it
▫ People make mistakes (typos)
▫ People are busy (“this is good enough”) to systematically
organize carefully using structured formats

4
9/25/2020
Data Mining
Types of problems with data

• Some data have problems on their own that needs to be cleaned:

– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the DM task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we integrate them
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– How to integrate data organized in different format following
different conventions.

5
9/25/2020
Data Mining
Case study: Government Agency Data
▪ What we want:

ID Name City State

1 Transportation ministry Gambela Gambela
2 Ministry of finance Gambela Gambela
3 Health office Gambela Gambela

• How to prepare enough and complete data with good quality that
we need for mining?
 Coming up with good quality data needs to pass through
different data preprocessing tasks

6
9/25/2020
Data Mining
Major Tasks in Data Preprocessing
• Data cleansing: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of data from multiple sources, such as databases, data
warehouses, or files
• Data reduction
– obtains a reduced representation of the data set that is much
smaller in volume, yet
produces almost the same results.
 Dimensionality reduction
 Numerosity/size reduction
 Data compression
• Data transformation
– Normalization
– Discretization and/or Concept hierarchy generation
7
9/25/2020
Data Mining
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which
require data cleaning
▪ What’s wrong here?

How to clean it: manually or automatically?

8
9/25/2020
Data Mining
Data Cleaning: Incomplete Data

• The dataset may lack certain attributes of interest

– Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of
a given region to Malaria outbreak?

• The dataset may contain only aggregate data. E.g.: traffic police
car accident report
– this much accident occurs this day in this sub-city
No of Accident Date Address
3 05/05/13 Finfinne, Oromia
2 02/05/13 Sumale

9
9/25/2020
Data Mining
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
 many tuples have no recorded value for several attributes

What’s wrong here? A missing required field

10
9/25/2020
Data Mining
Data Cleaning: Missing Data

• Missing data may be due to

– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding and may not be
considered important at the time of entry
– not register history or changes of the data

• How to handle Missing data? Missing data may need to be

inferred
– Ignore the missing value: not effective when the percentage of
missing values per attribute varies considerably
– Fill in the missing value manually: tedious + infeasible?
– Fill automatically
11
• calculate, say, using Expected Maximization (EM) Algorithm
9/25/2020
Data Mining
the most probable value
Predict missing value using EM
▪ Solves estimation with incomplete data.
▫ Obtain initial estimates for parameters using mean value.
▫ use estimates for calculating a value for missing data &
▫ The process continue Iteratively until convergence ((μi - μi+1) ≤ Ө).
▪ E.g.: out of six data items given known values= {1, 5, 10, 4}, estimate the
two missing data items?
▫ Let the EM converge if two estimates differ in 0.05 & our initial guess of
the two missing values= 3.
• The algorithm stop
since the last two
estimates are only
0.05 apart.
• Thus, our estimate
for the two items is
4.97.
12
9/25/2020
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– e.g., Salary=“−10” (an error)
▪ Typographical errors are errors that corrupt data
▫ Let say ‘green’ is written as ‘rgeen’
• Incorrect attribute values may be due to
– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention

13
9/25/2020
Data Mining
Data Cleaning: How to catch Noisy Data
▪ Manually check all data : tedious + infeasible?
▪ Sort data by frequency
▫ ‘green’ is more frequent than ‘rgeen’
▫ Works well for categorical data
• Use, say Numerical constraints to Catch Corrupt Data
▫ Weight can’t be negative
▫ People can’t have more than 2 parents
▫ Salary can’t be less than Birr 300
▪ Use statistical techniques to Catch Corrupt Data
▫ Check for outliers (the case of the 8 meters man)
▫ Check for correlated outliers using n-gram (“pregnant male”)
▫ People can be male
▫ People can be pregnant
▫ People can’t be male AND pregnant 14
9/25/2020
Data Mining
Data Integration
▪ Data integration combines data from multiple sources (database,
data warehouse, files & sometimes from non-electronic sources)
into a coherent store
▪ Because of the use of different sources, data that is fine on its own
may become problematic when we want to integrate it.
▪ Some of the issues are:
▫ Different formats and structures
▫ Conflicting and redundant data
▫ Data at different levels

15
9/25/2020
Data Mining
Data Integration: Formats
Not everyone uses the same format. Do you agree?
Schema integration: e.g., A.cust-id  B.cust-#
Integrate metadata from different sources

Dates are especially problematic:

12/19/97
19/12/97
19/12/1997
19-12-97
Dec 19, 1997
19 December 1997
19th Dec. 1997

Are you frequently writing money as:

Birr 200, Br. 200, 200 Birr, …
16
9/25/2020
Data Mining
Data Integration: Inconsistent
▪ Inconsistent data: containing discrepancies in codes or names, which is
also the problem of lack of standardization / naming conventions. e.g.,
▫ Age=“26” vs. Birthday=“03/07/1986”
▫ Some use “1,2,3” for rating; others “A, B, C”
▪ Discrepancy between duplicate records

17
9/25/2020
Data Mining
Data Reduction Strategies
▪Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost
the same) analytical results
▪Why data reduction? A database/data warehouse may store terabytes
of data. Complex data analysis may take a very long time to run on
the complete data set.

▪Data reduction strategies

▫ Dimensionality reduction,
▫ Select best attributes or remove unimportant attributes
▫ Numerosity reduction
▫ Reduce data volume by choosing alternative, smaller forms of
data representation
▫ Data compression
▫ Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
network or the Internet 18
9/25/2020
Data Mining
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values
• Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified
range of values
• min-max normalization
• z-score normalization

– Discretization: Reduce data size by dividing the range of a

continuous attribute into intervals. Interval labels can then be
used to replace actual data values
▫ Discretization can be performed recursively on an attribute
using method such as
▫ Binning: divide values into intervals
▫ Concept hierarchy climbing: organizes concepts (i.e., 19
Miningattribute values) hierarchically
9/25/2020
Data
Data sets preparation for learning
A standard machine learning technique is to divide the dataset into a
training set and a test set.
Training dataset is used for learning the parameters of the model
in order to produce hypotheses.
A training set is a set of problem instances (described as a set of
properties and their values), together with a classification of
the instance.

Test dataset, which is never seen during the hypothesis forming

stage, is used to get a final, unbiased estimate of how well the
model works.
Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
20
9/25/2020
Data Mining
Classification: Train, Validation, Test split

21
9/25/2020
Data Mining
Data Mining Main Tasks

22
9/25/2020
Data Mining
DM Task: Predictive Modeling
A predictive model makes a prediction/forecast about values of data
using known results found from different historical data
Prediction Methods use existing variables to predict unknown or
future values of other variables.
Predict one variable Y given a set of other variables X. Here X could
be an n-dimensional vector
In effect this is function approximation through learning the
relationship between Y and X
Many, many algorithms for predictive modeling in statistics and
machine learning, including
Classification, regression, etc.

Often the emphasis is on predictive accuracy, less emphasis on

understanding the model

23
9/25/2020
Data Mining
Prediction Problems: Classification vs.
Numeric Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data

Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
24
9/25/2020
Data Mining
Time Series Analysis
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior

25
9/25/2020
Data Mining
Models and Patterns
▪ Model = abstract representation of a given training data
e.g., very simple linear model structure
Y=aX+b
▫ a & b are parameters determined from the data
▫ Y = aX + b is the model structure
▫ Y = 0.9X + 0.3 is a particular model
▪ Pattern represents “local structure” in a dataset
▫E.g., if X>x then Y >y with probability p
▪ Example: Given a finite sample, <x,f(x)> pairs, create a model that
can hold for future values?
 To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the
pattern will hold for future examples too. 26
9/25/2020
Data Mining
Predictive Modeling: Customer Scoring

▪ Example: a bank has a database of 1 million past customers, 10% of

whom took out mortgages

▪ Use machine learning to rank new customers as a function of

p(mortgage | customer data)

▪ Customer data
▫ History of transactions with the bank
▫ Other credit data (obtained from Experian, etc)
▫ Demographic data on the customer or where they live

▪ Techniques
▫ Binary classification: logistic regression, decision trees, etc
▫ Many, many applications of this nature

27
9/25/2020
Data Mining
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation

Red Blood Cell Hemoglobin Concentration

EM ITERATION 25

– e.g., a model that could simulate the data if needed

4.4

4.3

4.2

▪Descriptive model identifies patterns or relationship in data 4.1

▫ Unlike the predictive model, 4

▫ a descriptive model serves as

3.9

3.8

▫ a way to explore the properties of the data examined, 3.7

3.3 3.4 3.5 3.6 3.7
Red Blood Cell Volume
3.8 3.9 4

▫ not to predict new properties

• Description Methods find human-interpretable patterns that describe
and find natural groupings of the data.
• Methods used in descriptive modeling are: clustering,
summarization, association rule discovery, etc.

28
9/25/2020
Data Mining
Example of Descriptive Modeling

▪ goal: learn directed relationships among p variables

▪ techniques: directed (causal) graphs
▪ challenge: distinguishing between correlation and causation
▫ Example: Do yellow fingers cause lung cancer?

29
9/25/2020
Data Mining
Pattern (Association Rule) Discovery

▪ Goal is to discover interesting “local” patterns (sequential patterns)

in the data rather than to characterize the data globally
▫ Also called link analysis (uncovers relationships among data)

▪ Given market basket data we might discover that

▫ If customers buy wine and bread then they buy cheese with
probability 0.9

▪ Methods used in pattern discovery include:

▫ Association rules, Sequence discovery, etc.

30
9/25/2020
Data Mining
Example of Pattern Discovery
▪ Example in retail: Customer transactions to consumer behavior:
▫ People who bought “Da Vinci Code” also bought “The Five People
You Meet in Heaven” (www.amazon.com)
▪ Example: football player behavior
▫ If player A is in the game, player B’s scoring rate increases from
25% chance per game to 95% chance per game
▪ What about the following?
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBB
DBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCAC
DABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBA
ACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBAD
CBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDD
DADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADA
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDAD
ABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCB
AAADCADDADAABBACCBB 31
9/25/2020
Data Mining
Basic Data Mining algorithms

▪ Classification: which is also called Supervised learning, maps data

into predefined groups or classes to enhance the prediction process

▪ Clustering: which is also called Unsupervised learning, groups

similar data together into clusters.
▫ is used to find appropriate groupings of elements for a set of data.
▫ Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; i.e., there is no target field & the
relationship among the data is identified by bottom-up approach.
▪ Association Rule: is also known as market-basket analysis
▫ It discovers interesting associations between attributes contained in a
database.
▫ Based on frequency of occurrence of number of items in the event,
association rule tells if item X is a part of the event, then what is the
likelihood of item Y is also part of the event.
32
9/25/2020
Data Mining
Supervised vs. Unsupervised Learning

▪ Supervised learning (classification)

▫ Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
▫ New data is classified based on the training set

▪ Unsupervised learning (clustering)

▫ The class labels of training data is unknown
▫ Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data

33
9/25/2020
Data Mining
Question?

Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
DP-900 Practice Set
100% (2)
DP-900 Practice Set
23 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Project Report On Bakery Management System PDF
No ratings yet
Project Report On Bakery Management System PDF
65 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Mining
No ratings yet
Data Mining
22 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Unit - II
No ratings yet
Unit - II
56 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Normalization
No ratings yet
Normalization
35 pages
Top 100 Tableau Interview Questions and Answers (2021)
100% (1)
Top 100 Tableau Interview Questions and Answers (2021)
18 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Mining
No ratings yet
Data Mining
40 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Naresh Kumar.K: Oracle SQL PL/SQL Developer Mobile
No ratings yet
Naresh Kumar.K: Oracle SQL PL/SQL Developer Mobile
2 pages
Platform Developer I
No ratings yet
Platform Developer I
6 pages
Correlation
No ratings yet
Correlation
14 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Best Practices and Examples For Developing Roles in SAP HANA - Example Project
No ratings yet
Best Practices and Examples For Developing Roles in SAP HANA - Example Project
48 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
ServiceDesk Release ENU
No ratings yet
ServiceDesk Release ENU
86 pages
Learner Responses
No ratings yet
Learner Responses
33 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Chapter 2 Review of Related Literature Enrollment System
100% (1)
Chapter 2 Review of Related Literature Enrollment System
5 pages
Apache Iceberg - Additional Real World Use Cases
No ratings yet
Apache Iceberg - Additional Real World Use Cases
25 pages
Project 2
No ratings yet
Project 2
70 pages
Rdbms
No ratings yet
Rdbms
2 pages
Final Exam Big Data - 11112
No ratings yet
Final Exam Big Data - 11112
6 pages
Web App Pen Testing - SQL Injection Videos 2
No ratings yet
Web App Pen Testing - SQL Injection Videos 2
14 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
AWS Certified Data Engineer Associate Cheat Sheet
No ratings yet
AWS Certified Data Engineer Associate Cheat Sheet
54 pages
Chapter 2 - Data Preprocessing
No ratings yet
Chapter 2 - Data Preprocessing
15 pages
Document 577844.1
No ratings yet
Document 577844.1
2 pages
PerconaXtraBackup-2 4 21
No ratings yet
PerconaXtraBackup-2 4 21
162 pages
Auth0 Com Blog Sqlalchemy Orm Tutorial For Python Developers
No ratings yet
Auth0 Com Blog Sqlalchemy Orm Tutorial For Python Developers
38 pages
Vijayakumar Support Engineer
No ratings yet
Vijayakumar Support Engineer
2 pages
3.1 Relational Algebra
No ratings yet
3.1 Relational Algebra
63 pages
Module 1 - Introduction To Database Systems
No ratings yet
Module 1 - Introduction To Database Systems
11 pages
Recipe Organizer Project Overview
No ratings yet
Recipe Organizer Project Overview
27 pages
Eurosim 6
No ratings yet
Eurosim 6
7 pages
Logs A10
No ratings yet
Logs A10
4 pages
PRNT - AWS Services - CLFC02 - v1.0
No ratings yet
PRNT - AWS Services - CLFC02 - v1.0
5 pages
Assignment Designs For University Student
No ratings yet
Assignment Designs For University Student
6 pages
Deepanshu Sethi Azure Data Engineer
No ratings yet
Deepanshu Sethi Azure Data Engineer
2 pages
Document 1560390.1 - Database Error After Upgarade
No ratings yet
Document 1560390.1 - Database Error After Upgarade
2 pages

Chapter 3 Data Preparation

Uploaded by

Chapter 3 Data Preparation

Uploaded by

Chapter 2

• Some data have problems on their own that needs to be cleaned:

ID Name City State

How to clean it: manually or automatically?

• The dataset may lack certain attributes of interest

What’s wrong here? A missing required field

• Missing data may be due to

• How to handle Missing data? Missing data may need to be

Dates are especially problematic:

Are you frequently writing money as:

▪Data reduction strategies

– Discretization: Reduce data size by dividing the range of a

Test dataset, which is never seen during the hypothesis forming

Often the emphasis is on predictive accuracy, less emphasis on

▪ Example: a bank has a database of 1 million past customers, 10% of

▪ Use machine learning to rank new customers as a function of

Red Blood Cell Hemoglobin Concentration

– e.g., a model that could simulate the data if needed

▪Descriptive model identifies patterns or relationship in data 4.1

▫ Unlike the predictive model, 4

▫ a descriptive model serves as

▫ a way to explore the properties of the data examined, 3.7

▫ not to predict new properties

▪ goal: learn directed relationships among p variables

▪ Goal is to discover interesting “local” patterns (sequential patterns)

▪ Given market basket data we might discover that

▪ Methods used in pattern discovery include:

▪ Classification: which is also called Supervised learning, maps data

▪ Clustering: which is also called Unsupervised learning, groups

▪ Supervised learning (classification)

▪ Unsupervised learning (clustering)

You might also like