0% found this document useful (0 votes)

18 views66 pages

Unit 4

This document discusses data preprocessing tasks. It defines common data quality issues like incomplete, noisy, and inconsistent data. The main goals of data preprocessing are to handle these quality issues, prepare raw data for further processing, and ensure quality decisions can be made from the data. Key preprocessing tasks include data cleaning, integration, transformation, and reduction.

Uploaded by

Manan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views66 pages

Unit 4

Uploaded by

Manan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 66

Unit-4

Data Pre-processing
Outline
 Why to preprocess data?
 Mean, median, mode & range
 Attribute types
 Data preprocessing tasks
• Data cleaning
• Data integration
• Data transformation
• Data reduction
 Data mining task primitives
Why to preprocess data?
 Real world data are generally “dirty”
• Incomplete: Missing attribute values, lack of certain attributes of interest,
or containing only aggregate data.
o E.g. Occupation=“ ”
• Noisy: Containing errors or outliers.
o E.g. Salary=“abcxy”
• Inconsistent: Containing similarity in codes or names.
o E.g. “Gujarat” & “Gujrat” (Common mistakes like spelling, grammar, articles)
Why data preprocessing is important?
“No quality data, No quality results”
 It looks like Garbage In Garbage Out (GIGO).

 Quality decisions must be based on quality data.

 Duplicate or missing data may cause incorrect or even misleading
statistics.
 Data preparation, cleaning and transformation are the majority task
in data mining. (could be as high as 90%).
 Data preprocessing prepares raw data for further processing.
𝑛
1
Mean 𝑋= ∑
𝑛 𝑖 =1
𝑥

 Mean is the average of a dataset.

 To find the mean, calculate the sum of all the data and then divide
by the total number of data.
 Example
✔ Find out mean for 12, 15, 11, 11, 7, 13

First, find the sum of the data.

12 + 15 +11 + 11 + 7 + 13 = 69
Then divide by the total number of data.
69 / 6 = 11.5 Mean
Median
 Median is the middle number in a dataset when the data is
arranged in numerical order (Sorted Order).
If count is Odd then middle number is
Median
If count is Even then take average of
middle two numbers that is Median
Median - Odd (Cont..)
 Example
 Find out Median for 12, 15, 11, 11, 7, 13, 15

In above example, count of data is 7. (Odd)

First, arrange the data in ascending order.
7, 11, 11, 12, 13, 15, 15
Partitioning data into equal halfs
7, 11, 11, 12, 13, 15, 15
12 Median
Median - Even (Cont..)
 Example
 Find out median for 12, 15, 11, 11, 7, 13

In above example, count of data is 6. (Even)

First, arrange the data in ascending order.
7, 11, 11, 12, 13, 15
Calculate an average of the two numbers in
the middle.
7, 11, 11, 12, 13, 15
(11 + 12)/2 = 11.5 Median
Mode
 The mode is the number that occurs most often within a set of
numbers.
 Example
Find mode.
12, 15, 11, 11, 7, 13
11 Mode (Unimodal)

Find mode.
12, 15, 11, 11, 7, 12, 13
11, 12 Mode (Bimodal)
Mode (Cont..)
 Example
Find mode.
12, 12 15, 11, 11, 7, 13, 7
7, 11, 12 Mode (Trimodal)

Find mode.
12, 15, 11, 10, 7, 14, 13
No Mode
Range
 The range of a set of data is the difference between the largest
and the smallest number in the set.
 Example
 Find range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50

First, arrange the data in ascending order.

26, 30, 34, 40, 40, 42, 43, 47, 48, 50, 50, 55
 In our example largest number is 55, and subtract the smallest
number is 26.

55 – 26 = 29 Range
Standard deviation
 The Standard Deviation is a measure of how spread out any data
are.
 Its symbol is σ (the Greek letter sigma).

 Standard Deviation is Square root of sample variance.

Standard deviation (Cont..)
 The Variance is defined as:
The average of the squared differences from the Mean.

To calculate the variance follow these steps:

1. Calculate the mean, x.

2. Write a table that subtracts the mean from each
observed value.
3. Square each of the differences, add this column.
4. Divide by n -1 where n is the number of items in the
sample, this is the variance (In actual case take n).
5. To get the standard deviation we take the square root of
the variance.
Standard deviation - example
 The owner of the Indian restaurant is interested in how much
people spend at the restaurant.
 He examines 10 randomly selected receipts for parties and writes
down the following data.
44, 50, 38, 96, 42, 47, 40, 39, 46, 50
1. Find out Mean (1st step)
 Mean is 49.2
2. Write a table that subtracts the mean from each observed value. (2nd step)
Standard deviation – example (Cont..)
Step : 3 X X – Mean ( X – Mean )2 Step : 4
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44 S2 288.7 ~ 289
96 46.8 2190.24
42 -7.2 51.84
Step : 5
47 -2.2 4.84
40 -9.2 84.64 S
39 -10.2 104.04 S 17

46 -3.2 10.24
50 0.8 0.64
Total 2600.4
Standard deviation – example (Cont..)
 Standard deviation can be thought of measuring how far the data
values lie from the mean, we take the mean and move on
standard deviation in either direction.
 The mean for this example is 49.2 and the standard deviation is
17.
 Now, 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2
 This means that most of the data probably spend between 32.2
and 66.2.
 If all data are same then variance & standard deviation is 0 (zero).
Example (Try it)
 Calculate Mean, Median, Mode, Range, Variance &
Standard deviation .
13, 18, 13, 14, 13, 16, 14, 21, 13
 Mean is 15.
 Median is 14.
 Mode is 13 & 14 (Bimodal).
 Range is 8.
 Variance is 289.
 Standard deviation is 17.
Attribute Types
 An attribute is a property of the object.
 It also represents different features of the object.
 E.g. Person  Name, Age, Qualification etc.
 Attribute types can be divided into four categories.
1. Nominal
2. Ordinal
3. Interval
4. Ratio
1) Nominal Attribute Attribute Types

 Nominal attributes are named attributes which can be separated

into discrete (individual) categories which do not overlap.
 Nominal attributes values also called as distinct values.
 Example
2) Ordinal Attribute Attribute Types

 Ordinal attribute is the order of the values, that’s important and

significant, but the differences between each one is not really
known.
 Example
 Rankings  1st, 2nd, 3rd
 Ratings  ,
 We know that a 5 star is better than a 2 star or 3 star, but we don’t
know and cannot quantify–how much better it is?
3) Interval Attribute Attribute Types

 Interval attribute comes in the form of a numerical value where

the difference between points is meaningful.
 Example
 Temperature  10°-20°, 30°-50°, 35°-45°
 Calendar Dates  15th – 22nd, 10th – 30th
 We can not find true zero (absolute) value with interval attributes.
4) Ratio Attribute Attribute Types

 Ratio attribute is looks like interval attribute, but it must have a

true zero (absolute) value.
 It tells us about the order and the exact value between units or
data.
 Example
 Age Group  10-20, 30-50, 35-45 (In years)
 Mass  20-30 kg, 10-15 kg
 It does have a true zero (absolute) so, it is possible to compute
ratios.
Data Preprocessing
 Data have quality if they satisfy the requirements of the intended
use.
 There are many factors comprising data quality, including
accuracy, completeness, consistency, timeliness, believability, and
interpretability.
 The data you wish to analyze by data mining techniques are
✔ incomplete (lacking attribute values or certain attributes of interest, or
containing only aggregate data);
✔ inaccurate or noisy (containing errors, or values that deviate from the
expected); and
✔ inconsistent (e.g., containing discrepancies in the department codes used
to categorize items)
Data Preprocessing
 The elements defining data quality:
✔ accuracy,
✔ completeness,
✔ consistency.
 Inaccurate, incomplete, and inconsistent data are commonplace
properties of large real-world databases and data warehouses.
 Reasons for inaccurate data (i.e., having incorrect attribute
values):
✔ The data collection instruments used may be faulty.
✔ There may have been human or computer errors occurring at data entry.
✔ Users may purposely submit incorrect data values for mandatory fields
when they do not wish to submit personal information (e.g., by choosing
the default value “January 1” displayed for birthday). This is known as
disguised missing data.
✔ Errors in data transmission can also occur.
✔ There may be technology limitations such as limited buffer size for
coordinating synchronized data transfer and consumption.
✔ Incorrect data may also result from inconsistencies in naming conventions
or data codes, or inconsistent formats for input fields (e.g., date).
✔ Duplicate tuples also require data cleaning.
 Incomplete data can occur for a number of reasons.
✔ Attributes of interest may not always be available
✔ Data may not be included simply because they were not considered
important at the time of entry.
✔ Relevant data may not be recorded due to a misunderstanding or because
of equipment malfunctions.
Data Preprocessing Tasks
Data
Cleaning

Data Data
Transformation Integration

Data
Reduction
1) Data Cleaning
 Real-world data tend to be incomplete, noisy, and inconsistent.
 Data cleaning (or data cleansing) routines attempt to fill in
missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
1) Fill missing values Data Cleaning

 Ignore the tuple (record/row):

• Usually done when class label is missing.
• This method is not very effective, unless the tuple contains several
attributes with missing values.
• It is especially poor when the percentage of missing values per attribute
varies considerably
 Fill missing value manually:
• This approach is time consuming and may not be feasible given a large
data set with many missing values.
 Use a global constant to fill in the missing value:
• Replace all missing attribute values by the same constant such as a label like
“Unknown” or -∞.
1) Fill missing values (Cont..) Data Cleaning

 Use a measure of central tendency for the attribute (e.g., the

mean or median) to fill in the missing value:
• Use mean or median to fill missing values.
 Use the attribute mean or median for all samples belonging to
the same class as the given tuple
 Use the most probable value to fill in the missing value:
• This may be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction.
2) Noisy Data Data Cleaning

 Noise is a random error or variance in a measured variable.

1. Binning method
2. Regression
3. Outlier analysis
1) Binning method
 Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
 The sorted values are distributed into a number of “buckets,” or
bins.
 Because binning methods consult the neighborhood of values,
they perform local smoothing
Binning methods for data smoothing
2) Regression:
 Data smoothing can also be done by regression, a technique that
conforms data values to a function.
 Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to
predict the other.
 Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit
to a multidimensional surface.
2) Outlier analysis:
 Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.”
 Values that fall outside of the set of clusters may be considered
outliers.
Data Integration
 Data mining often requires data integration—the merging of data
from multiple data stores.
 Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
 This can help improve the accuracy and speed of the subsequent
data mining process.
Entity Identification Problem
 Data analysis task will involve data integration, which combines
data from multiple sources into a coherent data store, as in data
warehousing.
 These sources may include multiple databases, data cubes, or flat
files.
 There are a number of issues to consider during data integration.
 Schema integration and object matching can be tricky.
 How can equivalent real-world entities from multiple sources be
matched up? This is referred to as the entity identification
problem.
Redundancy and Correlation Analysis
 Redundancy is another important issue in data integration.
 An attribute (such as annual revenue, for instance) may be redundant
if it can be “derived” from another attribute or set of attributes.
 Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.
 Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data.
 For nominal data, we use the Χ2 (chi-square) test.
 For numeric attributes, use the correlation coefficient and covariance,
both of which access how one attribute’s values vary from those of
another.
Χ2 Correlation Test for Nominal Data
 For nominal data, a correlation relationship between two
attributes, A and B, can be discovered by a X2 (chi-square) test.
 Suppose A has c distinct values, namely a1,a2,….,ac. B has r
distinct values, namely b1,b2,…,br.
 The data tuples described by A and B can be shown as a
contingency table, with the c values of A making up the columns
and the r values of B making up the rows.
 Let (Ai ,Bj) denote the joint event that attribute A takes on value ai
and attribute B takes on value bj.
 The X2 value is computed as

 where n is the number of data tuples

 Suppose that a group of 1500 people was surveyed.
 The gender of each person was noted. Each person was polled as
to whether his or her preferred type of reading material was
fiction or nonfiction.
 Thus, we have two attributes, gender and preferred reading. The
observed frequency (or count) of each possible joint event is
summarized in the contingency table where the numbers in
parentheses are the expected frequencies.
 For example, the expected frequency for the cell (male, fiction) is
 For this 2X2 table, the degrees of freedom are
(r-1)(c-1) = (2-1)(2-1).
 For 1 degree of freedom, the X2 value needed to reject the
hypothesis at the 0.001 significance level is 10.828.
 Since our computed value is above this, we can reject the
hypothesis that gender and preferred reading are independent
and conclude that the two attributes are (strongly) correlated for
the given group of people.
chi square distribution table
Tuple Duplication
 To detect redundancies between attributes, duplication should
also be detected at the tuple level (e.g., where there are two or
more identical tuples for a given unique data entry case)
 For example, if a purchase order database contains attributes for
the purchaser’s name and address instead of a key to this
information in a purchaser database, discrepancies can occur, such
as the same purchaser’s name appearing with different addresses
within the purchase order database.
Data Value Conflict Detection and Resolution
 Data integration also involves the detection and resolution of data
value conflicts.
 For example, for the same real-world entity, attribute values from
different sources may differ.
 This may be due to differences in representation, scaling, or
encoding.
Data Reduction
 Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data.
 That is, mining on the reduced data set should be more efficient
yet produce the same (or almost the same) analytical results.
Overview of Data Reduction Strategies
 Data reduction strategies include
✔ Dimensionality reduction,
✔ Numerosity reduction, and
✔ Data compression.
Dimensionality reduction
 Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
 Dimensionality reduction methods include wavelet transforms and
principal components analysis, which transform or project the
original data onto a smaller space.
 Attribute subset selection is a method of dimensionality reduction
in which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
Numerosity reduction
 Numerosity reduction techniques replace the original data
volume by alternative, smaller forms of data representation.
 These techniques may be parametric or nonparametric.
 For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead
of the actual data.
 Regression and log-linear models are examples.
 Nonparametric methods for storing reduced representations of
the data include histograms, clustering, sampling, and data cube
aggregation.
Data Compression
 In data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data.
 If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called
lossless.
 If, instead, we can reconstruct only an approximation of the
original data, then the data reduction is called lossy.
Data Transformation
 Data transformation is the process of converting data from one
form to another form.
 Data often resides in different locations across the storage and
also differs in format.
 Data transformation is necessary to ensure that data from one
application or database is understandable to other applications
and databases also.
Data Transformation (Cont..)
 Data transformation strategies includes the following:
1. Smoothing
2. Attribute construction
3. Aggregation
4. Normalization
5. Discretization
6. Concept hierarchy generation for nominal data
Data Transformation (Cont..)
1. Smoothing
• It works to remove noise from the data.
• It is a form of data cleaning where users specify transformations to correct
data inconsistencies.
• Such techniques include binning, regression and clustering.
2. Attribute construction
• It is referred as new attributes are constructed and added from the given
set of attributes to help the mining process.
3. Aggregation
• In this, summary or aggregation operations are applied to the data.
• E.g. Daily sales data are aggregated at individual source so sales manager
can compute monthly and annually total amounts.
Data Transformation (Cont..)
4. Normalization
• Normalization is scaling technique or a mapping technique.
• With normalization, we can find new range from an existing range.
• There are three techniques for normalization.
1. Min-Max Normalization
o This is a simple normalization technique in which we fit given data in a pre-
defined boundary, or a pre-defined interval [0,1].
2. Decimal scaling
o In this technique we move the decimal point of values of the attribute.
1) Min-max normalization
 Min max is a technique that helps to normalizing the data.
 It will scale the data between 0 and 1.
 Example
Age
16
20
30
40
1) Min-max normalization (Cont..)
 Min : Minimum value = 16
 Max : Maximum value = 40
 V = Respective value of attributes. In our example V1= 16, V2=20,
V3=30 & V4=40.
 NewMax = 1
 NewMin = 0
Formula : V’
1) Min-max normalization (Cont..)
Formula : V’

For Age 16 :

MinMax (v’) = (16 – 16)/(40-16) * (1 – 0) + 0

= 0 / 24 * 1
=0

For Age 20 :

MinMax (v’) = (20 – 16)/(40-16) * (1 – 0) + 0

= 4 / 24 * 1
= 0.16
1) Min-max normalization (Cont..)
For Age 30 :

MinMax (v’) = (30 – 16)/(40-16) * (1 – 0) + 0

= 14 / 24 * 1
= 0.58

For Age 40 :

MinMax (v’) = (40 – 16)/(40-16) * (1 – 0) + 0

= 24 / 24 * 1
=1

Age After Min-max normalization

16 0
20 0.16
30 0.58
40 1
2) Decimal scaling
 In this technique we move the decimal point of values of the
attribute.
 This movement of decimal points totally depends on the
maximum value among all values in the attribute.
 Value V of attribute A can be normalized by the following formula
Normalized value of attribute = (vi / 10j)
Decimal scaling - Example
CGPA Formula After Decimal Scaling
2 2 / 10 0.2
3 3 / 10 0.3

 We will check maximum value among our attribute CGPA.

 Maximum value is 3 so, we can convert it into decimal by dividing
with 10. why 10?
 We will count total digits in our maximum value and then put 1.
 After 1 we can put zeros equal to the length of maximum value.
 Here 3 is maximum value and total digits in this value is only 1 so,
we will put one zero after 1.
Decimal scaling (Try it!)
Bonus Formula After Decimal Scaling
400 400/1000 0.4
310 310/1000 0.31

Salary Formula After Decimal Scaling

40,000 40000/100000 0.4
31,000 31000/100000 0.31
Data Transformation (Cont..)
5. Discretization
• Discretization techniques can be categorized based on how the separation
is performed, such as whether it uses class information or which direction
it proceeds (top-down or bottom-up).
• The raw values of a numeric attribute (e.g. age) are replaced by interval
labels (e.g. 0-10, 11-20 etc.) or conceptual labels (e.g. youth, adult,
senior).
6. Concept hierarchy generation for nominal data
• In this, attributes such as address can be generalized to higher-level
concepts, like street or city or state or country.
• Many hierarchies for nominal attributes are implicit within the database
schema.
• E.g. city, country or state table in RDBMS.
Data mining task primitives (Cont..)
 The data mining task primitives includes the following:
• Task-relevant data
• Kind of knowledge to be mined
• Background knowledge
• Interestingness measurement
• Presentation for visualizing the discovered patterns

Chapter 02 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
75% (4)
Chapter 02 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
93 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
DM GTU Study Material Presentations Unit-2 17032021053028AM
No ratings yet
DM GTU Study Material Presentations Unit-2 17032021053028AM
60 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
DM GTU Study Material Presentations Unit-2 17032021053028AM
No ratings yet
DM GTU Study Material Presentations Unit-2 17032021053028AM
60 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Unit 4 DMBI
No ratings yet
Unit 4 DMBI
55 pages
DM Unit1 - 2 Data - Preprocssing 19I504
No ratings yet
DM Unit1 - 2 Data - Preprocssing 19I504
67 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Data ch2
No ratings yet
Data ch2
16 pages
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
100% (1)
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
54 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Week2 1
No ratings yet
Week2 1
24 pages
Unit 1
No ratings yet
Unit 1
78 pages
Data Mining - Preprocessing
No ratings yet
Data Mining - Preprocessing
77 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Module 2
No ratings yet
Module 2
62 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Mining
No ratings yet
Data Mining
40 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
No ratings yet
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
43 pages
PPT1
No ratings yet
PPT1
93 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
DM Merged
No ratings yet
DM Merged
169 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Mining
No ratings yet
Mining
63 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Surviving Introduction to Finance
From Everand
Surviving Introduction to Finance
James Triplett
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
4 pages
Foet Sem 7 Exam Time Table
No ratings yet
Foet Sem 7 Exam Time Table
10 pages
Unit 3
No ratings yet
Unit 3
23 pages
Unit 5
No ratings yet
Unit 5
40 pages
Bitcoin Prise Using LSTM - Ipynb - Colab
No ratings yet
Bitcoin Prise Using LSTM - Ipynb - Colab
49 pages
LEGO Facts
No ratings yet
LEGO Facts
1 page
Start An Essay
100% (2)
Start An Essay
7 pages
Types of Headlines
100% (1)
Types of Headlines
5 pages
Moon of The Caribbees: Presented by
No ratings yet
Moon of The Caribbees: Presented by
7 pages
Iao PDF
100% (1)
Iao PDF
3 pages
Makalah Urgensi Pendidikan
No ratings yet
Makalah Urgensi Pendidikan
13 pages
Subjective V.S Objective Tests
100% (1)
Subjective V.S Objective Tests
67 pages
FLCT, Cheska G.-Bsed-Eng-4-Activity
No ratings yet
FLCT, Cheska G.-Bsed-Eng-4-Activity
2 pages
Resume Format
No ratings yet
Resume Format
3 pages
2021 - 2 - YBM ( ) - 5 - (13) Grammar Build Up - OK (20220630 )
No ratings yet
2021 - 2 - YBM ( ) - 5 - (13) Grammar Build Up - OK (20220630 )
35 pages
Design Unbound 2 - Designing For Emergence in A White Water World, Volume 2 Ecologies of Change by Pendleton-Jullian, Ann MBrown, John Seely
No ratings yet
Design Unbound 2 - Designing For Emergence in A White Water World, Volume 2 Ecologies of Change by Pendleton-Jullian, Ann MBrown, John Seely
495 pages
Bimal Krishna Matilal-Perception - An Essay On Classical Indian Theories of Knowledge - Oxford University Press, USA (1986)
No ratings yet
Bimal Krishna Matilal-Perception - An Essay On Classical Indian Theories of Knowledge - Oxford University Press, USA (1986)
44 pages
Grover Search Algorithm: Eva Borbely
No ratings yet
Grover Search Algorithm: Eva Borbely
21 pages
School - Based Reading Programs: An Overview
No ratings yet
School - Based Reading Programs: An Overview
5 pages
Test 2 NewDest Amer Beg
No ratings yet
Test 2 NewDest Amer Beg
4 pages
QA Engineer Coding Challenge
No ratings yet
QA Engineer Coding Challenge
3 pages
Cobaltskies24 Brochure
No ratings yet
Cobaltskies24 Brochure
110 pages
Immigration Clearance
60% (5)
Immigration Clearance
92 pages
Computer System Security Unit 1
No ratings yet
Computer System Security Unit 1
60 pages
FP1 U0S Grammar Practice Plus
No ratings yet
FP1 U0S Grammar Practice Plus
1 page
02 Introduction To RPi
No ratings yet
02 Introduction To RPi
22 pages
Negotiating Cultural Signifiers - Left-Aligned Cultural Clubs and The Re-Invention of Alamikkali in The South Canara Region-Sourav V
No ratings yet
Negotiating Cultural Signifiers - Left-Aligned Cultural Clubs and The Re-Invention of Alamikkali in The South Canara Region-Sourav V
11 pages
2016 Imotc
No ratings yet
2016 Imotc
3 pages
Rekap 12 AKL
No ratings yet
Rekap 12 AKL
60 pages
ENG 219 - ĐỀ ÔN TẬP - SV - 8.23
No ratings yet
ENG 219 - ĐỀ ÔN TẬP - SV - 8.23
10 pages
L0 U6 Answer
No ratings yet
L0 U6 Answer
5 pages
Ingles Prueba de Diagnostico 7mo A 4to Medio
No ratings yet
Ingles Prueba de Diagnostico 7mo A 4to Medio
22 pages
1 Corinthians 11
No ratings yet
1 Corinthians 11
32 pages

Unit 4

Uploaded by

Unit 4

Uploaded by

Unit-4

 Quality decisions must be based on quality data.

 Mean is the average of a dataset.

First, find the sum of the data.

In above example, count of data is 7. (Odd)

In above example, count of data is 6. (Even)

First, arrange the data in ascending order.

 Standard Deviation is Square root of sample variance.

To calculate the variance follow these steps:

1. Calculate the mean, x.

 Nominal attributes are named attributes which can be separated

 Ordinal attribute is the order of the values, that’s important and

 Interval attribute comes in the form of a numerical value where

 Ratio attribute is looks like interval attribute, but it must have a

 Ignore the tuple (record/row):

 Use a measure of central tendency for the attribute (e.g., the

 Noise is a random error or variance in a measured variable.

 where n is the number of data tuples

MinMax (v’) = (16 – 16)/(40-16) * (1 – 0) + 0

MinMax (v’) = (20 – 16)/(40-16) * (1 – 0) + 0

MinMax (v’) = (30 – 16)/(40-16) * (1 – 0) + 0

MinMax (v’) = (40 – 16)/(40-16) * (1 – 0) + 0

Age After Min-max normalization

 We will check maximum value among our attribute CGPA.

Salary Formula After Decimal Scaling

You might also like