The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
Ch1:
Some examples
- The current NASA Earth observation satellites generate a terabyte of data every
day.
- Many companies maintain large Data Warehouses of customer transactions.
A fairly small data warehouse might contain more than a hundred million
transactions.
Ch1:
1
4/25/2014
Ch1:
Ch1:
2. Knowledge Discovery
2
4/25/2014
Ch1:
Data mining
creates models to find hidden patterns in
large, complex collections of data,
Patterns that sometimes elude traditional
statistical approaches to analysis because
of the:
• large number of attributes,
• the complexity of patterns,
• the difficulty in performing the analysis.
3
4/25/2014
Ch1:
Ch1:
SAMPLES
Observations – Cases
Records - Examples
4
4/25/2014
Ch1:
Ch1:
Here the aim is simply to extract the most information we can from
the data available.
5
4/25/2014
Ch1:
predictive descriptive
Ch1:
6
4/25/2014
Ch1:
For purposes of data mining we will assume that the data takes a
particular standard form which is described in the next Slides, and we will
look at some of the practical problems of data preparation.
Ch1:
1. Standard Formulation
For any data mining application we have a universe of objects that are of interest.
Each object (Student) is described by a number of:
variables that correspond to its properties. In data called attributes.
The set of variable values of the object is called a record or an instance.
The complete set of data available to us for an application is called a dataset.
7
4/25/2014
Ch1:
1. Standard Formulation
For any data mining application weThis
havedataset is an example
a universe of labelled
of objects that aredata,
of interest.
where one attribute is given special significance
Each object (Student) is described and
by athe
number
aim is toof:
predict its value.
This attribute the standard name ‘class’.
Variables (attributes). Record (instance).
When there is no such significant attribute we
Complete set of data: Dataset. call the data unlabelled.
Ch1:
2. Types of Variable
8
4/25/2014
Ch1:
2. Types of Variable
• Nominal Variables
– Used to put objects into categories, simply labels.
– Is an order-less scale, uses different symbols to represent the different
states (values) of the variable being measured.
An example, customer-type: 1,2,3,4,… OR A,B,C,D,…
– Do not have metric characteristics.
– Do not have no particular order and no necessary relation to one another.
– No mathematical interpretation.
• Binary Variables
Special case of a nominal variable that takes only two possible values:
true or false, 1 or 0 etc.
Ch1:
2. Types of Variable
• Ordinal Variables
– Similar to nominal variables, except: values that can be arranged in a
meaningful order, e.g. small, medium, large.
An order relation is defined but not a distance relation, e.g. rank of a student
in a class.
• Integer Variables
– Unlike nominal variables that are numerical in form, arithmetic with integer
variables is meaningful (1 child + 2 children = 3 children etc.).
9
4/25/2014
Ch1:
2. Types of Variable
• Interval-scaled Variables
– Variables that take numerical values which are measured at equal intervals
from a zero point or origin.
e.g. The Fahrenheit and Celsius temperature scales.
The zero value has been selected arbitrarily and does not imply ‘absence of
temperature’.
C 0 10 20 30
_________________________________
F 32 50 68 86
Ch1:
2. Types of Variable
• Interval-scaled Variables
– Variables that take numerical values which are measured at equal intervals
from a zero point or origin.
e.g. The Fahrenheit and Celsius temperature scales.
The zero value has been selected arbitrarily and does not imply ‘absence of
temperature’.
• Ratio-scaled Variables
– Similar to interval-scaled variables except that the zero point does reflect
the absence of the measured characteristic.
– A ratio scale has an absolute zero point and, consequently, the ratio
relation holds true for variables measured using this scale.
Quantities such as height, length, and salary use this type of scale.
10
4/25/2014
Ch1:
Ch1:
11
4/25/2014
Ch1:
• Data cannot be assumed that it is error free. (Even when the data is in
the standard form )
• Noisy value mean one that is valid for the dataset, but is
incorrectly recorded. e.g. 69.72 ---> 6.972 , brown ---> blue.
• Invalid value (not noisy value) can easily be detected and either
corrected or rejected. e.g. such as 69.7X for 6.972 or bbrown for
brown.
Ch1:
12
4/25/2014
Ch1:
Ch1:
– Some values that are outside the normal range of the variable (values of
a continuous attribute), the values should be investigated.
13
4/25/2014
Ch1:
4. Missing Values
In many real-world datasets data values are not recorded for all attributes.
Two of the most strategies for dealing with missing values are:
•Discard Instances
Delete all instances where there is at least one missing value and use the
remainder.
Its disadvantage is that discarding data may damage the reliability of the
results derived from the data.
Ch1:
There are several ways in which the number of attributes (or ‘features’)
can be reduced before a dataset is processed. The term feature reduction
or dimension reduction is generally used for this process. We will return
to this topic in next Chapter (9).
14