0% found this document useful (0 votes)
36 views14 pages

The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A

The document discusses the rise of big data and knowledge discovery through data mining. It notes that modern systems generate terabytes of data daily from various sources, and that data mining can uncover valuable hidden patterns and knowledge. The document outlines the typical steps in the knowledge discovery process, including data cleaning, transformation, mining, evaluation and presentation. It differentiates between labeled and unlabeled data, and between supervised and unsupervised learning techniques. The goals and methods of classification, regression, clustering, and association are also introduced.

Uploaded by

Majd ALAssaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views14 pages

The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A

The document discusses the rise of big data and knowledge discovery through data mining. It notes that modern systems generate terabytes of data daily from various sources, and that data mining can uncover valuable hidden patterns and knowledge. The document outlines the typical steps in the knowledge discovery process, including data cleaning, transformation, mining, evaluation and presentation. It differentiates between labeled and unlabeled data, and between supervised and unsupervised learning techniques. The goals and methods of classification, regression, clustering, and association are also introduced.

Uploaded by

Majd ALAssaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

4/25/2014

Ch1:

1. The Data Explosion


Modern computer systems are accumulating data at an almost
unimaginable rate and from a very wide variety of sources.

Some examples
- The current NASA Earth observation satellites generate a terabyte of data every
day.
- Many companies maintain large Data Warehouses of customer transactions.
A fairly small data warehouse might contain more than a hundred million
transactions.

Ch1:

1. The Data Explosion

Large database Data archives that


become data tombs are seldom visited.

We are data rich, but information poor.

Such data contains buried within it knowledge that can be


critical to a company’s growth or decline.

1
4/25/2014

Ch1:

1. The Data Explosion

knowledge that could:


• lead to important discoveries in science,
• enable us accurately to predict the weather and natural
disasters,
• enable us to identify the causes of and possible cures for
lethal illnesses,
• literally mean the difference between life and death.

Ch1:

2. Knowledge Discovery

Defined as the ‘non-trivial extraction of implicit, previously


unknown and potentially useful information from data’.

It is a process of which data mining forms just one part,


albeit a central one.

2
4/25/2014

Ch1:

Data mining
creates models to find hidden patterns in
large, complex collections of data,
Patterns that sometimes elude traditional
statistical approaches to analysis because
of the:
• large number of attributes,
• the complexity of patterns,
• the difficulty in performing the analysis.

Data Mining as a Part of the Knowledge Discovery Process.


The knowledge discovery process consists of an iterative sequence of the following
steps:

Ch1: 1.Data Cleaning (to remove noise and


inconsistent data).

2.Data Integration (where multiple


data source may be combined).
3.Data Selection (where data relevant
to the analysis task are retrieved from
the database).
4.Data Transformation (where data are
transformed or consolidated into forms
appropriate for mining by performing
summary or aggregation operation).

5.Data Mining (an essential process where intelligent


methods are applied in order to extract patterns).

6.Pattern Evaluation (to identify the truly interesting patterns


representing knowledge based on some interestingness measures).
7. Knowledge Presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user).

3
4/25/2014

Ch1:

3. Applications of Data Mining

Applications can be divided into four main types:


classification, numerical prediction, association and clustering.

Each of these is explained later.

However first we need to distinguish between two types of data


Labelled and Unlabelled Data.

Ch1:

4. Labelled and Unlabelled Data


Roughly 70%-80% of data mining operation time is spent on
preparing the data tables (obtained from different sources) to be
suitable for data mining modeling.
Ideal Structures of FEATURES – Variables – Attributes - Fields
Data for Data Mining

SAMPLES
Observations – Cases
Records - Examples

Features value for the given sample

In general we have a dataset of examples (called instances), each of


which comprises the values of a number of variables, which in data
mining are often called attributes. There are two types of data,
which are treated in radically different ways.

4
4/25/2014

Ch1:

4. Labelled and Unlabelled Data


For the first type of data there is a specially designated attribute and
the aim is to use the data given to predict the value of that attribute
for instances that have not yet been seen.

Data of this kind is called labelled.


Data mining using labelled data is known as supervised learning.

If the designated attribute is categorical, i.e. it must take one of a number


of distinct values such as ‘very good’, ‘good’ or ‘poor’, or (in an object
recognition application) ‘car’, ‘bicycle’, ‘person’, ‘bus’ or ‘taxi’ the task is
called classification.

If the designated attribute is numerical, e.g. the expected sale price of a


house or the opening price of a share on tomorrow’s stock market, the task
is called regression.

Ch1:

4. Labelled and Unlabelled Data

For the second type of data:


Data that does not have any specially designated attribute is called
unlabelled.

Data mining of unlabelled data is known as unsupervised learning.

Here the aim is simply to extract the most information we can from
the data available.

5
4/25/2014

Ch1:

Data Mining Functions (Methods)

supervised (directed) Unsupervised (undirected)

predictive descriptive

• Descriptive data mining


• Predictive data mining
functions
functions
are clustering,
are classification and
association models,
regression.
and feature extraction.

Ch1:

Different algorithms serve different purposes; each algorithm


has advantages and disadvantages.
A given algorithm can be used to solve different kinds of
problems. For example,

k-Means clustering is unsupervised data mining; however, if


you use k-Means clustering to assign new records to a cluster, it
performs predictive data mining.

Similarly, decision tree classification is supervised data mining;


however, the decision tree rules can be used for descriptive
purposes.

6
4/25/2014

Ch1:

Data for data mining comes in many forms:


• from computer files typed in by human operators,
• business information in SQL or some other standard database format,
• information recorded automatically by equipment such as fault logging
devices,
• to streams of binary data transmitted from satellites.

For purposes of data mining we will assume that the data takes a
particular standard form which is described in the next Slides, and we will
look at some of the practical problems of data preparation.

Ch1:

1. Standard Formulation
For any data mining application we have a universe of objects that are of interest.
Each object (Student) is described by a number of:
variables that correspond to its properties. In data called attributes.
The set of variable values of the object is called a record or an instance.
The complete set of data available to us for an application is called a dataset.

7
4/25/2014

Ch1:

1. Standard Formulation
For any data mining application weThis
havedataset is an example
a universe of labelled
of objects that aredata,
of interest.
where one attribute is given special significance
Each object (Student) is described and
by athe
number
aim is toof:
predict its value.
This attribute the standard name ‘class’.
Variables (attributes). Record (instance).
When there is no such significant attribute we
Complete set of data: Dataset. call the data unlabelled.

Ch1:

2. Types of Variable

It's important to classifying


types of variable (feature),
for suitable mining algorithm.

At least six main types of variable can be distinguished.

8
4/25/2014

Ch1:

2. Types of Variable

• Nominal Variables
– Used to put objects into categories, simply labels.
– Is an order-less scale, uses different symbols to represent the different
states (values) of the variable being measured.
An example, customer-type: 1,2,3,4,… OR A,B,C,D,…
– Do not have metric characteristics.
– Do not have no particular order and no necessary relation to one another.
– No mathematical interpretation.

• Binary Variables
Special case of a nominal variable that takes only two possible values:
true or false, 1 or 0 etc.

Ch1:

2. Types of Variable

• Ordinal Variables
– Similar to nominal variables, except: values that can be arranged in a
meaningful order, e.g. small, medium, large.
An order relation is defined but not a distance relation, e.g. rank of a student
in a class.

• Integer Variables
– Unlike nominal variables that are numerical in form, arithmetic with integer
variables is meaningful (1 child + 2 children = 3 children etc.).

9
4/25/2014

Ch1:

2. Types of Variable

• Interval-scaled Variables
– Variables that take numerical values which are measured at equal intervals
from a zero point or origin.
e.g. The Fahrenheit and Celsius temperature scales.
The zero value has been selected arbitrarily and does not imply ‘absence of
temperature’.

C 0 10 20 30
_________________________________

F 32 50 68 86

Ch1:

2. Types of Variable

• Interval-scaled Variables
– Variables that take numerical values which are measured at equal intervals
from a zero point or origin.
e.g. The Fahrenheit and Celsius temperature scales.
The zero value has been selected arbitrarily and does not imply ‘absence of
temperature’.

• Ratio-scaled Variables
– Similar to interval-scaled variables except that the zero point does reflect
the absence of the measured characteristic.
– A ratio scale has an absolute zero point and, consequently, the ratio
relation holds true for variables measured using this scale.
Quantities such as height, length, and salary use this type of scale.

10
4/25/2014

Ch1:

2.1 Categorical and Continuous Attributes (Variable)

Data mining systems divide attributes into just two types:


– Categorical corresponding to nominal, binary and ordinal variables
– Continuous corresponding to integer, interval-scaled and ratio-scaled
variables.

For many applications it is helpful to have a third category of


attribute, the ‘ignore’ attribute, corresponding to variables that are
of no significance for the application,

Ch1:

2.1 Categorical and Continuous Attributes (Variable)

It is important to choose methods that are appropriate to the types


of variable stored for a particular application.

There are other types of variable to which they would not be


applicable without modification, for example any variable that is
measured on a logarithmic scale.

11
4/25/2014

Ch1:

3. Data Preparation - Data Cleaning

• Data cannot be assumed that it is error free. (Even when the data is in
the standard form )

• Erroneous values can be divided into those which are possible


values of the attribute and those which are not.

• Noisy value mean one that is valid for the dataset, but is
incorrectly recorded. e.g. 69.72 ---> 6.972 , brown ---> blue.

• Invalid value (not noisy value) can easily be detected and either
corrected or rejected. e.g. such as 69.7X for 6.972 or bbrown for
brown.

Ch1:

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of


software tools available, However, in the absence of special software,
even some very basic analysis of the values of variables may be
helpful. For example:

12
4/25/2014

Ch1:

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of


software tools available, However, in the absence of special software,
even some very basic analysis of the values of variables may be
helpful. For example:

– A numerical variable take six different values. best to treat as categorical


variable rather than a continuous one.

– All the values of a variable identical. The variable treated as an ‘ignore’


attribute.

– All the values of a variable except one identical. It is then necessary to


decide whether the one different value is an error or not.
If not the variable should be treated as a categorical attribute with just
two values.

Ch1:

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of


software tools available, However, in the absence of special software,
even some very basic analysis of the values of variables may be
helpful. For example:

– Some values that are outside the normal range of the variable (values of
a continuous attribute), the values should be investigated.

– Some values occur an abnormally large number of times, the values


should be investigated.

At this point, Anomalous values may simply be errors or outliers, i.e.


genuine values that are significantly different from the others.
So we need to be careful before simply discarding them or adjusting them
back to ‘normal’ values.

13
4/25/2014

Ch1:

4. Missing Values
In many real-world datasets data values are not recorded for all attributes.
Two of the most strategies for dealing with missing values are:
•Discard Instances
Delete all instances where there is at least one missing value and use the
remainder.
Its disadvantage is that discarding data may damage the reliability of the
results derived from the data.

•Replace by Most Frequent/Average Value


Replacing a missing value by an estimate of its true value may of course
introduce noise into the data. Is likely lead to invalid results.

Ch1:

5. Reducing the Number of Attributes


Suppose we have 10,000 pieces of information about each supermarket
customer and want to predict which customers will buy a new brand of
dog food.
The number of attributes of any relevance to this is probably very small.

At best the many irrelevant attributes will place an unnecessary


computational overhead on any data mining algorithm. At worst, they
may cause the algorithm to give poor results.

There are several ways in which the number of attributes (or ‘features’)
can be reduced before a dataset is processed. The term feature reduction
or dimension reduction is generally used for this process. We will return
to this topic in next Chapter (9).

14

You might also like