0% found this document useful (0 votes)

36 views14 pages

The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A

The document discusses the rise of big data and knowledge discovery through data mining. It notes that modern systems generate terabytes of data daily from various sources, and that data mining can uncover valuable hidden patterns and knowledge. The document outlines the typical steps in the knowledge discovery process, including data cleaning, transformation, mining, evaluation and presentation. It differentiates between labeled and unlabeled data, and between supervised and unsupervised learning techniques. The goals and methods of classification, regression, clustering, and association are also introduced.

Uploaded by

Majd ALAssaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views14 pages

The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A

Uploaded by

Majd ALAssaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

4/25/2014

Ch1:

1. The Data Explosion

Modern computer systems are accumulating data at an almost
unimaginable rate and from a very wide variety of sources.

Some examples
- The current NASA Earth observation satellites generate a terabyte of data every
day.
- Many companies maintain large Data Warehouses of customer transactions.
A fairly small data warehouse might contain more than a hundred million
transactions.

Ch1:

1. The Data Explosion

Large database Data archives that

become data tombs are seldom visited.

We are data rich, but information poor.

Such data contains buried within it knowledge that can be

critical to a company’s growth or decline.

1
4/25/2014

Ch1:

1. The Data Explosion

knowledge that could:

• lead to important discoveries in science,
• enable us accurately to predict the weather and natural
disasters,
• enable us to identify the causes of and possible cures for
lethal illnesses,
• literally mean the difference between life and death.

Ch1:

2. Knowledge Discovery

Defined as the ‘non-trivial extraction of implicit, previously

unknown and potentially useful information from data’.

It is a process of which data mining forms just one part,

albeit a central one.

2
4/25/2014

Ch1:

Data mining
creates models to find hidden patterns in
large, complex collections of data,
Patterns that sometimes elude traditional
statistical approaches to analysis because
of the:
• large number of attributes,
• the complexity of patterns,
• the difficulty in performing the analysis.

Data Mining as a Part of the Knowledge Discovery Process.

The knowledge discovery process consists of an iterative sequence of the following
steps:

Ch1: 1.Data Cleaning (to remove noise and

inconsistent data).

2.Data Integration (where multiple

data source may be combined).
3.Data Selection (where data relevant
to the analysis task are retrieved from
the database).
4.Data Transformation (where data are
transformed or consolidated into forms
appropriate for mining by performing
summary or aggregation operation).

5.Data Mining (an essential process where intelligent

methods are applied in order to extract patterns).

6.Pattern Evaluation (to identify the truly interesting patterns

representing knowledge based on some interestingness measures).
7. Knowledge Presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user).

3
4/25/2014

Ch1:

3. Applications of Data Mining

Applications can be divided into four main types:

classification, numerical prediction, association and clustering.

Each of these is explained later.

However first we need to distinguish between two types of data

Labelled and Unlabelled Data.

Ch1:

4. Labelled and Unlabelled Data

Roughly 70%-80% of data mining operation time is spent on
preparing the data tables (obtained from different sources) to be
suitable for data mining modeling.
Ideal Structures of FEATURES – Variables – Attributes - Fields
Data for Data Mining

SAMPLES
Observations – Cases
Records - Examples

Features value for the given sample

In general we have a dataset of examples (called instances), each of

which comprises the values of a number of variables, which in data
mining are often called attributes. There are two types of data,
which are treated in radically different ways.

4
4/25/2014

Ch1:

4. Labelled and Unlabelled Data

For the first type of data there is a specially designated attribute and
the aim is to use the data given to predict the value of that attribute
for instances that have not yet been seen.

Data of this kind is called labelled.

Data mining using labelled data is known as supervised learning.

If the designated attribute is categorical, i.e. it must take one of a number

of distinct values such as ‘very good’, ‘good’ or ‘poor’, or (in an object
recognition application) ‘car’, ‘bicycle’, ‘person’, ‘bus’ or ‘taxi’ the task is
called classification.

If the designated attribute is numerical, e.g. the expected sale price of a

house or the opening price of a share on tomorrow’s stock market, the task
is called regression.

Ch1:

4. Labelled and Unlabelled Data

For the second type of data:

Data that does not have any specially designated attribute is called
unlabelled.

Data mining of unlabelled data is known as unsupervised learning.

Here the aim is simply to extract the most information we can from
the data available.

5
4/25/2014

Ch1:

Data Mining Functions (Methods)

supervised (directed) Unsupervised (undirected)

predictive descriptive

• Descriptive data mining

• Predictive data mining
functions
functions
are clustering,
are classification and
association models,
regression.
and feature extraction.

Ch1:

Different algorithms serve different purposes; each algorithm

has advantages and disadvantages.
A given algorithm can be used to solve different kinds of
problems. For example,

k-Means clustering is unsupervised data mining; however, if

you use k-Means clustering to assign new records to a cluster, it
performs predictive data mining.

Similarly, decision tree classification is supervised data mining;

however, the decision tree rules can be used for descriptive
purposes.

6
4/25/2014

Ch1:

Data for data mining comes in many forms:

• from computer files typed in by human operators,
• business information in SQL or some other standard database format,
• information recorded automatically by equipment such as fault logging
devices,
• to streams of binary data transmitted from satellites.

For purposes of data mining we will assume that the data takes a
particular standard form which is described in the next Slides, and we will
look at some of the practical problems of data preparation.

Ch1:

1. Standard Formulation
For any data mining application we have a universe of objects that are of interest.
Each object (Student) is described by a number of:
variables that correspond to its properties. In data called attributes.
The set of variable values of the object is called a record or an instance.
The complete set of data available to us for an application is called a dataset.

7
4/25/2014

Ch1:

1. Standard Formulation
For any data mining application weThis
havedataset is an example
a universe of labelled
of objects that aredata,
of interest.
where one attribute is given special significance
Each object (Student) is described and
by athe
number
aim is toof:
predict its value.
This attribute the standard name ‘class’.
Variables (attributes). Record (instance).
When there is no such significant attribute we
Complete set of data: Dataset. call the data unlabelled.

Ch1:

2. Types of Variable

It's important to classifying

types of variable (feature),
for suitable mining algorithm.

At least six main types of variable can be distinguished.

8
4/25/2014

Ch1:

2. Types of Variable

• Nominal Variables
– Used to put objects into categories, simply labels.
– Is an order-less scale, uses different symbols to represent the different
states (values) of the variable being measured.
An example, customer-type: 1,2,3,4,… OR A,B,C,D,…
– Do not have metric characteristics.
– Do not have no particular order and no necessary relation to one another.
– No mathematical interpretation.

• Binary Variables
Special case of a nominal variable that takes only two possible values:
true or false, 1 or 0 etc.

Ch1:

2. Types of Variable

• Ordinal Variables
– Similar to nominal variables, except: values that can be arranged in a
meaningful order, e.g. small, medium, large.
An order relation is defined but not a distance relation, e.g. rank of a student
in a class.

• Integer Variables
– Unlike nominal variables that are numerical in form, arithmetic with integer
variables is meaningful (1 child + 2 children = 3 children etc.).

9
4/25/2014

Ch1:

2. Types of Variable

• Interval-scaled Variables
– Variables that take numerical values which are measured at equal intervals
from a zero point or origin.
e.g. The Fahrenheit and Celsius temperature scales.
The zero value has been selected arbitrarily and does not imply ‘absence of
temperature’.

C 0 10 20 30
_________________________________

F 32 50 68 86

Ch1:

2. Types of Variable

• Ratio-scaled Variables
– Similar to interval-scaled variables except that the zero point does reflect
the absence of the measured characteristic.
– A ratio scale has an absolute zero point and, consequently, the ratio
relation holds true for variables measured using this scale.
Quantities such as height, length, and salary use this type of scale.

10
4/25/2014

Ch1:

2.1 Categorical and Continuous Attributes (Variable)

Data mining systems divide attributes into just two types:

– Categorical corresponding to nominal, binary and ordinal variables
– Continuous corresponding to integer, interval-scaled and ratio-scaled
variables.

For many applications it is helpful to have a third category of

attribute, the ‘ignore’ attribute, corresponding to variables that are
of no significance for the application,

Ch1:

2.1 Categorical and Continuous Attributes (Variable)

It is important to choose methods that are appropriate to the types

of variable stored for a particular application.

There are other types of variable to which they would not be

applicable without modification, for example any variable that is
measured on a logarithmic scale.

11
4/25/2014

Ch1:

3. Data Preparation - Data Cleaning

• Data cannot be assumed that it is error free. (Even when the data is in
the standard form )

• Erroneous values can be divided into those which are possible

values of the attribute and those which are not.

• Noisy value mean one that is valid for the dataset, but is
incorrectly recorded. e.g. 69.72 ---> 6.972 , brown ---> blue.

• Invalid value (not noisy value) can easily be detected and either
corrected or rejected. e.g. such as 69.7X for 6.972 or bbrown for
brown.

Ch1:

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of

software tools available, However, in the absence of special software,
even some very basic analysis of the values of variables may be
helpful. For example:

12
4/25/2014

Ch1:

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of

software tools available, However, in the absence of special software,
even some very basic analysis of the values of variables may be
helpful. For example:

– A numerical variable take six different values. best to treat as categorical

variable rather than a continuous one.

– All the values of a variable identical. The variable treated as an ‘ignore’

attribute.

– All the values of a variable except one identical. It is then necessary to

decide whether the one different value is an error or not.
If not the variable should be treated as a categorical attribute with just
two values.

Ch1:

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of

software tools available, However, in the absence of special software,
even some very basic analysis of the values of variables may be
helpful. For example:

– Some values that are outside the normal range of the variable (values of
a continuous attribute), the values should be investigated.

– Some values occur an abnormally large number of times, the values

should be investigated.

At this point, Anomalous values may simply be errors or outliers, i.e.

genuine values that are significantly different from the others.
So we need to be careful before simply discarding them or adjusting them
back to ‘normal’ values.

13
4/25/2014

Ch1:

4. Missing Values
In many real-world datasets data values are not recorded for all attributes.
Two of the most strategies for dealing with missing values are:
•Discard Instances
Delete all instances where there is at least one missing value and use the
remainder.
Its disadvantage is that discarding data may damage the reliability of the
results derived from the data.

•Replace by Most Frequent/Average Value

Replacing a missing value by an estimate of its true value may of course
introduce noise into the data. Is likely lead to invalid results.

Ch1:

5. Reducing the Number of Attributes

Suppose we have 10,000 pieces of information about each supermarket
customer and want to predict which customers will buy a new brand of
dog food.
The number of attributes of any relevance to this is probably very small.

At best the many irrelevant attributes will place an unnecessary

computational overhead on any data mining algorithm. At worst, they
may cause the algorithm to give poor results.

There are several ways in which the number of attributes (or ‘features’)
can be reduced before a dataset is processed. The term feature reduction
or dimension reduction is generally used for this process. We will return
to this topic in next Chapter (9).

Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Unit 2 Data Preprocessing For Students
No ratings yet
Unit 2 Data Preprocessing For Students
169 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
Unit 1
No ratings yet
Unit 1
28 pages
Mining
No ratings yet
Mining
129 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Full
No ratings yet
Full
367 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
Data in BI
No ratings yet
Data in BI
52 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Preprocessing 1
No ratings yet
Preprocessing 1
11 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Lec 1
No ratings yet
Lec 1
33 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
DWDM Reference Notes
No ratings yet
DWDM Reference Notes
126 pages
Data Mining
No ratings yet
Data Mining
15 pages
DM Unit-1
No ratings yet
DM Unit-1
14 pages
Ragb Alllnkg Kyoulltherrdz: in Structor
No ratings yet
Ragb Alllnkg Kyoulltherrdz: in Structor
31 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Unit I
No ratings yet
Unit I
57 pages
A Dictionary of Difficult Words
100% (1)
A Dictionary of Difficult Words
33 pages
02 Data
No ratings yet
02 Data
47 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
4 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
29 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Data Mining For Exam
No ratings yet
Data Mining For Exam
10 pages
02data Part1
No ratings yet
02data Part1
19 pages
Lec 1
No ratings yet
Lec 1
48 pages
Data Mining
No ratings yet
Data Mining
87 pages
AnIntroductiontoDataMining PDF
No ratings yet
AnIntroductiontoDataMining PDF
40 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Mining
100% (4)
Data Mining
9 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Stages of First and Second Language Acquisition
100% (2)
Stages of First and Second Language Acquisition
6 pages
Infant 1 - Cycle 1 - Mathematics
No ratings yet
Infant 1 - Cycle 1 - Mathematics
55 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Inquiry-Based Learning Approach
No ratings yet
Inquiry-Based Learning Approach
3 pages
D182 Task 2 Professional Growth Plan
0% (1)
D182 Task 2 Professional Growth Plan
9 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Spa 506 - Intergroup Contact Theory
No ratings yet
Spa 506 - Intergroup Contact Theory
7 pages
Andrew Iliadis A New Individuation Deleuze Simondon Connection PDF
No ratings yet
Andrew Iliadis A New Individuation Deleuze Simondon Connection PDF
18 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Transparency, Replicability, and Discovery in Cognitive Aging Research: A Computational Modeling Approach
No ratings yet
Transparency, Replicability, and Discovery in Cognitive Aging Research: A Computational Modeling Approach
20 pages
© Ncert Not To Be Republished: Mathematics Exemplar Problems
No ratings yet
© Ncert Not To Be Republished: Mathematics Exemplar Problems
8 pages
What Makes The Teacher Great
No ratings yet
What Makes The Teacher Great
16 pages
Sentence Patterns: (Eng 1-Communication Arts)
No ratings yet
Sentence Patterns: (Eng 1-Communication Arts)
17 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
Career Portfolio
No ratings yet
Career Portfolio
11 pages
RMP-Week 3-OCC
No ratings yet
RMP-Week 3-OCC
6 pages
Daily Lesson Plan: School Grade Level Teacher Learning Area Time & Dates Quarter
No ratings yet
Daily Lesson Plan: School Grade Level Teacher Learning Area Time & Dates Quarter
3 pages
CE2 Unit 2 Notes
No ratings yet
CE2 Unit 2 Notes
13 pages
CS 105.3 Database Management Systems
No ratings yet
CS 105.3 Database Management Systems
6 pages
Extensive Reading YANTI ASMARAA
No ratings yet
Extensive Reading YANTI ASMARAA
4 pages
Week 3 Lesson Log in EAPP
No ratings yet
Week 3 Lesson Log in EAPP
2 pages
DRichard Wilczynski Paper 1
No ratings yet
DRichard Wilczynski Paper 1
2 pages
What Is Social Philosophy: Chapter-2
No ratings yet
What Is Social Philosophy: Chapter-2
13 pages
Department of Education: First Summative Test in English 6 (Fourth Quarter)
No ratings yet
Department of Education: First Summative Test in English 6 (Fourth Quarter)
5 pages
PEAC Classroom Observation Form
100% (2)
PEAC Classroom Observation Form
1 page
(IJCST-V3I1P21) : S. Padmapriya
No ratings yet
(IJCST-V3I1P21) : S. Padmapriya
5 pages
Spiritual and Moral Goals
No ratings yet
Spiritual and Moral Goals
6 pages
Materi Report Text Dan Exercise 11
No ratings yet
Materi Report Text Dan Exercise 11
3 pages
Module 3.2
No ratings yet
Module 3.2
1 page
Lecture 2
No ratings yet
Lecture 2
8 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
Audio-Lingual Communicative Contrasting
No ratings yet
Audio-Lingual Communicative Contrasting
2 pages
1st Quarter Exam Oral Comm
No ratings yet
1st Quarter Exam Oral Comm
5 pages
PMBOK T&T Chapter 9
No ratings yet
PMBOK T&T Chapter 9
1 page
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet

The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A

Uploaded by

The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A

Uploaded by

4/25/2014

1. The Data Explosion

1. The Data Explosion

Large database Data archives that

We are data rich, but information poor.

Such data contains buried within it knowledge that can be

1. The Data Explosion

knowledge that could:

Defined as the ‘non-trivial extraction of implicit, previously

It is a process of which data mining forms just one part,

Data Mining as a Part of the Knowledge Discovery Process.

Ch1: 1.Data Cleaning (to remove noise and

2.Data Integration (where multiple

5.Data Mining (an essential process where intelligent

6.Pattern Evaluation (to identify the truly interesting patterns

3. Applications of Data Mining

Applications can be divided into four main types:

Each of these is explained later.

However first we need to distinguish between two types of data

4. Labelled and Unlabelled Data

Features value for the given sample

In general we have a dataset of examples (called instances), each of

4. Labelled and Unlabelled Data

Data of this kind is called labelled.

If the designated attribute is categorical, i.e. it must take one of a number

If the designated attribute is numerical, e.g. the expected sale price of a

4. Labelled and Unlabelled Data

For the second type of data:

Data mining of unlabelled data is known as unsupervised learning.

Data Mining Functions (Methods)

supervised (directed) Unsupervised (undirected)

• Descriptive data mining

Different algorithms serve different purposes; each algorithm

k-Means clustering is unsupervised data mining; however, if

Similarly, decision tree classification is supervised data mining;

Data for data mining comes in many forms:

It's important to classifying

At least six main types of variable can be distinguished.

2.1 Categorical and Continuous Attributes (Variable)

Data mining systems divide attributes into just two types:

For many applications it is helpful to have a third category of

2.1 Categorical and Continuous Attributes (Variable)

It is important to choose methods that are appropriate to the types

There are other types of variable to which they would not be

3. Data Preparation - Data Cleaning

• Erroneous values can be divided into those which are possible

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of

– A numerical variable take six different values. best to treat as categorical

– All the values of a variable identical. The variable treated as an ‘ignore’

– All the values of a variable except one identical. It is then necessary to

3. Data Preparation - Data Cleaning

In attempting to ‘clean up’ data it is helpful to have a range of

– Some values occur an abnormally large number of times, the values

At this point, Anomalous values may simply be errors or outliers, i.e.

•Replace by Most Frequent/Average Value

5. Reducing the Number of Attributes

At best the many irrelevant attributes will place an unnecessary

You might also like