Class 3 Introduction
Class 3 Introduction
Yashvardhan Sharma
30-Jan-24 CS F415 1
Today’s Outline
• Introduction
• Origins of Data Mining
• Challenges in Data Mining
• Data Mining
• vs Statistical Analysis
• vs Machine Learning
• vs Data Warehousing
• Data Preprocessing
30-Jan-24 CS F415 2
What is Data Mining?
• Many Definitions
• Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
• Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
30-Jan-24 CS F415 3
What is NOT Data Mining?
• Originally a “statistician” term
• Overusing of data to draw invalid inferences
• Bonferroni's theorem warns us that if there are too many
possible conclusions to draw, some will be true for purely
statistical reasons, with no physical validity.
• Famous example: David Rhine, a “parapsychologist" at Duke in
the 1950's tested students for Extra Sensory Perception(ESP) by
asking them to guess 10 cards - red or black. He found about
1/1000 of them guessed all 10, and instead of realizing that is
what you'd expect from random guessing, declared them to have
ESP. When he retested them, he found they did no better than
average.
His conclusion: telling people they have ESP causes them to lose it!
30-Jan-24 CS F415 4
Data Mining vs. Statistical Analysis
Statistical Analysis:
• Ill-suited for Nominal and Structured Data Types
• Completely data driven - incorporation of domain knowledge not possible
• Interpretation of results is difficult and daunting
• Requires expert user guidance
Data Mining:
• Large Data sets
• Efficiency of Algorithms is important
• Scalability of Algorithms is important
• Real World Data
• Lots of Missing Values
• Pre-existing data - not user generated
• Data not static - prone to updates
• Efficient methods for data retrieval available for use
30-Jan-24 CS F415 5
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?
30-Jan-24 CS F415 6
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems, parallel
computing and Distributed Computing
• Traditional Techniques
may be unsuitable due to Statistics/ Machine Learning/
• Enormity of data AI Pattern
Recognition
• High dimensionality
of data Data Mining
• Heterogeneous,
distributed nature Database
of data systems
30-Jan-24 CS F415 7
Challenges of Data Mining
• Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data
30-Jan-24 CS F415 8
Data Mining Tasks
• Prediction Methods
• Use some variables to predict unknown or future
values of other variables.
• Description Methods
• Find human-interpretable patterns that describe
the data.
30-Jan-24 CS F415 10
Data Mining and Data Warehousing
• Data Warehouse: a centralized data repository which can be queried
for business benefit.
• Data Warehousing makes it possible to
• extract archived operational data
• overcome inconsistencies between different legacy data formats
• integrate data throughout an enterprise, regardless of location, format, or
communication requirements
• incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
• Roll-up
• Drill-down
• Slice and dice
• Rotate
30-Jan-24 CS F415 11
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta Data
30-Jan-24 CS F415 13
Example of DBMS, OLAP and Data Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
30-Jan-24 CS F415 15
Example of DBMS, OLAP and Data Mining: Weather Data
OLAP:
• Using OLAP we can create a Multidimensional Model of our data (Data
Cube).
• For example using the dimensions: time, outlook and play we can create
the following model.
30-Jan-24 CS F415 16
Example of DBMS, OLAP and Data Mining: Weather Data
Data Mining:
• Using the ID3 algorithm we can produce the following
decision tree:
• outlook = sunny
• humidity = high: no
• humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
• windy = true: no
• windy = false: yes
30-Jan-24 CS F415 17
Major Issues in Data Warehousing and Mining
• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
30-Jan-24 CS F415 18
Major Issues in Data Warehousing and Mining
• Issues relating to the diversity of data types
• Handling relational and complex types of data
• Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
• Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A knowledge
fusion problem
• Protection of data security, integrity, and privacy
30-Jan-24 CS F415 19
Data Preprocessing
30-Jan-24 CS F415 21
Why can Data be Incomplete?
• Attributes of interest are not available (e.g., customer information for
sales transaction data)
• Data were not considered important at the time of transactions, so they
were not recorded!
• Data not recorded because of misunderstanding or malfunctions
• Data may have been recorded and later deleted!
• Missing/unknown values for some data
30-Jan-24 CS F415 22
Why can Data be Noisy/Inconsistent?
• Faulty instruments for data collection
• Human or computer errors
• Errors in data transmission
• Technology limitations (e.g., sensor data come at a faster rate
than they can be processed)
• Inconsistencies in naming conventions or data codes (e.g.,
2/5/2018 could be 2 May 2018 or 5 Feb 2018)
• Duplicate tuples, which were received twice should also be
removed
30-Jan-24 CS F415 23
What is Data?
• Collection of data objects and
their attributes Attributes
30-Jan-24 CS F415 24
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute
30-Jan-24 CS F415 25
Measurement of Length
• The way you measure an attribute is somewhat may not match
the attributes properties.
5 A 1
B
7 2
8 3
10 4
15 5
30-Jan-24 CS F415 26
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
• Distinctness: =
• Order: < >
• Addition: + -
• Multiplication: */
30-Jan-24 CS F415 27
Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, time, counts
30-Jan-24 CS F415 28
Attribute Type Description Examples Operations
Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,
provide enough information to order {good, better, best}, rank correlation, run
objects. (<, >) grades, street numbers tests, sign tests
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius or deviation, Pearson's
unit of measurement exists. Fahrenheit correlation, t and F
(+, - ) tests
Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean,
ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, length, percent variation
electrical current
30-Jan-24 CS F415 29
Attribute Level Transformation Comments
Ordinal An order preserving change of values, i.e., An attribute encompassing the notion of
new_value = f(old_value) good, better best can be represented
where f is a monotonic function. equally well by the values {1, 2, 3} or
by { 0.5, 1, 10}.
Interval new_value =a * old_value + b where a and b are Thus, the Fahrenheit and Celsius
constants temperature scales differ in terms of
where their zero value is and the size of
a unit (degree).
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.
30-Jan-24 CS F415 31
Important Characteristics of Structured Data
• Dimensionality
• Curse of Dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
30-Jan-24 CS F415 32