0% found this document useful (0 votes)
29 views

Class 3 Introduction

This document provides an overview of the CS F415 Data Mining course. It introduces key concepts like the origins of data mining, challenges in data mining, differences between data mining and statistical analysis/machine learning/data warehousing, and common data mining tasks. The document also discusses data preprocessing and provides examples to illustrate concepts like data mining versus database management systems and online analytical processing.

Uploaded by

f20201207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Class 3 Introduction

This document provides an overview of the CS F415 Data Mining course. It introduces key concepts like the origins of data mining, challenges in data mining, differences between data mining and statistical analysis/machine learning/data warehousing, and common data mining tasks. The document also discusses data preprocessing and provides examples to illustrate concepts like data mining versus database management systems and online analytical processing.

Uploaded by

f20201207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CS F415: Data Mining

Yashvardhan Sharma

30-Jan-24 CS F415 1
Today’s Outline

• Introduction
• Origins of Data Mining
• Challenges in Data Mining
• Data Mining
• vs Statistical Analysis
• vs Machine Learning
• vs Data Warehousing
• Data Preprocessing

30-Jan-24 CS F415 2
What is Data Mining?
• Many Definitions
• Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
• Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns

30-Jan-24 CS F415 3
What is NOT Data Mining?
• Originally a “statistician” term
• Overusing of data to draw invalid inferences
• Bonferroni's theorem warns us that if there are too many
possible conclusions to draw, some will be true for purely
statistical reasons, with no physical validity.
• Famous example: David Rhine, a “parapsychologist" at Duke in
the 1950's tested students for Extra Sensory Perception(ESP) by
asking them to guess 10 cards - red or black. He found about
1/1000 of them guessed all 10, and instead of realizing that is
what you'd expect from random guessing, declared them to have
ESP. When he retested them, he found they did no better than
average.

His conclusion: telling people they have ESP causes them to lose it!
30-Jan-24 CS F415 4
Data Mining vs. Statistical Analysis
Statistical Analysis:
• Ill-suited for Nominal and Structured Data Types
• Completely data driven - incorporation of domain knowledge not possible
• Interpretation of results is difficult and daunting
• Requires expert user guidance
Data Mining:
• Large Data sets
• Efficiency of Algorithms is important
• Scalability of Algorithms is important
• Real World Data
• Lots of Missing Values
• Pre-existing data - not user generated
• Data not static - prone to updates
• Efficient methods for data retrieval available for use
30-Jan-24 CS F415 5
What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining?

– Look up phone – Group together similar


number in phone documents returned by
directory search engine according to
their context (e.g. Amazon
– Query a Web rainforest, Amazon.com,)
search engine for
– customers who buy
information about
diapers are more likely to
“Amazon”
buy beer

30-Jan-24 CS F415 6
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems, parallel
computing and Distributed Computing
• Traditional Techniques
may be unsuitable due to Statistics/ Machine Learning/
• Enormity of data AI Pattern
Recognition
• High dimensionality
of data Data Mining

• Heterogeneous,
distributed nature Database
of data systems

30-Jan-24 CS F415 7
Challenges of Data Mining
• Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data

30-Jan-24 CS F415 8
Data Mining Tasks
• Prediction Methods
• Use some variables to predict unknown or future
values of other variables.

• Description Methods
• Find human-interpretable patterns that describe
the data.

30-Jan-24 C1S F415 9


Data Mining vs. DBMS
• Example DBMS Reports
• Last months sales for each service type
• Sales per service grouped by customer sex or age bracket
• List of customers who lapsed their policy

• Questions answered using Data Mining


• What characteristics do customers that lapse their policy have in
common and how do they differ from customers who renew their
policy?
• Which motor insurance policy holders would be potential customers
for my House Content Insurance policy?

30-Jan-24 CS F415 10
Data Mining and Data Warehousing
• Data Warehouse: a centralized data repository which can be queried
for business benefit.
• Data Warehousing makes it possible to
• extract archived operational data
• overcome inconsistencies between different legacy data formats
• integrate data throughout an enterprise, regardless of location, format, or
communication requirements
• incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
• Roll-up
• Drill-down
• Slice and dice
• Rotate
30-Jan-24 CS F415 11
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data
Data integration Warehouse
30-Jan-24 CS F415 Repository 12
DBMS, OLAP, and Data Mining

DBMS OLAP Data Mining


Knowledge discovery
Extraction of detailed Summaries, trends and
Task of hidden patterns
and summary data forecasts
and insights

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the


Deduction (Ask the
modeling, model, apply it to
Method question, verify
Aggregation, new data, get the
with data)
Statistics result)

What is the average Who will buy a


Who purchased
income of mutual mutual fund in the
Example question mutual funds in
fund buyers by next 6 months and
the last 3 years?
region by year? why?

30-Jan-24 CS F415 13
Example of DBMS, OLAP and Data Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play

1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes

30-Jan-24 14 rainy 71 CS91F415 true no 14


Example of DBMS, OLAP and Data Mining: Weather Data
• By querying a DBMS containing the above table we may answer
questions like:
• What was the temperature in the sunny days?
{85, 80, 72, 69, 75}
• Which days the humidity was less than 75?
{6, 7, 9, 11}
• Which days the temperature was greater than 70?
{1, 2, 3, 8, 10, 11, 12, 13, 14}
• Which days the temperature was greater than 70 and the
humidity was less than 75?
The intersection of the above two: {11}

30-Jan-24 CS F415 15
Example of DBMS, OLAP and Data Mining: Weather Data
OLAP:
• Using OLAP we can create a Multidimensional Model of our data (Data
Cube).
• For example using the dimensions: time, outlook and play we can create
the following model.

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0

30-Jan-24 CS F415 16
Example of DBMS, OLAP and Data Mining: Weather Data

Data Mining:
• Using the ID3 algorithm we can produce the following
decision tree:

• outlook = sunny
• humidity = high: no
• humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
• windy = true: no
• windy = false: yes

30-Jan-24 CS F415 17
Major Issues in Data Warehousing and Mining
• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
30-Jan-24 CS F415 18
Major Issues in Data Warehousing and Mining
• Issues relating to the diversity of data types
• Handling relational and complex types of data
• Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
• Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A knowledge
fusion problem
• Protection of data security, integrity, and privacy

30-Jan-24 CS F415 19
Data Preprocessing

• Why preprocess the data?


• Types of Data Sets
• Data Quality
• Steps in Data Preprocessing

30-Jan-24 C1S F415 20


Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!

30-Jan-24 CS F415 21
Why can Data be Incomplete?
• Attributes of interest are not available (e.g., customer information for
sales transaction data)
• Data were not considered important at the time of transactions, so they
were not recorded!
• Data not recorded because of misunderstanding or malfunctions
• Data may have been recorded and later deleted!
• Missing/unknown values for some data

30-Jan-24 CS F415 22
Why can Data be Noisy/Inconsistent?
• Faulty instruments for data collection
• Human or computer errors
• Errors in data transmission
• Technology limitations (e.g., sensor data come at a faster rate
than they can be processed)
• Inconsistencies in naming conventions or data codes (e.g.,
2/5/2018 could be 2 May 2018 or 5 Feb 2018)
• Duplicate tuples, which were received twice should also be
removed

30-Jan-24 CS F415 23
What is Data?
• Collection of data objects and
their attributes Attributes

• An attribute is a property or Tid Refund Marital Taxable


Status Income Cheat
characteristic of an object
1 Yes Single 125K No
• Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No
• Attribute is also known as variable, 4 Yes Married 120K No
field, characteristic, or feature Objects 5 No Divorced 95K Yes

• A collection of attributes describe 6 No Married 60K No

an object 7 Yes Divorced 220K No


8 No Single 85K Yes
• Object is also known as record,
9 No Married 75K No
point, case, sample, entity, or
10 No Single 90K Yes
instance 10

30-Jan-24 CS F415 24
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute

• Distinction between attributes and attribute values


• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

• Different attributes can be mapped to the same set of values


• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value

30-Jan-24 CS F415 25
Measurement of Length
• The way you measure an attribute is somewhat may not match
the attributes properties.
5 A 1

B
7 2

8 3

10 4

15 5

30-Jan-24 CS F415 26
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
• Distinctness: = 
• Order: < >
• Addition: + -
• Multiplication: */

• Nominal attribute: distinctness


• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: all 4 properties

30-Jan-24 CS F415 27
Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, time, counts

30-Jan-24 CS F415 28
Attribute Type Description Examples Operations

Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,
provide enough information to order {good, better, best}, rank correlation, run
objects. (<, >) grades, street numbers tests, sign tests

Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius or deviation, Pearson's
unit of measurement exists. Fahrenheit correlation, t and F
(+, - ) tests

Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean,
ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, length, percent variation
electrical current

30-Jan-24 CS F415 29
Attribute Level Transformation Comments

Nominal Any permutation of values If all employee ID numbers were


reassigned, would it make any
difference?

Ordinal An order preserving change of values, i.e., An attribute encompassing the notion of
new_value = f(old_value) good, better best can be represented
where f is a monotonic function. equally well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value =a * old_value + b where a and b are Thus, the Fahrenheit and Celsius
constants temperature scales differ in terms of
where their zero value is and the size of
a unit (degree).

Ratio new_value = a * old_value Length can be measured in meters or


feet.
30-Jan-24 CS F415 30
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.

30-Jan-24 CS F415 31
Important Characteristics of Structured Data

• Dimensionality
• Curse of Dimensionality

• Sparsity
• Only presence counts

• Resolution
• Patterns depend on the scale

30-Jan-24 CS F415 32

You might also like