0% found this document useful (0 votes)
16 views110 pages

DM-Unit 1

The document discusses the significance of data mining in the context of the exponential growth of data across various sectors including business, science, and society. It outlines the evolution of data science, database technology, and the knowledge discovery process, emphasizing the need for automated analysis of large datasets to extract meaningful patterns. Additionally, it highlights the interdisciplinary nature of data mining, drawing from fields such as machine learning, statistics, and database systems to address complex data challenges.

Uploaded by

raju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views110 pages

DM-Unit 1

The document discusses the significance of data mining in the context of the exponential growth of data across various sectors including business, science, and society. It outlines the evolution of data science, database technology, and the knowledge discovery process, emphasizing the need for automated analysis of large datasets to extract meaningful patterns. Additionally, it highlights the interdisciplinary nature of data mining, drawing from fields such as machine learning, statistics, and database systems to address complex data challenges.

Uploaded by

raju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 110

UNIT -1

DATA MINING

1
Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes


– Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
2
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
– Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
– Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
– Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
 1990-now, data science
– The flood of data from new scientific instruments and simulations
– The ability to economically store and manage petabytes of data online
– The Internet and computing Grid that makes all these archives universally
accessible
– Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!

3
Evolution of Database
Technology
 1960s:
– Data collection, database creation, IMS and network DBMS
 1970s:
– Relational data model, relational DBMS implementation
 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
– Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems 4
Large-scale Data is
Everywhere!
 There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
 New mantra
 Gather whatever data you can
whenever and wherever
possible.
 Expectations
 Gathered data will have value
Traffic Patterns Social Networking: Twitter
either for the purpose
collected or for a purpose not
envisioned.

Sensor Networks Computational Simulati

5
Why Data Mining? Commercial
Viewpoint

 Lots of data is being collected


and warehoused
– Web data
 Yahoo has Peta Bytes of web data
 Facebook has billions of active users
– purchases at department/
grocery stores, e-commerce
 Amazon handles millions of visits/day
– Bank/Credit Card transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)

6
Why Data Mining? Scientific Viewpoint

 Data collected and stored at


enormous speeds
– remote sensors on a satellite
 NASA EOSDIS archives over
petabytes of earth science data / year Sky Survey Data
fMRI Data from Brain
– telescopes scanning the skies
 Sky survey data
– High-throughput biological data
– scientific simulations
 terabytes of data generated in a few hours
Gene Expression Data

 Data mining helps scientists


– in automated analysis of massive datasets
– In hypothesis formation
Surface Temperature of Earth
7
Great opportunities to improve productivity in all walks
of life

8
Great Opportunities to Solve Society’s Major
Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production
9
What is Data Mining?
 Many Definitions
– Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

10
What Is Data Mining?

 Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
– Data mining: a misnomer?
 Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
11
Knowledge Discovery (KDD) Process

 This is a view from typical


database systems and data
Pattern Evaluation
warehousing communities
 Data mining plays an essential
role in the knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 12
Example: A Web Mining
Framework

 Web mining usually involves


– Data cleaning
– Data integration from multiple sources
– Warehousing the data
– Data cube construction
– Data selection for data mining
– Data mining
– Presentation of the mining results
– Patterns and knowledge to be used or stored
into knowledge-base 13
What is (not) Data Mining?

 What is not Data  What is Data Mining?


Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rourke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g., Amazon
rainforest, Amazon.com)
14
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
15
KDD Process: A Typical View from ML
and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern interpretation
Dimension reduction Clustering Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities


16
Multi-Dimensional View of Data
Mining
 Data to be mined
– Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web, multi-
media, graphs & social and information networks
 Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
 Techniques utilized
– Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance, etc.
 Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc. 17
Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
18
Origins of Data Mining
 Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems

 Traditional techniques may be unsuitable due to data that is


– Large-scale
– High dimensional
– Heterogeneous
– Complex
– Distributed

 A key component of the emerging field of data science and data-


driven discovery

19
Data Mining: Confluence of Multiple
Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

20
Why Confluence of Multiple
Disciplines?
 Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of
data
 High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
 High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulations
 New and sophisticated applications
21
Data Mining Tasks

 Prediction Methods
– Use some variables to predict unknown or
future values of other variables.

 Description Methods
– Find human-interpretable patterns that
describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

22
Data Mining Tasks...

 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

23
Data Mining Models and Tasks

24
Data Mining: Classification Schemes

 General functionality

– Descriptive data mining


– Predictive data mining
 Different views lead to different classifications
– Data view: Kinds of data to be mined
– Knowledge view: Kinds of knowledge to be
discovered
– Method view: Kinds of techniques utilized
– Application view: Kinds of applications
adapted 25
Data Mining Tasks …

Clu
ste Data
ri ng
Tid Refund Marital
Status
Taxable
Income Cheat
l i ng
1 Yes Single 125K No
ode
2 No Married 100K No
M
i ve
3 No Single 70K No
4 Yes Married 120K No
ct
5 No Divorced 95K Yes

edi
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No

ation 12 Yes Divorced 220K No

tec ly
oci 13 No Single 85K Yes

tio
s
As
14 No Married 75K No

10
15 No Single 90K Yes
n

l es
Ru

Milk

26
Predictive Modeling: Classification
 Find a model for class attribute as a function of
the values of other attributes Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

27
Classification Example

cal cal tive # years at


ori ori ita Level of Credit
g g nt s Tid Employed
Education
present
Worthy
te te a as address
ca ca qu cl 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10 Test
Set

Training
Learn
Set Classifier Model

28
Examples of Classification Task

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

 Categorizing news stories as finance,


weather, entertainment, sports, etc

 Identifying intruders in the cyberspace

 Predicting tumor cells as benign or malignant

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

29
Classification: Application 1

 Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
 Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
 Label past transactions as fraud or fair
transactions. This forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit
card transactions on an account.
30
Classification: Application 2

 Churn prediction for telephone customers


– Goal: To predict whether a customer is likely
to be lost to a competitor.
– Approach:
 Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

31
Classification: Application 3
 Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
 Segment the image.
 Measure image attributes (features) - 40 of them per

object.
 Model the class based on these features.

 Success Story: Could find 16 new high red-shift

quasars, some of the farthest objects that are difficult


From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
to find!
32
Classifying Galaxies
Courtesy: http://aps.umn.edu

Early Class: Attributes:


• Stages of • Image features,
Formation • Characteristics of
light waves received,
Intermediate etc.

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

33
Regression

 Predict a value of a given continuous valued variable


based on the values of other variables, assuming a
linear or nonlinear model of dependency.
 Extensively studied in statistics, neural network fields.
 Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.

34
Clustering

 Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

35
Applications of Cluster Analysis
 Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
 Summarization
– Reduce the size of large data
sets
Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP


90

Use of K-means to
partition Sea Surface
60

Land Cluster 2

30 Temperature (SST) and


Land Cluster 1 Net Primary Production
latitude

0
(NPP) into clusters that
Ice or No NPP

-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1

-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
36
longitude
Clustering: Application 1

 Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
 Collect different attributes of customers based on
their geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.

37
Clustering: Application 2

 Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.

– Approach: To identify frequently occurring terms in


each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.

Enron email dataset

38
Association Rule Discovery:
Definition

 Given a set of records each of which contain


some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

39
Association Analysis: Applications

 Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management

 Telecommunication alarm diagnosis


– Rules are used to find combination of alarms that
occur together frequently in the same time period

 Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases
40
Association Analysis: Applications

 An Example Subspace Differential Coexpression Pattern


from lung cancer dataset Three lung cancer datasets [Bhattacharjee et al
2001], [Stearman et al. 2005], [Su et al. 2007]

Enriched with the TNF/NFB signaling pathway


which is well-known to be related to lung cancer
P-value: 1.4*10-5 (6/10 overlap with the pathway)

[Fang et al PSB 2010]


41
Deviation/Anomaly/Change Detection
 Detect significant deviations from
normal behavior
 Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
– Identify anomalous behavior from
sensor networks for monitoring and
surveillance.
– Detecting changes in the global forest
cover.

42
Primitives that Define a Data Mining
Task

 Task-relevant data
– Database or data warehouse name
– Database tables or data warehouse cubes
– Condition for data selection
– Relevant attributes or dimensions
– Data grouping criteria
 Type of knowledge to be mined
– Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
 Background knowledge
 Pattern interestingness measurements
 Visualization/presentation of discovered patterns
43
Primitive #: Background Knowledge

 A typical kind of background knowledge: Concept hierarchies


 Schema hierarchy
– E.g., street < city < province_or_state < country
 Set-grouping hierarchy
– E.g., {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
– email address: [email protected]
login-name < department < university < country
 Rule-based hierarchy
– low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2)
< $50

44
Primitive #: Pattern Interestingness
Measure

 Simplicity
e.g., (association) rule length, (decision) tree size
 Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule quality,
discriminating weight, etc.
 Utility
potential usefulness, e.g., support (association), noise threshold
(description)
 Novelty
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)

45
Primitive #: Presentation of Discovered
Patterns

 Different backgrounds/usages may require different forms of


representation
– E.g., rules, tables, crosstabs, pie/bar chart, etc.
 Concept hierarchy is also important
– Discovered knowledge might be more understandable when
represented at high level of abstraction
– Interactive drill up/down, pivoting, slicing and dicing provide
different perspectives to data
 Different kinds of knowledge require different representation:
association, classification, clustering, etc.

46
Motivating Challenges

 Scalability

 High Dimensionality

 Heterogeneous and Complex Data

 Data Ownership and Distribution

 Non-traditional Analysis

47
Major Issues in Data Mining

 Mining methodology and User interaction


– Mining different kinds of knowledge
 DM should cover a wide spectrum of data analysis and knowledge discovery tasks
 Enable to use the database in different ways
 Require the development of numerous data mining techniques
– Interactive mining of knowledge at multiple levels of abstraction
 Difficult to know exactly what will be discovered
 Allow users to focus the search, refine data mining requests
– Incorporation of background knowledge
 Guide the discovery process
 Allow discovered patterns to be expressed in concise terms and different levels of
abstraction
– Data mining query languages and ad hoc data mining
 High-level query languages need to be developed
48
 Should be integrated with a DB/DW query language
Major Issues in Data Mining

– Presentation and visualization of results


 Knowledge should be easily understood and directly usable
 High level languages, visual representations or other expressive forms
 Require the DM system to adopt the above techniques
– Handling noisy or incomplete data
 Require data cleaning methods and data analysis methods that can handle noise
– Pattern evaluation – the interestingness problem
 How to develop techniques to access the interestingness of discovered patterns,
especially with subjective measures bases on user beliefs or expectations

49
Major Issues in Data Mining
 Performance Issues
– Efficiency and scalability
 Huge amount of data
 Running time must be predictable and acceptable
– Parallel, distributed and incremental mining algorithms
 Divide the data into partitions and processed in parallel
 Incorporate database updates without having to mine the entire data again from
scratch
 Diversity of Database Types
– Other database that contain complex data objects, multimedia data,
spatial data, etc.
– Expect to have different DM systems for different kinds of data
– Heterogeneous databases and global information systems
 Web mining becomes a very challenging and fast-evolving field in data mining
50
Summary of Points

 Data mining: Discovering interesting patterns and knowledge from


massive amount of data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Major issues in data mining

51
KDD Process: Several Key
Steps
 Learning the application domain
– relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant
representation
 Choosing functions of data mining
– summarization, classification, regression, association, clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
52
Are All the “Discovered” Patterns
Interesting?
 Data mining may generate thousands of patterns: Not all of them are
interesting
– Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures
– A pattern is interesting if it is easily understood by humans, valid on new
or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
– Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
53
Find All and Only Interesting
Patterns?

 Find all the interesting patterns: Completeness


– Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns?
– Heuristic vs. exhaustive search
– Association vs. classification vs. clustering
 Search for only interesting patterns: An optimization problem
– Can a data mining system find only the interesting patterns?
– Approaches
 First general all the patterns and then filter out the uninteresting
ones
 Generate only the interesting patterns—mining query optimization

54
Other Pattern Mining Issues

 Precise patterns vs. approximate patterns


– Association and correlation mining: possible find sets of precise
patterns
 But approximate patterns can be more compact and sufficient
 How to find high quality approximate patterns??
– Gene sequence mining: approximate patterns are inherent
 How to derive efficient approximate pattern mining algorithms??
 Constrained vs. non-constrained patterns
– Why constraint-based mining?
– What are the possible kinds of constraints? How to push
constraints into the mining process?

55
A typical DM System Architecture

 Database, data warehouse, WWW or other information


repository (store data)
 Database or data warehouse server (fetch and
combine data)
 Knowledge base (turn data into meaningful groups
according to domain knowledge)
 Data mining engine (perform mining tasks)
 Pattern evaluation module (find interesting patterns)
 User interface (interact with the user)

56
Architecture: Typical Data Mining
System

Graphical User Interface

Pattern Evaluation
Know
Data Mining Engine ledge
-Base
Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web
57
KDD Process: Several Key Steps
 Learning the application domain
– relevant prior knowledge and goals of application
 Identifying a target data set: data selection
 Data processing
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from database)
– Data transformation (data transformed or consolidated into forms appropriate for
mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
 Use of discovered knowledge 58
What is Data?

 Collection of data objects Attributes


and their attributes
 An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a
person, temperature, etc. 2 No Married 100K No

– Attribute is also known as 3 No Single 70K No

Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
 A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
59
10
A More Complete View of Data

 Data may have parts

 The different parts of the data may have


relationships

 More generally, data may have structure

 Data can be incomplete

 We will discuss this in more detail later


60
Attribute Values

 Attribute values are numbers or symbols


assigned to an attribute for a particular object

 Distinction between attributes and attribute values


– Same attribute can be mapped to different
attribute values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same


set of values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different

61
Measurement of Length
 The way you measure an attribute may not match the
attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.

15 5

62
Types of Attributes

 There are different types of attributes


– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time, counts
63
Properties of Attribute Values

 The type of an attribute depends on which of the following


properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations
64
Difference Between Ratio and
Interval

 Is it physically meaningful to say that a


temperature of 10 ° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?
 Consider measuring the height above average

– If Bill’s height is three inches above average


and Bob’s height is six inches above average,
then would we say that Bob is twice as tall as
Bill?
– Is this situation analogous to that of
temperature? 65
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens 66


Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens 67


Discrete and Continuous
Attributes

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
68
Asymmetric Attributes
 Only presence (a non-zero attribute value) is regarded as important
 Words present in documents
 Items present in customer transactions

 If we met a friend in the grocery store would we ever say the


following?
“I see our purchases are very similar since we didn’t buy most of the same
things.”

 We need two asymmetric binary attributes to represent one


ordinary binary attribute
– Association analysis uses asymmetric attributes

 Asymmetric attributes typically arise from objects that are sets

69
Why Data Preprocessing?

 Data in the real world is dirty


– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
 No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
– Required for both OLAP and Data Mining!

70
Why can Data be
Incomplete?

 Attributes of interest are not available (e.g., customer


information for sales transaction data)
 Data were not considered important at the time of
transactions, so they were not recorded!
 Data not recorder because of misunderstanding or
malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data

71
Data Cleaning

 Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data

72
Data Quality

 Poor data quality negatively affects many data processing


efforts
“The most important point is that poor data quality is an unfolding
disaster.
– Poor data quality costs the typical company at least
ten percent (10%) of revenue; twenty percent
(20%) is probably a better estimate.”
Thomas C. Redman, DM Review, August 2004
 Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
73
Data Quality …

 What kinds of data quality problems?


 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:


– Noise and outliers
– Missing values
– Duplicate data
– Wrong data

74
Noise
 For objects, noise is an extraneous object
 For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


75
Outliers

 Outliers are data objects with characteristics that


are considerably different than most of the other
data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
 Credit card fraud
 Intrusion detection

 Causes? 76
Missing Values
 Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and
weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
 Handling missing values
– Eliminate data objects or variables
– Estimate missing values
 Example: time series of temperature
 Example: census results

– Ignore the missing value during analysis 77


Missing Values …
 Missing completely at random (MCAR)
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
 Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
 Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
 Not possible to know the situation from the data
78
Duplicate Data

 Data set may include data objects that are duplicates,


or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

 When should duplicate data not be removed?


79
Forms of data preprocessing

80
Data Preprocessing

 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature creation
 Discretization and Binarization
 Attribute Transformation

81
Aggregation

 Combining two or more attributes (or objects) into


a single attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years
– More “stable” data
 Aggregated data tends to have less variability
82
Example: Precipitation in Australia

 This example is based on precipitation in


Australia from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of
average monthly precipitation for 3,030 0.5◦
by 0.5◦ grid cells in Australia, and
– A histogram for the standard deviation of the
average yearly precipitation for the same
locations.
 The average yearly precipitation has less
variability than the average monthly precipitation.
 All precipitation measurements (and their 83
Example: Precipitation in Australia

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
84
Sampling
 Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary
investigation of the data and the final data analysis.

 Statisticians often sample because obtaining the entire


set of data of interest is too expensive or time
consuming.

 Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.

85
Sampling …

 The key principle for effective sampling is the


following:

– Using a sample will work almost as well as


using the entire data set, if the sample is
representative

– A sample is representative if it has


approximately the same properties (of
interest) as the original set of data

86
Sample Size

8000 points 2000 Points 500 Points

87
Types of Sampling
 Simple Random Sampling
– There is an equal probability of selecting any
particular item
– Sampling without replacement
 As each item is selected, it is removed from the population
– Sampling with replacement
 Objects are not removed from the population as they are
selected for the sample.
 In sampling with replacement, the same object can be
picked up more than once
 Stratified sampling
– Split the data into several partitions; then draw
random samples from each partition
88
Sample Size
 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

89
Curse of Dimensionality

 When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

 Definitions of density and


distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful • Randomly generate 500 points
• Compute difference between max and
min distance between any pair of points
90
Dimensionality Reduction

 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required
by data mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or
reduce noise

 Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
91
Dimensionality Reduction: PCA

 Goal is to find a projection that captures the


largest amount of variation in data
x2

x1
92
Dimensionality Reduction: PCA

93
Feature Subset Selection

 Another way to reduce dimensionality of data


 Redundant features
– Duplicate much or all of the information
contained in one or more other attributes
– Example: purchase price of a product and the
amount of sales tax paid
 Irrelevant features
– Contain no information that is useful for the
data mining task at hand
– Example: students' ID is often irrelevant to the
task of predicting students' GPA
 Many techniques developed, especially for
classification 94
Feature Creation

 Create new attributes that can capture the


important information in a data set much more
efficiently than the original attributes

 Three general methodologies:


– Feature extraction
 Example: extracting edges from images
– Feature construction
 Example: dividing mass by volume to get density
– Mapping data to new space
 Example: Fourier and wavelet analysis
95
Mapping Data to a New Space

 Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

96
Discretization

 Discretization is the process of converting a


continuous attribute into an ordinal attribute
– A potentially infinite number of values are
mapped into a small number of categories
– Discretization is commonly used in
classification
– Many classification algorithms work best if both
the independent and dependent variables have
only a few values
– We give an illustration of the usefulness of
discretization using the Iris data set
97
Iris Sample Data Set

 Iris Plant data set.


– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
 Setosa
 Versicolour
 Virginica
– Four (non-class) attributes
 Sepal width and length
Virginica. Robert H. Mohlenbrock. USDA
 Petal width and length NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
98
Discretization: Iris Example

Petal width low or petal length low implies Setosa.


Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
99
Discretization: Iris Example …

 How can we tell what the best discretization is?


– Unsupervised discretization: find breaks in the
data values 50
 Example:
40
Petal Length
30

Counts 20

10

0
0 2 4 6 8
Petal Length

– Supervised discretization: Use class labels to


find breaks
100
Discretization Without Using Class
Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

101
Discretization Without Using Class
Labels

Equal interval width approach used to obtain 4 values.

102
Discretization Without Using Class
Labels

Equal frequency approach used to obtain 4 values.

103
Discretization Without Using Class
Labels

K-means approach to obtain 4 values.

104
Binarization

 Binarization maps a continuous or categorical


attribute into one or more binary variables

 Typically used for association analysis

 Often convert a continuous attribute to a


categorical attribute and then convert a
categorical attribute to a set of binary attributes
– Association analysis needs asymmetric binary
attributes
– Examples: eye color and height measured as
{low, medium, high}
105
Attribute Transformation
 An attribute transform is a function that maps the
entire set of values of a given attribute to a new set of
replacement values such that each old value can be
identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
 Refers to various techniques to adjust to differences
among attributes in terms of frequency of occurrence,
mean, variance, range
 Take out unwanted, common signal, e.g., seasonality

– In statistics, standardization refers to subtracting


off the means and dividing by the standard
deviation
106
Example: Sample Time Series of Plant
Growth
Minneapolis

Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000
107
Seasonality Accounts for Much
Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
108
Similarity and Dissimilarity
Measures
 Similarity measure
– Numerical measure of how alike two data
objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity measure

– Numerical measure of how different two data


objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
109
Similarity/Dissimilarity for Simple
Attributes

The following table shows the similarity and dissimilarity


between two objects, x and y, with respect to a single, simple
attribute.

110

You might also like