DM 1
DM 1
5
Why Data Mining?
❑ The Explosive Growth of Data: from terabytes to petabytes
❑ Data collection and data availability
❑ Automated data collection tools, database systems, Web, computerized
society
❑ Major sources of abundant data
❑ Business: Web, e-commerce, transactions, stocks, …
❑ Science: Remote sensing, bioinformatics, scientific simulation, …
❑ Society and everyone: news, digital cameras, YouTube
❑ We are drowning in data, but starving for knowledge!
❑ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
6
What Is Data Mining?
❑ Data mining (knowledge discovery from data)
❑ Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
❑ Data mining: a misnomer?
❑ Alternative names
❑ Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
❑ Watch out: Is everything “data mining”?
❑ Simple search and query processing
❑ (Deductive) expert systems
7
Knowledge Discovery (KDD) Process
❑ This is a view from typical database systems
and data warehousing communities Pattern Evaluation
Task-relevant Data
Data Cleaning
Data Integration
8 Databases
Example: A Web Mining Framework
❑ Web mining usually involves
❑ Data cleaning
❑ Data integration from multiple sources
❑ Warehousing the data
❑ Data cube construction
❑ Data selection for data mining
❑ Data mining
❑ Presentation of the mining results
❑ Patterns and knowledge to be used or stored into knowledge-base
9
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
11
Data Mining vs. Data Exploration
❑ Which view do you prefer?
❑ KDD vs. ML/Stat. vs. Business Intelligence
❑ Depending on the data, applications, and your focus
12
Multi-Dimensional View of Data Mining
❑ Data to be mined
Database data (extended-relational, object-oriented, heterogeneous), data warehouse,
❑
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-
media, graphs & social and information networks
❑ Knowledge to be mined (or: Data mining functions)
❑ Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, …
❑ Descriptive vs. predictive data mining
❑ Multiple/integrated functions and mining at multiple levels
❑ Techniques utilized
❑ Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
❑ Applications adapted
❑ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis,
13 text mining, Web mining, etc.
Data Mining: On What Kinds of Data?
❑ Database-oriented data sets and applications
❑ Relational database, data warehouse, transactional database
❑ Object-relational databases, Heterogeneous databases and legacy databases
❑ Advanced data sets and advanced applications
❑ Data streams and sensor data
❑ Time-series data, temporal data, sequence data (incl. bio-sequences)
❑ Structure data, graphs, social networks and information networks
❑ Spatial data and spatiotemporal data
❑ Multimedia database
❑ Text databases
❑ The World-Wide Web
14
Data Mining Functions: (1) Generalization
❑ Information integration and data warehouse construction
❑ Data cleaning, transformation, integration, and
multidimensional data model
❑ Data cube technology
❑ Scalable methods for computing (i.e., materializing)
multidimensional aggregates
❑ OLAP (online analytical processing)
❑ Multidimensional concept description: Characterization
and discrimination
❑ Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
15
Data Mining Functions: (2) Pattern Discovery
❑ Frequent patterns (or frequent itemsets)
❑ What items are frequently purchased together in your Walmart?
❑ Association and Correlation Analysis
18
Data Mining Functions: (5) Outlier Analysis
❑ Outlier analysis
❑ Outlier: A data object that does not comply with the
general behavior of the data
❑ Noise or exception?―One person’s garbage could be
another person’s treasure
❑ Methods: by product of clustering or regression analysis, …
❑ Useful in fraud detection, rare events analysis
19
Data Mining Functions: (6) Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
❑ Sequence, trend and evolution analysis
❑ Trend, time-series, and deviation analysis
❑ e.g., regression and value prediction
❑ Sequential pattern mining
❑ e.g., buy digital camera, then buy large memory cards
❑ Periodicity analysis
❑ Motifs and biological sequence analysis
❑ Approximate and consecutive motifs
❑ Similarity-based analysis
❑ Mining data streams
❑ Ordered, time-varying, potentially infinite, data streams
20
Data Mining Functions: (7) Structure and
Network Analysis
❑ Graph mining
❑ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
❑ Information network analysis
❑ Social networks: actors (objects, nodes) and relationships (edges)
❑ e.g., author networks in CS, terrorist networks
❑ Multiple heterogeneous networks
❑ A person could be multiple information networks: friends, family, classmates, …
❑ Links carry a lot of semantic information: Link mining
❑ Web mining
❑ Web is a big information network: from PageRank to Google
❑ Analysis of Web information networks
❑ Web community discovery, opinion mining, usage mining, …
21
Evaluation of Knowledge
❑ Are all mined knowledge interesting?
❑ One can mine tremendous amount of “patterns”
❑ Some may fit only certain dimension space (time, location, …)
❑ Some may not be representative, may be transient, …
❑ Evaluation of mined knowledge → directly mine only interesting knowledge?
❑ Descriptive vs. predictive
❑ Coverage
❑ Typicality vs. novelty
❑ Accuracy
❑ Timeliness
22
❑ …
Data Mining: Confluence of Multiple Disciplines
Machine Pattern
Statistics
Learning Recognition
Database High-Performance
Algorithm
Technology Computing
23
Why Confluence of Multiple Disciplines?
❑ Tremendous amount of data
❑ Algorithms must be scalable to handle big data
❑ High-dimensionality of data
❑ Micro-array may have tens of thousands of dimensions
❑ High complexity of data
❑ Data streams and sensor data
❑ Time-series data, temporal data, sequence data
❑ Structure data, graphs, social and information networks
❑ Spatial, spatiotemporal, multimedia, text and Web data
❑ Software programs, scientific simulations
❑ New and sophisticated applications
24
Applications of Data Mining
❑ Web page analysis: classification, clustering, ranking
❑ Collaborative analysis & recommender systems
❑ Basket data analysis to targeted marketing
❑ Biological and medical data analysis
❑ Data mining and software engineering
❑ Data mining and text analysis
❑ Data mining and social and information network analysis
❑ Built-in (invisible data mining) functions in Google, MS, Yahoo!, Linked, Facebook, …
❑ Major dedicated data mining systems/tools
❑ SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
25
Major Issues in Data Mining (1)
❑ Mining Methodology
❑ Mining various and new kinds of knowledge
❑ Mining knowledge in multi-dimensional space
❑ Data mining: An interdisciplinary effort
❑ Boosting the power of discovery in a networked environment
❑ Handling noise, uncertainty, and incompleteness of data
❑ Pattern evaluation and pattern- or constraint-guided mining
❑ User Interaction
❑ Interactive mining
❑ Incorporation of background knowledge
❑ Presentation and visualization of data mining results
26
Major Issues in Data Mining (2)
❑ Efficiency and Scalability
❑ Efficiency and scalability of data mining algorithms
❑ Parallel, distributed, stream, and incremental mining methods
❑ Diversity of data types
❑ Handling complex types of data
❑ Mining dynamic, networked, and global data repositories
❑ Data mining and society
❑ Social impacts of data mining
❑ Privacy-preserving data mining
❑ Invisible data mining
27
Types of Data Sets: (1) Record Data
❑ Relational records
❑ Relational tables, highly structured
❑ Data matrix, e.g., numerical matrix, crosstabs
❑ Transaction data
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0
❑ Molecular Structures
❑ Image data:
❑ Video data:
31
Important Characteristics of Structured Data
❑ Dimensionality
❑ Curse of dimensionality
❑ Sparsity
❑ Only presence counts
❑ Resolution
❑ Patterns depend on the scale
❑ Distribution
❑ Centrality and dispersion
32
Data Objects
❑ Data sets are made up of data objects
❑ A data object represents an entity
❑ Examples:
❑ sales database: customers, store items, sales
❑ medical database: patients, treatments
❑ university database: students, professors, courses
❑ Also called samples , examples, instances, data points, objects, tuples
❑ Data objects are described by attributes
❑ Database rows → data objects; columns → attributes
33
Attributes
❑ Attribute (or dimensions, features, variables)
❑ A data field, representing a characteristic or feature of a data object.
❑ E.g., customer _ID, name, address
❑ Types:
❑ Nominal (e.g., red, blue)
❑ Binary (e.g., {true, false})
❑ Ordinal (e.g., {freshman, sophomore, junior, senior})
❑ Numeric: quantitative
❑ Interval-scaled: 100○C is interval scales
❑ Ratio-scaled: 100○K is ratio scaled since it is twice as high as 50 ○K
❑ Q1: Is student ID a nominal, ordinal, or interval-scaled data?
❑ Q2: What about eye color? Or color in the color spectrum of physics?
34
Attribute Types
❑ Nominal: categories, states, or “names of things”
❑ Hair_color = {auburn, black, blond, brown, grey, red, white}
❑ marital status, occupation, ID numbers, zip codes
❑ Binary
❑ Nominal attribute with only 2 states (0 and 1)
❑ Symmetric binary: both outcomes equally important
❑ e.g., gender
❑ Asymmetric binary: outcomes not equally important.
❑ e.g., medical test (positive vs. negative)
❑ Convention: assign 1 to most important outcome (e.g., HIV positive)
❑ Ordinal
❑ Values have a meaningful order (ranking) but magnitude between successive
values is not known
❑ Size = {small, medium, large}, grades, army rankings
35
Numeric Attribute Types
❑ Quantity (integer or real-valued)
❑ Interval
❑ Inherent zero-point
❑ We can speak of values as being an order of magnitude larger than the unit
of measurement (10 K˚ is twice as high as 5 K˚).
❑ e.g., temperature in Kelvin, length, counts, monetary quantities
36
Discrete vs. Continuous Attributes
❑ Discrete Attribute
❑ Has only a finite or countably infinite set of values
❑ E.g., zip codes, profession, or the set of words in a collection of documents
❑ Sometimes, represented as integer variables
❑ Note: Binary attributes are a special case of discrete attributes
❑ Continuous Attribute
❑ Has real numbers as attribute values
❑ E.g., temperature, height, or weight
❑ Practically, real values can only be measured and represented using a finite
number of digits
❑ Continuous attributes are typically represented as floating-point variables
37
Visualizing Complex Data and Relations: Social Networks
❑ Visualizing non-numerical data: social and information networks
organizing
information networks
A social network
38
What is Data Preprocessing? — Major Tasks
❑ Data cleaning
❑ Handle missing data, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
❑ Data integration
❑ Integration of multiple databases, data cubes, or files
❑ Data reduction
❑ Dimensionality reduction
❑ Numerosity reduction
❑ Data compression
❑ Data transformation and data discretization
❑ Normalization
❑ Concept hierarchy generation
39
Why Preprocess the Data? — Data Quality Issues
❑ Measures for data quality: A multidimensional view
❑ Accuracy: correct or wrong, accurate or not
❑ Completeness: not recorded, unavailable, …
❑ Consistency: some modified but some not, dangling, …
❑ Timeliness: timely update?
❑ Believability: how trustable the data are correct?
❑ Interpretability: how easily the data can be understood?
40
Data Cleaning
❑ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, and transmission error
❑ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
❑ e.g., Occupation = “ ” (missing data)
❑ Noisy: containing noise, errors, or outliers
❑ e.g., Salary = “−10” (an error)
❑ Inconsistent: containing discrepancies in codes or names, e.g.,
❑ Age = “42”, Birthday = “03/07/2010”
❑ Was rating “1, 2, 3”, now rating “A, B, C”
❑ discrepancy between duplicate records
❑ Intentional (e.g., disguised missing data)
❑ Jan. 1 as everyone’s birthday?
41
Incomplete (Missing) Data
❑ Data is not always available
❑ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
❑ Missing data may be due to
❑ Equipment malfunction
❑ Inconsistent with other recorded data and thus deleted
❑ Data were not entered due to misunderstanding
❑ Certain data may not be considered important at the time of entry
❑ Did not register history or changes of the data
❑ Missing data may need to be inferred
42
How to Handle Missing Data?
❑ Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
❑ Fill in the missing value manually: tedious + infeasible?
❑ Fill in it automatically with
❑ a global constant : e.g., “unknown”, a new class?!
❑ the attribute mean
❑ the attribute mean for all samples belonging to the same class: smarter
❑ the most probable value: inference-based such as Bayesian formula or decision
tree
43
Noisy Data
❑ Noise: random error or variance in a measured variable
❑ Incorrect attribute values may be due to
❑ Faulty data collection instruments
❑ Data entry problems
❑ Data transmission problems
❑ Technology limitation
❑ Inconsistency in naming convention
❑ Other data problems
❑ Duplicate records
❑ Incomplete data
❑ Inconsistent data
44
How to Handle Noisy Data?
❑ Binning
❑ First sort data and partition into (equal-frequency) bins
❑ Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
❑ Regression
❑ Smooth by fitting the data into regression functions
❑ Clustering
❑ Detect and remove outliers
❑ Semi-supervised: Combined computer and human inspection
❑ Detect suspicious values and check by human (e.g., deal with possible outliers)
45
Data Cleaning as a Process
❑ Data discrepancy detection
❑ Use metadata (e.g., domain, range, dependency, distribution)
❑ Check field overloading
❑ Check uniqueness rule, consecutive rule and null rule
❑ Use commercial tools
❑ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
❑ Data auditing: by analyzing data to discover rules and relationship to detect violators
(e.g., correlation and clustering to find outliers)
❑ Data migration and integration
❑ Data migration tools: allow transformations to be specified
❑ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
❑ Integration of the two processes
❑ Iterative and interactive (e.g., Potter’s Wheels)
46
END OF UNIT - I
47