Chap1 Introduction
Chap1 Introduction
1
What Is Data Mining?
We live in a world where vast amounts of data are generated constantly and rapidly
Data mining is the process of discovering interesting patterns, models and other
kinds of knowledge in large data sets
“Data mining”: a misnomer? It should be “knowledge mining from data”
Other terms: Knowledge mining from data, KDD (Knowledge Discovery from Data),
pattern discovery, knowledge extraction, data analytics, information harvesting
Data mining is a young, dynamic, and promising field
Example: Data mining turns a large collection of data into knowledge
Google’s Flu Trends found a close relationship between the number of people who
search for flu-related info. and the number of people who have flu symptoms
It can estimate flu activity up to two weeks faster than traditional systems
2
Data Mining: An Essential Step in Knowledge
Discovery
Cluster Analysis
Deep Learning
Outlier Analysis
6
Multidimensional Data Summarization
Predictive
Classification and label prediction
Analysis
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
Ex. 1. Classify countries based on (climate)
Ex. 2. Classify cars based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic regression,
…
Typical applications:
Credit card fraud detection, direct marketing, classifying stars, diseases, web-
pages, …
9
Cluster Analysis
Unsupervised learning (i.e., Class label is
unknown)
Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
Principle: Maximizing intra-class similarity &
minimizing interclass similarity
Many methods and applications
10
Deep Learning
Deep learning: A fast expanding dynamic frontier in machine learning
Deep learning has developed various neural network architectures
Feed-forward neural networks
Convolutional neural networks
Recurrent neural networks
Graph neural networks
Transformer
Deep learning has broad applications in computer vision, natural language
processing, machine translation, social network analysis, and so on
Deep learning has been reshaping a variety of data mining tasks
Ex. classification, clustering, outlier detection, and reinforcement learning
11
Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception?―One person’s garbage could be
another person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
12
Other Data Mining Functions: Time and
Ordering: Sequential Pattern, Trend and
Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis
e.g., regression and value prediction
Sequential pattern mining
e.g., buy digital camera, then buy large memory cards
Periodicity analysis
Motifs and biological sequence analysis
Approximate and consecutive motifs
Similarity-based analysis
Mining data streams
Ordered, time-varying, potentially infinite, data streams
13
Other Data Mining Functions: Structure and
Network Analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family, classmates, …
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining, …
14
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of “patterns”
Some may fit only certain dimension space (time, location, …)
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mining only interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
15
…
Data Mining: Confluence of Multiple
Disciplines
16
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be scalable to handle big data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social and information networks
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
17
Data Mining and Applications
Web page analysis: classification, clustering, ranking
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis
Data mining and software engineering
Data mining and text analysis
Data mining and social and information network analysis
Built-in (invisible data mining) functions in Google, Microsoft, LinkedIn, Meta, …
Major dedicated data mining systems/tools
SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
18
Data Mining and Society
Data mining technology may benefit society
Ex.: Help scientific discovery, business management, economy recovery, and
security protection (e.g., the real-time discovery of intruders and cyberattacks)
Need to guard against the misuse of data mining
Data mining also poses the risk of unintentionally disclosing some confidential
business or government information and disclosing an individual’s personal
information
Studies on data security in data mining and privacy-preserving data publishing and
data mining are important, ongoing research theme
The philosophy is to observe data sensitivity and preserve data security and
people’s privacy while performing successful data mining
These and other related issues will be discussed throughout the book
19
Summary
Data mining: Discovering interesting patterns and knowledge from massive amounts
of data
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
Different data mining method on a wide variety of data
Data mining functionalities: summarization, pattern discovery, classification,
clustering, deep learning, outlier analysis, trend and outlier analysis, …
Data mining is a confluence of multiple disciplines
Data mining has broad applications
Promote secure data mining to benefit society
20
21
21