1.1 Introduction To Data Mining
1.1 Introduction To Data Mining
2
Introduction to Data Mining
Into the Digital Era
● Identify different views of data mining;
understand the key issues in data mining
● People’s daily lives
○ > 4 billion internet users
○ Social media, smart devices, …
● Scientific discovery
○ Rubin Observatory: 20TB/night
● Many application domains…
3
Introduction to Data Mining
Why Data Mining?
● Explosive data growth
○ KB, MB, GB, TB, PB, EB, ZB
○ Data creation, transmission, storage,
sharing, processing
● Drowning in data & starving for
knowledge
● Need automated analysis of massive data
4
Introduction to Data Mining
What is Data Mining?
● Knowledge discovery from data
○ Extraction of interesting patterns or
knowledge from huge amounts of data
○ Interesting: valid, previously unknown,
potentially useful, ultimately
understandable by human
○ Huge amounts of data: scalability,
efficiency
5
Introduction to Data Mining
The Four Views of Data Mining
● The data mining process encompasses four key views:
○ Data View: Understanding the dataset and its attributes.
○ Technique View: Techniques of Discovering patterns and relationships.
○ Knowledge View: Interpreting and evaluating discovered knowledge.
○ Application View: Applying the mined knowledge to achieve business goals.
● These views collectively guide us through the data mining pipeline.
6
Introduction to Data Mining
The Four Views of Data Mining
7
Introduction to Data Mining
Data View
● The 3Vs, 4Vs, 5Vs Value
8
Introduction to Data Mining
Data View
● Relational, transactional data
Single or
○ E.g., student records, bank accounts, store purchases mixture of
● Sequential, temporal, streaming data multiple data
types
○ E.g., gene sequences, stock prices, sensor reading
● Spatial, spatial-temporal data
○ E.g., land use, bird migration, traffic condition
● Text, multimedia, Web data
○ E.g., news articles, audio/video/image, hypertext
● Graph, network data
○ E.g., social network, power grid, co-authorship 9
Introduction to Data Mining
Application View
● Market analysis, target advertisement
And many
○ E.g., customer profiling, product recommendation many more…
● Healthcare, medical research
○ E.g., disease diagnosis, patient care, drug discovery
● Science and engineering
○ E.g., air pollution, marine life, electric vehicles
● Security
○ E.g., surveillance, intrusion/crime, fraud, cyberattack
● Government, nonprofit
○ E.g., urban planning, traffic control, education 10
Introduction to Data Mining
Knowledge View
● Frequent pattern, association, correlation
Descriptive,
○ E.g., songs listened together or in certain sequence predictive,
○ E.g., A is (more/less) likely to happen given B prescriptive
● Categorization
○ E.g., similarity among users with certain purchases
○ E.g., differences between two patient groups
● Anomaly, outliers
○ E.g., sensor errors, fraud activities, extreme events
● Changes over time
○ E.g., emerging new patterns, shift of user interest 11
Introduction to Data Mining
Technique View
● Frequent pattern analysis
● Classification, prediction
● Clustering
● Anomaly detection
● Trend and evolution analysis
12
Introduction to Data Mining
Frequent Pattern Analysis
● Frequent itemset
● Frequent sequence
● Frequent structure
● Association rules
● Correlation analysis
13
Introduction to Data Mining
Classification
● Pre-defined classes
● Need training data
● Build model to distinguish
classes
14
Introduction to Data Mining
Prediction
● Numerical prediction
(continuous value)
○ E.g., weather
○ E.g., stock price
○ E.g., traffic
15
Introduction to Data Mining
Clustering
● No predefined classes
● Intra-cluster similarity
● Inter-cluster dissimilarity
16
Introduction to Data Mining
Anomaly Detection
● Anomaly/outlier
○ Differ from the “norm”
○ E.g., error, noise
○ E.g., fraud
○ E.g., extreme events
17
Introduction to Data Mining
Trend and Evolution Analysis
● Changes over time
○ Overall trend
○ Periodical patterns
○ Anomalies
○ …
18
Introduction to Data Mining Pipeline
Key Components of the Data Mining Pipeline
● The data mining pipeline consists of essential components:
○ Data Collection: Gathering relevant data from various sources.
○ Data Preprocessing: Cleaning, integrating, transforming, and reducing data.
○ Data Mining: Applying algorithms to discover patterns and knowledge.
○ Pattern Evaluation: Assessing the quality and relevance of discovered
patterns.
○ Knowledge Presentation: Communicating insights and findings to
stakeholders.
● A well-structured pipeline ensures effective data analysis and decision-making.
19
Introduction to Data Mining Pipeline
Data Mining Pipeline
20
Introduction to Data Mining Pipeline
Summary
● Data mining: Discovering insights from large datasets.
● Significance: Informed decision-making, problem-solving.
● Key Views: Data, Patterns, Knowledge, Utility.
● Understand the stages for effective data mining.
21