0% found this document useful (0 votes)
6 views57 pages

CPSC 4830 2025summer Lecture 1

The document is an introduction to a course on Data Mining for Data Analytics, covering the importance of data mining, its definitions, types of data and patterns that can be mined, and the technologies used. It discusses the evolution of data science, the challenges faced in data mining, and various applications across different fields. Key concepts include knowledge discovery, data mining methodologies, and the need for effective evaluation of mined knowledge.

Uploaded by

Jerd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views57 pages

CPSC 4830 2025summer Lecture 1

The document is an introduction to a course on Data Mining for Data Analytics, covering the importance of data mining, its definitions, types of data and patterns that can be mined, and the technologies used. It discusses the evolution of data science, the challenges faced in data mining, and various applications across different fields. Key concepts include knowledge discovery, data mining methodologies, and the need for effective evaluation of mined knowledge.

Uploaded by

Jerd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

CPSC 4830

Data Mining for Data Analytics


Lecture 1
Introduction
Introduction
Why Data Mining?
What is Data Mining?
What kinds of Data can be Mined?
What kinds of Patterns can be Mined?
What kinds of Technologies are used?
What kinds of Applications are targeted?
Major issues in Data Mining
A brief history of Data Mining and Data Mining Society
Introduction
Why Data Mining?
What is Data Mining?
What kinds of Data can be Mined?
What kinds of Patterns can be Mined?
What kinds of Technologies are used?
What kinds of Applications are targeted?
Major issues in Data Mining
A brief history of Data Mining and Data Mining Society

… Why there are so many questions here???


Introduction
Why Data Mining?
What is Data Mining?
What kinds of Data can be Mined?
What kinds of Patterns can be Mined?
What kinds of Technologies are used?
What kinds of Applications are targeted?
Major issues in Data Mining
A brief history of Data Mining and Data Mining Society

… Why there are so many questions here???


It is a state of art in Data Science to “ask a correct question”.
Why Data Mining?
The Explosive Growth of Data: GB TB PB EB ZB YB
Data Collection and Data Availability
Web, e-commerce, transactions, stocks,…
Walmart handles 1 Million transaction data every hour
Remote sensing, Bioinformatics, Scientific simulation, ….
Every one of us contain 3 billion base pair of DNA
News, Photos and Videos in everyone mobile phone, YouTube, TikTok, …
1 Trillion web pages, 1 hour of video upload to YouTube every second~ 10 years a day
We are drowning in data, but starving for knowledge
Data Mining is an automated analysis of massive data sets
Why Data Mining?
Before 1600: Empirical Science
1600-1950s: Theoretical Science
Theoretical models often motivate experiments and generalize our understanding.
1950s-1990s: Computational Science
Computational Science traditionally meant simulation. It grew out of our inability to find closed-form
solutions for complex mathematical models.
1990s-now: Data Science
Flood of data from new scientific instruments and simulations
Ability to economically store and manage PB of data online
Internet and computing grid makes all these can be shared universally
Data Mining is a major new challenge
What is the difference between
the traditional methods and big
data era?
How does Newton discover F=ma?
Before 1600: Empirical Science
1600-1950s: Theoretical Science
Now imagine you were Sir Isaac Newton, and there was an apple falling on your head.
What’s next?
How does Newton discover F=ma?
Before 1600: Empirical Science
1600-1950s: Theoretical Science
Now imagine you were Sir Isaac Newton, and there was an apple falling on your head.
What’s next?
You try to repeat the same experiment (let an apple fall at the same height, or use the trolley running
down a slightly slope)
And you will collect all the data.
And…
How does Newton discover F=ma?
Before 1600: Empirical Science
1600-1950s: Theoretical Science
Now imagine you were Sir Isaac Newton, and there was an apple falling on your head.
What’s next?
You try to repeat the same experiment (let an apple fall at the same height, or use the trolley running
down a slightly slope)
And you will collect all the data.
And…
Wait…
You are using your genius mind to collect
relevant data, like displacement, velocity,
force, etc.
How does Newton discover F=ma?
Before 1600: Empirical Science
1600-1950s: Theoretical Science
Now imagine you were Sir Isaac Newton, and there was an apple falling on your head.
What’s next?
You try to repeat the same experiment (let an apple fall at the same height, or use the trolley running
down a slightly slope)
And you will collect all the data.
Fit in your formulae, like F=mx or F=mv or F=ma (Note: Ancient scientist guess F=mv)
And finally you find that F=ma fit al the data, and you announce that you have found the Newton’s 2 nd
Law.
Note: this is the real process for Newton to find out by using Kepler’s data.
How does Newton discover F=ma?
What if the Kepler’s data is given to our DANA student, but not Newton?
Can we find out the relationship?
In order to get the relationship, what do we need to do?
What is Data Mining?
Data Mining is a knowledge discovery from data, without much domain knowledge.
Extraction of interesting (implicit, potentially useful) patterns or knowledge from huge amount of
data
Different names for Data Mining: Knowledge Discovery in Database (KDD), Knowledge Extraction,
Data/Pattern Analysis, Data Archeology, Data Dredging, Information Harvesting, Business
Intelligence, etc.
Data Mining different Framework
What is Data Mining?
A typical database systems and data warehousing communities
Data Mining plays an essential role in the knowledge discovery process
KDD Framework

https://barnraisersllc.com/2018/10/01/data-mining-process-essential-steps/
What is Data Mining?
Machine Learning KDD Framework

https://www.geeksforgeeks.org/data-mining-process/
What is Data Mining?
Business Intelligence Framework

Data Mining SpringerLink


What is Data Mining?
KDD vs ML vs BI Framework, depends on the data, applications and focus
Data Mining is NOT Data Exploration
BI view: Warehouse, data cube reporting but not much mining
Business objects vs Data Mining tools
Supply chain example: Mining vs OLAP vs Tableau
In shorts:
Automatically detect patterns in data
Use the uncovered patterns to predict future data
Perform other kinds of decision making under uncertainty
Multi-Dimensional View of Data Mining
Data to be Mined
Database data, data warehouse, transactional data, stream, spatiotemporal, time series, sequence, text and
web, multi-media, graphs and social information networks, etc.
Knowledge to be Mined
Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
Descriptive vs Predictive Data Mining
Multiple integrated functions and mining at multiple levels
Techniques utilized
Data intensive, data warehouse, machine learning, statistics, pattern recognition, visualization, high
performance, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bioinformatics, stock market analysis, text mining, web
mining, etc.
What can Data Mining do?
What kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database, etc.
Advanced data sets and advanced applications
Data streams and sensor data
Time series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Object relational database
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
WWW …..
What kinds of Patterns?
Now suppose you have a lot of data, and you are going to “ask a correct question”…
That is what can I do with those data? What can I find out? What is useful to the business?
What can I help to team/company to shorten their working hour or make more profit?

What kinds of Patterns? Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, multidimensional data model
Data cube technology
Scalable methods for computing multidimensional aggregates
OLAP
Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g. dry vs wet region
What kinds of Patterns? Association and Correlation Analysis
Frequent patterns
What items are frequently purchased together in the supermarket?
Association, correlation vs causality
A typical association rule ( Diaper -> Beer )
Strongly associated items =?= strongly correlated?
How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other applications?
What kinds of Patterns? Classification
Classification and label prediction
Construct models based on some training examples
Describe and distinguish classes or concepts for future prediction
E.g. classify countries based on climate, or classify cars based on mileage
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule based
classification, pattern based classification, logistic regression, etc.
Typical applications:
Credit card fraud detection, direct marketing, classifying stars, diseases, web pages, etc.
What kinds of Patterns? Cluster Analysis
Unsupervised learning
Group data to form new categories, e.g. cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity and Minimizing interclass similarity
What kinds of Patterns? Outlier Analysis
A data object that does not comply with the general behavior of the data
Noise or exception? One people thinks it’s a garbage but it could be other treasure
Methods: Product of clustering or regression analysis
Useful in fraud detection, rare events analysis
What kinds of Patterns? Sequential Pattern Trend and Evolution Analysis
Sequence, trend and evolution analysis
Trend, time series, and deviation analysis e.g. regression and value prediction
Sequential pattern mining
E.g. buy digital camera, then buy SD memory cards
Periodicity analysis
Motifs and Biological sequence analysis
Approximate and consecutive motifs
Similarity based analysis
Mining data streams
Ordered, time varying, potentially infinite, data streams
What kinds of Patterns? Structure and Network Analysis
Graph Mining
Finding frequent subgraphs, trees, substructures
Information network analysis
Social networks: actors and relationships
E.g. author networks, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family, classmates, etc.
Links carry a lot of semantic information: Link mining
Web Mining
Web is a big information network: PageRank
Analysis of Web information networks
Web community discovery, opinion mining, usage mining, etc.
What kinds of Patterns? Summarize
Main types:
Supervised learning
Unsupervised learning
Reinforcement learning
Deep learning can be applied to all the above types
Variants:
Semi-supervised learning
Active learning
Ensemble learning
Transfer learning
….
What kinds of Patterns? Summarize
Supervised Learning
What kinds of Patterns? Summarize
Supervised Learning: Classification
What kinds of Patterns? Summarize
Supervised Learning: Classification
What kinds of Patterns? Summarize
Supervised Learning: Probabilistic Classification
What kinds of Patterns? Summarize
Supervised Learning: Real world Classification
Object recognition and image classification
Character recognition
Document classification
Spam detection and filtering
Intrusion detection
Medical diagnosis
What kinds of Patterns? Summarize
Supervised Learning: Regression
What kinds of Patterns? Summarize
Supervised Learning: Real world Regression
Predict tomorrows stock market price given current market conditions and other information
Predict the age of a viewer watching a given video on YouTube
Predict the location in 3D space of a robot arm end effector, given control signals sent to its various motors
Predict the amount of prostate specific antigen in the body as a function of a number of different clinical
measurements
Predict the temperature at any location inside a building using weather data, time, door sensors
What kinds of Patterns? Summarize
Unsupervised Learning:
Unlabeled Data
The goal of unsupervised learning is to discover “interesting structures/patterns” in the data
Examples: Clustering, Dimension reduction, Structure discovery, etc.
What kinds of Patterns? Summarize
Unsupervised Learning: Clustering
What kinds of Patterns? Summarize
Unsupervised Learning: Real world Clustering
Market researchers use cluster analysis to partition the general population of consumers into market
segments and to better understand the relationships between different groups of consumers/potential
customers, and for use in market segmentation, Product positioning, New product development and
Selecting test markets
In the study of social networks, clustering may be used to recognize communities within large groups of
people
In human genetic clustering, the similarity of genetic data is used in clustering to infer population structures
Recommender systems are designed to recommend new items based on a user’s tastes. Use clustering
algorithms to predict a user’s preferences based on the preferences of other users in the user’s cluster.
What kinds of Patterns? Summarize
Unsupervised Learning: Dimensionality Reduction
Dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a
lower dimensional subspace
What kinds of Patterns? Summarize
Unsupervised Learning: Structure Discovery
Discover a graph structure about how a set of variables are related, latent variables
What kinds of Patterns? Summarize
Unsupervised Learning: Reinforcement Learning
An agent learns how to act or behave from occasional reward or punishment signals
That is the way the researchers handle the experimental mice/dog
The most famous one is AlphaGo
What kinds of Patterns? Summarize
Unsupervised Learning: Deep Learning
Use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation
Each successive layer uses the output from the previous layer as input
Learn in supervised or unsupervised manners
Learn multiple levels of representations that correspond to different levels of abstraction; the levels form a
hierarchy of concepts
The most famous one is ChatGPT
What kinds of Patterns? Summarize
Unsupervised Learning: Deep Learning
Deep learning has a unique advantage, i.e. automatic feature extraction
Automatically grasps the relevant features required for the solution of the problem
Reduces the burden on the programmer to select the features explicitly
What kinds of Patterns? Summarize
What kinds of Patterns? Summarize
“No Free Lunch” theorem
There is no one algorithm that works best for every problem
Assumptions of a great algorithm for one problem may not hold for another problem
So we have to try multiple algorithms and find the one that works best for a specified problem
What kinds of Patterns? Summarize
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of “patterns” and “knowledge”
Some may fit only certain dimension space
Some may not be representative, may be transient, etc.
Evaluation of mined knowledge: mine only interesting knowledge
Descriptive vs predictive
Coverage
Typicality vs novelty
Accuracy
Timeliness

Confluence of Multiple Disciplines
Confluence of Multiple Disciplines
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera bytes of data
High dimensionality of data
Micro array may have tens of thousands of dimensions
High complexity of data
Data Streams and sensor data
Time series data, temporal data, sequence data
Structure data, graphs, social networks and multi linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text, Wed data
Software programs, scientific simulations
New and sophisticated applications
Applications of Data Mining
Web page analysis: web page classification, clustering, PageRank and HITS algorithms
Collaborative analysis and recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis, biological sequence analysis, biological
network analysis
Data mining and software engineering
What difficulties Data Mining facing?
Major Issues in Data Mining
Mining methodology
Mining various and new kinds of knowledge
Mining knowledge in multi dimensional space
Data mining: interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty and incompleteness of data
Pattern evaluation and pattern or constraint guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
Major Issues in Data Mining
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy preserving data mining
Invisible data mining
Take home messages
• Data mining: Discovering interesting patterns and knowledge from massive amount
of data
• A natural evolution of science and information technology with wide applications
• KDD process includes data cleaning, data integration, data selection, transformation,
data mining, pattern evaluation, and knowledge presentation
• Mining can be performed in a variety of data
• Data Mining functionalities: Characterization, Discrimination, Association,
Classification, Clustering, Trend and Outlier Analysis, etc.
• Major issues in Data Mining

You might also like