0% found this document useful (0 votes)
15 views

Week 4 - Introduction to Data Mining and Data Mining Techniques (3)

Uploaded by

Sujal Shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Week 4 - Introduction to Data Mining and Data Mining Techniques (3)

Uploaded by

Sujal Shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to Data Mining

and Data Mining Techniques


CMP4294: Introduction to Artificial Intelligence

Dr Mariam Adedoyin-Olowe
[email protected]
Recap
• Week 1: Introduction of AI

• Week 2:
• AI Early Concepts
• What triggers the real artificial intelligence (The 3 key ingredients )
• How did AI evolve over time (Modern Historical Perspective)
• Rise of Machine Learning (2000s - 2010s)
• Current Landscape (2020s)
• Future Directions
Recap
Week 1: Introduction to AI Week 2: Evolution of AI & AI Systems
– D e fi n i ti o n – AI Early Concepts
– Why AI?
– What triggers the real artificial
– Principles of AI?
intelligence (The 3 key ingredients )
– Ty pe s o f A I – How did AI evolve over time (Modern
– AI Life Cycle Historical Perspective)

– – Rise of Machine Learning (2000s -


B e n e fi t s o f A I
2010s)
– Challenges and concerns of AI
– Current Landscape (2020s)
– Future Directions
Week 3
Impact and Ethics of AI
Today’s Outline
• What is KDD
• What is Data Mining
• Why is Data Mining essential
• Data Mining process
• Examples of Data Mining techniques

Learning Objectives
• Learn the Origin and the significance of KDD and Data Mining
• Examine the different Data Mining techniques
What is KDD

• Knowledge Discovery in Databases (KDD) is the all-


inclusive process of finding knowledge in data, and
highlights the complex application of specific data mining
methods
Data Mining and KDD

Data Mining is principally a part of KDD process


KDD process
Interpretation

Data Mining
Transformation
Preprocessing Knowledge
Selection
Patterns
Transformed
Data
Preprocessed
Raw Target Data
Data Data
What is data mining?


It is the iterative and interactive process of
discovering valid, novel, useful, and
understandable knowledge ( patterns,
models, rules etc.) in Huge databases

It’s a knowledge discovery from data
Data Mining
• It applies to multiple disciplines

Database
Statistics
Systems

Machine Learning Data


Data Mining
Visualization

Information Retrieval Pattern


Recognition
Data mining process…

• Valid: accurate and authentic for immediate and future use


• Novel: reveals new insight not already known
• Useful: can be used for important decision making
• Understandable: leading to insight
• Iterative: takes multiple passes
• Interactive: human in the loop
Related fields
• Artificial Intelligence
• Machine learning
• Statistics
• Databases and data warehousing
• High performance computing
• Visualization
• Data Intelligence
• Business Intelligence
Why data mining?
Why Data Mining?
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
– Fraud detection and detection of unusual patterns (outliers)

• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– DNA and bio-data analysis
Need for data mining tools
• Human analysis crashes with volume and
dimensionality
– How swiftly can human assimilate 2 million records, with
200 elements?
– High rate of growth, changing sources
The Challenge

51020188905212001539458199000000001419881
22944882199608162100000010100010000000110
00031111100000000010031302000000000000002
02001000000000000000000000000000043438888
88884242434243330122020222000010100100000
00441000000001100000000000000000100000100
00000000000000000000000000000000000000000
00000019981027510201896060120021269409680
00000159019980903379811998091731001000001
00010000000110000320002000000100000001239
90000000000002002222003131003120000000000
00000042438888888888424342423321212122220
00000101100000024410000000001002000000000
00000000000100000000000000000000000000000
00000000000000000000199812305102018970203
20001862692920000004709199802135697119980
22731000001001000100000000011011000000200
00100000000021011000100000000000100000000
00001000110000000111003388882222331132334
33300000011000001110100110010200010000000
01000000001000000000000000000000000000000
00000000000000000000000000000000001998122
15102018990930200520089867300000194101999
01127598119990126310010001010001000000000
The Challenge

• Data volume too bulky for traditional analysis


– Number of records too large (millions if not billions)
– High dimensional (attributes/features/ fields) data (in
hundreds of thousands)

• Increased opportunity for access


– Web navigation, on-line collections
Data mining goals
Data Mining Goals
Improved business efficiency
Improve marketing (advertise to the most likely buyers)
Inventory reduction (stock only needed quantities)
Information source: Historical business data
Example: Supermarket sales records

Date/Time/Register Fish Turkey Cranberries Wine ...


12/6 13:15 2 N Y Y N ...
12/6 13:16 3 Y N N Y ...

Sample question – what products are generally purchased


together?
The answers are in the data!
Data mining process
• Understand application domain
– Prior knowledge, user goals
• Create target dataset
– Select data, focus on subsets
• Data cleaning and transformation
– Remove noise, outliers, missing values
– Select features, reduce dimensions
Data mining process

• Apply data mining algorithm


– Associations, sequences, classification,
clustering, etc.
• Interpret, evaluate and visualise patterns
– What interesting findings pops out?
– Iterate if needed
• Manage discovered knowledge
– Close the loop
Data Mining process
Data mining process - more detail

Build/select
Original DB target database Select sample

Normalize Eradicate Supply missing


values noisy data values

Transform Build derived Significant


values attributes attributes

Extract Select DM Select DM


knowledge technique(s) task (s)
Test Improve Present/Visualise
knowledge knowledge Knowledge/findings
What Can Data Mining Do?
• Cluster
• Classify
– Categorical, Regression
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
Data Mining Techniques
Data Mining Techniques
Predictive Techniques
Predictive

• Classification sorts data into predefined groups or classes


– Supervised learning: In Supervised learning, you train the machine with
“labelled” data
– Unsupervised learning is a machine learning technique, where model need not
be supervised. This learning lets you gather/develop a data output from the
prior experience.

– Pattern recognition: It is the automatic recognition of patterns and


evenness in data. Pattern recognition has resemblance with artificial
intelligence as well as machine learning.

Prediction: It is used for forecasting the future. For example, you might
consider the previous sale of a particular model of smartphone to predict the
demand of its new release in the market

https://www.datasciencecentral.com/profiles/blogs/the-7-most-important-data-mining-te
chniques
• Classification
– allocate a new data record to one of numerous prior
groups or classes
– We know X and Y belong together, find other things in
same group

• Decision Tree a decision support tool that applies a


tree-like model of decisions and their possible
consequences/event outcomes - should we play
football today?
• E.g., classify countries based on climate, or classify
cars based on gas mileage
Ex. of Classification
Task
• Let’s assume you’re assessing data on individual customers’
financial
backgrounds and purchase history

• You could classify them as “low,” “medium,” or “high” credit


risks

• You could then use these classifications to learn even more


about those customers and make decision on those to give
credit facilities to without endangering the prospect of the
business.
…More Examples of
Classification Tasks
Task Attribute set Class label

Categorising Features extracted Spam or non-spam


email messages from email
message header
and content

Categorising Scores extracted Fail or pass


exam grades from exam results
Time series analysis

• Time series analysis includes approaches


for analysing time series data in
order to extract significant statistics and
other features of the data.

• Time series forecasting is the use of a


model to predict future values based on
previously observed values.

https://magoosh.com/statistics/time-series-
analysis-and-forecasting-definition-
and-examples/
Ex. Time series analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over
time
• Classify behaviour
Regression
• Regression is used to identify
the probability of a certain
variable, while considering the
occurrence of other variables.

– For example, you could use it to project


the stock market price, based on other
factors like consumer demand,
competition and availability.

– To be precise, regression focus on


discovering the exact relationship
between two (or more) variables in a
given data set.
Descriptive techniques
Clustering
• Clustering partitions dataset into subsets or groups in a
way that features of a group share a common set of
properties
– Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass
similarity
Clustering
• Clustering groups similar data
together into clusters (it is
similar to classification).

• For example, you might decide to


cluster different demographics of
Twitter users into different clusters
based their location.

• Clustering groups similar objects


together such that the similarity of
members within the same group is
maximised, and among those that
belong to different groups is minimised
(dissimilar).
…more descriptive techniques
• Association Rule Mining (discover sets of features that
frequently co-occur, and rules among them)

– Diaper à Beer [0.5%, 75%]

• Sequence Mining discover sequences of occurrences that


frequently take place together
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of
the data
– Noise or exception? No! useful in fraud detection, rare events analysis
Association Rule (Market
Basket Analysis)
• An association rules find frequent
patterns, associations, correlations,
or
causal structures among sets of
items in transaction databases

• It’s used to recognise customer


buying habits by finding associations
and correlations between the
different items that customers place
in their “shopping basket”
Summarization
• Summarization maps data into subsets
with associated simple descriptions
• It compresses data into an informative
representation
– Characterization
– Generalization
• Simple summarization methods such as
tabulating the mean and standard
deviations are often applied for data
analysis, data visualization and
automated report generation
Sequential Analysis
• Sequential Analysis determines sequential patterns from
sequential data

• More precisely, it consists of discovering interesting sub-


sequences in a set of sequences where the interestingness
of a subsequence can be measured in terms of various criteria
such as its occurrence frequency, length, and profit.

• Sequential pattern mining has many real-life applications since


data is logically encoded as sequences of symbols in many
fields such as e- learning, market basket analysis, texts, and
webpage click-stream analysis.
Uses of Data Mining in Real Life
• Healthcare
• Education
• Banking – Fraud Detection
• Policing – Crime Detection
• Travel
• Retail
• Energy
Find out the KPIs for each of these
sectors
Summary
• Data Mining is part of KDD process

• Data Mining is used to find understandable knowledge


from huge database

• Descriptive predictive and prescriptive analytics are


Data Mining goals

You might also like