0% found this document useful (0 votes)

14 views

1. Introduction to Data Mining & Classification

Uploaded by

Idkk Either

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

1. Introduction to Data Mining & Classification

Uploaded by

Idkk Either

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

Introduction to

Data Mining

Sandro Radovanović
Contact me:
Microsoft
Teams/[email protected]
1.
INTRODUCTION
to Data Mining
Why Data Mining
▪ Credit ratings:
▪ Given a hundreds of clients, which are least
likely to default?
▪ Customer relationship management:
▫ Which clients are likely to be loyal, and which
are most likely to leave for a competitor?
▪ Marketing:
▫ What group of clients we have? Can we
automatically discover them?
3
Data Mining
▪ Data Mining (DM) is a process of semi-
automatically analyzing large
databases to find patterns that are:
▪ Valid;
▪ Novel;
▪ Useful; and
▪ Understandable.
4
What is Data Mining?
▪ Finding most frequent surname of our
clients?

5
What is Data Mining?
▪ What products to recommend to a
client?

6
What is Data Mining?
▪ Whether a client opens temporary
savings account in next two months?

7
What is Data Mining?
▪ Acquire additional information about
client from Internet.

8
What is Data Mining?
▪ Identifying groups of clients by using
domain expert.

9
What is Data Mining?
▪ Identifying groups of clients by digital
usage of bank services.

10
What is Data Mining?
▪ How much money on bank account
client will have at the beginning of the
month in next year?

11
Data Mining Tasks
Reduction Regression Estimation
Selecting data (rows and Estimation of numerical Same as regression with
columns) of interest. value based on behaviour addition of time
of the clients. depended data.

Classification Clustering Association

Predicting class or Grouping of clients based Rules
category to which client on their similarity. Discovering associations
should be assigned to. (relationship) between
products.

12
Note: Data Mining Tasks
▪ Supervised vs. Unsupervised learning (and Reinforcement
Learning)
▫ Supervised machine learning requires labelled input and
output data during the training phase, while unsupervised
don’t.

▫ In supervised learning, the goal is to predict outcomes for

new data. You know up front the type of results to expect.

▫ With an unsupervised learning algorithm, the goal is to

get insights from data. The machine learning itself
determines what is different/interesting from the dataset.

13
Note: Data Mining Tasks
▪ Generative vs. Discriminative Machine
Learning Models
▫ Generative models are those that
center on the distribution of the
classes within the dataset
▫ Discriminative models learn about
the boundary between classes
within a dataset
14
Note: Data Mining Tasks
Generative models Discriminative Models
▪ Generative models aim to ▪ Discriminative models
capture the actual model the decision
distribution of the classes boundary for the
in the dataset dataset classes
▪ Generative models predict ▪ Discriminative models
the joint probability
learn the conditional
distribution – p(x,y)
probability – p(y|x)
▪ Generative models are
computationally
▪ Discriminative models
expensive compared to are computationally
discriminative models cheap compared to
generative models
15
Regression and Estimation
ID A B C OUTPUT

1 10 20 7 25

2 30 15 10 28

3 5 24 16 16

16
Classification
ID A B C OUTPUT

1 10 20 7 Default

2 30 15 10 Not Default

3 5 24 16 Default

17
Clustering
ID A B C

1 10 20 7

2 30 15 10

3 5 24 16

18
Association Rules
ID A B C

1 TRUE FALSE FALSE

2 TRUE TRUE FALSE

3 FALSE TRUE TRUE

19
CRISP-DM
Methodology for Data Mining
projects

20
CRISP-DM

21
CRISP-DM
1. Business 2. Data 3. Data
Understanding Understanding Preparation
Understanding of goals Acquiring data, Preparing dataset for
and tasks of the project. exploratory analysis, data data mining task.
What should be done and quality etc. Selection of rows and
what are limitations. attributes. Transformation
of data. Data cleansing
4. Modelling 5. Evaluation 6. Deployment
Selection of a model. Performance of the Put model to production.
Parameters etc. model! Do we solve the
problem defined in 1?

22
2.
CLASSIFICATION
Let’s start with classification
Classification: Definition
▪ Given a collection of records find a
model for class attribute as a function
of the values of other attributes.

▪ Goal: Assign clients to classes as

accurately as possible.

24
Classification: Definition
▪ Given a collection of records
▪ find a model for class attribute
▪ as a function of the values of other
attributes

25
RapidMiner
▪ Introduction and Classification

26
Classification concepts to
remember (1)
▪ Model
▪ Apply model
▪ Performance

27
Classification concepts to
remember (2)
▪ Model vs. Algorithm

28
OVERFITTI
NG
CONCEPT TO REMEMBER (3)

29
What happened?

30
Overfitting
▪ Overfitting is a problem when
prediction model is too complex for
data at hand.
▪ Good fit for data at hand – poor fit for
new data

31
Underfitting
▪ Underfitting occurs when a model is too
simple.
▪ Poor fit to data at hand – poor fit to new
data.

32
Underfitting and Overfitting

33
Underfitting and Overfitting

34
EVALUATIO
N
CONCEPT TO REMEMBER (4)

35
Evaluation of the
classification model
▪ Confusion matrix
▪ Accuracy
▪ Precision
▪ Recall
▪ AUC

36
Errors in Classification
problems
▪ Types of error – FP and FN
▪ Cost of errors?

37
VALIDATIO
N
CONCEPT TO REMEMBER (5)

38
Validation
▪ Divide dataset to training part and test
part.

39
Validation
▪ Can we have more testing with same
amount of data?

40
Cross-Validation

41
OVERFITTI
NG
How to deal with it?
CONCEPT TO REMEMBER (6)

42
How to deal with
overfitting?
▪ Change parameters of the learning
algorithm.
▪ Remove features.
▪ Obtain additional data (if available).

43
TESTING
MORE
How to test multiple algorithms at
the same time…

44
How to test multiple
algorithms at the same
time?
▪ Loop
▪ Log

45
Don’t do this

46
Dataset for Project
Task(s)

47
Project Task
▪ Create RapidMiner process(es) for the
given task – Student dropout prediction
▪ Open Learning Analytics | OU Analyse
| Knowledge Media Institute | The Ope
n University

▪ Try going through the each step of

48
CRISP-DM process.
Project Task

49
Project Task
▪ courses.csv
▫ code_module – code name of the
module, which serves as the identifier.
▫ code_presentation – code name of the
presentation. It consists of the year
and “B” for the presentation starting in
February and “J” for the presentation
starting in October.
50
Project Task
▪ assessments.csv
▫ code_module
▫ code_presentation
▫ id_assessment – identification number of the
assessment
▫ assessment_type – type of assessment. Three types of
assessments exist: Tutor Marked Assessment (TMA),
Computer Marked Assessment (CMA) and Final Exam
(Exam).
▫ date – information about the final submission date of
the assessment calculated as the number of days since
the start of the module-presentation.
51
Project Task
▪ vle.csv
▫ id_site – an identification number of the material.
▫ code_module
▫ code_presentation
▫ activity_type – the role associated with the
module material.
▫ week_from – the week from which the material is
planned to be used.
▫ week_to – week until which the material is
planned to be used.
52
Project Task
▪ studentInfo.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ gender
▫ region – identifies the geographic region, where
the student lived while taking the module-
presentation.
▫ highest_education – highest student education
level on entry to the module presentation.
53
Project Task
▪ studentInfo.csv
▫ imd_band – specifies the
Index of Multiple Depravation band of the place
where the student lived during the module-
presentation.
▫ age_band
▫ num_of_prev_attempts
▫ studied_credits
▫ disability
▫ final_result – student’s final result in the
module-presentation.
54
Project Task
▪ studentRegistration.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ date_registration
▫ date_unregistered

55
Project Task
▪ studentAssessment.csv
▫ id_assessment
▫ id_student
▫ is_banked – tranferred from the previous
presentation
▫ score – the student’s score in this
assessment. The range is from 0 to 100. The
score lower than 40 is interpreted as Fail.
The marks are in the range from 0 to 100.
56
Project Task
▪ studentVle.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ id_site
▫ date
▫ sum_click – the number of times a student
interacts with the material in that day.

57
THANKS!
Any questions?

Rise of The e Nurse The Power of Social Media in Nursing
No ratings yet
Rise of The e Nurse The Power of Social Media in Nursing
11 pages
Fundamentals of Artificial Neural Networks
No ratings yet
Fundamentals of Artificial Neural Networks
7 pages
Data Mining Chapter 1 Notes
No ratings yet
Data Mining Chapter 1 Notes
40 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Lect 1
No ratings yet
Lect 1
38 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
classification basic concept.data mining
No ratings yet
classification basic concept.data mining
20 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Mod03-Lifecycle Dataprocessing
No ratings yet
Mod03-Lifecycle Dataprocessing
72 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
3 DM Classification (2)
No ratings yet
3 DM Classification (2)
62 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Data management
No ratings yet
Data management
36 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
DMlecture1
No ratings yet
DMlecture1
39 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
0 KDLVLP Đã G P
No ratings yet
0 KDLVLP Đã G P
523 pages
lecture1&2-đã chuyển đổi
No ratings yet
lecture1&2-đã chuyển đổi
46 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Case Study - Churn Mdel Prediction
No ratings yet
Case Study - Churn Mdel Prediction
77 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
Data Mining Outline
No ratings yet
Data Mining Outline
5 pages
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
No ratings yet
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
50 pages
Unit 4 Intro DM
No ratings yet
Unit 4 Intro DM
30 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
47 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
DataClassification
No ratings yet
DataClassification
65 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
Unit 4 Data warehousing and Data mining
No ratings yet
Unit 4 Data warehousing and Data mining
15 pages
DATA MINING JNTUH CSE R18
No ratings yet
DATA MINING JNTUH CSE R18
20 pages
Data Mining 101
No ratings yet
Data Mining 101
50 pages
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
Data Mining
No ratings yet
Data Mining
254 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
14 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Bilal Ahmed Shaik Data Mining
No ratings yet
Bilal Ahmed Shaik Data Mining
88 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
3 DM
No ratings yet
3 DM
36 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
27 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
CH 2
No ratings yet
CH 2
37 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Program Book NileTESOL 2018 Final Jan-3
No ratings yet
Program Book NileTESOL 2018 Final Jan-3
68 pages
Rashmi Kathuria: E-Mathematics
100% (2)
Rashmi Kathuria: E-Mathematics
36 pages
Theory of Instruction
No ratings yet
Theory of Instruction
742 pages
Two Sides To Every Story: By: James Colley Grades 11-12/ Language Arts
No ratings yet
Two Sides To Every Story: By: James Colley Grades 11-12/ Language Arts
4 pages
Key To b1
No ratings yet
Key To b1
16 pages
Implementation of School Based Management
No ratings yet
Implementation of School Based Management
14 pages
Thesis Lapuyan
No ratings yet
Thesis Lapuyan
29 pages
NCC Nss Nso Yoga
No ratings yet
NCC Nss Nso Yoga
3 pages
Questionnaire Mwua
No ratings yet
Questionnaire Mwua
5 pages
Organisational Behaviour (Prof. Asim Talukdar)
No ratings yet
Organisational Behaviour (Prof. Asim Talukdar)
11 pages
1316020AL TM Primary 3A Unit 7 Money
No ratings yet
1316020AL TM Primary 3A Unit 7 Money
18 pages
English3 Q2 Mod7 CommonlyUsedPossessivePronouns V2
100% (1)
English3 Q2 Mod7 CommonlyUsedPossessivePronouns V2
23 pages
Find Someone Who PDF
No ratings yet
Find Someone Who PDF
2 pages
Engagement Ebook 1 PDF
No ratings yet
Engagement Ebook 1 PDF
30 pages
Heidi Talaat Farid: Highly Developed Skills in
No ratings yet
Heidi Talaat Farid: Highly Developed Skills in
3 pages
6th Grade Pre-Algebra Curriculum Paper 09-10
No ratings yet
6th Grade Pre-Algebra Curriculum Paper 09-10
2 pages
RFP1 - Mod1 - Chpt1 - March 2010
No ratings yet
RFP1 - Mod1 - Chpt1 - March 2010
6 pages
PETA 1 and 2 Speech Writing and Delivery
0% (1)
PETA 1 and 2 Speech Writing and Delivery
2 pages
Making The Workplace Inclusive 5th Sem Notes Eng Manish Verma
No ratings yet
Making The Workplace Inclusive 5th Sem Notes Eng Manish Verma
30 pages
Bringing The World Into The Classroom
60% (5)
Bringing The World Into The Classroom
15 pages
Resume Andrea - Dano
No ratings yet
Resume Andrea - Dano
3 pages
Weekly Paln
No ratings yet
Weekly Paln
54 pages
Transition Plan
No ratings yet
Transition Plan
4 pages
English Spotlight 1 Daily Routines
No ratings yet
English Spotlight 1 Daily Routines
2 pages
Complete Download (Ebook) Contemplative Approaches to Sustainability in Higher Education - Theory and Practice by Marie Eaton, Holly J. Hughes, Jean MacGregor ISBN 9781138190177, 9781138190184, 9781315641249, 1138190179, 1138190187, 1315641240 PDF All Chapters
100% (8)
Complete Download (Ebook) Contemplative Approaches to Sustainability in Higher Education - Theory and Practice by Marie Eaton, Holly J. Hughes, Jean MacGregor ISBN 9781138190177, 9781138190184, 9781315641249, 1138190179, 1138190187, 1315641240 PDF All Chapters
81 pages
Lac Table of Specifications
No ratings yet
Lac Table of Specifications
7 pages
Practical Research II The Curriculum Guide
100% (1)
Practical Research II The Curriculum Guide
13 pages
CESC-LESSON PLAN
No ratings yet
CESC-LESSON PLAN
9 pages