1. Introduction to Data Mining & Classification
1. Introduction to Data Mining & Classification
Data Mining
Sandro Radovanović
Contact me:
Microsoft
Teams/[email protected]
1.
INTRODUCTION
to Data Mining
Why Data Mining
▪ Credit ratings:
▪ Given a hundreds of clients, which are least
likely to default?
▪ Customer relationship management:
▫ Which clients are likely to be loyal, and which
are most likely to leave for a competitor?
▪ Marketing:
▫ What group of clients we have? Can we
automatically discover them?
3
Data Mining
▪ Data Mining (DM) is a process of semi-
automatically analyzing large
databases to find patterns that are:
▪ Valid;
▪ Novel;
▪ Useful; and
▪ Understandable.
4
What is Data Mining?
▪ Finding most frequent surname of our
clients?
5
What is Data Mining?
▪ What products to recommend to a
client?
6
What is Data Mining?
▪ Whether a client opens temporary
savings account in next two months?
7
What is Data Mining?
▪ Acquire additional information about
client from Internet.
8
What is Data Mining?
▪ Identifying groups of clients by using
domain expert.
9
What is Data Mining?
▪ Identifying groups of clients by digital
usage of bank services.
10
What is Data Mining?
▪ How much money on bank account
client will have at the beginning of the
month in next year?
11
Data Mining Tasks
Reduction Regression Estimation
Selecting data (rows and Estimation of numerical Same as regression with
columns) of interest. value based on behaviour addition of time
of the clients. depended data.
12
Note: Data Mining Tasks
▪ Supervised vs. Unsupervised learning (and Reinforcement
Learning)
▫ Supervised machine learning requires labelled input and
output data during the training phase, while unsupervised
don’t.
13
Note: Data Mining Tasks
▪ Generative vs. Discriminative Machine
Learning Models
▫ Generative models are those that
center on the distribution of the
classes within the dataset
▫ Discriminative models learn about
the boundary between classes
within a dataset
14
Note: Data Mining Tasks
Generative models Discriminative Models
▪ Generative models aim to ▪ Discriminative models
capture the actual model the decision
distribution of the classes boundary for the
in the dataset dataset classes
▪ Generative models predict ▪ Discriminative models
the joint probability
learn the conditional
distribution – p(x,y)
probability – p(y|x)
▪ Generative models are
computationally
▪ Discriminative models
expensive compared to are computationally
discriminative models cheap compared to
generative models
15
Regression and Estimation
ID A B C OUTPUT
1 10 20 7 25
2 30 15 10 28
3 5 24 16 16
16
Classification
ID A B C OUTPUT
1 10 20 7 Default
2 30 15 10 Not Default
3 5 24 16 Default
17
Clustering
ID A B C
1 10 20 7
2 30 15 10
3 5 24 16
18
Association Rules
ID A B C
19
CRISP-DM
Methodology for Data Mining
projects
20
CRISP-DM
21
CRISP-DM
1. Business 2. Data 3. Data
Understanding Understanding Preparation
Understanding of goals Acquiring data, Preparing dataset for
and tasks of the project. exploratory analysis, data data mining task.
What should be done and quality etc. Selection of rows and
what are limitations. attributes. Transformation
of data. Data cleansing
4. Modelling 5. Evaluation 6. Deployment
Selection of a model. Performance of the Put model to production.
Parameters etc. model! Do we solve the
problem defined in 1?
22
2.
CLASSIFICATION
Let’s start with classification
Classification: Definition
▪ Given a collection of records find a
model for class attribute as a function
of the values of other attributes.
24
Classification: Definition
▪ Given a collection of records
▪ find a model for class attribute
▪ as a function of the values of other
attributes
25
RapidMiner
▪ Introduction and Classification
26
Classification concepts to
remember (1)
▪ Model
▪ Apply model
▪ Performance
27
Classification concepts to
remember (2)
▪ Model vs. Algorithm
28
OVERFITTI
NG
CONCEPT TO REMEMBER (3)
29
What happened?
30
Overfitting
▪ Overfitting is a problem when
prediction model is too complex for
data at hand.
▪ Good fit for data at hand – poor fit for
new data
31
Underfitting
▪ Underfitting occurs when a model is too
simple.
▪ Poor fit to data at hand – poor fit to new
data.
32
Underfitting and Overfitting
33
Underfitting and Overfitting
34
EVALUATIO
N
CONCEPT TO REMEMBER (4)
35
Evaluation of the
classification model
▪ Confusion matrix
▪ Accuracy
▪ Precision
▪ Recall
▪ AUC
36
Errors in Classification
problems
▪ Types of error – FP and FN
▪ Cost of errors?
37
VALIDATIO
N
CONCEPT TO REMEMBER (5)
38
Validation
▪ Divide dataset to training part and test
part.
39
Validation
▪ Can we have more testing with same
amount of data?
40
Cross-Validation
41
OVERFITTI
NG
How to deal with it?
CONCEPT TO REMEMBER (6)
42
How to deal with
overfitting?
▪ Change parameters of the learning
algorithm.
▪ Remove features.
▪ Obtain additional data (if available).
43
TESTING
MORE
How to test multiple algorithms at
the same time…
44
How to test multiple
algorithms at the same
time?
▪ Loop
▪ Log
45
Don’t do this
46
Dataset for Project
Task(s)
47
Project Task
▪ Create RapidMiner process(es) for the
given task – Student dropout prediction
▪ Open Learning Analytics | OU Analyse
| Knowledge Media Institute | The Ope
n University
49
Project Task
▪ courses.csv
▫ code_module – code name of the
module, which serves as the identifier.
▫ code_presentation – code name of the
presentation. It consists of the year
and “B” for the presentation starting in
February and “J” for the presentation
starting in October.
50
Project Task
▪ assessments.csv
▫ code_module
▫ code_presentation
▫ id_assessment – identification number of the
assessment
▫ assessment_type – type of assessment. Three types of
assessments exist: Tutor Marked Assessment (TMA),
Computer Marked Assessment (CMA) and Final Exam
(Exam).
▫ date – information about the final submission date of
the assessment calculated as the number of days since
the start of the module-presentation.
51
Project Task
▪ vle.csv
▫ id_site – an identification number of the material.
▫ code_module
▫ code_presentation
▫ activity_type – the role associated with the
module material.
▫ week_from – the week from which the material is
planned to be used.
▫ week_to – week until which the material is
planned to be used.
52
Project Task
▪ studentInfo.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ gender
▫ region – identifies the geographic region, where
the student lived while taking the module-
presentation.
▫ highest_education – highest student education
level on entry to the module presentation.
53
Project Task
▪ studentInfo.csv
▫ imd_band – specifies the
Index of Multiple Depravation band of the place
where the student lived during the module-
presentation.
▫ age_band
▫ num_of_prev_attempts
▫ studied_credits
▫ disability
▫ final_result – student’s final result in the
module-presentation.
54
Project Task
▪ studentRegistration.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ date_registration
▫ date_unregistered
55
Project Task
▪ studentAssessment.csv
▫ id_assessment
▫ id_student
▫ is_banked – tranferred from the previous
presentation
▫ score – the student’s score in this
assessment. The range is from 0 to 100. The
score lower than 40 is interpreted as Fail.
The marks are in the range from 0 to 100.
56
Project Task
▪ studentVle.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ id_site
▫ date
▫ sum_click – the number of times a student
interacts with the material in that day.
57
THANKS!
Any questions?
58