0% found this document useful (0 votes)
8 views

1. Introduction to Data Mining & Classification

Uploaded by

Idkk Either
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

1. Introduction to Data Mining & Classification

Uploaded by

Idkk Either
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Introduction to

Data Mining

Sandro Radovanović
Contact me:
Microsoft
Teams/[email protected]
1.
INTRODUCTION
to Data Mining
Why Data Mining
▪ Credit ratings:
▪ Given a hundreds of clients, which are least
likely to default?
▪ Customer relationship management:
▫ Which clients are likely to be loyal, and which
are most likely to leave for a competitor?
▪ Marketing:
▫ What group of clients we have? Can we
automatically discover them?
3
Data Mining
▪ Data Mining (DM) is a process of semi-
automatically analyzing large
databases to find patterns that are:
▪ Valid;
▪ Novel;
▪ Useful; and
▪ Understandable.
4
What is Data Mining?
▪ Finding most frequent surname of our
clients?

5
What is Data Mining?
▪ What products to recommend to a
client?

6
What is Data Mining?
▪ Whether a client opens temporary
savings account in next two months?

7
What is Data Mining?
▪ Acquire additional information about
client from Internet.

8
What is Data Mining?
▪ Identifying groups of clients by using
domain expert.

9
What is Data Mining?
▪ Identifying groups of clients by digital
usage of bank services.

10
What is Data Mining?
▪ How much money on bank account
client will have at the beginning of the
month in next year?

11
Data Mining Tasks
Reduction Regression Estimation
Selecting data (rows and Estimation of numerical Same as regression with
columns) of interest. value based on behaviour addition of time
of the clients. depended data.

Classification Clustering Association


Predicting class or Grouping of clients based Rules
category to which client on their similarity. Discovering associations
should be assigned to. (relationship) between
products.

12
Note: Data Mining Tasks
▪ Supervised vs. Unsupervised learning (and Reinforcement
Learning)
▫ Supervised machine learning requires labelled input and
output data during the training phase, while unsupervised
don’t.

▫ In supervised learning, the goal is to predict outcomes for


new data. You know up front the type of results to expect.

▫ With an unsupervised learning algorithm, the goal is to


get insights from data. The machine learning itself
determines what is different/interesting from the dataset.

13
Note: Data Mining Tasks
▪ Generative vs. Discriminative Machine
Learning Models
▫ Generative models are those that
center on the distribution of the
classes within the dataset
▫ Discriminative models learn about
the boundary between classes
within a dataset
14
Note: Data Mining Tasks
Generative models Discriminative Models
▪ Generative models aim to ▪ Discriminative models
capture the actual model the decision
distribution of the classes boundary for the
in the dataset dataset classes
▪ Generative models predict ▪ Discriminative models
the joint probability
learn the conditional
distribution – p(x,y)
probability – p(y|x)
▪ Generative models are
computationally
▪ Discriminative models
expensive compared to are computationally
discriminative models cheap compared to
generative models
15
Regression and Estimation
ID A B C OUTPUT

1 10 20 7 25

2 30 15 10 28

3 5 24 16 16

16
Classification
ID A B C OUTPUT

1 10 20 7 Default

2 30 15 10 Not Default

3 5 24 16 Default

17
Clustering
ID A B C

1 10 20 7

2 30 15 10

3 5 24 16

18
Association Rules
ID A B C

1 TRUE FALSE FALSE

2 TRUE TRUE FALSE

3 FALSE TRUE TRUE

19
CRISP-DM
Methodology for Data Mining
projects

20
CRISP-DM

21
CRISP-DM
1. Business 2. Data 3. Data
Understanding Understanding Preparation
Understanding of goals Acquiring data, Preparing dataset for
and tasks of the project. exploratory analysis, data data mining task.
What should be done and quality etc. Selection of rows and
what are limitations. attributes. Transformation
of data. Data cleansing
4. Modelling 5. Evaluation 6. Deployment
Selection of a model. Performance of the Put model to production.
Parameters etc. model! Do we solve the
problem defined in 1?

22
2.
CLASSIFICATION
Let’s start with classification
Classification: Definition
▪ Given a collection of records find a
model for class attribute as a function
of the values of other attributes.

▪ Goal: Assign clients to classes as


accurately as possible.

24
Classification: Definition
▪ Given a collection of records
▪ find a model for class attribute
▪ as a function of the values of other
attributes

25
RapidMiner
▪ Introduction and Classification

26
Classification concepts to
remember (1)
▪ Model
▪ Apply model
▪ Performance

27
Classification concepts to
remember (2)
▪ Model vs. Algorithm

28
OVERFITTI
NG
CONCEPT TO REMEMBER (3)

29
What happened?

30
Overfitting
▪ Overfitting is a problem when
prediction model is too complex for
data at hand.
▪ Good fit for data at hand – poor fit for
new data

31
Underfitting
▪ Underfitting occurs when a model is too
simple.
▪ Poor fit to data at hand – poor fit to new
data.

32
Underfitting and Overfitting

33
Underfitting and Overfitting

34
EVALUATIO
N
CONCEPT TO REMEMBER (4)

35
Evaluation of the
classification model
▪ Confusion matrix
▪ Accuracy
▪ Precision
▪ Recall
▪ AUC

36
Errors in Classification
problems
▪ Types of error – FP and FN
▪ Cost of errors?

37
VALIDATIO
N
CONCEPT TO REMEMBER (5)

38
Validation
▪ Divide dataset to training part and test
part.

39
Validation
▪ Can we have more testing with same
amount of data?

40
Cross-Validation

41
OVERFITTI
NG
How to deal with it?
CONCEPT TO REMEMBER (6)

42
How to deal with
overfitting?
▪ Change parameters of the learning
algorithm.
▪ Remove features.
▪ Obtain additional data (if available).

43
TESTING
MORE
How to test multiple algorithms at
the same time…

44
How to test multiple
algorithms at the same
time?
▪ Loop
▪ Log

45
Don’t do this

46
Dataset for Project
Task(s)

47
Project Task
▪ Create RapidMiner process(es) for the
given task – Student dropout prediction
▪ Open Learning Analytics | OU Analyse
| Knowledge Media Institute | The Ope
n University

▪ Try going through the each step of


48
CRISP-DM process.
Project Task

49
Project Task
▪ courses.csv
▫ code_module – code name of the
module, which serves as the identifier.
▫ code_presentation – code name of the
presentation. It consists of the year
and “B” for the presentation starting in
February and “J” for the presentation
starting in October.
50
Project Task
▪ assessments.csv
▫ code_module
▫ code_presentation
▫ id_assessment – identification number of the
assessment
▫ assessment_type – type of assessment. Three types of
assessments exist: Tutor Marked Assessment (TMA),
Computer Marked Assessment (CMA) and Final Exam
(Exam).
▫ date – information about the final submission date of
the assessment calculated as the number of days since
the start of the module-presentation.
51
Project Task
▪ vle.csv
▫ id_site – an identification number of the material.
▫ code_module
▫ code_presentation
▫ activity_type – the role associated with the
module material.
▫ week_from – the week from which the material is
planned to be used.
▫ week_to – week until which the material is
planned to be used.
52
Project Task
▪ studentInfo.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ gender
▫ region – identifies the geographic region, where
the student lived while taking the module-
presentation.
▫ highest_education – highest student education
level on entry to the module presentation.
53
Project Task
▪ studentInfo.csv
▫ imd_band – specifies the
Index of Multiple Depravation band of the place
where the student lived during the module-
presentation.
▫ age_band
▫ num_of_prev_attempts
▫ studied_credits
▫ disability
▫ final_result – student’s final result in the
module-presentation.
54
Project Task
▪ studentRegistration.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ date_registration
▫ date_unregistered

55
Project Task
▪ studentAssessment.csv
▫ id_assessment
▫ id_student
▫ is_banked – tranferred from the previous
presentation
▫ score – the student’s score in this
assessment. The range is from 0 to 100. The
score lower than 40 is interpreted as Fail.
The marks are in the range from 0 to 100.
56
Project Task
▪ studentVle.csv
▫ code_module
▫ code_presentation
▫ id_student
▫ id_site
▫ date
▫ sum_click – the number of times a student
interacts with the material in that day.

57
THANKS!
Any questions?

58

You might also like