0% found this document useful (0 votes)
62 views

2018 05 Intro To Datamining Notes

This document provides an introduction and overview of lessons for a hands-on course on data mining using the Orange data mining environment. It will cover basic data exploration techniques using Orange workflows and visualizations, including loading data, selecting subsets, and exploring relationships between variables. The document contains screenshots and explanations of example Orange workflows used to analyze gene expression and clinical datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

2018 05 Intro To Datamining Notes

This document provides an introduction and overview of lessons for a hands-on course on data mining using the Orange data mining environment. It will cover basic data exploration techniques using Orange workflows and visualizations, including loading data, selecting subsets, and exploring relationships between variables. The document contains screenshots and explanations of example Orange workflows used to analyze gene expression and clinical datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 88

Zupan, Demsar: Introduction to Data May 2018

Mining

Introduction to Data
Mining
Working notes for the hands-on
course with Orange Data Mining

These notes include Orange


Welcome to the course on Introduction to Data Mining!
workflows and visualizations
You will see how common data mining tasks can be
we will construct during the
accomplished without programming. We will use Orange
course.
to construct visual data mining workflows. Many similar
The working notes were data mining environments exist, but the authors of these
prepared by Blaž Zupan and notes prefer Orange for one simple reason—they are its
Janez Demšar with help from authors. For the courses offered by Orange developers,
the members of the please visit https://orange.biolab.si/training.
Bioinformatics Lab in Ljubljana If you haven’t already installed Orange, please
that develop and maintain
download the installation package from
Orange.
http://orange.biolab.si.
University of Ljubljana 1
Zupan, Demsar: Introduction to Data May 2018
Mining

Attribution-NonCommercial-NoDerivs
CC BY-NC-ND

University of Ljubljana 2
Lesson 1: Workflows in Orange
Orange workflows consist of components that read,
process and visualize data. We call them “widgets.” We
place the widgets on a drawing board (the “canvas”).
Widgets communicate by sending information along with
a communication channel. An output from one widget is
used as input to another.

A screenshot above shows a


We construct workflows by dragging widgets onto the
simple workflow with two canvas and connecting them by drawing a line from the
connected widgets and one transmitting widget to the receiving widget. The
widget without connections. widget’s outputs are on the right and the inputs on the
The outputs of a widget appear left. In the workflow above, the File widget sends data
on the right, while the inputs to the Data Table widget.
appear on the left.
Start by constructing a workflow that consists of a File
widget, two Scatter Plot widgets, and two Data Table
widgets:

Workflow with a File widget that


reads data from disk and sends
it to the Scatter Plot and Data
Table widget. The Data Table
renders the data in a
spreadsheet, while the Scatter
Plot visualizes it. Selected data
points from the Scatterplot are
sent to two other widgets: Data
Table (1) and Scatter Plot (1).

The File widget reads data from your local disk. Open
the File Widget by double clicking its icon. Orange
comes with several preloaded data sets. From these
(“Browse documentation data sets…”), choose brown-
selected.tab, a yeast gene expression data set.

Orange workflows often start


with a File widget. The brown-
selected data set comprises 186
rows (genes) and 81 columns.
Out of the 81 columns, 79
contain gene expressions of
baker’s yeast under various
conditions, one column (marked
as a “meta attribute”) provides
gene names, and one column
contains the “class” value or
gene function.
After you load the data, open the other widgets. In the
Scatter Plot widget, select a few data points and watch as
they appear in widget Data Table (1). Use a combination
of two Scatter Plot widgets, where the second scatter plot
shows a detail from a smaller region selected in the first
scatterplot.
Following is more of a side note, but it won’t hurt.
Namely, the scatter plot for a pair of random features does
not provide much information on gene function. Does this
change with a different choice of feature pairs in the
visualization? Rank projections (the button on the top left
of the Scatter Plot widget) can help you find a good
feature pair. How do you think this works? Could the
suggested pairs of features be useful to a biologist?
We can connect the output of the Data Table widget to
the Scatter Plot widget to highlight the chosen data
instances (rows) in the scatter plot.

In this workflow, we have turned


on the option “Show channel
names between widgets” in
File→Preferences.

How does Orange distinguish between the primary data


source and the data selection? It uses the first connected
signal as the entire data set and the second one as its
subset. To make changes or to check what is happening
under the hood, double click on the line connecting the
two widgets.

Orange comes with a basic set


of widgets for data input, The rows in the data set we are exploring in this lesson
preprocessing, visualization are gene profiles. We can use the Gene Info widget from
and modeling. For other tasks, the Bioinformatics add-on to get more information on
like text mining, network the genes we selected in any of the Orange widgets.
analysis, and bioinformatics,
there are add-ons. Check them
out by selecting “Add-ons…”
from the options menu.
Lesson 2: Basic Data
Exploration
Let us consider another problem, this time from clinical
medicine. We will dig for something interesting in the
data and explore it a bit with various widgets. You will
get to know Orange better and also learn about several
interesting visualizations.

We will start with an empty canvas; to clean it from


our previous lesson, use either FileNew or select all
the widgets and remove them (use the
backspace/delete key, or Cmd-backspace if you are on
Mac).

Now again, add the File widget and open another


documentation data set: heart_disease. How does the
data look?

Let us check whether standard visualizations tell us anything


interesting. (Hint: look for gender differences. These are
always interesting and occasionally even real.)

The two Distributions widgets


get different data: the upper
gets the selected rows, and

s the others.the
Double- click the connection between the widgets to access setup dialog, as you've learned in the previous lesson.

Data can also be split by the value of features — in this


case, gender — and analyze it separately.
In the Select Rows widget, we choose the female patients.
You can also add other conditions. Selection of data
instances works well with visualization of data
distribution. Try having at least two widgets open at the
same time and explore the data.
There are two less known — but great — visualizations for
observing interactions between features.

You can play with the widget


by trying different
combinations of 1-4 features.

The mosaic display shows a rectangle split into columns


with widths reflecting the prevalence of different types
of chest pain. Each column is then further split
vertically according to gender distributions within the
column. The resulting rectangles are divided again
horizontally according to age group sizes. Within the
resulting bars, the red and blue areas represent the
outcome distribution for each group and the tiny strip to
the left of each shows the overall distribution.

What can you read from this diagram?


Another visualization, Sieve diagram, also splits a
rectangle horizontally and vertically, but with independent
cuts, so the areas correspond to the expected number of
data instances assuming the observed variables are
independent. For example, 1/4 of patients are older than
60, and 1/3 of patients are female, so the area of the
bottom right rectangle is 1/12 of the total area. With
roughly 300 patients, we would expect 1/12 × 300 = 25
older women in our data. There are 34. Sieve diagram
shows the difference between the expected and the
observed frequencies by the grid density and the color of
the field.
See the Score Combinations
button? Guess what it does?
And how it scores the
combinations? (Hint: there are
some Greek letters at the
bottom of the widget.)
Lesson 3: Saving Your Work
If you followed the instructions so far — except for those
about removing widgets — your workflow might look like
this.

You can save it (FileSave) and share it with your


colleagues. Just don't forget to put the data files in the
same directory as the file with the workflow.
One more trick: Pressing Ctrl-
Widgets also have a Report button, which you can use to
C (or ⌘-C, on Mac) copies a
keep a log of your analysis. When you find something
visualization to the clipboard,
interesting, like an unexpected Sieve Diagram, just click
so you can paste it to another
Report to add the graph to your log. You can also add
application.
reports from the widgets on the path to this one, to make
sure you don't forget anything relevant.

Clicking on the part of the report also allows you


to add a comment.

Clicking on a part of the report also allows you to add a


comment.

You can save the report as HTML or PDF, or to a file that


includes all workflows that are related to the report items
and which you can later open in Orange. In this way, you
and your colleagues can reproduce your analysis results.
Lesson 4: Loading Your Own
Data Set
The data sets we have worked with in previous lessons
come with Orange installation. Orange can read data from
spreadsheet file formats which include tab and comma
separated and Excel files. Let us prepare a data set (with
school subjects and grades) in Excel and save it on a local
disk.

In Orange, we can use the File widget to load this data.

Looks ok. Orange has correctly guessed that student


names are character strings and that this column in the
data set is special, meant to provide additional
information and not to be used for modeling (more about
this in the coming lectures). All other columns are
numeric features.
It is always good to check if Orange read the data
correctly. We can connect our File widget with the Data
Table widget,

and double click on the Data Table to see the


data in the spreadsheet format.

Nice, everything is here.

We can also use Google Sheets, a free online spreadsheet


alternative. Then, instead of finding the file on the local
disk, we would enter its URL address to the File widget's
URL entry box.

There is more to input data formatting and loading. We


can define the type and kind of the data column, specify
that the column is a web address of an image, and
more. But enough for the first day. If you would like to
dive deeper, check out the documentation page on
Loading your Data, or a video on this subject.
Lesson 5: Classification
In one of the previous lessons, we explored the heart
disease data. We wanted to predict which persons have
clogged arteries — but we did not make any
predictions. Let's try it now.

This won't do: the widget Predictions shows the data,


but no makes no predictions. It can't. For this, it needs a
model. Like this.

The data is fed into the Tree widget, which uses it to


infer a predictive model. The Predictions widget now
gets the data from the File widget and also a predictive
model from the Tree widget. This is something new: in
our past workflows, widgets passed only data to each
other, but here we have a channel that carries a model.

The Predictions widget uses


the model to make
predictions about the data
and shows them in the table.

How correct are these


predictions? Do we have a
good model? How can we
tell?

But (and even before


answering these critical
questions), what is a tree?
How does it look like? How
does Orange create one? Is
this algorithm something
we should use? So many
questions to answer today!
Lesson 6: Classification Trees
Classification tree is one of the oldest, but still popular,
The data set we will use is machine learning methods. We like it since the method is
stored on a server. Copy the easy to explain and gives rise to random forests, one of
web address and paste it into the most accurate machine learning techniques (more on
URL entry box in the File this later). So, what kind of model is a classification tree?
widget. An alternative way to
Let us load a data set from
access this data is to use the
http://file.biolab.si/datasets/sailing.tab that records the
Data Sets widget that is
conditions under which a friend skipper went sailing,
currently available in the build a tree and visualize it in the Tree Viewer.
Prototypes add-on.

Here’s a warning: this sailing


data is small. Therefore, any
relations inferred from the
classification tree on this page
are unreliable. What should the
size of the data set be to acquire
stronger conclusions?

We read the tree


from top to bottom. It
looks like this
skipper is a social
person; as soon as
there’s company, the
probability of her
sailing increases.
When joined by a
smaller group of
individuals, there is
no sailing if there is
rain.
(Thunderstorms? Too
dangerous?) When
she has a smaller
company,
but the boat at her
disposal is big, there is
no sailing either.
Classification trees were hugely
Trees place the most useful feature at the root. What
The Rank widget could be used would be the most useful feature? It is the feature that
on its own. Say, to figure out
splits the data into two purest possible subsets. These
which genes are best predictors
are then split further, again by the most informative
of the phenotype in some gene
features. This process of breaking up the data subsets to
expression data set. Or what
smaller ones repeats until we reach subsets where all
data belongs to the same class. These subsets are
experimental conditions to
represented by leaf nodes in strong blue or red. The
consider to profile the genes
process of data splitting can also terminate when it runs
and assign their function. Oh,
out of data instances or out of useful features (the two
but we have already worked
leaf nodes in white).
with a data set of this kind.
What does Rank tell us about We still have not been very explicit about what we
it? mean by “the most useful” feature. There are many
ways to measure this. We can compute some such
scores in Orange using the Rank widget, which
estimates the quality of data features and ranks them
according to how much information they carry. We can
compute the scores from the whole data set or from data
corresponding to some node of the classification tree in
the Tree Viewer.

In this class, we will not dive


into definitions. If you are
interested, there’s a good
explanation of information gain
on stackoverflow.com.
Lesson 7: Model Inspection
Here’s another interesting combination of widgets: the
classification tree viewer and the scatterplot. This time,
consider the famous Iris data set (comes with Orange).
In the Scatter Plot, find the best visualization of this
data set, that is, the one that best separates the instances
from different classes. Then connect the Tree Viewer to
the Scatterplot. Selecting any node of the tree will
output the corresponding data subset, which will be
shown in the scatter plot.

Wherever possible, visualizations


in Orange are designed to
support selection and passing
of the data that applies to it.
Finding interesting data
subsets and analyzing their
commonalities is a central part
of explorative data analysis, a
data analysis approach favored
by the data visualization guru
Edward Tufte.

Just for fun, we have included a few other widgets in


this workflow. In a way, the Tree Viewer widget behaves
like the Select Rows widget, except that the rules used
to filter the data are inferred from the data itself and
optimized to obtain purer data subsets.
Zupan, Demsar: Introduction to Data May 2018
Mining

Lesson 8: Classification
Accuracy
Now that we know what classification trees are, the
next question is what is the quality of their predictions.
For beginning, we need to define what we mean by
quality. In classification, the simplest measure of
quality is classification accuracy expressed as the
proportion of data instances for which the classifier
correctly guessed the value of the class. Let’s see if we
can estimate, or at least get a feeling for, classification
Measuring of accuracy is such accuracy with the widgets we already know.
an important concept that it
would require its widget. But
wait a while, there’s
educational value in reusing
the widgets we already know.

Let us try this schema with the brown-selected data set.


The Predictions widget outputs a data table augmented
with a column that includes predictions. In the Data
Table widget, we can sort the data by any of these two
columns, and manually select data instances where the
values of these two features are different (this would not
work on big data). Roughly, visually estimating the
accuracy of predictions is straightforward in the
Distribution widget, if we set the features in view
appropriately.

University of Ljubljana 18
Zupan, Demsar: Introduction to Data May 2018
Mining

This lesson has a strange title


and it is not obvious why it was Lesson 9: How to Cheat
chosen. Maybe you, the reader,
At this stage, the classification tree looks very good.
should tell us what does this
There’s only one data point where it makes a mistake. Can
lesson have to do with cheating.
we mess up the data set so bad that the trees will
ultimately fail? Like, remove any existing correlation
between gene expression profiles and class? We can!
There’s the Randomize widget that can shuffle the class
column. Check out the chaos it creates in the Scatter Plot
visualization where there were nice clusters before
randomization!
Randomize widget shuffles the
column in the data table. It can
shuffle the class column, columns
with data features or columns
with meta information. Shuffling
the class column breaks any
relation between features and
the class, keeping the data
points (genes profiles) intact.

Why is the background in


this scatter plot so green,
and only green? Why have
the other colors disappeared
after the class
randomization?

Fine. There can be no classifier that can model this


mess, right? Let us test this. We will build
classification tree and check its performance on the
messed-up data set.

University of Ljubljana 19
Zupan, Demsar: Introduction to Data May 2018
Mining

And the result? Here is a screenshot of the Distributions:

At this stage, it may be


worthwhile checking how do the
trees look. Try comparing the
tree inferred from original and
shuffled data!

Most unusual. Almost no mistakes. How is this


possible? On a class-randomized data set?
The signals from the Data
Sampler widget have not been To find the answer to this riddle, open the Tree Viewer
named in our workflow to save and check out the tree. How many nodes does it have?
space. The Data Sampler splits Are there many data instances in the leaf nodes?
the data to a sample and out-of-
sample (so called remaining
It looks like the tree just memorized every data instance
data). The sample was given to
from the data set. No wonder the predictions were right.
The tree makes no sense, and it is complex because it
the Tree widget, while the
simply remembered everything.
remaining data was handed to
the Predictions widget. Set the Ha, if this is so, that is, if a classifier remembers
Data Sampler so that the size of everything from a data set but without discovering any
these two data sets is about general patterns, it should perform miserably on any new
equal. data set. Let us check this out. We will split our data set
into two sets, training and testing, train the
classification tree on the training data set and then
estimate its accuracy on the test data set.

University of Ljubljana 20
Zupan, Demsar: Introduction to Data May 2018
Mining

Let’s check how the Distributions widget looks after


testing the classifier on the test data.
Turns out that for every class
value the majority of data
instances has been predicted
to the ribosomal class (green).
Why? Green again (like green
from the Scatter Plot of the
messed-up data)? Here is a
hint: use the Box Plot widget
to answer this question.

The first two classes are a complete fail. The predictions


for ribosomal genes are a bit better, but still with lots of
mistakes. On the class-randomized training data, our
classifier fails miserably.
Finally, this is just as we would expect.
To test the performance (accuracy) of the classification
technique, we have just learned that we need to train the
classifiers on the training set and then test it on a
separate test set. With this test, we can distinguish
between those classifiers that just memorize the training
data and those that learn a useful model.
We needed to class-randomize
only the training data set to Learning is not only remembering. Rather, it is
fail in predictions. Try discovering patterns that govern the data and apply to new
changing the workflow so that data as well. To estimate the accuracy of a classifier, we,
the classes are randomized therefore, need a separate test set. This assessment should
only there, and not in the test not depend on just one division of the input data set to
set. training and test set (here’s a place for cheating as well).
Instead, we need to repeat the process of estimation
several times, each time on a different train/test set and
report on the average score.

University of Ljubljana 21
Lesson 10: Cross-Validation
Estimating the accuracy may depend on a particular split
of the data set. To increase robustness, we can repeat the
measurement several times, each time choosing a
different subset of the data for training. One such method
is cross-validation. It is available in Orange through the
Test & Score widget.

Note that in each iteration, Test & Score will pick part of
the data for training, learn the predictive model on this
data using some machine learning method, and then test
the accuracy of the resulting model on the remaining, test
data set. For this, the widget will need on its input a data
set from which it will sample data for training and testing,
and a learning method which it will use on the training
data set to construct a predictive model. In Orange, the
learning method is simply called a learner. Hence, Test &
Score needs a learner on its input. A typical workflow
with this widget is as follows.
For geeks: a learner is an
object that, given the data,
This is another way to use the Tree widget. In the
outputs a classifier. Just what
workflows from the previous lessons we have used
another of its outputs, called Model: its construction
Test & Score needs.
required the data. This time, no data is needed for Tree,
because all that we need from it a learner.
Cross validation splits the data
sets into, say, 10 different non- Here we show Test & Score widget looks like. CA stands
overlapping subsets we call for classification accuracy, and this is what we really care
folds. In each iteration, one fold for for now. We will talk about other measures, like AUC,
will be used for testing, while later.
the data from all other folds will
be used for training. In this
way, each data instance will be
used for testing exactly once.
Lesson 11: A Few More
Classifiers
We have ended the previous lesson with cross-validation
and classification trees. There are many other, much more
accurate classifiers. A particularly interesting one is
Random Forest, which averages across predictions of
hundreds of classification trees. It uses two tricks to
construct different classification trees. First, it infers each
tree from a sample of the training data set (with
replacement). Second, instead of choosing the most
informative feature for each split, it randomly selects
from a subset of most informative features. In this way, it
randomizes the tree inference process. Think of each tree
shedding light on the data from a different perspective.
Just like in the wisdom of the crowd, an ensemble of trees
(called a forest) usually performs better than a single tree.

Let us see if this is really so. We give two learners to the


Test Learners widget and check if cross-validated
classification accuracy is indeed higher for random forest.
Choose different classification data sets for this
comparison, starting with those we already know (hearth
disease, iris, brown selected).
It may be interesting to compare where different
classification methods make mistakes. We can use
Confusion Matrix for this purpose, and then pass the
signal from this widget to the Scatter Plot.

What kind of object is sent


from the Test & Score widget to
the Confusion Matrix widget?
So far, we have used widgets
that send data, or even
learners. But what could the
Test & Score widget
communicate to other widgets?

There are other classifiers we can try. We will briefly


mention a few more, but instead of diving into what they
do (we could spend a semester on this!), we’ll pass on to
other important topics in data mining. At this point, just
add them to the workflow above and see how they perform.

It would be nice if we could, at least on the intuitive level,


understand the differences between all these methods and
their variants (every method has some parameters).
Remember, the classification tree finds hyperplanes
orthogonal to the axis; those hyperplanes split the data
space to regions with different class probabilities. The
tree’s decision boundaries are flat. Nearest neighbors
classifies the data instance according to the few
neighboring data instances in the training set. Decision
boundaries with this approach could be very complex.
Logistic regression infers just one hyperplane (decision
boundary) in an arbitrary direction. This is similar to
support vector machines with linear kernel, but then again,
the kernels with SVM can be changed, resulting in more
complex decision boundaries.
Ok, we have to admit: the above paragraph reads almost
like gibberish. We would need a workflow where we
could actually see the decision boundaries. And perhaps
invent the data sets to test the classifiers. Best in 2D.
Maybe, for a start, we could just paint the data. Time to
stop writing this long passage of text, end the suspense,
and construct a workflow that does this all.

Be creative when painting the data! Also, instead of SVM,


use different classifiers. Also, try changing the parameters
of the classifiers. Like, limit the depth of the decision tree
to 2, or 3, 4. Or switch from SVM with linear kernel to the
radial basis function.
Appropriately set up the scatter plot to observe the
changes.
Lesson 12: A Sneaky
Way to Cheat
Gene expression data set we Consider a typical gene expression data sets where we
will use was borrowed from have samples in rows and genes expressions in columns.
Gene Expression Omnibus. These data sets are usually fat: there are many more genes
There is a special widget in than samples. Fat data sets are almost typical for systems
Orange bioinformatics add on biology. When samples are labeled with phenotype and
that we could use to fetch this our task is phenotype classification, many features
and similar data sets. Instead, (genes) will be irrelevant and most often only a few will
we will here rely on GEO data be highly correlated with class. So why not simply first
set that is preloaded in Orange: select a set of most informative features, and then do the
geo- gds360. Use File and then whole analysis? At least cross-validation will then work
“Browse documentation data much faster, as the model inference algorithms will deal
sets”. with much smaller data tables. Cool. What a nice trick!
Let’s try it out in the following workflow.

The workflow above uses the data preprocessing widget,


which we have configured to select 5 most informative
features.

Observe the classification accuracy obtained on the original


data
set, and on the data set with five best selected
features.What is
happening? Why?
Lesson 13: Cheating Works
Even on Randomized Data
We can push the example from our previous lesson to the
extreme. We will randomize the classification data. That
is, we will take the column with the class values and
randomly permute it. We will use the Randomize widget to
do this.

Later, we will do classification on this data set. We


expect really low classification accuracy on randomized
data set. Then, we will select five features that are most
associated with the class. Even though we have
randomly permuted the classes, there have to be some
features that are weakly correlated with the class. Simply
because we have tens of thousands of features, and we
have only a few samples. There are enough features that
some of them correlate with class simply by chance.
Finally, we will score a random forest on a randomized
data set with selected features.

Compare the scores reported by cross-validation on


different data sets in this pipeline. Why is the accuracy in
the final one rather high? Would adding more “most
informative features” improve or degrade the cross-
validated performance on a randomized data set?

Instead of selecting five most


informative features, you can
reduce this number even further.
Say, to two most informative
features. What happens? Why
does accuracy raise after this
change?
Lesson 14: How to
Correctly Perform Test
The writing on the right looks and Score
straightforward. But actually
To put it simply: never, in any way, transform the data
one needs to be extremely prior to cross-validation. Any transformation should
careful not to succumb to happen within cross- validation loop, first on the training
overfitting when reporting set, and then, if required, on a test set. In a relaxed form:
results of cross- validation it’s ok to transform the data, but the transformation should
tests. The literature on systems be done independently on the train and the test set and the
biology is polluted with transformation on the test set should in no way use the
reporting on overly optimistic information about the class value. Data imputation could
results, and high impact factors be an example of such operation, but again it should be
provide no guarantee that carried out separately for the train and test set and should
studies were carried out not consider classes.
correctly (in fact, due to a lack
of reviewers from the field of But how do we then correctly apply preprocessing in
machine learning, mistakes Orange? The idea of reducing the number of features prior
likely stay overlooked). to inferring a predictive model may be still appealing, now
that we know we can use it on training data sets (leaving
Simon et al. (2003) provides a the test set alone). Following are two workflows that do
great read on this topic. He this correctly.
found that many of the early
papers in gene expression
analysis reported high accuracy
simply due to overfitting.

In this first workflow, we gave the Test & Score widget


a preprocessor (feature selection was used in this
example). The Test & Score widget uses it correctly
only on the training sets. This type of workflow is
preferred if we would like to test the effect of
preprocessing on a number of different learning
algorithms.
The Preprocess widget does
Alternatively, we can include a preprocessor in a
not necessary require a data
learning method. The preprocessor is now called on the
set on its input. An alternative
training data set just before this learner performs
use of this widget is to output
inference of the predictive model.
a method for data
preprocessing, which we can
then pass to either a learning
method or to a widget for
cross validation.

This is not the first time we


have used a widget that
instead of a data passes
forward a computation Can you extend this workflow to such an extent that
method. All the learners, like you can test both a learner with preprocessing by
Random Forest, do so. A feature subset selection and the same learner without
learner could get data on its this preprocessing? How does the number of selected
input and pass a classifier to features affect the cross-validated accuracies? Does the
its output, or simple pass an success of this particular combination of machine
instance of itself, that is, learning technique depend on the input data set? Does
pass a learning algorithm to it work better for some machine learning algorithms?
whichever widget could use Try its performance on k-nearest neighbors learner
it. For instance, to the Test & (warning: use small data sets, this classifier could be
Score widget. very slow).

Somehow, in a shy way, we have also introduced a


technique for feature selection, and pointed to its possible
utility for classification problems. Feature subset
selection, or FSS in short, was and still is, to some extent,
an important topic in machine learning. Modern
classification algorithms, though, perform it implicitly,
and can deal with a large number of features without the
help of external procedures for their advanced selection.
Random forest is one such technique.
Lesson 15: Model Scoring
In multiple choice exams, you are graded according to
the number of correct answers. The same goes for
classifiers: the more correct predictions they make, the
better they are. Nothing could make more sense. Right?

Maybe not. Dr. Smith is a specialist of a type and his


diagnosis is correct in 98% of the cases. Would you
consider visiting him if you have some symptoms related
to his speciality?

Not necessarily. His specialty, in fact, are rare diseases (2


out of 100 of his patients have it) and, being lazy, he
always dismisses everybody as healthy. His predictions are
worthless — although extremely accurate. Classification
accuracy is not an absolute measure, which can be judged
out of context. At the very least, it has to be compared
with the frequency of the majority class, which is, in case
of rare diseases, quite … major.

For instance, on GEO data set GDS 4182, the


classification tree achieves 78% accuracy on cross
validation, which may be reasonably good. Let us
compare this with the Constant model, which implements
Dr. Smith’s strategy by always predicting the majority. It
gets 83%. Classification trees are not so good after all, are
they?

On the other hand, their accuracy on GDS 3713 is 57%,


which seems rather good in comparison with the 50%
What do other columns
achieved by predicting the majority.
represent? Keep reading!

The problem with classification accuracy goes


deeper, though.

Classifiers usually make predictions based on


probabilities they compute. If a data instance
belongs to class A with a probability of 80% and to
B with a probability of 20%, it is classified as A.
Classes versus probabilities
This makes sense, right?
estimated by logistic
regression. Can you replicate Maybe not, again. Say you fall down the stairs and
this image? your leg hurts. You open Orange, enter some data into
your favorite model
Zupan, Demsar: Introduction to Data May 2018
Mining

and compute a 20% of having your leg broken. So you


assume your leg is not broken and you take an aspirin. Or
perhaps not?

What if the chance of a broken leg was just 10%? 5%? 0.1%?

Say we decide that any leg with a 1% chance of being


broken will be classified as broken. What will this do to
our classification threshold? It is going to decrease badly
— but we apparently do not care. What do we do care
about then? What kind of “accuracy” is important?

Not all mistakes are equal. We can summarize them in


the Confusion Matrix. Here is one for logistic
regression on the heart disease data.
These numbers in the Confusion
Matrix have names. An instance
can be classified as positive or
negative; imagine this as being
positive or negative when being
tested for some medical
condition. This classification can
be true or false. So there are four
options, true POSITIVE (TP), FALSE
POSITIVE (FP), true negative (TN)
and FALSE negative (FN).

Identify them in the table!

Use the output from Confusion


Matrix as a subset for Scatter
plot to explore the data
instances that were misclassified

Logistic regression correctly classifies 147


healthy persons and 110 of the sick, the
numbers on the diagonal. Classification
accuracy is then 257 out of 303, which is
85%.

17 healthy people were unnecessarily


scared. The opposite error is worse: the
heart problems of 29 persons went
undetected. We need to distinguish

University of Ljubljana 32
Zupan, Demsar: Introduction to Data May 2018
Mining

betwe
en
these
two
kinds
of
mistak
es.

University of Ljubljana 33
Zupan, Demsar: Introduction to Data May 2018
Mining

We are interested in the probability that a person who has


some problem will be correctly diagnosed. There were 139
such cases, and 110 were discovered. The proportion is 110
/ 139 = 0.79. This measure is called sensitivity or recall or
true positive rate (TPR).

If you were interested only in sensitivity, though, here’s Dr.


Smith’s associate partner — wanting to be on the safe side,
she considers everybody ill, so she has a perfect sensitivity of
1.0.

To counterbalance the sensitivity, we compute the


opposite: what is the proportion of correctly classified
negative instances? 147 out of 164, that is, 90%. This is
called specificity or true negative rate.

If you are interested in a


So, if you’re classified as OK, you have a 90% chance of
complete list, see the Wikipedia
actually being OK? No, it’s the other way around: 90% is
page on Receiver operating
the chance of being classified as OK, if you are OK. (Think
characteristic,
about it, it’s not as complicated as it sounds). If you’re
https://en.wikipedia.org/wiki/
interested in your chance of being OK if the classifier tells
Receiver_operating_characteristic you so, you look for the negative predictive value. Then
there’s also precision, the probability of being positive if
you’re classified as such. And the fall-out and negative
likelihood ratio and … a whole list of other
indistinguishable fancy names, each useful for some
purpose.

University of Ljubljana 33
Lesson 16: Choosing
the Decision
Threshold
The common property of scores from the previous
lesson is that they depend on the threshold we choose
for classifying an instance as positive. By adjusting it,
we can balance between them and find, say, the
threshold that gives us the required sensitivity at an
acceptable specificity. We can even assign costs
(monetary or not) to different kinds of mistakes and
find the threshold with the minimal expected cost.

A useful tool for this is the Receiver-Operating


Characteristic curve. Don’t mind the meaning of the
name, just call it the ROC curve.

Here are the curves for logistic regression, SVM


with linear kernels and naive Bayesian classifier
on the same ROC plot.

The curves show how the sensitivity (y-axis) and


specificity (x-axis, but from right to left) change with
different thresholds.
There exists, for instance, a threshold for logistic
Sounds complicated? If it
regression (the green curve) that gives us 0.65 sensitivity
helps: perhaps you remember at 0.95 specificity (the curve shows 1 - specificity). Or 0.9
the term parametric curve from sensitivity with a specificity of
some of your math classes. 0.8. Or a sensitivity of (almost) 1 with a specificity of
ROC is a parametric curve somewhere around 0.3.
where x and y (the sensitivity
and 1 - specificity) are a The optimal point would be at top left. The diagonal
function of the same parameter, represents the behavior of a random guessing classifier.
the decision threshold.
Which of the three classifiers is the best now? It
depends on the specificity and sensitivity we want; at
some points we prefer logistic regression and at some
points the naive bayesian classifier. SVM doesn’t cut it,
anywhere.

There is a popular score derived from the ROC curve,


called Area under curve, AUC. It measures, well, the area
under the curve.
This curve. If the curve goes straight up and then right,
the area is 1; such an optimal AUC is not reached in
practice. If the classifier guesses at random, the curve
follows the diagonal and AUC is 0.5. Anything below
that is equivalent to guessing + bad luck.

AUC has a kind of absolute scale. As a rule of a thumb: 0.6 is


ROC curves and AUC are
bad,
fascinating tools. To learn
0.7 is bearable, 0.8 is publishable and 0.9 is suspicious.
more, read T. Fawcett: ROC
Graphs: Notes and Practical AUC also has a nice probabilistic interpretation. Say that
Considerations for Researchers we are given two data instances and we are told that one
is positive and the other is negative. We use the
classifier to estimate the probabilities of being positive
for each instance, and decide that the one with the
highest probability is positive. It turns out that the
probability that such a decision is correct equals the
AUC of this classifier. Hence, AUC measures how well
the classifier discriminates between the positive and
negative instances.

From another perspective: if we use a classifier to rank


data instances, then AUC of 1 signifies a perfect ranking,
an AUC of 0.5 a random ranking and an AUC of 0 a
perfect reversed ranking.
Lesson 17: Linear Regression
In the Paint Data widget, remove For a start, let us construct a very simple data set. It will
the Class-2 label from the list. If contain a just one continuous input feature (let’s call it x)
you have accidentally left it and a continuous class (let’s call it y). We will use Paint
while painting, don’t despair. Data, and then reassign one of the features to be a class by
The class variable will appear in using Select Column and moving the feature y from the
the Select Columns widget, but list of “Features” to a field with a target variable. It is
you can “remove” it by always good to check the results, so we are including Data
dragging it into the Available Table and Scatter Plot in the workflow at this stage. We
Variables list. will be modest this time and only paint 10 points and will
use Put instead of the Brush tool.

We would like to build a model that predicts the value of


class y from the feature x. Say that we would like our
model to be linear, to mathematically express it as
h(x)=𝜃0+𝜃1x. Oh, this is the equation of a line. So we

would like to draw a line through our data points. The 𝜃0 is


then an intercept, and 𝜃1 is a slope. But there are many
different lines we could draw. Which one is the best one?
Which one is the one that fits our data the most?
The question above requires us to define what a good fit
is. Say, this could be the error the fitted model (the line)
makes when it predicts the value of y for a given data
point (value of x). The prediction is h(x), so the error is
h(x) - y. We should treat the negative and positive errors
equally, plus, let us agree, we would prefer punishing
larger errors more severely than smaller ones. Therefore,
it is perfectly ok if we square the errors for each data
Do not worry about the strange point and then sum them up. We got our objective
name of the widget Polynomial function! Turns out that there is only one line that
Regression, we will get there in minimizes this function. The procedure that finds it is
a moment. called linear regression. For cases where we have only
one input feature, Orange has a special widget in the
educational add-on called Polynomial Regression.

Looks ok. Except that these data points do not appear


exactly on the line. We could say that the linear model is
perhaps too simple for our data sets. Here is a trick:
besides column x, the widget Univariate Regression can add
columns x2, x3… xn to our data set. The number n is a degree
of polynomial expansion the widget performs. Try setting
this number to higher values, say to 2, and then 3, and then,
say, to 9. With the degree of 3, we are then fitting the data to
a linear function h(x) = 𝜃0 + 𝜃1x + 𝜃1x2 + 𝜃1x3.
The trick we have just performed (adding the higher order
features to the data table and then performing linear
regression) is called Polynomial Regression. Hence the
name of the widget. We get something reasonable with
polynomials of degree 2 or 3, but then the results get
really wild. With higher degree polynomials, we totally
overfit our data.

It is quite surprising to see that linear regression model can result in fitting non-linear (univariate) functions. That is, functions wit
model is actually a hyperplane (a

flat surface) in the space of


Overfitting is related to the complexity of the model. In
many features (columns) that
polynomial regression, the models are defined through
are powers of x. So for the
parameters
degree 2, h(x)=𝜃0+𝜃1x+𝜃1x2 is a
𝜃. The more parameters, the more complex is the
(flat)
model. Obviously, the simplest model has just one
hyperplane. The visualization
gets curvy only once we plot
parameter (an intercept), ordinary linear regression has
h(x) two (an intercept and a slope), and polynomial
as a function of x. regression models have as many parameters as is the
degree of the polynomial. It is easier to overfit with a
more complex model, as this can adjust to the data
better. But is the overfitted model really discovering the
true data patterns?
Which of the two models depicted in the figures above
would you
trust more?
Lesson 18: Regularization
There has to be some cure for the overfitting. Something
that helps us control it. To find it, let’s check what the
values of the parameters 𝜃 under different degrees of
polynomials actually are

With smaller degree polynomials values of 𝜃 stay


small, but then as the degree goes up, the numbers get
really large.

More complex models can fit the training data better.


The fitted curve can wiggle sharply. The derivatives of
such functions are high, and so need to be the
coefficients 𝜃. If only we could force the linear
Which inference of linear regression to infer models with a small value of
model would overfit more, the coefficients. Oh, but we can. Remember, we have
one with high λ or the one with started with the optimization function the linear
low λ? regression minimizes, the sum of squared errors. We
What should the value of λ be to could simply add to this a sum of all 𝜃 squared. And
cancel regularization? What if
ask the linear regression to minimize both terms.
the value of λ is really high, say
Perhaps we should weigh the part with 𝜃 squared, say,
1000?
we some coefficient
λ, just to control the
level of
regularization.
Internally, if no learner is
Here we go: we just reinvented regularization, a
present on its input, the
procedure that helps machine learning models not to
Polynomial Regression widget
overfit the training data. To observe the effects of the
would use just its ordinary,
regularization, we can give Polynomial Regression our
non-regularized linear
own learner, which supports these kind of settings.
regression.

The Linear Regression widget provides two types of


regularization. Ridge regression is the one we have
talked about and minimizes the sum of squared
coefficients 𝜃. Lasso regression minimizes the sum of
absolute value of coefficients. Although the difference
may seem negligible, the consequences are that lasso
regression may result in a large proportion of
coefficients 𝜃 being zero, in this way performing
feature subset selection.

Now for the test. Increase the degree of polynomial to


the max. Use Ridge Regression. Does the inferred
model overfit the data? How does degree of overfitting
depend on regularization strength?
Lesson 19:
Regularization and
Accuracy on Test Set
Overfitting hurts. Overfit models fit the training data well,
but can perform miserably on new data. Let us observe
this effect in regression. We will use hand-painted data
set, split it into the training (50%) and test (50%) data set,
polynomially expand the training data set to enable
overfitting, build a model on it, and test the model on both
Paint about 20 to 30 data
the (seen) training data and the (unseen) held- out data:
instances. Use attribute y as
target variable in Select
Columns. Split the data 50:50 in
Data Sampler. Cycle between
test on train or test data in Test
& Score. Use ridge regression
to build linear regression
model.

Now we can vary the regularization strength in Linear


Regression and observe the accuracy in Test & Score. For
accuracy scoring, we will use RMSE, root mean squared
error, which is computed by observing the error for each
data point, squaring it, averaging this across all the data
instances, and taking a square root.

The core of this lesson is to compare the error on the


training and test set while varying the level of
regularization. Remember, regularization controls
overfitting - the more we regularize, the less tightly we fit
the model to the training data. So for the training set, we
expect the error to drop with less regularization and more
overfitting, and to increase with more regularization and
less fitting. No surprises expected there. But how does this
play out on the test set? Which sides minimizes the test-set
error? Or is the optimal level of regularization somewhere
in between? How do we estimate this level of
regularization from the training data alone?

Orange is currently not equipped with parameter fitting


and we need to find the optimal level of regularization
manually. At this stage, it suffices to say that
parameters must be found on the training data set
without touching the test data.
Lesson 20: Prediction of
Tissue Age from Level of
Download the methylation data
set from
Methylation
http://file.biolab.si/files/
Enough painting. Now for the real data. We will use a
methylation.tab. Predictions of
data set that includes human tissues from subjects at
different age. The tissues were profiled by
age from methylation profile
measurements of DNA methylation, a mechanism for
were investigated by Horvath
cells to regulate the gene expression. Methylation of
(2013) Genome Biology
DNA is scarce when we are young, and gets more
14:R115.
abundant as we age. We have prepared a data set where
the degree of methylation was expressed per each gene.
Let us test if we can predict the age from the
methylation profile - and if we can do this better than by
just predicting the average age of subjects in the training
set.

This workflow looks familiar and is similar to those for


Using other learners, like
classification problems. The Test & Score widget reports
random forests, takes a while
on statistics we have not seen before. MAE, for one, is the
on this data set. But you may
mean average error. Just like for classification, we have
try to sample the features,
used cross- validation, so MAE was computed only on the
obtain a smaller data set, and
test data instances and averaged across 10 runs of cross
try various regression learners.
validation. The results indicate that our modeling
technique misses the age by about 5 years, which is a
much better result than predicting by the mean age in the
training set.
Lesson 21: Evaluating
Regression
The last lessons quickly introduced scoring for regression,
and important measures such as RMSE and MAE. In
classification, a nice addition to find misclassified data
instances was the confusion matrix. But the confusion
matrix could only be applied to discrete classes. Before
Orange gets some similar for regression, one way to find
misclassified data instances is through scatter plot!

This workflow visualizes the


predictions that were
performed on the training data.
How would you change the
widget to use a separate test
set? Hint: The Sample widget
can help.

We can play around with this workflow by painting the


data such that the regression would perform well on
blue data point and fail on the red outliers. In the
scatter plot we can check if the difference between the
predicted and true class was indeed what we have
expected.
A similar workflow would work for any data set. Take, for
instance, the housing data set (from Orange distribution).
Say, just like above, we would like to plot the relation
between true and predicted continuous class, but would
like to add information on the absolute error the predictor
makes. Where is the error coming from? We need a new
column. The Feature Constructor widget (albeit being a
bit geekish) comes to the rescue.

In the Scatter Plot widget,


we can now select the data
where the predictor erred
substantially and explore
the results further.

We could, in principle, also


mine the errors to see if we can
identify data instances for
which this was high. But then,
if this is so, we could have
improved predictions at such
regions. Like, construct
predictors that predict the
error. This is weird. Could we
then also construct a predictor,
that predicts the error of the
predictor that predicts the
error? Strangely enough, such
ideas have recently led to
something called Gradient
Boosted Trees, which are
nowadays among the best
regressors (and are coming to
Orange soon).
Zupan, Demsar: Introduction to Data May 2018
Mining

Lesson 22: Feature


Scoring and Selection
For this lesson, load the data
Linear regression infers a model that estimate the class, a
from imports-85.tab using the real- valued feature, as a sum of products of input
File widget and Browse features and their weights. Consider the data on prices of
imported cars in 1985.
Inspecting this data set in a
Data Table, it shows that
some features, like fuel-
system, engine- type and
many others, are discrete.
Linear regression only works
with numbers. In Orange,
linear regression will
automatically convert all
discrete values to numbers,
most often using several
features to represent a single
discrete features. We also do
this conversion manually by
using Continuize widget.

Before we continue,
you should check what
Continuize actually
does and how it
converts the nominal
features into real-
valued features. The
table below should
provide sufficient
illustration.

University of Ljubljana 45
Zupan, Demsar: Introduction to Data May 2018
Mining

Now to the core of this lesson. Our workflow reads


the data, coninuizes it such that we also normalize all
the features to bring them the to equal scale, then we
load the data into Linear Regression widget and check
out the feature coefficients in the Data Table.

In Linear Regression, we will use L1 regularization.


Compared to L2 regularization, which aims to
minimize the sum of squared weights, L1 regularization
is more rough and minimizes the sum of absolute
values of the weights. The result of this “roughness” is
that many of the feature will get zero weights.
But this may be also exactly
what we want. We want
to select only the most
important features, and
want to see how the
model that uses only a
smaller subset of
features actually
behaves.
Also, this smaller set of
features is ranked.
Engine size is a huge
factor in pricing of our
cars, and so is the make,
where Porsche,
Mercedes and BMW cost
more than other cars (ok,
no news here).

We should notice that


the number of features with non-zero weights varies
with regularization strength. Stronger regularization
would result in fewer features with non-zero weights.

University of Ljubljana 46
Trong lớp học, chúng tôi sẽ giới thiệu
về phân nhóm bằng cách sử dụng một
tập dữ liệu đơn giản về học sinh và
Lesson 23: Phân cụm phân
điểm tiếng Anh và Đại số của họ. Tải
tập dữ liệu từ
cấp
http://file.biolab.si/files/grade 2.tab Giả sử rằng chúng ta quan tâm đến việc tìm kiếm các cụm trong dữ
liệu . Nghĩa là, chúng ta muốn xác định các nhóm cá thể dữ liệu gần
nhau, tương tự nhau. Hãy xem xét một tập dữ liệu đơn giản, có hai
đặc trưng (xem ghi chú bên cạnh) và vẽ nó trong Biểu đồ phân tán.
Chúng ta có bao nhiêu cụm? Điều gì xác định một cụm? Những
trường hợp dữ liệu nào thuộc cùng một cụm? Thủ tục khám phá các
cụm trông như thế nào?

Làm thế nào để chúng ta đo được


sự tương đồng giữa các cụm nếu Chúng ta cần bắt đầu với một định nghĩa về “tương tự”. Một phép đo
chúng ta chỉ biết những điểm tương đơn giản về độ tương tự cho dữ liệu đó là khoảng cách Euclide: bình
đồng giữa các điểm? Theo mặc phương các di trên mọi chiều, một số chúng và lấy căn bậc hai, giống
định, Orange tính toán khoảng như trong định lý Pitago. Vì vậy, chúng ta muốn nhóm các cá thể dữ
cách trung bình giữa tất cả các cặp liệu với khoảng cách Euclide nhỏ.
điểm dữ liệu của chúng; đây được Bây giờ ta cần xác định một thuật toán phân cụm. Ta sẽ bắt đầu với
gọi là liên kết trung bình. Thay vào mỗi cá thể dữ liệu nằm trong cụm riêng của nó. Tiếp theo, ta hợp nhất
đó, chúng ta có thể lấy khoảng cách các cụm gần nhau nhất - như hai điểm gần nhất - thành một cụm. Nói
giữa hai điểm gần nhất trong mỗi lại. Và lặp lại. Và lặp lại. Và lặp lại cho đến khi bạn kết thúc với một
cụm (liên kết đơn) hoặc hai điểm xa cụm duy nhất chứa tất cả các điểm.
nhất (hoàn Thủ tục này xây dựng một hệ thống phân cấp của các cụm, điều này
liên kết). giải thích tại sao chúng ta gọi nó là phân cụm phân cấp. Sau khi hoàn
thành, chúng tôi có thể
Zupan, Demsar: Introduction to Data May 2018
Mining

quan sát toàn bộ hệ thống phân cấp và quyết định


điểm nào sẽ là điểm tốt để dừng lại. Với điều này,
chúng ta quyết định số lượng các cụm thực tế.
Một cách có thể để quan sát kết quả phân nhóm trên
tập dữ liệu nhỏ của chúng ta với các điểm là thông
qua quy trình làm việc sau:

Hãy để chúng ta xem nó hoạt động như thế nào. Tải


dữ liệu, tính toán khoảng cách và phân cụm dữ liệu.
Trong tiện ích phân nhóm phân cấp, cắt phân cấp ở
một điểm khoảng cách nhất định và quan sát các cụm
tương ứng trong biểu đồ Phân tán..

Bạn cũng có thể quan sát các thuộc tính của các cụm
- đó là điểm trung bình của môn Đại số và tiếng Anh
- trong hộp biểu đồ.

University of Ljubljana 48
Zupan, Demsar: Introduction to Data May 2018
Mining

Lesson 24: Animal Kingdom


Your lecturers spent substantial part of their youth
admiring a particular Croatian chocolate called Animal
Kingdom. Each chocolate bar came with a card —- a
drawing of some (random) animal, and the associated
album made us eat a lot of chocolate. Then our kids came,
and the story repeated. Some things stay forever. Funny
stuff was we never understood the order in which the
cards were laid out in the album. We later learned about
taxonomy, but being more inclined to engineering we
never mastered learning it in our biology classes. Luckily,
there’s data mining and the idea that taxonomy simply
stems from measuring the distance between species.

Here we use zoo data (from documentation data sets)


with attributes that report on various features of animals
(has hair, has feathers, lays eggs). We measure the
distance and compute the clustering. Animals in this data
set are annotated with type (mammal, insect, bird, and so
on). It would be cool to know if the clustering re-
discovered these groups of animals. We can do this
through marking the clusters in Hierarchical Clustering
widget, and then observing the results in the Sieve
Diagram.

University of Ljubljana 49
Zupan, Demsar: Introduction to Data May 2018
Mining

Looks great. Birds, say, are in cluster C6. Cluster C4


consists of amphibians and some reptiles. And so forth.

Checking this in the Box plot is even cooler. We can get a


distribution of animal types in each cluster:

Or we can turn it around and see how different types


of animals are spread across clusters.

What is wrong with those mammals? Why can't they be


in one single cluster? Two reasons. First, they represent
40 % of the data instances. Second, they include some
weirdos. Click on the clusters in the box plot and
discover who they are.

University of Ljubljana 50
Lesson 25: Discovering
clusters
Can we replicate this on some real data? Can clustering
indeed be useful for defining meaningful subgroups?

Take brown-selected (from documentation data sets)


connect the hierarchical clustering so the you can see a
cluster as a subset in the scatterplot.

So far, we used the dendrogram to set a cut-off point. Now


we will click on a branch in a dendrogram to select a
subset of the data instances. By combining it with the
Scatter Plot widget, we get a great tool for exploring the
clusters. Try it with an appropriate pair of features to
visualize (use Rank projections).

By using a scatter plot or other widgets, an expert can


determine whether the clusters are meaningful.

For this data set, though, we can do something even


better. The data already contains some predefined
groups. Let us check how
well the clusters match the classes - which we know, but
clustering did not.

We will use the dendrogram to set a suitable threshold


that splits the data into some three to five clusters. We
can plot this data in a new scatter plot; we find a
reasonable pair of attributes and then set the color of
the points to represent the cluster they belong to. Do
the clusters match the actual classes? The result is
rather impressive if you keep two things in mind. First,
the clustering algorithm did not actually know about
the classes, it discovered them by itself. Second, it did
not operate on the picture you see in the scatter plot and
in which the clusters are quite pronounced, but in a 79-
dimensional data space with possibly plenty of
redundant features. Yet it identified the three groups of
genes almost without mistakes.

This lessons is not a recipe for what you should be


doing in practice. If your data already contains groups
labels, say gene group annotations, there is no need to
discover them (again) by using clustering. In this case
you should be interested in predictive models from
previous lessons. If you do not have such a grouping but
you suspect that the data contains distinct subgroups,
run clustering. The sole purpose of this lesson was to
demonstrate that clustering can indeed find a
meaningful subgroups in the data; we pretend we did
not know the groups, use the clustering to discover
them, and checked how well the correspond to the
actual groups.
Lesson 26: k-Means Clustering
Hierarchical clustering is not suitable for larger data sets
due to the prohibitive size of the distance matrix: with 30
thousand objects, the distance matrix already has almost
one billion elements. An alternative approach that avoids
using the distance matrix is k- means clustering.

K-means clustering randomly selects k centers (with k


specified in advance). Then it alternates between two
steps. In one step, it assigns each point to its closest
center, thus forming k clusters. In the other, it
recomputes the centers of the clusters. Repeating these
two steps typically convergences quite fast; even for
the big data sets with millions of data points it usually
takes just a couple of tens or hundreds iterations.

Orange's add-on Educational provides a widget


Interactive k- means, which illustrates the algorithm.

Use the Paint widget to paint some data - maybe five


groups of points. Feed it to Interactive k-means and set
the number of centroids to 5. You may get something like
this.

Try rerunning the clustering


from new random positions and
observe how the centers
conquer the territory. Exciting,
Keep pressing Recompute Centroids and Reassign
isn't it?
Membership until it stops changes. With this simple, two-
dimensional data it will take just a few iterations; with
more points and features, it can take longer, but the
principle is the same.
How do we set the initial number of clusters? That's
simple: we choose the number that gives the optimal
clustering.

Well then, how do we define the optimal clustering? This


one is a bit harder. We want small distances between
points in the same cluster and large distances between
points from different clusters. Pick one point, and let A be
its average distance to the data points in the same cluster
and let B represent the average distance to the points from
the closest other cluster. (The closest cluster? Just
compute B for all other clusters and take the lowest value.)
The value (B - A) / max(A, B) is called silhouette; the
higher the silhouette, the better the point fits into its
cluster. The average silhouette across all points is the
silhouette of the clustering. The higher the silhouette, the
better the clustering.

Now that we can assess the quality of clustering, we can


run k- means with different values of parameter k
(number of clusters) and select k which gives the largest
silhouette.

For this, we abandon our educational toy and connect Paint to


the widget k-Means. We tell it to find the optimal number of
clusters between 2 and 8, as scored by the Silhouette.

Works like charm.

Except that it often doesn't. First, the result of k-means


clustering depends on the initial selection of centers. With
unfortunate
selection, it may get stuck in a local optimum. We solve
this by re- running the clustering multiple times from
random positions and using the best result. Second, the
silhouette sometimes fails to correctly evaluate the
clustering. Nobody's perfect.

Time to experiment. Connect the Scatter Plot to k-Means.


Change the number of clusters. See if the clusters make
sense. Could you paint the data where k-Means fails? Or
where it really works well?
Lesson 27: Finding
Clusters When There
Are None
We saw how clustering can discover the subgroups in the
data. The flip side of this is that algorithms like k-means
will always find them even when they do not actually
exist.

Playing with Paint Data and in real data sets?


k-Means can be quite fun.
Try
painting the data where there
are clusters, but k-means does
not find them. Or, actually, finds
the wrong ones. What kind of
clusters are easy to find for k-
means? Are these the kind of
clusters we would actually find
It is difficult to
verify whether the
clusters we found are
"real". Data mining
methods like
clustering can serve
only as hints that can
help forming new
hypotheses, which
must make biological
sense and be verified
on new, independent
data. We cannot
make conclusions
based only on
"discovering"
clusters.
Lesson 28: Silhouettes
Don't get confused: we paint Consider a two-feature data set which we have painted
data and/or visualize it with in the Paint Data widget. We send it to the k-means
Scatter plots, which show only
clustering, tell it to find three clusters, and display the
two features. This is just for an
clustering in the scatterplot.
illustration! Most data sets
contain many features and
methods like k-Means
clustering take into account all
features, not just two.

The data points in the green cluster are well separated


from those in the other two. Not so for the blue and red
points, where several points are on the border between
the clusters. We would like to quantify the degree of
how well a data point belongs to the cluster to which it
is assigned.

We will invent a scoring measure for this and we will call it a


silhouette (because this is how it's called). Our goal: a
silhouette of 1 (one) will mean that the data instance is well
rooted in the cluster, while the score of 0 (zero) will be
assigned to data instances on the border between two clusters.

For a given data point (say the blue point in the image on
the left), we can measure the distance to all the other
points in its cluster and compute the average. Let us
denote this average distance with
A. The smaller A, the better.
Zupan, Demsar: Introduction to Data May 2018
Mining

On the other hand, we would like a data point to be far


away from the points in the closest neighboring cluster.
The closest cluster to our blue data point is the red cluster.
We can measure the distances between the blue data point
and all the points in the red cluster, and again compute the
average. Let us denote this average distance as B. The
larger B, the better.

The point is well rooted within its own cluster if the


distance to the points from the neighboring cluster (B) is
much larger than the distance to the points from its own
cluster (A), hence we compute B-A. We normalize it by
dividing it with the larger of these two numbers, S = (B -A) /
max{A, B}. Voilá, S is our silhouette score.

Orange has a Silhouette Plot widget that displays the


C3 is the green cluster, and all values of the silhouette score for each data instance.
its points have large We can also choose a particular data instance in the
silhouettes. silhouette plot and check out its position in the scatter
Not so for the other two.
plot.

Below we selected three data


instances with the worst
silhouette scores. Can you guess
where the lie in the scatter plot?

This of course looks great


for data sets with two
features, where the scatter
plot reveals all the
information. In higher-
dimensional data, the scatter
plot shows just two features
at a time, so two points that
seem close in the scatter plot
may be actually far apart
when all features - perhaps
thousands of gene
expressions - are taken into
account.

The total quality of clustering


- the silhouette of the
clustering - is the average
silhouette across all points.
When the k-Means widget

University of Ljubljana 58
Zupan, Demsar: Introduction to Data May 2018
Mining

searches for the optimal number of clusters, it tries


different number of clusters and displays the
corresponding silhouette scores.

Ah, one more thing: Silhouette Plot can be used on any


data, not just on data sets that are the output of clustering.
We could use it with the iris data set and figure out which
class is well separated from the other two and, conversely,
which data instances from one class are similar to those
from another.

We don't have to group the instances by the class. For


instance, the silhouette on the left would suggest that the
patients from the heart disease data with typical
anginal pain are similar to each other (with respect to the
distance/similarity computed from all features), while
those with other types of pain, especially non- anginal
pain are not clustered together at all.

University of Ljubljana 59
Lesson 29: Mapping the Data
Imagine a foreign visitor to the US who knows nothing
about the US geography. He doesn’t even have a map; the
only data he has is a list of distances between the cities.
Oh, yes, and he attended the Introduction to Data Mining.

If we know distances between the cities, we can cluster


them.

For this example we retrieved


data from http://
www.mapcrow.info/
united_states.html, removed
the city names from the first
line and replaced it with “31
labelled”.

The file is available at http://


file.biolab.si/files/us-
cities.dst.zip. To load it, unzip
the file and use the File
Distance widget from the
Prototypes add- on.

How much sense does it make? Austin and San Antonio


are closer to each other than to Houston; the tree is then
joined by Dallas. On the other hand, New Orleans is
much closer to Houston than to Miami. And, well, good
luck hitchhiking from Anchorage to Honolulu.

As for Anchorage and Honolulu, they are left-overs;


when there were only three clusters left (Honolulu,
Anchorage and the big cluster with everything else),
Honolulu and Anchorage were closer to each other than
to the rest. But not close — the corresponding lines in
the dendrogram are really long.
The real problem is New Orleans and San Antonio: New
Orleans is close to Atlanta and Memphis, Miami is close
to Jacksonville and Tampa. And these two clusters are
suddenly more similar to each other than to some distant
cities in Texas.

We can’t run k-means


In general, two points from different clusters may be
clustering on this data, since
more similar to each other than to some points from
we only have distances, and k-
their corresponding clusters.
means runs on real (tabular)
data. Yet, k-means would have To get a better impression about the physical layout
the same problem as of cities, people have invented a better tool: a map!
hierarchical clustering. Can we reconstruct a map from a matrix of distances?
Sure. Take any pair of cities and put them on paper
with a distance corresponding to some scale. Add the
third city and put it at the corresponding distance
from the two. Continue until done. Excluding, for the
sake of scale, Anchorage, we get the following map.

We have not constructed this map manually, of course. We used a widget called MDS, which stands for Multidimensional sc
It is actually a rather exact map of the US from the Australian perspective. You cannot get the orientation from a map of dist
Remember the clustering of animals? Can we draw
a map of animals?

Does the map make any sense? Are similar animals


together? Color the points by the types of animals and you
should see.

The map of the US was accurate: one can put the points in
a plane so that the distances correspond to actual distances
between cities. For most data, this is usually impossible.
What we get is a projection (a non-linear projection, if you
care about mathematical finesses) of the data. You lose
something, but you get a picture.

The MDS algorithm does not always find the optimal


map. You may want to restart the MDS from random
positions. Use the slider “Show similar pairs” to see
whether the points that are placed together (or apart)
actually belong together. In the above case, the
honeybee belongs closer to the wasp, but could not fly
there as in the process of optimization it bumped into
the hostile region of flamingos and swans.
Lesson 30: Principal
Component Analysis
Which of the following three scatterplots (showing x
vs. y, x vs. z and y vs. z) for the same three-
dimensional data gives us the best picture about the
actual layout of the data in space?

Spoiler figures successfully


Yes, the first scatter plot looks very useful: it tells us that
removed :-).
x and y are highly correlated and that we have three
clusters of somewhat irregular shape. But remember: this
data is three dimensional.
What is we saw it from another, perhaps better perspective?
(No spoilers here, but we'll add the figure after the lecture!)

Let's make another experiment. Go to https://in-the-sky.org/


ngc3d.php, disable Auto-rotate and Show labels and select
Zoom to show Local Milky Way. Now let's rotate the picture
of the galaxy to find the layout of the stars.

Think about what we've done. What are the properties


of the best projection?

We want the data to be as spread out as possible. If we


look from the direction parallel to the galactic plane,
we see just a line. We lose one dimension, essentially
keeping just a single coordinate for each star. (This is
unfortunately exactly the perspective we see on the
night sky: most stars are in the bright band we call the
milky way, and we only see the outliers.) Among all
possible projections, we attempt to find the one with
the highest spread across the scatter plot. This
projection may not be (and usually isn't) orthogonal to
any axis; it may be projection to an arbitrary plane.

We again talk about two dimensional projection only for


the sake of illustration. Imagine that we have ten thousand
dimensional
data and we would like, for some reason, keep just ten
features. Yes, we can rank the features and keep the most
informative, but what if these are correlated and tell us
the same thing? Or what if our data does not have any
target variable: with what should the "good features" be
correlated? And what if the optimal projection is not
aligned with the axes at all, so "good" features are
combinations of the original ones?

We can do the same reasoning as above: we want to find a


10- dimensional (for the sake of examples) projection in
which the data points are as spread as possible.

How do we do this? Let's go back to our everyday's three


dimensional world and think about how to find a two-
dimensional projection.

Imagine you are observing a swarm of flies; your data are


their exact coordinates in the room, so the position of each
fly is described by three numbers. Then you discover that
your flies actually fly in a formation: they are (almost) on
the same line. You could then describe the position of
each fly with a single number that represents the fly's
position along the line. Plus, you need to know where in
the space the line lies. We call this line the first principal
component. By using it, we reduce the three-dimensional
space into a single dimension.

After some careful observation, you notice the flies are a


bit spread in one other direction, so they do not fly along a
line but along a band. Therefore, we need two numbers,
one along the first and one along the — you guessed it —
second principal component.

It turns out the flies are actually also spread in the third
direction. Thus you need three numbers after all.

Or do you? It all depends on how spread they are in the


second and in the third direction. If the spread along the
second is relatively small in comparison with the first, you
are fine with a single dimension. If not, you need two, but
perhaps still not three.

Let's step back a bit: why would one who carefully


measured expressions of ten thousand genes want to throw
most data away and reduce it to a dozen dimensions? The
data, in general, may not and does not have as many
dimensions as there are features. Say you have an
experiment in which you spill different amounts of
Zupan, Demsar: Introduction to Data May 2018
Mining

two chemicals over colonies of amoebas and then measure


the expressions of 10.000 genes. Instead of flies in a
three-dimensional space, you now profile colonies in a
10,000-dimensional space, the coordinates corresponding
to gene expressions. Yet if expressions of genes depend
only on the concentrations of these two chemicals, you
can compute all 10,000 numbers from just two.
Your data is then just two-dimensional.
A technique that does this is called Principle
Components Analysis, or PCA. The corresponding
widget is simple: it receives the data and outputs the
transformed data.

The widget allows you to


select the number of
components and helps you by
showing how much
information (technically:
explained variance) you
retain with respect to the
number of components
(brownish line) and the
amount of information
(explained variance) in each
component.

The PCA on the left shows the


scree diagram for brown-
selected data. Set like this, the
widget replaces the 80 features
with just seven - and still
keeping 82.7% of information. (Note: disable "Normalize
data" checkbox to get the same picture.) Let us see a
scatter plot for the first two components.

University of Ljubljana 65
Zupan, Demsar: Introduction to Data May 2018
Mining

The axes, PC1 and PC2, do not correspond to particular


features in the original data, but to their linear combination.
What we are looking at is a projection onto the plane,
defined by the first two components. When you consider
only two components, you can imagine that PCA put a
hyperplane into multidimensional space and projecting all
data into it.

Note that this is an unsupervised method: it does not


care about the class. The classes in the projection may
be be well separated or not. Let's add some colors to the
points and see how lucky we are this time.

The data separated so well that these two dimensions


alone may suffice for building a good classifier. No,
wait, it gets even better. The data classes are separated
well even along the first component. So we should be
able to build a classifier from a single feature!

University of Ljubljana 66
In the above schema we use the ordinary Test & Score
widget, but renamed it to “Test on original data” for better
understanding of the workflow.

On the original data, Logistic regression gets 98% AUC


and classification accuracy. If we select just single
component in PCA, we already get a 93%, and if we take
two, we get the same result as on the original data.

PCA is thus useful for multiple purposes. It can simplify


our data by combining the existing features to a much
smaller number of features without losing much data. The
directions of these features may tell us something about
the data. Finally, it can find us good two-dimensional
projections that we can observe in scatter plots.
Zupan, Demsar: Introduction to Data May 2018
Mining

Lesson 31: Image Embedding


Every data set so far came in the matrix (tabular) form:
objects (say, tissue samples, students, flowers) were
described by row vectors representing a number of
features. Not all the data is like this; think about
collections of text articles, nucleotide sequences, voice
This depiction of deep learning recordings or images. It would be great if we could
network was borrowed from represent them in the same matrix format we have used so
http://www.amax.com/blog/? far. We would turn collections of, say, images, into
matrices and explore them with the familiar prediction or
clustering techniques.

Until very recently, finding


useful representation of
complex objects such as
images was a real pain.
Now, technology called
deep learning is used to
develop models that
transform complex objects
to vectors of numbers.
Consider images. When we,
humans, see an image, our
neural networks go from
pixels, to spots, to patches,
and to some higher order
representations like squares,
triangles, frames, all the
way to representation of
complex objects. Artificial
neural networks used for
deep learning emulate these
through layers of
computational units
(essentially,
logistic regression models and some other stuff we will
ignore here). If we put an image to an input of such a
network and collect the outputs from the higher levels, we
get vectors containing an abstraction of the image. This is
called embedding.

Deep learning requires a lot of data (thousands, possibly


millions of data instances) and processing power to
prepare the network. We will use one which is already
prepared. Even so, embedding takes time, so Orange
doesn't do it locally but uses a server invoked through
the ImageNet Embedding widget.
Zupan, Demsar: Introduction to Data May 2018
Mining

68
University of Ljubljana
Zupan, Demsar: Introduction to Data May 2018
Mining

Image embedding describes the images with a set of


2048 features appended to the table with meta features
of images.

We have no idea what these features are, except that


they represent some higher-abstraction concepts in the
deep neural network (ok, this is not very helpful in terms
of interpretation). Yet, we have just described images
with vectors that we can compare and measure their
similarities and distances. Distances? Right, we could do
clustering. Let’s cluster the images of animals and see
what happens.

To recap: in the workflow about we have loaded the


images from the local disk, turned them into numbers,
computed the distance matrix containing distances
between all pairs of images, used the distances for
hierarchical clustering, and displayed the images that
correspond to the selected branch of the dendrogram in
the image viewer. We used cosine similarity to assess
the distances (simply because of the dendrogram
looked better than with the Euclidean distance).

University of Ljubljana 69
Even the lecturers of this course were surprised at
the result. Beautiful!
Lesson 32: Images
and Classification
In this lesson, we are using We can use image data for classification. For that, we
images of yeast protein need to associate every image with the class label. The
localization (http://file.biolab.si/ easiest way to do this is by storing images of different
files/yeast-localization-small.zip) classes in different folders. Take, for instance, images
in the classification setup. But of yeast protein localization. Screenshot of the file
this same data set could be names shows we have stored them on the disk.
explored in clustering as well. Localization sites
The workflow would be the (cytoplasm,
same as the one from previous endosome,
lesson. Try it out! Do Italian endoplasmic
cities cluster next to American reticulum) will now
or are become class labels
for the images. We are
just a step away from
testing if logistic
regression can
classify images to
their corresponding
protein localization
sites. The data set is
small: you may use
leave-one-out for
evaluation in Test &
Score widget instead
of cross validation.

At about 0.9 the AUC


score is quite high, and
we can check where the
mistakes are made and
visualize these in an
Image Viewer.
For the End
The course on Visual Approaches to Data Mining ends
here. We covered quite some mileage, and we hope we
have taught you some essential procedures that should be
on the stack of every data scientists. The goal was to get
you familiar with basic techniques, tools, and concepts of
data science and teach you have to visually explore data
and models. Data science is a vast field, and it takes years
of study and practice to master it. You may never become
a data scientist, but as an expert in some other field, it
should now be easier to talk and collaborate with
statisticians and computer scientists. And for those who
want to go ahead with data science, well, you now know
where to start.

You might also like