2018 05 Intro To Datamining Notes
2018 05 Intro To Datamining Notes
Mining
Introduction to Data
Mining
Working notes for the hands-on
course with Orange Data Mining
Attribution-NonCommercial-NoDerivs
CC BY-NC-ND
University of Ljubljana 2
Lesson 1: Workflows in Orange
Orange workflows consist of components that read,
process and visualize data. We call them “widgets.” We
place the widgets on a drawing board (the “canvas”).
Widgets communicate by sending information along with
a communication channel. An output from one widget is
used as input to another.
The File widget reads data from your local disk. Open
the File Widget by double clicking its icon. Orange
comes with several preloaded data sets. From these
(“Browse documentation data sets…”), choose brown-
selected.tab, a yeast gene expression data set.
s the others.the
Double- click the connection between the widgets to access setup dialog, as you've learned in the previous lesson.
Lesson 8: Classification
Accuracy
Now that we know what classification trees are, the
next question is what is the quality of their predictions.
For beginning, we need to define what we mean by
quality. In classification, the simplest measure of
quality is classification accuracy expressed as the
proportion of data instances for which the classifier
correctly guessed the value of the class. Let’s see if we
can estimate, or at least get a feeling for, classification
Measuring of accuracy is such accuracy with the widgets we already know.
an important concept that it
would require its widget. But
wait a while, there’s
educational value in reusing
the widgets we already know.
University of Ljubljana 18
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 19
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 20
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 21
Lesson 10: Cross-Validation
Estimating the accuracy may depend on a particular split
of the data set. To increase robustness, we can repeat the
measurement several times, each time choosing a
different subset of the data for training. One such method
is cross-validation. It is available in Orange through the
Test & Score widget.
Note that in each iteration, Test & Score will pick part of
the data for training, learn the predictive model on this
data using some machine learning method, and then test
the accuracy of the resulting model on the remaining, test
data set. For this, the widget will need on its input a data
set from which it will sample data for training and testing,
and a learning method which it will use on the training
data set to construct a predictive model. In Orange, the
learning method is simply called a learner. Hence, Test &
Score needs a learner on its input. A typical workflow
with this widget is as follows.
For geeks: a learner is an
object that, given the data,
This is another way to use the Tree widget. In the
outputs a classifier. Just what
workflows from the previous lessons we have used
another of its outputs, called Model: its construction
Test & Score needs.
required the data. This time, no data is needed for Tree,
because all that we need from it a learner.
Cross validation splits the data
sets into, say, 10 different non- Here we show Test & Score widget looks like. CA stands
overlapping subsets we call for classification accuracy, and this is what we really care
folds. In each iteration, one fold for for now. We will talk about other measures, like AUC,
will be used for testing, while later.
the data from all other folds will
be used for training. In this
way, each data instance will be
used for testing exactly once.
Lesson 11: A Few More
Classifiers
We have ended the previous lesson with cross-validation
and classification trees. There are many other, much more
accurate classifiers. A particularly interesting one is
Random Forest, which averages across predictions of
hundreds of classification trees. It uses two tricks to
construct different classification trees. First, it infers each
tree from a sample of the training data set (with
replacement). Second, instead of choosing the most
informative feature for each split, it randomly selects
from a subset of most informative features. In this way, it
randomizes the tree inference process. Think of each tree
shedding light on the data from a different perspective.
Just like in the wisdom of the crowd, an ensemble of trees
(called a forest) usually performs better than a single tree.
What if the chance of a broken leg was just 10%? 5%? 0.1%?
University of Ljubljana 32
Zupan, Demsar: Introduction to Data May 2018
Mining
betwe
en
these
two
kinds
of
mistak
es.
University of Ljubljana 33
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 33
Lesson 16: Choosing
the Decision
Threshold
The common property of scores from the previous
lesson is that they depend on the threshold we choose
for classifying an instance as positive. By adjusting it,
we can balance between them and find, say, the
threshold that gives us the required sensitivity at an
acceptable specificity. We can even assign costs
(monetary or not) to different kinds of mistakes and
find the threshold with the minimal expected cost.
It is quite surprising to see that linear regression model can result in fitting non-linear (univariate) functions. That is, functions wit
model is actually a hyperplane (a
Before we continue,
you should check what
Continuize actually
does and how it
converts the nominal
features into real-
valued features. The
table below should
provide sufficient
illustration.
University of Ljubljana 45
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 46
Trong lớp học, chúng tôi sẽ giới thiệu
về phân nhóm bằng cách sử dụng một
tập dữ liệu đơn giản về học sinh và
Lesson 23: Phân cụm phân
điểm tiếng Anh và Đại số của họ. Tải
tập dữ liệu từ
cấp
http://file.biolab.si/files/grade 2.tab Giả sử rằng chúng ta quan tâm đến việc tìm kiếm các cụm trong dữ
liệu . Nghĩa là, chúng ta muốn xác định các nhóm cá thể dữ liệu gần
nhau, tương tự nhau. Hãy xem xét một tập dữ liệu đơn giản, có hai
đặc trưng (xem ghi chú bên cạnh) và vẽ nó trong Biểu đồ phân tán.
Chúng ta có bao nhiêu cụm? Điều gì xác định một cụm? Những
trường hợp dữ liệu nào thuộc cùng một cụm? Thủ tục khám phá các
cụm trông như thế nào?
Bạn cũng có thể quan sát các thuộc tính của các cụm
- đó là điểm trung bình của môn Đại số và tiếng Anh
- trong hộp biểu đồ.
University of Ljubljana 48
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 49
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 50
Lesson 25: Discovering
clusters
Can we replicate this on some real data? Can clustering
indeed be useful for defining meaningful subgroups?
For a given data point (say the blue point in the image on
the left), we can measure the distance to all the other
points in its cluster and compute the average. Let us
denote this average distance with
A. The smaller A, the better.
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 58
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 59
Lesson 29: Mapping the Data
Imagine a foreign visitor to the US who knows nothing
about the US geography. He doesn’t even have a map; the
only data he has is a list of distances between the cities.
Oh, yes, and he attended the Introduction to Data Mining.
We have not constructed this map manually, of course. We used a widget called MDS, which stands for Multidimensional sc
It is actually a rather exact map of the US from the Australian perspective. You cannot get the orientation from a map of dist
Remember the clustering of animals? Can we draw
a map of animals?
The map of the US was accurate: one can put the points in
a plane so that the distances correspond to actual distances
between cities. For most data, this is usually impossible.
What we get is a projection (a non-linear projection, if you
care about mathematical finesses) of the data. You lose
something, but you get a picture.
It turns out the flies are actually also spread in the third
direction. Thus you need three numbers after all.
University of Ljubljana 65
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 66
In the above schema we use the ordinary Test & Score
widget, but renamed it to “Test on original data” for better
understanding of the workflow.
68
University of Ljubljana
Zupan, Demsar: Introduction to Data May 2018
Mining
University of Ljubljana 69
Even the lecturers of this course were surprised at
the result. Beautiful!
Lesson 32: Images
and Classification
In this lesson, we are using We can use image data for classification. For that, we
images of yeast protein need to associate every image with the class label. The
localization (http://file.biolab.si/ easiest way to do this is by storing images of different
files/yeast-localization-small.zip) classes in different folders. Take, for instance, images
in the classification setup. But of yeast protein localization. Screenshot of the file
this same data set could be names shows we have stored them on the disk.
explored in clustering as well. Localization sites
The workflow would be the (cytoplasm,
same as the one from previous endosome,
lesson. Try it out! Do Italian endoplasmic
cities cluster next to American reticulum) will now
or are become class labels
for the images. We are
just a step away from
testing if logistic
regression can
classify images to
their corresponding
protein localization
sites. The data set is
small: you may use
leave-one-out for
evaluation in Test &
Score widget instead
of cross validation.