Lab description file (4)
Lab description file (4)
Week 1: Introduction
• Distance Metrics
1
1 Introduction
Welcome to the first practical session of the Data Mining course. In this session we are going to
review some of the fundamental mathematical concepts that we are going to use in this module,
and do some programming exercises with these concepts. That will solidify the base that you are
going to need for the lab sessions, for the coursework assignments and for the future.
We are going to use Python, version 3. Python is a general purpose programming language, and
it is widely used in data science, scientific computing and machine learning. Python is completely
free and open source. Additionally, it is very popular for three main reasons: (i) it is easy to learn,
being recommended as a “first” programming language; (ii) it allows easy and quick development
of applications; (iii) it has a great variety of useful and open libraries.
If you have never used Python before, then please take a look at a Python tutorial before
proceeding, so that you can follow the Data Mining labs smoothly. For example, the official
Python tutorial at https://docs.python.org/3/tutorial/index.html is a good starting point.
The TAs are going to demonstrate how to start Python, so in this lab you can focus on Chapter
3, 4 and 5. That is, from “An Informal Introduction to Python” until (and including) “Data
Structures”. However, if you already have experience with Python, you can start the lab from the
next section.
• https://repl.it/
Click in “new repl”, no need to sign up
• https://www.onlinegdb.com/online_python_compiler
However, perhaps, the best option is to install Spyder (an abbreviation from Scientific PYthon
Development EnviRonment) from Anaconda Navigator (which is free to install). It is also possible
to find Anaconda on AppsAnywhere. In the labs on campus it will be installed, but if you like to
install it at home the following steps must be followed:
• Open Anaconda Navigator and in Environments find Python and mark specific version and
chose Python 3.7.7 (the default version may be 3.8 - please, see next item for the secuence
needed to downgrade it to 3.7.7)
• Again in the Environments of Anaconda Navigator check that PyTorch is installed and check
its version. If it is version 1.6 downgrade it to version 1.4. After that you will be able to also
downgrade the version of Python from the default version 3.8 down to version 3.7.7 which
we need (for Labs 8, 9 and 10 you will need PyTorch v1.4)
• install Torchvision (which you will need for Lab9) from Anaconda command prompt by the
command:
1 pip install torchvision
2
3 Linear Algebra Revision
Data is usually organised in tables, where each row represents an item, and each column a feature
of the item. That is, each item is normally characterised by a list of values. For example, a fruit
could be characterised by its width, height, weight, colour, type, price, etc. In this lab we will
assume that all features are numbers. Later in class we will see how to handle different kind of
features.
Therefore, mathematically we can see a data-set as a matrix, and each item as one point (or a
vector ) in a multi-dimensional space. This mathematical point of view is the fundamental basis of
many of the machine learning techniques. Hence, it is important to understand the basic concepts
of linear algebra.
3
0.2 3 4.5
2.5 4.1 3.7
V=
0.7 1.5 2.5
• range
The range function returns an iterator that yields a sequence of evenly spaced integers. For
example,
1 In : list ( range (9) )
results in:
1 Out : [0 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8]
resulting in:
1 Out : [0 ,2 ,4 ,6 ,8]
or going backwards:
1 In : list ( range (12 ,0 , -2) )
resulting in:
1 Out : [12 ,10 ,8 ,6 ,4 ,2]
• zip The zip function ”pairs” up the elements of a number of lists, tuples or other sequences
to create a list of tuples. For example,
1 In : a ={1 ,2 ,3 ,4}
2 In : b ={10 ,20 ,30 ,40}
3 In : zipped_result = zip (a , b )
4 In : list ( zipped_result )
resulting in:
1 Out : [(1 , 40) , (2 , 10) , (3 , 20) , (4 , 30) ]
• append() method
The append() is a method that appends an element to the end of a list. For example,
1 In : a =[1 ,2 ,3 ,4]
2 In : a . append (999)
3 In : list ( a )
resulting in:
4
1 Out : [1 ,2 ,3 ,4 ,999]
5
3.2 Vector Operations
There are several useful operations that we can perform with vectors and matrices, for example:
• We can sum two vectors a and b. Let a = {a1 , a2 , . . . , an } and b = {b1 , b2 , . . . , bn }. The sum
a + b = {a1 + b1 , a2 + b2 , . . . , an + bn }.
• We can compute the dot product between two vectors. Let a = {a1 , a2 , . . . , an } and b =
{b1 , b2 , . . . , bn }. The dot product a· b = a1 × b1 + a2 × b2 + . . . + an × bn .
Exercise 1:
Write the following three methods in Python:
• Sum(a, b): Receives two vectors a and b represented as lists, and returns their sum.
• DotProduct(a, b): Receives two vectors a and b and returns the dot product a· b.
• Write in Python a method transpose(A), which returns the transpose of a vector or a matrix
A.
6
The inverse of a matrix is a different concept than the transpose. However, before introducing
the inverse, we have to define the identity matrix I. I is a square matrix (i.e., of size n × n),
whose
diagonal
elements are all 1s, and all other elements are 0s. For instance, the matrix I =
1 0 0
0 1 0 is a 3 × 3 identity matrix.
0 0 1
We can now introduce the inverse. We indicate the inverse of a matrix A by A−1 , and it is
defined as the matrix A−1 such that AA−1 = I. That is, the multiplication of a matrix by its
inverse has as a result the identity matrix.
2 1 −1 1 −1
For example, given the matrix A = , its inverse A = . It can be
1 1 −1 2
2 1 1 −1 1 0
verified that =
1 1 −1 2 0 1
Exercise 4:
Write in Python a method isInverse(A,B), which returns True if B is the inverse of A; or False
otherwise. Again, assume that the matrices are represented as lists of lists.
Hint: You are just being asked to verify if B is the inverse, not to actually calculate the inverse.
You can make use of the multiplication method mult(A, B) from Exercise 2,
• Given two points a and b, create a method dist(a, b), which returns their Euclidean distance.
• Given a matrix A, create a method lowDist(A), that returns which pairs of rows in A has
the lowest Euclidean distance.
Again, for these exercises please assume that vectors are represented as Python lists and ma-
trices as lists of lists.
Another common metric is the cosine similarity. When calculating the cosine similarity we
will see each item now as a vector (instead of a point). We then check the angle (θ) between two
7
vectors (items) a and b. The lower θ, the more a and b can be considered as “similar”. However,
instead of directly using θ, we use cos(θ). Hence, a cosine similarity of 1 would mean that two
items are equivalent, while a cosine similarity of −1 indicates that they are complete opposites.
The cosine similarity can be directly calculated using the dot product, as:
a· b
cos(θ) =
||a||||b||
√ qP
n 2
We use ||a|| to indicate the norm of a vector, which is the same as a· a = i=1 ai , where
n is the dimension of the vector. Therefore, the above equation leads to:
Pn
ai bi
cos(θ) = qP i=1qP .
n 2 n 2
a
i=1 i b
i=1 i
Exercise 6:
Given two vectors a and b, create a Python method cosSimilarity(a,b), which returns their
cosine similarity.