0% found this document useful (0 votes)
2 views

Lab description file (4)

The document outlines the first week of a Data Mining course, focusing on the installation of Python and an introduction to linear algebra concepts such as vectors, matrices, and distance metrics. It provides instructions for setting up Python IDEs, including Anaconda and online compilers, and emphasizes the importance of understanding linear algebra for machine learning. Additionally, it includes exercises for implementing basic vector and matrix operations in Python.

Uploaded by

Ren Keting
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lab description file (4)

The document outlines the first week of a Data Mining course, focusing on the installation of Python and an introduction to linear algebra concepts such as vectors, matrices, and distance metrics. It provides instructions for setting up Python IDEs, including Anaconda and online compilers, and emphasizes the importance of understanding linear algebra for machine learning. Additionally, it includes exercises for implementing basic vector and matrix operations in Python.

Uploaded by

Ren Keting
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

SCC403 – Data Mining

Week 1: Introduction

Aim of the session:


• Installations and Introduction to Python

• Introduction to Linear Algebra using Python, including:

• Vectors and Matrices

• Common Vector and Matrices Operations

• Distance Metrics

1
1 Introduction
Welcome to the first practical session of the Data Mining course. In this session we are going to
review some of the fundamental mathematical concepts that we are going to use in this module,
and do some programming exercises with these concepts. That will solidify the base that you are
going to need for the lab sessions, for the coursework assignments and for the future.
We are going to use Python, version 3. Python is a general purpose programming language, and
it is widely used in data science, scientific computing and machine learning. Python is completely
free and open source. Additionally, it is very popular for three main reasons: (i) it is easy to learn,
being recommended as a “first” programming language; (ii) it allows easy and quick development
of applications; (iii) it has a great variety of useful and open libraries.
If you have never used Python before, then please take a look at a Python tutorial before
proceeding, so that you can follow the Data Mining labs smoothly. For example, the official
Python tutorial at https://docs.python.org/3/tutorial/index.html is a good starting point.
The TAs are going to demonstrate how to start Python, so in this lab you can focus on Chapter
3, 4 and 5. That is, from “An Informal Introduction to Python” until (and including) “Data
Structures”. However, if you already have experience with Python, you can start the lab from the
next section.

2 Online IDEs for Python


First things first. Before we start we need to install our Integrated Development Environment or
IDE.
There are several options.
Perhaps the easiest is to use one of the simplest IDE - REPL. A more sophisticated option is
to use the OnlineGBD compiler.

• https://repl.it/
Click in “new repl”, no need to sign up

• https://www.onlinegdb.com/online_python_compiler

However, perhaps, the best option is to install Spyder (an abbreviation from Scientific PYthon
Development EnviRonment) from Anaconda Navigator (which is free to install). It is also possible
to find Anaconda on AppsAnywhere. In the labs on campus it will be installed, but if you like to
install it at home the following steps must be followed:

• Download and install Anaconda from https://www.anaconda.com/

• Open Anaconda Navigator and in Environments find Python and mark specific version and
chose Python 3.7.7 (the default version may be 3.8 - please, see next item for the secuence
needed to downgrade it to 3.7.7)

• Again in the Environments of Anaconda Navigator check that PyTorch is installed and check
its version. If it is version 1.6 downgrade it to version 1.4. After that you will be able to also
downgrade the version of Python from the default version 3.8 down to version 3.7.7 which
we need (for Labs 8, 9 and 10 you will need PyTorch v1.4)

• install Torchvision (which you will need for Lab9) from Anaconda command prompt by the
command:
1 pip install torchvision

2
3 Linear Algebra Revision
Data is usually organised in tables, where each row represents an item, and each column a feature
of the item. That is, each item is normally characterised by a list of values. For example, a fruit
could be characterised by its width, height, weight, colour, type, price, etc. In this lab we will
assume that all features are numbers. Later in class we will see how to handle different kind of
features.
Therefore, mathematically we can see a data-set as a matrix, and each item as one point (or a
vector ) in a multi-dimensional space. This mathematical point of view is the fundamental basis of
many of the machine learning techniques. Hence, it is important to understand the basic concepts
of linear algebra.

3.1 Vectors, Spaces, and Matrices


We represent a list of values as a vector, which are usually represented in bold face. For example:
v = (1, 5, 9, 10.5), is a vector that holds 4 values (features). Another common representation is to
write an arrow on the top of the variable, e.g.: ~v = (1, 5, 9, 10.5). Usually we will use vectors in
the real space (R), which means that a feature can be any arbitrary number (instead of being, for
example, only integers).
The dimension of the space is the number of features that we have in our data-set. Hence, in
the examples above we have vectors with four dimensions, and we say that the vectors are in R4
space. In general, if we have n features, we have vectors in Rn space. It is also common to see a
data item with n features as a point in the Rn space.
Only one, two and three dimensional spaces can be directly visualised. We can still do calcu-
lations and reason in higher than 3 dimensions, they just cannot be visualised easily.
When we have multiple items, we represent them as a table, where each row is an item and
each column is a feature. We represent that as a matrix, such as the one below:

3
 
0.2 3 4.5
 2.5 4.1 3.7 
V=
 0.7 1.5 2.5 

2.75 3.5 2.47


This matrix represents four items, each containing three features. If we get one row of the
matrix, we would have one vector/point, for example: V0 = (0.2, 3, 4.5).

3.1.1 Some useful Python functions


Before we go into details of how to represent vectors in Python we will look closer at some specific
Python functions.

• range
The range function returns an iterator that yields a sequence of evenly spaced integers. For
example,
1 In : list ( range (9) )

results in:
1 Out : [0 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8]

You can also try


1 In : list ( range (0 ,10 ,2) )

resulting in:
1 Out : [0 ,2 ,4 ,6 ,8]

or going backwards:
1 In : list ( range (12 ,0 , -2) )

resulting in:
1 Out : [12 ,10 ,8 ,6 ,4 ,2]

• zip The zip function ”pairs” up the elements of a number of lists, tuples or other sequences
to create a list of tuples. For example,
1 In : a ={1 ,2 ,3 ,4}
2 In : b ={10 ,20 ,30 ,40}
3 In : zipped_result = zip (a , b )
4 In : list ( zipped_result )

resulting in:
1 Out : [(1 , 40) , (2 , 10) , (3 , 20) , (4 , 30) ]

• append() method
The append() is a method that appends an element to the end of a list. For example,
1 In : a =[1 ,2 ,3 ,4]
2 In : a . append (999)
3 In : list ( a )

resulting in:

4
1 Out : [1 ,2 ,3 ,4 ,999]

Note, that instead of


1 In : list ( a )

one may also use


1 In : print ( a )

3.1.2 Representation in Python


There are two ways to represent vectors and matrices in Python: as lists or as NumPy arrays. In
this lab we are going to first use the list representation, in order for you to exercise how the data
manipulation operations can be implemented. In the next lab we will introduce NumPy arrays,
which facilitates the vector/matrices manipulation.
Hence, representing a vector is straight-forward. For instance, our example vector ~v = (1, 5, 9, 10.5)
would be represented as:
1 v = [1 , 5 , 9 , 10.5];
For matrices, we can use list of lists. That is, we see the matrix
 as a list of rows,where each
0.2 3 4.5
 2.5 4.1 3.7 
row is a list of columns. Therefore, our example matrix V =   0.7
 would be
1.5 2.5 
2.75 3.5 2.47
represented as:
1 V = [[0.2 , 3 , 4.5] ,[2.5 , 4.1 , 3.7] ,[0.7 , 1.5 , 2.5] ,[2.75 , 3.5 ,
2.47]];
Therefore, to access the first element of a vector V, you would use in Python V[0]. To get the
second row of a matrix V, you would use V[1]. And to access the third element of the second row,
it would be V[1][2].

5
3.2 Vector Operations
There are several useful operations that we can perform with vectors and matrices, for example:

• We can sum two vectors a and b. Let a = {a1 , a2 , . . . , an } and b = {b1 , b2 , . . . , bn }. The sum
a + b = {a1 + b1 , a2 + b2 , . . . , an + bn }.

• We can multiply a vector a by a scalar (i.e., a number) λ. Let a = {a1 , a2 , . . . , an } and λ be


a scalar. The multiplication of λ × a = {λa1 , λa2 , . . . , λan }.

• We can compute the dot product between two vectors. Let a = {a1 , a2 , . . . , an } and b =
{b1 , b2 , . . . , bn }. The dot product a· b = a1 × b1 + a2 × b2 + . . . + an × bn .

Exercise 1:
Write the following three methods in Python:

• Sum(a, b): Receives two vectors a and b represented as lists, and returns their sum.

• Mult(a, lambda): Receives a vector a and a scalar λ and returns λa.

• DotProduct(a, b): Receives two vectors a and b and returns the dot product a· b.

3.3 Matrix Operations


Similarly, we can apply several operations on matrices. For instance, we can calculate a matrix
C = AB, multiplying matrix A with matrix B. Let ci,j be an element in the new matrix C in row
i and column j (and correspondingly ai,j and bi,j for
Pnmatrices A and B). Each element ci,j in the
new matrix C will be given by the equation ci,j = k=1 ai,k bk,j .
That is, the element in row i and column j (ci,j ) in the new matrix C will be given by going
through row i in the matrix A and column j in the matrix B. We then multiply each pair of
numbers, and sum up all the results. For example:
    
2 1 4 6 2∗4+1∗5 2∗6+1∗7
=
3 4 5 7 3∗4+4∗5 3∗6+4∗7
Exercise 2:
Write in Python a method mult(A, B), which receives two n × n matrices A and B, and returns
a new matrix C = AB. Assume that the matrices are represented as lists of lists.

3.4 Transpose and Inverse


Another common operation in vector and matrices is to calculate its transpose. We indicate the
transpose of a matrix A by AT , where each row of A becomes a column of AT . That is, ai,j = aTj,i .
   
2 1 3 2 4 7
For example, if A =  4 6 9 , then AT =  1 6 8 . The same applies to vectors. That
7 8 2 3 9 2
 
3
is, if a = (3, 5, 8), then aT =  5 .
8
Exercise 3:

• What is the meaning of abT ?

• Write in Python a method transpose(A), which returns the transpose of a vector or a matrix
A.

6
The inverse of a matrix is a different concept than the transpose. However, before introducing
the inverse, we have to define the identity matrix I. I is a square matrix (i.e., of size n × n),
whose
 diagonal
 elements are all 1s, and all other elements are 0s. For instance, the matrix I =
1 0 0
 0 1 0  is a 3 × 3 identity matrix.
0 0 1
We can now introduce the inverse. We indicate the inverse of a matrix A by A−1 , and it is
defined as the matrix A−1 such that AA−1 = I. That is, the multiplication of a matrix by its
inverse has as a result the identity matrix.   
2 1 −1 1 −1
For example, given the matrix A = , its inverse A = . It can be
1 1 −1 2
    
2 1 1 −1 1 0
verified that =
1 1 −1 2 0 1
Exercise 4:
Write in Python a method isInverse(A,B), which returns True if B is the inverse of A; or False
otherwise. Again, assume that the matrices are represented as lists of lists.
Hint: You are just being asked to verify if B is the inverse, not to actually calculate the inverse.
You can make use of the multiplication method mult(A, B) from Exercise 2,

3.5 Eigenvalues and Eigenvectors


The eigenvalue and eigenvectors of matrices are also very important concepts, which are used in
several data mining methods. Given a matrix A, the eigenvectors v and eigenvalues λ are such
that Av = λv. Each eigenvector v has a corresponding eigenvalue λ, which is the one that makes
the previous equation hold true. How to calculate eigenvalues and eigenvectors are beyond the
scope of this lab, but in the next labs you will find that there are libraries available for doing this
calculation.

3.6 Distance Metrics


Given two points in space, it is very useful to calculate the distance between them. As you may
remember that we represent an item in our dataset as a point. The distance allows us to calculate
how (dis-)similar two items are.
There are many distance metrics. The most common one is the Euclidean distance, which is
defined by the following equation:
v
u n
uX
d(a, b) = t (ai − bi )2 ,
i=1

where n is the dimension of the vectors.


p For example, if the points have only two dimensions, then
the equation would be: d(a, b) = (a1 − b1 )2 + (a2 − b2 )2 .
Exercise 5:

• Given two points a and b, create a method dist(a, b), which returns their Euclidean distance.

• Given a matrix A, create a method lowDist(A), that returns which pairs of rows in A has
the lowest Euclidean distance.

Again, for these exercises please assume that vectors are represented as Python lists and ma-
trices as lists of lists.

Another common metric is the cosine similarity. When calculating the cosine similarity we
will see each item now as a vector (instead of a point). We then check the angle (θ) between two

7
vectors (items) a and b. The lower θ, the more a and b can be considered as “similar”. However,
instead of directly using θ, we use cos(θ). Hence, a cosine similarity of 1 would mean that two
items are equivalent, while a cosine similarity of −1 indicates that they are complete opposites.
The cosine similarity can be directly calculated using the dot product, as:

a· b
cos(θ) =
||a||||b||
√ qP
n 2
We use ||a|| to indicate the norm of a vector, which is the same as a· a = i=1 ai , where
n is the dimension of the vector. Therefore, the above equation leads to:
Pn
ai bi
cos(θ) = qP i=1qP .
n 2 n 2
a
i=1 i b
i=1 i

Exercise 6:
Given two vectors a and b, create a Python method cosSimilarity(a,b), which returns their
cosine similarity.

You might also like