0% found this document useful (0 votes)
5 views

Lab description file (4)

The document provides an introduction to basic Python libraries used in data mining, specifically NumPy, Matplotlib, and SciPy. It covers how to import these libraries, perform basic operations with NumPy arrays, create plots with Matplotlib, and utilize SciPy for mathematical algorithms and statistical functions. Exercises are included to reinforce learning and encourage practical application of the concepts discussed.

Uploaded by

Ren Keting
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lab description file (4)

The document provides an introduction to basic Python libraries used in data mining, specifically NumPy, Matplotlib, and SciPy. It covers how to import these libraries, perform basic operations with NumPy arrays, create plots with Matplotlib, and utilize SciPy for mathematical algorithms and statistical functions. Exercises are included to reinforce learning and encourage practical application of the concepts discussed.

Uploaded by

Ren Keting
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SCC403 – Data Mining

Week 2: Introduction to the basic Python


libraries

Aim of the session:

• NumPy
• Matplotlib
• SciPy

1
1 Introduction

In computer programming, we can use libraries to help us develop our appli-


cations. Fundamentally a library is like a “toolbox”, making it easier to do
common tasks. It consists of lines of code (subroutines) that were developed
by others, but we can access those in our own applications, by calling specific
functions.
In order to use libraries, we use the following syntax: import [library name]
as [short name]. [library name] would be the specific library that you want
to load, and [short name] is how you are going to refer to it in your program.
It is also possible to use: from [library name] import [package], when we just
want to use a specific package from a library. We are going to see some
examples later. In this lab we are going to see 3 very common libraries for
scientific computing in Python: NumPy, Matplotlib and SciPy, though there
are others that we will refer to later in the course.

2 NumPy

NumPy is the main library for using and manipulating matrices in Python,
including linear algebra and statistics capabilities. It is, therefore, a very
fundamental library for scientific computing and data science. NumPy ma-
trices are similar to data structures used in other scientific computing frame-
works, like Matlab. In fact, if you are a Matlab user, you can see the main
similarities and differences at https://docs.scipy.org/doc/numpy/user/
numpy-for-matlab-users.html.
As mentioned previously, we first need to import the library. We do that
by typing:
1 import numpy as np
Now we can use np to refer to functions implemented in the NumPy li-
brary.

2.1 Basic Operations with arrays in NumPy

The main object in NumPy is the array which can be seen as a table of
elements (usually numbers), all of the same type, indexed by a tuple of non-
negative integers. Note, it starts (same as in C language from 0). In NumPy
dimensions are called axes. For example, the coordinates of a point in 3D
space [3, 4, 2] has one axis. That axis has 3 elements in it, so we say it has a
length of 3. We may also consider examples with more axes, e.g. an example
of an array that has 2 axes is:

2
1 [[ 1. , 0. , 0.] ,
2 [ 0. , 1. , 2.]]
The first axis has a length of 2, the second axis has a length of 3.
NumPy’s array class is called ndarray.
You can check the dimension of the ndarray with the command ndarray.shape.
For a matrix with n rows and m columns, the shape is (n,m).
Arithmetic operators apply on arrays element-wise.
Arrays can be initialised by zeros, or sequential numbers as follows:
1 >>> np . zeros ((2 , 1) )
will produce
1 array ([[0.] ,
2 [0.]])
To create sequences of numbers, NumPy provides the arange function
which is analogous to the Python built-in range, but returns an array.
1 c = np . arange (3 ,21 ,2)
will produce
1 array ([ 3 , 5, 7, 9 , 11 , 13 , 15 , 17 , 19])
We defined a sequence starting with 3 with step 2 finishing with 21. You
can note that the last value (21 is not included.
Exercise 1:
For the scalar λ = 0.1 and the two vectors a = [3, 45, 7, 2] and b =
[2, 54, 13, 15] define them as ndarray and then calculate:

• sum of a and b;
• multiplication of lambda by a;
• element-wise product of a and b.

2.2 Indexing arrays

One-dimensional arrays can be indexed, sliced and iterated over, much like
lists and other Python sequences.
Let us take the sequence, c we defined earlier and index its first element
(remember, same as in C language, the indexing starts with 0. Then we get:
1 c [0]=3
2 c [2]=7
3 c [8]=19

3
Some examples of slicing of the same array follow:
1 c [1:4]
2 array ([5 , 7 , 9])

1 c [7:3: -1]
2 array ([17 , 15 , 13 , 11])

1 c [1:7:2]=100
2 array ([ 3 , 100 , 7 , 100 , 11 , 100 , 15 , 17 , 19])

2.3 Linear Algebra

Let us consider the following array:


1 d = np . array ([[1.0 , 2.0] , [3.0 , 4.0]])
Its transpose can be found as follows:
1 d . transpose ()
The result is:
1 array ([[1. , 3.] ,
2 [2. , 4.]])
Inverse of square matrices is quite easy with NumPy:
1 np . linalg . inv ( d )
The result is:
1 array ([[ -2. , 1. ] ,
2 [ 1.5 , -0.5]])
You can find more details on the official NumPy tutorial, available at
https://docs.scipy.org/doc/numpy/user/quickstart.html and, in par-
ticular, about Linear algebra at linalg.py in NumPy folder.

3 Matplotlib

Matplotlib is a library for plotting graphs in Python. It has an Objected


Oriented interface, and a simpler interface called PyPlot, which is similar to
Matlab. We will use in our labs the PyPlot interface.
Each pyplot function makes some change to a figure: e.g., creates a fig-
ure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc.
Let us start with a very simple example:

4
1 import matplotlib . pyplot as plt
2
3 plt . plot ([1 , 4 , 3 , 2])
4 plt . ylabel ( ’ some numbers ’)
5 plt . show ()

Formatting the style of the plot


For every x, y pair of arguments, there is an optional third argument
which is the format string that indicates the color and line type of the plot.
The letters and symbols of the format string are the same as in MATLAB.
The default format string is ’b-’, which is a solid blue line.
The example below illustrates plotting several lines with different format
styles in one function call using arrays.
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 # evenly sampled time at 200 ms intervals
5 t = np . arange (0. , 5. , 0.2)
6
7 # red dashes , blue squares and green triangles
8 plt . plot (t , t , ’r - - ’ , t , t **2 , ’ bs ’ , t , t **3 , ’g ^ ’)
9 plt . show ()

5
You can create multiple figures by using multiple figure calls with an
increasing figure number. Each figure can contain as many axes and subplots
as necessary:
1 import matplotlib . pyplot as plt
2 plt . figure (1) # the first figure
3 plt . subplot (211) # the first subplot in
the first figure
4 plt . plot ([1 , 2 , 3])
5 plt . subplot (212) # the second subplot
in the first figure
6 plt . plot ([6 , 5 , 4])

6
You can find more details We refer you now to the official PyPlot tutorial.
Exercise 2:
Please follow the tutorial at https://matplotlib.org/tutorials/introductory/
pyplot.html, and try the examples on your own computer.

4 SciPy

SciPy is a collection of mathematical algorithms and convenience functions


built on the NumPy extension of Python. It adds significant power to the
interactive Python session by providing the user with high-level commands
and classes for manipulating and visualizing data.In particular, it includes
the following packages:
• cluster: Clustering algorithms
• constants: Physical and mathematical constants
• fftpack: Fast Fourier Transform routines
• integrate: Integration and ordinary differential equation solvers
• interpolate: Interpolation and smoothing splines
• io: Input and Output

7
• linalg: Linear algebra
• ndimage: N-dimensional image processing
• odr: Orthogonal distance regression
• optimize: Optimization and root-finding routines
• signal: Signal processing
• sparse: Sparse matrices and associated routines
• spatial: Spatial data structures and algorithms
• special: Special functions
• stats: Statistical distributions and functions

4.1 Statistics

The stats package contains various probability distributions and statistical


functions. For instance, let’s see what we can do with the Normal distribu-
tion. First we will load the statistics package by:
1 from scipy import stats
We can now easily take samples from the Normal distribution using ran-
dom variables sampling, rvs:
1 a = norm . rvs ( size =3)
The size parameter allows you to specify how many random numbers you
want to generate. Try it in your computer, and you will see that each time a
different array is returned. You can also specify the mean and the standard
deviation of the Normal distribution, by using the loc and scale parameters,
respectively. For example, check the output of the following two commands:
1 a = stats . norm . rvs ( loc =6 , scale =5 , size =5)
2 b = stats . norm . rvs ( loc =4 , scale =0.5 , size =5)
Besides simulating distributions, several statistical functions are available.
For instance, we can run a t-test to compare two different distributions,
which is a very useful approach for analysing experimental data. Let’s see
the example below:
1 rvs1 = stats . norm . rvs ( loc =5 , scale =10 , size =500)
2 rvs2 = stats . norm . rvs ( loc =5 , scale =10 , size =500)
3 Result = stats . ttest_ind ( rvs1 , rvs2 )
4 print ( Result )

8
Here we are generating two samples of the same distribution (rvs1, rvs2 ),
and then performing a t-test in the third line. The output of the t-test may
vary slightly each time you run, given that rvs1 and rvs2 will be re-generated,
but you will see something like:

Ttest_indResult(statistic=-0.5489036175088705, pvalue=0.5831943748663959

As you can see, the pvalue was around 0.58. Since 0.58 > 0.01, the t-test
correctly identified that the two samples come from the same distribution.
Now if you try:
1 rvs3 = stats . norm . rvs ( loc =8 , scale =10 , size =500)
2 Result_1_3 = stats . ttest_ind ( rvs1 , rvs3 )
3 print ( Result_1_3 )
The output will be similar to:

Ttest_indResult(statistic=-4.533414290175026, pvalue=6.507128186389019e-

As we can see, this time we compared rvs1 against a sample from a differ-
ent distribution (rvs3 ), since rvs3 has a different mean. The t-test returned
a very low value (6.50 × 10−6 ), correctly identifying that the underlying dis-
tributions that generated the samples are different.

4.2 Linear Algebra

SciPy provides several functions for linear algebra in the linalg package. It
has more linear algebra functions than the ones in NumPy, and they usually
run faster. Hence, we will show here some examples of linear algebra using
SciPy.
For instance, let’s see how to find the inverse of a matrix. As you may
recall, the inverse of a matrix A is a matrixB such that AB = I, where I
1 0 0
is the identity matrix. That is, I = 0 1 0. In SciPy, we can obtain the
0 0 1
inverse of a matrix A by using linalg.inv(A). Additionally, we can multiply
a matrix A by a matrix B using A.dot(B). Hence, the following example
calculates the inverse and checks the result:
1 import numpy as np
2 from scipy import linalg
3
4 A = np . array ([[1 ,3 ,5] ,[2 ,5 ,1] ,[2 ,3 ,8]])
5 B = linalg . inv ( A )

9
6 print ( B )
7 A . dot ( B )
 
−1.48 0.36 0.88
You will find the matrix B =  0.56 0.08 −0.36 which when multi-
0.16 −0.12 0.04
 
1.00000000e + 00 −1.11022302e − 16 −5.55111512e − 17
plied to A leads to: 3.05311332e − 16 1.00000000e + 00 1.87350135e − 16 .
2.22044605e − 16 −1.11022302e − 16 1.00000000e + 00
As we can see, this is very close to the “ideal” identity matrix.
Another common linear algebra operation is to find eigenvalues and eigen-
vectors of a matrix A. For instance, we will need these operations when
implementing PCA in the next lab session. Fundamentally the eigenvalues
and eigenvectors are the scalars λ and corresponding vectors v such that:
Av = λv. These can be found in SciPy using the function linalg.eig(A),
which returns the eigenvalues followed by the eigenvectors. For instance,
consider the following example:
1 import numpy as np
2 from scipy import linalg
3
4 A = np . array ([[1 , 2] , [3 , 4]])
5 la , v = linalg . eig ( A )
6 l1 , l2 = la
7
8 print ( l1 , l2 ) # eigenvalues
9 print ( v [: , 0]) # first eigenvector
10 print ( v [: , 1]) # second eigenvector
11 print ( A . dot ( v [: ,0]) - l1 * v [: ,0]) # check the
computation
12 print ( A . dot ( v [: ,1]) - l2 * v [: ,1])
 
1 2
By running this example, you will find that for the matrix A = ,
3 4
we can find the following eigenvalues: -0.3722, 5.3722. Additionally, the
respective eigenvectors are: [−0.8245, 0.5657]; [−0.41597356, −0.90937671].
We check the computation by executing Av − λv, which should return a
vector with zeros. Indeed, our code outputs: 
0.00000000e + 00 + 0.j 5.55111512e − 17 + 0.j
−4.4408921e − 16 + 0.j 0.0000000e + 00 + 0.j ,
which, as we can see, is very close to 0 for all values.
Exercise 3:

10
Try additional examples in the SciPy tutorials (https://docs.scipy.
org/doc/scipy/reference/). While doing so, plot graphs using the PyPlot
library.

11

You might also like