Tutorial1 KNN
Tutorial1 KNN
Contents
What this Tutorial Covers 2
NumPy 7
Boolean Indexing and Logical Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Index Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Wrapping Up 18
1
What this Tutorial Covers
As there are plenty of guides and documentation for learning to do Python data science and machine learning
(ML) online, I won’t be trying to teach you this from scratch. What this tutorial will provide you with is
links to useful resources, and some tools and techniques that are especially useful or relevant for the first
assignment, and which might not be emphasized as much in standard guides.
These tutorials won’t be a regular thing. I will still always be available by e-mail, and help will be available
on Slack if you need guidance. But you will be expected to pick up the rest of of the tools you need by
reading the documentation and other resources online.
However, starting out, there can be a lot to learn, especially if you are just getting started with Python. So
we are covering some Python data science basics here to help you get started. This tutorial assumes you
have learned the basics of Python syntax, and understand fundamental Python classes like list, str, bool,
int, and float, and that when you see something like ClassName.method mentioned, you understand how you
would use this with an actual instantiation of the class.
In terms of the official Python tutorial, I am assuming you have gone through and mostly understand the
sections:
• 1.
Whetting Your Appetite: all
• 2.
Using the Python Interpreter: all
• 3.
An Informal Introduction to Python: all
• 4.
More Control Flow Tools: only sections below
– 4.1. if Statements
– 4.2. for Statements
– 4.3. The range() Function
– 4.5. pass Statements
– 4.6. Defining Functions
– 4.7. More on Defining Functions
∗ 4.7.1. Default Argument Values
∗ 4.7.2. Keyword Arguments
• 5. Data Structures: only sections below
– 5.1. More on Lists
∗ 5.1.3. List Comprehensions
– 5.3. Tuples and Sequences
– 5.6. Looping Techniques
– 5.7. More on Conditions
– 5.8. Comparing Sequences and Other Types
• 8. Errors and Exceptions: only sections below
– 8.1. Syntax Errors
– 8.2. Exceptions
– 8.3. Handling Exceptions
– 8.4. Raising Exceptions
• 9. Classes: only sections below
– 9.1. A Word About Names and Objects
– 9.2. Python Scopes and Namespaces
∗ 9.2.1. Scopes and Namespaces Example
– 9.3. A First Look at Classes
∗ 9.3.1. Class Definition Syntax
∗ 9.3.2. Class Objects
∗ 9.3.3. Instance Objects
∗ 9.3.4. Method Objects
∗ 9.3.5. Class and Instance Variables
2
Some General Python Learning Tips
Effectively Using your REPL, IPython/Notebooks/Jupyter
For those of you coming from compiled languages without a REPL (e.g. C, C++, Java), the usual code,
compile, run, debug, and repeat workflow still will work just fine for Python (this is what I do most of the
time). But since Python is interpreted, it can often be faster to use some kind of REPL-like setup.
It is easy to open up a terminal, start a session with the Python REPL by running the command python or
python3, and then play around with the various functions available to you. If you’re using a competent OS
(i.e. not Windows, which might require extra steps), tab-completion can help you discover all the methods
available to you on objects, and you can see different results very quickly. If you have gone through the first
couple or so sections of the official Python tutorial, you will already have a pretty good idea of how to use
the REPL. The Python REPL is a great scratch space, so don’t forget to use it.
When you want more power available then the plain Python REPL, there are Jupyter Notebooks. If you
don’t want to setup notebooks, you can use Google Colab, which sets up the whole notebook environment
for you in your browser without you having to install anything.
Because I think Notebooks encourage bad code and bad practices, and aren’t allowed as project or assignment
submissions, I won’t cover them further. But you are still free to use them to develop your code initially,
and I recommend giving them a try if you haven’t before. Notebooks are good for quickly playing around
with data and learning the basics of some new library.
3
Read and Search the Official Docs First
However if you understand the basics of what you need to do, your next step should be to head to the official
documentation. For this course, there are probably only really six or seven official resources you’ll need.
• Numpy
• Pandas
• scikit-learn
• matplotlib and possibly seaborn
• Tensorflow and Keras
• PyTorch and PyTorch Lightning
Learn to use these sites, and browse through API docs and official tutorials before searching Google for how
to do things. The only docs that are sometimes not so great are those for matplotlib and pandas. If you
want to know how to do some plotting thing, or want to know how to manipulate a DataFrame in some way,
it can actually be faster and more helpful to go to stackexchange first1 .
then running the command python tutorial.py will create a folder /home/derek/Desktop/CSCI444/test/2 .
However, if my current working directory was then changed to be /home/derek/Desktop/, and I ran the
command python CSCI444/tutorial.py, then this would create the folder /home/derek/Desktop/test/. And
if I had to refactor and move the tutorial.py file, test would get created in the wrong place again. Relative
paths can thus be a subtle and annoying cause of bugs.
One alternative is to always use absolute paths. If instead the script were:
import os
then it no longer matters from where I run this script (e.g. we don’t use any information about the current
working directory). The problem is, now no one else can use the code without changing the derek part, or,
if they are on Windows, without completely changing everything. That’s not good.
One simple solution is the magic __file__ variable. Variables with the double underscore are special variables
or methods in Python (also called “magic” or “dunder” methods) that perform special functions at runtime.
The __file__ magic method returns the relative path to the file that it is written in. E.g:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
print(__file__) # prints: tutorial.py
directory somehow. In general, you should always run your Python scripts from a consistent location (e.g. for simple scripts,
the containing directory).
4
# working dir: /home/derek/Desktop/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
print(__file__) # prints: CSCI444/tutorial.py
In a REPL (where there is no file) attempting to access __file__ will raise an exception.
The __file__ magic variable is the first tool that will allow us to easily and reliably identify paths across
operating systems. The next tool is the Pathlib library, which can convert this string to an absolute path:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
from pathlib import Path
path = Path(__file__).resolve()
print(path) # prints: /home/derek/Desktop/CSCI444/tutorial.py
Now we have stored in the path variable an absolute path that works across operating systems, no matter
what the current working directory. It will also generally only break if we move the tutorial.py file.
Pathlib also makes it easy to reliably navigate directories and get what you need:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
from pathlib import Path
path = Path(__file__).resolve()
print(path) # /home/derek/Desktop/CSCI444/tutorial.py
print(path.parent) # /home/derek/Desktop/CSCI444
print(path.parent.parent) # /home/derek/Desktop
print(path.parent / "newfolder") # /home/derek/Desktop/CSCI444/newfolder
There’s a whole bunch more you can do with pathlib. Check out the docs, and if you see some site recom-
mending old school functions like os.path.join and other verbose and tedious stuff, consider replacing those
calls with Pathlib equivalents.
Plotting
For the most part, you should be fine with matplotlib. However, the default matplotlib plots are somewhat
ugly, and are missing some key types of plots. If you don’t mind working with pandas DataFrames, then
seaborn is a useful way to get some nicer plots. In particular pairplot can be an extremely powerful way
to visualize your entire dataset, if it is not too large, and heatmap can be used for visualizing 2D arrays and
DataFrames.
5
Arrays and DataFrames
Data science means working with arrays. Arrays are multidimensional matrices. In deep learning libraries,
arrays are often called tensors (essentially unrelated to tensors in mathematics). Two dimensional arrays are
often just called matrices, and one-dimensional arrays are more precisely referred to as vectors.
Although not strictly part of the definition of array, the array concept in programming usually strongly
implies that all elements of the array are of the same data type. It is extremely unusual to see mixed-type
arrays. Most of the time, the data type will be some kind of float, integer, string, bytes, or possibly a
data/time type. That is, arrays tend to hold only one simple, primitive type of data. Arrays also do not
store much additional extra information (like variable names).
Arrays for data science are more accurately call multidimensional arrays or N-dimensional arrays. Arrays are
implemented in Python in the NumPy library via the ndarray class. Other languages have similar constructs
(e.g. Julia, Rust).
For a lot of data science and machine learning, arrays can be somewhat inconvenient, as you cannot access
rows or columns by name, and you can’t store complicated data of multiple types conveniently. In this case,
you want a data structure that more resembles a table. In Python (and also R) the structure of choice is the
dataframe, and in Python, the library for manipulating dataframes is pandas. As we are focused on Python,
I will refer only to DataFrames from here on in, to make it clear I am talking about pandas style dataframes.
Before getting into arrays and DataFrames, it is important to clarify some key terms.
Terminology
To work with and think clearly about your data, you need to make strong distinctions between your data
shape or dimensions and your data sizes.
The number of dimensions in your data is the number of axes in the array. E.g. a timeseries of rainfall
in one location is 1D, or for rainfalls in multiple locations is 2D. A black-and-white (BW) image is 2D, a
colour image is 3D (you have RGB and maybe A—alpha transparency—data at each pixel), a BW brain
scan is 3D, and a colour brain scan is 4D. If you have an array that holds a collection of samples (which we
will encounter when dealing with batches later in deep learning), then the dimension of the batch will be the
dimensions of the samples plus one. E.g. an array holding 100 BW images is a 3D array. If arr is a NumPy
array, the dimension is returned by len(arr.shape).
The size or length of a dimension is the number of points or values in that dimension. So a 1080p image
would have a size of 1920 in the horizontal dimension, and 1080 in the vertical dimension. The shape of an
array is a list of all the sizes for all dimensions. If arr is a NumPy array, then arr.shape returns the shape
information. So if we have an RGB 1080p image in the NumPy array img, then img.shape would return the
tuple (1920, 1080, 3).
The total size of an array is the total number of values contained in the array. That is, it is
np.prod(arr.shape), the product of all the sizes for each dimension.
Reshaping is an operation that does not change the total size of your array, but may change the shape in
any number of ways. Resizing is any operation that may affect the total size of an array. For example,
transposition is a kind of reshaping that flips an array on it’s diagonal, so that an array of shape (m, n)
gets the shape (n, m) (and you can get the transpose of an array arr conveniently with the shortcut arr.T).
Something that comes up a lot in deep learning and working with images (like in your first assignment) is the
channel dimension. This is the dimension that contains multiple copies / layers / channels of the “signal”
(e.g. image, timeseries). In our RGB 1080p example, the channel dimension has size (and position) 3. This
is called “channels last” format. Sometimes, images will be stored in a “channels first” format, in which case
the 1080p image shape would be (3, 1920, 1080)3 . If you had, say, daily rainfall data for Antigonish and
3 Also, colour images can be 4-channel (but still 3D) if RGBA and not just RGB.
6
Toronto for 365 days, the shape would be (365, 2) or (2, 365), and we could call this size-2 dimension the
channel dimension4 .
NumPy
To start, I recommend you learn how to do the basics of NumPy. Their beginner tutorial covers most of
what you will need to get started for the assignment. The other thing you will need for the first assignment
is boolean indexing and possibly index arrays.
Note: NumPy is almost always imported as import numpy as np, and you should follow this convention for
your code. In this tutorial, if you see np.<something>, then you can assume I mean NumPy.
idx_9 = (A == 9)
idx_9
# array([[False, True, False, False],
# [False, False, False, False],
# [ True, True, False, False],
# [False, False, False, False]])
idx_8_9 = (A == 8) | (A == 9) # don't use "or" or "and" for boolean arrays, use "|" or "&"
idx_8_9
# array([[ True, True, False, True],
# [ True, False, False, False],
# [ True, True, False, False],
# [False, False, False, False]])
4 These latter points can be confusing. That is, flat colour images are stored as 3D arrays. But when we get to Tensorflow /
Torch, we’ll see we see functions like Conv2D, MaxPool2D and etc. to processes flat colour images. This is because when we refer
to the dimension of data, we often don’t count the channel dimension. In any case, don’t overthink this, just keep in mind
channel dimensions are a thing.
7
Now we can use these boolean arrays to select elements.
print(A[idx_8_9]) # [8 9 8 8 9 9]
print(A[idx_small]) # [3 0 3 0]
This can be more useful when you have multiple objects, and need to select based on values in one of them.
Let’s imagine we have values X, and target y:
X = np.array([[1, 8],
[2, 3],
[3, 4],
[4, 7],
[5, 6]])
y = np.array([[8],
[8],
[9],
[9],
[8]])
and we only want data where the target is equal to 8. Then we can get this with:
eights = X[y == 8, :] # IndexError: too many indices for array
Note that boolean arrays used for indexing really have to be boolean. That, is:
A = np.array([1, 2, 3, 4])
idx = np.array([0, 0, 1, 1]) # surely this is the same!
A[idx] # array([1, 1, 2, 2]) # AAGH, nope
Index Arrays
Sometimes it is more convenient to just select the elements at exactly certain indices. In fact, that’s what
happened in our last boolean example. Index arrays are not too often useful to create on your own, but they
are returned from a lot of data splitting functions, so it is important to understand the basic idea and be
aware of the technique.
Here is a basic example:
8
np.random.seed(3)
A = np.linspace(0, 10, 10).round(1)
idx = np.random.choice(5, size=5, replace=False)
A # array([ 0. , 1.1, 2.2, 3.3, 4.4, 5.6, 6.7, 7.8, 8.9, 10. ])
idx # array([3, 4, 1, 0, 2])
A[idx] # array([3.3, 4.4, 1.1, 0. , 2.2])
pandas
With any luck5 you won’t have to deal with pandas much on this first assignment. So feel free to skip this
section unless you need pandas.
However, pandas is often the easiest way to load data with the various pandas.read_ functions. These
functions are extremely powerful and will likely help you quickly load most of the data you find online. If
you understand the NumPy basics, you can generally use NumPy methods on a pandas DataFrame and get
similar behavior. Alternately, you can just convert the DataFrame and be done with pandas for now.
Typically you import pandas as pd. When using pandas, you also use the DataFrame class so much that you
likely want to import it itself as from pandas import DataFrame. In the following examples, you can assume
those imports are implicit.
There is an official 10-minute guide that covers the bare minimum of pandas, and there are also a bunch of
things that you need to understand which are covered in the basics tutorial.
If you like pandas, by all means use it. However, with the amount of early material to learn, you might just
want to learn enough to use it like NumPy, and to load data. pandas is quite complicated and takes a long
time to learn well, so I would personally recommend avoiding it for now.
You can also index into a DataFrame most of the time as if it is a NumPy array by using the .iloc accessor:
np.random.seed(3)
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=["A", "B", "C"])
print(df)
# A B C
# 0 8 9 3
# 1 8 8 0
# 2 5 3 9
print(df.iloc[:, 0])
# 0 8
# 1 8
# 2 5
# Name: A, dtype: int64
print(df.iloc[1, :])
5 pandas is very poorly designed and frustrating to use, mostly because of the annoying and counter-intuitive Series objects,
and endlessly subtle indexing options, none of which work particularly sensibly or predictably. It is also extremely slow, even
for Python. I hate pandas, and try to use as few features from it as possible, but you really can’t avoid it for a huge amount of
Python data science.
9
# A 8
# B 8
# C 0
# Name: 1, dtype: int64
You can also access elements with boolean indexing and index arrays, and the usual tricks like df == 8 work
to generate “boolean indexing DataFrames” too. You can also do NumPy style indexing, but with column
names, via .loc.
np.random.seed(3)
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=["A", "B", "C"])
print(df)
# A B C
# 0 8 9 3
# 1 8 8 0
# 2 5 3 9
print(df.loc[:, "A"])
# 0 8
# 1 8
# 2 5
# Name: A, dtype: int64
print(df.loc[1, :])
# A 8
# B 8
# C 0
# Name: 1, dtype: int64
idx = (df == 8)
print(df[idx])
# A B C
# 0 8.0 NaN NaN
# 1 8.0 8.0 NaN
# 2 NaN NaN NaN
If you want to save yourself a huge amount of confusion and want to avoid some annoying
bugs, almost always access and assign data using the .iloc and .loc methods, and nothing else.
# AVOID ALL OF THESE WHEN REASONABLE!
df.A
df["A"]
df["A", :]
df["A"][1]
One exception which can be extraordinarily useful for certain tasks is the DataFrame.filter method with
the regex argument. I won’t cover it, just be aware of it.
10
Formatting Your Data for Analysis
Often, real-world data is messy. It might have different sizes, be in the wrong shape, having missing values,
or have errors in data entry / corruptions.
Reshaping
Often, you will have to reshape your data before you can use it in standard algorithms. For example, most
ML algorithms expect a shape of either (n_samples, n_features), or the transpose of this. This means some
data (like the 2D image data in your first assignment) can not be used directly, and must be flattened (or
embedded if we are dealing with timeseries or sequences).
In NumPy, your main methods for this are ndarray.reshape and ndarray.transpose.
The .transpose function is fully general, and allows you to permute the axes in different orders. E.g.
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style("ticks") # remove some ugly default gridlines
Figure 1: A 3.
11
imgs_flipped = imgs.transpose([0, 2, 1]) # swap last two dimensions
print(imgs_flipped.shape) # (1797, 8, 8) - no change in shape!
img = imgs_flipped[3, :, :]
plt.imshow(img, cmap="Greys")
plt.show() # the "3" has been flipped on the diagonal!
Figure 2: A transposed 3.
Figure 3: A 3 again.
The .reshape function does… reshaping, as defined above. BEWARE! Naive reshaping can scramble your
data!
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
# DON'T DO THIS!
imgs = load_digits()["images"]
print(imgs.shape) # (1797, 8, 8)
imgs = imgs.reshape([8, 8, 1797]) # easy-easy, no need for transpose
img = imgs[:, :, 3]
print(img.shape) # (8, 8), yup, perfect, who needs transpose
plt.imshow(img, cmap="Greys")
plt.show() # WTF
12
Figure 4: Modern art.
Let’s reshape properly to an array that is (n_sample, n_features). In this case, we have 1797 samples, and
8*8 == 64 features (each pixel is a feature):
imgs = load_digits()["images"]
print(imgs.shape) # (1797, 8, 8)
n_samples = imgs.shape[0]
n_features = np.prod(imgs.shape[1:]) # Don't hard-code magic numbers!
imgs = imgs.reshape([n_samples, n_features])
img_flat = imgs[3, :]
print(img_flat.shape) # (64,) - Good!
plt.imshow(img_flat, cmap="Greys") # TypeError: Invalid shape (64,) for image data
Here, although we reshaped correctly, we see plt.imshow wants a different size. To plot img_flat, we need
another tool.
imgs = load_digits()["images"]
print(imgs.shape)
n_samples = imgs.shape[0]
n_features = np.prod(imgs.shape[1:])
imgs = imgs.reshape([n_samples, n_features])
img_flat = imgs[3, :]
print(img_flat.shape) # (64,)
img_expanded = np.expand_dims(img_flat, 1)
plt.imshow(img_expanded, cmap="Greys")
plt.show()
13
Figure 5: A flattened 3
You can delete annoying size-1 dimensions with np.squeeze(), or, if we know we want the 1D representation,
then np.ravel() is often the fastest way to do this.
The Evil of Size-1 Dimensions When you start with array programming, these phoney size-1 dimensions
might be the cause of a lot of errors and conceptual confusion early on. If you are mathematically minded,
you also realize at some level these arrays are wonky as mathematical and programming objects. For example:
np.random.seed(42)
A = np.random.randint(0, 10, [10]) # 10 random digits
E = np.expand_dims(A, 1)
print(A.shape) # (10,)
print(E.shape) # (10, 1)
# but wait a minute, adding a dimension doesn't add data. What if...
14
EEEEEE = np.copy(A)
for i in range(31): # biggest allowed dimension is 32
EEEEEE = np.expand_dims(EEEEEE, i+1)
The point is that size-1 dimensions are fundamentally empty, and are only there for programming reasons
to allow functions to expect common shapes. The problem is sometimes functions silently throw away size-1
dimensions when you need them, or need size-1 dimensions when you thought a raveled array would be fine.
In fact, a lot of NumPy functions include a keepdims argument for this purpose.
So just be aware that size-1 dimensions can cause all sorts of subtle bugs. I recommend having
them around for only exactly as long as you need them, and otherwise keep arrays in .raveled or .squeezed
form (e.g. no size-1 dimensions). If you get strange, shape-related errors, read the docs, look at the expected
input shapes and what shapes are returned, and throw in .expand_dims and .squeeze or .ravel calls as
needed to resolve the problem.
15
ends/boundaries).
Cropping can be implemented trivially by array slicing, so there aren’t NumPy functions for this. But if you
want to crop more symmetrically, like to the center of some array, or include random crops, then there are
tools available in PyTorch and Tensorflow.
Padding is actually quite a bit more complicated, as the choice of padding value turns out to matter more
than some have hoped. There are multiple ways to pad beyond just zero-padding (the default in most
applications). Sometimes, the padding choice is so important, naive padding can invalidate a model, such
as in failing to use causal padding for timeseries data.
The most general padding function in Python is np.pad. Then, there are options for PyTorch and Tensorflow.
Until we get to deep learning, I would just stick with NumPy. Although there are problems with naive zero-
padding, you can still just use this if you want6 .
6 Potential bonus marks could perhaps be rewarded for comparing the effects of different padding choices on algorithm
performance - providing this doesn’t contaminate data, as it can with timeseries data.
7 Note this is an example of a towardsdatascience post that is good.
16
The General Data Analysis Procedure
Data science is more of an art than a science, so there is isn’t really anything remotely like a general data
analysis procedure. However, most Python libraries (and in fact other statistical libraries like in R) write
their functions to use similar abstractions. That is, for most data analysis, you:
1. Acquire data.
2. Inspect data (visualization, summary stats)
3. Clean data (reshape, resize, deal with missing values, outliers)
4. Split your data into predictors, and, if doing supervised learning, target, usually called (respectively)
• X, y
5. Define training and test sets, usually called
• X_tr, X_test, y_tr, y_test
6. Instantiate or build some model
• model = Model()
7. Fit / train that model to / on the training data
• model = model.fit(X_tr, y_tr)
8. Use the fitted model to get predictions on the test set
• y_pred = model.predict(X_test)
9. Evaluate the predictions by comparing the predictions
• compare(y_pred, y_test)
In the simplest cases, that will really be mostly the extent of the code. In extremely simple cases, you might
also even have X_tr == X_test == X and y_tr == y_test == y. But in practice, building and training the
model can be complicated and time consuming, and evaluation can also be quite computationally expensive
(as it may require many repetitive fitting and prediction steps).
For the first problem set, the above procedure is mostly all you have to do. If you’ve managed to learn
all the Python, NumPy, pandas, and other concepts, this part will probably seem pretty easy (at least to
implement).
Nevertheless, there are a few tips you might find helpful.
The order of x and y above matters, in that it changes which side of 0.5 the resulting AUC value will be.
Alternately, you can use sklearn.metrics.roc_auc_score. As this is setup for the assumption you have some
scores / predictions, you will need to have your data formatted differently, and will need to generate a vector
of perfect predictions. You can do this like so:
17
# assume x, y are the observed (raveled) vectors of values for each group
data = np.concatenate([x, y])
labels = np.array([0 for _ in x] + [1 for _ in y])
auc = roc_auc_score(y_true=labels, y_score=data)
Switching the order of the 0 and 1 above will also change which side of 0.5 the resulting AUC value will be.
Both methods above give identical values.
Wrapping Up
This tutorial should provide you with all the tools and resources needed for the first assignment. However,
remember, if you run into problems, I am available by e-mail or on Slack.
Good luck!
18