0% found this document useful (0 votes)
14 views

Tutorial1 KNN

Uploaded by

Meet Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Tutorial1 KNN

Uploaded by

Meet Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Tutorial #1: KNN and Data Manipulation with Python

Contents
What this Tutorial Covers 2

Some General Python Learning Tips 3


Effectively Using your REPL, IPython/Notebooks/Jupyter . . . . . . . . . . . . . . . . . . . . . . 3
Learn to Use Your Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Choosing Appropriate Learning Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Paths and Pathlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Arrays and DataFrames 6


Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

NumPy 7
Boolean Indexing and Logical Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Index Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Formatting Your Data for Analysis 11


Dealing with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Resizing and Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

The General Data Analysis Procedure 17


Computing the AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Speeding Up scikit-learn Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Speeding Up Prototyping with Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Avoid Silly Mistakes by Regularly Visualizing and Inspecting Shapes . . . . . . . . . . . . . . . . . 18

Wrapping Up 18

1
What this Tutorial Covers
As there are plenty of guides and documentation for learning to do Python data science and machine learning
(ML) online, I won’t be trying to teach you this from scratch. What this tutorial will provide you with is
links to useful resources, and some tools and techniques that are especially useful or relevant for the first
assignment, and which might not be emphasized as much in standard guides.
These tutorials won’t be a regular thing. I will still always be available by e-mail, and help will be available
on Slack if you need guidance. But you will be expected to pick up the rest of of the tools you need by
reading the documentation and other resources online.
However, starting out, there can be a lot to learn, especially if you are just getting started with Python. So
we are covering some Python data science basics here to help you get started. This tutorial assumes you
have learned the basics of Python syntax, and understand fundamental Python classes like list, str, bool,
int, and float, and that when you see something like ClassName.method mentioned, you understand how you
would use this with an actual instantiation of the class.
In terms of the official Python tutorial, I am assuming you have gone through and mostly understand the
sections:
• 1.
Whetting Your Appetite: all
• 2.
Using the Python Interpreter: all
• 3.
An Informal Introduction to Python: all
• 4.
More Control Flow Tools: only sections below
– 4.1. if Statements
– 4.2. for Statements
– 4.3. The range() Function
– 4.5. pass Statements
– 4.6. Defining Functions
– 4.7. More on Defining Functions
∗ 4.7.1. Default Argument Values
∗ 4.7.2. Keyword Arguments
• 5. Data Structures: only sections below
– 5.1. More on Lists
∗ 5.1.3. List Comprehensions
– 5.3. Tuples and Sequences
– 5.6. Looping Techniques
– 5.7. More on Conditions
– 5.8. Comparing Sequences and Other Types
• 8. Errors and Exceptions: only sections below
– 8.1. Syntax Errors
– 8.2. Exceptions
– 8.3. Handling Exceptions
– 8.4. Raising Exceptions
• 9. Classes: only sections below
– 9.1. A Word About Names and Objects
– 9.2. Python Scopes and Namespaces
∗ 9.2.1. Scopes and Namespaces Example
– 9.3. A First Look at Classes
∗ 9.3.1. Class Definition Syntax
∗ 9.3.2. Class Objects
∗ 9.3.3. Instance Objects
∗ 9.3.4. Method Objects
∗ 9.3.5. Class and Instance Variables

2
Some General Python Learning Tips
Effectively Using your REPL, IPython/Notebooks/Jupyter
For those of you coming from compiled languages without a REPL (e.g. C, C++, Java), the usual code,
compile, run, debug, and repeat workflow still will work just fine for Python (this is what I do most of the
time). But since Python is interpreted, it can often be faster to use some kind of REPL-like setup.
It is easy to open up a terminal, start a session with the Python REPL by running the command python or
python3, and then play around with the various functions available to you. If you’re using a competent OS
(i.e. not Windows, which might require extra steps), tab-completion can help you discover all the methods
available to you on objects, and you can see different results very quickly. If you have gone through the first
couple or so sections of the official Python tutorial, you will already have a pretty good idea of how to use
the REPL. The Python REPL is a great scratch space, so don’t forget to use it.
When you want more power available then the plain Python REPL, there are Jupyter Notebooks. If you
don’t want to setup notebooks, you can use Google Colab, which sets up the whole notebook environment
for you in your browser without you having to install anything.
Because I think Notebooks encourage bad code and bad practices, and aren’t allowed as project or assignment
submissions, I won’t cover them further. But you are still free to use them to develop your code initially,
and I recommend giving them a try if you haven’t before. Notebooks are good for quickly playing around
with data and learning the basics of some new library.

Learn to Use Your Debugger


As code grows larger and more complex, random print statements and tweaks are often not enough. This
is when you’ll want to reach for a debugger. A debugger can help you inspect variables and find certain
kinds of bugs much more quickly. I recommend you learn how to do integrated debugging in either VS Code
or PyCharm. This is usually very easy to learn, and amounts to little more than clicking the right GUI
components.
A debugger can also be a surprisingly useful development tool in Python. Just place a breakpoint in your
script, run to that point, and now in the debug console you have access to all the variables that are both
defined and in scope. This can help you prototype new code. Once you’ve learned the basics of your debugger,
I think you will see it is actually a far more powerful and flexible tool than most other tools available to you
(e.g. notebooks).

Choosing Appropriate Learning Resources


Avoid Plain / Naive Googling
Due to aggressive search-engine optimization, plain Google results (i.e., those that don’t use any Google-
fu) are increasingly almost entirely junk. When trying to learn how to do some data science, you get
overwhelmed with results from content-mills like towardsdatascience, machinelearningmastery, and various
other extremely low-quality medium posts or listicles. While these can occasionally be useful, usually they
are extremely poorly written, contain numerous basic conceptual errors, and are written by people that only
just learned how to do what they are writing about. I strongly recommend you avoid using these kinds of
resources, when possible.
If you don’t understand a concept or technique at all, and the recommended course texts were not helpful,
there is no shame in using Wikipedia. Wikipedia is generally good for mathematical and statistical content,
and much more trustworthy than the average content-mill material. It also tends to link to decent resources.
After you understand basic concepts and ideas, you can try searching stackexchange for more specific practical
and conceptual questions. In fact, generally just adding “site:stackexchange.com” to the end of your Google
searches will get you far better results.

3
Read and Search the Official Docs First
However if you understand the basics of what you need to do, your next step should be to head to the official
documentation. For this course, there are probably only really six or seven official resources you’ll need.
• Numpy
• Pandas
• scikit-learn
• matplotlib and possibly seaborn
• Tensorflow and Keras
• PyTorch and PyTorch Lightning
Learn to use these sites, and browse through API docs and official tutorials before searching Google for how
to do things. The only docs that are sometimes not so great are those for matplotlib and pandas. If you
want to know how to do some plotting thing, or want to know how to manipulate a DataFrame in some way,
it can actually be faster and more helpful to go to stackexchange first1 .

Paths and Pathlib


One of the most important things to remember early on is that the shell where you execute your python
or python3 commands always has a current working directory (which can be viewed in Windows with
the command cd, and in Linux/MacOS with pwd). When you define relative paths in your Python
scripts, they will always be relative to the working directory. That is, if my current user directory is
/home/derek/Desktop/CSCI444/, and my script file is /home/derek/Desktop/CSCI444/tutorial.py, and has the
code:
import os

os.makedirs("test") # make a folder / directory called "test"

then running the command python tutorial.py will create a folder /home/derek/Desktop/CSCI444/test/2 .

However, if my current working directory was then changed to be /home/derek/Desktop/, and I ran the
command python CSCI444/tutorial.py, then this would create the folder /home/derek/Desktop/test/. And
if I had to refactor and move the tutorial.py file, test would get created in the wrong place again. Relative
paths can thus be a subtle and annoying cause of bugs.
One alternative is to always use absolute paths. If instead the script were:
import os

os.makedirs("/home/derek/Desktop/CSCI444/test") # make a folder / directory called "test"

then it no longer matters from where I run this script (e.g. we don’t use any information about the current
working directory). The problem is, now no one else can use the code without changing the derek part, or,
if they are on Windows, without completely changing everything. That’s not good.
One simple solution is the magic __file__ variable. Variables with the double underscore are special variables
or methods in Python (also called “magic” or “dunder” methods) that perform special functions at runtime.
The __file__ magic method returns the relative path to the file that it is written in. E.g:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
print(__file__) # prints: tutorial.py

but if our directory is /home/derek/Desktop/, then


1 Eventually, this will cost you, as you will not build up a good understanding of the libraries this way. But for this course,
you likely won’t advance far enough into the APIs to hit this problem.
2 Imports resolve in a similar way. So if you get strange errors related to imports, you probably changed your current working

directory somehow. In general, you should always run your Python scripts from a consistent location (e.g. for simple scripts,
the containing directory).

4
# working dir: /home/derek/Desktop/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
print(__file__) # prints: CSCI444/tutorial.py

In a REPL (where there is no file) attempting to access __file__ will raise an exception.
The __file__ magic variable is the first tool that will allow us to easily and reliably identify paths across
operating systems. The next tool is the Pathlib library, which can convert this string to an absolute path:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
from pathlib import Path

path = Path(__file__).resolve()
print(path) # prints: /home/derek/Desktop/CSCI444/tutorial.py

Now we have stored in the path variable an absolute path that works across operating systems, no matter
what the current working directory. It will also generally only break if we move the tutorial.py file.
Pathlib also makes it easy to reliably navigate directories and get what you need:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
from pathlib import Path

path = Path(__file__).resolve()
print(path) # /home/derek/Desktop/CSCI444/tutorial.py
print(path.parent) # /home/derek/Desktop/CSCI444
print(path.parent.parent) # /home/derek/Desktop
print(path.parent / "newfolder") # /home/derek/Desktop/CSCI444/newfolder

There’s a whole bunch more you can do with pathlib. Check out the docs, and if you see some site recom-
mending old school functions like os.path.join and other verbose and tedious stuff, consider replacing those
calls with Pathlib equivalents.

Plotting
For the most part, you should be fine with matplotlib. However, the default matplotlib plots are somewhat
ugly, and are missing some key types of plots. If you don’t mind working with pandas DataFrames, then
seaborn is a useful way to get some nicer plots. In particular pairplot can be an extremely powerful way
to visualize your entire dataset, if it is not too large, and heatmap can be used for visualizing 2D arrays and
DataFrames.

5
Arrays and DataFrames
Data science means working with arrays. Arrays are multidimensional matrices. In deep learning libraries,
arrays are often called tensors (essentially unrelated to tensors in mathematics). Two dimensional arrays are
often just called matrices, and one-dimensional arrays are more precisely referred to as vectors.
Although not strictly part of the definition of array, the array concept in programming usually strongly
implies that all elements of the array are of the same data type. It is extremely unusual to see mixed-type
arrays. Most of the time, the data type will be some kind of float, integer, string, bytes, or possibly a
data/time type. That is, arrays tend to hold only one simple, primitive type of data. Arrays also do not
store much additional extra information (like variable names).
Arrays for data science are more accurately call multidimensional arrays or N-dimensional arrays. Arrays are
implemented in Python in the NumPy library via the ndarray class. Other languages have similar constructs
(e.g. Julia, Rust).
For a lot of data science and machine learning, arrays can be somewhat inconvenient, as you cannot access
rows or columns by name, and you can’t store complicated data of multiple types conveniently. In this case,
you want a data structure that more resembles a table. In Python (and also R) the structure of choice is the
dataframe, and in Python, the library for manipulating dataframes is pandas. As we are focused on Python,
I will refer only to DataFrames from here on in, to make it clear I am talking about pandas style dataframes.
Before getting into arrays and DataFrames, it is important to clarify some key terms.

Terminology
To work with and think clearly about your data, you need to make strong distinctions between your data
shape or dimensions and your data sizes.
The number of dimensions in your data is the number of axes in the array. E.g. a timeseries of rainfall
in one location is 1D, or for rainfalls in multiple locations is 2D. A black-and-white (BW) image is 2D, a
colour image is 3D (you have RGB and maybe A—alpha transparency—data at each pixel), a BW brain
scan is 3D, and a colour brain scan is 4D. If you have an array that holds a collection of samples (which we
will encounter when dealing with batches later in deep learning), then the dimension of the batch will be the
dimensions of the samples plus one. E.g. an array holding 100 BW images is a 3D array. If arr is a NumPy
array, the dimension is returned by len(arr.shape).
The size or length of a dimension is the number of points or values in that dimension. So a 1080p image
would have a size of 1920 in the horizontal dimension, and 1080 in the vertical dimension. The shape of an
array is a list of all the sizes for all dimensions. If arr is a NumPy array, then arr.shape returns the shape
information. So if we have an RGB 1080p image in the NumPy array img, then img.shape would return the
tuple (1920, 1080, 3).
The total size of an array is the total number of values contained in the array. That is, it is
np.prod(arr.shape), the product of all the sizes for each dimension.

Reshaping is an operation that does not change the total size of your array, but may change the shape in
any number of ways. Resizing is any operation that may affect the total size of an array. For example,
transposition is a kind of reshaping that flips an array on it’s diagonal, so that an array of shape (m, n)
gets the shape (n, m) (and you can get the transpose of an array arr conveniently with the shortcut arr.T).
Something that comes up a lot in deep learning and working with images (like in your first assignment) is the
channel dimension. This is the dimension that contains multiple copies / layers / channels of the “signal”
(e.g. image, timeseries). In our RGB 1080p example, the channel dimension has size (and position) 3. This
is called “channels last” format. Sometimes, images will be stored in a “channels first” format, in which case
the 1080p image shape would be (3, 1920, 1080)3 . If you had, say, daily rainfall data for Antigonish and
3 Also, colour images can be 4-channel (but still 3D) if RGBA and not just RGB.

6
Toronto for 365 days, the shape would be (365, 2) or (2, 365), and we could call this size-2 dimension the
channel dimension4 .

NumPy
To start, I recommend you learn how to do the basics of NumPy. Their beginner tutorial covers most of
what you will need to get started for the assignment. The other thing you will need for the first assignment
is boolean indexing and possibly index arrays.
Note: NumPy is almost always imported as import numpy as np, and you should follow this convention for
your code. In this tutorial, if you see np.<something>, then you can assume I mean NumPy.

Boolean Indexing and Logical Operators


Often, you want to select only some elements of an array. There are some extremely convenient shorthands
to do this. Let’s build an array:
np.random.seed(3)
A = np.random.randint(0, 10, [4,4])
A
# array([[8, 9, 3, 8],
# [8, 0, 5, 3],
# [9, 9, 5, 7],
# [6, 0, 4, 7]])

and use “==” to generate some boolean arrays:


idx_8 = (A == 8) # get logical indicators of values that equal 8
idx_8
# array([[ True, False, False, True],
# [ True, False, False, False],
# [False, False, False, False],
# [False, False, False, False]])

idx_9 = (A == 9)
idx_9
# array([[False, True, False, False],
# [False, False, False, False],
# [ True, True, False, False],
# [False, False, False, False]])

idx_8_9 = (A == 8) | (A == 9) # don't use "or" or "and" for boolean arrays, use "|" or "&"
idx_8_9
# array([[ True, True, False, True],
# [ True, False, False, False],
# [ True, True, False, False],
# [False, False, False, False]])

idx_small = (A <= 3) # We can use other logical operators too


idx_small
# array([[False, False, True, False],
# [False, True, False, True],
# [False, False, False, False],
# [False, True, False, False]])

4 These latter points can be confusing. That is, flat colour images are stored as 3D arrays. But when we get to Tensorflow /

Torch, we’ll see we see functions like Conv2D, MaxPool2D and etc. to processes flat colour images. This is because when we refer
to the dimension of data, we often don’t count the channel dimension. In any case, don’t overthink this, just keep in mind
channel dimensions are a thing.

7
Now we can use these boolean arrays to select elements.
print(A[idx_8_9]) # [8 9 8 8 9 9]
print(A[idx_small]) # [3 0 3 0]

This can be more useful when you have multiple objects, and need to select based on values in one of them.
Let’s imagine we have values X, and target y:
X = np.array([[1, 8],
[2, 3],
[3, 4],
[4, 7],
[5, 6]])
y = np.array([[8],
[8],
[9],
[9],
[8]])

and we only want data where the target is equal to 8. Then we can get this with:
eights = X[y == 8, :] # IndexError: too many indices for array

Wait, what happened? Let’s look at y.shape:


print(y.shape) # (5, 1)

This is a common problem you’ll bump into. The solution is:


print(y.squeeze().shape) # (5,)
idx = (y.squeeze() == 8) # array([ True, True, False, False, True])
eights = X[idx, :]
eights
# array([[1, 8],
# [2, 3],
# [5, 6]])

We can also select things not matching some criterion with ~:


not_eights = X[~idx, :]
not_eights
# array([[3, 4],
# [4, 7]])

Note that boolean arrays used for indexing really have to be boolean. That, is:
A = np.array([1, 2, 3, 4])
idx = np.array([0, 0, 1, 1]) # surely this is the same!
A[idx] # array([1, 1, 2, 2]) # AAGH, nope

# We accidentally made an index array (see below)


idx.dtype # dtype('int64')
idx = np.array(idx, dtype=bool) # convert to boolean array
A[idx] # array([3, 4])

Index Arrays
Sometimes it is more convenient to just select the elements at exactly certain indices. In fact, that’s what
happened in our last boolean example. Index arrays are not too often useful to create on your own, but they
are returned from a lot of data splitting functions, so it is important to understand the basic idea and be
aware of the technique.
Here is a basic example:

8
np.random.seed(3)
A = np.linspace(0, 10, 10).round(1)
idx = np.random.choice(5, size=5, replace=False)
A # array([ 0. , 1.1, 2.2, 3.3, 4.4, 5.6, 6.7, 7.8, 8.9, 10. ])
idx # array([3, 4, 1, 0, 2])
A[idx] # array([3.3, 4.4, 1.1, 0. , 2.2])

pandas
With any luck5 you won’t have to deal with pandas much on this first assignment. So feel free to skip this
section unless you need pandas.
However, pandas is often the easiest way to load data with the various pandas.read_ functions. These
functions are extremely powerful and will likely help you quickly load most of the data you find online. If
you understand the NumPy basics, you can generally use NumPy methods on a pandas DataFrame and get
similar behavior. Alternately, you can just convert the DataFrame and be done with pandas for now.
Typically you import pandas as pd. When using pandas, you also use the DataFrame class so much that you
likely want to import it itself as from pandas import DataFrame. In the following examples, you can assume
those imports are implicit.
There is an official 10-minute guide that covers the bare minimum of pandas, and there are also a bunch of
things that you need to understand which are covered in the basics tutorial.
If you like pandas, by all means use it. However, with the amount of early material to learn, you might just
want to learn enough to use it like NumPy, and to load data. pandas is quite complicated and takes a long
time to learn well, so I would personally recommend avoiding it for now.

Tips to Make pandas Less Frustrating


If you want to avoid pandas, you can always convert a DataFrame to an ndarray by using the
method. You can convert it back like so:
DataFrame.to_numpy()
COLUMNS = ["A", "B", "C"] # save your column names somewhere
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=COLUMNS)
arr = df.to_numpy() # convert
# ... do some manipulations on your array, e.g. arr -= 1
df_new = DataFrame(data=arr, columns=COLUMNS)

You can also index into a DataFrame most of the time as if it is a NumPy array by using the .iloc accessor:
np.random.seed(3)
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=["A", "B", "C"])
print(df)
# A B C
# 0 8 9 3
# 1 8 8 0
# 2 5 3 9

print(df.iloc[:, 0])
# 0 8
# 1 8
# 2 5
# Name: A, dtype: int64

print(df.iloc[1, :])

5 pandas is very poorly designed and frustrating to use, mostly because of the annoying and counter-intuitive Series objects,

and endlessly subtle indexing options, none of which work particularly sensibly or predictably. It is also extremely slow, even
for Python. I hate pandas, and try to use as few features from it as possible, but you really can’t avoid it for a huge amount of
Python data science.

9
# A 8
# B 8
# C 0
# Name: 1, dtype: int64

You can also access elements with boolean indexing and index arrays, and the usual tricks like df == 8 work
to generate “boolean indexing DataFrames” too. You can also do NumPy style indexing, but with column
names, via .loc.
np.random.seed(3)
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=["A", "B", "C"])
print(df)
# A B C
# 0 8 9 3
# 1 8 8 0
# 2 5 3 9

print(df.loc[:, "A"])
# 0 8
# 1 8
# 2 5
# Name: A, dtype: int64

print(df.loc[1, :])
# A 8
# B 8
# C 0
# Name: 1, dtype: int64

idx = (df == 8)
print(df[idx])
# A B C
# 0 8.0 NaN NaN
# 1 8.0 8.0 NaN
# 2 NaN NaN NaN

If you want to save yourself a huge amount of confusion and want to avoid some annoying
bugs, almost always access and assign data using the .iloc and .loc methods, and nothing else.
# AVOID ALL OF THESE WHEN REASONABLE!
df.A
df["A"]
df["A", :]
df["A"][1]

One exception which can be extraordinarily useful for certain tasks is the DataFrame.filter method with
the regex argument. I won’t cover it, just be aware of it.

10
Formatting Your Data for Analysis
Often, real-world data is messy. It might have different sizes, be in the wrong shape, having missing values,
or have errors in data entry / corruptions.

Dealing with Missing Values


This is probably one of the first things you’ll want to look into. Missing values break all but the most
sophisticated contemporary ML algorithms. For NumPy arrays and pandas DataFrames, missing values are
encoded as np.nan, and can be found with either np.isnan for NumPy, or pandas.isnull or DataFrame.isna
for pandas. Missing values are also often referred to as NaN, nan, or other such variants, which stand for “Not
a number”. The NaN value is usually still a float. E.g. isinstance(np.nan, float) returns True.
If you have a lot of NaNs in your data, you might not be able to just drop instances (rows, samples) with
functions like DataFrame.dropna or tricks like x = x[~np.isnan(x)]. This unfortunately means you’re in the
wild and wacky world of missing data. There are bookshelves on this subject, and you really don’t want to
get into them for this course. If you have a huge amount of missing data or NaNs in your data,
you might just want to find a different dataset.
However, if you have just a few missing values, and you really like your data, you might still want to work
around these issues. In this case, you probably want to reach for a simple imputation method. You can find
some approaches for this in sklearn.impute, and general guides here.

Reshaping
Often, you will have to reshape your data before you can use it in standard algorithms. For example, most
ML algorithms expect a shape of either (n_samples, n_features), or the transpose of this. This means some
data (like the 2D image data in your first assignment) can not be used directly, and must be flattened (or
embedded if we are dealing with timeseries or sequences).
In NumPy, your main methods for this are ndarray.reshape and ndarray.transpose.

The .transpose function is fully general, and allows you to permute the axes in different orders. E.g.
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style("ticks") # remove some ugly default gridlines

imgs = load_digits()["images"] # this is a small dataset available in sklearn


print(imgs.shape) # (1797, 8, 8) - first dimension indexes the images
img = imgs[3, :, :]
print(img.shape) # (8, 8)
plt.imshow(img, cmap="Greys") # show black-and-white image
plt.show() # it's a "3"

Figure 1: A 3.

11
imgs_flipped = imgs.transpose([0, 2, 1]) # swap last two dimensions
print(imgs_flipped.shape) # (1797, 8, 8) - no change in shape!
img = imgs_flipped[3, :, :]
plt.imshow(img, cmap="Greys")
plt.show() # the "3" has been flipped on the diagonal!

Figure 2: A transposed 3.

imgs_reindexed = imgs.transpose([1, 2, 0])


print(imgs_reindexed.shape) # (8, 8, 1797) - last dimension now indexes images
img = imgs_reindexed[:, :, 3]
print(img.shape) # (8, 8)
plt.imshow(img, cmap="Greys")
plt.show() # it's a normal "3"

Figure 3: A 3 again.

The .reshape function does… reshaping, as defined above. BEWARE! Naive reshaping can scramble your
data!
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# DON'T DO THIS!
imgs = load_digits()["images"]
print(imgs.shape) # (1797, 8, 8)
imgs = imgs.reshape([8, 8, 1797]) # easy-easy, no need for transpose
img = imgs[:, :, 3]
print(img.shape) # (8, 8), yup, perfect, who needs transpose
plt.imshow(img, cmap="Greys")
plt.show() # WTF

12
Figure 4: Modern art.

Let’s reshape properly to an array that is (n_sample, n_features). In this case, we have 1797 samples, and
8*8 == 64 features (each pixel is a feature):
imgs = load_digits()["images"]
print(imgs.shape) # (1797, 8, 8)
n_samples = imgs.shape[0]
n_features = np.prod(imgs.shape[1:]) # Don't hard-code magic numbers!
imgs = imgs.reshape([n_samples, n_features])
img_flat = imgs[3, :]
print(img_flat.shape) # (64,) - Good!
plt.imshow(img_flat, cmap="Greys") # TypeError: Invalid shape (64,) for image data

Here, although we reshaped correctly, we see plt.imshow wants a different size. To plot img_flat, we need
another tool.

Adding / Removing Dimensions


Because most images these days are colour, it is common for most image data to be 3- or 4-channel (i.e. 2-
dimensional images are represented by 3D arrays). To make programming easier, it is thus often the default
to assume that a flat image should always be represented as having three dimensions: two relating to pixels,
and one relating to the channels. Only black-and-white images are sometimes encoded as 1-channel, truly
2D images, and only for some programming libraries. Likewise, most ML algorithms expect 2D input in the
form (n_samples, n_features). But of course, it is extremely common that you have just one feature.
This often results in tiny annoyances. You have data that is shape (n_samples,), but the function you are
using wants this with the shape (n_samples, 1). Here, the magic tool is np.expand_dims(). If we use this,
we can now plot our flattened image:
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style("ticks")

imgs = load_digits()["images"]
print(imgs.shape)
n_samples = imgs.shape[0]
n_features = np.prod(imgs.shape[1:])
imgs = imgs.reshape([n_samples, n_features])
img_flat = imgs[3, :]
print(img_flat.shape) # (64,)
img_expanded = np.expand_dims(img_flat, 1)
plt.imshow(img_expanded, cmap="Greys")
plt.show()

13
Figure 5: A flattened 3

You can delete annoying size-1 dimensions with np.squeeze(), or, if we know we want the 1D representation,
then np.ravel() is often the fastest way to do this.

The Evil of Size-1 Dimensions When you start with array programming, these phoney size-1 dimensions
might be the cause of a lot of errors and conceptual confusion early on. If you are mathematically minded,
you also realize at some level these arrays are wonky as mathematical and programming objects. For example:
np.random.seed(42)
A = np.random.randint(0, 10, [10]) # 10 random digits
E = np.expand_dims(A, 1)
print(A.shape) # (10,)
print(E.shape) # (10, 1)

# "Empty" dimensions have to have size 1, for some reason


np.reshape(A, [A.shape[0], 0]) # ValueError: cannot reshape array of size 10 into shape (10,0)
np.reshape(A, [A.shape[0], 1]) # ok, works
np.reshape(A, [A.shape[0], 2]) # ValueError: cannot reshape array of size 10 into shape (10,2)

# comparison behaves sanely


np.alltrue(A == A.T) # True
np.alltrue(A.T == A) # True
np.alltrue(A == A.ravel()) # True
np.alltrue(A.ravel() == A) # True

np.alltrue(E == E.T) # False


np.alltrue(E.T == E) # False
np.alltrue(E == E.ravel()) # False
np.alltrue(E.ravel() == E) # False
np.alltrue(A == E) # False

np.alltrue(A.ravel() == E.ravel()) # True - arrays still only have identical data

# Some slicing makes sense...


print(A[0]) # 6
print(A[0, :]) # IndexError: too many indices for array (GOOD)
print(A[0, 0]) # IndexError: too many indices for array (GOOD)
print(A[0, -1]) # IndexError: too many indices for array (GOOD)
print(E[0, :]) # [6] - note square brackets: returned slice is itself an ndarray
print(E[0]) # [6] - actually just a shorthand for above

# and I guess this makes sense:


print(E[0, 0]) # 6 - return is an int, not an ndarray
print(E[0, -1]) # 6 - same

# but wait a minute, adding a dimension doesn't add data. What if...

14
EEEEEE = np.copy(A)
for i in range(31): # biggest allowed dimension is 32
EEEEEE = np.expand_dims(EEEEEE, i+1)

print(EEEEEE.shape) # (10, 1, 1, ..., 1)


print(EEEEEE[0, :]) # [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[6]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
print(EEEEEE[0, 0, :]) # [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[6]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
print(EEEEEE[0, 0, -1]) # [[[[[[[[[[[[[[[[[[[[[[[[[[[[[6]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
# and etc.

The point is that size-1 dimensions are fundamentally empty, and are only there for programming reasons
to allow functions to expect common shapes. The problem is sometimes functions silently throw away size-1
dimensions when you need them, or need size-1 dimensions when you thought a raveled array would be fine.
In fact, a lot of NumPy functions include a keepdims argument for this purpose.
So just be aware that size-1 dimensions can cause all sorts of subtle bugs. I recommend having
them around for only exactly as long as you need them, and otherwise keep arrays in .raveled or .squeezed
form (e.g. no size-1 dimensions). If you get strange, shape-related errors, read the docs, look at the expected
input shapes and what shapes are returned, and throw in .expand_dims and .squeeze or .ravel calls as
needed to resolve the problem.

Resizing and Reshaping


A common problem in machine learning is having data of different sizes. While some more advanced tech-
niques (fully convolutional CNNs, seq2seq RNNs, transformer RNNs) can handle different sizes naturally,
most basic techniques cannot. In these cases, you need to modify your data. Below are the most common
ways to deal with this problem.
Note: You have to be careful when you do this, since you could inadvertently contaminate your data. For
example, imagine you are implementing classification between two groups, but in one group, each sample is
smaller (e.g. shorter timeseries, smaller image). If you resize all samples to the same size, it should be clear
that one group’s resized samples will be more distorted / blurry than the other’s. That is, you’ve accidentally
introduced a variable that easily allows perfect identification of each group, but which has nothing to do
with the actual classification task. There could be a similar issue with padding or cropping. In general,
manipulating your data must be done carefully (although most online tutorials treat this as if it is some
simple fire-and-forget processes that doesn’t much matter).

Interpolation / Resampling and Downsampling / Upsampling


Interpolation and resampling are very general concepts, and technically include downsampling and upsam-
pling. However, in practice, to interpolate usually implies filling in missing data, so that an interpolated
sample has more data points than the original sample. In general, the idea is to fit some function to your
data, and then evaluate that function at more points for interpolation, or less points for downsampling.
The smoother and more local the function (e.g. a spline), generally the better (but slower) the resampling.
Simpler functions (linear, nearest-neighbor) are faster, but may lose more information.
There are some extremely general interpolation functions and libraries available in Python:
• scipy.interpolate: 1D and 2D interpolation functions
• scipy.ndimage.zoom: interpolate arbitrary arrays
• torch.nn.functional.interpolate (Torch general-purpose array interpolation)
• tf.image.resize: Tensorflow image resizer
• tfg.math.interpolation: Tensorflow Graphics interpolation (advanced)

Cropping and Padding


Often, these are the easiest and most desirable ways to resolve shape discrepancies. Large samples can
be cropped (data points removed), and/or small samples can be padded (values are added at sample

15
ends/boundaries).
Cropping can be implemented trivially by array slicing, so there aren’t NumPy functions for this. But if you
want to crop more symmetrically, like to the center of some array, or include random crops, then there are
tools available in PyTorch and Tensorflow.
Padding is actually quite a bit more complicated, as the choice of padding value turns out to matter more
than some have hoped. There are multiple ways to pad beyond just zero-padding (the default in most
applications). Sometimes, the padding choice is so important, naive padding can invalidate a model, such
as in failing to use causal padding for timeseries data.
The most general padding function in Python is np.pad. Then, there are options for PyTorch and Tensorflow.
Until we get to deep learning, I would just stick with NumPy. Although there are problems with naive zero-
padding, you can still just use this if you want6 .

Patches / Slices / Windows


Sometimes, data can be too large to handle computationally in its entirety. And sometimes, there is a huge
amount of variability in the sizes of individual samples. And sometimes, you also have strong reasons to
believe that patterns arise only locally (e.g. close pixels in an image, close timepoints in a timeseries, close
spacetime locations for spatial data).
In this situation, training entire instances might miss the key patterns, and be computationally expensive.
One option to convert a bunch of samples into the same sizes is via patches or windows of the same size. You
can either let your patches overlap, partly overlap, or not overlap at all. With NumPy array slicing, these
are all fairly straightforward to implement on your own.
However, since this is a common technique, there are tools for this. In Tensorflow, this can be done with
tf.image.extract_patches for images, or with a TimeseriesGenerator. It would require manual code with
PyTorch, or learning some TorchIO. You can also do it very cleverly with raw NumPy, but it takes some
work7 .

6 Potential bonus marks could perhaps be rewarded for comparing the effects of different padding choices on algorithm

performance - providing this doesn’t contaminate data, as it can with timeseries data.
7 Note this is an example of a towardsdatascience post that is good.

16
The General Data Analysis Procedure
Data science is more of an art than a science, so there is isn’t really anything remotely like a general data
analysis procedure. However, most Python libraries (and in fact other statistical libraries like in R) write
their functions to use similar abstractions. That is, for most data analysis, you:
1. Acquire data.
2. Inspect data (visualization, summary stats)
3. Clean data (reshape, resize, deal with missing values, outliers)
4. Split your data into predictors, and, if doing supervised learning, target, usually called (respectively)
• X, y
5. Define training and test sets, usually called
• X_tr, X_test, y_tr, y_test
6. Instantiate or build some model
• model = Model()
7. Fit / train that model to / on the training data
• model = model.fit(X_tr, y_tr)
8. Use the fitted model to get predictions on the test set
• y_pred = model.predict(X_test)
9. Evaluate the predictions by comparing the predictions
• compare(y_pred, y_test)
In the simplest cases, that will really be mostly the extent of the code. In extremely simple cases, you might
also even have X_tr == X_test == X and y_tr == y_test == y. But in practice, building and training the
model can be complicated and time consuming, and evaluation can also be quite computationally expensive
(as it may require many repetitive fitting and prediction steps).
For the first problem set, the above procedure is mostly all you have to do. If you’ve managed to learn
all the Python, NumPy, pandas, and other concepts, this part will probably seem pretty easy (at least to
implement).
Nevertheless, there are a few tips you might find helpful.

Computing the AUC


The first assignment will ask you to compute the area under the ROC (receiver operating characteristic)
curve (AUC) as a way to quantify the separation of your two groups. If you read up on the AUC and ROCs,
you might find this confusing, as the AUC is usually only defined with respect to the performance of some
classifier. That is, you don’t normally talk about the AUC of raw, unclassified groups.
However, given some overlap in the two groups, even the best classifier will not be able to perfectly separate
the samples. That is, if you are given two samples that fall in the overlapping region, which way you classify
those two samples depends entirely on where you set your separation threshold / cutpoint. So when there is
overlap, even a perfect classifier will result in tradeoffs between false and true positives and negatives.
Thus, whenever you have two subgroups, you can calculate the AUC that corresponds to a perfect classifier,
and this tells you something about how fundamentally separable the two groups are.
In Python, there are two ways to calculate this AUC. You can use the mathematical relationship to the
Mann-Whitney U statistic along with scipy.stats.mannwhitneyu to get the AUC. In that case, you will have:
# assume x, y are the observed (raveled) vectors of values for each group
auc = mannwhitneyu(x, y).statistic / (len(x) * len(y))

The order of x and y above matters, in that it changes which side of 0.5 the resulting AUC value will be.
Alternately, you can use sklearn.metrics.roc_auc_score. As this is setup for the assumption you have some
scores / predictions, you will need to have your data formatted differently, and will need to generate a vector
of perfect predictions. You can do this like so:

17
# assume x, y are the observed (raveled) vectors of values for each group
data = np.concatenate([x, y])
labels = np.array([0 for _ in x] + [1 for _ in y])
auc = roc_auc_score(y_true=labels, y_score=data)

Switching the order of the 0 and 1 above will also change which side of 0.5 the resulting AUC value will be.
Both methods above give identical values.

Speeding Up scikit-learn Models


The sklearn.neighbors.KNeighborsClassifier can be slow for large data, or on old machines. If this is a
problem, you should first try using the n_jobs=-1 parameter when instantiating the model. Depending on
your number of cores, this could result in an almost 10 times speedup. Many sklearn models have this option
available.

Speeding Up Prototyping with Subsets


When you first try to get some code working, often analysing your full dataset can take a long time. There is
usually no need to do a full analysis each time, so instead, to speed up development, you can either generate
random data, or just use small random subsets of your data.
To select a subset, you could choose a random subset with np.random.choice. But if you have imbalanced
data, this might occasionally choose a subset that has only data from one class, or from one narrow region of
your data. This can make some models impossible to fit. If you want to ensure that your subset has enough
data from all classes to train, you could use StratifiedShuffleSplit.

Avoid Silly Mistakes by Regularly Visualizing and Inspecting Shapes


When you are first learning arrays, it helps to regularly print or otherwise inspect the shapes of your arrays
before and after manipulations. When you get results, if it is possible to display them as an image, or
otherwise plot them, you should generally do this. This might help you find subtle errors, like the example
earlier where bad reshaping scrambled the image data.
In addition, printing slices of your data objects can also help you find strange errors. DataFrames print
very cleanly, as do NumPy arrays. You can usually easily print just portions of your data. In pandas,
you can do this with DataFrame.head(), and in NumPy with slicing, e.g. arr[:10, :] will give you the
first 10 rows. You can also summarize DataFrames quickly with DataFrame.describe() or NumPy arrays
with DataFrame(arr).describe() (e.g. use an intermediate pandas DataFrame just to get the nice summary
functions).

Wrapping Up
This tutorial should provide you with all the tools and resources needed for the first assignment. However,
remember, if you run into problems, I am available by e-mail or on Slack.
Good luck!

18

You might also like