0% found this document useful (0 votes)

27 views18 pages

Tutorial1 KNN

Uploaded by

Meet Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views18 pages

Tutorial1 KNN

Uploaded by

Meet Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Tutorial #1: KNN and Data Manipulation with Python

Contents
What this Tutorial Covers 2

Some General Python Learning Tips 3

Effectively Using your REPL, IPython/Notebooks/Jupyter . . . . . . . . . . . . . . . . . . . . . . 3
Learn to Use Your Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Choosing Appropriate Learning Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Paths and Pathlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Arrays and DataFrames 6

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

NumPy 7
Boolean Indexing and Logical Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Index Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Formatting Your Data for Analysis 11

Dealing with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Resizing and Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

The General Data Analysis Procedure 17

Computing the AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Speeding Up scikit-learn Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Speeding Up Prototyping with Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Avoid Silly Mistakes by Regularly Visualizing and Inspecting Shapes . . . . . . . . . . . . . . . . . 18

Wrapping Up 18

1
What this Tutorial Covers
As there are plenty of guides and documentation for learning to do Python data science and machine learning
(ML) online, I won’t be trying to teach you this from scratch. What this tutorial will provide you with is
links to useful resources, and some tools and techniques that are especially useful or relevant for the first
assignment, and which might not be emphasized as much in standard guides.
These tutorials won’t be a regular thing. I will still always be available by e-mail, and help will be available
on Slack if you need guidance. But you will be expected to pick up the rest of of the tools you need by
reading the documentation and other resources online.
However, starting out, there can be a lot to learn, especially if you are just getting started with Python. So
we are covering some Python data science basics here to help you get started. This tutorial assumes you
have learned the basics of Python syntax, and understand fundamental Python classes like list, str, bool,
int, and float, and that when you see something like ClassName.method mentioned, you understand how you
would use this with an actual instantiation of the class.
In terms of the official Python tutorial, I am assuming you have gone through and mostly understand the
sections:
• 1.
Whetting Your Appetite: all
• 2.
Using the Python Interpreter: all
• 3.
An Informal Introduction to Python: all
• 4.
More Control Flow Tools: only sections below
– 4.1. if Statements
– 4.2. for Statements
– 4.3. The range() Function
– 4.5. pass Statements
– 4.6. Defining Functions
– 4.7. More on Defining Functions
∗ 4.7.1. Default Argument Values
∗ 4.7.2. Keyword Arguments
• 5. Data Structures: only sections below
– 5.1. More on Lists
∗ 5.1.3. List Comprehensions
– 5.3. Tuples and Sequences
– 5.6. Looping Techniques
– 5.7. More on Conditions
– 5.8. Comparing Sequences and Other Types
• 8. Errors and Exceptions: only sections below
– 8.1. Syntax Errors
– 8.2. Exceptions
– 8.3. Handling Exceptions
– 8.4. Raising Exceptions
• 9. Classes: only sections below
– 9.1. A Word About Names and Objects
– 9.2. Python Scopes and Namespaces
∗ 9.2.1. Scopes and Namespaces Example
– 9.3. A First Look at Classes
∗ 9.3.1. Class Definition Syntax
∗ 9.3.2. Class Objects
∗ 9.3.3. Instance Objects
∗ 9.3.4. Method Objects
∗ 9.3.5. Class and Instance Variables

2
Some General Python Learning Tips
Effectively Using your REPL, IPython/Notebooks/Jupyter
For those of you coming from compiled languages without a REPL (e.g. C, C++, Java), the usual code,
compile, run, debug, and repeat workflow still will work just fine for Python (this is what I do most of the
time). But since Python is interpreted, it can often be faster to use some kind of REPL-like setup.
It is easy to open up a terminal, start a session with the Python REPL by running the command python or
python3, and then play around with the various functions available to you. If you’re using a competent OS
(i.e. not Windows, which might require extra steps), tab-completion can help you discover all the methods
available to you on objects, and you can see different results very quickly. If you have gone through the first
couple or so sections of the official Python tutorial, you will already have a pretty good idea of how to use
the REPL. The Python REPL is a great scratch space, so don’t forget to use it.
When you want more power available then the plain Python REPL, there are Jupyter Notebooks. If you
don’t want to setup notebooks, you can use Google Colab, which sets up the whole notebook environment
for you in your browser without you having to install anything.
Because I think Notebooks encourage bad code and bad practices, and aren’t allowed as project or assignment
submissions, I won’t cover them further. But you are still free to use them to develop your code initially,
and I recommend giving them a try if you haven’t before. Notebooks are good for quickly playing around
with data and learning the basics of some new library.

Learn to Use Your Debugger

As code grows larger and more complex, random print statements and tweaks are often not enough. This
is when you’ll want to reach for a debugger. A debugger can help you inspect variables and find certain
kinds of bugs much more quickly. I recommend you learn how to do integrated debugging in either VS Code
or PyCharm. This is usually very easy to learn, and amounts to little more than clicking the right GUI
components.
A debugger can also be a surprisingly useful development tool in Python. Just place a breakpoint in your
script, run to that point, and now in the debug console you have access to all the variables that are both
defined and in scope. This can help you prototype new code. Once you’ve learned the basics of your debugger,
I think you will see it is actually a far more powerful and flexible tool than most other tools available to you
(e.g. notebooks).

Choosing Appropriate Learning Resources

Avoid Plain / Naive Googling
Due to aggressive search-engine optimization, plain Google results (i.e., those that don’t use any Google-
fu) are increasingly almost entirely junk. When trying to learn how to do some data science, you get
overwhelmed with results from content-mills like towardsdatascience, machinelearningmastery, and various
other extremely low-quality medium posts or listicles. While these can occasionally be useful, usually they
are extremely poorly written, contain numerous basic conceptual errors, and are written by people that only
just learned how to do what they are writing about. I strongly recommend you avoid using these kinds of
resources, when possible.
If you don’t understand a concept or technique at all, and the recommended course texts were not helpful,
there is no shame in using Wikipedia. Wikipedia is generally good for mathematical and statistical content,
and much more trustworthy than the average content-mill material. It also tends to link to decent resources.
After you understand basic concepts and ideas, you can try searching stackexchange for more specific practical
and conceptual questions. In fact, generally just adding “site:stackexchange.com” to the end of your Google
searches will get you far better results.

3
Read and Search the Official Docs First
However if you understand the basics of what you need to do, your next step should be to head to the official
documentation. For this course, there are probably only really six or seven official resources you’ll need.
• Numpy
• Pandas
• scikit-learn
• matplotlib and possibly seaborn
• Tensorflow and Keras
• PyTorch and PyTorch Lightning
Learn to use these sites, and browse through API docs and official tutorials before searching Google for how
to do things. The only docs that are sometimes not so great are those for matplotlib and pandas. If you
want to know how to do some plotting thing, or want to know how to manipulate a DataFrame in some way,
it can actually be faster and more helpful to go to stackexchange first1 .

Paths and Pathlib

One of the most important things to remember early on is that the shell where you execute your python
or python3 commands always has a current working directory (which can be viewed in Windows with
the command cd, and in Linux/MacOS with pwd). When you define relative paths in your Python
scripts, they will always be relative to the working directory. That is, if my current user directory is
/home/derek/Desktop/CSCI444/, and my script file is /home/derek/Desktop/CSCI444/tutorial.py, and has the
code:
import os

os.makedirs("test") # make a folder / directory called "test"

then running the command python tutorial.py will create a folder /home/derek/Desktop/CSCI444/test/2 .

However, if my current working directory was then changed to be /home/derek/Desktop/, and I ran the
command python CSCI444/tutorial.py, then this would create the folder /home/derek/Desktop/test/. And
if I had to refactor and move the tutorial.py file, test would get created in the wrong place again. Relative
paths can thus be a subtle and annoying cause of bugs.
One alternative is to always use absolute paths. If instead the script were:
import os

os.makedirs("/home/derek/Desktop/CSCI444/test") # make a folder / directory called "test"

then it no longer matters from where I run this script (e.g. we don’t use any information about the current
working directory). The problem is, now no one else can use the code without changing the derek part, or,
if they are on Windows, without completely changing everything. That’s not good.
One simple solution is the magic __file__ variable. Variables with the double underscore are special variables
or methods in Python (also called “magic” or “dunder” methods) that perform special functions at runtime.
The __file__ magic method returns the relative path to the file that it is written in. E.g:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
print(__file__) # prints: tutorial.py

but if our directory is /home/derek/Desktop/, then

1 Eventually, this will cost you, as you will not build up a good understanding of the libraries this way. But for this course,
you likely won’t advance far enough into the APIs to hit this problem.
2 Imports resolve in a similar way. So if you get strange errors related to imports, you probably changed your current working

directory somehow. In general, you should always run your Python scripts from a consistent location (e.g. for simple scripts,
the containing directory).

4
# working dir: /home/derek/Desktop/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
print(__file__) # prints: CSCI444/tutorial.py

In a REPL (where there is no file) attempting to access __file__ will raise an exception.
The __file__ magic variable is the first tool that will allow us to easily and reliably identify paths across
operating systems. The next tool is the Pathlib library, which can convert this string to an absolute path:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
from pathlib import Path

path = Path(__file__).resolve()
print(path) # prints: /home/derek/Desktop/CSCI444/tutorial.py

Now we have stored in the path variable an absolute path that works across operating systems, no matter
what the current working directory. It will also generally only break if we move the tutorial.py file.
Pathlib also makes it easy to reliably navigate directories and get what you need:
# working dir: /home/derek/Desktop/CSCI444/
# This file: /home/derek/Desktop/CSCI444/tutorial.py
from pathlib import Path

path = Path(__file__).resolve()
print(path) # /home/derek/Desktop/CSCI444/tutorial.py
print(path.parent) # /home/derek/Desktop/CSCI444
print(path.parent.parent) # /home/derek/Desktop
print(path.parent / "newfolder") # /home/derek/Desktop/CSCI444/newfolder

There’s a whole bunch more you can do with pathlib. Check out the docs, and if you see some site recom-
mending old school functions like os.path.join and other verbose and tedious stuff, consider replacing those
calls with Pathlib equivalents.

Plotting
For the most part, you should be fine with matplotlib. However, the default matplotlib plots are somewhat
ugly, and are missing some key types of plots. If you don’t mind working with pandas DataFrames, then
seaborn is a useful way to get some nicer plots. In particular pairplot can be an extremely powerful way
to visualize your entire dataset, if it is not too large, and heatmap can be used for visualizing 2D arrays and
DataFrames.

5
Arrays and DataFrames
Data science means working with arrays. Arrays are multidimensional matrices. In deep learning libraries,
arrays are often called tensors (essentially unrelated to tensors in mathematics). Two dimensional arrays are
often just called matrices, and one-dimensional arrays are more precisely referred to as vectors.
Although not strictly part of the definition of array, the array concept in programming usually strongly
implies that all elements of the array are of the same data type. It is extremely unusual to see mixed-type
arrays. Most of the time, the data type will be some kind of float, integer, string, bytes, or possibly a
data/time type. That is, arrays tend to hold only one simple, primitive type of data. Arrays also do not
store much additional extra information (like variable names).
Arrays for data science are more accurately call multidimensional arrays or N-dimensional arrays. Arrays are
implemented in Python in the NumPy library via the ndarray class. Other languages have similar constructs
(e.g. Julia, Rust).
For a lot of data science and machine learning, arrays can be somewhat inconvenient, as you cannot access
rows or columns by name, and you can’t store complicated data of multiple types conveniently. In this case,
you want a data structure that more resembles a table. In Python (and also R) the structure of choice is the
dataframe, and in Python, the library for manipulating dataframes is pandas. As we are focused on Python,
I will refer only to DataFrames from here on in, to make it clear I am talking about pandas style dataframes.
Before getting into arrays and DataFrames, it is important to clarify some key terms.

Terminology
To work with and think clearly about your data, you need to make strong distinctions between your data
shape or dimensions and your data sizes.
The number of dimensions in your data is the number of axes in the array. E.g. a timeseries of rainfall
in one location is 1D, or for rainfalls in multiple locations is 2D. A black-and-white (BW) image is 2D, a
colour image is 3D (you have RGB and maybe A—alpha transparency—data at each pixel), a BW brain
scan is 3D, and a colour brain scan is 4D. If you have an array that holds a collection of samples (which we
will encounter when dealing with batches later in deep learning), then the dimension of the batch will be the
dimensions of the samples plus one. E.g. an array holding 100 BW images is a 3D array. If arr is a NumPy
array, the dimension is returned by len(arr.shape).
The size or length of a dimension is the number of points or values in that dimension. So a 1080p image
would have a size of 1920 in the horizontal dimension, and 1080 in the vertical dimension. The shape of an
array is a list of all the sizes for all dimensions. If arr is a NumPy array, then arr.shape returns the shape
information. So if we have an RGB 1080p image in the NumPy array img, then img.shape would return the
tuple (1920, 1080, 3).
The total size of an array is the total number of values contained in the array. That is, it is
np.prod(arr.shape), the product of all the sizes for each dimension.

Reshaping is an operation that does not change the total size of your array, but may change the shape in
any number of ways. Resizing is any operation that may affect the total size of an array. For example,
transposition is a kind of reshaping that flips an array on it’s diagonal, so that an array of shape (m, n)
gets the shape (n, m) (and you can get the transpose of an array arr conveniently with the shortcut arr.T).
Something that comes up a lot in deep learning and working with images (like in your first assignment) is the
channel dimension. This is the dimension that contains multiple copies / layers / channels of the “signal”
(e.g. image, timeseries). In our RGB 1080p example, the channel dimension has size (and position) 3. This
is called “channels last” format. Sometimes, images will be stored in a “channels first” format, in which case
the 1080p image shape would be (3, 1920, 1080)3 . If you had, say, daily rainfall data for Antigonish and
3 Also, colour images can be 4-channel (but still 3D) if RGBA and not just RGB.

6
Toronto for 365 days, the shape would be (365, 2) or (2, 365), and we could call this size-2 dimension the
channel dimension4 .

NumPy
To start, I recommend you learn how to do the basics of NumPy. Their beginner tutorial covers most of
what you will need to get started for the assignment. The other thing you will need for the first assignment
is boolean indexing and possibly index arrays.
Note: NumPy is almost always imported as import numpy as np, and you should follow this convention for
your code. In this tutorial, if you see np.<something>, then you can assume I mean NumPy.

Boolean Indexing and Logical Operators

Often, you want to select only some elements of an array. There are some extremely convenient shorthands
to do this. Let’s build an array:
np.random.seed(3)
A = np.random.randint(0, 10, [4,4])
A
# array([[8, 9, 3, 8],
# [8, 0, 5, 3],
# [9, 9, 5, 7],
# [6, 0, 4, 7]])

and use “==” to generate some boolean arrays:

idx_8 = (A == 8) # get logical indicators of values that equal 8
idx_8
# array([[ True, False, False, True],
# [ True, False, False, False],
# [False, False, False, False],
# [False, False, False, False]])

idx_9 = (A == 9)
idx_9
# array([[False, True, False, False],
# [False, False, False, False],
# [ True, True, False, False],
# [False, False, False, False]])

idx_8_9 = (A == 8) | (A == 9) # don't use "or" or "and" for boolean arrays, use "|" or "&"
idx_8_9
# array([[ True, True, False, True],
# [ True, False, False, False],
# [ True, True, False, False],
# [False, False, False, False]])

idx_small = (A <= 3) # We can use other logical operators too

idx_small
# array([[False, False, True, False],
# [False, True, False, True],
# [False, False, False, False],
# [False, True, False, False]])

4 These latter points can be confusing. That is, flat colour images are stored as 3D arrays. But when we get to Tensorflow /

Torch, we’ll see we see functions like Conv2D, MaxPool2D and etc. to processes flat colour images. This is because when we refer
to the dimension of data, we often don’t count the channel dimension. In any case, don’t overthink this, just keep in mind
channel dimensions are a thing.

7
Now we can use these boolean arrays to select elements.
print(A[idx_8_9]) # [8 9 8 8 9 9]
print(A[idx_small]) # [3 0 3 0]

This can be more useful when you have multiple objects, and need to select based on values in one of them.
Let’s imagine we have values X, and target y:
X = np.array([[1, 8],
[2, 3],
[3, 4],
[4, 7],
[5, 6]])
y = np.array([[8],
[8],
[9],
[9],
[8]])

and we only want data where the target is equal to 8. Then we can get this with:
eights = X[y == 8, :] # IndexError: too many indices for array

Wait, what happened? Let’s look at y.shape:

print(y.shape) # (5, 1)

This is a common problem you’ll bump into. The solution is:

print(y.squeeze().shape) # (5,)
idx = (y.squeeze() == 8) # array([ True, True, False, False, True])
eights = X[idx, :]
eights
# array([[1, 8],
# [2, 3],
# [5, 6]])

We can also select things not matching some criterion with ~:

not_eights = X[~idx, :]
not_eights
# array([[3, 4],
# [4, 7]])

Note that boolean arrays used for indexing really have to be boolean. That, is:
A = np.array([1, 2, 3, 4])
idx = np.array([0, 0, 1, 1]) # surely this is the same!
A[idx] # array([1, 1, 2, 2]) # AAGH, nope

# We accidentally made an index array (see below)

idx.dtype # dtype('int64')
idx = np.array(idx, dtype=bool) # convert to boolean array
A[idx] # array([3, 4])

Index Arrays
Sometimes it is more convenient to just select the elements at exactly certain indices. In fact, that’s what
happened in our last boolean example. Index arrays are not too often useful to create on your own, but they
are returned from a lot of data splitting functions, so it is important to understand the basic idea and be
aware of the technique.
Here is a basic example:

8
np.random.seed(3)
A = np.linspace(0, 10, 10).round(1)
idx = np.random.choice(5, size=5, replace=False)
A # array([ 0. , 1.1, 2.2, 3.3, 4.4, 5.6, 6.7, 7.8, 8.9, 10. ])
idx # array([3, 4, 1, 0, 2])
A[idx] # array([3.3, 4.4, 1.1, 0. , 2.2])

pandas
With any luck5 you won’t have to deal with pandas much on this first assignment. So feel free to skip this
section unless you need pandas.
However, pandas is often the easiest way to load data with the various pandas.read_ functions. These
functions are extremely powerful and will likely help you quickly load most of the data you find online. If
you understand the NumPy basics, you can generally use NumPy methods on a pandas DataFrame and get
similar behavior. Alternately, you can just convert the DataFrame and be done with pandas for now.
Typically you import pandas as pd. When using pandas, you also use the DataFrame class so much that you
likely want to import it itself as from pandas import DataFrame. In the following examples, you can assume
those imports are implicit.
There is an official 10-minute guide that covers the bare minimum of pandas, and there are also a bunch of
things that you need to understand which are covered in the basics tutorial.
If you like pandas, by all means use it. However, with the amount of early material to learn, you might just
want to learn enough to use it like NumPy, and to load data. pandas is quite complicated and takes a long
time to learn well, so I would personally recommend avoiding it for now.

Tips to Make pandas Less Frustrating

If you want to avoid pandas, you can always convert a DataFrame to an ndarray by using the
method. You can convert it back like so:
DataFrame.to_numpy()
COLUMNS = ["A", "B", "C"] # save your column names somewhere
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=COLUMNS)
arr = df.to_numpy() # convert
# ... do some manipulations on your array, e.g. arr -= 1
df_new = DataFrame(data=arr, columns=COLUMNS)

You can also index into a DataFrame most of the time as if it is a NumPy array by using the .iloc accessor:
np.random.seed(3)
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=["A", "B", "C"])
print(df)
# A B C
# 0 8 9 3
# 1 8 8 0
# 2 5 3 9

print(df.iloc[:, 0])
# 0 8
# 1 8
# 2 5
# Name: A, dtype: int64

print(df.iloc[1, :])

5 pandas is very poorly designed and frustrating to use, mostly because of the annoying and counter-intuitive Series objects,

and endlessly subtle indexing options, none of which work particularly sensibly or predictably. It is also extremely slow, even
for Python. I hate pandas, and try to use as few features from it as possible, but you really can’t avoid it for a huge amount of
Python data science.

9
# A 8
# B 8
# C 0
# Name: 1, dtype: int64

You can also access elements with boolean indexing and index arrays, and the usual tricks like df == 8 work
to generate “boolean indexing DataFrames” too. You can also do NumPy style indexing, but with column
names, via .loc.
np.random.seed(3)
df = DataFrame(data=np.random.randint(0, 10, [3,3]), columns=["A", "B", "C"])
print(df)
# A B C
# 0 8 9 3
# 1 8 8 0
# 2 5 3 9

print(df.loc[:, "A"])
# 0 8
# 1 8
# 2 5
# Name: A, dtype: int64

print(df.loc[1, :])
# A 8
# B 8
# C 0
# Name: 1, dtype: int64

idx = (df == 8)
print(df[idx])
# A B C
# 0 8.0 NaN NaN
# 1 8.0 8.0 NaN
# 2 NaN NaN NaN

If you want to save yourself a huge amount of confusion and want to avoid some annoying
bugs, almost always access and assign data using the .iloc and .loc methods, and nothing else.
# AVOID ALL OF THESE WHEN REASONABLE!
df.A
df["A"]
df["A", :]
df["A"][1]

One exception which can be extraordinarily useful for certain tasks is the DataFrame.filter method with
the regex argument. I won’t cover it, just be aware of it.

10
Formatting Your Data for Analysis
Often, real-world data is messy. It might have different sizes, be in the wrong shape, having missing values,
or have errors in data entry / corruptions.

Dealing with Missing Values

This is probably one of the first things you’ll want to look into. Missing values break all but the most
sophisticated contemporary ML algorithms. For NumPy arrays and pandas DataFrames, missing values are
encoded as np.nan, and can be found with either np.isnan for NumPy, or pandas.isnull or DataFrame.isna
for pandas. Missing values are also often referred to as NaN, nan, or other such variants, which stand for “Not
a number”. The NaN value is usually still a float. E.g. isinstance(np.nan, float) returns True.
If you have a lot of NaNs in your data, you might not be able to just drop instances (rows, samples) with
functions like DataFrame.dropna or tricks like x = x[~np.isnan(x)]. This unfortunately means you’re in the
wild and wacky world of missing data. There are bookshelves on this subject, and you really don’t want to
get into them for this course. If you have a huge amount of missing data or NaNs in your data,
you might just want to find a different dataset.
However, if you have just a few missing values, and you really like your data, you might still want to work
around these issues. In this case, you probably want to reach for a simple imputation method. You can find
some approaches for this in sklearn.impute, and general guides here.

Reshaping
Often, you will have to reshape your data before you can use it in standard algorithms. For example, most
ML algorithms expect a shape of either (n_samples, n_features), or the transpose of this. This means some
data (like the 2D image data in your first assignment) can not be used directly, and must be flattened (or
embedded if we are dealing with timeseries or sequences).
In NumPy, your main methods for this are ndarray.reshape and ndarray.transpose.

The .transpose function is fully general, and allows you to permute the axes in different orders. E.g.
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style("ticks") # remove some ugly default gridlines

imgs = load_digits()["images"] # this is a small dataset available in sklearn

print(imgs.shape) # (1797, 8, 8) - first dimension indexes the images
img = imgs[3, :, :]
print(img.shape) # (8, 8)
plt.imshow(img, cmap="Greys") # show black-and-white image
plt.show() # it's a "3"

Figure 1: A 3.

11
imgs_flipped = imgs.transpose([0, 2, 1]) # swap last two dimensions
print(imgs_flipped.shape) # (1797, 8, 8) - no change in shape!
img = imgs_flipped[3, :, :]
plt.imshow(img, cmap="Greys")
plt.show() # the "3" has been flipped on the diagonal!

Figure 2: A transposed 3.

imgs_reindexed = imgs.transpose([1, 2, 0])

print(imgs_reindexed.shape) # (8, 8, 1797) - last dimension now indexes images
img = imgs_reindexed[:, :, 3]
print(img.shape) # (8, 8)
plt.imshow(img, cmap="Greys")
plt.show() # it's a normal "3"

Figure 3: A 3 again.

The .reshape function does… reshaping, as defined above. BEWARE! Naive reshaping can scramble your
data!
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# DON'T DO THIS!
imgs = load_digits()["images"]
print(imgs.shape) # (1797, 8, 8)
imgs = imgs.reshape([8, 8, 1797]) # easy-easy, no need for transpose
img = imgs[:, :, 3]
print(img.shape) # (8, 8), yup, perfect, who needs transpose
plt.imshow(img, cmap="Greys")
plt.show() # WTF

12
Figure 4: Modern art.

Let’s reshape properly to an array that is (n_sample, n_features). In this case, we have 1797 samples, and
8*8 == 64 features (each pixel is a feature):
imgs = load_digits()["images"]
print(imgs.shape) # (1797, 8, 8)
n_samples = imgs.shape[0]
n_features = np.prod(imgs.shape[1:]) # Don't hard-code magic numbers!
imgs = imgs.reshape([n_samples, n_features])
img_flat = imgs[3, :]
print(img_flat.shape) # (64,) - Good!
plt.imshow(img_flat, cmap="Greys") # TypeError: Invalid shape (64,) for image data

Here, although we reshaped correctly, we see plt.imshow wants a different size. To plot img_flat, we need
another tool.

Adding / Removing Dimensions

Because most images these days are colour, it is common for most image data to be 3- or 4-channel (i.e. 2-
dimensional images are represented by 3D arrays). To make programming easier, it is thus often the default
to assume that a flat image should always be represented as having three dimensions: two relating to pixels,
and one relating to the channels. Only black-and-white images are sometimes encoded as 1-channel, truly
2D images, and only for some programming libraries. Likewise, most ML algorithms expect 2D input in the
form (n_samples, n_features). But of course, it is extremely common that you have just one feature.
This often results in tiny annoyances. You have data that is shape (n_samples,), but the function you are
using wants this with the shape (n_samples, 1). Here, the magic tool is np.expand_dims(). If we use this,
we can now plot our flattened image:
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style("ticks")

imgs = load_digits()["images"]
print(imgs.shape)
n_samples = imgs.shape[0]
n_features = np.prod(imgs.shape[1:])
imgs = imgs.reshape([n_samples, n_features])
img_flat = imgs[3, :]
print(img_flat.shape) # (64,)
img_expanded = np.expand_dims(img_flat, 1)
plt.imshow(img_expanded, cmap="Greys")
plt.show()

13
Figure 5: A flattened 3

You can delete annoying size-1 dimensions with np.squeeze(), or, if we know we want the 1D representation,
then np.ravel() is often the fastest way to do this.

The Evil of Size-1 Dimensions When you start with array programming, these phoney size-1 dimensions
might be the cause of a lot of errors and conceptual confusion early on. If you are mathematically minded,
you also realize at some level these arrays are wonky as mathematical and programming objects. For example:
np.random.seed(42)
A = np.random.randint(0, 10, [10]) # 10 random digits
E = np.expand_dims(A, 1)
print(A.shape) # (10,)
print(E.shape) # (10, 1)

# "Empty" dimensions have to have size 1, for some reason

np.reshape(A, [A.shape[0], 0]) # ValueError: cannot reshape array of size 10 into shape (10,0)
np.reshape(A, [A.shape[0], 1]) # ok, works
np.reshape(A, [A.shape[0], 2]) # ValueError: cannot reshape array of size 10 into shape (10,2)

# comparison behaves sanely

np.alltrue(A == A.T) # True
np.alltrue(A.T == A) # True
np.alltrue(A == A.ravel()) # True
np.alltrue(A.ravel() == A) # True

np.alltrue(E == E.T) # False

np.alltrue(E.T == E) # False
np.alltrue(E == E.ravel()) # False
np.alltrue(E.ravel() == E) # False
np.alltrue(A == E) # False

np.alltrue(A.ravel() == E.ravel()) # True - arrays still only have identical data

# Some slicing makes sense...

print(A[0]) # 6
print(A[0, :]) # IndexError: too many indices for array (GOOD)
print(A[0, 0]) # IndexError: too many indices for array (GOOD)
print(A[0, -1]) # IndexError: too many indices for array (GOOD)
print(E[0, :]) # [6] - note square brackets: returned slice is itself an ndarray
print(E[0]) # [6] - actually just a shorthand for above

# and I guess this makes sense:

print(E[0, 0]) # 6 - return is an int, not an ndarray
print(E[0, -1]) # 6 - same

# but wait a minute, adding a dimension doesn't add data. What if...

14
EEEEEE = np.copy(A)
for i in range(31): # biggest allowed dimension is 32
EEEEEE = np.expand_dims(EEEEEE, i+1)

print(EEEEEE.shape) # (10, 1, 1, ..., 1)

print(EEEEEE[0, :]) # [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[6]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
print(EEEEEE[0, 0, :]) # [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[6]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
print(EEEEEE[0, 0, -1]) # [[[[[[[[[[[[[[[[[[[[[[[[[[[[[6]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
# and etc.

The point is that size-1 dimensions are fundamentally empty, and are only there for programming reasons
to allow functions to expect common shapes. The problem is sometimes functions silently throw away size-1
dimensions when you need them, or need size-1 dimensions when you thought a raveled array would be fine.
In fact, a lot of NumPy functions include a keepdims argument for this purpose.
So just be aware that size-1 dimensions can cause all sorts of subtle bugs. I recommend having
them around for only exactly as long as you need them, and otherwise keep arrays in .raveled or .squeezed
form (e.g. no size-1 dimensions). If you get strange, shape-related errors, read the docs, look at the expected
input shapes and what shapes are returned, and throw in .expand_dims and .squeeze or .ravel calls as
needed to resolve the problem.

Resizing and Reshaping

A common problem in machine learning is having data of different sizes. While some more advanced tech-
niques (fully convolutional CNNs, seq2seq RNNs, transformer RNNs) can handle different sizes naturally,
most basic techniques cannot. In these cases, you need to modify your data. Below are the most common
ways to deal with this problem.
Note: You have to be careful when you do this, since you could inadvertently contaminate your data. For
example, imagine you are implementing classification between two groups, but in one group, each sample is
smaller (e.g. shorter timeseries, smaller image). If you resize all samples to the same size, it should be clear
that one group’s resized samples will be more distorted / blurry than the other’s. That is, you’ve accidentally
introduced a variable that easily allows perfect identification of each group, but which has nothing to do
with the actual classification task. There could be a similar issue with padding or cropping. In general,
manipulating your data must be done carefully (although most online tutorials treat this as if it is some
simple fire-and-forget processes that doesn’t much matter).

Interpolation / Resampling and Downsampling / Upsampling

Interpolation and resampling are very general concepts, and technically include downsampling and upsam-
pling. However, in practice, to interpolate usually implies filling in missing data, so that an interpolated
sample has more data points than the original sample. In general, the idea is to fit some function to your
data, and then evaluate that function at more points for interpolation, or less points for downsampling.
The smoother and more local the function (e.g. a spline), generally the better (but slower) the resampling.
Simpler functions (linear, nearest-neighbor) are faster, but may lose more information.
There are some extremely general interpolation functions and libraries available in Python:
• scipy.interpolate: 1D and 2D interpolation functions
• scipy.ndimage.zoom: interpolate arbitrary arrays
• torch.nn.functional.interpolate (Torch general-purpose array interpolation)
• tf.image.resize: Tensorflow image resizer
• tfg.math.interpolation: Tensorflow Graphics interpolation (advanced)

Cropping and Padding

Often, these are the easiest and most desirable ways to resolve shape discrepancies. Large samples can
be cropped (data points removed), and/or small samples can be padded (values are added at sample

15
ends/boundaries).
Cropping can be implemented trivially by array slicing, so there aren’t NumPy functions for this. But if you
want to crop more symmetrically, like to the center of some array, or include random crops, then there are
tools available in PyTorch and Tensorflow.
Padding is actually quite a bit more complicated, as the choice of padding value turns out to matter more
than some have hoped. There are multiple ways to pad beyond just zero-padding (the default in most
applications). Sometimes, the padding choice is so important, naive padding can invalidate a model, such
as in failing to use causal padding for timeseries data.
The most general padding function in Python is np.pad. Then, there are options for PyTorch and Tensorflow.
Until we get to deep learning, I would just stick with NumPy. Although there are problems with naive zero-
padding, you can still just use this if you want6 .

Patches / Slices / Windows

Sometimes, data can be too large to handle computationally in its entirety. And sometimes, there is a huge
amount of variability in the sizes of individual samples. And sometimes, you also have strong reasons to
believe that patterns arise only locally (e.g. close pixels in an image, close timepoints in a timeseries, close
spacetime locations for spatial data).
In this situation, training entire instances might miss the key patterns, and be computationally expensive.
One option to convert a bunch of samples into the same sizes is via patches or windows of the same size. You
can either let your patches overlap, partly overlap, or not overlap at all. With NumPy array slicing, these
are all fairly straightforward to implement on your own.
However, since this is a common technique, there are tools for this. In Tensorflow, this can be done with
tf.image.extract_patches for images, or with a TimeseriesGenerator. It would require manual code with
PyTorch, or learning some TorchIO. You can also do it very cleverly with raw NumPy, but it takes some
work7 .

6 Potential bonus marks could perhaps be rewarded for comparing the effects of different padding choices on algorithm

performance - providing this doesn’t contaminate data, as it can with timeseries data.
7 Note this is an example of a towardsdatascience post that is good.

16
The General Data Analysis Procedure
Data science is more of an art than a science, so there is isn’t really anything remotely like a general data
analysis procedure. However, most Python libraries (and in fact other statistical libraries like in R) write
their functions to use similar abstractions. That is, for most data analysis, you:
1. Acquire data.
2. Inspect data (visualization, summary stats)
3. Clean data (reshape, resize, deal with missing values, outliers)
4. Split your data into predictors, and, if doing supervised learning, target, usually called (respectively)
• X, y
5. Define training and test sets, usually called
• X_tr, X_test, y_tr, y_test
6. Instantiate or build some model
• model = Model()
7. Fit / train that model to / on the training data
• model = model.fit(X_tr, y_tr)
8. Use the fitted model to get predictions on the test set
• y_pred = model.predict(X_test)
9. Evaluate the predictions by comparing the predictions
• compare(y_pred, y_test)
In the simplest cases, that will really be mostly the extent of the code. In extremely simple cases, you might
also even have X_tr == X_test == X and y_tr == y_test == y. But in practice, building and training the
model can be complicated and time consuming, and evaluation can also be quite computationally expensive
(as it may require many repetitive fitting and prediction steps).
For the first problem set, the above procedure is mostly all you have to do. If you’ve managed to learn
all the Python, NumPy, pandas, and other concepts, this part will probably seem pretty easy (at least to
implement).
Nevertheless, there are a few tips you might find helpful.

Computing the AUC

The first assignment will ask you to compute the area under the ROC (receiver operating characteristic)
curve (AUC) as a way to quantify the separation of your two groups. If you read up on the AUC and ROCs,
you might find this confusing, as the AUC is usually only defined with respect to the performance of some
classifier. That is, you don’t normally talk about the AUC of raw, unclassified groups.
However, given some overlap in the two groups, even the best classifier will not be able to perfectly separate
the samples. That is, if you are given two samples that fall in the overlapping region, which way you classify
those two samples depends entirely on where you set your separation threshold / cutpoint. So when there is
overlap, even a perfect classifier will result in tradeoffs between false and true positives and negatives.
Thus, whenever you have two subgroups, you can calculate the AUC that corresponds to a perfect classifier,
and this tells you something about how fundamentally separable the two groups are.
In Python, there are two ways to calculate this AUC. You can use the mathematical relationship to the
Mann-Whitney U statistic along with scipy.stats.mannwhitneyu to get the AUC. In that case, you will have:
# assume x, y are the observed (raveled) vectors of values for each group
auc = mannwhitneyu(x, y).statistic / (len(x) * len(y))

The order of x and y above matters, in that it changes which side of 0.5 the resulting AUC value will be.
Alternately, you can use sklearn.metrics.roc_auc_score. As this is setup for the assumption you have some
scores / predictions, you will need to have your data formatted differently, and will need to generate a vector
of perfect predictions. You can do this like so:

17
# assume x, y are the observed (raveled) vectors of values for each group
data = np.concatenate([x, y])
labels = np.array([0 for _ in x] + [1 for _ in y])
auc = roc_auc_score(y_true=labels, y_score=data)

Switching the order of the 0 and 1 above will also change which side of 0.5 the resulting AUC value will be.
Both methods above give identical values.

Speeding Up scikit-learn Models

The sklearn.neighbors.KNeighborsClassifier can be slow for large data, or on old machines. If this is a
problem, you should first try using the n_jobs=-1 parameter when instantiating the model. Depending on
your number of cores, this could result in an almost 10 times speedup. Many sklearn models have this option
available.

Speeding Up Prototyping with Subsets

When you first try to get some code working, often analysing your full dataset can take a long time. There is
usually no need to do a full analysis each time, so instead, to speed up development, you can either generate
random data, or just use small random subsets of your data.
To select a subset, you could choose a random subset with np.random.choice. But if you have imbalanced
data, this might occasionally choose a subset that has only data from one class, or from one narrow region of
your data. This can make some models impossible to fit. If you want to ensure that your subset has enough
data from all classes to train, you could use StratifiedShuffleSplit.

Avoid Silly Mistakes by Regularly Visualizing and Inspecting Shapes

When you are first learning arrays, it helps to regularly print or otherwise inspect the shapes of your arrays
before and after manipulations. When you get results, if it is possible to display them as an image, or
otherwise plot them, you should generally do this. This might help you find subtle errors, like the example
earlier where bad reshaping scrambled the image data.
In addition, printing slices of your data objects can also help you find strange errors. DataFrames print
very cleanly, as do NumPy arrays. You can usually easily print just portions of your data. In pandas,
you can do this with DataFrame.head(), and in NumPy with slicing, e.g. arr[:10, :] will give you the
first 10 rows. You can also summarize DataFrames quickly with DataFrame.describe() or NumPy arrays
with DataFrame(arr).describe() (e.g. use an intermediate pandas DataFrame just to get the nice summary
functions).

Wrapping Up
This tutorial should provide you with all the tools and resources needed for the first assignment. However,
remember, if you run into problems, I am available by e-mail or on Slack.
Good luck!

Data Wrangling Py
No ratings yet
Data Wrangling Py
442 pages
Minimalist Data Wrangling With Python 0645571911
No ratings yet
Minimalist Data Wrangling With Python 0645571911
442 pages
Ecommerce Syllabus BBA 6th Sem
No ratings yet
Ecommerce Syllabus BBA 6th Sem
56 pages
python
No ratings yet
python
457 pages
Info 19
No ratings yet
Info 19
140 pages
My Book of Python Computing - Abhijit Kar Gupta
50% (2)
My Book of Python Computing - Abhijit Kar Gupta
385 pages
Machine Learning and Pattern Recognition Programming
No ratings yet
Machine Learning and Pattern Recognition Programming
4 pages
Comprehensive Roadmap for AI, ML, DS, DA & DSA
No ratings yet
Comprehensive Roadmap for AI, ML, DS, DA & DSA
26 pages
Statistics Machine Learning Python
No ratings yet
Statistics Machine Learning Python
437 pages
DIGITAL MEDIA Notes Till SEM
No ratings yet
DIGITAL MEDIA Notes Till SEM
66 pages
Minimalist Datawrangling Withpython Marek Gagolewski pdf download
No ratings yet
Minimalist Datawrangling Withpython Marek Gagolewski pdf download
79 pages
Python PRG and Numerical Methods
100% (1)
Python PRG and Numerical Methods
483 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Datawranglingpy Screen v1.0.3 20230206
No ratings yet
Datawranglingpy Screen v1.0.3 20230206
442 pages
Python Tutorial Text 2024-1
No ratings yet
Python Tutorial Text 2024-1
82 pages
CD Stereo System: Operating Instructions
No ratings yet
CD Stereo System: Operating Instructions
32 pages
AISCIENCES - Data Science Cookbook - V0
No ratings yet
AISCIENCES - Data Science Cookbook - V0
244 pages
Notebook
No ratings yet
Notebook
12 pages
آشنایی با پایتون برای اقتصادسنجی
No ratings yet
آشنایی با پایتون برای اقتصادسنجی
407 pages
Foundationfor DataScience
No ratings yet
Foundationfor DataScience
41 pages
Textbook Python New
100% (1)
Textbook Python New
318 pages
CCS362_SPC__Notes_Unit_2.pdf
No ratings yet
CCS362_SPC__Notes_Unit_2.pdf
31 pages
c15732d c4d6 Af31 d18 d56f0f8f5675 Machine Learning Roadmap
No ratings yet
c15732d c4d6 Af31 d18 d56f0f8f5675 Machine Learning Roadmap
25 pages
Python For Absolute Newbies
No ratings yet
Python For Absolute Newbies
109 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
68 pages
Communication Networks Alberto Leon Garcia PDF
0% (2)
Communication Networks Alberto Leon Garcia PDF
2 pages
Fluke 1651 User Manual
No ratings yet
Fluke 1651 User Manual
80 pages
Data_Structures_AI
No ratings yet
Data_Structures_AI
298 pages
Introduction To Programming in Python
No ratings yet
Introduction To Programming in Python
79 pages
Statistics Machine Learning Python
No ratings yet
Statistics Machine Learning Python
399 pages
Python Numpy-Github - Io
No ratings yet
Python Numpy-Github - Io
25 pages
SRS Document For ER Diagram
No ratings yet
SRS Document For ER Diagram
11 pages
Python Numpy Tutorial (With Jupyter and Colab)
No ratings yet
Python Numpy Tutorial (With Jupyter and Colab)
29 pages
2000 Analysis of Optimal Thread Pool Size
No ratings yet
2000 Analysis of Optimal Thread Pool Size
14 pages
Apex DC - Frequently Asked Questions
No ratings yet
Apex DC - Frequently Asked Questions
41 pages
PIIS235271102300153X
No ratings yet
PIIS235271102300153X
8 pages
Catalogue 14
No ratings yet
Catalogue 14
228 pages
Data Science Notes
No ratings yet
Data Science Notes
4 pages
Analog Communication
60% (5)
Analog Communication
70 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
300 pages
python 6 month _data analyst course (3)
No ratings yet
python 6 month _data analyst course (3)
20 pages
PyTorch - Advanced Deep Learning
No ratings yet
PyTorch - Advanced Deep Learning
237 pages
Data Analysis Python Read The Docs Io en Latest
No ratings yet
Data Analysis Python Read The Docs Io en Latest
79 pages
How To Learn Python For Data Science
100% (1)
How To Learn Python For Data Science
22 pages
Heating, Ventilation and Air Conditioning by Krishnanand M
No ratings yet
Heating, Ventilation and Air Conditioning by Krishnanand M
15 pages
Guided Exercise: Build A Web App With Web Appbuilder
No ratings yet
Guided Exercise: Build A Web App With Web Appbuilder
32 pages
OTP Bypass Flow Documentation
No ratings yet
OTP Bypass Flow Documentation
4 pages
Python Syllabus From Basic To Advanced. (Data Automation and Visualization) - 2
No ratings yet
Python Syllabus From Basic To Advanced. (Data Automation and Visualization) - 2
11 pages
Python For Beginners. 2 Books in 1 - A Completed Guide To Master The Basics of Python Language
100% (2)
Python For Beginners. 2 Books in 1 - A Completed Guide To Master The Basics of Python Language
370 pages
Stat and Machine Learning Python PDF
No ratings yet
Stat and Machine Learning Python PDF
300 pages
Statistics Machine Learning Python
No ratings yet
Statistics Machine Learning Python
415 pages
uft-one-for sap ds
No ratings yet
uft-one-for sap ds
3 pages
Python For Accounting A Modern Guide Python Programming in Accounting 9789730338928 Compress
100% (3)
Python For Accounting A Modern Guide Python Programming in Accounting 9789730338928 Compress
395 pages
Master in Business Consulting: Hochschule F Urtwangen University
No ratings yet
Master in Business Consulting: Hochschule F Urtwangen University
13 pages
Introducing Python
No ratings yet
Introducing Python
108 pages
Become A Python Developer
No ratings yet
Become A Python Developer
8 pages
Python - 3 Books in 1 - Beginner's Guide, Data Science and Machine Learning. The Easiest Guide To Start Python Programming. Unlock Your Programmer Potential and Develop Your Project in
100% (1)
Python - 3 Books in 1 - Beginner's Guide, Data Science and Machine Learning. The Easiest Guide To Start Python Programming. Unlock Your Programmer Potential and Develop Your Project in
241 pages
Datascienceusing Python Training
No ratings yet
Datascienceusing Python Training
11 pages
IS Door Mounting Kit IP43 EAV91355 02
No ratings yet
IS Door Mounting Kit IP43 EAV91355 02
4 pages
AE1205 Programming and Scientific Computing in PYTHON - April 2022 2
No ratings yet
AE1205 Programming and Scientific Computing in PYTHON - April 2022 2
222 pages
SENG419-python 98745
No ratings yet
SENG419-python 98745
103 pages
Python Programming For Beginner - Jackson, Kit
No ratings yet
Python Programming For Beginner - Jackson, Kit
311 pages
Saudi Arabian Oil Company: SECTION 13320 Programmable Logic Controllers
No ratings yet
Saudi Arabian Oil Company: SECTION 13320 Programmable Logic Controllers
8 pages
2021 - Python For Absolute Beginners
100% (5)
2021 - Python For Absolute Beginners
158 pages
Agentic AI
No ratings yet
Agentic AI
26 pages
Python For Data Science Quickstart Guide
No ratings yet
Python For Data Science Quickstart Guide
13 pages
Python Book by Example
No ratings yet
Python Book by Example
90 pages
Introduction to Python 1
No ratings yet
Introduction to Python 1
13 pages
Python - Follow Dr. AngShu (@drangshu) For More
100% (1)
Python - Follow Dr. AngShu (@drangshu) For More
300 pages
Cyber Crime Scenario in India and Judicial Response
No ratings yet
Cyber Crime Scenario in India and Judicial Response
5 pages
Maasai Mara University transcript
No ratings yet
Maasai Mara University transcript
1 page
6G: Unveiling The Next Frontier of Wireless Connectivity
No ratings yet
6G: Unveiling The Next Frontier of Wireless Connectivity
2 pages
664 PythonBasics PDF
100% (1)
664 PythonBasics PDF
42 pages
Python Book by Example
No ratings yet
Python Book by Example
90 pages
IT CSA Service Level
No ratings yet
IT CSA Service Level
5 pages
Design of An Enclosed Football Stadium
100% (1)
Design of An Enclosed Football Stadium
32 pages
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
100% (3)
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
56 pages
Models
No ratings yet
Models
4 pages
Poly
100% (1)
Poly
108 pages
Introduction To Python For Science & Engineering: David J. Pine
No ratings yet
Introduction To Python For Science & Engineering: David J. Pine
18 pages
Python Basics (By Mark Wickert)
No ratings yet
Python Basics (By Mark Wickert)
42 pages
Python Basics: Before Numpy
No ratings yet
Python Basics: Before Numpy
49 pages
EE483-Power System Protection-Syllabus PDF
No ratings yet
EE483-Power System Protection-Syllabus PDF
2 pages
Porsche 911: Navigation Search
No ratings yet
Porsche 911: Navigation Search
19 pages
ATA Chapter Codes
No ratings yet
ATA Chapter Codes
16 pages
Video Games/ Digital Painting/ Imaging Videos
No ratings yet
Video Games/ Digital Painting/ Imaging Videos
1 page
Python Parallel Programming Cookbook
From Everand
Python Parallel Programming Cookbook
Giancarlo Zaccone
5/5 (1)
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Python Mastery: From Absolute Beginner to Pro
From Everand
Python Mastery: From Absolute Beginner to Pro
NIBEDITA Sahu
No ratings yet
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet

Tutorial1 KNN

Uploaded by

Tutorial1 KNN

Uploaded by

Tutorial #1: KNN and Data Manipulation with Python

Some General Python Learning Tips 3

Arrays and DataFrames 6

Formatting Your Data for Analysis 11

The General Data Analysis Procedure 17

Learn to Use Your Debugger

Choosing Appropriate Learning Resources

Paths and Pathlib

os.makedirs("test") # make a folder / directory called "test"

os.makedirs("/home/derek/Desktop/CSCI444/test") # make a folder / directory called "test"

but if our directory is /home/derek/Desktop/, then

Boolean Indexing and Logical Operators

and use “==” to generate some boolean arrays:

idx_small = (A <= 3) # We can use other logical operators too

Wait, what happened? Let’s look at y.shape:

This is a common problem you’ll bump into. The solution is:

We can also select things not matching some criterion with ~:

# We accidentally made an index array (see below)

Tips to Make pandas Less Frustrating

Dealing with Missing Values

imgs = load_digits()["images"] # this is a small dataset available in sklearn

imgs_reindexed = imgs.transpose([1, 2, 0])

Adding / Removing Dimensions

# "Empty" dimensions have to have size 1, for some reason

# comparison behaves sanely

np.alltrue(E == E.T) # False

np.alltrue(A.ravel() == E.ravel()) # True - arrays still only have identical data

# Some slicing makes sense...

# and I guess this makes sense:

print(EEEEEE.shape) # (10, 1, 1, ..., 1)

Resizing and Reshaping

Interpolation / Resampling and Downsampling / Upsampling

Cropping and Padding

Patches / Slices / Windows

Computing the AUC

Speeding Up scikit-learn Models

Speeding Up Prototyping with Subsets

Avoid Silly Mistakes by Regularly Visualizing and Inspecting Shapes

You might also like