Python Refresher Notes
Python Refresher Notes
to Machine
Learning
Python Refresher
Agenda
• Language Introduction, data types
– Numerical types
– Boolean
– Strings
– Lists (Indexing and Slicing)
– Assignment
– Mutable vs Immutable
– Tuple
– Dictionaries
– Sets
Agenda
• Language Introduction, control statements
– If statements
– While Loops
– For Loops
– List Comprehension
– Looping
Agenda
• Language Introduction, organising code
– Functions
– Calling functions
– Generators
– Modules
• Handling files
– Reading from files
– Writing to files
Agenda
• Data Analytics Ecosystem
• Numpy
– Matplotlib basics
– Introducing Numpy Arrays
– Multi-Dimensional Arrays
– Slicing/Indexing Arrays
– Creating Arrays
– Array Creation Functions
– Array Calculation Methods
– The array data structure
– Advanced Numpy overview
Agenda
• Pandas
– Reading Data
– Series: One Dimensional Data Structure
– DataFrame: Two Dimensional Data Structure
– Visualisation
– Dealing With Missing Data
– Dealing With Dates and Times
– Computations and Statistics
– Group-Based Operations: Split-Apply-Combine
Why Python?
Python is an interpreted programming language
that is:
• easy to learn,
• easy to use,
• and comprehensive in terms of Data Science
tools
Why Python?
The tools can do the following among other
things:
• Data wrangling,
• Statistical analysis,
• machine learning,
• Natural Language Processing (NLP),
Why Python?
Python is also a fully fledged programming
language with programming styles:
• from exploratory analysis
• to repeatable science
• to software engineering for production
deployment.
Why Python?
• Python Highlights
– Compiles to byte codes and interprets them
– Automatic garbage collection
– Dynamic typing
– Object-oriented (everything is an object)
– Free and open
– Portable / Cross-platform
– Easy to learn and use
– Comes with a standard library e.g. webserver, reading
files
Why Python for Data Science?
• High-level language, allows rapid prototyping
to explore multiple approaches
• Libraries and tools exist to support you during
all phases of your workflow
• Easy, Matlab-like visualisation tools
• Active, growing scientific community
• It is a real programming language
– General purpose language
Who is using Python for Data Science?
• Wall Street
– Some of the largest investment banks and hedge
funds rely on Python for their core trading and risk
management, fraud detection systems.
• Travel Industry
– Travel companies use Python for data mining and
predictive analytics
• Travel pricing insights
• Recommendation systems
• Predicting travel delays
Petroleum Industry
• Geophysics and exploration
– ConocoPhilips
– Shell
• Astra Zeneca
– Astra Zeneca consolidated some of their disparate
drug discovery tools into a suite called PyDrone
with great success
Social Media
• Many people and companies use Python to
analyse social media data from Google.
LinkedIn, Facebook, Twitter, etc. (from
analysis, customer segmentation, prediction).
• Many More
– Gov: National Labs, SEC, …
– PayPal
– Uber
– …
Language Introduction
Data Types
Outline
• Data types:
– Numerical types: int, long, float, complex numbers
– Booleans
– Strings
– Lists and tuples
– Dictionaries and sets
– Things to know about efficiency
Interactive Calculator
#adding two numbers
>>> 2 + 3
5
#setting a variable
>>> a = 2
>>> a
2
>>> type(a)
int
Interactive Calculator
# an arbitrarily large integer
a = 12345678901234567890
#remove ‘a’ from the ‘namespace’
>>>del a
>>> a
NameError: name ‘a’ is not defined
#integer literals in other bases
>>> 0xFF, 0o77, 0b11
(255, 63, 3)
Interactive Calculator
#real numbers
>>> b = 1.4 + 2.3
>>> b
3.69999999997
>>> type(b)
float
#complex numbers
>>> c = 2 + 1.5j
>>> c
(2 + 1.5j)
Interactive Calculation
#arithmetic operations
>>> 1+2-(3*4//6)**5+(7%5)
-27
#simple math functions
>>> abs(-3)
3
>>> max(0, min(10, 0, -1, 3))
0
>>> round(2.718281828, 0)
3.0
Interactive Calculation
#Overwriting function(!)
#don’t do this
>>> max = 100
#some time later …
>>> x = max(4, 5)
TypeError: ‘int’ object is not callable
Built-in functions are just like variables which
can be overwritten
Interactive Calculation
Type conversion
>>> int(2.718281828)
2
>>> float(2)
2.0
>>> 1 + 2.0
3.0
# Also -=, +=, /=, etc
Interactive Calculation
In-place operation
>>> b = 2.5
>>> b += 0.5 #b = b + 0.5
>>> b
# Also -=, +=, /=, etc
Give it a try!
In a jupyter notebook
(-b +√(b**2 – 4ac))
2a
For a -2.0
b 3.0
c 5.0
Logical Expressions
Comparison operations
# <, >, <=, >=, !=
>>> 1 >= 2
False
>>> 2**3 != 3**2
True
Logical Expressions
# Chained comparisons
>>> 1 < 10 < 100
# bool DATA TYPE
>>> q = 1 > 0
>>> q
True
>>> type(q)
bool
Logical Expressions
# and OPERATOR
>>> 1 > 0 and 5 == 5
True
# If first operand is false,
# the second is not evaluated
>>> 1 < 0 and max(0,1,2) > 1
False
Logical Expressions
# or OPERATOR
>>> a = 50
a < 10 or a > 90
False
# If first operand is true,
# the second is not evaluated
>>> a = 0
>>> a < 10 or a > 90
True
# not OPERATOR
>>> not 10 <= a <= 90
True
Strings
# Creating Strings
# using double quotes
>>> s = “hello world”
>>> print(s)
hello world
# single quotes also work
>>> s = ‘hello world’
>>> print(s)
hello world
Strings
# Strings Operations
# concatenating two strings
>>> “hello “ + “world”
‘hello world’
# repeating a string
>>> “hello “ * 3
‘hello hello hello’
# String Length
>>> s = “12345”
>>> len(s)
5
Strings
# Split/Join Strings
# split space-delimited words
>>> s = “hello world”
>>> wrd_lst = s.split()
>>> print(wrd_lst)
[‘hello’, ‘world’]
# join words back together
# with a space in between
>>> space = ‘ ‘
>>> space.join(wrd_lst)
‘hello world’
Multi-line Strings
#Triple Quates
# Strings in triple quotes retain line breaks
>>> a = “””hello
world”””
>>> print(a)
hello
world
Multi-line Strings
# New line character
# Including a newline character
>>> a = “hello\nworld”
>>> print(a)
hello
world
A Few String Methods and Functions
REPLACEMENT
>>> a = “hello world”
>>> a.replace(‘world’, ‘Mars’)
‘hello Mars’
CONVERT UPPERCASE
>>> a.upper()
‘HELLO WORLD’
REMOVE WHITESPACE
>>> s = “\t hello world \n”
>>> s.strip()
‘hello world’
A Few String Methods and Functions
NUMBERS TO STRINGS
>>> repr(1.1 + 2.2)
>>> str(1.1)
STRINGS TO NUMBERS
>>> int(‘23’)
>>> int(‘FF’, 16)
>>> float(‘23’)
String Formatting
The formart() method replaces any replacement fields in
the string with the values given as arguments.
Replacement fields format: {<name>:<format_spec>}
# If ‘name’ is an integer, it refers to the argument
position
>>> ‘{0} is greater than {1}’.format(100,50}
100 is greater than 50’
# If ‘name’ is text, it refers to a keyword argument.
>>> ‘{last}, {first}’.format(first=‘Ellen’, last=‘Ripley’)
‘Ripley, Ellen’
String Formatting Format Spec
The optional format specification is used to control
how the values are displayed.
# Fixed point format (and a named keyword argument).
>>> print(‘[{x:5.0f]] [[x:5.2f}] [{x:5.2f}].format(x=12.3456))
# Alignment (and using a numbered positional argument).
>>> print(‘[{0:<10s)} [{0:>10s}] [{0:*>10s}] [{0:*10s}]’
>>> print(template.format(‘PYTHON’))
List Objects
LIST CREATION WITH BRACKETS
>>> a = [10, 11, 12, 13, 14]
>>> print(a)
CONCATENATING LISTS
# simply use the + operator
>>> [10, 11] + [12, 13]
REPEATING ELEMENTS IN LISTS
#the multiply operator does the trick
>>> [10, 11] * 3
List Objects
range(start, stop, step)
#the range function is helpful for creating a sequence
>>> list(range(5))
Output: [0, 1, 2, 3, 4]
>>> list(range(2, 7))
Output: [2, 3, 4, 5, 6]
>>> list(range(2,7,2))
Output: [2, 4, 6]
Indexing
RETRIEVING AN ELEMENT
# list
# indices: 0 1 2 3 4
>>> a = [10 , 11, 12, 13, 14]
>>> a[0]
SETTING AN ELEMENT
>>> a[1] = 21
>>> print(a)
OUT OF BOUNDS
>>> a[10]
Traceback (innermost last):
File “<interactive input>”, line 1, in ?
IndexError: list index out of range
Indexing
NEGATIVE INDICES
# negative indices count
#back from the end of the list
# indices: -5 -4 -3 -2 -1
>>> a = [ 10 , 11, 12, 13, 14]
>>> a[-1]
>>> a[-2]
-5 -4 -3 -2 -1
10 11 12 13 14
0 1 2 3 4
The first element in an array has index=0 as in C.
Take note Matlab and Fortran programmers!
More on List Objects
LIST CONTAINING MULTIPLE TYPES
# list containing integer, string, and another list
>>> a = [10, ‘eleven’, [12, 13]]
>>> a[0]
>>> a[2]
# use multiple indices to retrieve elements
# from nested lists
>>> a[2][0]
More on List Objects
LENGTH OF A LIST
>>> len(a)
DELETING OBJECT FROM LIST
#use the del keyword
>>> del a[2]
>>> a
DOES THE LIST CONTAIN x?
# use in or not in
>>> a = [10, 11, 12, 13, 14]
>>> 13 in a
>>> 13 not in a
Slicing
var[lower : upper : step]
Extracts a portion of a sequence by specifying a
lower and upper bound.
The lower bound element is included, but the
upper-bound element is not included.
Mathematically, [lower, upper).
The step value specifies the stride between
elements.
Slicing
SLICING LISTS
# indices:
# -5 -4 -3 -2 -1
# 0 1 2 3 4
>>> a = [10, 11, 12, 13, 14]
>>> a[1:3]
# negative indices work also
>>> a[1: -2]
>>> a[-4:3]
Slicing
OMITTING INDICES
# omitted boundaries are assumed to be the
# beginning (or and) of the list
# grab first three element
>>> a[:3]
# grab last two elements
>>> a[-2:]
# every other element
>>> a [::2]
Lists in Action
>>> a = [10, 21, 23, 11, 24]
# add an element to the list
>>> a.append(11)
>>> print(a)
# how many 11s are there?
>>> a.count(11)
# extend with another list
a.extend([5, 4])
# where does 11 first occur
>>> a.insert(2, 100)
>>> print(a)
Lists in Action
# pop the item at index = 3
>>> a.pop(3)
# remove the first 11
>>> a.remove(11)
>>> print(a)
# sort the list (in-place). Note: use sorted(a) to
# return a new list
>>> a.sort()
>>> print(a)
# reverse the list
>>> a. reverse()
>>> print(a)
Give it a try!
In a jupyter notebook
Given the list:
b = [9.5, 9.25, 9.75, 9.50]
1. Add the value 9.00 to the end of the list
2. Find the maximum value (Hint: look at the
max built-in)
3. Find the index of the maximum value
4. Remove the maximum value the list
Assignment of the “Simple” Object
>>> x = 0 x
>>> x = [0, 1, 2] x
Control Statements
Outline
• If statements
• While loops
• For loops
– List comprehension
– Looping patterns
If Statements
If/elif/else provides conditional execution of code
blocks
IF STATEMENT FORMAT
if <condition>:
<statement 1>
<statement 2>
elif <condition>:
<statements>
else:
<statements>
If Statements
IF EXAMPLE
# a simple if statement
>>> x = 10
>>> if x > 0:
print(‘Hey!’)
print(‘x > 0’)
elif x == 0:
print(‘x is 0’)
else:
print(‘x is negative’)
While Loop
While loops iterate until a condition is met
while <condition>:
<statements>
WHILE LOOPS
While Loop
>>> tasks = [‘A’, ‘B’, ‘C’]
>>> while tasks:
curr = tasks.pop()
template = ‘Doing {} ; ‘\
‘To do {}’
print(template.format(curr, tasks))
Output:
Doing C ; To do ['A', 'B']
Doing B ; To do ['A']
Doing A ; To do []
While Loop
BREAKING OUT OF A LOOP
#Breaking from infinite loop with “break”
>>> from builtins import input
>>> while True:
cmd = input(‘-> ‘)
if cmd == ‘quit’:
break
print(‘Executing {} ‘\
.format(cmd))
For Loops
For loops iterate over a collection of objects
for <loop_var> in <collectio>:
<statements>
TYPICAL SCENARIO
>>> for integer in range(5):
print(integer)
# Use a mutable container like a list to collect results
>>> output = []
>>> for integer in range(5):
output.append(integer)
>>> print(output)
For Loops
LOOPING OVER A STRING
>>> for char in ‘abcde’:
print(char)
LOOPING OVER A LIST
>>> animals=[‘dogs’, ‘cats’, ‘bears’]
>>> accum = ‘ ‘
>>> for animal in animals:
accum += animals+ ‘ ‘
>>> print(accum)
Give it a try!
In a jupyter notebook
Given the list:
values = [-4, 4, -1, -2, 10, 3]
Write a while loop that creates two lists from this one:
a list of positive values and a list of negative values.
Hint: you will need to start with two empty lists
positives = [] and negatives = []
LIST COMPREHENSION
LIST TRANSFORM WITH LOOP
# element by element transform of a list by applying an
# expression to each element
>>> a = [10, 21, 23, 11, 24]
>>> results = []
>>> for val in a:
results.append(val+1)
>>> results
LIST COMPREHENSION
LIST COMPREHENSION
# list comprehensions provide a concise syntax for this
# sort of element by element transformation
>>> a = [10, 21, 23, 11, 24]
>>> [val+1 for val in a]
LIST COMPREHENSION
FILTER-TRANSFORM WITH LOOP
# transform only elements that meet a criteria
>>> a = [10, 21, 23, 11, 24]
>>> results = []
>>> for val in a:
if val > 15:
results.append(val+1)
>>> results
LIST COMPREHENSION
LIST COMPREHENSION WITH FILTER
>>> a = [10, 21, 23, 11, 24]
>>> [val+1 for val in a if val > 15]
Consider using a list comprehension whenever you
need to transform one sequence to another
Looping Patterns
MULTIPLE LOOP VARIABLES
# Looping through a sequence of tuples allows multiple
variables to be assigned
>>> pairs = [(0, ‘a’), (1, ‘b’), (2, ‘c’)]
>>> for index, value in pairs:
print(‘{} {}’.format(index,value))
Looping Patterns
ENUMERATE
# enumerate -> index, item
>>> y = [‘a’, ‘b’, ‘c’]
>>> for index, value in enumerate(y):
print(‘{} {}’.format(index, value))
Looping Patterns
ZIP
# zip 2 or more sequence into a list of tuples
>>> x = [0, 1, 2]
>>> y = [‘a’, ‘b’, ‘c’]
>>> zip(x,y)
>>> for index, value in zip(x,y):
print(‘{} {}’.format(index,value))
REVERSED
>>> z = [(0, ‘a’), (1, ‘b’), (2, ‘c’)]
>>> for index, value in reversed(z)
print(‘{} {}’.format(index,value)
Looping Over a Dictionary
>>> d = {‘a’ : 1, ‘b’ : 2, ‘c’ : 3}
DEFAULT LOOPING (KEYS)
>>> for key in d:
print(key)
LOOPING OVER KEYS (EXPLICIT)
>>> for key in d.keys():
print(key)
LOOPING OVER VALUES
>>> for val in d.values():
print(val)
LOOPING OVER ITEMS
>>> for key, val in d.items():
print(d[key] is val)
Give it a try!
In a jupyter notebook
Given the dictionary:
values = {‘A’: -4, ‘B’: 10, ‘C’: -5, ‘D’: 3}
Use a loop to build a dictionary containing only the
keys and values which are positive.
Language Introduction
Organising Code
Functions
Functions are reusable snippets of code
• Definition
• Positional and keyword arguments
Anatomy of a Function
The keyword def indicates
The start of a function Function arguments are listed, separated by
commas. They are passed by assignment.
return a
An optional docstring
Indentation is used to documents the function
indicate the contents
An optional return statement in a standard way for tools
of the function. It is not like ipython.
optional, but a part of the specifies the value returned
syntax. from the function. If return is
omitted, the function returns
the special value None
Our New Function in Action
>>> def add(x, y):
a = x + y
return a
# Test it out with numbers
>>> val_1 = 2
>>> val_2 = 3
>>> add(val_1, val_2)
Our New Function in Action
# How about strings
>>> val_1 = ‘foo’
>>> val_2 = ‘bar’
>>> add(val_1, val_2)
Our New Function in Action
# Names can be assigned to functions
>>> func = add
>>> fun(val_1, val_2)
# How about numbers and strings?
>>> add(‘abc’, 1)
Traceback (innermost last):
File “<interactive input”>, line 1, in ?
File “<interactive input”>, line 2, in add
TypeError…
Give it a try!
In a jupyter notebook
Create a function called count_letter that takes
as input a string txt and a character char, and
returns the number of times char appears in txt,
ignoring case.
For example:
>>> count_letter(“Php, Perl, or Python?”, “p”)
Function Calling Conventions
POSITIONAL ARGUMENTS
# The “standard” calling convention we know
>>> def add(x, y):
return x + y
>>> add(2,3)
Function Calling Conventions
KEYWORD ARGUMENTS
# specify argument names
>>> add(x=2, y=3)
# or even a mixture if you are careful with order
>>> add(2, y=3)
Function Calling Conventions
DEFAULT VALUES
# Arguments can be assigned default values
>>> def quad(x, a=1, b=1, c=0):
return a*x**2 + b*x + c
# use defaults for a, b and c
>>> quad(2.0)
# Set b=3. Defaults for a, band c
>>> quad(2.0, b=3)
# Keyword arguments can be passed in out of
# order
>>> quad(2.0, c=1, a =3, b=2)
Modules and Packages
Modules and packages are Python’s “libraries”
i.e. a collection of constants, functions and
classes
Importing a Module
MODULES ARE .py FILES
Modules are just .py files
# my_tools.py
def greetings():
return “Hello all”
>>> import my_tools
>>> my_tools.greetings()
Importing a Module
BASIC IMPORTS
# The most basic import
>>> import numpy
>>> numpy.pi
# use an alias
>>> import numpy as np
>>> np.pi
Importing a Module
IMPORTING SPECIFIC SYMBOLS
# Select specific names to bring into the local
# namespace
>>> from numpy import pi
>>> pi
Importing a Module
IMPORTING FROM A SUBMODULE
# Some modules have submodules with their
# own objects
>>> from numpy.linalg import LinAlgError
>>> raise LinAlgError(
“You ate all the pi.”)
LinAlgError: You ate all the pi.
Importing a Module
IMPORTING EVERYTHING
# Overwrite built-in names with numpy module
# namespace
>>> from numpy import *
Packages
PACKAGES
Often a library will contain several modules.
These are organised as a hierarchical directory
structure, and imported using “dotted module
names”.
The first and the intermediate names (if any) are
called “packages”
Packages
PACKAGES
Example
>>> from email.utils import parseaddr
>>> from email import utils
>>> utils.parseaddr(‘Nhamo Mtetwa <[email protected]>’)
Packages
PACKAGES ARE DIRECTORIES
Email /
__init__.py
charset.py ( defines add_codec)
header.py (defines make_header)
utils.py (defines parseaddr)
The file __init__.py indicates that email is a
package.
It is often an empty file
Give ita try!
In a jupyter notebook
1. Import functions join and expanduser from
module os.path
join(expanduser(‘~’), ‘myfile.txt’)
2. Import module pandas with alias pd
3. Import division from module
__future__
Language Introduction
Reading Data
Reading Text Files
AS A LIST OF STRINGS
# Read file as list of strings
>>> from io import open
>>> with open(‘rcs.txt’, encoding=‘ascii’) as f:
lines = f.readline()
>>> lines
Reading Text Files
ONE LINE AT A TIME
# Read one line at a time
>>> with open(‘rcs.txt’,
encoding=‘utf-8’) as f:
header= f.readline()
for line in f:
print(line)
EXAMPLE FILE: RCS.TXT
#freq (Mhz) vv (dB) hh (dB)
100 -20.3 -31.2
200 -22.7 -33.6
Reading Text Files
WRITING AND APPENDING
# Mode ‘w’: create new file
>>> with open(‘a.txt’, ‘w’, encoding=‘ascii’) as f:
f.write(u ’Wow!’)
# Mode ‘a’: append to file
>>> with open(‘a.txt’, ‘a’, encoding=‘ascii’) as f:
f.write(u ’Boo. \nYay!’)
# Read the wholefile
>>> with open(‘a.txt’, ‘r’, encoding=‘ascii’) as f:
whole_file=f.read()
>>> whole_file
Reading Text Files
WRITE AND READ
>>> with open(‘a.txt’, ‘w+’,
encoding=‘ascii’) as f:
f.write(u ‘ab cd ef’)
f.seek(3)
print(f.read(2))
Use binary mode, ‘rb’, or ‘wb’, to prevent
Python from corrupting binary data formats like
JPEG and PDF
Give it a try!
In a Jupyter notebook
Open a file course_data.txt for writing, and write
out the values from this list, one value to a line:
b = [9.50, 9.25, 9.75, 9.50]
Then open the file and read the values back in as a
list.
Open the file again to append another line with
value 9
NumPy
array
array([1, 2, 3])
Matplotlib Basics
Matplotlib
Matplotlib behaves likes a state machine.
Any command is applied to current plotting area
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> t = np.linspace(0, 2*np.pi, 50)
>>> x = np.sin(t)
>>> y = np.cos(t)
Matplotlib’s “State Machine”
# Now create a figure
>>> plt.figure()
# and plot x inside it
>>>plt.plot(x)
Matplotlib’s “State Machine”
# Now create a new figure
>>> plt.figure()
# and plot y inside it …
>>> plt.plot(y)
# and add a title
>>> plt.title(“Cos”)
Line Plots
>>> x = np.linspace(0,2*np.pi, 50)
>>> y1 = np.sin(x)
>>> y2 = np.sin(2*x)
>>> plt.figure() #Create figure
>>> plt.hold(False)
>>> plt(y1)
>>>plt(x, y1)
Line Plots
# red dot-dash circle
>>> plt.plot(x, y1, ‘r-o’)
# red marker only circle
>>> plt.plot(x, y1, ‘ro’)
>>> plt.plot(x, y1, ‘g-o’,x, y2, ‘b-+’)
>>> plt.legend([‘sin(x)’, ‘sin(2x)’])
Symbol Colour Symbol Colour
b Blue . Point
g Green o circle
r Red <>^v Triangle
c cyan 8 Octagon
m magenta s Square
y Yellow * Star
k Black + Plus
w White
Scatter Plots
>>> N = 50 # n0. of points
>>> x = np.linspace(0, 10, N)
>>> from numpy.random import rand
>>> err = rand(N)5.0 # noise
>>> y1 = x + err
>>> areas = rand(N)*300
Scatter Plots
plt.scatter(x, y1, s = areas)
plt.hold(False) # overwrite
colors = rand(N)
plt.scatter(x, y1, s=areas, c=colors)
plt.colorbar()
plt.title(“Rando scatter”)
Image “Plots”
>>> #create some data
>>> e1 = rand(100)
>>> e2 = rand(100)*2
>>> e3 = rand(100)*10
>>> e4 = rand(100)*100
Image “Plots”
>>> corrmatrix = np.corrcoef([e1, e2, e3, e4])
# Plot corr matrix as image
>>> plt.imshow(corrmatrix, interpolation=‘none’, cmap=‘GnBu’)
>>> plt.colorbar()
Multiple Plots Using subplot
>>> t = np.linspace(0, 2*np.pi)
>>> x = np.sin(t)
>>> y = np.cos(t)
Multiple Plots Using subplot
# To divide the plotting area
>>> plt.subplot(2, 1, 1)
>>> plt.plot(x)
#Now activate a new plot area
>>>plt.subplot(2, 1, 2)
>>> plt.plot(y)
Histogram Plots
# Create array of data
>>> from numpy.random import randint
>>> data = randint(10000, size=(10,1000))
# Aprox norm distribution
>>> x = np.sum(data, axis=0)
# plt.subplot(2, 1, 1)
>>> plt.hist(x, color=‘r’)
# plot commulative dist
>>> plt.subplot(2, 1, 2)
>>> plt.hist(x, cumulative=True)
# For multiple histograms use
# plt.hist([d1, d2, …])
Legend, Titles and Axis Labels
# Add labels in plot command
>>> plt.plot(np.sin(t), labels = ‘sin’)
>>> plt.plot(np.cos(t), labels=‘cos’)
>>> plt.legend()
Titles and Axis Labels
>>> plt.plot(t, np.sin(t))
>>> plt.xlabels(‘radians’)
# Keywords set text properties
>>> plt.ylabel(‘amplitude’, fontsize=‘large’)
>>> plt.title(‘sin(x)’)
Plotting from Scripts
# In IPython, plots show up
# as soon as a plot command is issued
>>> plt.figure()
>>> plt.plot(np.sin(t))
>>>plt.figure()
>>>plt.plot(np.cos(t))
Non Interactive Mode
# In a script, you must call the show() command
# to display plots. Call it at the end of all your
#plot commands for best performance
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> t = np.linespace(0, 2*np.pi, 50)
>>> plt.figure()
>>> plt.plot(np.sin(t))
>>> plt.figure()
>>> plt.plot(np.cos(t))
# plots will not appear until this command is run
>>> plt.show()
MPL Exercise: Desired Output
Introducing NumPy Arrays
# Simple array creation
>>> import numpy as np
>>> a = np.array([0, 1, 2, 3])
>>> a
array([0, 1, 2, 3])
# Checking the type
>>> type(a)
numpy.ndarray
Introducing NumPy Arrays
# Numeric ‘type’ of elements
>>> a.dtype
dtype(‘int32’)
# Number of dimensions
>>> a.ndim
Introducing NumPy Arrays
Array shape
# Shape returns a tuple listing the length of the
# array along each dimension
>>> a.shape
(4, )
Introducing NumPy Arrays
Bytes per element
>>> a.itemize
Bytes of memory used
# Return the number of bytes used by the data
# portion of the array
>>> a.nbytes
16
Array Operations
#Simple Array Math
>>> a = np.array([1, 2, 3, 4])
>>> b = np.array([2, 3, 4, 5])
>>> a + b
array([3, 5, 7, 9])
>>> a ** b
array([ 1, 8, 81, 1024])
Maths Functions
# create array from 0. to 10.
>>> x = np.arange(11.)
#multiply entire array by scalar value
>>> c = (2*np.pi)/10
>>> c
0.62831853071795862
>>> c*x
# in-place operations
>>> x *=c
>>> x
# apply functions to array
>>> y = np.sin(x)
Setting Array Elements
# Array indexing
>>> a[0]
0
>>> a[0] = 10
>>>a
array([10, 1, 2, 3])
Setting Array Elements
# assigning a float into an int32 array truncates
# the decimal part
>>> a[0] = 10.6
>>> a
array([10, 1, 2, 3])
# fill has the same behavior
>>> a.fill(-4.8)
>>> a
Array([-4, -4, -4, -4])
Multi-Dimensional Arrays
# Multi-dimensional arrays
>>> a = np.array([[ 0, 1, 2, 3],[10, 11, 12, 13]])
>>> a
# shape = (ROWS, COLUMNS)
>>>a.shape
(2, 4)
# element count
>>> a.size
8
#Number of dimensions
>>> a.ndim
2
Get/Set Elements
>>> a[1, 3]
13
>>> a[1, 3] = -1
>>> a
array([[ 0, 1, 2 3],
[10, 11, 12, -1]
#Address second (oneth) row using single index
>>> a[1]
Slicing
var[lower:upper:step]
Extracts a portion of a sequence by specifying a lower and upper
bound.
The lower bound element is included, but the upper-bound
element is not included.
Mathematically: [lower,upper).
The step value specifies the stride between elements
Slicing
#Slicing arrays
#indicies: 0 1 2 3 4
>>> a = np.array([10, 11, 12, 13, 14])
>>> a[1:3]
array([11, 12])
Omitting Indices
# omitted boundaries are assumed to the
# beginning (or end) of the list
# grab first three elements
>>> a[:3]
Array([10, 11, 12])
# grab last two elements
>>> a[-2:]
Array([13, 14])
# every other element
>>> a[::2]
Array([10, 12, 14])
Give it a try
Create the array below with the following command
a = np.arange(25).reshape(5,5)
and extract the slices indicated
0
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Give it a try
Create the array below with the following command
a = np.arange(25).reshape(5,5)
and extract the slices indicated
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Give it a try
Create the array below with the following command
a = np.arange(25).reshape(5,5)
and extract the slices indicated
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Creating arrays
Array Constructor Examples
Floating point arrays
#Default to double precision
>>> a = np.array([0, 1.0, 2, 3])
>>> a.dtype
dtype(‘float’)
>>> a.nbytes
32
Array Constructor Examples
Reducing precision
>>> a = np.array([0, 1., 2, 3])
>>> a.dtype
dtype(‘float32’)
>>> a.nbytes
16
Array Constructor Examples
Unsigned Integer Byte
>>> a = np.array([0, 1, 2, 3], dtype=‘unit8’)
>>> a.dtype
dtype(‘unit8’)
4
Array Creation Functions
ARANGE
Arange([start], stop[, step], dtype=None)
Nearly identical to Python’s range().
Creates an array of values in the range [start, stop] with
specified step value.
Allows non-integer values for start, stop, and step.
Default dtype is derived from the start, stop, and step values.
>>> np.arange(4)
Array([0, 1 ,2 ,3])
>>>np.arange(0, 2*pi, pi/4)
Ones, Zeros
ones (shape, dtype=float64)
zeros(shape, dtype=64)
Shape is a number or sequence specifying the
dimensions of the array.
If dtype is not specified, it defaults to flaot64.
>>> np.ones((2, 3), dtype=‘float32’)
Array([1., 1., 1.], [1., 1., 1.]],dtype=float32)
>>>np.zeros(3)
Array([0., 0., 0.])
Array Creation (cont.)
IDENTITY
# Generate an n by n identity array. The default
# dtype is float32
>>> a = np.identity(4)
array([[1.0, 0., 0., 0.],
[0., 1.0, 0., 0],
[0., 0., 1., 0.]
[0., 0., 0., 1.]])
Array Creation (cont.)
IDENTITY
>>> a.dtype
Dtype(‘float64’)
>>> np.identity(4, dtype=int)
array([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0]
[0, 0, 0, 1]])
Empty and Fill
# empty(shape, dtype=float64, order=‘c’)
>>> a = np.empty(2)
>>>a
# fill array with 5.0
>>> a.fill(5.0)
array([5., 5.])
#alternative approach (slightly slower)
>>> a[:] = 4.0
array([4., 4.])
Array Creation Functions (cont.)
LINESPACE
# Generate N evenly spaced elements between
# (and including) start and stop values
>>> np.linespace(0, 1, 5)
array([0., 0.25, 0.5, 0.75, 1.0])
Array Creation Functions (cont.)
LOGSPACE
# Generate N evenly spaced elements on a log
# scale between base**start and base**stop
# (default base=10)
>>> np.logspace(0, 1, 5)
Array([1., 1.77, 3.16, 5.62, 10.])
Array Creation Functions (cont.)
ARRAYS FROM?TO TXT FILES
Data.txt
-- BEGINNING OF THE FILE
% Day, Month, Year, Skip, Avg Power
01, 01, 2000, x876, 13 % crazy day!
% we don’t have Jan 03rd
04, 01, 2000, %fed, 55
Array Creation Functions (cont.)
# loadtxt() automatically generate an array from
# the txt file
arr = np.loadtxt(‘Data.txt’, skiprows=1, dtype=int,
delimeter=“,”, usecols=(0, 1, 2,4), comments = “%”)
# Save an array into a txt file
np.savetxt(‘filename’, err)
Array calculation methods
Computations with arrays
Rule 1: Operations between multiple array
objects are first checked for proper shape
match.
Rule 2: Mathematical operators (+-*/exp,
log, ..) apply element by element, on
the values.
Rule 3: Reduction operations (mean, std, skew,
kurt, sum, prod, …) apply to the whole
array, unless an axis is specified
Rule 4: Missing values propagate unless
explicitly ignored (nanmean, nansum, …)
Array Calculation Methods
SUM Function
>>> a = np.array([[1, 2, 3], [4, 5, 6]])
# sum() defaults to adding up all the values in an
# array
>>> sum(a)
21
Array Calculation Methods
SUM Function
# supply the keyword axis to sum along the 0th
# axis
>>> np.sum(a, axis=0)
# supply the keyword axis to sum along the last
# axis
>>> np.sum(a, axis=-1)
Array([ 6, 15])
Axis
Array Calculation Methods
SUM ARRAY METHOD
# a.sum() defaults to adding up all values in the
# array
>>> a.sum()
21
# supply an axis argument to sum along a
# specific axis
>>> a.sum(axis=0)
array([5, 7, 9])
Array Calculation Methods
PRODUCT
# product along columns
>>> a.product(axis=0)
array([ 4, 10, 18])
# as a function
>>> np.prod(a, axis=0)
array([ 4, 10, 18])
Min/Max
MIN
>>> a = np.array([2., 3., 0., 1.])
>>>a.min(axis=0)
0.0
# Use NumPy’s min() instead of Python’s
# built-in min() for speedy operations on
# multi-dimensional arrays
>>> np.min(a, axis=0)
Min/Max
ARGMIN
#Find index of minimum value.
>>> a.argmin(axis=0)
2
# as a function
>>>np.argmin(a, axis=0)
2
Min/Max
MAX
>>> a = np.array([2., 3., 0., 1.])
>>>a.max(axis=0)
3.0
# as a function
>>> np.max(a, axis=0)
3.0
Min/Max
ARGMAX
# Find index of maximum value
>>> a.argmax(axis=0)
1
# as a function
>>> argmax(a, axis=0)
1
Statistics Array Methods
MEAN
>>> a = np.array([[1, 2, 3], [4, 5, 6]])
# mean values of each column
>>> a.mean(axis=0)
array([2.5, 3.5, 4.5])
>>> np.mean(a, axis=0)
array([2.5, 3.5, 4.5])
>>> np.average(a, axis=0)
array([2.5, 3.5, 4.5])
# average can also calculate a weighted average
>>> np.average(a, weights=[1, 2], axis=0)
array([3., 4., 5.])
Statistics Array Methods
STANDARD DEV/VARIANCE
# Standard Deviation
>>> a.std(axis=0)
array([1.5, 1.5, 1.5])
# variance
>>> a.var(axis=0)
array([2.25, 2.25, 2.25])
>>> np.var(a, axis=0)
array([2.25, 2.25, 2.25])
Give it a try
Create the array below with the following command
a = np.arange(-15,15).reshape(5,6)**2
and and compute:
• The maximum of each row (one max per row)
• The mean of each row (one mean per row)
225 196 169 144 121 100
81 64 49 36 25 16
9 4 1 0 1 4
9 16 25 36 49 64
81 100 121 144 169 196
The array data structure
Operations on the array structure
Operations that only affect the array structure,
not the data, can be executed without copying
memory.
Transpose
TRANSPOSE
>>> a = np.array([[0, 1, 2], [3, 4, 5]])
>>> a.shape
(2, 3)
#Transpose swaps the order of axes
>>> a.T
array([[0, 3],
[1, 4],
[2, 5]])
>>> a.T.shape
(3, 2)
Transpose
TRANSPOSE
# Transpose does not move values around in
memory. It only changes
# the order of “strides” in the array
>>> a.strides
(12, 4)
>>> a.T.strides
(4, 12)
Reshaping Arrays
RESHAPE
>>> a = np.array([[0, 1, 2], [3, 4, 5]])
# Return a new array with a different shape
# (a view where possible)
>>> a.reshape(3, 2)
array([[0, 1], [2, 3], [4, 5]])
Reshaping Arrays
RESHAPE
# Reshape cannot change the number of
# elements in an array
>>> a.reshape(4, 2)
ValueError: total size of new array must be
unchanged
Reshaping Arrays
SHAPE
>>> a = np.arange(6)
>>> a
array([0, 1, 2, 4, 5])
>>> a.shape
(6, )
# Reshape array in-place to 2 x 3
>>> a.shape = (2, 3)
>>> a
array([[0, 1, 2], [3, 4, 5]])
Flattening Arrays
FLATTEN (SAFE)
a.flatten() converts a multi-dimensional array
into a 1-D array. The new array is a copy of the
original data
# Create a 2D array
>>> a = np.array([[0, 1], [2, 3]])
#Flatten out elements to 1D
>>> b = a.flatten()
>>> b
array([0, 1, 2, 3])
Flattening Arrays
FLATTEN (SAFE)
# Changing b does not change a
>>> b[0] = 10
>>> b
array([10, 1, 2, 3])
>>> a
array([[0, 1], [2, 3])
Flattening Arrays
RAVEL (EFFICIENT)
a.ravel() is the same as a.flatten(), but returns a
reference (or view) of the array if possible (i.e.,
the memory is continuous).
Otherwise the new array copies the data.
#Flatten out elements to 1-D
>>> b = a.ravel()
>>> b
array([0, 1, 2, 3])
Flattening Arrays
RAVEL (EFFICIENT)
# Changing b does change a
>>> b[0] = 10
>>> b
array([10, 1, 2, 3])
>>> a
array([[10, 1], [2, 3]])
Advanced NumPy overview
NumPy is the low-level core of most Python
Data Science libraries.
There are a couple of advanced NumPy topics
that it’s worth being aware of:
• memmap’ed arrays
• structured arrays
Pandas
Pd.read_table Example
>>>pd.read_table(‘historical_data.csv’, sep=‘,’,
header=1, index_col=0, parse_dates=True,
na_values=[‘-’])
Reading Large Files in Chunks
Pandas supports reading potentially very large files in chunks,
e.g. :
>>> chunks = []
>>> reader = pd.read_csv(‘contributions_2012.csv’,
shunksize=100000)
>>> for table in reader:
new_yorkers = table[‘contbr_city’] == ‘NEW YORK’
chunks.append(table[new_yorkers])
>>> new_york_contributions = pd.concat(chunks)
>>> print(len(new_york_contributions))
Pandas IO Summary
READING
Format Method, Function, Class
txt, csv read_table, read_csv
pickle read_pickle
HDFS read_hdfs, HDFStore
SQL read_sql_table
Excel read_excel
R(exp) rpy.common.load_data
Pandas IO Summary
WRITING
Format Method, Function, Class