0% found this document useful (0 votes)
13 views

Python Refresher Notes

This document provides an agenda for an introduction to machine learning using Python. It covers Python data types like numerical, boolean, string, list, dictionary and set. It also discusses control statements, organizing code using functions and modules, handling files, and key data analytics tools like Numpy and Pandas. The document explains why Python is well suited for data science and machine learning tasks and provides examples of companies using Python for data analysis.

Uploaded by

tawarush
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Python Refresher Notes

This document provides an agenda for an introduction to machine learning using Python. It covers Python data types like numerical, boolean, string, list, dictionary and set. It also discusses control statements, organizing code using functions and modules, handling files, and key data analytics tools like Numpy and Pandas. The document explains why Python is well suited for data science and machine learning tasks and provides examples of companies using Python for data analysis.

Uploaded by

tawarush
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 292

Introduction

to Machine
Learning
Python Refresher
Agenda
•  Language Introduction, data types
–  Numerical types
–  Boolean
–  Strings
–  Lists (Indexing and Slicing)
–  Assignment
–  Mutable vs Immutable
–  Tuple
–  Dictionaries
–  Sets
Agenda
•  Language Introduction, control statements
–  If statements
–  While Loops
–  For Loops
–  List Comprehension
–  Looping
Agenda
•  Language Introduction, organising code
–  Functions
–  Calling functions
–  Generators
–  Modules
•  Handling files
–  Reading from files
–  Writing to files
Agenda
•  Data Analytics Ecosystem
•  Numpy
–  Matplotlib basics
–  Introducing Numpy Arrays
–  Multi-Dimensional Arrays
–  Slicing/Indexing Arrays
–  Creating Arrays
–  Array Creation Functions
–  Array Calculation Methods
–  The array data structure
–  Advanced Numpy overview
Agenda
•  Pandas
–  Reading Data
–  Series: One Dimensional Data Structure
–  DataFrame: Two Dimensional Data Structure
–  Visualisation
–  Dealing With Missing Data
–  Dealing With Dates and Times
–  Computations and Statistics
–  Group-Based Operations: Split-Apply-Combine
Why Python?
Python is an interpreted programming language
that is:
•  easy to learn,
•  easy to use,
•  and comprehensive in terms of Data Science
tools
Why Python?
The tools can do the following among other
things:
•  Data wrangling,
•  Statistical analysis,
•  machine learning,
•  Natural Language Processing (NLP),
Why Python?
Python is also a fully fledged programming
language with programming styles:

•  from exploratory analysis
•  to repeatable science
•  to software engineering for production
deployment.
Why Python?
•  Python Highlights
–  Compiles to byte codes and interprets them
–  Automatic garbage collection
–  Dynamic typing
–  Object-oriented (everything is an object)
–  Free and open
–  Portable / Cross-platform
–  Easy to learn and use
–  Comes with a standard library e.g. webserver, reading
files
Why Python for Data Science?
•  High-level language, allows rapid prototyping
to explore multiple approaches
•  Libraries and tools exist to support you during
all phases of your workflow
•  Easy, Matlab-like visualisation tools
•  Active, growing scientific community
•  It is a real programming language
–  General purpose language
Who is using Python for Data Science?
•  Wall Street
–  Some of the largest investment banks and hedge
funds rely on Python for their core trading and risk
management, fraud detection systems.
•  Travel Industry
–  Travel companies use Python for data mining and
predictive analytics
•  Travel pricing insights
•  Recommendation systems
•  Predicting travel delays
Petroleum Industry
•  Geophysics and exploration
–  ConocoPhilips
–  Shell
•  Astra Zeneca
–  Astra Zeneca consolidated some of their disparate
drug discovery tools into a suite called PyDrone
with great success
Social Media
•  Many people and companies use Python to
analyse social media data from Google.
LinkedIn, Facebook, Twitter, etc. (from
analysis, customer segmentation, prediction).
•  Many More
–  Gov: National Labs, SEC, …
–  PayPal
–  Uber
–  …
Language Introduction

Data Types
Outline
•  Data types:
–  Numerical types: int, long, float, complex numbers
–  Booleans
–  Strings
–  Lists and tuples
–  Dictionaries and sets
–  Things to know about efficiency
Interactive Calculator
#adding two numbers
>>> 2 + 3
5
#setting a variable
>>> a = 2
>>> a
2
>>> type(a)
int
Interactive Calculator
# an arbitrarily large integer
a = 12345678901234567890
#remove ‘a’ from the ‘namespace’
>>>del a
>>> a
NameError: name ‘a’ is not defined
#integer literals in other bases
>>> 0xFF, 0o77, 0b11
(255, 63, 3)

Interactive Calculator
#real numbers
>>> b = 1.4 + 2.3
>>> b
3.69999999997
>>> type(b)
float
#complex numbers
>>> c = 2 + 1.5j
>>> c
(2 + 1.5j)
Interactive Calculation
#arithmetic operations
>>> 1+2-(3*4//6)**5+(7%5)
-27
#simple math functions
>>> abs(-3)
3
>>> max(0, min(10, 0, -1, 3))
0
>>> round(2.718281828, 0)
3.0
Interactive Calculation
#Overwriting function(!)
#don’t do this
>>> max = 100
#some time later …
>>> x = max(4, 5)
TypeError: ‘int’ object is not callable
Built-in functions are just like variables which
can be overwritten
Interactive Calculation
Type conversion
>>> int(2.718281828)
2
>>> float(2)
2.0
>>> 1 + 2.0
3.0
# Also -=, +=, /=, etc
Interactive Calculation
In-place operation
>>> b = 2.5
>>> b += 0.5 #b = b + 0.5
>>> b
# Also -=, +=, /=, etc
Give it a try!
In a jupyter notebook
(-b +√(b**2 – 4ac))
2a

For a -2.0
b 3.0
c 5.0
Logical Expressions
Comparison operations
# <, >, <=, >=, !=
>>> 1 >= 2
False
>>> 2**3 != 3**2
True
Logical Expressions
# Chained comparisons
>>> 1 < 10 < 100
# bool DATA TYPE
>>> q = 1 > 0
>>> q
True
>>> type(q)
bool
Logical Expressions
# and OPERATOR
>>> 1 > 0 and 5 == 5
True
# If first operand is false,
# the second is not evaluated
>>> 1 < 0 and max(0,1,2) > 1
False
Logical Expressions
# or OPERATOR
>>> a = 50
a < 10 or a > 90
False
# If first operand is true,
# the second is not evaluated
>>> a = 0
>>> a < 10 or a > 90
True
# not OPERATOR
>>> not 10 <= a <= 90
True
Strings
# Creating Strings
# using double quotes
>>> s = “hello world”
>>> print(s)
hello world
# single quotes also work
>>> s = ‘hello world’
>>> print(s)
hello world


Strings
# Strings Operations
# concatenating two strings
>>> “hello “ + “world”
‘hello world’
# repeating a string
>>> “hello “ * 3
‘hello hello hello’
# String Length
>>> s = “12345”
>>> len(s)
5

Strings
# Split/Join Strings
# split space-delimited words
>>> s = “hello world”
>>> wrd_lst = s.split()
>>> print(wrd_lst)
[‘hello’, ‘world’]
# join words back together
# with a space in between
>>> space = ‘ ‘
>>> space.join(wrd_lst)
‘hello world’

Multi-line Strings
#Triple Quates
# Strings in triple quotes retain line breaks
>>> a = “””hello
world”””
>>> print(a)
hello
world

Multi-line Strings
# New line character
# Including a newline character
>>> a = “hello\nworld”
>>> print(a)
hello
world

A Few String Methods and Functions
REPLACEMENT
>>> a = “hello world”
>>> a.replace(‘world’, ‘Mars’)
‘hello Mars’
CONVERT UPPERCASE
>>> a.upper()
‘HELLO WORLD’
REMOVE WHITESPACE
>>> s = “\t hello world \n”
>>> s.strip()
‘hello world’
A Few String Methods and Functions
NUMBERS TO STRINGS
>>> repr(1.1 + 2.2)
>>> str(1.1)

STRINGS TO NUMBERS
>>> int(‘23’)
>>> int(‘FF’, 16)
>>> float(‘23’)
String Formatting
The formart() method replaces any replacement fields in
the string with the values given as arguments.
Replacement fields format: {<name>:<format_spec>}
# If ‘name’ is an integer, it refers to the argument
position
>>> ‘{0} is greater than {1}’.format(100,50}
100 is greater than 50’
# If ‘name’ is text, it refers to a keyword argument.
>>> ‘{last}, {first}’.format(first=‘Ellen’, last=‘Ripley’)
‘Ripley, Ellen’
String Formatting Format Spec
The optional format specification is used to control
how the values are displayed.
# Fixed point format (and a named keyword argument).
>>> print(‘[{x:5.0f]] [[x:5.2f}] [{x:5.2f}].format(x=12.3456))

# Alignment (and using a numbered positional argument).
>>> print(‘[{0:<10s)} [{0:>10s}] [{0:*>10s}] [{0:*10s}]’
>>> print(template.format(‘PYTHON’))
List Objects
LIST CREATION WITH BRACKETS
>>> a = [10, 11, 12, 13, 14]
>>> print(a)
CONCATENATING LISTS
# simply use the + operator
>>> [10, 11] + [12, 13]
REPEATING ELEMENTS IN LISTS
#the multiply operator does the trick
>>> [10, 11] * 3
List Objects
range(start, stop, step)
#the range function is helpful for creating a sequence
>>> list(range(5))
Output: [0, 1, 2, 3, 4]

>>> list(range(2, 7))
Output: [2, 3, 4, 5, 6]

>>> list(range(2,7,2))
Output: [2, 4, 6]
Indexing
RETRIEVING AN ELEMENT
# list
# indices: 0 1 2 3 4
>>> a = [10 , 11, 12, 13, 14]
>>> a[0]
SETTING AN ELEMENT
>>> a[1] = 21
>>> print(a)
OUT OF BOUNDS
>>> a[10]
Traceback (innermost last):
File “<interactive input>”, line 1, in ?
IndexError: list index out of range

Indexing
NEGATIVE INDICES
# negative indices count
#back from the end of the list
# indices: -5 -4 -3 -2 -1
>>> a = [ 10 , 11, 12, 13, 14]
>>> a[-1]
>>> a[-2]
-5 -4 -3 -2 -1
10 11 12 13 14
0 1 2 3 4
The first element in an array has index=0 as in C.
Take note Matlab and Fortran programmers!


More on List Objects
LIST CONTAINING MULTIPLE TYPES
# list containing integer, string, and another list
>>> a = [10, ‘eleven’, [12, 13]]
>>> a[0]
>>> a[2]
# use multiple indices to retrieve elements
# from nested lists
>>> a[2][0]
More on List Objects
LENGTH OF A LIST
>>> len(a)
DELETING OBJECT FROM LIST
#use the del keyword
>>> del a[2]
>>> a
DOES THE LIST CONTAIN x?
# use in or not in
>>> a = [10, 11, 12, 13, 14]
>>> 13 in a
>>> 13 not in a

Slicing
var[lower : upper : step]

Extracts a portion of a sequence by specifying a
lower and upper bound.

The lower bound element is included, but the
upper-bound element is not included.

Mathematically, [lower, upper).

The step value specifies the stride between
elements.
Slicing
SLICING LISTS
# indices:
# -5 -4 -3 -2 -1
# 0 1 2 3 4
>>> a = [10, 11, 12, 13, 14]
>>> a[1:3]
# negative indices work also
>>> a[1: -2]
>>> a[-4:3]
Slicing
OMITTING INDICES
# omitted boundaries are assumed to be the
# beginning (or and) of the list
# grab first three element
>>> a[:3]
# grab last two elements
>>> a[-2:]
# every other element
>>> a [::2]
Lists in Action
>>> a = [10, 21, 23, 11, 24]
# add an element to the list
>>> a.append(11)
>>> print(a)
# how many 11s are there?
>>> a.count(11)
# extend with another list
a.extend([5, 4])
# where does 11 first occur
>>> a.insert(2, 100)
>>> print(a)
Lists in Action
# pop the item at index = 3
>>> a.pop(3)
# remove the first 11
>>> a.remove(11)
>>> print(a)

# sort the list (in-place). Note: use sorted(a) to
# return a new list
>>> a.sort()
>>> print(a)
# reverse the list
>>> a. reverse()
>>> print(a)
Give it a try!
In a jupyter notebook
Given the list:
b = [9.5, 9.25, 9.75, 9.50]
1.  Add the value 9.00 to the end of the list
2.  Find the maximum value (Hint: look at the
max built-in)
3.  Find the index of the maximum value
4.  Remove the maximum value the list
Assignment of the “Simple” Object

Assignment creates object references

>>> x = 0 x

# This causes x and y to point to the same value y


>>> y = x

# Re-assigned y to a new value decouples the X



two variables
>>> y = “foo” y
>>> print(x)
Assignment of Container Object

Assignment creates object references

>>> x = [0, 1, 2] x

# This causes x and y to point to the same value y


>>> y = x

# A change to y also changes x.


>>> y[1] = 6
>>> print x
# Re-assigned y to a new list decouples the two X

# variables
>>> y = [3, 4] y
Mutable vs. Immutable
MUTABLE OBJETS
# Mutable objects, such as lists, can be changed
# in place

# Insert new values into list
>>> a = [10, 11, 12, 13, 14]
>>> a[1:3] = [5,6]
>>> print(a)
Mutable vs. Immutable
IMMUTABLE OBJETS
# Immutable objects, such as integers and strings,
# cannot be changed in place.

# Try inserting values into a string
>>> s = ‘abcde’
>>> s[1:3] = ‘xy’
Traceback (innermost last):
File “<interactive input>”, line 1, in ?
TypeError: ‘str’ object does not support items assignment

# here is how to do it
>>> s = s[:1] + ‘xy’ + s[3:]
>>> print(s)
Tuple – Immutable Sequence
TUPLE CREATION
>>> a = (10, 11, 12, 13, 14)
>>> print(a)

PARENTHESIS ARE OPTIONAL
>>> a = 10, 11, 12, 13, 14
>>> print(a)
Tuple – Immutable Sequence
LENGTH – 1 TUPLE
>>> (10,) # is a tuple
>>> (10) # is not a tuple, but an integer with parantheses

TUPLES ARE IMMUTABLE
# create a list
>>> a = range(10,15)
# cast the list to a tuple
>>> b = tuple(a)
>>> print(b)
# try inserting a value
>>> b[3] = 23
TypeError: ‘tuple’ object does not support item
assignment
Give it a try!
In a jupyter notebook
Without executing the code, what values will a
and x hold?

a= [3, 4]
x= (1, 2, a)
x[-1].append(7)
Run the code and check your answer.
Sequence (Un)packing
(UN)PACKING SEQUENCES
# Creating a tuple without ()
>>> d = 1, 2, 3
>>>d
# Multiple assignments from a tuple
>>> a, b, c = d
>>> print(b)
# Multiple assignment from a list
>>> a, b, c = [1, 2, 3]
>>> print(b)
Sequence (Un)packing


WHY IS IT USEFUL?
We will see soon that the feature is very
common in Python code, e.g.
>>> def f(x):
return 1, x, x**2
>>> a0, a1, a2 = f(3)
Sequence (Un)packing


WHY IS IT USEFUL?
Another example is in for loops over multiple
elements
# Swapping variables
>>> a, b = 1, 2
>>> b, a = a, b
>>> print(a)
Dictionaries

Dictionaries store key/value pairs, indexing a
dictionary by a key returns the value associated
with it.

The key must be immutable

They map a set of objects (keys) to another set
of objects (values)
Dictionaries

DICTIONARY EXAMPLES
# Create an empty dictionary using curly brackets
>>> record = {}
# Each indexed assignment creates an empty key/value pair
>>> record[‘first’] = ‘James’
>>> record[‘second’] = ‘Maxwell’
>>> record[‘born’] = ‘1893’
>>> print(record)

Dictionaries
DICTIONARY EXAMPLES

# Create another dictionary with initial entries
>>> new_record = {‘first’ : ’James’, ‘middle’ : ’Clerk’}

# Now update the first dictionary with values from new one
>>> record.update(new_record)
>>> print(record)


Accessing and Deleting Keys and Values
ACCESS USING INDEX NOTATION
>>> record[‘first’]
‘James’

ACCESSING WITH get(key, default)
The get() method returns the value associated with a key,
the optional second argument is the return value if the key
is not the dictionary
>>> record.get(‘born’, 0)
>>> record.get(‘home’, ‘TBD’)
>>> record.get(‘home’)
KeyError …


Accessing and Deleting Keys and Values
REMOVE AN ENTRY WITH DEL
>>> del record[‘middle’]
>>> record

REMOVE WITH pop(key, default)
Pop() removes the key from the dictionary and returns the
value, an optional second argument is the return value if
the key is not in the dictionary
>>> record.pop(‘born’)
>>> record
>>>record.pop(‘born’,0)


Dictionaries in Action
# dict of animals:count pairs
>>> barn = {‘cows’ : 1, ‘dogs’ : 5, ‘cats’ : 3}
# test for chickens
>>> ‘chickens’ in barn
False
# get list of all keys
>>> list(barn.keys())
# get a list of all values
>>> list(barn.values())
Dictionaries in Action
# return key/value tuples
>>> list(barn.items())
# How many cats?
>>> barn[‘cats’]

#Change the number of cats
>>> barn[‘cats’] = 10
>>> barn[‘cats’]

#Add some sheep
>>> barn[‘sheep’] = 5
>>> barn[‘sheep’]
Set Objects
DEFINITION
A set is an ordered collection of unique, immutable
objects

CONSTRUCTION
# an empty set
>>> s = set()
# convert a sequence to set
>>> t = set([1, 2, 3, 1])
# note removal of duplicates
>>> t
Set Objects
ADD/REMOVE ELEMENTS

>>> t.add(5)
>>>t.update([5, 6, 7])
>>> t
REMOVE ELEMENTS
>>> t.remove(1)

Set Objects
SET OPERATIONS

>>> a = set([1, 2, 3, 4])
>>> b = set([3, 4, 5, 6])
>>> a.union(b)
>>> a.intersection(b)
>>> a.difference(b)

Selecting a Data Type
Selecting the appropriate data type is important
Insert Remove Find Ordered
list linear linear linear yes
set constant constant constant no
dict constant constant constant no

Typical usage for each data type:


lists: represent ordered collections of items, stacks, and
queues
sets: represent collections of unique, unordered items
dicts: represent registers, caches, mappings in general

Language Introduction

Control Statements
Outline
•  If statements
•  While loops
•  For loops
–  List comprehension
–  Looping patterns
If Statements
If/elif/else provides conditional execution of code
blocks
IF STATEMENT FORMAT
if <condition>:
<statement 1>
<statement 2>
elif <condition>:
<statements>
else:
<statements>
If Statements
IF EXAMPLE
# a simple if statement
>>> x = 10
>>> if x > 0:
print(‘Hey!’)
print(‘x > 0’)
elif x == 0:
print(‘x is 0’)
else:
print(‘x is negative’)
While Loop
While loops iterate until a condition is met
while <condition>:
<statements>

WHILE LOOPS
While Loop
>>> tasks = [‘A’, ‘B’, ‘C’]
>>> while tasks:
curr = tasks.pop()
template = ‘Doing {} ; ‘\
‘To do {}’
print(template.format(curr, tasks))
Output:
Doing C ; To do ['A', 'B']
Doing B ; To do ['A']
Doing A ; To do []
While Loop
BREAKING OUT OF A LOOP
#Breaking from infinite loop with “break”
>>> from builtins import input
>>> while True:
cmd = input(‘-> ‘)
if cmd == ‘quit’:
break
print(‘Executing {} ‘\
.format(cmd))
For Loops
For loops iterate over a collection of objects
for <loop_var> in <collectio>:
<statements>

TYPICAL SCENARIO
>>> for integer in range(5):
print(integer)
# Use a mutable container like a list to collect results
>>> output = []
>>> for integer in range(5):
output.append(integer)
>>> print(output)
For Loops
LOOPING OVER A STRING
>>> for char in ‘abcde’:
print(char)

LOOPING OVER A LIST
>>> animals=[‘dogs’, ‘cats’, ‘bears’]
>>> accum = ‘ ‘
>>> for animal in animals:
accum += animals+ ‘ ‘
>>> print(accum)
Give it a try!
In a jupyter notebook


Given the list:
values = [-4, 4, -1, -2, 10, 3]
Write a while loop that creates two lists from this one:
a list of positive values and a list of negative values.

Hint: you will need to start with two empty lists
positives = [] and negatives = []
LIST COMPREHENSION

LIST TRANSFORM WITH LOOP
# element by element transform of a list by applying an
# expression to each element
>>> a = [10, 21, 23, 11, 24]
>>> results = []
>>> for val in a:
results.append(val+1)
>>> results
LIST COMPREHENSION
LIST COMPREHENSION

# list comprehensions provide a concise syntax for this
# sort of element by element transformation

>>> a = [10, 21, 23, 11, 24]
>>> [val+1 for val in a]
LIST COMPREHENSION
FILTER-TRANSFORM WITH LOOP

# transform only elements that meet a criteria
>>> a = [10, 21, 23, 11, 24]
>>> results = []
>>> for val in a:
if val > 15:
results.append(val+1)
>>> results
LIST COMPREHENSION
LIST COMPREHENSION WITH FILTER
>>> a = [10, 21, 23, 11, 24]
>>> [val+1 for val in a if val > 15]

Consider using a list comprehension whenever you
need to transform one sequence to another
Looping Patterns
MULTIPLE LOOP VARIABLES

# Looping through a sequence of tuples allows multiple
variables to be assigned
>>> pairs = [(0, ‘a’), (1, ‘b’), (2, ‘c’)]
>>> for index, value in pairs:
print(‘{} {}’.format(index,value))
Looping Patterns
ENUMERATE
# enumerate -> index, item
>>> y = [‘a’, ‘b’, ‘c’]
>>> for index, value in enumerate(y):
print(‘{} {}’.format(index, value))
Looping Patterns
ZIP
# zip 2 or more sequence into a list of tuples
>>> x = [0, 1, 2]
>>> y = [‘a’, ‘b’, ‘c’]
>>> zip(x,y)
>>> for index, value in zip(x,y):
print(‘{} {}’.format(index,value))
REVERSED
>>> z = [(0, ‘a’), (1, ‘b’), (2, ‘c’)]
>>> for index, value in reversed(z)
print(‘{} {}’.format(index,value)
Looping Over a Dictionary
>>> d = {‘a’ : 1, ‘b’ : 2, ‘c’ : 3}
DEFAULT LOOPING (KEYS)
>>> for key in d:
print(key)
LOOPING OVER KEYS (EXPLICIT)
>>> for key in d.keys():
print(key)
LOOPING OVER VALUES
>>> for val in d.values():
print(val)
LOOPING OVER ITEMS
>>> for key, val in d.items():
print(d[key] is val)


Give it a try!
In a jupyter notebook
Given the dictionary:
values = {‘A’: -4, ‘B’: 10, ‘C’: -5, ‘D’: 3}

Use a loop to build a dictionary containing only the
keys and values which are positive.
Language Introduction

Organising Code
Functions
Functions are reusable snippets of code

•  Definition

•  Positional and keyword arguments
Anatomy of a Function
The keyword def indicates
The start of a function Function arguments are listed, separated by
commas. They are passed by assignment.

def add(arg0, arg1):


A colon (J
“””Add two numbers””” terminates the
a = arg0 + arg1 function signature

return a
An optional docstring
Indentation is used to documents the function
indicate the contents
An optional return statement in a standard way for tools
of the function. It is not like ipython.
optional, but a part of the specifies the value returned
syntax. from the function. If return is
omitted, the function returns
the special value None
Our New Function in Action
>>> def add(x, y):
a = x + y
return a
# Test it out with numbers
>>> val_1 = 2
>>> val_2 = 3
>>> add(val_1, val_2)

Our New Function in Action

# How about strings

>>> val_1 = ‘foo’
>>> val_2 = ‘bar’
>>> add(val_1, val_2)

Our New Function in Action
# Names can be assigned to functions
>>> func = add
>>> fun(val_1, val_2)

# How about numbers and strings?
>>> add(‘abc’, 1)
Traceback (innermost last):
File “<interactive input”>, line 1, in ?
File “<interactive input”>, line 2, in add
TypeError…

Give it a try!
In a jupyter notebook
Create a function called count_letter that takes
as input a string txt and a character char, and
returns the number of times char appears in txt,
ignoring case.

For example:
>>> count_letter(“Php, Perl, or Python?”, “p”)
Function Calling Conventions

POSITIONAL ARGUMENTS
# The “standard” calling convention we know
>>> def add(x, y):
return x + y
>>> add(2,3)

Function Calling Conventions

KEYWORD ARGUMENTS
# specify argument names
>>> add(x=2, y=3)
# or even a mixture if you are careful with order
>>> add(2, y=3)

Function Calling Conventions
DEFAULT VALUES
# Arguments can be assigned default values
>>> def quad(x, a=1, b=1, c=0):
return a*x**2 + b*x + c
# use defaults for a, b and c
>>> quad(2.0)
# Set b=3. Defaults for a, band c
>>> quad(2.0, b=3)
# Keyword arguments can be passed in out of
# order
>>> quad(2.0, c=1, a =3, b=2)

Modules and Packages



Modules and packages are Python’s “libraries”
i.e. a collection of constants, functions and
classes


Importing a Module
MODULES ARE .py FILES
Modules are just .py files
# my_tools.py
def greetings():
return “Hello all”
>>> import my_tools
>>> my_tools.greetings()


Importing a Module
BASIC IMPORTS
# The most basic import
>>> import numpy
>>> numpy.pi
# use an alias
>>> import numpy as np
>>> np.pi


Importing a Module
IMPORTING SPECIFIC SYMBOLS
# Select specific names to bring into the local
# namespace
>>> from numpy import pi
>>> pi


Importing a Module
IMPORTING FROM A SUBMODULE
# Some modules have submodules with their
# own objects
>>> from numpy.linalg import LinAlgError
>>> raise LinAlgError(
“You ate all the pi.”)
LinAlgError: You ate all the pi.



Importing a Module
IMPORTING EVERYTHING
# Overwrite built-in names with numpy module
# namespace
>>> from numpy import *


Packages
PACKAGES
Often a library will contain several modules.
These are organised as a hierarchical directory
structure, and imported using “dotted module
names”.
The first and the intermediate names (if any) are
called “packages”


Packages
PACKAGES

Example
>>> from email.utils import parseaddr
>>> from email import utils
>>> utils.parseaddr(‘Nhamo Mtetwa <[email protected]>’)


Packages
PACKAGES ARE DIRECTORIES
Email /
__init__.py
charset.py ( defines add_codec)
header.py (defines make_header)
utils.py (defines parseaddr)

The file __init__.py indicates that email is a
package.
It is often an empty file


Give ita try!
In a jupyter notebook
1. Import functions join and expanduser from
module os.path
join(expanduser(‘~’), ‘myfile.txt’)
2. Import module pandas with alias pd
3. Import division from module
__future__

Language Introduction

Reading Data
Reading Text Files
AS A LIST OF STRINGS
# Read file as list of strings
>>> from io import open
>>> with open(‘rcs.txt’, encoding=‘ascii’) as f:
lines = f.readline()
>>> lines
Reading Text Files
ONE LINE AT A TIME
# Read one line at a time
>>> with open(‘rcs.txt’,
encoding=‘utf-8’) as f:
header= f.readline()
for line in f:
print(line)
EXAMPLE FILE: RCS.TXT
#freq (Mhz) vv (dB) hh (dB)
100  -20.3 -31.2
200 -22.7 -33.6
Reading Text Files
WRITING AND APPENDING
# Mode ‘w’: create new file
>>> with open(‘a.txt’, ‘w’, encoding=‘ascii’) as f:
f.write(u ’Wow!’)
# Mode ‘a’: append to file
>>> with open(‘a.txt’, ‘a’, encoding=‘ascii’) as f:
f.write(u ’Boo. \nYay!’)
# Read the wholefile
>>> with open(‘a.txt’, ‘r’, encoding=‘ascii’) as f:
whole_file=f.read()
>>> whole_file
Reading Text Files
WRITE AND READ
>>> with open(‘a.txt’, ‘w+’,
encoding=‘ascii’) as f:
f.write(u ‘ab cd ef’)
f.seek(3)
print(f.read(2))
Use binary mode, ‘rb’, or ‘wb’, to prevent
Python from corrupting binary data formats like
JPEG and PDF
Give it a try!
In a Jupyter notebook

Open a file course_data.txt for writing, and write
out the values from this list, one value to a line:
b = [9.50, 9.25, 9.75, 9.50]

Then open the file and read the values back in as a
list.

Open the file again to append another line with
value 9
NumPy

The standard numerical library for


Python
NumPy
•  Defining arrays
•  Indexing and slicing
•  Creating arrays
•  Array calculations
•  The array data structure
•  Advanced NumPy, overview
Get Started
Often at the command line, it is
IMPORT NUMPY handy to import everything from
NumPy into the command line
>>> from numpy import *
>>> __version__
or However, if you are writing scripts
It is easier for others to read and
>>> from numpy import array debug in the future if you use
explicit imports

USING IPYTHON –PYLAB Ipython has a ‘pylab’ mode where
It importdall of NumPy and Matplotlib,
C:\> ipython --pylab Into the namespace for you as a
conveniece. It also enables threading

While IPython is used for all the demos, ‘>>>’ is for showing plots
Used on future slides instead of ‘in [1]: to
save space
NumPy Arrays
NumPy arrays
•  Defining
•  Indexing and slicing
•  Creating arrays
•  Array calculations
•  The array data structure
•  Advanced NumPy, overview
Getting started
Often at the command line, it is
IMPORT NUMPY handy to import everything from
numpy into the command shell
>>> from numpy import *
>>> __version__
OR However, if you are writing scripts,
it is easier for others to read and debug
>>> from numpy import \ In the future if you use explicit imports

array

array([1, 2, 3])
Matplotlib Basics
Matplotlib
Matplotlib behaves likes a state machine.
Any command is applied to current plotting area
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> t = np.linspace(0, 2*np.pi, 50)
>>> x = np.sin(t)
>>> y = np.cos(t)

Matplotlib’s “State Machine”
# Now create a figure
>>> plt.figure()

# and plot x inside it
>>>plt.plot(x)

Matplotlib’s “State Machine”

# Now create a new figure
>>> plt.figure()

# and plot y inside it …
>>> plt.plot(y)

# and add a title
>>> plt.title(“Cos”)
Line Plots
>>> x = np.linspace(0,2*np.pi, 50)
>>> y1 = np.sin(x)
>>> y2 = np.sin(2*x)
>>> plt.figure() #Create figure
>>> plt.hold(False)
>>> plt(y1)
>>>plt(x, y1)

Line Plots
# red dot-dash circle
>>> plt.plot(x, y1, ‘r-o’)
# red marker only circle
>>> plt.plot(x, y1, ‘ro’)
>>> plt.plot(x, y1, ‘g-o’,x, y2, ‘b-+’)
>>> plt.legend([‘sin(x)’, ‘sin(2x)’])

Symbol Colour Symbol Colour
b Blue . Point
g Green o circle
r Red <>^v Triangle
c cyan 8 Octagon
m magenta s Square
y Yellow * Star
k Black + Plus
w White
Scatter Plots
>>> N = 50 # n0. of points
>>> x = np.linspace(0, 10, N)
>>> from numpy.random import rand
>>> err = rand(N)5.0 # noise
>>> y1 = x + err
>>> areas = rand(N)*300

Scatter Plots
plt.scatter(x, y1, s = areas)
plt.hold(False) # overwrite
colors = rand(N)
plt.scatter(x, y1, s=areas, c=colors)
plt.colorbar()
plt.title(“Rando scatter”)
Image “Plots”

>>> #create some data
>>> e1 = rand(100)
>>> e2 = rand(100)*2
>>> e3 = rand(100)*10
>>> e4 = rand(100)*100
Image “Plots”

>>> corrmatrix = np.corrcoef([e1, e2, e3, e4])
# Plot corr matrix as image
>>> plt.imshow(corrmatrix, interpolation=‘none’, cmap=‘GnBu’)
>>> plt.colorbar()
Multiple Plots Using subplot
>>> t = np.linspace(0, 2*np.pi)
>>> x = np.sin(t)
>>> y = np.cos(t)

Multiple Plots Using subplot

# To divide the plotting area

>>> plt.subplot(2, 1, 1)
>>> plt.plot(x)
#Now activate a new plot area
>>>plt.subplot(2, 1, 2)
>>> plt.plot(y)
Histogram Plots
# Create array of data
>>> from numpy.random import randint
>>> data = randint(10000, size=(10,1000))
# Aprox norm distribution
>>> x = np.sum(data, axis=0)
# plt.subplot(2, 1, 1)
>>> plt.hist(x, color=‘r’)
# plot commulative dist
>>> plt.subplot(2, 1, 2)
>>> plt.hist(x, cumulative=True)
# For multiple histograms use
# plt.hist([d1, d2, …])
Legend, Titles and Axis Labels
# Add labels in plot command
>>> plt.plot(np.sin(t), labels = ‘sin’)
>>> plt.plot(np.cos(t), labels=‘cos’)
>>> plt.legend()

Titles and Axis Labels
>>> plt.plot(t, np.sin(t))
>>> plt.xlabels(‘radians’)
# Keywords set text properties
>>> plt.ylabel(‘amplitude’, fontsize=‘large’)
>>> plt.title(‘sin(x)’)

Plotting from Scripts
# In IPython, plots show up
# as soon as a plot command is issued
>>> plt.figure()
>>> plt.plot(np.sin(t))
>>>plt.figure()
>>>plt.plot(np.cos(t))
Non Interactive Mode
# In a script, you must call the show() command
# to display plots. Call it at the end of all your
#plot commands for best performance
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> t = np.linespace(0, 2*np.pi, 50)
>>> plt.figure()
>>> plt.plot(np.sin(t))
>>> plt.figure()
>>> plt.plot(np.cos(t))
# plots will not appear until this command is run
>>> plt.show()
MPL Exercise: Desired Output
Introducing NumPy Arrays
# Simple array creation
>>> import numpy as np
>>> a = np.array([0, 1, 2, 3])
>>> a
array([0, 1, 2, 3])
# Checking the type
>>> type(a)
numpy.ndarray
Introducing NumPy Arrays
# Numeric ‘type’ of elements
>>> a.dtype
dtype(‘int32’)

# Number of dimensions
>>> a.ndim
Introducing NumPy Arrays
Array shape
# Shape returns a tuple listing the length of the
# array along each dimension
>>> a.shape
(4, )

Introducing NumPy Arrays
Bytes per element
>>> a.itemize

Bytes of memory used
# Return the number of bytes used by the data
# portion of the array
>>> a.nbytes
16

Array Operations
#Simple Array Math
>>> a = np.array([1, 2, 3, 4])
>>> b = np.array([2, 3, 4, 5])
>>> a + b
array([3, 5, 7, 9])
>>> a ** b
array([ 1, 8, 81, 1024])
Maths Functions
# create array from 0. to 10.
>>> x = np.arange(11.)
#multiply entire array by scalar value
>>> c = (2*np.pi)/10
>>> c
0.62831853071795862
>>> c*x
# in-place operations
>>> x *=c
>>> x
# apply functions to array
>>> y = np.sin(x)
Setting Array Elements
# Array indexing
>>> a[0]
0
>>> a[0] = 10
>>>a
array([10, 1, 2, 3])
Setting Array Elements
# assigning a float into an int32 array truncates
# the decimal part
>>> a[0] = 10.6
>>> a
array([10, 1, 2, 3])
# fill has the same behavior
>>> a.fill(-4.8)
>>> a
Array([-4, -4, -4, -4])

Multi-Dimensional Arrays
# Multi-dimensional arrays
>>> a = np.array([[ 0, 1, 2, 3],[10, 11, 12, 13]])
>>> a
# shape = (ROWS, COLUMNS)
>>>a.shape
(2, 4)
# element count
>>> a.size
8
#Number of dimensions
>>> a.ndim
2

Get/Set Elements
>>> a[1, 3]
13
>>> a[1, 3] = -1
>>> a
array([[ 0, 1, 2 3],
[10, 11, 12, -1]
#Address second (oneth) row using single index
>>> a[1]
Slicing
var[lower:upper:step]
Extracts a portion of a sequence by specifying a lower and upper
bound.

The lower bound element is included, but the upper-bound
element is not included.

Mathematically: [lower,upper).

The step value specifies the stride between elements
Slicing
#Slicing arrays
#indicies: 0 1 2 3 4
>>> a = np.array([10, 11, 12, 13, 14])
>>> a[1:3]
array([11, 12])
Omitting Indices
# omitted boundaries are assumed to the
# beginning (or end) of the list
# grab first three elements
>>> a[:3]
Array([10, 11, 12])
# grab last two elements
>>> a[-2:]
Array([13, 14])
# every other element
>>> a[::2]
Array([10, 12, 14])
Give it a try
Create the array below with the following command
a = np.arange(25).reshape(5,5)
and extract the slices indicated

0
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Give it a try
Create the array below with the following command
a = np.arange(25).reshape(5,5)
and extract the slices indicated

0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Give it a try
Create the array below with the following command
a = np.arange(25).reshape(5,5)
and extract the slices indicated

0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Creating arrays
Array Constructor Examples

Floating point arrays
#Default to double precision
>>> a = np.array([0, 1.0, 2, 3])
>>> a.dtype
dtype(‘float’)
>>> a.nbytes
32
Array Constructor Examples
Reducing precision
>>> a = np.array([0, 1., 2, 3])
>>> a.dtype
dtype(‘float32’)
>>> a.nbytes
16
Array Constructor Examples
Unsigned Integer Byte
>>> a = np.array([0, 1, 2, 3], dtype=‘unit8’)
>>> a.dtype
dtype(‘unit8’)
4
Array Creation Functions
ARANGE
Arange([start], stop[, step], dtype=None)
Nearly identical to Python’s range().
Creates an array of values in the range [start, stop] with
specified step value.
Allows non-integer values for start, stop, and step.
Default dtype is derived from the start, stop, and step values.

>>> np.arange(4)
Array([0, 1 ,2 ,3])
>>>np.arange(0, 2*pi, pi/4)

Ones, Zeros
ones (shape, dtype=float64)
zeros(shape, dtype=64)
Shape is a number or sequence specifying the
dimensions of the array.
If dtype is not specified, it defaults to flaot64.

>>> np.ones((2, 3), dtype=‘float32’)
Array([1., 1., 1.], [1., 1., 1.]],dtype=float32)
>>>np.zeros(3)
Array([0., 0., 0.])
Array Creation (cont.)
IDENTITY
# Generate an n by n identity array. The default
# dtype is float32
>>> a = np.identity(4)
array([[1.0, 0., 0., 0.],
[0., 1.0, 0., 0],
[0., 0., 1., 0.]
[0., 0., 0., 1.]])


Array Creation (cont.)
IDENTITY
>>> a.dtype
Dtype(‘float64’)
>>> np.identity(4, dtype=int)
array([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0]
[0, 0, 0, 1]])


Empty and Fill
# empty(shape, dtype=float64, order=‘c’)
>>> a = np.empty(2)
>>>a
# fill array with 5.0
>>> a.fill(5.0)
array([5., 5.])
#alternative approach (slightly slower)
>>> a[:] = 4.0
array([4., 4.])
Array Creation Functions (cont.)

LINESPACE
# Generate N evenly spaced elements between
# (and including) start and stop values
>>> np.linespace(0, 1, 5)
array([0., 0.25, 0.5, 0.75, 1.0])
Array Creation Functions (cont.)

LOGSPACE
# Generate N evenly spaced elements on a log
# scale between base**start and base**stop
# (default base=10)
>>> np.logspace(0, 1, 5)
Array([1., 1.77, 3.16, 5.62, 10.])
Array Creation Functions (cont.)
ARRAYS FROM?TO TXT FILES
Data.txt
-- BEGINNING OF THE FILE
% Day, Month, Year, Skip, Avg Power
01, 01, 2000, x876, 13 % crazy day!
% we don’t have Jan 03rd
04, 01, 2000, %fed, 55


Array Creation Functions (cont.)

# loadtxt() automatically generate an array from
# the txt file
arr = np.loadtxt(‘Data.txt’, skiprows=1, dtype=int,
delimeter=“,”, usecols=(0, 1, 2,4), comments = “%”)
# Save an array into a txt file
np.savetxt(‘filename’, err)

Array calculation methods
Computations with arrays
Rule 1: Operations between multiple array
objects are first checked for proper shape
match.
Rule 2: Mathematical operators (+-*/exp,
log, ..) apply element by element, on
the values.
Rule 3: Reduction operations (mean, std, skew,
kurt, sum, prod, …) apply to the whole
array, unless an axis is specified
Rule 4: Missing values propagate unless
explicitly ignored (nanmean, nansum, …)

Array Calculation Methods
SUM Function
>>> a = np.array([[1, 2, 3], [4, 5, 6]])
# sum() defaults to adding up all the values in an
# array
>>> sum(a)
21
Array Calculation Methods
SUM Function
# supply the keyword axis to sum along the 0th
# axis
>>> np.sum(a, axis=0)
# supply the keyword axis to sum along the last
# axis
>>> np.sum(a, axis=-1)
Array([ 6, 15])
Axis
Array Calculation Methods
SUM ARRAY METHOD
# a.sum() defaults to adding up all values in the
# array
>>> a.sum()
21
# supply an axis argument to sum along a
# specific axis
>>> a.sum(axis=0)
array([5, 7, 9])
Array Calculation Methods
PRODUCT
# product along columns
>>> a.product(axis=0)
array([ 4, 10, 18])

# as a function
>>> np.prod(a, axis=0)
array([ 4, 10, 18])
Min/Max
MIN
>>> a = np.array([2., 3., 0., 1.])
>>>a.min(axis=0)
0.0
# Use NumPy’s min() instead of Python’s
# built-in min() for speedy operations on
# multi-dimensional arrays
>>> np.min(a, axis=0)

Min/Max

ARGMIN
#Find index of minimum value.
>>> a.argmin(axis=0)
2
# as a function
>>>np.argmin(a, axis=0)
2
Min/Max
MAX
>>> a = np.array([2., 3., 0., 1.])
>>>a.max(axis=0)
3.0
# as a function
>>> np.max(a, axis=0)
3.0
Min/Max
ARGMAX
# Find index of maximum value
>>> a.argmax(axis=0)
1
# as a function
>>> argmax(a, axis=0)
1
Statistics Array Methods
MEAN
>>> a = np.array([[1, 2, 3], [4, 5, 6]])
# mean values of each column
>>> a.mean(axis=0)
array([2.5, 3.5, 4.5])
>>> np.mean(a, axis=0)
array([2.5, 3.5, 4.5])
>>> np.average(a, axis=0)
array([2.5, 3.5, 4.5])
# average can also calculate a weighted average
>>> np.average(a, weights=[1, 2], axis=0)
array([3., 4., 5.])

Statistics Array Methods
STANDARD DEV/VARIANCE
# Standard Deviation
>>> a.std(axis=0)
array([1.5, 1.5, 1.5])
# variance
>>> a.var(axis=0)
array([2.25, 2.25, 2.25])
>>> np.var(a, axis=0)
array([2.25, 2.25, 2.25])

Give it a try
Create the array below with the following command
a = np.arange(-15,15).reshape(5,6)**2
and and compute:
•  The maximum of each row (one max per row)
•  The mean of each row (one mean per row)


225 196 169 144 121 100
81 64 49 36 25 16
9 4 1 0 1 4
9 16 25 36 49 64
81 100 121 144 169 196
The array data structure
Operations on the array structure


Operations that only affect the array structure,
not the data, can be executed without copying
memory.
Transpose
TRANSPOSE
>>> a = np.array([[0, 1, 2], [3, 4, 5]])
>>> a.shape
(2, 3)
#Transpose swaps the order of axes
>>> a.T
array([[0, 3],
[1, 4],
[2, 5]])
>>> a.T.shape
(3, 2)
Transpose
TRANSPOSE
# Transpose does not move values around in
memory. It only changes
# the order of “strides” in the array
>>> a.strides
(12, 4)
>>> a.T.strides
(4, 12)
Reshaping Arrays
RESHAPE
>>> a = np.array([[0, 1, 2], [3, 4, 5]])
# Return a new array with a different shape
# (a view where possible)
>>> a.reshape(3, 2)
array([[0, 1], [2, 3], [4, 5]])
Reshaping Arrays
RESHAPE
# Reshape cannot change the number of
# elements in an array
>>> a.reshape(4, 2)
ValueError: total size of new array must be
unchanged

Reshaping Arrays
SHAPE
>>> a = np.arange(6)
>>> a
array([0, 1, 2, 4, 5])
>>> a.shape
(6, )
# Reshape array in-place to 2 x 3
>>> a.shape = (2, 3)
>>> a
array([[0, 1, 2], [3, 4, 5]])

Flattening Arrays
FLATTEN (SAFE)
a.flatten() converts a multi-dimensional array
into a 1-D array. The new array is a copy of the
original data
# Create a 2D array
>>> a = np.array([[0, 1], [2, 3]])
#Flatten out elements to 1D
>>> b = a.flatten()
>>> b
array([0, 1, 2, 3])

Flattening Arrays
FLATTEN (SAFE)
# Changing b does not change a
>>> b[0] = 10
>>> b
array([10, 1, 2, 3])
>>> a
array([[0, 1], [2, 3])

Flattening Arrays
RAVEL (EFFICIENT)
a.ravel() is the same as a.flatten(), but returns a
reference (or view) of the array if possible (i.e.,
the memory is continuous).
Otherwise the new array copies the data.
#Flatten out elements to 1-D
>>> b = a.ravel()
>>> b
array([0, 1, 2, 3])
Flattening Arrays
RAVEL (EFFICIENT)
# Changing b does change a
>>> b[0] = 10
>>> b
array([10, 1, 2, 3])
>>> a
array([[10, 1], [2, 3]])
Advanced NumPy overview
NumPy is the low-level core of most Python
Data Science libraries.

There are a couple of advanced NumPy topics
that it’s worth being aware of:

•  memmap’ed arrays
•  structured arrays
Pandas
Pd.read_table Example
>>>pd.read_table(‘historical_data.csv’, sep=‘,’,
header=1, index_col=0, parse_dates=True,
na_values=[‘-’])
Reading Large Files in Chunks
Pandas supports reading potentially very large files in chunks,
e.g. :
>>> chunks = []
>>> reader = pd.read_csv(‘contributions_2012.csv’,
shunksize=100000)
>>> for table in reader:
new_yorkers = table[‘contbr_city’] == ‘NEW YORK’
chunks.append(table[new_yorkers])
>>> new_york_contributions = pd.concat(chunks)
>>> print(len(new_york_contributions))
Pandas IO Summary
READING
Format Method, Function, Class
txt, csv read_table, read_csv
pickle read_pickle
HDFS read_hdfs, HDFStore
SQL read_sql_table
Excel read_excel
R(exp) rpy.common.load_data
Pandas IO Summary
WRITING
Format Method, Function, Class

txt, csv To_string, to_csv


html To_html
pickle To_pickle
HDFS to_hdfs, HDFStore
SQL read_sql_table
Excel to_excel
R(exp) rpy.common.convert_to_r_dataframe
Examples
# Excel
>>> pd.read_excel(“out.xlsx”)
# Scrape tables from HTML webpages
>>> url = (“http://en.wikipedia.org/” “wiki/
World_population”)
>>> pd.read_html(url)
Examples

# HDFS is a set of technologies that supports
# management of very large and complex data
# collections
>>> with pd.HDFStore(‘foo.h5’) as stor:
s.to_hdf(stor, ‘ser1’)
s2 = pd.read_hdf(stor, ‘ser1’)
>>> s3 = pd.read_hdf(’foo.h5’, ‘ser1’)
Writing Pandas Objects




TO TEXT FORMATS
>>> df.to_csv(‘my_cav.csv’)
# Can be pasted in forms, Excel, etc.
>>> df.to_clipboard()

Writing Pandas Objects

TO EXCEL
# Write single object
>>> df.to_excel(‘spreadsheet.csv’, ‘Sheet Name’)

# Write multiple objects
>>> writer = pd.ExcelWriter(‘out.xlsx’)
>>> df1.to_excel(writer, ‘Sheet1’)
>>> df2.to_excel(writer, ‘Sheet2’)
>>> writer.save()
Writing Pandas Objects


TO HDFS
# Context manager (with) makes sure the HDFS is closed after
#writing
>>> with pd.HDFStore(‘foo.h5’) as stor:
s.to_hdf(stor, ‘ser1’)
df1.to_hdf(stor, ‘/dfs/df1’)

Writing Pandas Objects

TO DATABASE


#Connect with SQLAlchemy engine
>>> from sqlalchemy import create_engine
>>> db_str = ‘sqlite://foo.sqlite’
>>> engine = create_engine(db_str)
# to_sql manages the connection
>>> df.to_sql(‘table_name’, engine)
Definition
SERIES

A Pandas Series is a one-dimensional array that can hold any
data type (integer, float, string, Python object, etc.),

The elements in the Series are labeled, and the labels are
collectively called the index.

Also looks like a dictionary



Definition

CONSTRUCTOR
# Basic method to create a Series
>>> s = pd.Series(data, index=index)

# Example
>>> s = pd.Series([‘Cary’, ‘Lynn’, ‘Sam’], index=[‘n1’,’n2’,’n3’])
>>> s
Output:
n1 Cary
n2 Lynn
n3 Sam
dtype: object


Creating Methods
CREATING FROM SEQUENCES
# From an ndarray
>>> from numpy.random import randn
>>> s = pd.Series(randn(5), index=[‘a’, ‘b’, ‘c’, ‘d’,
‘e’])
# From a dictionary
>>> d = {‘n1’: ‘Cary’,
‘n2’: ‘Lynn’,
‘n3’: ‘Sam’}
>>> s = pd.Series(d, name=‘People’)
Creating Methods
CREATING FROM A SCALAR
>>> idx = [‘n1’, ‘n2’, ‘n3’]
>>> s = pd.Series(7, index=idx)
Output:
n1 7
n2 7
n3 7
dtype: int64
When creating from a scalar, the index is required.
The number of elements will match the length of the
index.
Important Attributes and Methods
KEY ATTRIBUTES & METHODS

# Series values
>>> s.values
Array([‘Cary’, ‘Lynn’, ‘Sam’), dtype=object)

# Series values are in a Numpy ndarray
>>> type(s.values)
Numpy.ndarray

Important Attributes and Methods
KEY ATTRIBUTES & METHODS

# Series index
>>> s.index
Index([‘n1’, ‘n2’, ‘n3’], dtype=object)

# Series name
>>> s.name
‘People’
>>> s.name = ‘Important People’
Important Attributes and Methods
KEY ATTRIBUTES & METHODS
# Data type
>>> s.dtype

# Shape
>>> s.shape
(3,)

# Length
>>> len(s)
3
# Unique values
>>> s.unique()
Array([‘Cary’, ‘Lynn’, ‘Sam’), dtype=‘object’])
Indexing and Slicing
INDEXING
# Index elements like a dict Series[label]à value
>>> s[‘n2’]
‘Lynn’

SLICING
# Slice elements
>>> s[1:2]
n2 Lynn
dtype: object

Indexing and Slicing

#Slice elements by index
#Note: Inclusive of upper bound.
>>> s[‘n2’: ‘n3’]
n2 Lynn
n3 Sam
dtype: object
Specifying Label or Position Indexing

[] NOTATION IS AMBIGUOUS
Series[integer]:Row position or index label?

>>> s = pd.Series([1,2,3], index=[3, 0, 2])
>>> s[0]
2
>>> s[1]
Keyerror
Specifying Label or Position Indexing
Use special attributes to select rows:
.loc attribute is purely index label-based
.iloc attribute is purely integer position based

>>> s.iloc[0]
1
>>> s.loc[0]
2
>>> s.iloc[1]
2
>>> s.loc[1]
KeyError: ‘the label [1] is not in the [index]’
Slicing Series
.ILOC IS LIKE SLICING AN ARRAY
# s.iloc[row_lower:row_upper:step]
>>> index = [‘No’, ‘Blofeld’, ‘Chiffre’]
>>> s = pd.Series([1, 5, 21], index=index)

#Access an element based on position. Return
# only values
>>>s.iloc[1]
5
# Slicing returns a Series
>>> s.iloc[:2]
No 1
BloFeld 5
Slicing Series
.LOC INCLUDES UPPER BOUND
>>> s.loc[‘No’:’BloFeld’]
No 1
Blofeld 5

# Including non-existing labels raises a KeyError
>>> s.loc[‘Blofeld’:’Orlov’]
KeyError: ‘Orlov’

Slicing Series

#Exception: Alphabetically ordered letters\
>>> index = [‘a’, ‘b’, ‘c’]
>>> s = pd.Series(range(3), index=index}
>>> s.loc[‘a’:’e’]
a 0
b 1
c 2
Give it a try!
1.  Create a Series with values[1, 2, 3, 4, 5], the
index [1, 1, 2, 3, 5] and name it ‘fib’.
2.  Rename the Series to ‘fibonacci’
3.  Select the first two elements, explicitly by
position
4.  Select the first two elements, explicitly by
label.
5.  Take a slice consisting of every other element
starting with first one, by position
Definition
WHY DATAFRAMES?
It’s useful to think of a Pandas DataFrame in
terms of its essential components
•  Values: Think of the DataFrame’s data as a
collection of columns, where each column
represents one variable.
–  A column can contain only one type of data, but
columns can be of different data types.
Definition
WHY DATAFRAMES?
It’s useful to think of a Pandas DataFrame in
terms of its essential components conti…

•  Columns: Columns have labels, which makes it
easier to work with the data

•  Index: Rows also have labels; collectively, row
labels are called index
How to Create a DataFrame
BASIC CREATION SYNTAX
# df = pd.DataFrame(data, index=labels,
# column=column_names)
# With data only
>>> data = [[32, ‘M’]
[18, ‘F’]
[26, ‘M’]]
>>> df = pd.DataFrame(data)

How to Create a DataFrame

SPECIFYING INDEX AND COLUMNS
>>> index = [‘Cary’, ‘Lynn’, ‘Sam’]
>>> columns = [‘Age’, ‘Gender’]
>>> df = pd.DataFrame(data, index=index,
columns=columns)
How to Create a DataFrame
DICTIONARY OF COLUMNS
>>> d = {‘Age’: [32, 18, 26],
‘Gender’: [‘M’, ‘F’, ‘M’]}
>>> df = pd.DataFrame(d,
index=[‘Cary’, ‘Lynn’, ‘Sam’],
columns=[‘Gender’, ‘Age’])
How to Create a DataFrame
FROM NUMPY ARRAY
>>> from numpy import eye

>>> df = pd.DataFrame(eye(3),
index=[‘a’, ‘b’, ‘c’],
columns=[‘A’, ‘B’, ‘C’])
Important DataFrame Attributes

IMPORTANT ATTRIBUTES
#DataFrame values
>>> df.values
Array([[32, M], [18, F], [26, M]], dtype=object)
# DataFrame index
>>> df.index
Index([u’Cary’, u’Lynn’, u’Sam’], dtype=‘object’)
Important DataFrame Attributes

IMPORTANT ATTRIBUTES

# DataFrame columns
>>> df.columns
Index([u’Age’, u’Geneder’], dtype=‘object’)

# Length is number of rows
>>> len(df)
3
Important DataFrame Attributes
IMPORTANT ATTRIBUTES (cont’d)
# Data type of the column
>>> df.dtypes
Age int64
Gender object
dtype:object

# DataFrame shape
>>> df.shape
(3, 2)
Indexing by Labels
INDEX COLUMNS LIKE A DICT

DataFrame[label] à column (a Series)
>>> a = pd.Series([0, 1, 2])
>>> df = pd.DataFrame({‘A’: s, ‘B’: -s})


Indexing by Labels
INDEX COLUMNS LIKE A DICT

# Pulling out a column is like accessing a
# dictionary element
df[‘A’]
a 0
b 1
c 2

Indexing by Labels
INDEX COLUMNS LIKE A DICT

#Select multiple columns using a list of column
names
>>> df[[‘B’, ‘A’]]
B A
a 0 0
b -1 1
c -2 2
Indexing by Labels
ADD COLUMNS LIKE A DICT
# Adding a column is like adding a dictionary
# columns
>>> df[‘C’] = [4, 5, 6]
>>> df
A B C
a 0 0 4
b 1 -1 5
c 2 -2 6

Indexing by Labels
ADD COLUMNS LIKE A DICT

# Access column as attributes if label is a valid
# Python variable name
>>> df.A
a 0
b 1
c 2
Indexing by Labels
ADD COLUMNS LIKE A DICT


# Invalid variable names require using []
>>> df[‘New Column’] = -5
>>> df.New Column
SyntaxError: invalid sysntax
Indexing by Label or Integer Position

[] NOTATION IS AMBIGUOUS
DataFrame[integer]:Column position or name?
>>> df = pd.DataFrame(
{2: [‘a’, ‘b’, ‘c’],
0: [‘d’, ‘e’, ‘f’],
3: [‘g’, ‘h’, ‘i’] },
index=[‘i1’, ‘i2’, ‘i3’])

Indexing by Label or Integer Position
[] NOTATION IS AMBIGUOUS
>>> df[0]
i1 d
i2 e
i3 f

>>> df[1]
KeyError

Indexing by Label or Integer Position
SOLUTION: .loc AND .iloc
Use special attributes to select subjects:
•  .loc attribute to purely index label-based
•  .iloc attribute is purely integer position based
>>> df.iloc[:, 0] # Column pos. 0
i1 a
i2 b
i3 c
Indexing by Label or Integer Position
SOLUTION: .loc AND .iloc
>>> df.loc[:, 0] # Label 0 = pos. 1
i1 d
i2 e
i3 f
>>> df.loc[:, 1] # Error
KeyError: ‘the label [1] is not in the [columns]’
Slicing a DataFrame
SIMILAR TO SLICING 2D ARRAY
# Can specify columns as well. The syntax is:
# df.iloc[row_low:row_high:step,
col_low:col_high:step]
>>> df.iloc[:1, :2]
Name Weight
b1 No 75

>>> df.loc[:’b1’, :Weight’]
Name Weight
b1 No 75

Slicing a DataFrame
MIXING LABEL & INTEGER SLICES
# .loc and .iloc do not accept mixed selectors
>>> df.loc[:2, ‘Name’: ‘Weight’]
TypeError: cannot do slice indexing
# Solution is to use .ix. It is primarily label-based,
# but has integer fallback.
>>> df.ix[:2, ‘Name’: :’Weight’]
Name Weight
b1 No 75
b5 Blofeld 140
Give it a try!
Create a simple DataFrame:
Import pandas as pd
Import numpy as np
data = np.arange(12).reshape(4, 3)
df = pd.DataFrame(data, index=[‘one’, ‘two’,
‘three’, ‘four’],
columns=[‘x’, ‘y’, ‘z’])
1.  Get column ‘y’
2.  Get row ‘three’ (by name)
3.  Get second and fourth rows (by index)
4.  Get the column ‘y’ and ‘z’ of rows ‘two’ and
‘three’
Pandas Plotting Basics

Matplotlib’s Object Model


Setup
IN A JUPYTER NOTEBOOK
# Place in first cell
# Show static figures inline
>>> %matplotlib inline

# Show interactive figures inline
>>> %matplotlib notebook

AT THE IPYTHON PROMPT
#Show interactive figures in new window
>>> %matplotlib
Setup
# script.py In a script, you must call the
# plt.show() command to display plots. Call it at
# the end of all your plot commands for best
# performance
import matplotlib.pyplot as plt
import pandas as pd
df1.plot()
df2.plot()
# Plots will not appear until this command is run
plt.show()
Pandas Plotting: Single Plot
CREATING A PLOT
>>> import matplotlib.pyplot as plt
# Create a DataFrame with dummy data
>>> df = pd.DataFrame({‘a’: [1, 2, 3],
‘b’: [5, 3, 9]},
index=[7, 8, 9])
# Plot both columns together
# You will get an Axes object back
>>> ax = df.plot()
>>> type(ax)
AxesSubplot
Pandas Plotting: Single Plot
CREATING A PLOT
#Edit the plot attributes.
#(Has to be in same cell as plot()!)
>>> ax.set_title(‘Title’)
>>> ax.set_xlabel(‘Horizontal Axis’)
>>> ax.set_ylabel(‘Vertical Axis’)
# Override legend labels and change location
>>> ax.legend((‘y1’, ‘y2’), loc=‘best’)
>>> plt.savefig(‘newfigure.pdf’)
Pandas Plotting: Subplots
ONE SUBPLOT PER COLUMN
# Create a DataFrame with dummy data
>>> df = pd.DataFrame((‘y1’: [1, 2, 3],
‘y2’: [5, 3, 9]),
index=[7, 8, 9])
# Plot each columns separately. You will get
# back an array of Axes objects (one per subplot)
>>> axes = df.plot(subplots=True)
>>> type(axes)
Numpy.ndarray
Pandas Plotting: Subplots
ONE SUBPLOT PER COLUMN
# Edit the subplot attributes. (Has to be in the
# same cell as plot()!)
>>> axes[0].set_xlabel(‘cost’)
>>> axes[1].set_ylabel(‘profit’)
>>> axes[1].set_xlabel(‘year’)
# Use different layout
>>> rows, columns = 1, 2
>>> df.plot(subplot=True, layout=(rows,
columns))
Pandas Plotting
OTHER TYPES OF PLOTS
# Create a DataFrame
>>> import numpy as np
>>> df2 = pd.DataFrame((
‘A’: np.random.random(100),
‘B’: np.random.random(100))
Pandas Plotting
OTHER TYPES OF PLOTS
# Scatter plot
>>> df2.plot(kind=‘scatter’, x=‘A’, y=‘B’)
# Histogram
>>> df2.hist()
# Find more here, in particular lag_plot,
#autocorrelation_plot, and andrews_curves
>>> pandas.tools.plotting
Give it a try!
Create a DataFrame:
Import numpy as np
a = np.sin(np.linespace(-np.pi, np.pi))
df = pd.DataFrame({‘A’: a,
‘B’: a + 0.1*np.random.randn(50)})
1.  Plot A and B in the same plot
2.  Plot A and B in two separate subplots
1.  Set the label A to “sin”
2.  Set the label B to “noisy sin”’
3.  Plot a “scatter matrix” of columns A and B
( see pd.scatter_matrix)
Dealing with Missing Data
Dealing with Missing Data
PANDAS PHILOSOPHY
•  To signal a missing value, Pandas stores a NaN
(Not a Number) value defined in Numpy
(np.nan)
•  Unlike other packages (like NumPy), most
operators in Pandas will ignore NaN values in
a Pandas data structure.
Dealing with Missing Data

>>> import numpy as np
>>> a = np.array([1, 2, 3, np.nan])
>>> a.sum()
Nan
>>> s = Series(a)
>>> s.sum()
6
Dealing with Missing Data
FINDING MISSING VALUES
>>> df
s1 s2
a 1 Nan
b Nan Nan
c 3 3.5
d 4 4.5



Dealing with Missing Data
FINDING MISSING VALUES
# Boolean mask for all null values: np.nan and None.
Use notnull method for the inverse
>>> df.isnull()
s1 s2
a False True
b True True
c False False
d False False


Dealing with Missing Data
FINDING MISSING VALUES

REMOVE / REPLACE NaN
>>> df.fillna(value=0)
s1 s2
a 1.0 0.0
b 0.0 0.0
c 3.0 3.5
d 4.0 4.5


Dealing with Missing Data
REMOVE / REPLACE NaN (cont’d)
# Fill na from previous value
>>> df.fillna(method=‘ffill’)
s1 s2
a 1 Nan
b 1 Nan
c 3 3.5
d 4 4.5



Dealing with Missing Data
REMOVE / REPLACE NaN (cont’d)
# Interpolate NaNs away
>>> df.interpolate()
s1 s2
a 1 NaN
B 2 NaN
c 3 3.5
d 4 4.5



Dealing with Missing Data
REMOVE / REPLACE NaN (cont’d)
# Remove all rows w/ missing values
>>> df.dropna(how=‘all’)
s1 s2
a 1 NaN
b 1 NaN
c 3 3.5
d 4 4.5
>>> df.dropna(how=‘any’)
s1 s2
c 3 3.5
d 4 4.5


Give it a try!
Create a DataFrame
import numpy as np
t = np.linespace(0, 2*np.pi, 25)
df = pd.DataFrame({‘X’: np.sin(t),
‘X2’: np.sin(t),
‘Y’: 0.5 + np.random.randn(25)},
index=t)
df.iloc[5:12, 0] = np.nan
df.loc[np.pi] = np.nan
df.iloc[::2, -1] = np.nan
1.  Drop rows where all the values are missing.
2.  Interpolate missing values in X using the quadratic method.
3.  Replace missing values in Y with the mean of Y.
4.  Plot the resulting DataFrame
Dates and Times
Dealing with Date & Time
CREATING DATE/TIME INDEXES
# The index can be a list of dates+times locations
>>> pd.date_range(‘2000-01-01’, periods=4)

# Specify frequency: us, ms, S, T, H, D,B,
# W, M, 3min, 2h20min, 2W, etc., or use frequency
# in pd.datetools module
>>> r = pd.date_range(‘2000-01-01’, periods=72,
freq=‘H’)
>>> _=pd.date_range(‘2000-01-01’, periods=3,
freq=datetools.Easter())
>>> i = pd.date_range(‘2000-01-01’, periods=3,
freq=‘3min’)
>>> ts = pd.Series(range(3), index=i)

Dealing with Date & Time
UP-/DOWN-SAMPLING
>>> ts.resample(‘T’).mean()

# Group hourly data into daily

>>> ts2 = pd.Series(np.random.randn(72),
index=r)
>>> ts2.resample(‘D’, closed=‘left’,
label=‘left’).mean()
Dealing with Date & Time ||
TIME ALIGNMENT
# Data alignment based on time is one of Panda’s
# most celebrated features
>>> yearly = pd.date_range(‘2000’, freq=‘AS’,
periods=3)
>>> from numpy.random import rand
>>> df = pd.DataFrame(rand(3), index=yearly,
columns=[‘A’])
>>> df
>>> monthly = pd.date_range(‘2000’, freq=‘MS’, periods=3)
>>> df2 = pd.DataFrame(rand(3), index=monthly,
columns=[‘B’])
>>> df2
Dealing with Date & Time ||
TIME ALIGNMENT (cont’d)
>>> df3 = pd.concat([df, df2], axis=1)
>>> df3

INDEXING AND SLICING
# Index with datetime object
>>> df3.loc[pd.to_datetime(‘2000’)]

# Partial string matching is slicing
>>> df3.loc[‘2000’]
Give it a try!
Download the Apple stock prices since 2010 ( backup
data in pandas/finger_exercises/
Apple_stock_prices_local.ipynb)
import pandas_datareader.data as web
Aap1 = web.get_data_yahoo(‘AAPL’, ‘1/1/2010’)
1.  Plot the Close price for 2014
2.  Print the data from the last 4 weeks (see the .last
method)
3.  Extract the adjusted close column (“AdjClose”),
resample the full data to a monthly period and plot.
Do this 3 times, using the min, max, and mean of
the resampling window.
4.  Create a date range 4 steps, starting January 7, 2010
at 15:00, with a frequency of 1 h and 20 minutes
Computations and Statistics
Computations with DataFrames
Rule 1: Operations between multiple Pandas
objects implement auto alignment based
on index first.
Rule 2: Mathematical operators (+-*/exp, log, …)
apply element by element, on the values.
Rule 3: Reduction operations (mean, std, skew,
kurt, sum, prod, …) are applied column by
column or row by row.
Rule 4: Missing values propagate through binary
operations (+-*), but not reducing
operations (mean, sum)

Computations with Series
USE BUILT IN METHODS
>>> s
a 4
a -5
c 6

# Methods apply to entire Series
>>> s.abs()
a 4
a 5
c 6
>>> s.sum()
15

Computations with Series
APPLY CUSTOMS FUNCTIONS
#Numpy ufunc applies to entire Series
>>> s. apply(np.exp)

# Or a custome function to each value
>>> def str_len(x):
return len(str(x))
>>>s.apply(str_len)

# Could be done with:
# >>> s.astype(‘str’).str.len()
Computations with DataFrames
# Computations are applied column-by-column
>>> df
>>> df.sum()

# Adding a Series or rescaling aligns on column
#names
>>> row = df.iloc[1]
>>> df - row
Computations with DataFrames
DATAFRAME TRANSFORMATION
# applymap is similar but receives a value and
returns # a value
>>> def str_len(x):
return len(str(x))
>>> df.applymap(str_len)
Computations with DataFrames
DATAFRAME REDUCTION
# ‘apply’ a custom function to columns. The
# function receives a column (Series) and returns
# a value
>>> def peak_to_peak(x):
return x.max() – x.min()
>>> df.apply(peak_to_peak, axis=0)
Statistical Analysis
DESCRIPTIVE STATS
>>> df
# Descriptive stats available: count, sum, mean,
# median, min, max, abs, prod, std, var, skew,kurt,
# quantile, cumsum, cumprod, cummax,
# Stats on DF are column by column
>>> df.mean()
>>> df.mean(axis=1)
# min/max location (Series only)
>>> df[‘B’].argmin()

>>> df.describe()
Statistical Analysis
WINDOWED STATS
# The same descriptive stats are available as
“rolling stats”
# For example,
>>> t = s.rolling(window=20).mean()
# Custom function on ndarray supported with
apply:
>>> def short_win_mean():
return x[1:-1].mean()
>>> x.rolling(20).apply(short_win_mean)
Correlations
CORRELATIONS
# Correlations between Series
>>> ts.corr(ts2)

# Pair-wise correlations of the columns. Optional
# argument: method=, one of
# (‘pearson’, ‘kendall’, ‘spearman’)
>>> corr_matrix = df.corr()
# Pair-wise covariance of the columns
>>> cov_matrix = df.cov()_
Give it a try!
Create this DataFrame:
Import numpy as np
T = np.linespace(0, 10, 200)
Df = pd.DataFrame({
‘X’: np.sin(t) + 0.1*np.random.randn(200)
‘Y’: np.sin(2*t) + 0.1*np.random.randn(200),
})
1.  Get the absolute value of df.
2.  Calculate the mean value of each column.
3.  Subtract the mean from each column using apply.
4.  Calculate the 10-sample rolling mean of df and plot it.
Split-Apply-Combine
Group-Based-Operations
RATIONALE
It is often necessary to apply different operations
on different subgroups
•  Traditionally handled by SQL-based systems
•  Pandas provides in-memory, SQL-like set of
operations

Group-Based-Operations
RATIONALE
General ‘framework’: split, apply, combine (Hadley
Wickham, R-programmer):
•  Splitting the data into groups (based on some
criterion, e.g. column value)
•  Applying a function to each group independently
•  Combine the results back into a data structure
(e.g. DatFrame)

Data Aggregation: Split
SPLIT WITH groupby()
>>>df
# Group data by one column’s value
>>> gb = df.groupby(‘Flag’)
# gb is a groupby object
>>> gb.groups
# gb = iterator of tuples with group name and
# sub part of df
>>> for groupname, subdf in gb:
print(groupname)
print(subdf)

Data Aggregation: Split
SPLIT WITH groupby()
# Displays a subplot per group
>>> gb.boxplot(column=[“A”, “B”])
Groupby() ON THE INDEX
>>> df2 = df.reset_index()
>>> def even(x):
return x%2 ==0
>>> gb2 = df2.groupby(even)
>>> gb2.groups

Data Aggregation: Apply-Combine
Three ways to apply: aggregate (or equivalently agg,
and built-in methods like sum) if each Series in each
group is turned into one value, transform if each
series in each group is modified but retains its
index, or apply in the most general case.
REDUCE WITH aggregate() or agg()
>>> gb.sum()

Data Aggregation: Apply-Combine
# More flexible but slower
>>> summed = gb.aggregate(np.sum)
# Given a list or dict
>>> gb.agg([np.mean, np.std])
>>> gb.agg({‘A’: ‘sum’, ‘B’:’std’})
>>> def demean(x):
return x – x.mean()
>>> gb.transform(demean)
Give it a try!
Create this time series of random values:
df = pd.DataFrame(1 + 3*np.random.randn(18, 2),
index=np.random.randn(18),
columns=[‘A’, ‘B’])
labels = np.array(list(‘xyz’)*6)
np.random.shuffle(labels)
df[‘C’] = labels
1.  Group by column “C” and print the mean and
standard deviation (std)
2.  Bonus: Do the same with a single command.
3.  Loop over the data grouped by column “C” and
print the groups’ name and data

You might also like