File D
File D
P YTHON P RIMER
Python has become the programming language of choice for many researchers and
practitioners in data science and machine learning. This appendix gives a brief intro-
duction to the language. As the language is under constant development and each year
many new packages are being released, we do not pretend to be exhaustive in this in-
troduction. Instead, we hope to provide enough information for novices to get started
with this beautiful and carefully thought-out language.
https://www.anaconda.com/.
The Anaconda installer automatically installs the most important packages and also
provides a convenient interactive development environment (IDE), called Spyder.
Use the Anaconda Navigator to launch Spyder, Jupyter notebook, install and update
packages, or open a command-line terminal.
To get started1 , try out the Python statements in the input boxes that follow. You can
either type these statements at the IPython command prompt or run them as (very short)
1
We assume that you have installed all the necessary files and have launched Spyder.
463
464 D.1. Getting Started
Python programs. The output for these two modes of input can differ slightly. For ex-
ample, typing a variable name in the console causes its contents to be automatically printed,
whereas in a Python program this must be done explicitly by calling the print function.
Selecting (highlighting) several program lines in Spyder and then pressing function key2
F9 is equivalent to executing these lines one by one in the console.
object In Python, data is represented as an object or relation between objects (see also Sec-
tion D.2). Basic data types are numeric types (including integers, booleans, and floats),
sequence types (including strings, tuples, and lists), sets, and mappings (currently, diction-
aries are the only built-in mapping type).
Strings are sequences of characters, enclosed by single or double quotes. We can print
strings via the print function.
For pretty-printing output, Python strings can be formatted using the format function. The
bracket syntax {i} provides a placeholder for the i-th variable to be printed, with 0 being
the first index. Individual variables can be formatted separately and as desired; formatting
☞ 475 syntax is discussed in more detail in Section D.9.
print("Name :{1} ( height {2} m, age {0})". format (111 ," Bilbo " ,0.84))
Name:Bilbo ( height 0.84 m, age 111)
Lists can contain different types of objects, and are created using square brackets as in the
following example:
mutable Elements in lists are indexed starting from 0, and are mutable (can be changed):
x = [1 ,2]
x[0] = 2 # Note that the first index is 0
x
[2 ,2]
immutable In contrast, tuples (with round brackets) are immutable (cannot be changed). Strings are
immutable as well.
x = (1 ,2)
x[0] = 2
TypeError : 'tuple ' object does not support item assignment
slice Lists can be accessed via the slice notation [start:end]. It is important to note that end
is the index of the first element that will not be selected, and that the first element has index
0. To gain familiarity with the slice notation, execute each of the following lines.
2
This may depend on the keyboard and operating system.
Appendix D. Python Primer 465
An operator is a programming language construct that performs an action on one or more operator
operands. The action of an operator in Python depends on the type of the operand(s). For
example, operators such as +, ∗, −, and % that are arithmetic operators when the operands
are of a numeric type, can have different meanings for objects of non-numeric type (such
as strings).
15 % 4 # Remainder of 15/4
3
A class (see Section D.8) can be thought of as a template for creating a custom type of
object.
s = "hello"
d = dir(s)
print(d,flush=True) # Print the list in " flushed " format
['__add__ ', '__class__ ', '__contains__ ', '__delattr__ ', '__dir__ ',
... (many left out) ... 'replace ', 'rfind ',
'rindex ', 'rjust ', 'rpartition ', 'rsplit ', 'rstrip ', 'split ',
'splitlines ', 'startswith ', 'strip ', 'swapcase ', 'title ',
'translate ', 'upper ', 'zfill ']
dot notation Any attribute attr of an object obj can be accessed via the dot notation: obj.attr. To
find more information about any object use the help function.
s = "hello"
help(s. replace )
replace (...) method of builtins .str instance
S. replace (old , new[, count ]) -> str
This shows that the attribute replace is in fact a function. An attribute that is a function is
method called a method. We can use the replace method to create a new string from the old one
by changing certain characters.
s = 'hello '
s1 = s. replace ('e','a')
print(s1)
hallo
In many Python editors, pressing the TAB key, as in objectname.<TAB>, will bring
up a list of possible attributes via the editor’s autocompletion feature.
The assignment operator, =, assigns an object to a variable; e.g., x = 12. An expression assignment
is a combination of values, operators, and variables that yields another value or variable.
Variable names are case sensitive and can only contain letters, numbers, and under-
scores. They must start with either a letter or underscore. Note that reserved words
such as True and False are case sensitive as well.
Python is a dynamically typed language, and the type of a variable at a particular point
during program execution is determined by its most recent object assignment. That is, the
type of a variable does not need to be explicitly declared from the outset (as is the case in
C or Java), but instead the type of the variable is determined by the object that is currently
assigned to it.
It is important to understand that a variable in Python is a reference to an object — reference
think of it as a label on a shoe box. Even though the label is a simple entity, the contents
of the shoe box (the object to which the variable refers) can be arbitrarily complex. Instead
of moving the contents of one shoe box to another, it is much simpler to merely move the
label.
x = [1 ,2]
y = x # y refers to the same object as x
print(id(x) == id(y)) # check that the object id's are the same
y[0] = 100 # change the contents of the list that y refers to
print(x)
True
[100 ,2]
x = [1 ,2]
y = x # y refers to the same object as x
y = [100 ,2] # now y refers to a different object
print(id(x) == id(y))
print(x)
False
[1 ,2]
Table D.1 shows a selection of Python operators for numerical and logical variables.
A function takes a list of input variables that are references to objects. Inside the func-
tion, a number of statements are executed which may modify the objects, but not the ref-
erence itself. In addition, the function may return an output object (or will return the value
None if not explicitly instructed to return output). Think again of the shoe box analogy. The
input variables of a function are labels of shoe boxes, and the objects to which they refer
are the contents of the shoe boxes. The following program highlights some of the subtleties
of variables and objects in Python.
Note that the statements within a function must be indented. This is Python’s way to
define where a function begins and ends.
x = [1 ,2 ,3]
Variables that are defined inside a function only have local scope; that is, they are
recognized only within that function. This allows the same variable name to be used in
different functions without creating a conflict. If any variable is used within a function,
Python first checks if the variable has local scope. If this is not the case (the variable has
not been defined inside the function), then Python searches for that variable outside the
function (the global scope). The following program illustrates several important points.
Appendix D. Python Primer 469
def stat(x):
n = len(x) #the length of x
meanx = sum(x)/n
stdx = sqrt(sum( square (x - meanx ))/n)
return [meanx ,stdx]
print(stat(x))
[2.6666666666666665 , 1.3719410418171119]
1. Basic math functions such as sqrt are unknown to the standard Python interpreter
and need to be imported. More on this in Section D.5 below.
2. As was already mentioned, indentation is crucial. It shows where the function begins
and ends.
3. No semicolons3 are needed to end lines, but the first line of the function definition
(here line 5) must end with a colon (:).
4. Lists are not arrays (vectors of numbers), and vector operations cannot be performed
on lists. However, the numpy module is designed specifically with efficient vec-
tor/matrix operations in mind. On the second code line, we define x as a vector
(ndarray) object. Functions such as square, sum, and sqrt are then applied to
such arrays. Note that we used the default Python functions len and sum. More on
numpy in Section D.10.
5. Running the program with stat(x) instead of print(stat(x)) in line 11 will not
show any output in the console.
To display the complete list of built-in functions, type (using double underscores)
dir(__builtin__) .
D.5 Modules
A Python module is a programming construct that is useful for organizing code into module
manageable parts. To each module with name module_name is associated a Python file
module_name.py containing any number of definitions, e.g., of functions, classes, and
variables, as well as executable statements. Modules can be imported into other programs
using the syntax: import <module_name> as <alias_name>, where <alias_name>
is a shorthand name for the module.
3
Semicolons can be used to put multiple commands on a single line.
470 D.5. Modules
namespace When imported into another Python file, the module name is treated as a namespace,
providing a naming system where each object has its unique name. For example, different
modules mod1 and mod2 can have different sum functions, but they can be distinguished by
prefixing the function name with the module name via the dot notation, as in mod1.sum and
mod2.sum. For example, the following code uses the sqrt function of the numpy module.
import numpy as np
np.sqrt (2)
1.4142135623730951
The numpy package contains various subpackages, such as random, linalg, and fft.
More details are given in Section D.10.
When using Spyder, press Ctrl+I in front of any object, to display its help file in a
separate window.
As we have already seen, it is also possible to import only specific functions from a
module using the syntax: from <module_name> import <fnc1, fnc2, ...>.
1.4142135623730951
0.54030230586813965
This avoids the tedious prefixing of functions via the (alias) of the module name. However,
for large programs it is good practice to always use the prefix/alias name construction, to
be able to clearly ascertain precisely which module a function being used belongs to.
if <condition1>:
<statements>
elif <condition2>:
<statements>
else:
<statements>
Here, <condition1> and <condition2> are logical conditions that are either True or
False; logical conditions often involve comparison operators (such as ==, >, <=, !=).
In the example above, there is one elif part, which allows for an “else if” conditional
statement. In general, there can be more than one elif part, or it can be omitted. The else
part can also be omitted. The colons are essential, as are the indentations.
The while and for loops have the following syntax.
while <condition>:
<statements>
Above, <collection> is an iterable object (see Section D.7 below). For further con-
trol in for and while loops, one can use a break statement to exit the current loop, and
the continue statement to continue with the next iteration of the loop, while abandoning
any remaining statements in the current iteration. Here is an example.
import numpy as np
ans = 'y'
while ans != 'n':
outcome = np. random . randint (1 ,6+1)
if outcome == 6:
print(" Hooray a 6!")
break
else:
print("Bad luck , a", outcome )
ans = input("Again? (y/n) ")
472 D.7. Iteration
D.7 Iteration
Iterating over a sequence of objects, such as used in a for loop, is a common operation.
To better understand how iteration works, we consider the following code.
s = "Hello"
for c in s:
print(c,'*', end=' ')
H * e * l * l * o *
A string is an example of a Python object that can be iterated. One of the methods of a
iterable string object is __iter__. Any object that has such a method is called an iterable. Calling
iterator this method creates an iterator — an object that returns the next element in the sequence
to be iterated. This is done via the method __next__.
s = "Hello"
t = s. __iter__ () # t is now an iterator . Same as iter(s)
print(t. __next__ () ) # same as next(t)
print(t. __next__ () )
print(t. __next__ () )
H
e
l
The inbuilt functions next and iter simply call these corresponding double-
underscore functions of an object. When executing a for loop, the sequence/collection
over which to iterate must be an iterable. During the execution of the for loop, an iterator
is created and the next function is executed until there is no next element. An iterator is
also an iterable, so can be used in a for loop as well. Lists, tuples, and strings are so-called
sequence sequence objects and are iterables, where the elements are iterated by their index.
range The most common iterator in Python is the range iterator, which allows iteration over
a range of indices. Note that range returns a range object, not a list.
Similar to Python’s slice operator [i : j], the iterator range(i, j) ranges from i to j,
not including the index j.
sets Two other common iterables are sets and dictionaries. Python sets are, as in mathem-
atics, unordered collections of unique objects. Sets are defined with curly brackets { }, as
opposed to round brackets ( ) for tuples, and square brackets [ ] for lists. Unlike lists, sets do
not have duplicate elements. Many of the usual set operations are implemented in Python,
including the union A | B and intersection A & B.
Appendix D. Python Primer 473
A = {3, 2, 2, 4}
B = {4, 3, 1}
C = A & B
for i in A:
print(i)
print(C)
2
3
4
{3, 4}
A useful way to construct lists is by list comprehension; that is, by expressions of the list
form comprehension
setA = {3, 2, 4, 2}
setB = {x**2 for x in setA}
print(setB)
listA = [3, 2, 4, 2]
listB = [x**2 for x in listA]
print(listB)
{16, 9, 4}
[9, 4, 16, 4]
A dictionary is a set-like data structure, containing one or more key:value pairs en- dictionary
closed in curly brackets. The keys are often of the same type, but do not have to be; the
same holds for the values. Here is a simple example, storing the ages of Lord of the Rings
characters in a dictionary.
D.8 Classes
Recall that objects are of fundamental importance in Python — indeed, data types and
functions are all objects. A class is an object type, and writing a class definition can be class
thought of as creating a template for a new type of object. Each class contains a number
of attributes, including a number of inbuilt methods. The basic syntax for the creation of a
class is:
474 D.8. Classes
class <class_name>:
def __init__(self):
<statements>
<statements>
instance The main inbuilt method is __init__, which creates an instance of a class object.
For example, str is a class object (string class), but s = str('Hello') or simply
s = 'Hello', creates an instance, s, of the str class. Instance attributes are created dur-
ing initialization and their values may be different for different instances. In contrast, the
values of class attributes are the same for every instance. The variable self in the initializ-
ation method refers to the current instance that is being created. Here is a simple example,
explaining how attributes are assigned.
class shire_person :
def __init__ (self ,name): # initialization method
self.name = name # instance attribute
self.age = 0 # instance attribute
address = 'The Shire ' # class attribute
It is good practice to create all the attributes of the class object in the __init__ method,
but, as seen in the example above, attributes can be created and assigned everywhere, even
outside the class definition. More generally, attributes can be added to any object that has
a __dict__.
inheritance Python classes can be derived from a parent class by inheritance, via the following
syntax.
class <class_name>(<parent_class_name>):
<statements>
Appendix D. Python Primer 475
The derived class (initially) inherits all of the attributes of the parent class.
As an example, the class shire_person below inherits the attributes name, age, and
address from its parent class person. This is done using the super function, used here
to refer to the parent class person without naming it explicitly. When creating a new
object of type shire_person, the __init__ method of the parent class is invoked, and
an additional instance attribute Shire_address is created. The dir function confirms that
Shire_address is an attribute only of shire_person instances.
class person :
def __init__ (self ,name):
self.name = name
self.age = 0
self. address = ' '
p1 = shire_person ("Frodo")
p2 = person (" Gandalf ")
print(dir(p1)[:1] , dir(p1)[ -3:] )
print(dir(p2)[:1] , dir(p2)[ -3:] )
[' Shire_address '] ['address ', 'age ', 'name ']
['__class__ '] ['address ', 'age ', 'name ']
D.9 Files
To write to or read from a file, a file first needs to be opened. The open function in Python
creates a file object that is iterable, and thus can be processed in a sequential manner in a
for or while loop. Here is a simple example.
The first argument of open is the name of the file. The second argument specifies
if the file is opened for reading ('r'), writing ('w'), appending ('a'), and so on. See
help(open). Files are written in text mode by default, but it is also possible to write in
binary mode. The above program creates a file output.txt with 5 lines, containing the
strings 0, 10, . . . , 40. Note that if we had written fout.write(i) in the fourth line of the
code above, an error message would be produced, as the variable i is an integer, and not a
string. Recall that the expression string.format() is Python’s way to specify the format
of the output string.
The formatting syntax {:3d} indicates that the output should be constrained to a spe-
cific width of three characters, each of which is a decimal value. As mentioned in the
476 D.9. Files
introduction, bracket syntax {i} provides a placeholder for the i-th variable to be printed,
with 0 being the first index. The format for the output is further specified by {i:format},
where format is typically4 of the form:
[width][.precision][type]
In this specification:
• precision specifies the number of digits to be displayed after the decimal point for
a floating point values of type f, or the number of digits before and after the decimal
point for a floating point values of type g;
• type specifies the type of output. The most common types are s for strings, d for
integers, b for binary numbers, f for floating point numbers (floats) in fixed-point
notation, g for floats in general notation, e for floats in scientific notation.
The following code reads the text file output.txt line by line, and prints the output
on the screen. To remove the newline \n character, we have used the strip method for
strings, which removes any whitespace from the start and end of a string.
4
More formatting options are possible.
Appendix D. Python Primer 477
When dealing with file input and output it is important to always close files. Files that
remain open, e.g., when a program finishes unexpectedly due to a programming error, can
cause considerable system problems. For this reason it is recommended to open files via
context management. The syntax is as follows.
Context management ensures that a file is correctly closed even when the program is
terminated prematurely. An example is given in the next program, which outputs the most-
frequent words in Dicken’s A Tale of Two Cities, which can be downloaded from the book’s
GitHub site as ataleof2cities.txt.
Note that in the next program, the file ataleof2cities.txt must be placed in the cur-
rent working directory. The current working directory can be determined via import os
followed by cwd = os.getcwd().
numline = 0
DICT = {}
with open('ataleof2cities .txt ', encoding ="utf8") as fin:
for line in fin:
words = line.split ()
for w in words:
if w not in DICT:
DICT[w] = 1
else:
DICT[w] +=1
numline += 1
word count
---------------
the 7348
and 4679
of 3949
to 3387
a 2768
in 2390
his 1911
was 1672
that 1650
I 1444
478 D.10. NumPy
D.10 NumPy
The package NumPy (module name numpy) provides the building blocks for scientific
computing in Python. It contains all the standard mathematical functions, such as sin,
cos, tan, etc., as well as efficient functions for random number generation, linear algebra,
and statistical computation.
[[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]]
(2, 3, 2)
<class 'numpy.ndarray '>
We will be mostly working with 2D arrays; that is, ndarrays that represent ordinary
matrices. We can also use the range method and lists to create ndarrays via the array
method. Note that arange is numpy’s version of range, with the difference that arange
returns an ndarray object.
[[1 2 3]
[3 2 1]]
The dimension of an ndarray can be obtained via its shape method, which returns a
tuple. Arrays can be reshaped via the reshape method. This does not change the current
ndarray object. To make the change permanent, a new instance needs to be created.
One shape dimension for reshape can be specified as −1. The dimension is then
inferred from the other dimension(s).
The 'T' attribute of an ndarray gives its transpose. Note that the transpose of a “vector”
with shape (n, ) is the same vector. To distinguish between column and row vectors, reshape
such a vector to an n × 1 and 1 × n array, respectively.
Two useful methods of joining arrays are hstack and vstack, where the arrays are
joined horizontally and vertically, respectively.
[[ 1. 1. 1. 0. 0.]
[ 1. 1. 1. 0. 0.]
[ 1. 1. 1. 0. 0.]]
D.10.2 Slicing
Arrays can be sliced similarly to Python lists. If an array has several dimensions, a slice for
each dimension needs to be specified. Recall that Python indexing starts at '0' and ends
at 'len(obj)-1'. The following program illustrates various slicing operations.
Note that ndarrays are mutable objects, so that elements can be modified directly, without
having to create a new object.
print(np.sqrt(x))
[[1.41421356 2. ]
[2.44948974 2.82842712]]
print(np.dot(x,y))
[[10 , 10]
[22, 22]]
Since version 3.5 of Python, it is possible to multiply two ndarrays using the @
operator (which implements the np.matmul method). For matrices, this is similar to using @ operator
the dot method. For higher-dimensional arrays the two methods behave differently.
print(x @ y)
[[10 10]
[22 22]]
• mi = ni , or
• min{mi , ni } = 1, or
For example, shapes (1, 2, 3) and (4, 2, 1) are aligned, as are (2, , ) and (1, 2, 3). However,
(2, 2, 2) and (1, 2, 3) are not aligned. NumPy “duplicates” the array elements across the
smaller dimension to match the larger dimension. This process is called broadcasting and broadcasting
is carried out without actually making copies, thus providing efficient memory use. Below
are some examples.
import numpy as np
A= np. arange (4). reshape (2 ,2) # (2 ,2) array
[[ 40 501]
[ 42 503]]
[[ 0 40]
[1000 1500]]
Note that above x1 is duplicated row-wise and x2 column-wise. Broadcasting also applies
to the matrix-wise operator @, as illustrated below. Here, the matrix b is duplicated across
the third dimension resulting in the two matrix multiplications
" #" # " #" #
0 1 0 1 4 5 0 1
and .
2 3 2 3 6 7 2 3
[[10 19]
[14 27]]]
Functions such as sum, mean, and std can also be executed as methods of an ndarray
instance. The argument axis can be passed to specify along which dimension the function
is applied. By default axis=None.
import numpy as np
np. random .seed (123) # set the seed for the random number generator
x = np. random . random () # uniform (0 ,1)
y = np. random . randint (5 ,9) # discrete uniform 5 ,... ,8
z = np. random .randn (4) # array of four standard normals
print(x,y,'\n',z)
0.6964691855978616 7
[ 1.77399501 -0.66475792 -0.07351368 1.81403277]
https://docs.scipy.org/doc/numpy/reference/random/index.html.
Appendix D. Python Primer 483
D.11 Matplotlib
The main Python graphics library for 2D and 3D plotting is matplotlib, and its subpack-
age pyplot contains a collection of functions that make plotting in Python similar to that
in MATLAB.
sqrtplot.py
import matplotlib . pyplot as plt
import numpy as np
x = np. arange (0, 10, 0.1)
u = np. arange (0 ,10)
y = np.sqrt(x)
v = u/3
plt. figure ( figsize = [4 ,2]) # size of plot in inches
plt.plot(x,y, 'g--') # plot green dashed line
plt.plot(u,v,'r.') # plot red dots
plt. xlabel ('x')
plt. ylabel ('y')
plt. tight_layout ()
plt. savefig ('sqrtplot .pdf ',format ='pdf ') # saving as pdf
plt.show () # both plots will now be drawn
3
2
y
1
0
0 2 4 6 8 10
x
Figure D.1: A simple plot created using pyplot.
The library matplotlib also allows the creation of subplots. The scatterplot and histogram
in Figure D.2 have been produced using the code below. When creating a histogram there
are several optional arguments that affect the layout of the graph. The number of bins is
determined by the parameter bins (the default is 10). Scatterplots also take a number of
parameters, such as a string c which determines the color of the dots, and alpha which
affects the transparency of the dots.
484 D.11. Matplotlib
histscat.py
import matplotlib . pyplot as plt
import numpy as np
x = np. random .randn (1000)
u = np. random .randn (100)
v = np. random .randn (100)
plt. subplot (121) # first subplot
plt.hist(x,bins =25, facecolor ='b')
plt. xlabel ('X Variable ')
plt. ylabel ('Counts ')
plt. subplot (122) # second subplot
plt. scatter (u,v,c='b', alpha =0.5)
plt.show ()
120
2
100
80
Counts
0
60
40 1
20
2
0
2 0 2 2 0 2
X Variable
surf3dscat.py
import matplotlib . pyplot as plt
import numpy as np
from mpl_toolkits . mplot3d import Axes3D
def npdf(x,y):
return np.exp ( -0.5*( pow(x ,2)+pow(y ,2)))/np.sqrt (2* np.pi)
plt.show ()
0.4
0.35
0.3 0.30
f(x, y)
f(x, y)
0.25
0.2 0.20
0.15
0.1 0.10
0.05
0.0
3
2
2
1 1
2 0 3 0
2 1
1 1 y 1 y
0 0 2
1 2 1
x 2 x 2 3
3 3
D.12 Pandas
The Python package Pandas (module name pandas) provides various tools and data struc-
tures for data analytics, including the fundamental DataFrame class.
For the code in this section we assume that pandas has been imported via
import pandas as pd.
Here, <data> some 1-dimensional data structure, such as a 1-dimensional ndarray, a list,
or a dictionary, and index is a list of names of the same length as <data>. When <data>
is a dictionary, the index is created from the keys of the dictionary. When <data> is an
ndarray and index is omitted, the default index will be [0, ..., len(data)-1].
If the index is not specified, the default index is [0, ..., len(data)-1]. Data can
☞1 also be read directly from a CSV or Excel file, as is done in Section 1.1. If a dictionary is
used to create the data frame (as below), the dictionary keys are used as the column names.
ages = [6 ,3 ,5 ,6 ,5 ,8 ,0 ,3]
d={ 'Gender ':['M', 'F']*4, 'Age ': ages}
df1 = pd. DataFrame (d)
df1.at[0,'Age ']= 60 # change an element
df1.at[1,'Gender '] = 'Female ' # change another element
df2 = df1.drop('Age ' ,1) # drop a column
df3 = df2.copy (); # create a separate copy of df2
df3['Age '] = ages # add the original column
dfcomb = pd. concat ([df1 ,df2 ,df3],axis =1) # combine the three dfs
print( dfcomb )
Gender Age Gender Gender Age
0 M 60 M M 6
1 Female 3 Female Female 3
2 M 5 M M 5
3 F 6 F F 6
4 M 5 M M 5
5 F 8 F F 8
6 M 0 M M 0
7 F 3 F F 3
Note that the above DataFrame object has two Age columns. The expression
dfcomb[’Age’] will return a DataFrame with both these columns.
Gender object
Age int64
dtype: object
Gender category
Age int64
dtype: object
import numpy as np
import pandas as pd
ages = [6 ,3 ,5 ,6 ,5 ,8 ,0 ,3]
np. random .seed (123)
df = pd. DataFrame (np. random . randn (3 ,4) , index = list('abc '),
columns = list('ABCD '))
print(df)
df1 = df.loc["b":"c","B":"C"] # create a partial data frame
print(df1)
meanA = df['A']. mean () # mean of 'A' column
print('mean of column A = {}'. format ( meanA ))
expA = df['A']. apply(np.exp) # exp of all elements in 'A' column
print(expA)
A B C D
a -1.085631 0.997345 0.282978 -1.506295
b -0.578600 1.651437 -2.426679 -0.428913
c 1.265936 -0.866740 -0.678886 -0.094709
B C
b 1.651437 -2.426679
c -0.866740 -0.678886
mean of column A = -0.13276486552118785
a 0.337689
b 0.560683
c 3.546412
Name: A, dtype: float64
The groupby method of a DataFrame object is useful for summarizing and displaying
the data in manipulated ways. It groups data according to one or more specified columns,
such that methods such as count and mean can be applied to the grouped data.
Appendix D. Python Primer 489
To allow for multiple functions to be calculated at once, the agg method can be used.
It can take a list, dictionary, or string of functions.
490 D.13. Scikit-learn
D.12.4 Plotting
The plot method of a DataFrame makes plots of a DataFrame using Matplotlib. Different
types of plot can be accessed via the kind = 'str' construction, where str is one of
line (default), bar, hist, box, kde, and several more. Finer control, such as modifying
the font, is obtained by using matplotlib directly. The following code produces the line
and box plots in Figure D.4.
import numpy as np
import pandas as pd
import matplotlib
df = pd. DataFrame ({'normal ':np. random . randn (100) ,
'Uniform ':np. random . uniform (0 ,1 ,100) })
font = {'family ' : 'serif ', 'size ' : 14} #set font
matplotlib .rc('font ', ** font) # change font
df.plot () # line plot ( default )
df.plot(kind = 'box ') # box plot
matplotlib . pyplot .show () # render plots
Normal
Uniform
2 2
0 0
2 2
0 20 40 60 80 100 Normal Uniform
Figure D.4: A line and box plot using the plot method of DataFrame.
D.13 Scikit-learn
Scikit-learn is an open-source machine learning and data science library for Python. The
library includes a range of algorithms relating to the chapters in this book. It is widely
used due to its simplicity and its breadth. The module name is sklearn. Below is a brief
introduction into modeling the data with sklearn. The full documentation can be found
at
https://scikit-learn.org/.
Appendix D. Python Primer 491
As an example, the following code generates a synthetic data set and splits it into
equally-sized training and test sets.
syndat.py
import numpy as np
import matplotlib . pyplot as plt
from sklearn . model_selection import train_test_split
D.13.2 Standardization
In some instances it may be necessary to standardize the data. This may be done in
sklearn with scaling methods such as MinMaxScaler or StandardScaler. Scaling may
improve the convergence of gradient-based estimators and is useful when visualizing data
on vastly different scales. For example, suppose that X is our explanatory data (e.g., stored
as a numpy array), and we wish to standardize such that each value lies between 0 and 1.
492 D.13. Scikit-learn
3 2 1 0 1 2 3
Figure D.5: Example training (circles) and test (squares) set for two class classification.
Explanatory variables are the (x, y) coordinates, classes are zero (green) or one (blue).
Specific classifiers for logistic regression, naïve Bayes, linear and quadratic discrimin-
ant analysis, K-nearest neighbors, and support vector machines are given in Section 7.8.
☞ 277
The package numba can significantly speed up calculations via smart compilation. First
run the following code.
jitex.py
import timeit
import numpy as np
from numba import jit
n = 10**8
#@jit
def myfun(s,n):
for i in range (1,n):
s = s+ 1/i
return s
494 D.14. System Calls, URL Access, and Speed-Up
Now remove the # character before the @ character in the code above, in order to
activate the “just in time” compiler. This gives a 15-fold speedup:
Euler 's constant is approximately 0.57721566
elapsed time: 0.39 seconds
Further Reading
To learn Python, we recommend [82] and [110]. However, as Python is constantly evolving,
the most up-to-date references will be available from the Internet.