0% found this document useful (0 votes)
2 views32 pages

File D

This appendix provides an introduction to Python, focusing on its relevance in data science and machine learning. It covers essential topics such as installation, basic data types, operators, functions, and modules, emphasizing Python 3 and the use of the Anaconda distribution. The document aims to equip novices with foundational knowledge to begin programming in Python effectively.

Uploaded by

anh397146
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views32 pages

File D

This appendix provides an introduction to Python, focusing on its relevance in data science and machine learning. It covers essential topics such as installation, basic data types, operators, functions, and modules, emphasizing Python 3 and the use of the Anaconda distribution. The document aims to equip novices with foundational knowledge to begin programming in Python effectively.

Uploaded by

anh397146
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

APPENDIX D

P YTHON P RIMER

Python has become the programming language of choice for many researchers and
practitioners in data science and machine learning. This appendix gives a brief intro-
duction to the language. As the language is under constant development and each year
many new packages are being released, we do not pretend to be exhaustive in this in-
troduction. Instead, we hope to provide enough information for novices to get started
with this beautiful and carefully thought-out language.

D.1 Getting Started


The main website for Python is
https://www.python.org/,
where you will find documentation, a tutorial, beginners’ guides, software examples, and
so on. It is important to note that there are two incompatible “branches” of Python, called
Python 3 and Python 2. Further development of the language will involve only Python 3,
and in this appendix (and indeed the rest of the book) we only consider Python 3. As there
are many interdependent packages that are frequently used with a Python installation, it
is convenient to install a distribution — for instance, the Anaconda Python distribution, Anaconda
available from

https://www.anaconda.com/.

The Anaconda installer automatically installs the most important packages and also
provides a convenient interactive development environment (IDE), called Spyder.

Use the Anaconda Navigator to launch Spyder, Jupyter notebook, install and update
packages, or open a command-line terminal.

To get started1 , try out the Python statements in the input boxes that follow. You can
either type these statements at the IPython command prompt or run them as (very short)
1
We assume that you have installed all the necessary files and have launched Spyder.

463
464 D.1. Getting Started

Python programs. The output for these two modes of input can differ slightly. For ex-
ample, typing a variable name in the console causes its contents to be automatically printed,
whereas in a Python program this must be done explicitly by calling the print function.
Selecting (highlighting) several program lines in Spyder and then pressing function key2
F9 is equivalent to executing these lines one by one in the console.
object In Python, data is represented as an object or relation between objects (see also Sec-
tion D.2). Basic data types are numeric types (including integers, booleans, and floats),
sequence types (including strings, tuples, and lists), sets, and mappings (currently, diction-
aries are the only built-in mapping type).
Strings are sequences of characters, enclosed by single or double quotes. We can print
strings via the print function.

print("Hello World !")


Hello World!

For pretty-printing output, Python strings can be formatted using the format function. The
bracket syntax {i} provides a placeholder for the i-th variable to be printed, with 0 being
the first index. Individual variables can be formatted separately and as desired; formatting
☞ 475 syntax is discussed in more detail in Section D.9.

print("Name :{1} ( height {2} m, age {0})". format (111 ," Bilbo " ,0.84))
Name:Bilbo ( height 0.84 m, age 111)

Lists can contain different types of objects, and are created using square brackets as in the
following example:

x = [1,'string '," another string "] # Quote type is not important


[1, 'string ', 'another string ']

mutable Elements in lists are indexed starting from 0, and are mutable (can be changed):

x = [1 ,2]
x[0] = 2 # Note that the first index is 0
x
[2 ,2]

immutable In contrast, tuples (with round brackets) are immutable (cannot be changed). Strings are
immutable as well.
x = (1 ,2)
x[0] = 2
TypeError : 'tuple ' object does not support item assignment

slice Lists can be accessed via the slice notation [start:end]. It is important to note that end
is the index of the first element that will not be selected, and that the first element has index
0. To gain familiarity with the slice notation, execute each of the following lines.
2
This may depend on the keyboard and operating system.
Appendix D. Python Primer 465

a = [2, 3, 5, 7, 11, 13, 17, 19, 23]


a [1:4] # Elements with index from 1 to 3
a[:4] # All elements with index less than 4
a[3:] # All elements with index 3 or more
a[ -2:] # The last two elements
[3, 5, 7]
[2, 3, 5, 7]
[7, 11, 13, 17, 19, 23]
[19 , 23]

An operator is a programming language construct that performs an action on one or more operator
operands. The action of an operator in Python depends on the type of the operand(s). For
example, operators such as +, ∗, −, and % that are arithmetic operators when the operands
are of a numeric type, can have different meanings for objects of non-numeric type (such
as strings).

'hello ' + 'world ' # String concatenation


'helloworld '

'hello ' * 2 # String repetition


'hellohello '

[1 ,2] * 2 # List repetition


[1, 2, 1, 2]

15 % 4 # Remainder of 15/4
3

Some common Python operators are given in Table D.1. ☞ 467

D.2 Python Objects


As mentioned in the previous section, data in Python is represented by objects or relations
between objects. We recall that basic data types included strings and numeric types (such
as integers, booleans, and floats).
As Python is an object-oriented programming language, functions are objects too
(everything is an object!). Each object has an identity (unique to each object and immutable
— that is, cannot be changed — once created), a type (which determines which operations
can be applied to the object, and is considered immutable), and a value (which is either
mutable or immutable). The unique identity assigned to an object obj can be found by
calling id, as in id(obj).
Each object has a list of attributes, and each attribute is a reference to another object. attributes
The function dir applied to an object returns the list of attributes. For example, a string
object has many useful attributes, as we shall shortly see. Functions are objects with the
__call__ attribute.
466 D.3. Types and Operators

A class (see Section D.8) can be thought of as a template for creating a custom type of
object.

s = "hello"
d = dir(s)
print(d,flush=True) # Print the list in " flushed " format
['__add__ ', '__class__ ', '__contains__ ', '__delattr__ ', '__dir__ ',
... (many left out) ... 'replace ', 'rfind ',
'rindex ', 'rjust ', 'rpartition ', 'rsplit ', 'rstrip ', 'split ',
'splitlines ', 'startswith ', 'strip ', 'swapcase ', 'title ',
'translate ', 'upper ', 'zfill ']

dot notation Any attribute attr of an object obj can be accessed via the dot notation: obj.attr. To
find more information about any object use the help function.

s = "hello"
help(s. replace )
replace (...) method of builtins .str instance
S. replace (old , new[, count ]) -> str

Return a copy of S with all occurrences of substring


old replaced by new. If the optional argument count is
given , only the first count occurrences are replaced .

This shows that the attribute replace is in fact a function. An attribute that is a function is
method called a method. We can use the replace method to create a new string from the old one
by changing certain characters.

s = 'hello '
s1 = s. replace ('e','a')
print(s1)
hallo

In many Python editors, pressing the TAB key, as in objectname.<TAB>, will bring
up a list of possible attributes via the editor’s autocompletion feature.

D.3 Types and Operators


type Each object has a type. Three basic data types in Python are str (for string), int (for
integers), and float (for floating point numbers). The function type returns the type of
an object.

t1 = type ([1 ,2 ,3])


t2 = type ((1 ,2 ,3))
t3 = type ({1 ,2 ,3})
print(t1 ,t2 ,t3)
Appendix D. Python Primer 467

<class 'list '> <class 'tuple '> <class 'set '>

The assignment operator, =, assigns an object to a variable; e.g., x = 12. An expression assignment
is a combination of values, operators, and variables that yields another value or variable.

Variable names are case sensitive and can only contain letters, numbers, and under-
scores. They must start with either a letter or underscore. Note that reserved words
such as True and False are case sensitive as well.

Python is a dynamically typed language, and the type of a variable at a particular point
during program execution is determined by its most recent object assignment. That is, the
type of a variable does not need to be explicitly declared from the outset (as is the case in
C or Java), but instead the type of the variable is determined by the object that is currently
assigned to it.
It is important to understand that a variable in Python is a reference to an object — reference
think of it as a label on a shoe box. Even though the label is a simple entity, the contents
of the shoe box (the object to which the variable refers) can be arbitrarily complex. Instead
of moving the contents of one shoe box to another, it is much simpler to merely move the
label.
x = [1 ,2]
y = x # y refers to the same object as x
print(id(x) == id(y)) # check that the object id's are the same
y[0] = 100 # change the contents of the list that y refers to
print(x)
True
[100 ,2]

x = [1 ,2]
y = x # y refers to the same object as x
y = [100 ,2] # now y refers to a different object
print(id(x) == id(y))
print(x)
False
[1 ,2]

Table D.1 shows a selection of Python operators for numerical and logical variables.

Table D.1: Common numerical (left) and logical (right) operators.

+ addition ~ binary NOT


- subtraction & binary AND
* multiplication ^ binary XOR
** power | binary OR
/ division == equal to
// integer division != not equal to
% modulus
468 D.4. Functions and Methods

Several of the numerical operators can be combined with an assignment operator, as in


x += 1 to mean x = x + 1. Operators such as + and * can be defined for other data types
as well, where they take on a different meaning. This is called operator overloading, an
example of which is the use of <List> * <Integer> for list repetition as we saw earlier.

D.4 Functions and Methods


Functions make it easier to divide a complex program into simpler parts. To create a
function function, use the following syntax:

def <function name>(<parameter_list>):


<statements>

A function takes a list of input variables that are references to objects. Inside the func-
tion, a number of statements are executed which may modify the objects, but not the ref-
erence itself. In addition, the function may return an output object (or will return the value
None if not explicitly instructed to return output). Think again of the shoe box analogy. The
input variables of a function are labels of shoe boxes, and the objects to which they refer
are the contents of the shoe boxes. The following program highlights some of the subtleties
of variables and objects in Python.

Note that the statements within a function must be indented. This is Python’s way to
define where a function begins and ends.

x = [1 ,2 ,3]

def change_list (y):


y. append (100) # Append an element to the list referenced by y
y[0]=0 # Modify the first element of the same list
y = [2 ,3 ,4] # The local y now refers to a different list
# The list to which y first referred does not change
return sum(y)

print( change_list (x))


print(x)
9
[0, 2, 3, 100]

Variables that are defined inside a function only have local scope; that is, they are
recognized only within that function. This allows the same variable name to be used in
different functions without creating a conflict. If any variable is used within a function,
Python first checks if the variable has local scope. If this is not the case (the variable has
not been defined inside the function), then Python searches for that variable outside the
function (the global scope). The following program illustrates several important points.
Appendix D. Python Primer 469

from numpy import array , square , sqrt

x = array ([1.2 ,2.3 ,4.5])

def stat(x):
n = len(x) #the length of x
meanx = sum(x)/n
stdx = sqrt(sum( square (x - meanx ))/n)
return [meanx ,stdx]

print(stat(x))
[2.6666666666666665 , 1.3719410418171119]

1. Basic math functions such as sqrt are unknown to the standard Python interpreter
and need to be imported. More on this in Section D.5 below.

2. As was already mentioned, indentation is crucial. It shows where the function begins
and ends.

3. No semicolons3 are needed to end lines, but the first line of the function definition
(here line 5) must end with a colon (:).

4. Lists are not arrays (vectors of numbers), and vector operations cannot be performed
on lists. However, the numpy module is designed specifically with efficient vec-
tor/matrix operations in mind. On the second code line, we define x as a vector
(ndarray) object. Functions such as square, sum, and sqrt are then applied to
such arrays. Note that we used the default Python functions len and sum. More on
numpy in Section D.10.

5. Running the program with stat(x) instead of print(stat(x)) in line 11 will not
show any output in the console.

To display the complete list of built-in functions, type (using double underscores)
dir(__builtin__) .

D.5 Modules
A Python module is a programming construct that is useful for organizing code into module
manageable parts. To each module with name module_name is associated a Python file
module_name.py containing any number of definitions, e.g., of functions, classes, and
variables, as well as executable statements. Modules can be imported into other programs
using the syntax: import <module_name> as <alias_name>, where <alias_name>
is a shorthand name for the module.

3
Semicolons can be used to put multiple commands on a single line.
470 D.5. Modules

namespace When imported into another Python file, the module name is treated as a namespace,
providing a naming system where each object has its unique name. For example, different
modules mod1 and mod2 can have different sum functions, but they can be distinguished by
prefixing the function name with the module name via the dot notation, as in mod1.sum and
mod2.sum. For example, the following code uses the sqrt function of the numpy module.

import numpy as np
np.sqrt (2)
1.4142135623730951

A Python package is simply a directory of Python modules; that is, a collection of


modules with additional startup information (some of which may be found in its __path__
attribute). Python’s built-in module is called __builtins__. Of the great many useful
Python modules, Table D.2 gives a few.

Table D.2: A few useful Python modules/packages.

datetime Module for manipulating dates and times.


matplotlib MATLABTM -type plotting package
numpy Fundamental package for scientific computing, including random
number generation and linear algebra tools. Defines the ubiquitous
ndarray class.
os Python interface to the operating system.
pandas Fundamental module for data analysis. Defines the powerful
DataFrame class.
pytorch Machine learning library that supports GPU computation.
scipy Ecosystem for mathematics, science, and engineering, containing
many tools for numerical computing, including those for integration,
solving differential equations, and optimization.
requests Library for performing HTTP requests and interfacing with the web.
seaborn Package for statistical data visualization.
sklearn Easy to use machine learning library.
statsmodels Package for the analysis of statistical models.

The numpy package contains various subpackages, such as random, linalg, and fft.
More details are given in Section D.10.

When using Spyder, press Ctrl+I in front of any object, to display its help file in a
separate window.

As we have already seen, it is also possible to import only specific functions from a
module using the syntax: from <module_name> import <fnc1, fnc2, ...>.

from numpy import sqrt , cos


sqrt (2)
cos (1)
Appendix D. Python Primer 471

1.4142135623730951
0.54030230586813965

This avoids the tedious prefixing of functions via the (alias) of the module name. However,
for large programs it is good practice to always use the prefix/alias name construction, to
be able to clearly ascertain precisely which module a function being used belongs to.

D.6 Flow Control


Flow control in Python is similar to that of many programming languages, with conditional
statements as well as while and for loops. The syntax for if-then-else flow control is
as follows.

if <condition1>:
<statements>
elif <condition2>:
<statements>
else:
<statements>

Here, <condition1> and <condition2> are logical conditions that are either True or
False; logical conditions often involve comparison operators (such as ==, >, <=, !=).
In the example above, there is one elif part, which allows for an “else if” conditional
statement. In general, there can be more than one elif part, or it can be omitted. The else
part can also be omitted. The colons are essential, as are the indentations.
The while and for loops have the following syntax.

while <condition>:
<statements>

for <variable> in <collection>:


<statements>

Above, <collection> is an iterable object (see Section D.7 below). For further con-
trol in for and while loops, one can use a break statement to exit the current loop, and
the continue statement to continue with the next iteration of the loop, while abandoning
any remaining statements in the current iteration. Here is an example.

import numpy as np
ans = 'y'
while ans != 'n':
outcome = np. random . randint (1 ,6+1)
if outcome == 6:
print(" Hooray a 6!")
break
else:
print("Bad luck , a", outcome )
ans = input("Again? (y/n) ")
472 D.7. Iteration

D.7 Iteration
Iterating over a sequence of objects, such as used in a for loop, is a common operation.
To better understand how iteration works, we consider the following code.

s = "Hello"
for c in s:
print(c,'*', end=' ')
H * e * l * l * o *

A string is an example of a Python object that can be iterated. One of the methods of a
iterable string object is __iter__. Any object that has such a method is called an iterable. Calling
iterator this method creates an iterator — an object that returns the next element in the sequence
to be iterated. This is done via the method __next__.
s = "Hello"
t = s. __iter__ () # t is now an iterator . Same as iter(s)
print(t. __next__ () ) # same as next(t)
print(t. __next__ () )
print(t. __next__ () )
H
e
l

The inbuilt functions next and iter simply call these corresponding double-
underscore functions of an object. When executing a for loop, the sequence/collection
over which to iterate must be an iterable. During the execution of the for loop, an iterator
is created and the next function is executed until there is no next element. An iterator is
also an iterable, so can be used in a for loop as well. Lists, tuples, and strings are so-called
sequence sequence objects and are iterables, where the elements are iterated by their index.
range The most common iterator in Python is the range iterator, which allows iteration over
a range of indices. Note that range returns a range object, not a list.

for i in range (4 ,20):


print(i, end=' ')
print(range (4 ,20))
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
range (4 ,20)

Similar to Python’s slice operator [i : j], the iterator range(i, j) ranges from i to j,
not including the index j.

sets Two other common iterables are sets and dictionaries. Python sets are, as in mathem-
atics, unordered collections of unique objects. Sets are defined with curly brackets { }, as
opposed to round brackets ( ) for tuples, and square brackets [ ] for lists. Unlike lists, sets do
not have duplicate elements. Many of the usual set operations are implemented in Python,
including the union A | B and intersection A & B.
Appendix D. Python Primer 473

A = {3, 2, 2, 4}
B = {4, 3, 1}
C = A & B
for i in A:
print(i)
print(C)
2
3
4
{3, 4}

A useful way to construct lists is by list comprehension; that is, by expressions of the list
form comprehension

<expression> for <element> in <list> if <condition>


For sets a similar construction holds. In this way, lists and sets can be defined using very
similar syntax as in mathematics. Compare, for example, the mathematical definition of
the sets A := {3, 2, 4, 2} = {2, 3, 4} (no order and no duplication of elements) and B := {x2 :
x ∈ A} with the Python code below.

setA = {3, 2, 4, 2}
setB = {x**2 for x in setA}
print(setB)
listA = [3, 2, 4, 2]
listB = [x**2 for x in listA]
print(listB)
{16, 9, 4}
[9, 4, 16, 4]

A dictionary is a set-like data structure, containing one or more key:value pairs en- dictionary
closed in curly brackets. The keys are often of the same type, but do not have to be; the
same holds for the values. Here is a simple example, storing the ages of Lord of the Rings
characters in a dictionary.

DICT = {'Gimly ': 140, 'Frodo ':51, 'Aragorn ': 88}


for key in DICT:
print(key , DICT[key ])
Gimly 140
Frodo 51
Aragorn 88

D.8 Classes
Recall that objects are of fundamental importance in Python — indeed, data types and
functions are all objects. A class is an object type, and writing a class definition can be class
thought of as creating a template for a new type of object. Each class contains a number
of attributes, including a number of inbuilt methods. The basic syntax for the creation of a
class is:
474 D.8. Classes

class <class_name>:
def __init__(self):
<statements>
<statements>
instance The main inbuilt method is __init__, which creates an instance of a class object.
For example, str is a class object (string class), but s = str('Hello') or simply
s = 'Hello', creates an instance, s, of the str class. Instance attributes are created dur-
ing initialization and their values may be different for different instances. In contrast, the
values of class attributes are the same for every instance. The variable self in the initializ-
ation method refers to the current instance that is being created. Here is a simple example,
explaining how attributes are assigned.

class shire_person :
def __init__ (self ,name): # initialization method
self.name = name # instance attribute
self.age = 0 # instance attribute
address = 'The Shire ' # class attribute

print(dir( shire_person )[1:5] , '... ',dir( shire_person )[ -2:])


# list of class attributes

p1 = shire_person ('Sam ') # create an instance


p2 = shire_person ('Frodo ') # create another instance
print(p1. __dict__ ) # list of instance attributes

p2.race = 'Hobbit ' # add another attribute to instance p2


p2.age = 33 # change instance attribute
print(p2. __dict__ )

print( getattr (p1 ,'address ')) # content of p1 's class attribute


[' __delattr__ ', '__dict__ ', '__dir__ ', '__doc__ '] ...
[' __weakref__ ', 'address ']
{'name ': 'Sam ', 'age ': 0}
{'name ': 'Frodo ', 'age ': 33, 'race ': 'Hobbit '}
The Shire

It is good practice to create all the attributes of the class object in the __init__ method,
but, as seen in the example above, attributes can be created and assigned everywhere, even
outside the class definition. More generally, attributes can be added to any object that has
a __dict__.

An “empty” class can be created via


class <class_name>:
pass

inheritance Python classes can be derived from a parent class by inheritance, via the following
syntax.
class <class_name>(<parent_class_name>):
<statements>
Appendix D. Python Primer 475

The derived class (initially) inherits all of the attributes of the parent class.
As an example, the class shire_person below inherits the attributes name, age, and
address from its parent class person. This is done using the super function, used here
to refer to the parent class person without naming it explicitly. When creating a new
object of type shire_person, the __init__ method of the parent class is invoked, and
an additional instance attribute Shire_address is created. The dir function confirms that
Shire_address is an attribute only of shire_person instances.

class person :
def __init__ (self ,name):
self.name = name
self.age = 0
self. address = ' '

class shire_person ( person ):


def __init__ (self ,name):
super (). __init__ (name)
self. Shire_address = 'Bag End '

p1 = shire_person ("Frodo")
p2 = person (" Gandalf ")
print(dir(p1)[:1] , dir(p1)[ -3:] )
print(dir(p2)[:1] , dir(p2)[ -3:] )
[' Shire_address '] ['address ', 'age ', 'name ']
['__class__ '] ['address ', 'age ', 'name ']

D.9 Files
To write to or read from a file, a file first needs to be opened. The open function in Python
creates a file object that is iterable, and thus can be processed in a sequential manner in a
for or while loop. Here is a simple example.

fout = open('output .txt ','w')


for i in range (0 ,41):
if i%10 == 0:
fout.write('{:3d}\n'. format (i))
fout.close ()

The first argument of open is the name of the file. The second argument specifies
if the file is opened for reading ('r'), writing ('w'), appending ('a'), and so on. See
help(open). Files are written in text mode by default, but it is also possible to write in
binary mode. The above program creates a file output.txt with 5 lines, containing the
strings 0, 10, . . . , 40. Note that if we had written fout.write(i) in the fourth line of the
code above, an error message would be produced, as the variable i is an integer, and not a
string. Recall that the expression string.format() is Python’s way to specify the format
of the output string.
The formatting syntax {:3d} indicates that the output should be constrained to a spe-
cific width of three characters, each of which is a decimal value. As mentioned in the
476 D.9. Files

introduction, bracket syntax {i} provides a placeholder for the i-th variable to be printed,
with 0 being the first index. The format for the output is further specified by {i:format},
where format is typically4 of the form:
[width][.precision][type]
In this specification:

• width specifies the minimum width of output;

• precision specifies the number of digits to be displayed after the decimal point for
a floating point values of type f, or the number of digits before and after the decimal
point for a floating point values of type g;

• type specifies the type of output. The most common types are s for strings, d for
integers, b for binary numbers, f for floating point numbers (floats) in fixed-point
notation, g for floats in general notation, e for floats in scientific notation.

The following illustrates some behavior of formatting on numbers.

'{:5d}'. format (123)


'{:.4e}'. format (1234567890)
'{:.2f}'. format (1234567890)
'{:.2f}'. format (2.718281828)
'{:.3f}'. format (2.718281828)
'{:.3g}'. format (2.718281828)
'{:.3e}'. format (2.718281828)
'{0:3.3 f}; {2:.4 e};'. format (123.456789 , 0.00123456789)
' 123'
'1.2346e+09'
'1234567890.00 '
'2.72'
'2.718 '
'2.72'
'2.718e+00 '
'123.457; 1.2346e -03; '

The following code reads the text file output.txt line by line, and prints the output
on the screen. To remove the newline \n character, we have used the strip method for
strings, which removes any whitespace from the start and end of a string.

fin = open('output .txt ','r')


for line in fin:
line = line. strip () # strips a newline character
print(line)
fin.close ()
0
10
20
30
40

4
More formatting options are possible.
Appendix D. Python Primer 477

When dealing with file input and output it is important to always close files. Files that
remain open, e.g., when a program finishes unexpectedly due to a programming error, can
cause considerable system problems. For this reason it is recommended to open files via
context management. The syntax is as follows.

with open('output .txt ', 'w') as f:


f.write('Hi there!')

Context management ensures that a file is correctly closed even when the program is
terminated prematurely. An example is given in the next program, which outputs the most-
frequent words in Dicken’s A Tale of Two Cities, which can be downloaded from the book’s
GitHub site as ataleof2cities.txt.
Note that in the next program, the file ataleof2cities.txt must be placed in the cur-
rent working directory. The current working directory can be determined via import os
followed by cwd = os.getcwd().
numline = 0
DICT = {}
with open('ataleof2cities .txt ', encoding ="utf8") as fin:
for line in fin:
words = line.split ()
for w in words:
if w not in DICT:
DICT[w] = 1
else:
DICT[w] +=1
numline += 1

sd = sorted (DICT ,key=DICT.get , reverse =True) #sort the dictionary

print(" Number of unique words: {}\n". format (len(DICT)))


print("Ten most frequent words :\n")
print("{:8} {}". format ("word", "count"))
print (15* '-')
for i in range (0 ,10):
print("{:8} {}". format (sd[i], DICT[sd[i]]))
Number of unique words: 19091

Ten most frequent words:

word count
---------------
the 7348
and 4679
of 3949
to 3387
a 2768
in 2390
his 1911
was 1672
that 1650
I 1444
478 D.10. NumPy

D.10 NumPy
The package NumPy (module name numpy) provides the building blocks for scientific
computing in Python. It contains all the standard mathematical functions, such as sin,
cos, tan, etc., as well as efficient functions for random number generation, linear algebra,
and statistical computation.

import numpy as np # import the package


x = np.cos (1)
data = [1 ,2 ,3 ,4 ,5]
y = np.mean(data)
z = np.std(data)
print('cos (1) = {0:1.8 f} mean = {1} std = {2} '. format (x,y,z))
cos (1) = 0.54030231 mean = 3.0 std = 1.4142135623730951

D.10.1 Creating and Shaping Arrays


The fundamental data type in numpy is the ndarray. This data type allows for fast matrix
operations via highly optimized numerical libraries such as LAPACK and BLAS; this in
contrast to (nested) lists. As such, numpy is often essential when dealing with large amounts
of quantitative data.
ndarray objects can be created in various ways. The following code creates a 2 × 3 × 2
array of zeros. Think of it as a 3-dimensional matrix or two stacked 3 × 2 matrices.

A = np.zeros ([2 ,3 ,2]) # 2 by 3 by 2 array of zeros


print(A)
print(A.shape) # number of rows and columns
print(type(A)) # A is an ndarray
[[[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]

[[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]]
(2, 3, 2)
<class 'numpy.ndarray '>

We will be mostly working with 2D arrays; that is, ndarrays that represent ordinary
matrices. We can also use the range method and lists to create ndarrays via the array
method. Note that arange is numpy’s version of range, with the difference that arange
returns an ndarray object.

a = np.array(range (4)) # equivalent to np. arange (4)


b = np.array ([0 ,1 ,2 ,3])
C = np.array ([[1 ,2 ,3] ,[3 ,2 ,1]])
print(a, '\n', b,'\n' , C)
[0 1 2 3]
[0 1 2 3]
Appendix D. Python Primer 479

[[1 2 3]
[3 2 1]]

The dimension of an ndarray can be obtained via its shape method, which returns a
tuple. Arrays can be reshaped via the reshape method. This does not change the current
ndarray object. To make the change permanent, a new instance needs to be created.

a = np.array(range (9)) #a is an ndarray of shape (9,)


print(a.shape)
A = a. reshape (3 ,3) #A is an ndarray of shape (3 ,3)
print(a)
print(A)
[0 1 2 3 4 5 6 7 8]
(9,)
[[0, 1, 2]
[3, 4, 5]
[6, 7, 8]]

One shape dimension for reshape can be specified as −1. The dimension is then
inferred from the other dimension(s).

The 'T' attribute of an ndarray gives its transpose. Note that the transpose of a “vector”
with shape (n, ) is the same vector. To distinguish between column and row vectors, reshape
such a vector to an n × 1 and 1 × n array, respectively.

a = np. arange (3) #1D array ( vector ) of shape (3,)


print(a)
print(a.shape)
b = a. reshape (-1,1) # 3x1 array ( matrix ) of shape (3 ,1)
print(b)
print(b.T)
A = np. arange (9). reshape (3 ,3)
print(A.T)
[0 1 2]
(3,)
[[0]
[1]
[2]]
[[0 1 2]]
[[0 3 6]
[1 4 7]
[2 5 8]]

Two useful methods of joining arrays are hstack and vstack, where the arrays are
joined horizontally and vertically, respectively.

A = np.ones ((3 ,3))


B = np.zeros ((3 ,2))
C = np. hstack ((A,B))
print(C)
480 D.10. NumPy

[[ 1. 1. 1. 0. 0.]
[ 1. 1. 1. 0. 0.]
[ 1. 1. 1. 0. 0.]]

D.10.2 Slicing
Arrays can be sliced similarly to Python lists. If an array has several dimensions, a slice for
each dimension needs to be specified. Recall that Python indexing starts at '0' and ends
at 'len(obj)-1'. The following program illustrates various slicing operations.

A = np.array(range (9)). reshape (3 ,3)


print(A)
print(A[0]) # first row
print(A[: ,1]) # second column
print(A[0 ,1]) # element in first row and second column
print(A[0:1 ,1:2]) # (1 ,1) ndarray containing A[0 ,1] = 1
print(A[1: , -1]) # elements in 2nd and 3rd rows , and last column
[[0 1 2]
[3 4 5]
[6 7 8]]
[0 1 2]
[1 4 7]
1
[[1]]
[5 8]

Note that ndarrays are mutable objects, so that elements can be modified directly, without
having to create a new object.

A[1: ,1] = [0 ,0] # change two elements in the matrix A above


print(A)
[[0, 1, 2]
[3, 0, 5]
[6, 0, 8]]

D.10.3 Array Operations


Basic mathematical operators and functions act element-wise on ndarray objects.

x = np.array ([[2 ,4] ,[6 ,8]])


y = np.array ([[1 ,1] ,[2 ,2]])
print(x+y)
[[ 3, 5]
[ 8, 10]]

print(np. divide (x,y)) # same as x/y


[[ 2. 4.]
[ 3. 4.]]
Appendix D. Python Primer 481

print(np.sqrt(x))
[[1.41421356 2. ]
[2.44948974 2.82842712]]

In order to compute matrix multiplications and compute inner products of vectors,


numpy’s dot function can be used, either as a method of an ndarray instance or as a
method of np.

print(np.dot(x,y))
[[10 , 10]
[22, 22]]

print(x.dot(x)) # same as np.dot(x,x)


[[28 , 40]
[60, 88]]

Since version 3.5 of Python, it is possible to multiply two ndarrays using the @
operator (which implements the np.matmul method). For matrices, this is similar to using @ operator
the dot method. For higher-dimensional arrays the two methods behave differently.

print(x @ y)
[[10 10]
[22 22]]

NumPy allows arithmetic operations on arrays of different shapes (dimensions). Spe-


cifically, suppose two arrays have dimensions (m1 , m2 , . . . , m p ) and (n1 , n2 , . . . , n p ), respect-
ively. The arrays or shapes are said to be aligned if for all i = 1, . . . , p it holds that aligned

• mi = ni , or

• min{mi , ni } = 1, or

• either mi or ni , or both are missing.

For example, shapes (1, 2, 3) and (4, 2, 1) are aligned, as are (2, , ) and (1, 2, 3). However,
(2, 2, 2) and (1, 2, 3) are not aligned. NumPy “duplicates” the array elements across the
smaller dimension to match the larger dimension. This process is called broadcasting and broadcasting
is carried out without actually making copies, thus providing efficient memory use. Below
are some examples.

import numpy as np
A= np. arange (4). reshape (2 ,2) # (2 ,2) array

x1 = np.array ([40 ,500]) # (2,) array


x2 = x1. reshape (2 ,1) # (2 ,1) array

print(A + x1) # shapes (2 ,2) and (2,)


print(A * x2) # shapes (2 ,2) and (2 ,1)
482 D.10. NumPy

[[ 40 501]
[ 42 503]]
[[ 0 40]
[1000 1500]]

Note that above x1 is duplicated row-wise and x2 column-wise. Broadcasting also applies
to the matrix-wise operator @, as illustrated below. Here, the matrix b is duplicated across
the third dimension resulting in the two matrix multiplications
" #" # " #" #
0 1 0 1 4 5 0 1
and .
2 3 2 3 6 7 2 3

B = np. arange (8). reshape (2 ,2 ,2)


b = np. arange (4). reshape (2 ,2)
print(B@b)
[[[ 2 3]
[ 6 11]]

[[10 19]
[14 27]]]

Functions such as sum, mean, and std can also be executed as methods of an ndarray
instance. The argument axis can be passed to specify along which dimension the function
is applied. By default axis=None.

a = np.array(range (4)). reshape (2 ,2)


print(a.sum(axis =0)) # summing over rows gives column totals
[2, 4]

D.10.4 Random Numbers


One of the sub-modules in numpy is random. It contains many functions for random vari-
able generation.

import numpy as np
np. random .seed (123) # set the seed for the random number generator
x = np. random . random () # uniform (0 ,1)
y = np. random . randint (5 ,9) # discrete uniform 5 ,... ,8
z = np. random .randn (4) # array of four standard normals
print(x,y,'\n',z)
0.6964691855978616 7
[ 1.77399501 -0.66475792 -0.07351368 1.81403277]

For more information on random variable generation in numpy, see

https://docs.scipy.org/doc/numpy/reference/random/index.html.
Appendix D. Python Primer 483

D.11 Matplotlib
The main Python graphics library for 2D and 3D plotting is matplotlib, and its subpack-
age pyplot contains a collection of functions that make plotting in Python similar to that
in MATLAB.

D.11.1 Creating a Basic Plot


The code below illustrates various possibilities for creating plots. The style and color of
lines and markers can be changed, as well as the font size of the labels. Figure D.1 shows
the result.

sqrtplot.py
import matplotlib . pyplot as plt
import numpy as np
x = np. arange (0, 10, 0.1)
u = np. arange (0 ,10)
y = np.sqrt(x)
v = u/3
plt. figure ( figsize = [4 ,2]) # size of plot in inches
plt.plot(x,y, 'g--') # plot green dashed line
plt.plot(u,v,'r.') # plot red dots
plt. xlabel ('x')
plt. ylabel ('y')
plt. tight_layout ()
plt. savefig ('sqrtplot .pdf ',format ='pdf ') # saving as pdf
plt.show () # both plots will now be drawn

3
2
y

1
0
0 2 4 6 8 10
x
Figure D.1: A simple plot created using pyplot.

The library matplotlib also allows the creation of subplots. The scatterplot and histogram
in Figure D.2 have been produced using the code below. When creating a histogram there
are several optional arguments that affect the layout of the graph. The number of bins is
determined by the parameter bins (the default is 10). Scatterplots also take a number of
parameters, such as a string c which determines the color of the dots, and alpha which
affects the transparency of the dots.
484 D.11. Matplotlib

histscat.py
import matplotlib . pyplot as plt
import numpy as np
x = np. random .randn (1000)
u = np. random .randn (100)
v = np. random .randn (100)
plt. subplot (121) # first subplot
plt.hist(x,bins =25, facecolor ='b')
plt. xlabel ('X Variable ')
plt. ylabel ('Counts ')
plt. subplot (122) # second subplot
plt. scatter (u,v,c='b', alpha =0.5)
plt.show ()

120
2

100

80
Counts

0
60

40 1

20
2

0
2 0 2 2 0 2

X Variable

Figure D.2: A histogram and scatterplot.

One can also create three-dimensional plots as illustrated below.

surf3dscat.py
import matplotlib . pyplot as plt
import numpy as np
from mpl_toolkits . mplot3d import Axes3D

def npdf(x,y):
return np.exp ( -0.5*( pow(x ,2)+pow(y ,2)))/np.sqrt (2* np.pi)

x, y = np. random .randn (100) , np. random .randn (100)


z = npdf(x,y)

xgrid , ygrid = np. linspace ( -3 ,3 ,100) , np. linspace ( -3 ,3 ,100)

Xarray , Yarray = np. meshgrid (xgrid ,ygrid)


Appendix D. Python Primer 485

Zarray = npdf(Xarray , Yarray )

fig = plt. figure ( figsize =plt. figaspect (0.4))


ax1 = fig. add_subplot (121 , projection ='3d')
ax1. scatter (x,y,z, c='g')
ax1. set_xlabel ('$x$ ')
ax1. set_ylabel ('$y$ ')
ax1. set_zlabel ('$f(x,y)$')

ax2 = fig. add_subplot (122 , projection ='3d')


ax2. plot_surface (Xarray ,Yarray ,Zarray ,cmap='viridis ',
edgecolor ='none ')
ax2. set_xlabel ('$x$ ')
ax2. set_ylabel ('$y$ ')
ax2. set_zlabel ('$f(x,y)$')

plt.show ()

0.4
0.35
0.3 0.30
f(x, y)

f(x, y)
0.25
0.2 0.20
0.15
0.1 0.10
0.05
0.0

3
2
2
1 1
2 0 3 0
2 1
1 1 y 1 y
0 0 2
1 2 1
x 2 x 2 3
3 3

Figure D.3: Three-dimensional scatter- and surface plots.

D.12 Pandas
The Python package Pandas (module name pandas) provides various tools and data struc-
tures for data analytics, including the fundamental DataFrame class.

For the code in this section we assume that pandas has been imported via
import pandas as pd.

D.12.1 Series and DataFrame


The two main data structures in pandas are Series and DataFrame. A Series object can
be thought of as a combination of a dictionary and an 1-dimensional ndarray. The syntax
for creating a Series object is
486 D.12. Pandas

series = pd.Series(<data>, index=['index'])

Here, <data> some 1-dimensional data structure, such as a 1-dimensional ndarray, a list,
or a dictionary, and index is a list of names of the same length as <data>. When <data>
is a dictionary, the index is created from the keys of the dictionary. When <data> is an
ndarray and index is omitted, the default index will be [0, ..., len(data)-1].

DICT = {'one ':1, 'two ':2, 'three ':3, 'four ':4}


print(pd. Series (DICT))
one 1
two 2
three 3
four 4
dtype: int64

years = ['2000 ','2001 ','2002 ']


cost = [2.34 , 2.89 , 3.01]
print(pd. Series (cost , index = years , name = 'MySeries ')) #name it
2000 2.34
2001 2.89
2002 3.01
Name: MySeries , dtype : float64

The most commonly-used data structure in pandas is the two-dimensional DataFrame,


which can be thought of as pandas’ implementation of a spreadsheet or as a diction-
ary in which each “key” of the dictionary corresponds to a column name and the dic-
tionary “value” is the data in that column. To create a DataFrame one can use the
pandas DataFrame method, which has three main arguments: data, index (row labels),
and columns (column labels).

DataFrame(<data>, index=['<row_name>'], columns=['<column_name>'])

If the index is not specified, the default index is [0, ..., len(data)-1]. Data can
☞1 also be read directly from a CSV or Excel file, as is done in Section 1.1. If a dictionary is
used to create the data frame (as below), the dictionary keys are used as the column names.

DICT = {'numbers ':[1 ,2 ,3 ,4] , 'squared ':[1 ,4 ,9 ,16] }


df = pd. DataFrame (DICT , index = list('abcd '))
print(df)
numbers squared
a 1 1
b 2 4
c 3 9
d 4 16
Appendix D. Python Primer 487

D.12.2 Manipulating Data Frames


Often data encoded in DataFrame or Series objects need to be extracted, altered, or com-
bined. Getting, setting, and deleting columns works in a similar manner as for dictionaries.
The following code illustrates various operations.

ages = [6 ,3 ,5 ,6 ,5 ,8 ,0 ,3]
d={ 'Gender ':['M', 'F']*4, 'Age ': ages}
df1 = pd. DataFrame (d)
df1.at[0,'Age ']= 60 # change an element
df1.at[1,'Gender '] = 'Female ' # change another element
df2 = df1.drop('Age ' ,1) # drop a column
df3 = df2.copy (); # create a separate copy of df2
df3['Age '] = ages # add the original column
dfcomb = pd. concat ([df1 ,df2 ,df3],axis =1) # combine the three dfs
print( dfcomb )
Gender Age Gender Gender Age
0 M 60 M M 6
1 Female 3 Female Female 3
2 M 5 M M 5
3 F 6 F F 6
4 M 5 M M 5
5 F 8 F F 8
6 M 0 M M 0
7 F 3 F F 3

Note that the above DataFrame object has two Age columns. The expression
dfcomb[’Age’] will return a DataFrame with both these columns.

Table D.3: Useful pandas methods for data manipulation.

agg Aggregate the data using one or more functions.


apply Apply a function to a column or row.
astype Change the data type of a variable.
concat Concatenate data objects.
replace Find and replace values.
read_csv Read a CSV file into a DataFrame.
sort_values Sort by values along rows or columns.
stack Stack a DataFrame.
to_excel Write a DataFrame to an Excel file.

It is important to correctly specify the data type of a variable before embarking on


data summarization and visualization tasks, as Python may treat different types of objects
in dissimilar ways. Common data types for entries in a DataFrame object are float,
category, datetime, bool, and int. A generic object type is object.

d={ 'Gender ':['M', 'F', 'F']*4, 'Age ': [6 ,3 ,5 ,6 ,5 ,8 ,0 ,3 ,6 ,6 ,7 ,7]}


df=pd. DataFrame (d)
print(df. dtypes )
df['Gender '] = df['Gender ']. astype ('category ') # change the type
print(df. dtypes )
488 D.12. Pandas

Gender object
Age int64
dtype: object
Gender category
Age int64
dtype: object

D.12.3 Extracting Information


Extracting statistical information from a DataFrame object is facilitated by a large col-
lection of methods (functions) in pandas. Table D.4 gives a selection of data inspection
☞1 methods. See Chapter 1 for their practical use. The code below provides several examples
of useful methods. The apply method allows one to apply general functions to columns
or rows of a DataFrame. These operations do not change the data. The loc method allows
for accessing elements (or ranges) in a data frame and acts similar to the slicing operation
for lists and arrays, with the difference that the “stop” value is included, as illustrated in
the code below.

import numpy as np
import pandas as pd
ages = [6 ,3 ,5 ,6 ,5 ,8 ,0 ,3]
np. random .seed (123)
df = pd. DataFrame (np. random . randn (3 ,4) , index = list('abc '),
columns = list('ABCD '))
print(df)
df1 = df.loc["b":"c","B":"C"] # create a partial data frame
print(df1)
meanA = df['A']. mean () # mean of 'A' column
print('mean of column A = {}'. format ( meanA ))
expA = df['A']. apply(np.exp) # exp of all elements in 'A' column
print(expA)
A B C D
a -1.085631 0.997345 0.282978 -1.506295
b -0.578600 1.651437 -2.426679 -0.428913
c 1.265936 -0.866740 -0.678886 -0.094709
B C
b 1.651437 -2.426679
c -0.866740 -0.678886
mean of column A = -0.13276486552118785
a 0.337689
b 0.560683
c 3.546412
Name: A, dtype: float64

The groupby method of a DataFrame object is useful for summarizing and displaying
the data in manipulated ways. It groups data according to one or more specified columns,
such that methods such as count and mean can be applied to the grouped data.
Appendix D. Python Primer 489

Table D.4: Useful pandas methods for data inspection.

columns Column names.


count Counts number of non-NA cells.
crosstab Cross-tabulate two or more categories.
describe Summary statistics.
dtypes Data types for each column.
head Display the top rows of a DataFrame.
groupby Group data by column(s).
info Display information about the DataFrame.
loc Access a group or rows or columns.
mean Column/row mean.
plot Plot of columns.
std Column/row standard deviation.
sum Returns column/row sum.
tail Display the bottom rows of a DataFrame.
value_counts Counts of different non-null values.
var Variance.

df = pd. DataFrame ({'W':['a','a','b','a','a','b'],


'X':np. random .rand (6) ,
'Y':['c','d','d','d','c','c'], 'Z':np. random .rand (6) })
print(df)
W X Y Z
0 a 0.993329 c 0.641084
1 a 0.925746 d 0.428412
2 b 0.266772 d 0.460665
3 a 0.201974 d 0.261879
4 a 0.529505 c 0.503112
5 b 0.006231 c 0.849683

print(df. groupby ('W').mean ())


X Z
W
a 0.662639 0.458622
b 0.136502 0.655174

print(df. groupby ([ 'W', 'Y']).mean ())


X Z
W Y
a c 0.761417 0.572098
d 0.563860 0.345145
b c 0.006231 0.849683
d 0.266772 0.460665

To allow for multiple functions to be calculated at once, the agg method can be used.
It can take a list, dictionary, or string of functions.
490 D.13. Scikit-learn

print(df. groupby ('W').agg ([sum ,np.mean ]))


X Z
sum mean sum mean
W
a 2.650555 0.662639 1.834487 0.458622
b 0.273003 0.136502 1.310348 0.655174

D.12.4 Plotting
The plot method of a DataFrame makes plots of a DataFrame using Matplotlib. Different
types of plot can be accessed via the kind = 'str' construction, where str is one of
line (default), bar, hist, box, kde, and several more. Finer control, such as modifying
the font, is obtained by using matplotlib directly. The following code produces the line
and box plots in Figure D.4.
import numpy as np
import pandas as pd
import matplotlib
df = pd. DataFrame ({'normal ':np. random . randn (100) ,
'Uniform ':np. random . uniform (0 ,1 ,100) })
font = {'family ' : 'serif ', 'size ' : 14} #set font
matplotlib .rc('font ', ** font) # change font
df.plot () # line plot ( default )
df.plot(kind = 'box ') # box plot
matplotlib . pyplot .show () # render plots

Normal
Uniform
2 2

0 0

2 2
0 20 40 60 80 100 Normal Uniform
Figure D.4: A line and box plot using the plot method of DataFrame.

D.13 Scikit-learn
Scikit-learn is an open-source machine learning and data science library for Python. The
library includes a range of algorithms relating to the chapters in this book. It is widely
used due to its simplicity and its breadth. The module name is sklearn. Below is a brief
introduction into modeling the data with sklearn. The full documentation can be found
at
https://scikit-learn.org/.
Appendix D. Python Primer 491

D.13.1 Partitioning the Data


Randomly partitioning the data in order to test the model may be achieved easily with
sklearn’s function train_test_split. For example, suppose that the training data is
described by the matrix X of explanatory variables and the vector y of responses. Then the
following code splits the data set into training and testing sets, with the testing set being
half of the total set.

from sklearn . model_selection import train_test_split


X_train , X_test , y_train , y_test = train_test_split (X, y,
test_size = 0.5)

As an example, the following code generates a synthetic data set and splits it into
equally-sized training and test sets.

syndat.py
import numpy as np
import matplotlib . pyplot as plt
from sklearn . model_selection import train_test_split

np. random .seed (1234)

X=np.pi *(2* np. random . random (size =(400 ,2)) -1)


y=(np.cos(X[: ,0])*np.sin(X[: ,1]) >=0)

X_train , X_test , y_train , y_test = train_test_split (X, y,


test_size =0.5)

fig = plt. figure ()


ax = fig. add_subplot (111)
ax. scatter ( X_train [ y_train ==0 ,0] , X_train [ y_train ==0 ,1] , c='g',
marker ='o',alpha =0.5)
ax. scatter ( X_train [ y_train ==1 ,0] , X_train [ y_train ==1 ,1] , c='b',
marker ='o',alpha =0.5)
ax. scatter ( X_test [ y_test ==0 ,0] , X_test [ y_test ==0 ,1] , c='g',
marker ='s',alpha =0.5)
ax. scatter ( X_test [ y_test ==1 ,0] , X_test [ y_test ==1 ,1] , c='b',
marker ='s',alpha =0.5)

plt. savefig ('sklearntraintest .pdf ',format ='pdf ')


plt.show ()

D.13.2 Standardization
In some instances it may be necessary to standardize the data. This may be done in
sklearn with scaling methods such as MinMaxScaler or StandardScaler. Scaling may
improve the convergence of gradient-based estimators and is useful when visualizing data
on vastly different scales. For example, suppose that X is our explanatory data (e.g., stored
as a numpy array), and we wish to standardize such that each value lies between 0 and 1.
492 D.13. Scikit-learn

3 2 1 0 1 2 3

Figure D.5: Example training (circles) and test (squares) set for two class classification.
Explanatory variables are the (x, y) coordinates, classes are zero (green) or one (blue).

from sklearn import preprocessing


min_max_scaler = preprocessing . MinMaxScaler ( feature_range =(0 , 1))
x_scaled = min_max_scaler . fit_transform (X)
# equivalent to:
x_scaled = (X - X.min(axis =0)) / (X.max(axis =0) - X.min(axis =0))

D.13.3 Fitting and Prediction


Once the data has been partitioned and standardized if necessary, the data may be fitted to
a statistical model, e.g., a classification or regression model. For example, continuing with
our data from above, the following fits a model to the data and predicts the responses for
the test set.
from sklearn . someSubpackage import someClassifier
clf = someClassifier () # choose appropriate classifier
clf.fit(X_train , y_train ) # fit the data
y_prediction = clf. predict ( X_test ) # predict

Specific classifiers for logistic regression, naïve Bayes, linear and quadratic discrimin-
ant analysis, K-nearest neighbors, and support vector machines are given in Section 7.8.
☞ 277

D.13.4 Testing the Model


Once the model has made its prediction we may test its effectiveness, using relevant met-
rics. For example, for classification we may wish to produce the confusion matrix for the
test data. The following code does this for the data shown in Figure D.5, using a support
vector machine classifier.
Appendix D. Python Primer 493

from sklearn import svm


clf = svm.SVC( kernel = 'rbf ')
clf.fit( X_train , y_train )
y_prediction = clf. predict ( X_test )

from sklearn . metrics import confusion_matrix


print( confusion_matrix ( y_test , y_prediction ))
[[102 12]
[ 1 85]]

D.14 System Calls, URL Access, and Speed-Up


Operating system commands (whether in Windows, MacOS, or Linux) for creating dir-
ectories, copying or removing files, or executing programs from the system shell can be
issued from within Python by using the package os. Another useful package is requests
which enables direct downloads of files and webpages from URLs. The following Python
script uses both. It also illustrates a simple example of exception handling in Python.
misc.py
import os
import requests
for c in " 123456 ":
try: # if it does not yet exist
os.mkdir("MyDir"+ c) # make a directory
except : # otherwise
pass # do nothing

uname = "https :// github .com/DSML -book/ Programs /tree/ master /


Appendices / Python Primer /"
fname = " ataleof2cities .txt"
r = requests .get(uname + fname)
print(r.text)
open('MyDir1 /ato2c.txt ', 'wb ').write(r. content ) #write to a file
# bytes mode is important here

The package numba can significantly speed up calculations via smart compilation. First
run the following code.
jitex.py
import timeit
import numpy as np
from numba import jit
n = 10**8

#@jit
def myfun(s,n):
for i in range (1,n):
s = s+ 1/i
return s
494 D.14. System Calls, URL Access, and Speed-Up

start = timeit .time.clock ()


print("Euler 's constant is approximately {:9.8f}". format (
myfun (0,n) - np.log(n)))
end = timeit .time.clock ()
print(" elapsed time: {:3.2f} seconds ". format (end -start))
Euler 's constant is approximately 0.57721566
elapsed time: 5.72 seconds

Now remove the # character before the @ character in the code above, in order to
activate the “just in time” compiler. This gives a 15-fold speedup:
Euler 's constant is approximately 0.57721566
elapsed time: 0.39 seconds

Further Reading
To learn Python, we recommend [82] and [110]. However, as Python is constantly evolving,
the most up-to-date references will be available from the Internet.

You might also like