Python Notes
Python Notes
Luckily,forallthemajoroperatingsystems,namelyWindows,Mac,andL
inux,therearetargetedinstallersforNumPy,SciPy,andMatplotlib.Ifyou
areunsureabouttheinstallationprocess,youmightwanttoinstallEnthou
ghtPythonDistribution(https://www.enthought.com/products/
epd_free.php)orPython(x,y)(http://code.google.com/p/pythonxy/
wiki/
Downloads),whichcomewithalltheearliermentionedpackagesinclude
d.
Before we can talk about concrete machine learning algorithms, we
have to talkabout how best to store the data we will chew through.
This is important as themost advanced learning algorithm will not
be of any help to us if they will neverfinish. This may be simply
because accessing the data is too slow. Or maybe
itsrepresentationforcestheoperatingsystemtoswapallday.Addtothist
hatPythonis an interpreted language (a highly optimized one,
though) that is slow for manynumerically heavy algorithms
compared to C or Fortran. So we might ask why
onearthsomanyscientistsandcompaniesarebettingtheirfortuneonPyt
honeveninthehighlycomputation-intensiveareas?
TheansweristhatinPython,itisveryeasytooffloadnumber-crunchingtasksto
thelowerlayerintheformofaCorFortranextension.ThatisexactlywhatNum
PyandSciPydo(http://scipy.org/
install.html).Inthistandem,NumPyprovidesthesupportofhighlyoptimized
multidimensionalarrays,whicharethebasicdatastructure of most state-of-
the-art algorithms. SciPy uses those arrays to provide a set offast
numerical recipes. Finally, Matplotlib (http://matplotlib.org/) is
probably themostconvenientandfeature-richlibrarytoplothigh-
qualitygraphsusingPython.
Installing Python
ChewingdataefficientlywithNumPyand
intelligently with SciPy
Let us quickly walk through some basic NumPy examples and then
take a look atwhat SciPyprovidesontopofit.Ontheway,
wewillgetourfeetwetwithplottingusingthemarvelousMatplotlibpacka
ge.
YouwillfindmoreinterestingexamplesofwhatNumPycanofferathttp://
www.scipy.org/Tentative_NumPy_Tutorial.
You will also find the book NumPy Beginner's Guide - Second
Edition, Ivan Idris,Packt Publishing very valuable. Additional
tutorial style guides are at http://scipy-lectures.github.com; you
may also visit the official SciPy tutorial
athttp://docs.scipy.org/doc/scipy/reference/tutorial.
Inthisbook,wewilluseNumPyVersion1.6.2andSciPyVersion0.11.0.
LearningNumPy
SoletusimportNumPyandplayabitwithit.Forthat,weneedtostartthePythoninter
activeshell.
>>>importnumpy
>>>numpy.version.full_versio
n1.6.2
Aswedonotwanttopolluteournamespace,wecertainlyshouldnotdothefollo
wing:
>>>fromnumpyimport*
Thenumpy.arrayarraywillpotentiallyshadowthearraypackagethatisincl
udedin standard Python. Instead, we will usethe following
convenient shortcut:
>>>importnumpyasnp
>>>a=np.array([0,1,2,3,4,5])
>>>a
array([0,1,2,3,4,5])
>>>a.ndim
1
>>>a.shape(
6,)
We just created an array in a similar way to how we would create a
list in Python.However, NumPy arrays have additional information
about the shape. In this case,itisaone-
dimensionalarrayoffiveelements.Nosurprisessofar.
Wecannowtransformthisarrayintoa2Dmatrix.
>>>b=a.reshape((3,2))
>>>b
array([[0,1],
[2,3],
[4,5]])
>>>b.ndim
2
>>>b.shape(
3,2)
The funny thing starts when we realize just how much the
NumPy package
isoptimized.Forexample,itavoidscopieswhereverpossible.
>>>b[1][0]=77
>>>b
array([[0,1],
[77,3],
[4,5]])
>>>a
array([0,1,77,3,4,5])
Inthiscase,wehavemodifiedthevalue2to77inb,andwecanimmediatelys
eethesamechangereflectedinaaswell.Keepthatinmindwheneveryoune
edatruecopy.
>>>c=a.reshape((3,2)).copy()
>>>c
array([[0,1],
[77,3],
[4,5]])
>>>c[0][0]=-99
>>>a
array([0,1,77,3,4,5])
>>>c
array([[-99, 1],
[77, 3],
[4, 5]])
Here,candaaretotallyindependentcopies.
AnotherbigadvantageofNumPyarraysisthattheoperationsareprop
agatedtotheindividualelements.
>>>a*2
array([2,4,6,8,10])
>>>a**2
array([1,4,9,16,25])
ContrastthattoordinaryPythonlists:
>>>[1,2,3,4,5]*2
[1,2,3,4,5,1,2,3,4,5]
>>>[1,2,3,4,5]**2
Traceback (most recent call
last):File"<stdin>",line1,in<module
>
TypeError:unsupportedoperandtype(s)for**orpow():'list'and'int'
Ofcourse,byusingNumPyarrayswesacrificetheagilityPythonlistsoffer.Si
mpleoperationslikeaddingorremovingareabitcomplexforNumPyarrays.
Luckily,wehavebothatourdisposal,andwewillusetherightoneforthetaska
thand.
Indexing
PartofthepowerofNumPycomesfromtheversatilewaysinwhichitsarrays
canbeaccessed.
Inadditiontonormallistindexing,itallowsustousearraysthemselvesasindice
s.
>>>a[np.array([2,3,4])]a
rray([77,3,4])
Inadditiontothefactthatconditionsarenowpropagatedtotheindividualel
ements,wegainaveryconvenientwaytoaccessourdata.
>>>a>4
array([False,False,True,False,False,True],dtype=bool)
>>>a[a>4]
array([77,5])
Thiscanalsobeusedtotrimoutliers.
>>>a[a>4]=4
>>>a
array([0,1,4,3,4,4])
Asthisisafrequentusecase,thereisaspecialclipfunctionforit,clippingthe
valuesatbothendsofanintervalwithonefunctioncallasfollows:
>>>a.clip(0,4)
array([0,1,4,3,4,4])
Handlingnon-existingvalues
The power of NumPy's indexing capabilities comes in handy when
preprocessingdata that we have just read in from a text file. It will
most likely contain
invalidvalues,whichwewillmarkasnotbeingarealnumberusingnumpy.NA
Nasfollows:
c=np.array([1,2,np.NAN,3,4])#let'spretendwehavereadthisfromatex
tfile
>>>c
array([1., 2.,nan, 3., 4.])
>>>np.isnan(c)
array([False,False,True,False,False],dtype=bool)
>>>c[~np.isnan(c)]
array([1.,2.,3.,4.])
>>>
np.mean(c[~np.isnan(c)])2.5
Comparingruntimebehaviors
LetuscomparetheruntimebehaviorofNumPywithnormalPythonlists.Int
hefollowingcode,wewillcalculatethesumofallsquarednumbersof1to100
0andseehowmuchtimethecalculationwilltake.Wedoit10000timesandre
portthetotaltimesothatourmeasurementisaccurateenough.
importtimeit
normal_py_sec=timeit.timeit('sum(x*xforxinxrange(1000))',
number=1000
0)naive_np_sec=timeit.timeit('sum(na*na)',
setup="importnumpyasnp;na=np.
arange(1000)",
number=10000)
good_np_sec=timeit.timeit('na.dot(na)',
setup="importnumpyasnp;na=np.
arange(1000)",
number=10000)
print("NormalPython:
%fsec"%normal_py_sec)print("Naive
NumPy: %f
sec"%naive_np_sec)print("GoodNumPy:
%fsec"%good_np_sec)
NormalPython:1.157467sec
Naive NumPy: 4.061293
secGoodNumPy:0.033419s
ec
We make two interesting observations. First, just using NumPy as
data storage(Naive NumPy) takes 3.5 times longer, which is
surprising since we believe it
mustbemuchfasterasitiswrittenasaCextension.Onereasonforthisisthatt
heaccessof
individualelementsfromPythonitselfisrathercostly.Onlywhenweareabl
etoapplyalgorithms inside the optimized extension code do we get
speed improvements, andtremendous ones at that: using the dot()
function of NumPy, we are more than 25times faster. In summary, in
every algorithm we are about to implement, we
shouldalwayslookathowwecanmoveloopsoverindividualelementsfro
mPythontosomeofthehighlyoptimizedNumPyorSciPyextensionfuncti
ons.
However, the speed comes ata price. Using NumPyarrays, we no
longer
havetheincredibleflexibilityofPythonlists,whichcanholdbasicallyan
ything.NumPyarraysalwayshaveonlyonedatatype.
>>>a=np.array([1,2,3])
>>>
a.dtypedtype('i
nt64')
Ifwetrytouseelementsofdifferenttypes,NumPywilldoitsbesttocoercethe
mtothemostreasonablecommondatatype:
>>> np.array([1,
"stringy"])array(['1','stringy'],dtype=
'|S8')
>>> np.array([1, "stringy",
set([1,2,3])])array([1,stringy,set([1,2,3])],dtype=
object)
Learning SciPy
OntopoftheefficientdatastructuresofNumPy,SciPyoffersamagnitude
ofalgorithmsworkingonthosearrays.Whatevernumerical-
heavyalgorithmyoutakefromcurrentbooksonnumericalrecipes,youw
illmostlikelyfindsupportfortheminSciPyinonewayoranother.Whethe
ritismatrixmanipulation,linearalgebra,optimization,clustering,spatia
loperations,orevenFastFouriertransformation,thetoolboxisreadilyfill
ed.Therefore,itisagoodhabittoalwaysinspectthescipymodulebeforeyo
ustartimplementinganumericalalgorithm.
Forconvenience,thecompletenamespaceofNumPyisalsoaccessible
viaSciPy.So,fromnowon,wewilluseNumPy'smachineryviatheSciP
ynamespace.Youcancheckthiseasilybycomparingthefunctionrefer
encesofanybasefunction;forexample:
>>>importscipy,numpy
>>>scipy.version.full_version
0.11.0
>>>scipy.dotisnumpy.dot
True
Thediversealgorithmsaregroupedintothefollowingtoolboxes:
SciPypackage Functionality
cluster Hierarchicalclustering(cluster.hierarchy)
Vectorquantization/K-Means(cluster.vq)
SciPypackage Functionality
constants Physicalandmathematicalconstants
Conversion methods
fftpack DiscreteFouriertransformalgorithms
integrate Integrationroutines
interpolate Interpolation(linear,cubic,andsoon)
io Datainputandoutput
linalg Linearalgebraroutinesusingtheoptimized
BLASandLAPACKlibraries
maxentropy Functionsforfittingmaximumentropymodels
ndimage n-dimensionalimagepackage
odr Orthogonal
distanceregressionoptimize
Optimization(findingminimaandroots)signal
Signalprocessing
sparse Sparsematrices
spatial Spatialdatastructuresandalgorithms
special
SpecialmathematicalfunctionssuchasBess
elorJacobian
stats Statisticstoolkit
Thetoolboxesmostinterestingtoourendeavorarescipy.stats,scipy.interp
olate,scipy.cluster,andscipy.signal.Forthesakeofbrevity,wewillbriefly
exploresomefeaturesofthestatspackageandleavetheotherstobeexplaine
dwhentheyshowupinthechapters.
Ourfirst(tiny)machine learning
application
Letusgetourhandsdirtyandhavealookatourhypotheticalwebstartup,MLA
AS,whichsellstheserviceofprovidingmachinelearningalgorithmsviaHTT
P.Withtheincreasingsuccessofourcompany,thedemandforbetterinfrastru
cturealsoincreasestoserveallincomingwebrequestssuccessfully.Wedon't
wanttoallocatetoomanyresourcesasthatwouldbetoocostly.Ontheotherhan
d,wewilllosemoneyifwehavenotreservedenoughresourcesforservingallin
comingrequests.Thequestionnowis,whenwillwehitthelimitofourcurrenti
nfrastructure,whichweestimatedbeing100,000requestsperhour.Wewould
liketoknowinadvancewhenwehavetorequestadditionalserversinthecloudt
oservealltheincomingrequestssuccessfullywithoutpayingforunusedones.
Readinginthedata
Wehavecollectedthewebstatsforthelastmonthandaggregatedtheminch01/
data/web_traffic.tsv (tsv because it contains tab separated values). They
arestoredasthenumberofhitsperhour.Eachlinecontainsconsecutivehoursa
ndthenumberofwebhitsinthathour.
Thefirstfewlineslooklikethefollowing:
UsingSciPy'sgenfromtxt(),wecaneasilyreadinthedata.
importscipyassp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
Wehavetospecifytabasthedelimitersothatthecolumnsarecorrectlydete
rmined.Aquickcheckshowsthatwehavecorrectlyreadinthedata.
>>>print(data[:10])
[[1.00000000e+00
2.27200000e+03][2.00000000e+00
nan]
[3.00000000e+00
1.38600000e+03][4.00000000e+00
1.36500000e+03][5.00000000e+00
1.48800000e+03][6.00000000e+00
1.33700000e+03][7.00000000e+00
1.88300000e+03][8.00000000e+00
2.28300000e+03][9.00000000e+00
1.33500000e+03][1.00000000e+01
1.02500000e+03]]
>>>print(data.shape)
(743,2)
Wehave743datapointswithtwodimensions.
Preprocessingandcleaningthedata
ItismoreconvenientforSciPytoseparatethedimensionsintotwovecto
rs,eachofsize743.Thefirstvector,x,willcontainthehoursandtheother
,y,willcontainthe web hits in that particular hour. This splitting is
done using the special
indexnotationofSciPy,usingwhichwecanchoosethecolumnsindivid
ually.
x=data[:,0]
y=data[:,1]
Onecaveatisthatwestillhavesomevaluesinythatcontaininvalidvalues,na
n.Thequestionis,whatcanwedowiththem?
Letuscheckhowmanyhourscontaininvaliddata.
>>>sp.sum(sp.isnan(y))
8
Wearemissingonly8outof743entries,sowecanaffordtoremovethem.Rem
emberthat we can index a SciPy array with another array. sp.isnan(y)
returns an array
ofBooleansindicatingwhetheranentryisnotanumber.Using~,welogically
negatethat array so that we choose only those elements from x and y
where y does containvalidnumbers.
x=x[~sp.isnan(y)]y
=y[~sp.isnan(y)]
Togetafirstimpressionofourdata,letusplotthedatainascatterplotusingMa
tplotlib.Matplotlibcontainsthepyplotpackage,whichtriestomimicMatla
b'sinterface—averyconvenientandeasy-to-
useone(youwillfindmoretutorialsonplottingathttp://matplotlib.org/
users/pyplot_tutorial.html).
importmatplotlib.pyplotaspltplt
.scatter(x,y)
plt.title("Webtrafficoverthelastmonth")plt.xlabel("Time")
plt.ylabel("Hits/
hour")plt.xticks([w*7*24forwinrange(1
0)],
['week
%i'%wforwinrange(10)])plt.autosc
ale(tight=True)plt.grid()
plt.show()
Intheresultingchart,wecanseethatwhileinthefirstweeksthetrafficstayedmore
orlessthesame,thelastweekshowsasteepincrease:
Choosingtherightmodelandlearningalgorithm
Nowthatwehaveafirstimpressionofthedata,wereturntotheinitialquestion:ho
w
longwillourserverhandletheincomingwebtraffic?Toanswerthiswehaveto:
• Findtherealmodelbehindthenoisydatapoints
• Usethemodeltoextrapolateintothefuturetofindthepointintime
whereourinfrastructurehastobeextended
Beforebuildingourfirstmodel
When we talk about models, you can think of them as simplified
theoreticalapproximations of the complex reality. As such there is
always some inferiorityinvolved, also called the approximation
error. This error will guide us in
choosingtherightmodelamongthemyriadofchoiceswehave.Thiserror
willbecalculatedas the squared distance of the model's prediction to
the real data. That is, for
alearnedmodelfunction,f,theerroriscalculatedasfollows:
deferror(f,x,y):
returnsp.sum((f(x)-y)**2)
Thevectorsxandycontainthewebstatsdatathatwehaveextractedbefore.I
tisthebeautyofSciPy'svectorizedfunctionsthatweexploitherewithf(x).
Thetrainedmodelisassumedtotakeavectorandreturntheresultsagainasa
vectorofthesamesizesothatwecanuseittocalculatethedifferencetoy.
Startingwithasimplestraightline
Let us assume for a second that the underlying model is a straight
line. Thechallenge then is how to best put that line into the chart
so that it results in
thesmallestapproximationerror.SciPy'spolyfit()functiondoesexactlytha
t.Givendataxandyandthedesiredorderofthepolynomial(straightlinehaso
rder1),
itfindsthemodelfunctionthatminimizestheerrorfunctiondefinedearlier.
fp1,residuals,rank,sv,rcond=sp.polyfit(x,y,1,full=True)
Thepolyfit()functionreturnstheparametersofthefittedmodelfunction,
fp1;andbysettingfulltoTrue,wealsogetadditionalbackgroundinformatio
nonthefittingprocess.Ofit,onlyresidualsareofinterest,whichisexactly
theerroroftheapproximation.
>>>print("Modelparameters:%s"%fp1)
Modelparameters:[ 2.59619213989.02487106]
>>>print(res)
[3.17389767e+08]
Thismeansthatthebeststraightlinefitisthefollowingfunction:
f(x)=2.59619213*x+989.02487106.
Wethenusepoly1d()tocreateamodelfunctionfromthemodelparameters.
>>>f1=sp.poly1d(fp1)
>>>print(error(f1,x,y))317
389767.34
Wehaveusedfull=Truetoretrievemoredetailsonthefittingprocess.Normally,
wewouldnotneedit,inwhichcaseonlythemodelparameterswouldbereturne
d.
Infact,whatwedohereissimplecurvefitting.Youcanfin
doutmoreaboutitonWikipediabygoingtohttp://
en.wikipedia.org/wiki/Curve_fitting.
Wecannowusef1()toplotourfirsttrainedmodel.Inadditiontotheearlierp
lottinginstructions,wesimplyaddthefollowing:
fx=sp.linspace(0,x[-1],1000)#generateX-
valuesforplottingplt.plot(fx,f1(fx),linewidth=4)
plt.legend(["d=%i"%f1.order],loc="upperleft")
The following graph shows our first trained model:
It seems like the first four weeks are not that far off, although we
clearly see
thatthereissomethingwrongwithourinitialassumptionthattheunderly
ingmodelisastraightline.Plus,howgoodorbadactuallyistheerrorof31
7,389,767.34?
The absolute value of the error is seldom of use in isolation.
However, whencomparing two competing models, we can use their
errors to judge which one ofthem is better. Although our first model
clearly is not the one we would use,
itservesaveryimportantpurposeintheworkflow:wewilluseitasourbasel
ineuntilwe find a better one. Whatever model we will come up with
in the future, we willcompareitagainstthecurrentbaseline.
Themorecomplexthedatagets,thecurvescaptureitandmakeitfitbetter.The
errorsseemtotellthesamestory.
Errord=1:317,389,767.339778
Errord=2:179,983,507.878179
Errord=3:139,350,144.031725
Errord=10:121,942,326.363461
Errord=100:109,318,004.475556
However, taking a closer look at the fitted curves, we start to
wonder
whethertheyalsocapturethetrueprocessthatgeneratedthisdata.Fra
meddifferently,do our models correctly represent the underlying
mass behavior of customersvisiting our website? Looking at the
polynomial of degree 10 and 100, we seewildly oscillating
behavior. It seems that the models are fitted too much to
thedata.Somuchthatitisnowcapturingnotonlytheunderlyingproce
ssbutalsothenoise.Thisiscalledoverfitting.
Atthispoint,wehavethefollowingchoices:
• Selectingoneofthefittedpolynomialmodels.
• Switchingtoanothermorecomplexmodelclass;splines?
• Thinkingdifferentlyaboutthedataandstartingagain.
Ofthefivefittedmodels,thefirst-
ordermodelclearlyistoosimple,andthemodelsoforder10and100arecle
arlyoverfitting.Onlythesecond-andthird-
ordermodelsseemtosomehowmatchthedata.However,ifweextrapolat
ethematbothborders,weseethemgoingberserk.
Switching to a more complex class also seems to be the wrong way
to go about it.Whatargumentswouldbackwhichclass?
Atthispoint,werealizethatweprobablyhavenotcompletelyunderstoodo
urdata.
fa=sp.poly1d(sp.polyfit(xa,ya,1))fb=s
p.poly1d(sp.polyfit(xb,yb,1))
fa_error=error(fa,xa,ya)fb_e
rror=error(fb,xb,yb)
print("Errorinflection=%f"%
(fa+fb_error))Errorinflection=156,639,407.7015
23
Plotting thet wo models forth etwo data ranges gives the following chart:
Clearly, the combination of these two lines seems to be a much better
fit to the
datathananythingwehavemodeledbefore.Butstill,thecombinederrorishi
gherthanthehigher-orderpolynomials.Canwetrusttheerrorattheend?
Askeddifferently,whydowetrustthestraightlinefittedonlyatthelastwee
kofourdatamorethananyofthemorecomplexmodels?
Itisbecauseweassumethatitwillcapturefuturedatabetter.Ifweplotthemo
delsintothefuture,weseehowrightweare(d=1isagainourinitiallystraigh
tline).
Themodelsofdegree10and100don'tseemtoexpectabrightfutureforoursta
rtup.Theytriedsohardtomodelthegivendatacorrectlythattheyareclearlyu
selesstoextrapolatefurther.Thisiscalledoverfitting.Ontheotherhand,thel
ower-
degreemodelsdonotseemtobecapableofcapturingthedataproperly.Thisi
scalledunderfitting.
Soletusplayfairtothemodelsofdegree2andaboveandtryouthowtheybe
haveif we fit them only to the data of the last week. After all, we
believe that the
lastweeksaysmoreaboutthefuturethanthedatabefore.Theresultcanbes
eeninthefollowing psychedelic chart, which shows even more
clearly how bad the problem of over fittingis:
Still,judgingfromtheerrorsofthemodelswhentrainedonlyonthedatafromwe
ek
3.5andafter,weshouldstillchoosethemostcomplexone.
Erro d=1: 22143941.107
r 618
Erro d=2: 19768846.989
r 176
Erro d=3: 19766452.361
r 027
Erro d=10: 18949339.348
r 539
Erro d=10 16915159.603
r 0: 877
>>>fromscipy.optimizeimportfsolve
>>>reached_max=fsolve(fbt2-100000,800)/(7*24)
>>>print("100,000hits/hourexpectedatweek
%f"%reached_max[0])100,000hits/hourexpectedatweek9.827613
Ourmodeltellsusthatgiventhecurrentuserbehaviorandtractionofourst
artup,itwill takeanother monthuntil wehave reachedour
thresholdcapacity.
Ofcourse,thereisacertainuncertaintyinvolvedwithourprediction.Toget
therealpicture,youcandrawinmoresophisticatedstatisticstofindoutabo
utthevariancethat we have to expect when looking fartherand further
into the future.
Andthentherearetheuserandunderlyinguserbehaviordynamicsthatwe
[37]
cannotmodel accurately. However, at this point we are fine with the
current predictions.After all, we can prepare all the time-consuming
actions now. If we then
monitorourwebtrafficclosely,wewillseeintimewhenwehavetoallocate
newresources.