0% found this document useful (0 votes)
64 views

Python Notes

This document provides an introduction to machine learning and Python. It discusses that machine learning teaches machines how to carry out tasks by providing examples, and the complexity arises from the details. It then summarizes that NumPy provides optimized multi-dimensional arrays as the basic data structure for machine learning algorithms, SciPy uses these arrays to provide numerical methods, and Matplotlib is useful for plotting graphs. The document also provides some basic examples of using NumPy arrays and indexing capabilities.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Python Notes

This document provides an introduction to machine learning and Python. It discusses that machine learning teaches machines how to carry out tasks by providing examples, and the complexity arises from the details. It then summarizes that NumPy provides optimized multi-dimensional arrays as the basic data structure for machine learning algorithms, SciPy uses these arrays to provide numerical methods, and Matplotlib is useful for plotting graphs. The document also provides some basic examples of using NumPy arrays and indexing capabilities.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT I

MACHINE LEARNING AND PYTHON:

Machine learning (ML) teaches machines how to carry out tasks


by themselves.It is that simple.The complexity comes with the
details ,and that is most likely the reason you are reading this
book.
Maybeyouhavetoomuchdataandtoolittleinsight,andyouhopedthatu
singmachinelearningalgorithmswillhelpyousolvethischallenge.So
youstartedtodigintorandomalgorithms.Butaftersometimeyouwere
puzzled:which of the myriad of algorithms should you actually
choose?
Ormaybeyouarebroadlyinterestedinmachinelearningandhavebeenreading
a few blog sand articles about it for
sometime.Everythingseemedtobemagicandcool, so you started
your exploration and fed some toy data into a decision tree ora
support vector machine. But after you successfully applied it to
some other data,youwondered,wasthewholesettingright?
Didyougettheoptimalresults?Andhow do you know there are no
better algorithms? Or whether your data was "the right one"?
Welcome to the club! We, the authors, were at those stages once
upon a time,looking for information that tells the real story behind
the theoretical text books on machine learning. It turned out that
much of that information was "black
art",notusuallytaughtinstandardtextbooks.So,inasense,wewrotethis
booktoour
youngerselves;abookthatnotonlygivesaquickintroductiontomachinel
earning,but also teaches you lessons that we have learned along the
way. We hope that itwill also give you, the reader, a smoother entry
into one of the most exciting fieldsinComputerScience.
Machine learning and Python –the
dream team
The goal of machine learning is to teach machines (software) to
carry out
tasksbyprovidingthemwithacoupleofexamples(howtodoornotdoat
ask).Letusassume that each morning when you turn on your
computer, you perform thesame task of moving e-mails around so
that only those e-mails belonging to a particular topic end up in
the same folder. After some time, you feel bored and think of
automating this chore. One way would be to start analyzing your
brainandwritingdownalltherulesyourbrainprocesseswhileyouaresh
ufflingyour
e-mails. However, this will be quite cumbersome and always
imperfect. While you will miss some rules, you will over-specify
others. A better and more future-
proofwaywouldbetoautomatethisprocessbychoosingasetofe-
mailmetainformationandbody/
foldernamepairsandletanalgorithmcomeupwiththebestruleset.
Thepairswouldbeyourtrainingdata,andtheresultingruleset(alsocalled
model)could then be applied to future e-mails that we have not yet
seen. This is machine learning in its simplest form.
Of course, machine learning (often also referred to as data mining
or
predictiveanalysis)isnotabrandnewfieldinitself.Quitethecontrary,itss
uccessoverrecentyears can be attributed to the pragmatic way of
using rock-solid techniques andinsights from other successful
fields; for example, statistics. There, the purpose
isforushumanstogetinsightsintothedatabylearningmoreabouttheunde
rlying
patternsandrelationships.Asyoureadmoreandmoreaboutsuccessfulapp
licationsofmachinelearning(youhavecheckedoutkaggle.comalready,have
n'tyou? ),youwillseethatappliedstatisticsisacommonfieldamongmachi
nelearningexperts.
As you will see later, the process of coming up with a decent ML
approach is nevera waterfall-like process. Instead, you will see
yourself going back and forth in your analysis, trying out different
versions of your input data on diverse sets of ML algorithms. It is
this explorative nature that lends itself perfectly to Python. Beingan
interpreted high-level programming language, it may seem that
Python
wasdesignedspecificallyfortheprocessoftryingoutdifferentthings.W
hatismore,
it does this very fast. Sure enough, it is slower than C or similar
statically-
typedprogramminglanguages;nevertheless,withamyriadofeasy-to-
uselibrariesthatareoftenwritteninC,youdon'thavetosacrificespeedfo
ragility.

INTRODUCTION TO NumPY , SciPy , AND Matplotlib

Luckily,forallthemajoroperatingsystems,namelyWindows,Mac,andL
inux,therearetargetedinstallersforNumPy,SciPy,andMatplotlib.Ifyou
areunsureabouttheinstallationprocess,youmightwanttoinstallEnthou
ghtPythonDistribution(https://www.enthought.com/products/
epd_free.php)orPython(x,y)(http://code.google.com/p/pythonxy/
wiki/
Downloads),whichcomewithalltheearliermentionedpackagesinclude
d.
Before we can talk about concrete machine learning algorithms, we
have to talkabout how best to store the data we will chew through.
This is important as themost advanced learning algorithm will not
be of any help to us if they will neverfinish. This may be simply
because accessing the data is too slow. Or maybe
itsrepresentationforcestheoperatingsystemtoswapallday.Addtothist
hatPythonis an interpreted language (a highly optimized one,
though) that is slow for manynumerically heavy algorithms
compared to C or Fortran. So we might ask why
onearthsomanyscientistsandcompaniesarebettingtheirfortuneonPyt
honeveninthehighlycomputation-intensiveareas?
TheansweristhatinPython,itisveryeasytooffloadnumber-crunchingtasksto
thelowerlayerintheformofaCorFortranextension.ThatisexactlywhatNum
PyandSciPydo(http://scipy.org/
install.html).Inthistandem,NumPyprovidesthesupportofhighlyoptimized
multidimensionalarrays,whicharethebasicdatastructure of most state-of-
the-art algorithms. SciPy uses those arrays to provide a set offast
numerical recipes. Finally, Matplotlib (http://matplotlib.org/) is
probably themostconvenientandfeature-richlibrarytoplothigh-
qualitygraphsusingPython.

Installing Python

ChewingdataefficientlywithNumPyand
intelligently with SciPy
Let us quickly walk through some basic NumPy examples and then
take a look atwhat SciPyprovidesontopofit.Ontheway,
wewillgetourfeetwetwithplottingusingthemarvelousMatplotlibpacka
ge.
YouwillfindmoreinterestingexamplesofwhatNumPycanofferathttp://
www.scipy.org/Tentative_NumPy_Tutorial.
You will also find the book NumPy Beginner's Guide - Second
Edition, Ivan Idris,Packt Publishing very valuable. Additional
tutorial style guides are at http://scipy-lectures.github.com; you
may also visit the official SciPy tutorial
athttp://docs.scipy.org/doc/scipy/reference/tutorial.
Inthisbook,wewilluseNumPyVersion1.6.2andSciPyVersion0.11.0.

LearningNumPy
SoletusimportNumPyandplayabitwithit.Forthat,weneedtostartthePythoninter
activeshell.
>>>importnumpy
>>>numpy.version.full_versio
n1.6.2
Aswedonotwanttopolluteournamespace,wecertainlyshouldnotdothefollo
wing:
>>>fromnumpyimport*
Thenumpy.arrayarraywillpotentiallyshadowthearraypackagethatisincl
udedin standard Python. Instead, we will usethe following
convenient shortcut:
>>>importnumpyasnp
>>>a=np.array([0,1,2,3,4,5])
>>>a
array([0,1,2,3,4,5])
>>>a.ndim
1
>>>a.shape(
6,)
We just created an array in a similar way to how we would create a
list in Python.However, NumPy arrays have additional information
about the shape. In this case,itisaone-
dimensionalarrayoffiveelements.Nosurprisessofar.
Wecannowtransformthisarrayintoa2Dmatrix.
>>>b=a.reshape((3,2))
>>>b
array([[0,1],
[2,3],
[4,5]])
>>>b.ndim
2
>>>b.shape(
3,2)
The funny thing starts when we realize just how much the
NumPy package
isoptimized.Forexample,itavoidscopieswhereverpossible.
>>>b[1][0]=77
>>>b
array([[0,1],
[77,3],
[4,5]])
>>>a
array([0,1,77,3,4,5])
Inthiscase,wehavemodifiedthevalue2to77inb,andwecanimmediatelys
eethesamechangereflectedinaaswell.Keepthatinmindwheneveryoune
edatruecopy.
>>>c=a.reshape((3,2)).copy()
>>>c
array([[0,1],
[77,3],
[4,5]])
>>>c[0][0]=-99
>>>a
array([0,1,77,3,4,5])
>>>c
array([[-99, 1],
[77, 3],
[4, 5]])
Here,candaaretotallyindependentcopies.
AnotherbigadvantageofNumPyarraysisthattheoperationsareprop
agatedtotheindividualelements.
>>>a*2
array([2,4,6,8,10])
>>>a**2
array([1,4,9,16,25])
ContrastthattoordinaryPythonlists:
>>>[1,2,3,4,5]*2
[1,2,3,4,5,1,2,3,4,5]
>>>[1,2,3,4,5]**2
Traceback (most recent call
last):File"<stdin>",line1,in<module
>
TypeError:unsupportedoperandtype(s)for**orpow():'list'and'int'
Ofcourse,byusingNumPyarrayswesacrificetheagilityPythonlistsoffer.Si
mpleoperationslikeaddingorremovingareabitcomplexforNumPyarrays.
Luckily,wehavebothatourdisposal,andwewillusetherightoneforthetaska
thand.

Indexing
PartofthepowerofNumPycomesfromtheversatilewaysinwhichitsarrays
canbeaccessed.
Inadditiontonormallistindexing,itallowsustousearraysthemselvesasindice
s.
>>>a[np.array([2,3,4])]a
rray([77,3,4])
Inadditiontothefactthatconditionsarenowpropagatedtotheindividualel
ements,wegainaveryconvenientwaytoaccessourdata.
>>>a>4
array([False,False,True,False,False,True],dtype=bool)
>>>a[a>4]
array([77,5])
Thiscanalsobeusedtotrimoutliers.
>>>a[a>4]=4
>>>a
array([0,1,4,3,4,4])
Asthisisafrequentusecase,thereisaspecialclipfunctionforit,clippingthe
valuesatbothendsofanintervalwithonefunctioncallasfollows:
>>>a.clip(0,4)
array([0,1,4,3,4,4])

Handlingnon-existingvalues
The power of NumPy's indexing capabilities comes in handy when
preprocessingdata that we have just read in from a text file. It will
most likely contain
invalidvalues,whichwewillmarkasnotbeingarealnumberusingnumpy.NA
Nasfollows:
c=np.array([1,2,np.NAN,3,4])#let'spretendwehavereadthisfromatex
tfile
>>>c
array([1., 2.,nan, 3., 4.])
>>>np.isnan(c)
array([False,False,True,False,False],dtype=bool)
>>>c[~np.isnan(c)]
array([1.,2.,3.,4.])
>>>
np.mean(c[~np.isnan(c)])2.5

Comparingruntimebehaviors
LetuscomparetheruntimebehaviorofNumPywithnormalPythonlists.Int
hefollowingcode,wewillcalculatethesumofallsquarednumbersof1to100
0andseehowmuchtimethecalculationwilltake.Wedoit10000timesandre
portthetotaltimesothatourmeasurementisaccurateenough.
importtimeit
normal_py_sec=timeit.timeit('sum(x*xforxinxrange(1000))',
number=1000
0)naive_np_sec=timeit.timeit('sum(na*na)',
setup="importnumpyasnp;na=np.
arange(1000)",
number=10000)
good_np_sec=timeit.timeit('na.dot(na)',
setup="importnumpyasnp;na=np.
arange(1000)",
number=10000)

print("NormalPython:
%fsec"%normal_py_sec)print("Naive
NumPy: %f
sec"%naive_np_sec)print("GoodNumPy:
%fsec"%good_np_sec)

NormalPython:1.157467sec
Naive NumPy: 4.061293
secGoodNumPy:0.033419s
ec
We make two interesting observations. First, just using NumPy as
data storage(Naive NumPy) takes 3.5 times longer, which is
surprising since we believe it
mustbemuchfasterasitiswrittenasaCextension.Onereasonforthisisthatt
heaccessof
individualelementsfromPythonitselfisrathercostly.Onlywhenweareabl
etoapplyalgorithms inside the optimized extension code do we get
speed improvements, andtremendous ones at that: using the dot()
function of NumPy, we are more than 25times faster. In summary, in
every algorithm we are about to implement, we
shouldalwayslookathowwecanmoveloopsoverindividualelementsfro
mPythontosomeofthehighlyoptimizedNumPyorSciPyextensionfuncti
ons.
However, the speed comes ata price. Using NumPyarrays, we no
longer
havetheincredibleflexibilityofPythonlists,whichcanholdbasicallyan
ything.NumPyarraysalwayshaveonlyonedatatype.
>>>a=np.array([1,2,3])
>>>
a.dtypedtype('i
nt64')
Ifwetrytouseelementsofdifferenttypes,NumPywilldoitsbesttocoercethe
mtothemostreasonablecommondatatype:
>>> np.array([1,
"stringy"])array(['1','stringy'],dtype=
'|S8')
>>> np.array([1, "stringy",
set([1,2,3])])array([1,stringy,set([1,2,3])],dtype=
object)

Learning SciPy
OntopoftheefficientdatastructuresofNumPy,SciPyoffersamagnitude
ofalgorithmsworkingonthosearrays.Whatevernumerical-
heavyalgorithmyoutakefromcurrentbooksonnumericalrecipes,youw
illmostlikelyfindsupportfortheminSciPyinonewayoranother.Whethe
ritismatrixmanipulation,linearalgebra,optimization,clustering,spatia
loperations,orevenFastFouriertransformation,thetoolboxisreadilyfill
ed.Therefore,itisagoodhabittoalwaysinspectthescipymodulebeforeyo
ustartimplementinganumericalalgorithm.
Forconvenience,thecompletenamespaceofNumPyisalsoaccessible
viaSciPy.So,fromnowon,wewilluseNumPy'smachineryviatheSciP
ynamespace.Youcancheckthiseasilybycomparingthefunctionrefer
encesofanybasefunction;forexample:
>>>importscipy,numpy
>>>scipy.version.full_version
0.11.0
>>>scipy.dotisnumpy.dot
True
Thediversealgorithmsaregroupedintothefollowingtoolboxes:

SciPypackage Functionality

cluster Hierarchicalclustering(cluster.hierarchy)

Vectorquantization/K-Means(cluster.vq)
SciPypackage Functionality

constants Physicalandmathematicalconstants
Conversion methods
fftpack DiscreteFouriertransformalgorithms

integrate Integrationroutines

interpolate Interpolation(linear,cubic,andsoon)

io Datainputandoutput

linalg Linearalgebraroutinesusingtheoptimized
BLASandLAPACKlibraries
maxentropy Functionsforfittingmaximumentropymodels

ndimage n-dimensionalimagepackage

odr Orthogonal

distanceregressionoptimize

Optimization(findingminimaandroots)signal

Signalprocessing

sparse Sparsematrices

spatial Spatialdatastructuresandalgorithms

special
SpecialmathematicalfunctionssuchasBess
elorJacobian
stats Statisticstoolkit

Thetoolboxesmostinterestingtoourendeavorarescipy.stats,scipy.interp
olate,scipy.cluster,andscipy.signal.Forthesakeofbrevity,wewillbriefly
exploresomefeaturesofthestatspackageandleavetheotherstobeexplaine
dwhentheyshowupinthechapters.
Ourfirst(tiny)machine learning
application
Letusgetourhandsdirtyandhavealookatourhypotheticalwebstartup,MLA
AS,whichsellstheserviceofprovidingmachinelearningalgorithmsviaHTT
P.Withtheincreasingsuccessofourcompany,thedemandforbetterinfrastru
cturealsoincreasestoserveallincomingwebrequestssuccessfully.Wedon't
wanttoallocatetoomanyresourcesasthatwouldbetoocostly.Ontheotherhan
d,wewilllosemoneyifwehavenotreservedenoughresourcesforservingallin
comingrequests.Thequestionnowis,whenwillwehitthelimitofourcurrenti
nfrastructure,whichweestimatedbeing100,000requestsperhour.Wewould
liketoknowinadvancewhenwehavetorequestadditionalserversinthecloudt
oservealltheincomingrequestssuccessfullywithoutpayingforunusedones.

Readinginthedata
Wehavecollectedthewebstatsforthelastmonthandaggregatedtheminch01/
data/web_traffic.tsv (tsv because it contains tab separated values). They
arestoredasthenumberofhitsperhour.Eachlinecontainsconsecutivehoursa
ndthenumberofwebhitsinthathour.
Thefirstfewlineslooklikethefollowing:
UsingSciPy'sgenfromtxt(),wecaneasilyreadinthedata.
importscipyassp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
Wehavetospecifytabasthedelimitersothatthecolumnsarecorrectlydete
rmined.Aquickcheckshowsthatwehavecorrectlyreadinthedata.

>>>print(data[:10])
[[1.00000000e+00

2.27200000e+03][2.00000000e+00
nan]
[3.00000000e+00

1.38600000e+03][4.00000000e+00

1.36500000e+03][5.00000000e+00

1.48800000e+03][6.00000000e+00

1.33700000e+03][7.00000000e+00

1.88300000e+03][8.00000000e+00

2.28300000e+03][9.00000000e+00

1.33500000e+03][1.00000000e+01

1.02500000e+03]]
>>>print(data.shape)
(743,2)
Wehave743datapointswithtwodimensions.

Preprocessingandcleaningthedata
ItismoreconvenientforSciPytoseparatethedimensionsintotwovecto
rs,eachofsize743.Thefirstvector,x,willcontainthehoursandtheother
,y,willcontainthe web hits in that particular hour. This splitting is
done using the special
indexnotationofSciPy,usingwhichwecanchoosethecolumnsindivid
ually.
x=data[:,0]
y=data[:,1]

There is much more to the way data can be selected


from a SciPy array.Check out
http://www.scipy.org/Tentative_NumPy_Tutorialformo
redetailsonindexing,slicing,anditerating.

Onecaveatisthatwestillhavesomevaluesinythatcontaininvalidvalues,na
n.Thequestionis,whatcanwedowiththem?
Letuscheckhowmanyhourscontaininvaliddata.
>>>sp.sum(sp.isnan(y))
8
Wearemissingonly8outof743entries,sowecanaffordtoremovethem.Rem
emberthat we can index a SciPy array with another array. sp.isnan(y)
returns an array
ofBooleansindicatingwhetheranentryisnotanumber.Using~,welogically
negatethat array so that we choose only those elements from x and y
where y does containvalidnumbers.
x=x[~sp.isnan(y)]y
=y[~sp.isnan(y)]
Togetafirstimpressionofourdata,letusplotthedatainascatterplotusingMa
tplotlib.Matplotlibcontainsthepyplotpackage,whichtriestomimicMatla
b'sinterface—averyconvenientandeasy-to-
useone(youwillfindmoretutorialsonplottingathttp://matplotlib.org/
users/pyplot_tutorial.html).
importmatplotlib.pyplotaspltplt
.scatter(x,y)
plt.title("Webtrafficoverthelastmonth")plt.xlabel("Time")
plt.ylabel("Hits/
hour")plt.xticks([w*7*24forwinrange(1
0)],
['week
%i'%wforwinrange(10)])plt.autosc
ale(tight=True)plt.grid()
plt.show()
Intheresultingchart,wecanseethatwhileinthefirstweeksthetrafficstayedmore
orlessthesame,thelastweekshowsasteepincrease:
Choosingtherightmodelandlearningalgorithm
Nowthatwehaveafirstimpressionofthedata,wereturntotheinitialquestion:ho
w
longwillourserverhandletheincomingwebtraffic?Toanswerthiswehaveto:
• Findtherealmodelbehindthenoisydatapoints
• Usethemodeltoextrapolateintothefuturetofindthepointintime
whereourinfrastructurehastobeextended

Beforebuildingourfirstmodel
When we talk about models, you can think of them as simplified
theoreticalapproximations of the complex reality. As such there is
always some inferiorityinvolved, also called the approximation
error. This error will guide us in
choosingtherightmodelamongthemyriadofchoiceswehave.Thiserror
willbecalculatedas the squared distance of the model's prediction to
the real data. That is, for
alearnedmodelfunction,f,theerroriscalculatedasfollows:
deferror(f,x,y):
returnsp.sum((f(x)-y)**2)
Thevectorsxandycontainthewebstatsdatathatwehaveextractedbefore.I
tisthebeautyofSciPy'svectorizedfunctionsthatweexploitherewithf(x).
Thetrainedmodelisassumedtotakeavectorandreturntheresultsagainasa
vectorofthesamesizesothatwecanuseittocalculatethedifferencetoy.

Startingwithasimplestraightline
Let us assume for a second that the underlying model is a straight
line. Thechallenge then is how to best put that line into the chart
so that it results in
thesmallestapproximationerror.SciPy'spolyfit()functiondoesexactlytha
t.Givendataxandyandthedesiredorderofthepolynomial(straightlinehaso
rder1),
itfindsthemodelfunctionthatminimizestheerrorfunctiondefinedearlier.
fp1,residuals,rank,sv,rcond=sp.polyfit(x,y,1,full=True)
Thepolyfit()functionreturnstheparametersofthefittedmodelfunction,
fp1;andbysettingfulltoTrue,wealsogetadditionalbackgroundinformatio
nonthefittingprocess.Ofit,onlyresidualsareofinterest,whichisexactly
theerroroftheapproximation.
>>>print("Modelparameters:%s"%fp1)
Modelparameters:[ 2.59619213989.02487106]
>>>print(res)
[3.17389767e+08]
Thismeansthatthebeststraightlinefitisthefollowingfunction:
f(x)=2.59619213*x+989.02487106.
Wethenusepoly1d()tocreateamodelfunctionfromthemodelparameters.
>>>f1=sp.poly1d(fp1)
>>>print(error(f1,x,y))317
389767.34
Wehaveusedfull=Truetoretrievemoredetailsonthefittingprocess.Normally,
wewouldnotneedit,inwhichcaseonlythemodelparameterswouldbereturne
d.

Infact,whatwedohereissimplecurvefitting.Youcanfin
doutmoreaboutitonWikipediabygoingtohttp://
en.wikipedia.org/wiki/Curve_fitting.

Wecannowusef1()toplotourfirsttrainedmodel.Inadditiontotheearlierp
lottinginstructions,wesimplyaddthefollowing:
fx=sp.linspace(0,x[-1],1000)#generateX-
valuesforplottingplt.plot(fx,f1(fx),linewidth=4)
plt.legend(["d=%i"%f1.order],loc="upperleft")
The following graph shows our first trained model:
It seems like the first four weeks are not that far off, although we
clearly see
thatthereissomethingwrongwithourinitialassumptionthattheunderly
ingmodelisastraightline.Plus,howgoodorbadactuallyistheerrorof31
7,389,767.34?
The absolute value of the error is seldom of use in isolation.
However, whencomparing two competing models, we can use their
errors to judge which one ofthem is better. Although our first model
clearly is not the one we would use,
itservesaveryimportantpurposeintheworkflow:wewilluseitasourbasel
ineuntilwe find a better one. Whatever model we will come up with
in the future, we willcompareitagainstthecurrentbaseline.

Towards some advanced stuff


Letusnowfitamorecomplexmodel,apolynomialofdegree2,toseewhetherit
better"understands"ourdata:
>>>f2p=sp.polyfit(x,y,2)
>>>print(f2p)
array([1.05322215e-02,-5.26545650e+00, 1.97476082e+03])
>>>f2=sp.poly1d(f2p)
>>>print(error(f2,x,y))179
983507.878
Thefollowingchartshowsthemodelwetrainedbefore(straightlineofon
edegree)withour newly trained,more complex model withtwo
degrees (dashed):
Theerroris179,983,507.878,whichisalmosthalftheerrorofthestraight-
linemodel.Thisisgood;however,itcomeswithaprice.Wenowhaveamore
complexfunction,meaning that we have one more parameter to tune
inside polyfit(). The fittedpolynomialisasfollows:
f(x)=0.0105322215*x**2-5.26545650*x+1974.76082
So,ifmorecomplexitygivesbetterresults,whynotincreasethecomplexityeve
n
more?Let'stryitfordegree3,10,and100.

Themorecomplexthedatagets,thecurvescaptureitandmakeitfitbetter.The
errorsseemtotellthesamestory.
Errord=1:317,389,767.339778
Errord=2:179,983,507.878179
Errord=3:139,350,144.031725
Errord=10:121,942,326.363461
Errord=100:109,318,004.475556
However, taking a closer look at the fitted curves, we start to
wonder
whethertheyalsocapturethetrueprocessthatgeneratedthisdata.Fra
meddifferently,do our models correctly represent the underlying
mass behavior of customersvisiting our website? Looking at the
polynomial of degree 10 and 100, we seewildly oscillating
behavior. It seems that the models are fitted too much to
thedata.Somuchthatitisnowcapturingnotonlytheunderlyingproce
ssbutalsothenoise.Thisiscalledoverfitting.
Atthispoint,wehavethefollowingchoices:
• Selectingoneofthefittedpolynomialmodels.
• Switchingtoanothermorecomplexmodelclass;splines?
• Thinkingdifferentlyaboutthedataandstartingagain.
Ofthefivefittedmodels,thefirst-
ordermodelclearlyistoosimple,andthemodelsoforder10and100arecle
arlyoverfitting.Onlythesecond-andthird-
ordermodelsseemtosomehowmatchthedata.However,ifweextrapolat
ethematbothborders,weseethemgoingberserk.
Switching to a more complex class also seems to be the wrong way
to go about it.Whatargumentswouldbackwhichclass?
Atthispoint,werealizethatweprobablyhavenotcompletelyunderstoodo
urdata.

Stepping back to go forward – another look atour data


So,westepbackandtakeanotherlookatthedata.Itseemsthatthereisaninflecti
onpointbetweenweeks3and4.Soletusseparatethedataandtraintwolinesusi
ngweek3.5asaseparationpoint.Wetrainthefirstlinewiththedatauptoweek3,
andthesecondlinewiththeremainingdata.
inflection=3.5*7*24#calculatetheinflectionpointinhoursxa=x[:
inflection]#databeforetheinflectionpoint
ya=y[:inflection]
xb=x[inflection:]#data
afteryb=y[inflection:]

fa=sp.poly1d(sp.polyfit(xa,ya,1))fb=s
p.poly1d(sp.polyfit(xb,yb,1))

fa_error=error(fa,xa,ya)fb_e
rror=error(fb,xb,yb)
print("Errorinflection=%f"%
(fa+fb_error))Errorinflection=156,639,407.7015
23
Plotting thet wo models forth etwo data ranges gives the following chart:
Clearly, the combination of these two lines seems to be a much better
fit to the
datathananythingwehavemodeledbefore.Butstill,thecombinederrorishi
gherthanthehigher-orderpolynomials.Canwetrusttheerrorattheend?
Askeddifferently,whydowetrustthestraightlinefittedonlyatthelastwee
kofourdatamorethananyofthemorecomplexmodels?
Itisbecauseweassumethatitwillcapturefuturedatabetter.Ifweplotthemo
delsintothefuture,weseehowrightweare(d=1isagainourinitiallystraigh
tline).
Themodelsofdegree10and100don'tseemtoexpectabrightfutureforoursta
rtup.Theytriedsohardtomodelthegivendatacorrectlythattheyareclearlyu
selesstoextrapolatefurther.Thisiscalledoverfitting.Ontheotherhand,thel
ower-
degreemodelsdonotseemtobecapableofcapturingthedataproperly.Thisi
scalledunderfitting.
Soletusplayfairtothemodelsofdegree2andaboveandtryouthowtheybe
haveif we fit them only to the data of the last week. After all, we
believe that the
lastweeksaysmoreaboutthefuturethanthedatabefore.Theresultcanbes
eeninthefollowing psychedelic chart, which shows even more
clearly how bad the problem of over fittingis:

Still,judgingfromtheerrorsofthemodelswhentrainedonlyonthedatafromwe
ek
3.5andafter,weshouldstillchoosethemostcomplexone.
Erro d=1: 22143941.107
r 618
Erro d=2: 19768846.989
r 176
Erro d=3: 19766452.361
r 027
Erro d=10: 18949339.348
r 539
Erro d=10 16915159.603
r 0: 877

Training and testing


If only we had some data from the future that we could use to
measure
ourmodelsagainst,weshouldbeabletojudgeourmodelchoiceonlyonth
eresultingapproximationerror.
Althoughwecannotlookintothefuture,wecanandshouldsimulateasimilaref
fectbyholdingoutapartofourdata.Letusremove,forinstance,acertainpercen
tageofthedataandtrainontheremainingone.Thenweusethehold-
outdatatocalculatetheerror.Asthemodelhasbeentrainednotknowingthehol
d-
outdata,weshouldgetamorerealisticpictureofhowthemodelwillbehaveinth
efuture.
Thetesterrorsforthemodelstrainedonlyonthetimeaftertheinflectionpointnow
show a completely different picture.
Erro d=1: 7,917,335.8311
r 22
Erro d=2: 6,993,880.3488
r 70
Erro d=3: 7,137,471.1773
r 63
Erro d=10: 8,805,551.1897
r 38
Erro d=10 10,877,646.621
r 0: 984
Theresultcanbeseeninthefollowingchart:

It seems we finally have a clear winner. The model with degree 2


has the
lowesttesterror,whichistheerrorwhenmeasuredusingdatathatthemod
eldidnotseeduringtraining.Andthisiswhatletsustrustthatwewon'tgetb
adsurpriseswhenfuturedataarrives.
Answeringourinitialquestion
Chapter1
Finally,wehavearrivedatamodelthatwethinkrepresentstheunderlyingproc
essbest;itisnowasimpletaskoffindingoutwhenourinfrastructurewillreach
100,000requestsperhour.Wehavetocalculatewhenourmodelfunctionreac
hesthevalue100,000.
Having a polynomial ofdegree 2, we couldsimply compute the
inverseof
thefunctionandcalculateitsvalueat100,000.Ofcourse,wewouldlik
etohaveanapproachthatisapplicabletoanymodelfunctioneasily.
Thiscanbedonebysubtracting100,000fromthepolynomial,whichresult
sinanotherpolynomial,andfindingtherootofit.SciPy'soptimizemodule
hasthefsolvefunctiontoachievethiswhenprovidinganinitialstartingpos
ition.Letfbt2bethewinningpolynomialofdegree2:
>>>print(fbt2)
2
0.08844x-97.31x+2.853e+04
>>>print(fbt2-100000)
2
0.08844x-97.31x-7.147e+04

>>>fromscipy.optimizeimportfsolve
>>>reached_max=fsolve(fbt2-100000,800)/(7*24)
>>>print("100,000hits/hourexpectedatweek
%f"%reached_max[0])100,000hits/hourexpectedatweek9.827613
Ourmodeltellsusthatgiventhecurrentuserbehaviorandtractionofourst
artup,itwill takeanother monthuntil wehave reachedour
thresholdcapacity.
Ofcourse,thereisacertainuncertaintyinvolvedwithourprediction.Toget
therealpicture,youcandrawinmoresophisticatedstatisticstofindoutabo
utthevariancethat we have to expect when looking fartherand further
into the future.
Andthentherearetheuserandunderlyinguserbehaviordynamicsthatwe
[37]
cannotmodel accurately. However, at this point we are fine with the
current predictions.After all, we can prepare all the time-consuming
actions now. If we then
monitorourwebtrafficclosely,wewillseeintimewhenwehavetoallocate
newresources.

You might also like