0% found this document useful (0 votes)

112 views37 pages

Machine Learning with Python Basics

This document provides an introduction to machine learning and Python. It discusses that machine learning teaches machines how to carry out tasks by providing examples, and the complexity arises from the details. It then summarizes that NumPy provides optimized multi-dimensional arrays as the basic data structure for machine learning algorithms, SciPy uses these arrays to provide numerical methods, and Matplotlib is useful for plotting graphs. The document also provides some basic examples of using NumPy arrays and indexing capabilities.

Uploaded by

saranya sasikumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views37 pages

Machine Learning with Python Basics

Uploaded by

saranya sasikumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

UNIT I

MACHINE LEARNING AND PYTHON:

Machine learning (ML) teaches machines how to carry out tasks

by themselves.It is that simple.The complexity comes with the
details ,and that is most likely the reason you are reading this
book.
Maybeyouhavetoomuchdataandtoolittleinsight,andyouhopedthatu
singmachinelearningalgorithmswillhelpyousolvethischallenge.So
youstartedtodigintorandomalgorithms.Butaftersometimeyouwere
puzzled:which of the myriad of algorithms should you actually
choose?
Ormaybeyouarebroadlyinterestedinmachinelearningandhavebeenreading
a few blog sand articles about it for
sometime.Everythingseemedtobemagicandcool, so you started
your exploration and fed some toy data into a decision tree ora
support vector machine. But after you successfully applied it to
some other data,youwondered,wasthewholesettingright?
Didyougettheoptimalresults?Andhow do you know there are no
better algorithms? Or whether your data was "the right one"?
Welcome to the club! We, the authors, were at those stages once
upon a time,looking for information that tells the real story behind
the theoretical text books on machine learning. It turned out that
much of that information was "black
art",notusuallytaughtinstandardtextbooks.So,inasense,wewrotethis
booktoour
youngerselves;abookthatnotonlygivesaquickintroductiontomachinel
earning,but also teaches you lessons that we have learned along the
way. We hope that itwill also give you, the reader, a smoother entry
into one of the most exciting fieldsinComputerScience.
Machine learning and Python –the
dream team
The goal of machine learning is to teach machines (software) to
carry out
tasksbyprovidingthemwithacoupleofexamples(howtodoornotdoat
ask).Letusassume that each morning when you turn on your
computer, you perform thesame task of moving e-mails around so
that only those e-mails belonging to a particular topic end up in
the same folder. After some time, you feel bored and think of
automating this chore. One way would be to start analyzing your
brainandwritingdownalltherulesyourbrainprocesseswhileyouaresh
ufflingyour
e-mails. However, this will be quite cumbersome and always
imperfect. While you will miss some rules, you will over-specify
others. A better and more future-
proofwaywouldbetoautomatethisprocessbychoosingasetofe-
mailmetainformationandbody/
foldernamepairsandletanalgorithmcomeupwiththebestruleset.
Thepairswouldbeyourtrainingdata,andtheresultingruleset(alsocalled
model)could then be applied to future e-mails that we have not yet
seen. This is machine learning in its simplest form.
Of course, machine learning (often also referred to as data mining
or
predictiveanalysis)isnotabrandnewfieldinitself.Quitethecontrary,itss
uccessoverrecentyears can be attributed to the pragmatic way of
using rock-solid techniques andinsights from other successful
fields; for example, statistics. There, the purpose
isforushumanstogetinsightsintothedatabylearningmoreabouttheunde
rlying
patternsandrelationships.Asyoureadmoreandmoreaboutsuccessfulapp
licationsofmachinelearning(youhavecheckedoutkaggle.comalready,have
n'tyou? ),youwillseethatappliedstatisticsisacommonfieldamongmachi
nelearningexperts.
As you will see later, the process of coming up with a decent ML
approach is nevera waterfall-like process. Instead, you will see
yourself going back and forth in your analysis, trying out different
versions of your input data on diverse sets of ML algorithms. It is
this explorative nature that lends itself perfectly to Python. Beingan
interpreted high-level programming language, it may seem that
Python
wasdesignedspecificallyfortheprocessoftryingoutdifferentthings.W
hatismore,
it does this very fast. Sure enough, it is slower than C or similar
statically-
typedprogramminglanguages;nevertheless,withamyriadofeasy-to-
uselibrariesthatareoftenwritteninC,youdon'thavetosacrificespeedfo
ragility.

INTRODUCTION TO NumPY , SciPy , AND Matplotlib

Luckily,forallthemajoroperatingsystems,namelyWindows,Mac,andL
inux,therearetargetedinstallersforNumPy,SciPy,andMatplotlib.Ifyou
areunsureabouttheinstallationprocess,youmightwanttoinstallEnthou
ghtPythonDistribution(https://www.enthought.com/products/
epd_free.php)orPython(x,y)(http://code.google.com/p/pythonxy/
wiki/
Downloads),whichcomewithalltheearliermentionedpackagesinclude
d.
Before we can talk about concrete machine learning algorithms, we
have to talkabout how best to store the data we will chew through.
This is important as themost advanced learning algorithm will not
be of any help to us if they will neverfinish. This may be simply
because accessing the data is too slow. Or maybe
itsrepresentationforcestheoperatingsystemtoswapallday.Addtothist
hatPythonis an interpreted language (a highly optimized one,
though) that is slow for manynumerically heavy algorithms
compared to C or Fortran. So we might ask why
onearthsomanyscientistsandcompaniesarebettingtheirfortuneonPyt
honeveninthehighlycomputation-intensiveareas?
TheansweristhatinPython,itisveryeasytooffloadnumber-crunchingtasksto
thelowerlayerintheformofaCorFortranextension.ThatisexactlywhatNum
PyandSciPydo(http://scipy.org/
install.html).Inthistandem,NumPyprovidesthesupportofhighlyoptimized
multidimensionalarrays,whicharethebasicdatastructure of most state-of-
the-art algorithms. SciPy uses those arrays to provide a set offast
numerical recipes. Finally, Matplotlib (http://matplotlib.org/) is
probably themostconvenientandfeature-richlibrarytoplothigh-
qualitygraphsusingPython.

Installing Python

ChewingdataefficientlywithNumPyand
intelligently with SciPy
Let us quickly walk through some basic NumPy examples and then
take a look atwhat SciPyprovidesontopofit.Ontheway,
wewillgetourfeetwetwithplottingusingthemarvelousMatplotlibpacka
ge.
YouwillfindmoreinterestingexamplesofwhatNumPycanofferathttp://
www.scipy.org/Tentative_NumPy_Tutorial.
You will also find the book NumPy Beginner's Guide - Second
Edition, Ivan Idris,Packt Publishing very valuable. Additional
tutorial style guides are at http://scipy-lectures.github.com; you
may also visit the official SciPy tutorial
athttp://docs.scipy.org/doc/scipy/reference/tutorial.
Inthisbook,wewilluseNumPyVersion1.6.2andSciPyVersion0.11.0.

LearningNumPy
SoletusimportNumPyandplayabitwithit.Forthat,weneedtostartthePythoninter
activeshell.
>>>importnumpy
>>>numpy.version.full_versio
n1.6.2
Aswedonotwanttopolluteournamespace,wecertainlyshouldnotdothefollo
wing:
>>>fromnumpyimport*
Thenumpy.arrayarraywillpotentiallyshadowthearraypackagethatisincl
udedin standard Python. Instead, we will usethe following
convenient shortcut:
>>>importnumpyasnp
>>>a=np.array([0,1,2,3,4,5])
>>>a
array([0,1,2,3,4,5])
>>>a.ndim
1
>>>a.shape(
6,)
We just created an array in a similar way to how we would create a
list in Python.However, NumPy arrays have additional information
about the shape. In this case,itisaone-
dimensionalarrayoffiveelements.Nosurprisessofar.
Wecannowtransformthisarrayintoa2Dmatrix.
>>>b=a.reshape((3,2))
>>>b
array([[0,1],
[2,3],
[4,5]])
>>>b.ndim
2
>>>b.shape(
3,2)
The funny thing starts when we realize just how much the
NumPy package
isoptimized.Forexample,itavoidscopieswhereverpossible.
>>>b[1][0]=77
>>>b
array([[0,1],
[77,3],
[4,5]])
>>>a
array([0,1,77,3,4,5])
Inthiscase,wehavemodifiedthevalue2to77inb,andwecanimmediatelys
eethesamechangereflectedinaaswell.Keepthatinmindwheneveryoune
edatruecopy.
>>>c=a.reshape((3,2)).copy()
>>>c
array([[0,1],
[77,3],
[4,5]])
>>>c[0][0]=-99
>>>a
array([0,1,77,3,4,5])
>>>c
array([[-99, 1],
[77, 3],
[4, 5]])
Here,candaaretotallyindependentcopies.
AnotherbigadvantageofNumPyarraysisthattheoperationsareprop
agatedtotheindividualelements.
>>>a*2
array([2,4,6,8,10])
>>>a**2
array([1,4,9,16,25])
ContrastthattoordinaryPythonlists:
>>>[1,2,3,4,5]*2
[1,2,3,4,5,1,2,3,4,5]
>>>[1,2,3,4,5]**2
Traceback (most recent call
last):File"<stdin>",line1,in<module
>
TypeError:unsupportedoperandtype(s)for**orpow():'list'and'int'
Ofcourse,byusingNumPyarrayswesacrificetheagilityPythonlistsoffer.Si
mpleoperationslikeaddingorremovingareabitcomplexforNumPyarrays.
Luckily,wehavebothatourdisposal,andwewillusetherightoneforthetaska
thand.

Indexing
PartofthepowerofNumPycomesfromtheversatilewaysinwhichitsarrays
canbeaccessed.
Inadditiontonormallistindexing,itallowsustousearraysthemselvesasindice
s.
>>>a[np.array([2,3,4])]a
rray([77,3,4])
Inadditiontothefactthatconditionsarenowpropagatedtotheindividualel
ements,wegainaveryconvenientwaytoaccessourdata.
>>>a>4
array([False,False,True,False,False,True],dtype=bool)
>>>a[a>4]
array([77,5])
Thiscanalsobeusedtotrimoutliers.
>>>a[a>4]=4
>>>a
array([0,1,4,3,4,4])
Asthisisafrequentusecase,thereisaspecialclipfunctionforit,clippingthe
valuesatbothendsofanintervalwithonefunctioncallasfollows:
>>>a.clip(0,4)
array([0,1,4,3,4,4])

Handlingnon-existingvalues
The power of NumPy's indexing capabilities comes in handy when
preprocessingdata that we have just read in from a text file. It will
most likely contain
invalidvalues,whichwewillmarkasnotbeingarealnumberusingnumpy.NA
Nasfollows:
c=np.array([1,2,np.NAN,3,4])#let'spretendwehavereadthisfromatex
tfile
>>>c
array([1., 2.,nan, 3., 4.])
>>>np.isnan(c)
array([False,False,True,False,False],dtype=bool)
>>>c[~np.isnan(c)]
array([1.,2.,3.,4.])
>>>
np.mean(c[~np.isnan(c)])2.5

Comparingruntimebehaviors
LetuscomparetheruntimebehaviorofNumPywithnormalPythonlists.Int
hefollowingcode,wewillcalculatethesumofallsquarednumbersof1to100
0andseehowmuchtimethecalculationwilltake.Wedoit10000timesandre
portthetotaltimesothatourmeasurementisaccurateenough.
importtimeit
normal_py_sec=timeit.timeit('sum(x*xforxinxrange(1000))',
number=1000
0)naive_np_sec=timeit.timeit('sum(na*na)',
setup="importnumpyasnp;na=np.
arange(1000)",
number=10000)
good_np_sec=timeit.timeit('na.dot(na)',
setup="importnumpyasnp;na=np.
arange(1000)",
number=10000)

print("NormalPython:
%fsec"%normal_py_sec)print("Naive
NumPy: %f
sec"%naive_np_sec)print("GoodNumPy:
%fsec"%good_np_sec)

NormalPython:1.157467sec
Naive NumPy: 4.061293
secGoodNumPy:0.033419s
ec
We make two interesting observations. First, just using NumPy as
data storage(Naive NumPy) takes 3.5 times longer, which is
surprising since we believe it
mustbemuchfasterasitiswrittenasaCextension.Onereasonforthisisthatt
heaccessof
individualelementsfromPythonitselfisrathercostly.Onlywhenweareabl
etoapplyalgorithms inside the optimized extension code do we get
speed improvements, andtremendous ones at that: using the dot()
function of NumPy, we are more than 25times faster. In summary, in
every algorithm we are about to implement, we
shouldalwayslookathowwecanmoveloopsoverindividualelementsfro
mPythontosomeofthehighlyoptimizedNumPyorSciPyextensionfuncti
ons.
However, the speed comes ata price. Using NumPyarrays, we no
longer
havetheincredibleflexibilityofPythonlists,whichcanholdbasicallyan
ything.NumPyarraysalwayshaveonlyonedatatype.
>>>a=np.array([1,2,3])
>>>
a.dtypedtype('i
nt64')
Ifwetrytouseelementsofdifferenttypes,NumPywilldoitsbesttocoercethe
mtothemostreasonablecommondatatype:
>>> np.array([1,
"stringy"])array(['1','stringy'],dtype=
'|S8')
>>> np.array([1, "stringy",
set([1,2,3])])array([1,stringy,set([1,2,3])],dtype=
object)

Learning SciPy
OntopoftheefficientdatastructuresofNumPy,SciPyoffersamagnitude
ofalgorithmsworkingonthosearrays.Whatevernumerical-
heavyalgorithmyoutakefromcurrentbooksonnumericalrecipes,youw
illmostlikelyfindsupportfortheminSciPyinonewayoranother.Whethe
ritismatrixmanipulation,linearalgebra,optimization,clustering,spatia
loperations,orevenFastFouriertransformation,thetoolboxisreadilyfill
ed.Therefore,itisagoodhabittoalwaysinspectthescipymodulebeforeyo
ustartimplementinganumericalalgorithm.
Forconvenience,thecompletenamespaceofNumPyisalsoaccessible
viaSciPy.So,fromnowon,wewilluseNumPy'smachineryviatheSciP
ynamespace.Youcancheckthiseasilybycomparingthefunctionrefer
encesofanybasefunction;forexample:
>>>importscipy,numpy
>>>scipy.version.full_version
0.11.0
>>>scipy.dotisnumpy.dot
True
Thediversealgorithmsaregroupedintothefollowingtoolboxes:

SciPypackage Functionality

cluster Hierarchicalclustering(cluster.hierarchy)

Vectorquantization/K-Means(cluster.vq)
SciPypackage Functionality

constants Physicalandmathematicalconstants
Conversion methods
fftpack DiscreteFouriertransformalgorithms

integrate Integrationroutines

interpolate Interpolation(linear,cubic,andsoon)

io Datainputandoutput

linalg Linearalgebraroutinesusingtheoptimized
BLASandLAPACKlibraries
maxentropy Functionsforfittingmaximumentropymodels

ndimage n-dimensionalimagepackage

odr Orthogonal

distanceregressionoptimize

Optimization(findingminimaandroots)signal

Signalprocessing

sparse Sparsematrices

spatial Spatialdatastructuresandalgorithms

special
SpecialmathematicalfunctionssuchasBess
elorJacobian
stats Statisticstoolkit

Thetoolboxesmostinterestingtoourendeavorarescipy.stats,scipy.interp
olate,scipy.cluster,andscipy.signal.Forthesakeofbrevity,wewillbriefly
exploresomefeaturesofthestatspackageandleavetheotherstobeexplaine
dwhentheyshowupinthechapters.
Ourfirst(tiny)machine learning
application
Letusgetourhandsdirtyandhavealookatourhypotheticalwebstartup,MLA
AS,whichsellstheserviceofprovidingmachinelearningalgorithmsviaHTT
P.Withtheincreasingsuccessofourcompany,thedemandforbetterinfrastru
cturealsoincreasestoserveallincomingwebrequestssuccessfully.Wedon't
wanttoallocatetoomanyresourcesasthatwouldbetoocostly.Ontheotherhan
d,wewilllosemoneyifwehavenotreservedenoughresourcesforservingallin
comingrequests.Thequestionnowis,whenwillwehitthelimitofourcurrenti
nfrastructure,whichweestimatedbeing100,000requestsperhour.Wewould
liketoknowinadvancewhenwehavetorequestadditionalserversinthecloudt
oservealltheincomingrequestssuccessfullywithoutpayingforunusedones.

Readinginthedata
Wehavecollectedthewebstatsforthelastmonthandaggregatedtheminch01/
data/web_traffic.tsv (tsv because it contains tab separated values). They
arestoredasthenumberofhitsperhour.Eachlinecontainsconsecutivehoursa
ndthenumberofwebhitsinthathour.
Thefirstfewlineslooklikethefollowing:
UsingSciPy'sgenfromtxt(),wecaneasilyreadinthedata.
importscipyassp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
Wehavetospecifytabasthedelimitersothatthecolumnsarecorrectlydete
rmined.Aquickcheckshowsthatwehavecorrectlyreadinthedata.

>>>print(data[:10])
[[1.00000000e+00

2.27200000e+03][2.00000000e+00
nan]
[3.00000000e+00

1.38600000e+03][4.00000000e+00

1.36500000e+03][5.00000000e+00

1.48800000e+03][6.00000000e+00

1.33700000e+03][7.00000000e+00

1.88300000e+03][8.00000000e+00

2.28300000e+03][9.00000000e+00

1.33500000e+03][1.00000000e+01

1.02500000e+03]]
>>>print(data.shape)
(743,2)
Wehave743datapointswithtwodimensions.

Preprocessingandcleaningthedata
ItismoreconvenientforSciPytoseparatethedimensionsintotwovecto
rs,eachofsize743.Thefirstvector,x,willcontainthehoursandtheother
,y,willcontainthe web hits in that particular hour. This splitting is
done using the special
indexnotationofSciPy,usingwhichwecanchoosethecolumnsindivid
ually.
x=data[:,0]
y=data[:,1]

There is much more to the way data can be selected

from a SciPy array.Check out
http://www.scipy.org/Tentative_NumPy_Tutorialformo
redetailsonindexing,slicing,anditerating.

Onecaveatisthatwestillhavesomevaluesinythatcontaininvalidvalues,na
n.Thequestionis,whatcanwedowiththem?
Letuscheckhowmanyhourscontaininvaliddata.
>>>sp.sum(sp.isnan(y))
8
Wearemissingonly8outof743entries,sowecanaffordtoremovethem.Rem
emberthat we can index a SciPy array with another array. sp.isnan(y)
returns an array
ofBooleansindicatingwhetheranentryisnotanumber.Using~,welogically
negatethat array so that we choose only those elements from x and y
where y does containvalidnumbers.
x=x[~sp.isnan(y)]y
=y[~sp.isnan(y)]
Togetafirstimpressionofourdata,letusplotthedatainascatterplotusingMa
tplotlib.Matplotlibcontainsthepyplotpackage,whichtriestomimicMatla
b'sinterface—averyconvenientandeasy-to-
useone(youwillfindmoretutorialsonplottingathttp://matplotlib.org/
users/pyplot_tutorial.html).
importmatplotlib.pyplotaspltplt
.scatter(x,y)
plt.title("Webtrafficoverthelastmonth")plt.xlabel("Time")
plt.ylabel("Hits/
hour")plt.xticks([w*7*24forwinrange(1
0)],
['week
%i'%wforwinrange(10)])plt.autosc
ale(tight=True)plt.grid()
plt.show()
Intheresultingchart,wecanseethatwhileinthefirstweeksthetrafficstayedmore
orlessthesame,thelastweekshowsasteepincrease:
Choosingtherightmodelandlearningalgorithm
Nowthatwehaveafirstimpressionofthedata,wereturntotheinitialquestion:ho
w
longwillourserverhandletheincomingwebtraffic?Toanswerthiswehaveto:
• Findtherealmodelbehindthenoisydatapoints
• Usethemodeltoextrapolateintothefuturetofindthepointintime
whereourinfrastructurehastobeextended

Beforebuildingourfirstmodel
When we talk about models, you can think of them as simplified
theoreticalapproximations of the complex reality. As such there is
always some inferiorityinvolved, also called the approximation
error. This error will guide us in
choosingtherightmodelamongthemyriadofchoiceswehave.Thiserror
willbecalculatedas the squared distance of the model's prediction to
the real data. That is, for
alearnedmodelfunction,f,theerroriscalculatedasfollows:
deferror(f,x,y):
returnsp.sum((f(x)-y)**2)
Thevectorsxandycontainthewebstatsdatathatwehaveextractedbefore.I
tisthebeautyofSciPy'svectorizedfunctionsthatweexploitherewithf(x).
Thetrainedmodelisassumedtotakeavectorandreturntheresultsagainasa
vectorofthesamesizesothatwecanuseittocalculatethedifferencetoy.

Startingwithasimplestraightline
Let us assume for a second that the underlying model is a straight
line. Thechallenge then is how to best put that line into the chart
so that it results in
thesmallestapproximationerror.SciPy'spolyfit()functiondoesexactlytha
t.Givendataxandyandthedesiredorderofthepolynomial(straightlinehaso
rder1),
itfindsthemodelfunctionthatminimizestheerrorfunctiondefinedearlier.
fp1,residuals,rank,sv,rcond=sp.polyfit(x,y,1,full=True)
Thepolyfit()functionreturnstheparametersofthefittedmodelfunction,
fp1;andbysettingfulltoTrue,wealsogetadditionalbackgroundinformatio
nonthefittingprocess.Ofit,onlyresidualsareofinterest,whichisexactly
theerroroftheapproximation.
>>>print("Modelparameters:%s"%fp1)
Modelparameters:[ 2.59619213989.02487106]
>>>print(res)
[3.17389767e+08]
Thismeansthatthebeststraightlinefitisthefollowingfunction:
f(x)=2.59619213*x+989.02487106.
Wethenusepoly1d()tocreateamodelfunctionfromthemodelparameters.
>>>f1=sp.poly1d(fp1)
>>>print(error(f1,x,y))317
389767.34
Wehaveusedfull=Truetoretrievemoredetailsonthefittingprocess.Normally,
wewouldnotneedit,inwhichcaseonlythemodelparameterswouldbereturne
d.

Infact,whatwedohereissimplecurvefitting.Youcanfin
doutmoreaboutitonWikipediabygoingtohttp://
en.wikipedia.org/wiki/Curve_fitting.

Wecannowusef1()toplotourfirsttrainedmodel.Inadditiontotheearlierp
lottinginstructions,wesimplyaddthefollowing:
fx=sp.linspace(0,x[-1],1000)#generateX-
valuesforplottingplt.plot(fx,f1(fx),linewidth=4)
plt.legend(["d=%i"%f1.order],loc="upperleft")
The following graph shows our first trained model:
It seems like the first four weeks are not that far off, although we
clearly see
thatthereissomethingwrongwithourinitialassumptionthattheunderly
ingmodelisastraightline.Plus,howgoodorbadactuallyistheerrorof31
7,389,767.34?
The absolute value of the error is seldom of use in isolation.
However, whencomparing two competing models, we can use their
errors to judge which one ofthem is better. Although our first model
clearly is not the one we would use,
itservesaveryimportantpurposeintheworkflow:wewilluseitasourbasel
ineuntilwe find a better one. Whatever model we will come up with
in the future, we willcompareitagainstthecurrentbaseline.

Towards some advanced stuff

Letusnowfitamorecomplexmodel,apolynomialofdegree2,toseewhetherit
better"understands"ourdata:
>>>f2p=sp.polyfit(x,y,2)
>>>print(f2p)
array([1.05322215e-02,-5.26545650e+00, 1.97476082e+03])
>>>f2=sp.poly1d(f2p)
>>>print(error(f2,x,y))179
983507.878
Thefollowingchartshowsthemodelwetrainedbefore(straightlineofon
edegree)withour newly trained,more complex model withtwo
degrees (dashed):
Theerroris179,983,507.878,whichisalmosthalftheerrorofthestraight-
linemodel.Thisisgood;however,itcomeswithaprice.Wenowhaveamore
complexfunction,meaning that we have one more parameter to tune
inside polyfit(). The fittedpolynomialisasfollows:
f(x)=0.0105322215*x**2-5.26545650*x+1974.76082
So,ifmorecomplexitygivesbetterresults,whynotincreasethecomplexityeve
n
more?Let'stryitfordegree3,10,and100.

Themorecomplexthedatagets,thecurvescaptureitandmakeitfitbetter.The
errorsseemtotellthesamestory.
Errord=1:317,389,767.339778
Errord=2:179,983,507.878179
Errord=3:139,350,144.031725
Errord=10:121,942,326.363461
Errord=100:109,318,004.475556
However, taking a closer look at the fitted curves, we start to
wonder
whethertheyalsocapturethetrueprocessthatgeneratedthisdata.Fra
meddifferently,do our models correctly represent the underlying
mass behavior of customersvisiting our website? Looking at the
polynomial of degree 10 and 100, we seewildly oscillating
behavior. It seems that the models are fitted too much to
thedata.Somuchthatitisnowcapturingnotonlytheunderlyingproce
ssbutalsothenoise.Thisiscalledoverfitting.
Atthispoint,wehavethefollowingchoices:
• Selectingoneofthefittedpolynomialmodels.
• Switchingtoanothermorecomplexmodelclass;splines?
• Thinkingdifferentlyaboutthedataandstartingagain.
Ofthefivefittedmodels,thefirst-
ordermodelclearlyistoosimple,andthemodelsoforder10and100arecle
arlyoverfitting.Onlythesecond-andthird-
ordermodelsseemtosomehowmatchthedata.However,ifweextrapolat
ethematbothborders,weseethemgoingberserk.
Switching to a more complex class also seems to be the wrong way
to go about it.Whatargumentswouldbackwhichclass?
Atthispoint,werealizethatweprobablyhavenotcompletelyunderstoodo
urdata.

Stepping back to go forward – another look atour data

So,westepbackandtakeanotherlookatthedata.Itseemsthatthereisaninflecti
onpointbetweenweeks3and4.Soletusseparatethedataandtraintwolinesusi
ngweek3.5asaseparationpoint.Wetrainthefirstlinewiththedatauptoweek3,
andthesecondlinewiththeremainingdata.
inflection=3.5*7*24#calculatetheinflectionpointinhoursxa=x[:
inflection]#databeforetheinflectionpoint
ya=y[:inflection]
xb=x[inflection:]#data
afteryb=y[inflection:]

fa=sp.poly1d(sp.polyfit(xa,ya,1))fb=s
p.poly1d(sp.polyfit(xb,yb,1))

fa_error=error(fa,xa,ya)fb_e
rror=error(fb,xb,yb)
print("Errorinflection=%f"%
(fa+fb_error))Errorinflection=156,639,407.7015
23
Plotting thet wo models forth etwo data ranges gives the following chart:
Clearly, the combination of these two lines seems to be a much better
fit to the
datathananythingwehavemodeledbefore.Butstill,thecombinederrorishi
gherthanthehigher-orderpolynomials.Canwetrusttheerrorattheend?
Askeddifferently,whydowetrustthestraightlinefittedonlyatthelastwee
kofourdatamorethananyofthemorecomplexmodels?
Itisbecauseweassumethatitwillcapturefuturedatabetter.Ifweplotthemo
delsintothefuture,weseehowrightweare(d=1isagainourinitiallystraigh
tline).
Themodelsofdegree10and100don'tseemtoexpectabrightfutureforoursta
rtup.Theytriedsohardtomodelthegivendatacorrectlythattheyareclearlyu
selesstoextrapolatefurther.Thisiscalledoverfitting.Ontheotherhand,thel
ower-
degreemodelsdonotseemtobecapableofcapturingthedataproperly.Thisi
scalledunderfitting.
Soletusplayfairtothemodelsofdegree2andaboveandtryouthowtheybe
haveif we fit them only to the data of the last week. After all, we
believe that the
lastweeksaysmoreaboutthefuturethanthedatabefore.Theresultcanbes
eeninthefollowing psychedelic chart, which shows even more
clearly how bad the problem of over fittingis:

Still,judgingfromtheerrorsofthemodelswhentrainedonlyonthedatafromwe
ek
3.5andafter,weshouldstillchoosethemostcomplexone.
Erro d=1: 22143941.107
r 618
Erro d=2: 19768846.989
r 176
Erro d=3: 19766452.361
r 027
Erro d=10: 18949339.348
r 539
Erro d=10 16915159.603
r 0: 877

Training and testing

If only we had some data from the future that we could use to
measure
ourmodelsagainst,weshouldbeabletojudgeourmodelchoiceonlyonth
eresultingapproximationerror.
Althoughwecannotlookintothefuture,wecanandshouldsimulateasimilaref
fectbyholdingoutapartofourdata.Letusremove,forinstance,acertainpercen
tageofthedataandtrainontheremainingone.Thenweusethehold-
outdatatocalculatetheerror.Asthemodelhasbeentrainednotknowingthehol
d-
outdata,weshouldgetamorerealisticpictureofhowthemodelwillbehaveinth
efuture.
Thetesterrorsforthemodelstrainedonlyonthetimeaftertheinflectionpointnow
show a completely different picture.
Erro d=1: 7,917,335.8311
r 22
Erro d=2: 6,993,880.3488
r 70
Erro d=3: 7,137,471.1773
r 63
Erro d=10: 8,805,551.1897
r 38
Erro d=10 10,877,646.621
r 0: 984
Theresultcanbeseeninthefollowingchart:

It seems we finally have a clear winner. The model with degree 2

has the
lowesttesterror,whichistheerrorwhenmeasuredusingdatathatthemod
eldidnotseeduringtraining.Andthisiswhatletsustrustthatwewon'tgetb
adsurpriseswhenfuturedataarrives.
Answeringourinitialquestion
Chapter1
Finally,wehavearrivedatamodelthatwethinkrepresentstheunderlyingproc
essbest;itisnowasimpletaskoffindingoutwhenourinfrastructurewillreach
100,000requestsperhour.Wehavetocalculatewhenourmodelfunctionreac
hesthevalue100,000.
Having a polynomial ofdegree 2, we couldsimply compute the
inverseof
thefunctionandcalculateitsvalueat100,000.Ofcourse,wewouldlik
etohaveanapproachthatisapplicabletoanymodelfunctioneasily.
Thiscanbedonebysubtracting100,000fromthepolynomial,whichresult
sinanotherpolynomial,andfindingtherootofit.SciPy'soptimizemodule
hasthefsolvefunctiontoachievethiswhenprovidinganinitialstartingpos
ition.Letfbt2bethewinningpolynomialofdegree2:
>>>print(fbt2)
2
0.08844x-97.31x+2.853e+04
>>>print(fbt2-100000)
2
0.08844x-97.31x-7.147e+04

>>>fromscipy.optimizeimportfsolve
>>>reached_max=fsolve(fbt2-100000,800)/(7*24)
>>>print("100,000hits/hourexpectedatweek
%f"%reached_max[0])100,000hits/hourexpectedatweek9.827613
Ourmodeltellsusthatgiventhecurrentuserbehaviorandtractionofourst
artup,itwill takeanother monthuntil wehave reachedour
thresholdcapacity.
Ofcourse,thereisacertainuncertaintyinvolvedwithourprediction.Toget
therealpicture,youcandrawinmoresophisticatedstatisticstofindoutabo
utthevariancethat we have to expect when looking fartherand further
into the future.
Andthentherearetheuserandunderlyinguserbehaviordynamicsthatwe
[37]
cannotmodel accurately. However, at this point we are fine with the
current predictions.After all, we can prepare all the time-consuming
actions now. If we then
monitorourwebtrafficclosely,wewillseeintimewhenwehavetoallocate
newresources.

MACHINE LEARNING WITH PYTHON Step by Step Methods To Master Cane Alexander 2020 772d8a3fc38dfd11b7
No ratings yet
MACHINE LEARNING WITH PYTHON Step by Step Methods To Master Cane Alexander 2020 772d8a3fc38dfd11b7
92 pages
Machine Learning With Python Supervised Learning
No ratings yet
Machine Learning With Python Supervised Learning
114 pages
How Python Can Be Used Fo.9358036.Powerpoint
No ratings yet
How Python Can Be Used Fo.9358036.Powerpoint
7 pages
Unit 1
No ratings yet
Unit 1
62 pages
Chan, Jamie - Machine Learning With Python For Beginners - A Step-By-Step Guide With Hands-On Projects (Learn Coding Fast With Hands-On Project (2021) - Libgen - Li
100% (1)
Chan, Jamie - Machine Learning With Python For Beginners - A Step-By-Step Guide With Hands-On Projects (Learn Coding Fast With Hands-On Project (2021) - Libgen - Li
200 pages
Uttam
No ratings yet
Uttam
29 pages
Python Programming - 3 Books in - Ryan Turner
75% (16)
Python Programming - 3 Books in - Ryan Turner
193 pages
Introduction To Machine Learning With Python A Guide For Beginners in Data Science 9781724417503 1724417509
100% (3)
Introduction To Machine Learning With Python A Guide For Beginners in Data Science 9781724417503 1724417509
176 pages
Python Machine Learning Guide
No ratings yet
Python Machine Learning Guide
57 pages
OceanofPDF - Com Python - Andy Vickler
No ratings yet
OceanofPDF - Com Python - Andy Vickler
177 pages
Introduction To Machine Learning: Agenda
No ratings yet
Introduction To Machine Learning: Agenda
13 pages
Wa0003.
No ratings yet
Wa0003.
12 pages
7 Steps To Mastering Machine Learning With Python
100% (1)
7 Steps To Mastering Machine Learning With Python
8 pages
Week 1 Introduction To ML
100% (1)
Week 1 Introduction To ML
42 pages
Python Machine Learning Insights Guide
88% (8)
Python Machine Learning Insights Guide
57 pages
Lavanuru Lakshmi Keerthi-Internship Report - Lavanuru Lakshmi Keerthi PDF
No ratings yet
Lavanuru Lakshmi Keerthi-Internship Report - Lavanuru Lakshmi Keerthi PDF
43 pages
Python For Machine Learning Sample
100% (1)
Python For Machine Learning Sample
58 pages
Machine Learning With Python.
100% (2)
Machine Learning With Python.
147 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
12 pages
Statistics and Machine Learning Overview
No ratings yet
Statistics and Machine Learning Overview
319 pages
Python Machine Learning The Ultimate Beginners Gui...
100% (1)
Python Machine Learning The Ultimate Beginners Gui...
83 pages
AI and ML ppt-2
No ratings yet
AI and ML ppt-2
18 pages
Machine Learning With Python
67% (3)
Machine Learning With Python
18 pages
Machine-Learning 《 Intorduce to Machine Learning With Python》
No ratings yet
Machine-Learning 《 Intorduce to Machine Learning With Python》
181 pages
Machine Learning With Python A Practical Beginners' Guide (Machine Learning With Python For Beginners Book 2) (Oliver Theobald) (Z-Library)
100% (2)
Machine Learning With Python A Practical Beginners' Guide (Machine Learning With Python For Beginners Book 2) (Oliver Theobald) (Z-Library)
146 pages
Python - Follow Dr. AngShu (@drangshu) For More
100% (1)
Python - Follow Dr. AngShu (@drangshu) For More
300 pages
Module 1 MMC201
No ratings yet
Module 1 MMC201
77 pages
Edureka Machine Learning Ebook
No ratings yet
Edureka Machine Learning Ebook
23 pages
Python For Data Science Extended Ebook PDF
100% (5)
Python For Data Science Extended Ebook PDF
56 pages
Introduction to Machine Learning Basics
50% (2)
Introduction to Machine Learning Basics
175 pages
OceanofPDF - Com Python Machine Learning The Beginners Gu - Lilly Trinity
No ratings yet
OceanofPDF - Com Python Machine Learning The Beginners Gu - Lilly Trinity
115 pages
Python Machine Learning
No ratings yet
Python Machine Learning
109 pages
Statistics Machine Learning Python Draft
100% (1)
Statistics Machine Learning Python Draft
333 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
300 pages
Internshipml (J2)
No ratings yet
Internshipml (J2)
50 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
218 pages
Python for Statistics & ML Guide
No ratings yet
Python for Statistics & ML Guide
300 pages
Lecture 1
No ratings yet
Lecture 1
18 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Department of Electronics and Communication: Industrial Training Presentation
No ratings yet
Department of Electronics and Communication: Industrial Training Presentation
22 pages
Machine Learning With Python Report
100% (1)
Machine Learning With Python Report
41 pages
Python Linear Regression Guide
No ratings yet
Python Linear Regression Guide
153 pages
Django for Data Visualization and ML
No ratings yet
Django for Data Visualization and ML
7 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
15 pages
1 - AML - Manish
No ratings yet
1 - AML - Manish
72 pages
Wa0013.
No ratings yet
Wa0013.
12 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
ML in Simple Words: in Python, The Function Is Used To Display Output On The Screen or Other Standard Output Device
No ratings yet
ML in Simple Words: in Python, The Function Is Used To Display Output On The Screen or Other Standard Output Device
30 pages
Machine Learning Course with Python
No ratings yet
Machine Learning Course with Python
120 pages
Python ML
No ratings yet
Python ML
5 pages
Unit 1
No ratings yet
Unit 1
28 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
329 pages
ML Lecture 1 Intro
No ratings yet
ML Lecture 1 Intro
21 pages
Feide System Architecture: December 2007 English Translation: Apr. 15th, 2008
No ratings yet
Feide System Architecture: December 2007 English Translation: Apr. 15th, 2008
20 pages
Coin Validators: Parts For Cashflow Series
No ratings yet
Coin Validators: Parts For Cashflow Series
10 pages
GlobalSign Identity
No ratings yet
GlobalSign Identity
13 pages
ETT Handbook
No ratings yet
ETT Handbook
24 pages
Aster 620-618 Packet Broker Guide
No ratings yet
Aster 620-618 Packet Broker Guide
2 pages
Error Detection & Correction Guide
No ratings yet
Error Detection & Correction Guide
14 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
57 pages
Types of Computers
No ratings yet
Types of Computers
25 pages
Cisco Router Configuration Guide
No ratings yet
Cisco Router Configuration Guide
5 pages
Create OS Asset Tag Hierarchy Guide
No ratings yet
Create OS Asset Tag Hierarchy Guide
40 pages
1.1 (Dimensional Modelling)
No ratings yet
1.1 (Dimensional Modelling)
51 pages
Computer Organization & Architecture
No ratings yet
Computer Organization & Architecture
37 pages
NEB-2000C Coding Software
No ratings yet
NEB-2000C Coding Software
16 pages
BA Fresher JD
No ratings yet
BA Fresher JD
2 pages
Winning Lottery PDF
50% (2)
Winning Lottery PDF
1 page
8051 Microcontroller Lock System
100% (1)
8051 Microcontroller Lock System
3 pages
Crosby Sizing
100% (2)
Crosby Sizing
89 pages
OU500554475 2 PSI 500i Conn To Autom
100% (1)
OU500554475 2 PSI 500i Conn To Autom
50 pages
Project Financial Management Guide
No ratings yet
Project Financial Management Guide
5 pages
Advance Web Programming Practicals
No ratings yet
Advance Web Programming Practicals
51 pages
Jetter IsoDesigner Esite
No ratings yet
Jetter IsoDesigner Esite
4 pages
Multimedia q1
100% (1)
Multimedia q1
4 pages
ServiceNow CSA - Study Items Checklist (Answers)
No ratings yet
ServiceNow CSA - Study Items Checklist (Answers)
32 pages
Online Voting System Using Python
71% (7)
Online Voting System Using Python
48 pages
Content Creation Toolkit Freebie 498d3aa1 Bea0 4276 Ac87 3df232f873ad
100% (1)
Content Creation Toolkit Freebie 498d3aa1 Bea0 4276 Ac87 3df232f873ad
16 pages
Handbook On Csi For Inspecting Officers
No ratings yet
Handbook On Csi For Inspecting Officers
55 pages
ACS 600 MultiDrive Overview
100% (1)
ACS 600 MultiDrive Overview
19 pages
Texto de Trabajo PLC S
No ratings yet
Texto de Trabajo PLC S
238 pages
Camera Log: Access Denied Warnings
No ratings yet
Camera Log: Access Denied Warnings
36 pages
Java Developer CV for Finance Tech
No ratings yet
Java Developer CV for Finance Tech
4 pages

Machine Learning with Python Basics

Uploaded by

Machine Learning with Python Basics

Uploaded by

UNIT I

MACHINE LEARNING AND PYTHON:

Machine learning (ML) teaches machines how to carry out tasks

INTRODUCTION TO NumPY , SciPy , AND Matplotlib

There is much more to the way data can be selected

Towards some advanced stuff

Stepping back to go forward – another look atour data

Training and testing

It seems we finally have a clear winner. The model with degree 2

You might also like