3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Introduction
Ithappenedfewyearsback.AfterworkingonSASformorethan5years,Idecidedtomoveoutof
[Link],myhuntforotherusefultoolswasON!Fortunately,itdidnt
takemelongtodecide,Pythonwasmyappetizer.
[Link]
out,codingwassoeasy!
[Link],sincethen,Ivenotonlyexploredthislanguagetothe
depth, but also have helped many other to learn this language. Python was originally a general
[Link],overtheyears,withstrongcommunitysupport,thislanguagegotdedicated
libraryfordataanalysisandpredictivemodeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many
others to learn python faster. In this tutorial, we will take bite sized information about how to use
PythonforDataAnalysis,chewittillwearecomfortableandpracticeitatourownend.
TableofContents
[Link]
1/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
[Link]
WhylearnPythonfordataanalysis?
Python2.7v/s3.4
HowtoinstallPython?
RunningafewsimpleprogramsinPython
[Link]
PythonDataStructures
PythonIterationandConditionalConstructs
PythonLibraries
[Link]
Introductiontoseriesanddataframes
AnalyticsVidhyadatasetLoanPredictionProblem
[Link]
[Link]
LogisticRegression
DecisionTree
RandomForest
Letsgetstarted!
[Link]
WhylearnPythonfordataanalysis?
Python has gathered a lot of interest recently as a choice of language for data analysis. I
had compared it against SAS & Rsome time back. Here are some reasons which go in favour of
learningPython:
[Link]
2/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
OpenSourcefreetoinstall
Awesomeonlinecommunity
Veryeasytolearn
Canbecomeacommonlanguagefordatascienceandproductionofwebbasedanalyticsproducts.
Needlesstosay,itstillhasfewdrawbackstoo:
It is an interpreted language rather than compiled language hence might take up more CPU time.
However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.
Python2.7v/s3.4
[Link],speciallyif
[Link]/[Link]
[Link].
WhyPython2.7?
[Link]!Thisissomethingyoudneedinyourearlydays.Python2wasreleased
inlate2000andhasbeeninuseformorethan15years.
[Link]![Link]
of modules work only on 2.x versions. If you plan to use Python for specific applications like web
developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.
3.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.
WhyPython3.4?
[Link]!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinorder
to set a stronger foundation for the future. These might not be very relevant initially, but will matter
eventually.
[Link] is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift to 3.x
versions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.
ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonas
a language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated
[Link]!
[Link]
3/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
HowtoinstallPython?
Thereare2approachestoinstallPython:
YoucandownloadPythondirectlyfromitsprojectsiteandinstallindividualcomponentsandlibrariesyou
want
Alternately, you can download and install a package, which comes with preinstalled libraries. I would
[Link] .
Second method provides a hassle free installation and hence Ill recommend that to
[Link],
[Link],
untilandunless,youaredoingcuttingedgestatisticalresearch.
Choosingadevelopmentenvironment
OnceyouhaveinstalledPython,[Link]
3mostcommonoptions:
Terminal/Shellbased
IDLE(defaultenvironment)
iPythonnotebooksimilartomarkdowninR
[Link]
4/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
IDLEeditorforPython
While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It
providesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchooseto
runthecodeinblocks(ratherthanthelinebylineexecution)
WewilluseiPythonenvironmentforthiscompletetutorial.
Warmingup:RunningyourfirstPythonprogram
YoucanusePythonasasimplecalculatortostartwith:
[Link]
5/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Fewthingstonote
YoucanstartiPythonnotebookbywritingipythonnotebookonyourterminal/cmd,dependingonthe
OSyouareworkingon
YoucannameaiPythonnotebookbysimplyclickingonthenameUntitledOintheabovescreenshot
TheinterfaceshowsIn[*]forinputsandOut[*]foroutput.
YoucanexecuteacodebypressingShift+EnterorALT+Enter,ifyouwanttoinsertanadditional
rowafter.
Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsof
[Link]
[Link],theseincludelists,strings,tuples,dictionaries,forloop,whileloop,ifelse,etc.
Letstakealookatsomeofthese.
[Link]
PythonDataStructures
Followingaresomedatastructures,[Link]
ordertousethemasappropriate.
[Link]
6/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Lists Lists are one of the most versatile data structure in Python.A list can simply be defined by
[Link],
[Link]
canbechanged.
Hereisaquickexampletodefinealistandthenaccessit:
StringsStringscansimplybedefinedbyuseofsingle(),double()ortriple()invertedcommas.
Stringsenclosedintripequotes()canspanovermultiplelinesandareusedfrequentlyindocstrings
(Pythons way of documenting functions). \ is used as an escape character. Please note that Python
stringsareimmutable,soyoucannotchangepartofstrings.
[Link]
7/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
TuplesA tuple is represented by a number of values separated by [Link] are immutable
[Link],
eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.
SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedto
[Link],ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.
[Link]
8/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
DictionaryDictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysare
unique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.
PythonIterationandConditionalConstructs
Like most languages, Python also has a FORloop which is the most widely used method for
[Link]:
[Link]
9/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
foriin[PythonIterable]:
expression(i)
HerePythonIterablecanbealist,tupleorotheradvanceddatastructureswhichwewillexplorein
[Link],determiningthefactorialofanumber.
fact=1
foriinrange(1,N+1):
fact*=i
Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.
Themostcommonlyusedconstructisifelse,withfollowingsyntax:
if[condition]:
__executioniftrue__
else:
__executioniffalse__
Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:
ifN%2==0:
print'Even'
else:
print'Odd'
Now that you are familiar with Python fundamentals, lets take a step further. What if you have to
performthefollowingtasks:
1.Multiply2matrices
[Link]
[Link]
[Link]
[Link]
10/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
[Link]
Ifyoutrytowritecodefromscratch,itsgoing tobeanightmareandyouwontstayonPythonfor
morethan2days![Link],therearemanylibrarieswithpredefined
whichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,[Link]:
[Link](N)
[Link].
PythonLibraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful
[Link]
waysofdoingsoinPython:
importmathasm
frommathimport*
Inthefirstmanner,[Link]
frommathlibrary([Link])[Link]().
In the second manner, you have imported the entire name space in math i.e. you can directly use
factorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwhere
thefunctionshavecomefrom.
[Link]
11/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:
[Link]
library also contains basic linear algebra functions, Fourier transforms, advanced random number
capabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for
variety of high level science and engineering modules like discrete Fourier transform, LinearAlgebra,
OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..Youcan
usePylabfeatureinipythonnotebook(ipythonnotebookpylab=inline)tousetheseplottingfeatures
[Link],thenpylabconvertsipythonenvironmenttoanenvironment,very
[Link].
[Link]
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Pythonsusageindatascientistcommunity.
ScikitLearnfor machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
effiecient tools for machine learning and statistical modeling including classification, regression,
clusteringanddimensionalityreduction.
[Link],
estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics,
statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeach
estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative
[Link]
ofexploringandunderstandingdata.
Bokeh for creating interactive plots, dashboards and data applications on modern webbrowsers. It
[Link],ithasthe
capabilityofhighperformanceinteractivityoververylargeorstreamingdatasets.
[Link]
used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache
Spark,PyTables,[Link],Blazecanactasaverypowerfultoolforcreatingeffective
visualizationsanddashboardsonhugechunksofdata.
[Link]
capability to start at a website home url and then dig through webpages within the website to gather
information.
SymPy for symbolic computation. It has wideranging capabilities from basic symbolic arithmetic to
calculus,algebra,[Link]
[Link]
12/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
formattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismuch
easier to [Link] will find subtle differences with urllib2 but for beginners, Requests might be more
convenient.
Additionallibraries,youmightneed:
osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
[Link]
webpageinarun.
NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveinto
problem solving through Python. Yes I mean making a predictive model! In the process, we use
some powerful libraries and also come across the next level of data structures. We will take you
throughthe3keyphases:
[Link]
[Link]
[Link]
[Link]
In order to explore our data further, let me introduce you to another animal (as if Python was not
enough!)Pandas
[Link]
13/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
ImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,
but hang on!). They have been instrumental in increasing the use of Python in data science
community. We will now use Pandas to read a data set from an Analytics Vidhya competition,
perform exploratory analysis and build our first basic categorization algorithm for solving this
problem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandasSeriesand
DataFrames
IntroductiontoSeriesandDataframes
Series can be understood as a 1 dimensional labelled / indexed array. You can access individual
elementsofthisseriesthroughtheselabels.
A dataframe is similar to Excel workbook you have column names referring to columns and you
have rows, which can be accessed with use of row numbers. The essential difference being that
columnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
[Link]
intothesedataframesandthenvariousoperations([Link],aggregationetc.)canbeapplied
veryeasilytoitscolumns.
More:10MinutestoPandas
[Link]
14/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
PracticedatasetLoanPredictionProblem
[Link]:
VARIABLEDESCRIPTIONS:
Variable
Description
Loan_IDUniqueLoanID
Gender Male/Female
MarriedApplicantmarried(Y/N)
Dependents
Numberofdependents
Education
ApplicantEducation(Graduate/UnderGraduate)
Self_Employed
Selfemployed(Y/N)
ApplicantIncomeApplicantincome
CoapplicantIncome
LoanAmount
Coapplicantincome
Loanamountinthousands
Loan_Amount_Term
Termofloaninmonths
Credit_History credithistorymeetsguidelines
Property_Area
Urban/SemiUrban/Rural
Loan_Status
Loanapproved(Y/N)
Letsbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windows
commandprompt:
ipythonnotebookpylab=inline
[Link]
15/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
This opens up iPython notebook in pylab environment, which has a few useful libraries already
[Link],youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironment
[Link],bytyping
thefollowingcommand(andgettingtheoutputasseeninthefigurebelow):
plot(arange(5))
IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/[Link]
Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:
numpy
matplotlib
pandas
[Link]
16/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.I
havestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooks
liketillthisstage:
importpandasaspd
importnumpyasnp
importmatplotlibasplt
df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/[Link]")#Readingthedatasetin
adataframeusingPandas
QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()
[Link](10)
[Link]
17/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
[Link],youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function
[Link]()
[Link]
18/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinits
output(Readthisarticletorefreshbasicstatisticstounderstandpopulationdistribution)
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:
[Link](614592)22missingvalues.
2.Loan_Amount_Termhas(614600)14missingvalues.
3.Credit_Historyhas(614564)50missingvalues.
4.Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_History
fieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
[Link]
Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothe
median,i.e.the50%figure.
For the nonnumerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency
distribution to understand whether they make sense or [Link] frequency table can be printed by
followingcommand:
df['Property_Area'].value_counts()
Similarly,[Link][column_name]isa
[Link]
[Link],refertothe10MinutestoPandasresourcesharedabove.
Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.
LetusstartwithnumericvariablesnamelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:
[Link]
19/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
df['ApplicantIncome'].hist(bins=50)
Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequired
todepictthedistributionclearly.
Next,[Link]:
[Link](column='ApplicantIncome')
[Link]
20/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Thisconfirmsthepresenceofalotofoutliers/[Link]
[Link]
[Link]:
[Link](column='ApplicantIncome',by='Education')
We can see that there is no substantial different between the mean income of graduate and non
[Link],whichareappearing
[Link]
21/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
tobetheoutliers.
Now,LetslookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:
df['LoanAmount'].hist(bins=50)
[Link](column='LoanAmount')
[Link]
22/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Again,[Link],bothApplicantIncomeandLoanAmountrequiresome
amount of data munging. LoanAmount has missing and well as extreme values values, while
ApplicantIncomehasafewextremevalues,[Link]
upincomingsections.
Categoricalvariableanalysis
Now that we understand distributions for ApplicantIncome and LoanIncome, let us understand
categorical variables in more details. We will use Excel style pivot table and crosstabulation. For
instance,[Link]
MSExcelusingapivottableas:
[Link]
23/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the
probabilityofgettingloan.
[Link] this
articleforgettingahangofthedifferentdatamanipulationtechniquesinPandas.
temp1=df['Credit_History'].value_counts(ascending=True)
temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:[Link]
p({'Y':1,'N':0}).mean())
print'FrequencyTableforCreditHistory:'
printtemp1
print'\nProbilityofgettingloanforeachCreditHistoryclass:'
printtemp2
[Link]
24/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasa
barchartusingthematplotliblibrarywithfollowingcode:
[Link]
fig=[Link](figsize=(8,4))
ax1=fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('CountofApplicants')
ax1.set_title("ApplicantsbyCredit_History")
[Link](kind='bar')
ax2=fig.add_subplot(122)
[Link](kind='bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probabilityofgettingloan')
ax2.set_title("Probabilityofgettingloanbycredithistory")
[Link]
25/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Thisshowsthatthechancesofgettingaloanareeightfoldiftheapplicanthasavalidcredithistory.
YoucanplotsimilargraphsbyMarried,SelfEmployed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::
temp3=[Link](df['Credit_History'],df['Loan_Status'])
[Link](kind='bar',stacked=True,color=['red','blue'],grid=False)
[Link]
26/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Youcanalsoaddgenderintothemix(similartothepivottableinExcel):
Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,one
based on credit history, while other on 2 categorical variables (including gender). You can quickly
codethistocreateyourfirstsubmissiononAVDatahacks.
We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for
[Link]
27/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
pandas (the animal) would have increased by now given the amount of help, the library can
provideyouinanalyzingdatasets.
Next lets explore ApplicantIncome and LoanStatus variables further, perform data munging and
create a dataset for applying various modeling techniques. I would strongly urge that you take
anotherdatasetandproblemandgothroughanindependentexamplebeforereadingfurther.
[Link]:UsingPandas
Forthose,whohavebeenfollowing,hereareyourmustwearshoestostartrunning.
Datamungingrecapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolved
[Link]
aretheproblems,wearealreadyawareof:
[Link]
amountofmissingvaluesandtheexpectedimportanceofvariables.
[Link] looking at the distributions, we saw thatApplicantIncome and LoanAmount seemed to contain
extreme values at either end. Though they might make intuitive sense, but should be treated
appropriately.
Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenonnumericalfields
i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful
information.
IfyouarenewtoPandas,Iwouldrecommendreading [Link]
usefultechniquesofdatamanipulation.
Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdontworkwithmissing
[Link]
28/29
3/6/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
dataandeveniftheydo,[Link],letuscheckthenumberof
nulls/NaNsinthedataset
[Link](lambdax:sum([Link]()),axis=0)
Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthe
valueisnull.
Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachone
of these should be estimated and added in the data. Get a detailed view on different imputation
techniquesthroughthisarticle.
Note: Remember that missing values may not always be NaNs. For instance, if the
Loan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyour
[Link].
HowtofillmissingvaluesinLoanAmount?
[Link]
29/29