A Complete Tutorial To Learn Data Science With Python From Scratch
A Complete Tutorial To Learn Data Science With Python From Scratch
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.facebook.com/AnalyticsVidhya)
(https://twitter.com/analyticsvidhya)
(https://plus.google.com/+Analyticsvidhya/posts)
(https://www.linkedin.com/groups/Analytics-Vidhya-Learn-everything-about-5057165)
(https://www.analyticsvidhya.com)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
PANDAS
PYTHON
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/PYTHON-2/)
cebook.com/sharer.php?u=https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratchutorial%20to%20Learn%20Data%20Science%20with%20Python%20from%20Scratch)
(https://twitter.com/home?
0Tutorial%20to%20Learn%20Data%20Science%20with%20Python%20from%20Scratch+https://www.analyticsvidhya.com/blog/2016/01/completepython-scratch-2/)
(https://plus.google.com/share?url=https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-
tp://pinterest.com/pin/create/button/?url=https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-
alyticsvidhya.com/wp-
python.jpg&description=A%20Complete%20Tutorial%20to%20Learn%20Data%20Science%20with%20Python%20from%20Scratch)
Introduction
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
1/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
It happened few years back. After working on SAS for more than 5 years, I decided to move out of my
comfort zone.Being a data scientist, my hunt for other useful tools was ON! Fortunately, it didnt take
me long to decide, Python was my appetizer.
I always had a inclination towards coding. This was the time to do what I really loved. Code. Turned
out, coding was so easy!
I learned basics of Python within a week. And, since then, Ive not only explored this language to the
depth, but also have helped many other to learn this language. Python was originally a general
purpose language. But, over the years, with strong community support, this language got dedicated
library for data analysis and predictive modeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many
others to learn python faster. In this tutorial, we will take bite sized information about how to use
Python for Data Analysis, chew it till we are comfortable and practice it at our own end.
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/python.jpg)
Table of Contents
1. Basics of Python for Data Analysis
Why learn Python for data analysis?
Python 2.7 v/s 3.4
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
2/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://datahack.analyticsvidhya.com/contest/the-strategic-monk/)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
3/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
It is an interpreted language rather than compiled language hence might take up more CPU time.
However, given the savings in programmer time (due to ease of learning), it might still be a good
choice.
There is no clear winnerbutI suppose the bottom line is thatyou should focus on learning Python as
a language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated
article on Python 2.x vs 3.x in the near future!
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
4/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
want
Alternately, you can download and install a package, which comes with pre-installed libraries. I would
recommend downloading Anaconda (https://www.continuum.io/downloads). Another option could
beEnthought Canopy Express (https://www.enthought.com/downloads/).
Second method provides a hassle free installation and hence Ill recommend that to beginners.
The imitation of this approach is you have to wait for the entire package to be upgraded, even if you
are interested in the latest version of a single library. It should not matter until and unless, until and
unless, you are doing cutting edge statistical research.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
5/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/jupyter1.png)
Before we deep dive into problem solving, lets take a step back and understand the basics of Python.
As we know that data structures and iteration and conditional constructs form the crux of any
language. In Python, these include lists, strings, tuples, dictionaries, for-loop, while-loop, if-else, etc.
Lets take a look at some of these.
6/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/blog/wp-content/uploads/2014/07/python_lists.png)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
7/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Strings Strings can simply be de ned by use of single ( ), double ( ) or triple ( ) inverted commas.
Strings enclosed in tripe quotes ( ) can span over multiple lines and are used frequently in docstrings
(Pythons way of documenting functions). \ is used as an escape character. Please note that Python
strings are immutable, so you can not change part of strings.
(https://www.analyticsvidhya.com/blog/wp-content/uploads/2014/07/python_strings.png)
Tuples A tuple is represented by a number of values separated by commas. Tuples are immutable
and the output is surrounded by parentheses so that nested tuples are processed correctly.
Additionally, even though tuples are immutable, they can hold mutable data if needed.
Since Tuples are immutable and can not change, they are faster in processing as compared to
lists. Hence, if your list is unlikely to change, you should use tuples, instead of lists.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
8/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/blog/wp-content/uploads/2014/07/Python_tuples.png)
Dictionary Dictionary is an unordered set ofkey: valuepairs, with the requirement that the keys are
unique (within one dictionary). A pair of braces creates an empty dictionary:{}.
(https://www.analyticsvidhya.com/blog/wp-content/uploads/2014/07/Python_dictionary.png)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
9/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
foriin[PythonIterable]:
expression(i)
Here Python Iterable can be a list, tuple or other advanced data structures which we will explore in
later sections. Lets take a look at a simple example, determining the factorial of a number.
fact=1
foriinrange(1,N+1):
fact*=i
Coming to conditional statements, these are used to execute code fragments based on a condition.
The most commonly used construct is if-else, with following syntax:
if[condition]:
__executioniftrue__
else:
__executioniffalse__
ifN%2==0:
print'Even'
else:
print'Odd'
Now that you are familiar with Python fundamentals, lets take a step further. What if you have to
perform the following tasks:
1. Multiply 2 matrices
2. Find the root of a quadratic equation
3. Plot bar charts and histograms
4. Make statistical models
5. Access web-pages
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
10/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
If you try to write code from scratch, its going to be a nightmare and you wont stay on Python for
more than 2 days! But lets not worry about that. Thankfully, there are many libraries with prede ned
which we can directly import into our code and make our life easy.
For example, consider the factorial example we just saw. We can do that in a single step as:
math.factorial(N)
O -course we need to import the math library for that. Lets explore the various libraries next.
Python Libraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful
libraries. The rst step is obviously to learn to import them into our environment. There are several
ways of doing so in Python:
importmathasm
frommathimport*
In the rst manner, we have de ned an alias m to library math. We can now use various functions
from math library (e.g. factorial) by referencing it using the alias m.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly use
factorial() without referring to math.
Tip: Google recommendsthat you use rst style of importing libraries, as you will know where the
functions have come from.
Following are a list of libraries, you will need for any scienti c computations and data analysis:
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
11/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
NumPystands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This
library also contains basic linear algebra functions, Fourier transforms, advanced random number
capabilities and tools for integration with other low level languages like Fortran, C and C++
SciPy stands for Scienti c Python. SciPy is built on NumPy. It is one of the most useful library for
variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra,
Optimization and Sparse matrices.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You
can use Pylab feature in ipython notebook (ipython notebook pylab = inline) to use these plotting
features inline. If you ignore the inline option, then pylab converts ipython environment to an
environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
Pandas for structured data operations and manipulations. It is extensively used for data munging and
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Pythons usage in data scientist community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
e ecient tools for machine learning and statistical modeling including classi cation, regression,
clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore
data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics,
statistical tests, plotting functions, and result statistics are available for di erent types of data and each
estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative
statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central
part of exploring and understanding data.
Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It
empowers the user to generate elegant and concise graphics in the style of D3.js. Moreover, it has the
capability of high-performance interactivity over very large or streaming datasets.
Blaze for extending the capability of Numpy and Pandas to distributed and streaming datasets.It can
be used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache
Spark, PyTables, etc. Together with Bokeh, Blaze can act as a very powerful tool for creating e ective
visualizations and dashboards on huge chunks of data.
Scrapy for web crawling. It is a very useful framework for getting speci c patterns of data. It has the
capability to start at a website home url and then dig through web-pages within the website to gather
information.
SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to
calculus, algebra, discrete mathematics and quantum physics. Another useful feature is the capability
of formatting the result of the computations as LaTeX code.
Requests for accessing the web. It works similar to the the standard python library urllib2 but is much
easier to code. You will nd subtle di erences with urllib2 but for beginners, Requests might be more
convenient.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
12/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into
problem solving through Python. Yes I mean making a predictive model! In the process, we use some
powerful libraries and also come across the next level of data structures. We will take you through
the 3 key phases:
1. Data Exploration nding out more about the data we have
2. Data Munging cleaningthe data and playing with itto make it bettersuit statistical modeling
3. Predictive Modeling runningthe actual algorithms and having fun
13/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
14/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
VARIABLEDESCRIPTIONS:
Variable
Description
Loan_IDUniqueLoanID
Gender Male/Female
MarriedApplicantmarried(Y/N)
Dependents
Numberofdependents
Education
ApplicantEducation(Graduate/UnderGraduate)
Self_Employed
Selfemployed(Y/N)
ApplicantIncomeApplicantincome
CoapplicantIncome
LoanAmount
Coapplicantincome
Loanamountinthousands
Loan_Amount_Term
Termofloaninmonths
Credit_History credithistorymeetsguidelines
Property_Area
Urban/SemiUrban/Rural
Loan_Status
Loanapproved(Y/N)
ipythonnotebookpylab=inline
This opens up iPython notebook in pylab environment, which has a few useful libraries already
imported. Also, you will be able to plot your data inline, which makes this a really good environment
for interactive data analysis. You can check whether the environment has loaded correctly, by typing
the following command (and getting the output as seen in the gure below):
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
15/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
plot(arange(5))
(https://www.analyticsvidhya.com/blog/wp-content/uploads/2014/08/ipython_pylab_check.png)
I am currently working in Linux, and have stored the dataset in the following location:
/home/kunal/Downloads/Loan_Prediction/train.csv
Please note that you do not need to import matplotlib and numpy because of Pylab environment. I
have still kept them in the code, in case you use the code in a di erent environment.
After importing the library, you read the dataset using function read_csv(). This is how the code looks
like till this stage:
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
16/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
importpandasaspd
importnumpyasnp
importmatplotlibasplt
df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv")#Readingthedatasetinada
taframeusingPandas
df.head(10)
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/1.-head.png)
This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.
Next, you can look at summary of numerical elds by using describe() function
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
17/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
df.describe()
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/2.-describe.png)
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its
output (Read this article (https://www.analyticsvidhya.com/blog/2014/07/statistics/) to refresh basic
statistics to understand population distribution)
Here are a few inferences, you can draw by looking at the output of describe() function:
1. LoanAmount has (614 592) 22 missing values.
2. Loan_Amount_Term has (614 600) 14 missing values.
3. Credit_History has (614 564) 50 missing values.
4. We can also look that about 84% applicants have a credit_history. How? The mean of Credit_History
eld is 0.84 (Remember, Credit_History has value 1 for those who have a credit history and 0 otherwise)
5. The ApplicantIncome distribution seems to be in line with expectation. Same with CoapplicantIncome
Please note that we can get an idea of a possible skew in the data by comparing the mean to the
median, i.e. the 50% gure.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
18/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency
distribution to understand whether they make sense or not. The frequency table can be printed by
following command:
df['Property_Area'].value_counts()
Similarly, we can look at unique values of port of credit history. Note that dfname[column_name] is a
basic indexing technique to acessa particular column of the dataframe. It can be a list of columns as
well. For more information, refer to the 10 Minutes to Pandas resource shared above.
Distribution analysis
Now that we are familiar with basic data characteristics, let us study distribution of various variables.
Let us start with numeric variables namely ApplicantIncome and LoanAmount
Lets start by plotting the histogram of ApplicantIncome using the following commands:
df['ApplicantIncome'].hist(bins=50)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
19/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/output_6_1.png)
Here we observe that there are few extreme values. This is also the reason why 50 bins are required
to depict the distribution clearly.
Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:
df.boxplot(column='ApplicantIncome')
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/output_7_1.png)
This con rms the presence of a lot of outliers/extreme values. This can be attributed to the income
disparity in the society. Part of this can be driven by the fact that we are looking at people with
di erent education levels. Let us segregate them by Education:
df.boxplot(column='ApplicantIncome',by='Education')
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
20/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/output_9_1.png)
We can see that there is no substantial di erent between the mean income of graduate and nongraduates. But there are a higher number of graduates with very high incomes, which are appearing
to be the outliers.
Now, Lets look at the histogram and boxplot of LoanAmount using the following command:
df['LoanAmount'].hist(bins=50)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
21/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wpcontent/uploads/2016/01/output_13_1.png)
df.boxplot(column='LoanAmount')
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/output_14_1.png)
Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some
amount of data munging. LoanAmount has missing and well as extreme values values, while
ApplicantIncome has a few extreme values, which demand deeper understanding. We will take this
up in coming sections.
22/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/10.-pivot_table3.png)
Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the
probability of getting loan.
Now we will look at the steps required to generate a similar insight using Python.Please refer to this
article
(https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-
manipulation/) for getting a hang of the di erent data manipulation techniques in Pandas.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
23/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
temp1=df['Credit_History'].value_counts(ascending=True)
temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:x.map
({'Y':1,'N':0}).mean())
print'FrequencyTableforCreditHistory:'
printtemp1
print'\nProbilityofgettingloanforeachCreditHistoryclass:'
printtemp2
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/11.-pivot_python.png)
Now we can observe that we get a similar pivot_table like the MS Excel one. This can be plotted as a
bar chart using the matplotlib library with following code:
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
24/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
importmatplotlib.pyplotasplt
fig=plt.figure(figsize=(8,4))
ax1=fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('CountofApplicants')
ax1.set_title("ApplicantsbyCredit_History")
temp1.plot(kind='bar')
ax2=fig.add_subplot(122)
temp2.plot(kind='bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probabilityofgettingloan')
ax2.set_title("Probabilityofgettingloanbycredithistory")
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/output_16_1.png)
This shows that the chances of getting a loan are eight-fold if the applicant has a valid credit history.
You can plot similar graphs by Married, Self-Employed, Property_Area, etc.
Alternately, these two plots can also be visualized by combining them in a stacked chart::
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
25/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
temp3=pd.crosstab(df['Credit_History'],df['Loan_Status'])
temp3.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/output_17_1.png)
You can also add genderinto the mix (similar to the pivot table in Excel):
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/output_18_1.png)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
26/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
If you have not realized already, we have just created two basic classi cation algorithms here, one
based on credit history, while other on 2 categorical variables (including gender). You can quickly
code this to create your rst submission on AV Datahacks.
We justsaw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas
(the animal) would have increased by now given the amount of help, the library can provide you in
analyzing datasets.
Next lets explore ApplicantIncome and LoanStatus variables further, perform data munging
(https://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-stepspython/) and create a dataset for applying various modeling techniques. I would strongly urge that
you take another dataset and problem and go through an independent example before reading
further.
In addition to these problems with numerical elds, we should also look at the non-numerical elds
i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful
information.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
27/68
20/11/2016
If
you
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
are
new
to
Pandas,
would
recommend
reading
this
article
df.apply(lambdax:sum(x.isnull()),axis=0)
This command should tell us the number of missing values in each column as isnull() returns 1, if the
value is null.
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/4.-missing.png)
Though the missing values are not very high in number, but many variables have them and each one
of these should be estimated and added in the data. Get a detailed view on di erent imputation
techniques
through
this
article
(https://www.analyticsvidhya.com/blog/2016/01/guide-data-
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
28/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
exploration/).
Note: Remember that missing values may not always be NaNs. For instance, if the
Loan_Amount_Term is 0, does it makes sense or would you consider that missing? I suppose your
answer is missing and youre right. So we should check for values which are unpractical.
df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)
The other extreme could be to build a supervised learning model to predict loan amounton the basis
of other variables and then use age along with other variables to predict survival.
Since, the purpose nowis to bring out the steps in data munging, Ill rather take an approach, which
lies some where in between these 2 extremes. A key hypothesis is that the whether a person is
educated orself-employed can combine to give a good estimate of loan amount.
First, lets look at the boxplot to see if a trend exists:
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
29/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/5.-loan-amount-boxplot.png)
Thus we see some variations in the median of loan amount for each group and this can be used to
impute the values. But rst, we have to ensure that each of Self_Employed and Education variables
should not have a missing values.
As we say earlier, Self_Employed has some missing values. Lets look at the frequency table:
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/6.-self-emp.png)
Since ~86% values are No, it is safe to impute the missing values as No as there is a high probability
of success. This can be done using the following code:
df['Self_Employed'].fillna('No',inplace=True)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
30/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Now, we will create a Pivot table, which provides us median values for all the groups of unique values
of Self_Employed and Education features. Next, we de ne a function, which returns the values of
these cells and apply it to ll the missing values ofloan amount:
table=df.pivot_table(values='LoanAmount',index='Self_Employed',columns='Education',aggfunc=
np.median)
#Definefunctiontoreturnvalueofthispivot_table
deffage(x):
returntable.loc[x['Self_Employed'],x['Education']]
#Replacemissingvalues
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage,axis=1),inplace=True)
This should provide you a good way to impute missing values of loan amount.
df['LoanAmount_log']=np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
31/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/7.-loan-log.png)
Now the distribution looks much closer to normal and e ect of extreme values has been signi cantly
subsided.
Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong
support Co-applicants. So it might be a good idea to combine both incomes as total income and take
a log transformation of the same.
df['TotalIncome']=df['ApplicantIncome']+df['CoapplicantIncome']
df['TotalIncome_log']=np.log(df['TotalIncome'])
df['LoanAmount_log'].hist(bins=20)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
32/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/8.-total-income-log.png)
Now we see that the distribution is much better than before. I will leave it upto you to impute the
missing values for Gender, Married, Dependents, Loan_Amount_Term, Credit_History. Also, I
encourage you to think about possible additional information which can be derived from the data. For
example, creating a column for LoanAmount/TotalIncome might make sense as it gives an idea of
how well the applicant is suited to pay back his loan.
Next, we will look at making predictive models.
(https://www.analyticsvidhya.com/blog/2015/01/scikit-learn-python-machine-learning-
tool/).
Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into
numeric by encoding the categories. This can be done using the following code:
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
33/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
fromsklearn.preprocessingimportLabelEncoder
var_mod=['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Sta
tus']
le=LabelEncoder()
foriinvar_mod:
df[i]=le.fit_transform(df[i])
df.dtypes
Next, we will import the required modules. Then we will de ne a generic classi cation function, which
takes a model as input and determines the Accuracy and Cross-Validation scores. Since this is an
introductory article, I will not go into the details of coding. Please refer to this article
(https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
for
getting details of the algorithms with R and Python codes. Also, itll be good to get a refresher
on cross-validation through this article (https://www.analyticsvidhya.com/blog/2015/11/improvemodel-performance-cross-validation-in-python-r/), as it is a very important measure of power
performance.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
34/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
#Importmodelsfromscikitlearnmodule:
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.cross_validationimportKFold#ForKfoldcrossvalidation
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.treeimportDecisionTreeClassifier,export_graphviz
fromsklearnimportmetrics
#Genericfunctionformakingaclassificationmodelandaccessingperformance:
defclassification_model(model,data,predictors,outcome):
#Fitthemodel:
model.fit(data[predictors],data[outcome])
#Makepredictionsontrainingset:
predictions=model.predict(data[predictors])
#Printaccuracy
accuracy=metrics.accuracy_score(predictions,data[outcome])
print"Accuracy:%s"%"{0:.3%}".format(accuracy)
#Performkfoldcrossvalidationwith5folds
kf=KFold(data.shape[0],n_folds=5)
error=[]
fortrain,testinkf:
#Filtertrainingdata
train_predictors=(data[predictors].iloc[train,:])
#Thetargetwe'reusingtotrainthealgorithm.
train_target=data[outcome].iloc[train]
#Trainingthealgorithmusingthepredictorsandtarget.
model.fit(train_predictors,train_target)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
35/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
#Recorderrorfromeachcrossvalidationrun
error.append(model.score(data[predictors].iloc[test,:],data[outcome].iloc[test]))
print"CrossValidationScore:%s"%"{0:.3%}".format(np.mean(error))
#Fitthemodelagainsothatitcanbereferedoutsidethefunction:
model.fit(data[predictors],data[outcome])
Logistic Regression
Lets make our rst Logistic Regression model. One way would be to take all the variables into the
model but this might result in over tting (dont worry if youre unaware of this terminology yet). In
simple words, taking all variables might result in the model understanding complex relations speci c
to
the
data
and
will
not
generalize
well.
Read
more
about
Logistic
Regression
(https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-regression-in-r/).
We can easily make some intuitive hypothesis to set the ball rolling. The chances of getting a loan
will be higher for:
1. Applicants having a credit history (remember we observed this in exploration?)
2. Applicants with higher applicant and co-applicant incomes
3. Applicants with higher education level
4. Properties in urban areas with high growth perspectives
outcome_var='Loan_Status'
model=LogisticRegression()
predictor_var=['Credit_History']
classification_model(model,df,predictor_var,outcome_var)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
36/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
#Wecantrydifferentcombinationofvariables:
predictor_var=['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model,df,predictor_var,outcome_var)
Decision Tree
Decision tree is another method for making a predictive model. It is known to provide higher accuracy
than logistic regression model. Read more about Decision Trees
(https://www.analyticsvidhya.com/blog/2015/01/decision-tree-simpli ed/).
model=DecisionTreeClassifier()
predictor_var=['Credit_History','Gender','Married','Education']
classification_model(model,df,predictor_var,outcome_var)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
37/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
#Wecantrydifferentcombinationofvariables:
predictor_var=['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model,df,predictor_var,outcome_var)
Random Forest
Random forest is another algorithm for solving the classi cation problem.Read more about Random
Forest
(https://www.analyticsvidhya.com/blog/2015/09/random-forest-algorithm-multiple-
challenges/).
An advantage with Random Forest is that we can make it work with all the features and it returns a
feature importance matrix which can be used to select features.
model=RandomForestClassifier(n_estimators=100)
predictor_var=['Gender','Married','Dependents','Education',
'Self_Employed','Loan_Amount_Term','Credit_History','Property_Area',
'LoanAmount_log','TotalIncome_log']
classification_model(model,df,predictor_var,outcome_var)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
38/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Lets try both of these. First we see the feature importance matrix from which well take the most
important features.
#Createaserieswithfeatureimportances:
featimp=pd.Series(model.feature_importances_,index=predictor_var).sort_values(ascending=Fals
e)
printfeatimp
(https://www.analyticsvidhya.com/wp-content/uploads/2016/01/9.-rf-feat-imp.png)
Lets use the top 5 variables for creating a model. Also, we will modify the parameters of random
forest model a little bit:
model=RandomForestClassifier(n_estimators=25,min_samples_split=25,max_depth=7,max_features=
1)
predictor_var=['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Are
a']
classification_model(model,df,predictor_var,outcome_var)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
39/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Notice that although accuracy reduced, but the cross-validation score is improving showing that the
model is generalizing well. Remember that random forest models are not exactly repeatable.
Di erent runs will result in slight variations because of randomization. But the output should stay in
the ballpark.
You would have noticed that even after some basic parameter tuning on random forest, we have
reached a cross-validation accuracy only slightly better than the original logistic regression model.
This exercise gives us some very interesting and unique learning:
1. Using a more sophisticated model does not guarantee better results.
2. Avoid using complex modeling techniques as a black box without understanding the underlying
concepts. Doing so would increase the tendency of over tting thus making your models less
interpretable
3. Feature Engineering (https://www.analyticsvidhya.com/blog/2015/03/feature-engineering-variabletransformation-creation/) is the key to success. Everyone can use an Xgboost models but the real art
and creativity lies in enhancing your features to better suit the model.
So are you ready to take on the challenge? Start your data science journey withLoan Prediction
Problem (http://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction).
End Notes
I hope this tutorial will helpyou maximize your e ciency when starting with data science in Python. I
am sure this not only gave you an idea aboutbasic data analysis methods but it also showed you how
to implement some of the more sophisticated techniques available today.
Python is really a great tool, and is becoming an increasingly popular language among the data
scientists. The reason being, its easy to learn, integrates well with other databases and tools like
Spark and Hadoop. Majorly, it has great computational intensity and has powerful data analytics
libraries.
So, learn Python to perform the full life-cycle of any data science project. It includes reading,
analyzing, visualizing and nally making predictions.
If you come across any di culty while practicing Python, or you have any thoughts / suggestions /
feedback on the post, please feel free to post them through comments below.
If you like what you just read & want to continue your analytics
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
40/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
If you like what you just read & want to continue your analytics
learning,subscribe to our emails
(http://feedburner.google.com/fb/a/mailverify?uri=analyticsvidhya),follow us
on twitter (http://twitter.com/analyticsvidhya)or like ourfacebookpage
(http://facebook.com/analyticsvidhya).
Share this:
(https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/?share=linkedin&nb=1)
1K+
(https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/?share=facebook&nb=1)
275
(https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/?share=googleplus
1&nb=1)
(https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/?share=twitter&nb=1)
(https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/?share=pocket&nb=1)
(https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/?share=reddit&nb=1)
RELATED
(https://www.analyticsvidhya.com/
(https://www.analyticsvidhya.com/
(https://www.analyticsvidhya.com/
blog/2016/09/most-active-data-
blog/2014/03/sas-vs-vs-python-tool-
blog/2016/10/16-new-must-watch-
scientists-free-books-notebooks-
learn/)
tutorials-courses-on-machine-
tutorials-on-github/)
learning/)
In "Big data"
scientists-free-books-notebookshttps://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
tutorials-courses-on-machine41/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
scientists-free-books-notebookstutorials-on-github/)
tutorials-courses-on-machinelearning/)
In "Machine Learning"
In "Machine Learning"
Previous Article
Next Article
(https://www.analyticsvidhya.com/blog/2016/01/10(https://www.analyticsvidhya.com/blog/2016/01/model- popular-tv-shows-data-science-artificialmonitoring-senior-business-analystassistantintelligence/)
manager-gurgaon-5-6-years-experience/)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
42/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/blog/author/kunalj/)
Author
53 COMMENTS
Moumita
Mitra says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103788#RESPOND)
JANUARY 15, 2016 AT 4:41 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103788)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
43/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Paritosh
Gupta says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103796#RESPOND)
JANUARY 15, 2016 AT 6:15 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103796)
There is a very good book on Python for Data Analysis, O Reily Python for Data Analysis
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104034#RESPOND)
JANUARY 18, 2016 AT 7:06 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104034)
Moumita,
The book mentioned by Paritosh is a good place to start. You can also refer some of the books
mentioned here:
http://www.analyticsvidhya.com/blog/2014/06/books-data-scientists-or-aspiring-ones/
(http://www.analyticsvidhya.com/blog/2014/06/books-data-scientists-or-aspiring-ones/)
Hope this helps.
Kunal
Deepak
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104988#RESPOND)
JANUARY 31, 2016 AT 5:49 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104988)
Hey Kunal
Im trying to follow your lesson however I am stuck at reading the CSV le. Im using Ipython and
trying to read it. I am following the syntax that you have provided but it still doesnt work.
Can you please help me if its possible I would really appreciate it
Thanks
Deepak
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
44/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Pranesh
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103789#RESPOND)
JANUARY 15, 2016 AT 5:00 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103789)
Hi Kunal,
When you are planning to schedule next data science meetup in Bangalore. I have missed the
previous session due to con ict
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104036#RESPOND)
JANUARY 18, 2016 AT 7:08 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104036)
Pranesh,
We will have a meetup some time in early March. We will announce the dates on DataHack
platform and our meetup group page.
Hope to see you around this time.
Regards,
Kunal
Gianfranco
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103830#RESPOND)
JANUARY 15, 2016 AT 4:16 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103830)
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104037#RESPOND)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
45/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Dheeraj
Patta says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103834#RESPOND)
JANUARY 15, 2016 AT 5:19 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103834)
Thank you so much Kunal, this is indeed a great start for any Python beginner.
Really appreciate your teams e ort in bringing Data Science to a wider audience.
I strongly suggest A Byte of Python by Swaroop CH. It may be bit old now but helped me in
getting a good start in Python.
HighSpirits
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103835#RESPOND)
JANUARY 15, 2016 AT 5:31 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103835)
Awesome!!! This is one area where I was looking for help and AV has provided it!!! Thanks a lot for
the quick guide Kunalvery much helpful
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104038#RESPOND)
JANUARY 18, 2016 AT 7:13 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104038)
Kami888
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103865#RESPOND)
JANUARY 16, 2016 AT 3:33 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103865)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
46/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104039#RESPOND)
JANUARY 18, 2016 AT 7:14 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104039)
Dr.D.K.Samuel
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103868#RESPOND)
JANUARY 16, 2016 AT 3:59 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103868)
Really well written, will be nice if it is made available as a pdf for download (with all supporting
references). I will print and refer till I learn in full. Thanks
smrutiranjan
tripathy says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=103955#RESPOND)
JANUARY 17, 2016 AT 10:04 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-103955)
Hi kunal ji ,
Can you please guide (for a newbie )who dont have any software background , how can acquire
big data knowledge. whether is it necessary to learn SQL , JAVA ?Before stepping in the big data
practically, how can i warm up my self without getting in touch with the bias. Can you please
suggest good blog regarding big data for newbie.
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104040#RESPOND)
JANUARY 18, 2016 AT 7:15 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104040)
Smrutiranjan,
:
Kindly post this question on our discussion portal http://discuss.analyticsvidhya.com
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
47/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(http://discuss.analyticsvidhya.com)
This is not relevant to the article above.
Regards,
Kunal
Falkor
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104049#RESPOND)
JANUARY 18, 2016 AT 9:41 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104049)
This was good until, the fact hit me. I am using IDLE and dont have the libraries installed. Now,
how do I get these Pandas, Numpy etc installed for IDLE on Windows!?
Its been a long complicated browsing session. Only solution I seem to get is to ditch IDLE and
move to Spyder or move to Python 3.5 altogether.
Any solutions will be helpful, thank you.
Digvijay
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=106809#RESPOND)
MARCH 7, 2016 AT 3:02 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-106809)
I suggest installing anaconda. Its better to start with as it contains most of the commonly used
libraries for data analysis. Once anaconda is up and working, you can use any IDE of your choice.
Falkor
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104056#RESPOND)
JANUARY 19, 2016 AT 1:26 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104056)
Got Pandas to nally install and work, here is how I did it. Just in case it helps somebody else.
Download pip-installer from here: https://bootstrap.pypa.io/get-pip.py
(https://bootstrap.pypa.io/get-pip.py)
Put it on to desktop or some known path
Open Command prompt and point to the path or open the path
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
48/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Erik
(http://www.marsja.se) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104093#RESPOND)
JANUARY 19, 2016 AT 9:20 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104093)
Thank you for a real comprehensive post. Personally, I am mainly using Python for creating
Psychology experiments but I would like to start doing some analysis with Python (right now I
mainly use R). Some of the libraries (e.g., Seaborn) was new to me.
gt_67
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104100#RESPOND)
JANUARY 19, 2016 AT 12:15 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104100)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
49/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Hello !
I cant let this piece of code to work:
table = df.pivot_table(values=LoanAmount, index=Self_Employed ,columns=Education,
aggfunc=np.median)
# De ne function to return value of this pivot_table
def fage(x):
return table.loc[x[Self_Employed],x[Education]]
# Replace missing values
df[LoanAmount]. llna(df[df[LoanAmount].isnull()].apply(fage, axis=1), inplace=True)
Ive this error:
ValueError: invalid ll value with a
I checked the null values of the columns LoanAmount, Self_Employed and Education and
nothing wrong shows out. 614 values as others full columns.
Someone else had the same error ?
Mohamed
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104156#RESPOND)
JANUARY 20, 2016 AT 1:02 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104156)
Mr. gt_67,
I have same the error do you have any idea what that could be?
if Kunal can help understand and x this piece of code will be great.
dignity
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=106345#RESPOND)
FEBRUARY 29, 2016 AT 6:14 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-106345)
Missing values are already replaced by the mean with this line of code (1.st way)
df[LoanAmount]. llna(df[LoanAmount].mean(), inplace=True)
before.
This part is the second way of replacing missing values so
if you skip above line the code should work.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
50/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Mohamed
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104108#RESPOND)
JANUARY 19, 2016 AT 12:40 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104108)
This is a great, great resource. Thanks Kunal. But let me ask you for curiosity is this how data
scientist do at work, I mean it is like using a command like to get insight from the data, isnt there
GUI with python so you can be more productive?
Keep up the good work.
Kishore
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104110#RESPOND)
JANUARY 19, 2016 AT 12:51 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104110)
Hi Kunal,
Thanks for the excellent tutorial using python. It would be great if you could do a similar tutorial
using R.
Regards,
Kishore
Erik
Marsja (http://www.marsja.se) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104206#RESPOND)
JANUARY 20, 2016 AT 5:33 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104206)
Thank you Kunal for a real comprehensive tutorial on doing data science in Python! I really
appreciated the list of libraires. Really useful. I have, my self, started to look more and more on
doing data analysis with Python. I have tested pandas some and your exploratory analysis withpandas part was also helpful.
Venu
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104498#RESPOND)
JANUARY 24, 2016 AT 5:39 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104498)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
51/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Good One
Hemanth
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104522#RESPOND)
JANUARY 24, 2016 AT 3:00 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104522)
Is there a python library for performing OCR on PDF les? or for converting a raw scanned PDF to a
searchable PDF? To perform Text Analytics
Abhi
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104534#RESPOND)
JANUARY 24, 2016 AT 4:38 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104534)
Hey, great article. I nd my self getting hiccups the moment probability and statistics start
appearing. Can you suggest a book that takes me through these easily just like in this tutorial. Both
of these seem to be the lifeline of ML.
Deepak
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=104989#RESPOND)
JANUARY 31, 2016 AT 5:51 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-104989)
Hey Kunal
Im trying to follow your lesson however I am stuck at reading the CSV le. Im using Ipython and
trying to read it. I am following the syntax that you have provided but it still doesnt work.
Can you please help me if its possible I would really appreciate it
Thanks
Deepak
Deepak
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105054#RESPOND)
FEBRUARY 2, 2016 AT 2:21 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105054)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
52/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Hello Kunal i have started your tutorial but i am having di culty at importing pandas an opening
the csv le
do you mind assisting me
Thanks
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105143#RESPOND)
FEBRUARY 3, 2016 AT 11:35 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105143)
Deepak,
What is the problem you are facing? Can you attach a screenshot?
Also, tell me which OS are you working on and which Python installation are you working on?
Regards,
Kunal
Deepak
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105160#RESPOND)
FEBRUARY 3, 2016 AT 4:14 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105160)
Hey Thanks for replying no i do not think i can attach a screen shot on this Blog Wall. I would love
to email it to you but do not have your email address though.
But the problem i am having is trying to open the .csv le (train). I have opened pylab inline
My code is like this :
Line 1: %pylab inline
Populating the interactive namespace from numpy and matplotlib
Line 2: import pandas as pd
df = pd.read_csv(/Desktop/Studying_Tools/AV/train.csv)
When i click run in Ipython Notebook. It gives me an error Like this:
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
53/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
54/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Jaini
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105300#RESPOND)
FEBRUARY 7, 2016 AT 12:35 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105300)
Hi Kunal
Sincere apologies for a very basic question. I have installed python per above instructions.
Unfortunately I am unable to launch ipython notebook. Have spent hours but I guess I missing
something. Could you please kindly guide.
Thank you
Jaini
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
55/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Kunal
Jain (http://www.analyticsvidhya.com) says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105313#RESPOND)
FEBRUARY 7, 2016 AT 10:46 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105313)
Jaini,
What is the error you are getting? Which OS you are on? And what happens when you type
ipython notebook in shell / terminal / cmd ?
Regards,
Kunal
Emanuel
Woiski says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105342#RESPOND)
FEBRUARY 7, 2016 AT 8:18 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105342)
Nice article!
A few remarks:
1- pylab=inline is not recommended any more. Use %matplotlib inline for each notebook.
2- You can start a jupyter server using jupyter notebook instead of ipython notebook. For me,
notebooks open faster that way.
3- For plotting, use import matplotlib.pyplot as plt.
Regards
woiski
Jaini
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105350#RESPOND)
FEBRUARY 8, 2016 AT 1:11 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105350)
Thank you. I sincerely appreciate your instant response. I just reinstalled and went through
command prompt and it worked.
ngnikhilgoyal
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=105981#RESPOND)
FEBRUARY 20, 2016 AT 9:45 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-105981)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
56/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
IT would be good if you explained the code as you went along the exercise. For someone
unfamiliar with some of the methods and functions, it is di cult to understand why you are doing
certain things. For e.g.: While creating the pivot table, you introduced aggfunc=lambda x:
x.map({Y:1,N:0}).mean()) without explaining it. Intuitively I know you are coding Y as 1 and N as 0
and taking mean of each but you still need to explain what is lambda x: x.map . . . .
Olga
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=106263#RESPOND)
FEBRUARY 26, 2016 AT 1:06 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-106263)
There seems to be a bit of confusion, when you plot histogram. Histogram, by de nition, is a plot
of occurrence frequency of some variable. So, when you do manipulation with ApplicantIncome,
transforming to a TotalIncome by adding CoapplicantIncome, the outcome does not a ect the
histogram of LoanAmount, because the outcome of this manipulation does not change the
occurrence frequency or the values of LoanAmount. If you compare both of your plots, they will
look exactly the same for mentioned above reason. So, it will be, probably, better to correct this
part of the article.
Adull
KKU says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=106333#RESPOND)
FEBRUARY 29, 2016 AT 3:40 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-106333)
Thanks
Vlad
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=106401#RESPOND)
FEBRUARY 29, 2016 AT 11:13 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-106401)
Hi Kunal rst o thanks for this informative tutorial. Great stu . Unfortunately Im unable to
download the dataset I need to be signed up on AV, and I get an invalid request on signup.
Thank you again for this material.
Vlad
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=106408#RESPOND)
MARCH 1, 2016 AT 3:37 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-106408)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
57/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Dorinel
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=107006#RESPOND)
MARCH 10, 2016 AT 1:28 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-107006)
Hi Kunal,
Dont you give us access to the data set any more? I am reading your tutorial and want to repeat
your steps for data analysis!
Thanks,
Dorinel
Sam
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=107088#RESPOND)
MARCH 11, 2016 AT 10:43 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-107088)
Marc
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=107119#RESPOND)
MARCH 12, 2016 AT 12:59 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-107119)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
58/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Thanks for this. Is there a way to get access to the dataset that was used for this? seems like it
became unavailable from March 7!
Pavan
Kumar says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=111563#RESPOND)
MAY 29, 2016 AT 6:29 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-111563)
Really great and would start following I am a new entry to the data analysis stream
Harneet
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=114171#RESPOND)
JULY 28, 2016 AT 4:20 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-114171)
Hi Kunal,
I have trying to get some validations in python for logistic regression as available for SAS, like Area
Under Curve, Concordant, Discordant and Tied pairs, Ginni Value etc.. But I am unable to nd it
through google, what ever I was able to nd was very confusing.
Can you please help me with this?
Regards,
Harneet.
ASIF
AMEER says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=114960#RESPOND)
AUGUST 19, 2016 AT 5:06 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-114960)
Peter
Frech says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=115048#RESPOND)
AUGUST 23, 2016 AT 9:08 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-115048)
Hello,
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
59/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
very good article. I just stumbled upon one piece of code where I am not quite sure if I just don
interpret the arguments well, or whether there is truely a mistake in your code. It is the following:
metrics.accuracy_score(predictions,data[outcome])
Isnt predictions the true predictions, which should be placed as the argument y_pred of the
accuracy_score method, and data[outcome] are the real values which should be associated with
the argument y_true?
If that is so, then I think the order of passing the arguments is wrong, because the method is
de ned as following (according to doc): confusion_matrix(y_true, y_pred[, labels]) > that means
y_true comes as 1st argument. You have it the other way arround.
or doesnt make it a di erence at all? Anyways.
Best regards,
Peter
Nicola
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=115563#RESPOND)
SEPTEMBER 4, 2016 AT 8:36 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-115563)
Hi!
And thank you very much for your tutorial
Unfortunately there is no way to nd the .csv le for the loan prediction problem in
https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
(https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/)
wayne
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=116168#RESPOND)
SEPTEMBER 18, 2016 AT 1:32 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-116168)
Hello,
Thank you for the tutorial.
But as already mentioned by Nicola, there is no way to download the DataSet.
Could you please check it?
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
60/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Thanks
gopalankailash
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=117605#RESPOND)
OCTOBER 28, 2016 AT 3:49 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-117605)
The amount of e ort you guys put into these article is a true inspiration for folks like me to learn!
Thanks for all this!
Jack
Ma says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-PYTHON-SCRATCH-2/?REPLYTOCOM=118148#RESPOND)
NOVEMBER 8, 2016 AT 10:52 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/01/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEPYTHON-SCRATCH-2/#COMMENT-118148)
LEAVE A REPLY
Connect with:
(https://www.analyticsvidhya.com/wp-login.php?
action=wordpress_social_authenticate&mode=login&provider=Facebook&redirect_to=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.anal
tutorial-learn-data-science-python-scratch-2%2F)
Your email address will not be published.
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
61/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Comment
Name (required)
Email (required)
Website
SUBMIT COMMENT
TOP AV USERS
Rank
Name
Points
5388
4978
4433
4417
3371
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
62/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(http://www.greatlearning.in/great-lakes-pgpba?
utm_source=avm&utm_medium=avmbanner&utm_campaign=pgpba)
POPULAR POSTS
A Complete Tutorial to Learn Data Science with Python from Scratch
(https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-pythonscratch-2/)
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
(https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-inpython/)
Essentials of Machine Learning Algorithms (with Python and R Codes)
(https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely)
(https://www.analyticsvidhya.com/blog/2016/10/17-ultimate-data-science-projects-to-boost-yourknowledge-and-skills/)
7 Types of Regression Techniques you should know!
(https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/)
A Complete Tutorial on Time Series Modeling in R
(https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/)
6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)
(https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/)
Complete guide to create a Time Series Forecast (with Codes in Python)
(https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
63/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
Qung Co Trn
Google Hm Nay
Doanh Nghip Ca Bn Xut
Hin Trn Google Cha? Tng
Hin Th Vi AdWords
google.com.vn
(http://imarticus.org/diploma-in-big-data-analytics?
id=AnalyticsVidhya)
RECENT POSTS
(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learningrevealed/)
An Introduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application-programminginterfaces-5-apis-a-data-scientist-must-know/)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
64/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-applicationprogramming-interfaces-5-apis-a-data-scientist-must-know/)
SAURAV KAUSHIK , NOVEMBER 18, 2016
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarupbhattacharjee-analytics-vidhya-rank-8/)
Exclusive Interview with Data Scientist Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup-bhattacharjeeanalytics-vidhya-rank-8/)
KUNAL JAIN , NOVEMBER 16, 2016
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-tobreak-the-ice-monday-blues/)
8 Interesting Data Science Games to break the ice & Monday Blues!
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-to-break-the-icemonday-blues/)
KUNAL JAIN , NOVEMBER 14, 2016
(http://www.edvancer.in/certi ed-data-scientist-with-python-
course?utm_source=AV&utm_medium=AVads&utm_campaign=AVadsnonfc&utm_content=pythonavad)
GET CONNECTED
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
65/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
GET CONNECTED
7,155
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
1,425
FOLLOWERS
(http://www.facebook.com/Analyticsvidhya)
FOLLOWERS
(https://plus.google.com/+Analyticsvidhya)
22,827
Email
SUBSCRIBE
(http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
(https://datahack.analyticsvidhya.com/contest/the-strategic-
ABOUT US
For those of you, who are wondering what is Analytics Vidhya, Analytics can be de ned as the science of
extracting insights from raw data. The spectrum of analytics starts from capturing data and evolves into
monk/)
using insights / trends from this data to make informed decisions. Read More
(http://www.analyticsvidhya.com/about-me/)
STAY CONNECTED
7,155
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
1,425
22,827
(https://datahack.analyticsvidhya.com/contest/skilltestFOLLOWERS
(http://www.facebook.com/Analyticsvidhya)
FOLLOWERS
(https://plus.google.com/+Analyticsvidhya)
LATEST POSTS
machine-learning/)
SUBSCRIBE
(https://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
66/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learningrevealed/)
(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learning-revealed/)
ANKIT GUPTA , NOVEMBER 20, 2016
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-applicationprogramming-interfaces-5-apis-a-data-scientist-must-know/)
AnIntroduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application-programminginterfaces-5-apis-a-data-scientist-must-know/)
SAURAV KAUSHIK , NOVEMBER 18, 2016
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarupbhattacharjee-analytics-vidhya-rank-8/)
Exclusive Interview with Data Scientist Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup-bhattacharjeeanalytics-vidhya-rank-8/)
KUNAL JAIN , NOVEMBER 16, 2016
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-tobreak-the-ice-monday-blues/)
8 Interesting Data Science Games to break the ice & Monday Blues!
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-to-break-the-icemonday-blues/)
KUNAL JAIN , NOVEMBER 14, 2016
QUICK LINKS
Home (https://www.analyticsvidhya.com/)
About Us (https://www.analyticsvidhya.com/about-me/)
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
67/68
20/11/2016
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
TOP REVIEWS
https://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/
68/68