Python
Python
Python
Introduction:
Table of Contents
Drawback: It is an interpreted language rather than compiled language – hence might take up
more CPU time.
Cleaner and faster! Python developers have fixed some inherent glitches and minor
drawbacks in order to set a stronger foundation for the future. These might not be very
relevant initially, but will matter eventually.
It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift
to 3.x versions. Python 3 has released stable versions for past 5 years and will continue the
same.
Second method provides a hassle free installation. The limitation of this approach is you have to
wait for the entire package to be upgraded, even if you are interested in the latest version of a
single library. It should not matter until and unless, you are doing cutting edge statistical research.
IDE:
iPython:
Few things to note
You can start iPython notebook by writing “ipython notebook” on your terminal / cmd,
depending on the OS you are working on
You can name a iPython notebook by simply clicking on the name – UntitledO in the above
screenshot .
The interface shows In [*] for inputs and Out[*] for output.
You can execute a code by pressing “Shift + Enter” or “ALT + Enter”, if you want to insert an
additional row after.
2. Python libraries and Data Structures
Following are some data structures, which are used in Python. You should be familiar with them
in order to use them as appropriate.
Lists – Lists are one of the most versatile data structure in Python. A list can simply be defined
by writing a list of comma separated values in square brackets. Lists might contain items of
different types, but usually the items all have the same type. Python lists are mutable and
individual elements of a list can be changed.
1. Lists – Lists are one of the most versatile data structure in Python. A list can simply be
defined by writing a list of comma separated values in square brackets. Lists might
contain items of different types, but usually the items all have the same type. Python
lists are mutable and individual elements of a list can be changed.
A quick example:
2. Strings – Strings can simply be defined by use of single (‘ ), double ( ” ) or triple ( ”’ )
inverted commas. Strings enclosed in tripe quotes (“’) can span over multiple lines and
are used frequently in docstrings (Python’s way of documenting functions).
\ is used as an escape character. Please note that Python strings are immutable, so you
cannot change part of strings.
Like most languages, Python also has a FOR-loop which is the most widely used method for
iteration. It has a simple syntax:
fact=1
for i in range(1,N+1):
fact *= i
Coming to conditional statements, these are used to execute code fragments based on a
condition. The most commonly used construct is if-else, with following syntax:
if [condition]:
__execution if true__
else:
__execution if false__
if N%2 == 0:
print 'Even'
else:
print 'Odd'
Let’s take a step further. What if you have to perform the following tasks?
1. Multiply 2 matrices
2. Find the root of a quadratic equation
3. Plot bar charts and histograms
4. Make statistical models
5. Access web-pages
For example, consider the factorial example we just saw. We can do that in a single step as:
math.factorial(N)
Python Libraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful
libraries. The first step is obviously to learn to import them into our environment. There are
several ways of doing so in Python:
import math as m
In the first manner, we have defined an alias m to library math. We can now use various
functions from math library (e.g. factorial) by referencing it using the alias m.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly
use factorial() without referring to math.
Tip: Google recommends that you use first style of importing libraries, as you will know
where the functions have come from.
Following are a list of libraries, you will need for any scientific computations and
data analysis:
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with other
low level languages like Fortran, C and C++.
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful
library for variety of high level science and engineering modules like discrete Fourier
transform, Linear Algebra, Optimization and Sparse matrices.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to
heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab =
inline) to use these plotting features inline. If you ignore the inline option, then pylab
converts ipython environment to an environment, very similar to Matlab. You can also
use Latex commands to add math to your plot.
Pandas for structured data operations and manipulations. It is extensively used for data
munging and preparation. Pandas were added relatively recently to Python and have
been instrumental in boosting Python’s usage in data scientist community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library
contains a lot of effiecient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users
to explore data, estimate statistical models, and perform statistical tests. An extensive
list of descriptive statistics, statistical tests, plotting functions, and result statistics are
available for different types of data and each estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and
informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to
make visualization a central part of exploring and understanding data.
Bokeh for creating interactive plots, dashboards and data applications on modern web-
browsers. It empowers the user to generate elegant and concise graphics in the style of
D3.js. Moreover, it has the capability of high-performance interactivity over very large
or streaming datasets.
Blaze for extending the capability of Numpy and Pandas to distributed and streaming
datasets. It can be used to access data from a multitude of sources including Bcolz,
MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh, Blaze can
act as a very powerful tool for creating effective visualizations and dashboards on huge
chunks of data.
Scrapy for web crawling. It is a very useful framework for getting specific patterns of
data. It has the capability to start at a website home url and then dig through web-pages
within the website to gather information.
SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic
arithmetic to calculus, algebra, discrete mathematics and quantum physics. Another
useful feature is the capability of formatting the result of the computations as LaTeX
code.
Requests for accessing the web. It works similar to the the standard python library
urllib2 but is much easier to code. You will find subtle differences with urllib2 but for
beginners, Requests might be more convenient.
Now that we are familiar with Python fundamentals and additional libraries, let’s take a deep
dive into problem solving through Python. Yes making a predictive model! In the process, we
use some powerful libraries and also come across the next level of data structures. We will take
you through the 3 key phases:
1. Data Exploration – finding out more about the data we have.
2. Data Munging – cleaning the data and playing with it to make it better suit statistical
modeling
3. Predictive Modeling – running the actual algorithms and having fun.
3. Exploratory analysis in Python using Pandas