Python

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Python

Introduction:

 Python was originally a general purpose language.


 Strong Community support
 Dedicated libraries for data analysis and predictive modeling.

Table of Contents

1. Basics of Python for Data Analysis

2. Python libraries and data structures

3. Exploratory analysis in Python using Pandas

4. Data Munging in Python using Pandas

5. Building a Predictive Model in Python


1. Basics of Python for Data Analysis
 Open Source – free to install
 Awesome online community
 Very easy to learn
 Can become a common language for data science and production of web based analytics
products.

Drawback: It is an interpreted language rather than compiled language – hence might take up
more CPU time.

Python 2.7 v/s 3.6-7


 This is one of the most debated topics in Python.
 You will invariably cross paths with it, especially if you are a beginner. There is no
right/wrong choice here. It totally depends on the situation and your need to use.

Why Python 2.7?


 Community support
 Python 2 was released in late 2000 and has been in use for more than 17 years
 Plethora of third-party libraries! Though many libraries have provided 3.x support but still a
large number of modules work only on 2.x versions. If you plan to use Python for specific
applications like web-development with high reliance on external modules, you might be
better off with 2.7
 Some of the features of 3.x versions have backward compatibility and can work with 2.7
version.

Why Python 3.6-7?

 Cleaner and faster! Python developers have fixed some inherent glitches and minor
drawbacks in order to set a stronger foundation for the future. These might not be very
relevant initially, but will matter eventually.
 It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift
to 3.x versions. Python 3 has released stable versions for past 5 years and will continue the
same.

How to install Python?

There are 2 approaches to install Python:


You can download Python directly from its project site
(https://www.python.org/download/releases/2.7/) and install individual components and libraries
you want. Alternately, you can download and install a package, which comes with pre-installed
libraries.
Recommended Anaconda:- (https://www.continuum.io/downloads).

Another option could be Enthought Canopy Express (https://www.enthought.com/downloads/).

Second method provides a hassle free installation. The limitation of this approach is you have to
wait for the entire package to be upgraded, even if you are interested in the latest version of a
single library. It should not matter until and unless, you are doing cutting edge statistical research.

Choosing a development environment

3 most common options:


Terminal / Shell based
IDLE (default environment)
iPython notebook

IDE:

iPython:
Few things to note
 You can start iPython notebook by writing “ipython notebook” on your terminal / cmd,
depending on the OS you are working on
 You can name a iPython notebook by simply clicking on the name – UntitledO in the above
screenshot .
 The interface shows In [*] for inputs and Out[*] for output.
 You can execute a code by pressing “Shift + Enter” or “ALT + Enter”, if you want to insert an
additional row after.
2. Python libraries and Data Structures

Python Data Structures

Following are some data structures, which are used in Python. You should be familiar with them
in order to use them as appropriate.
Lists – Lists are one of the most versatile data structure in Python. A list can simply be defined
by writing a list of comma separated values in square brackets. Lists might contain items of
different types, but usually the items all have the same type. Python lists are mutable and
individual elements of a list can be changed.

1. Lists – Lists are one of the most versatile data structure in Python. A list can simply be
defined by writing a list of comma separated values in square brackets. Lists might
contain items of different types, but usually the items all have the same type. Python
lists are mutable and individual elements of a list can be changed.

A quick example:
2. Strings – Strings can simply be defined by use of single (‘ ), double ( ” ) or triple ( ”’ )
inverted commas. Strings enclosed in tripe quotes (“’) can span over multiple lines and
are used frequently in docstrings (Python’s way of documenting functions).
\ is used as an escape character. Please note that Python strings are immutable, so you
cannot change part of strings.

3. Tuples – A tuple is represented by a number of values separated by commas. Tuples are


immutable and the output is surrounded by parentheses so that nested tuples are
processed correctly. Additionally, even though tuples are immutable, they can hold
mutable data if needed. Since Tuples are immutable and cannot change, they are faster
in processing as compared to lists. Hence, if your list is unlikely to change, you should
use tuples, instead of lists.
4. Dictionary – Dictionary is an unordered set of key: value pairs, with the requirement
that the keys are unique (within one dictionary). A pair of braces creates an empty
dictionary: {}.

Python Iteration and Conditional Constructs

Like most languages, Python also has a FOR-loop which is the most widely used method for
iteration. It has a simple syntax:

for i in [Python Iterable]:


expression(i)
Here “Python Iterable” can be a list, tuple or other advanced data structures which we will
explore in later sections. Let’s take a look at a simple example, determining the factorial of a
number.

fact=1
for i in range(1,N+1):
fact *= i

Coming to conditional statements, these are used to execute code fragments based on a
condition. The most commonly used construct is if-else, with following syntax:

if [condition]:
__execution if true__
else:
__execution if false__

For instance, if we want to print whether the number N is even or odd: if N%

if N%2 == 0:
print 'Even'
else:
print 'Odd'

Let’s take a step further. What if you have to perform the following tasks?

1. Multiply 2 matrices
2. Find the root of a quadratic equation
3. Plot bar charts and histograms
4. Make statistical models
5. Access web-pages

There are many libraries for the purpose.

For example, consider the factorial example we just saw. We can do that in a single step as:

math.factorial(N)
Python Libraries

Lets take one step ahead in our journey to learn Python by getting acquainted with some useful
libraries. The first step is obviously to learn to import them into our environment. There are
several ways of doing so in Python:

import math as m

from math import *

In the first manner, we have defined an alias m to library math. We can now use various
functions from math library (e.g. factorial) by referencing it using the alias m.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly
use factorial() without referring to math.

Tip: Google recommends that you use first style of importing libraries, as you will know
where the functions have come from.

Following are a list of libraries, you will need for any scientific computations and
data analysis:

 NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with other
low level languages like Fortran, C and C++.

 SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful
library for variety of high level science and engineering modules like discrete Fourier
transform, Linear Algebra, Optimization and Sparse matrices.

 Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to
heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab =
inline) to use these plotting features inline. If you ignore the inline option, then pylab
converts ipython environment to an environment, very similar to Matlab. You can also
use Latex commands to add math to your plot.

 Pandas for structured data operations and manipulations. It is extensively used for data
munging and preparation. Pandas were added relatively recently to Python and have
been instrumental in boosting Python’s usage in data scientist community.
 Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library
contains a lot of effiecient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction.

 Statsmodels for statistical modeling. Statsmodels is a Python module that allows users
to explore data, estimate statistical models, and perform statistical tests. An extensive
list of descriptive statistics, statistical tests, plotting functions, and result statistics are
available for different types of data and each estimator.

 Seaborn for statistical data visualization. Seaborn is a library for making attractive and
informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to
make visualization a central part of exploring and understanding data.

 Bokeh for creating interactive plots, dashboards and data applications on modern web-
browsers. It empowers the user to generate elegant and concise graphics in the style of
D3.js. Moreover, it has the capability of high-performance interactivity over very large
or streaming datasets.

 Blaze for extending the capability of Numpy and Pandas to distributed and streaming
datasets. It can be used to access data from a multitude of sources including Bcolz,
MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh, Blaze can
act as a very powerful tool for creating effective visualizations and dashboards on huge
chunks of data.

 Scrapy for web crawling. It is a very useful framework for getting specific patterns of
data. It has the capability to start at a website home url and then dig through web-pages
within the website to gather information.

 SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic
arithmetic to calculus, algebra, discrete mathematics and quantum physics. Another
useful feature is the capability of formatting the result of the computations as LaTeX
code.

 Requests for accessing the web. It works similar to the the standard python library
urllib2 but is much easier to code. You will find subtle differences with urllib2 but for
beginners, Requests might be more convenient.

Additional libraries, you might need:


 os for Operating system and file operations
 networkx and igraph for graph based data manipulations
 regular expressions for finding patterns in text data
 BeautifulSoup for scrapping web. It is inferior to Scrapy as it will extract information
from just a single webpage in a run.

Now that we are familiar with Python fundamentals and additional libraries, let’s take a deep
dive into problem solving through Python. Yes making a predictive model! In the process, we
use some powerful libraries and also come across the next level of data structures. We will take
you through the 3 key phases:
1. Data Exploration – finding out more about the data we have.
2. Data Munging – cleaning the data and playing with it to make it better suit statistical
modeling
3. Predictive Modeling – running the actual algorithms and having fun.
3. Exploratory analysis in Python using Pandas

You might also like