Python For Data Analysis
Python For Data Analysis
ANALYSIS
Chapter:8
Prof. Priya Mathurkar
INTRODUCTION
• Data is the new Oil. This statement shows how every modern IT system is
driven by capturing, storing and analysing data for various needs.
• Be it about making decision for business, forecasting weather, studying
protein structures in biology or designing a marketing campaign. All of these
scenarios involve a multidisciplinary approach of using mathematical models,
statistics, graphs, databases and of course the business or scientific logic
behind the data analysis.
DATA SCIENCE
• Data science is the process of deriving knowledge and insights from a huge and diverse
set of data through organizing, processing and analyzing the data.
• It involves many different disciplines like mathematical and statistical modelling,
extracting data from it source and applying data visualization techniques.
• Often it also involves handling big data technologies to gather both structured and
unstructured data.
• Below we will see some example scenarios where Data science is used.
• Recommendation systems
• Financial Risk management
• Improvement in Health Care services
THE ROLE OF A DATA ANALYST
• A data analyst uses programming tools to mine large amounts of complex data, and find relevant information from
this data.
• In short, an analyst is someone who derives meaning from messy data. A data analyst needs to have skills in the
following areas, in order to be useful in the workplace:
• Domain Expertise — In order to mine data and come up with insights that are relevant to their workplace, an
analyst needs to have domain expertise.
• Programming Skills —As a data analyst, you will need to know the right libraries to use in order to clean data, mine,
and gain insights from it.
• Statistics — An analyst might need to use some statistical tools to derive meaning from data.
• Visualization Skills — A data analyst needs to have great data visualization skills, in order to summarize and
present data to a third party.
• Storytelling — Finally, an analyst needs to communicate their findings to a stakeholder or client. This means that
they will need to create a data story, and have the ability to narrate it.
WHY LEARN PYTHON FOR DATA
ANALYSIS?
• Here are some reasons which go in favour of learning Python:
• Open Source – free to install
• Awesome online community
• Very easy to learn
• Can become a common language for data science and production of web based
analytics products.
• A simple and easy to learn language which achieves result in fewer lines of code than other
similar languages like R. Its simplicity also makes it robust to handle complex scenarios with
minimal code and much less confusion on the general flow of the program.
• It is cross platform, so the same code works in multiple environments without needing any
change. That makes it perfect to be used in a multi-environment setup easily.
• It executes faster than other similar languages used for data analysis like R and MATLAB.
• Its excellent memory management capability, especially garbage collection makes it versatile
in gracefully managing very large volume of data transformation, slicing, dicing and
visualization.
• Most importantly Python has got a very large collection of libraries which serve as special
purpose analysis tools. For example – the NumPy package deals with scientific computing and
its array needs much less memory than the conventional python list for managing numeric
data. And the number of such packages is continuously growing.
• Python has packages which can directly use the code from other languages like Java or C.
This helps in optimizing the code performance by using existing code of other languages,
whenever it gives a better result.
INSTALLING SCIPY PACK
• The best way to enable the required packs is to use an installable binary package specific to your
operating system. These binaries contain full SciPy stack (inclusive of NumPy, SciPy, matplotlib, IPython,
SymPy and nose packages along with core Python).
• Windows
• Anaconda (from www.continuum.io) is a free Python distribution for SciPy stack. It is also available for
Linux and Mac.
• Python (x,y): It is a free Python distribution with SciPy stack and Spyder IDE for Windows OS.
(Downloadable from www.python-xy.github.io/)
LIBRARIES FOR SCIENTIFIC
COMPUTATIONS AND DATA ANALYSIS:
• Following are a list of libraries, you will need for any scientific computations
and data analysis:
• NumPy stands for Numerical Python. The most powerful feature of NumPy is
n-dimensional array. This library also contains basic linear algebra functions,
Fourier transforms, advanced random number capabilities and tools for
integration with other low level languages like Fortran, C and C++
• SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the
most useful library for variety of high level science and engineering modules
like discrete Fourier transform, Linear Algebra, Optimization and Sparse
matrices.
• Matplotlib for plotting vast variety of graphs, starting from histograms to
line plots to heat plots..
• Pandas for structured data operations and manipulations. It is extensively
used for data munging and preparation. Pandas were added relatively
recently to Python and have been instrumental in boosting Python’s usage in
data scientist community.
• Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this
library contains a lot of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality
reduction.
• Statsmodels for statistical modeling. Statsmodels is a Python module that
allows users to explore data, estimate statistical models, and perform
statistical tests. An extensive list of descriptive statistics, statistical tests,
plotting functions, and result statistics are available for different types of
data and each estimator.
• Seaborn for statistical data visualization. Seaborn is a library for making
attractive and informative statistical graphics in Python. It is based on
matplotlib. Seaborn aims to make visualization a central part of exploring and
understanding data.
• Bokeh for creating interactive plots, dashboards and data applications on
modern web-browsers. It empowers the user to generate elegant and concise
graphics in the style of D3.js. Moreover, it has the capability of high-
performance interactivity over very large or streaming datasets.
• Blaze for extending the capability of Numpy and Pandas to distributed and
streaming datasets. It can be used to access data from a multitude of
sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables,
etc. Together with Bokeh, Blaze can act as a very powerful tool for creating
effective visualizations and dashboards on huge chunks of data.
• Scrapy for web crawling. It is a very useful framework for getting specific
patterns of data. It has the capability to start at a website home url and then
dig through web-pages within the website to gather information.
• SymPy for symbolic computation. It has wide-ranging capabilities from basic
symbolic arithmetic to calculus, algebra, discrete mathematics and quantum
physics. Another useful feature is the capability of formatting the result of
the computations as LaTeX code.
• Requests for accessing the web. It works similar to the the standard python
library urllib2 but is much easier to code. You will find subtle differences with
urllib2 but for beginners, Requests might be more convenient.
PYTHON - PANDAS
• Below are the some of the important features of Pandas which is used specifically for Data processing
and Data analysis work.
• Fast and efficient DataFrame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and subsetting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
• If you have Python and PIP already installed on a system, then installation of
Pandas is very easy.
• Install it using this command:
• C:\Users\Your Name>pip install pandas
DIMENSION & DESCRIPTION
• The best way to think of these data structures is that the higher dimensional
data structure is a container of its lower dimensional data structure. For
example, DataFrame is a container of Series, Panel is a container of
DataFrame.
Data Dimensions Description
Structure
Series 1 1D labeled homogeneous array, size-immutable.
Data Frames 2 General 2D labeled, size-mutable tabular structure
with potentially heterogeneously typed columns.
• Example:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
LABELS
• If nothing else is specified, the values are labeled with their index number. First value has index 0,
second value has index 1 etc.
• This label can be used to access a specified value.
print(myvar[0])
Create Labels
• With the index argument, you can name your own labels.
• Example
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
print(myvar["y"])
KEY/VALUE OBJECTS AS SERIES
• You can also use a key/value object, like a dictionary, when creating a Series.
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390} # The keys of the dictionary become the
labels.
myvar = pd.Series(calories)
print(myvar)
myvar = pd.Series(calories, index = ["day1", "day2"]) # To select only some of the items in the dictionary,
print(myvar)
DATAFRAME
The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.
DATA TYPE OF COLUMNS
Column Type
Name String
Age Integer
Gender String
Rating Float
Example:2
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data) #Create a DataFrame from Lists
print df
Example:3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age']) #Create a DataFrame from Lists
print df
Example:4
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Example:5
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
• Locate Row
• Pandas use the loc attribute to return one or more specified row(s)
• print(df.loc[0]) # if indexes are integer
• print(df.loc[[0, 1]])
• Named Indexes
• With the index argument, you can name your own indexes.
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
• If your data sets are stored in a file, Pandas can load them into a DataFrame.
• Example
• Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
print(df.to_string()) : use to_string() to print the entire DataFrame.
PANDAS READ CSV
• Missing Data can occur when no information is provided for one or more
items or for a whole unit. Missing Data is a very big problem in real life
scenario. Missing Data can also refer to as NA(Not Available) values in
pandas.
• Checking for missing values using isnull() and notnull() :
• In order to check missing values in Pandas DataFrame, we use a function
isnull() and notnull(). Both function help in checking whether a value is NaN
or not. These function can also be used in Pandas Series in order to find null
values in a series.
PANDAS - CLEANING EMPTY CELLS
Remove Rows
• One way to deal with empty cells is to remove rows that contain empty cells.
• This is usually OK, since data sets can be very big, and removing a few rows
will not have a big impact on the result.
• df.dropna()
• If you want to change the original DataFrame, use the inplace = True
argument:
• df.dropna(inplace = True)
Note: In our cleaning examples we will be using a CSV file called 'dirtydata.csv'.
Replace Empty Values:
• Another way of dealing with empty cells is to insert a new value instead.
• This way you do not have to delete entire rows just because of some empty
cells.
• The fillna() method allows us to replace empty cells with a value:
• df.fillna(130, inplace = True)
• Calculate the MEDIAN, and replace any empty values with it:
• x = df["Calories"].median()
df["Calories"].fillna(x, inplace = True)
• Calculate the MODE, and replace any empty values with it:
• x = df["Calories"].mode()[0]
df["Calories"].fillna(x, inplace = True)
• Discovering Duplicates
• Duplicate rows are rows that have been registered more than one time.
• To discover duplicates, we can use the duplicated() method.
• The duplicated() method returns a Boolean values for each row:
• print(df.duplicated())
• df.drop_duplicates(inplace = True)
PANDAS - FIXING WRONG DATA
Wrong Data
• "Wrong data" does not have to be "empty cells" or "wrong format", it can just be
wrong, like if someone registered "199" instead of "1.99".
• Sometimes you can spot wrong data by looking at the data set, because you have an
expectation of what it should be.
• If you take a look at our data set, you can see that in row 7, the duration is 450, but
for all the other rows the duration is between 30 and 60.
• It doesn't have to be wrong, but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not work
out in 450 minutes.
Replacing Values
• One way to fix wrong values is to replace them with something else.
• In our example, it is most likely a typo, and the value should be "45" instead
of "450", and we could just insert "45" in row 7:
• df.loc[7, 'Duration'] = 45
• For small data sets you might be able to replace the wrong data one by one,
but not for big data sets.
• To replace wrong data for larger data sets you can create some rules, e.g. set
some boundaries for legal values, and replace any values that are outside of
the boundaries.
• Example
• for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
• Removing Rows
• Another way of handling wrong data is to remove the rows that contains wrong
data.
• This way you do not have to find out what to replace them with, and there is a
good chance you do not need them to do your analyses.
• Delete rows where "Duration" is higher than 120:
• for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
CLEANING DATA OF WRONG FORMAT
• NumPy is often used along with packages like SciPy (Scientific Python) and
Mat−plotlib (plotting library). This combination is widely used as a
replacement for MatLab, a popular platform for technical computing.
However, Python alternative to MatLab is now seen as a more modern and
complete programming language.
• Matplotlib is a python library used to create 2D graphs and plots by using python scripts.
• It has a module named pyplot which makes things easy for plotting by providing feature to
control line styles, font properties, formatting axes etc.
• It supports a very wide variety of graphs and plots namely - histogram, bar charts, power
spectra, error charts etc.
• It is used along with NumPy to provide an environment that is an effective open source
alternative for MatLab.
• It can also be used with graphics toolkits like PyQt and wxPython.
• Conventionally, the package is imported into the Python script by adding the following
statement −
• from matplotlib import pyplot as plt
MATPLOTLIB EXAMPLE