0% found this document useful (0 votes)
24 views

Python For Data Analysis

The document discusses the importance of data analysis and the role of data analysts, emphasizing the need for programming skills, domain expertise, and data visualization abilities. It highlights Python as a preferred language for data analysis due to its simplicity, speed, and extensive libraries like Pandas, NumPy, and SciPy. Additionally, it covers the installation of relevant packages and provides an overview of key data structures in Pandas, including Series and DataFrame, along with their functionalities.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Python For Data Analysis

The document discusses the importance of data analysis and the role of data analysts, emphasizing the need for programming skills, domain expertise, and data visualization abilities. It highlights Python as a preferred language for data analysis due to its simplicity, speed, and extensive libraries like Pandas, NumPy, and SciPy. Additionally, it covers the installation of relevant packages and provides an overview of key data structures in Pandas, including Series and DataFrame, along with their functionalities.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

PYTHON FOR DATA

ANALYSIS

Chapter:8
Prof. Priya Mathurkar
INTRODUCTION

• Data is the new Oil. This statement shows how every modern IT system is
driven by capturing, storing and analysing data for various needs.
• Be it about making decision for business, forecasting weather, studying
protein structures in biology or designing a marketing campaign. All of these
scenarios involve a multidisciplinary approach of using mathematical models,
statistics, graphs, databases and of course the business or scientific logic
behind the data analysis.
DATA SCIENCE

• Data science is the process of deriving knowledge and insights from a huge and diverse
set of data through organizing, processing and analyzing the data.
• It involves many different disciplines like mathematical and statistical modelling,
extracting data from it source and applying data visualization techniques.
• Often it also involves handling big data technologies to gather both structured and
unstructured data.
• Below we will see some example scenarios where Data science is used.
• Recommendation systems
• Financial Risk management
• Improvement in Health Care services
THE ROLE OF A DATA ANALYST

• A data analyst uses programming tools to mine large amounts of complex data, and find relevant information from
this data.
• In short, an analyst is someone who derives meaning from messy data. A data analyst needs to have skills in the
following areas, in order to be useful in the workplace:
• Domain Expertise — In order to mine data and come up with insights that are relevant to their workplace, an
analyst needs to have domain expertise.
• Programming Skills —As a data analyst, you will need to know the right libraries to use in order to clean data, mine,
and gain insights from it.
• Statistics — An analyst might need to use some statistical tools to derive meaning from data.
• Visualization Skills — A data analyst needs to have great data visualization skills, in order to summarize and
present data to a third party.
• Storytelling — Finally, an analyst needs to communicate their findings to a stakeholder or client. This means that
they will need to create a data story, and have the ability to narrate it.
WHY LEARN PYTHON FOR DATA
ANALYSIS?
• Here are some reasons which go in favour of learning Python:
• Open Source – free to install
• Awesome online community
• Very easy to learn
• Can become a common language for data science and production of web based
analytics products.
• A simple and easy to learn language which achieves result in fewer lines of code than other
similar languages like R. Its simplicity also makes it robust to handle complex scenarios with
minimal code and much less confusion on the general flow of the program.
• It is cross platform, so the same code works in multiple environments without needing any
change. That makes it perfect to be used in a multi-environment setup easily.
• It executes faster than other similar languages used for data analysis like R and MATLAB.
• Its excellent memory management capability, especially garbage collection makes it versatile
in gracefully managing very large volume of data transformation, slicing, dicing and
visualization.
• Most importantly Python has got a very large collection of libraries which serve as special
purpose analysis tools. For example – the NumPy package deals with scientific computing and
its array needs much less memory than the conventional python list for managing numeric
data. And the number of such packages is continuously growing.
• Python has packages which can directly use the code from other languages like Java or C.
This helps in optimizing the code performance by using existing code of other languages,
whenever it gives a better result.
INSTALLING SCIPY PACK

• The best way to enable the required packs is to use an installable binary package specific to your
operating system. These binaries contain full SciPy stack (inclusive of NumPy, SciPy, matplotlib, IPython,
SymPy and nose packages along with core Python).
• Windows
• Anaconda (from www.continuum.io) is a free Python distribution for SciPy stack. It is also available for
Linux and Mac.

• Canopy (www.enthought.com/products/canopy/) is available as free as well as commercial distribution with


full SciPy stack for Windows, Linux and Mac.

• Python (x,y): It is a free Python distribution with SciPy stack and Spyder IDE for Windows OS.
(Downloadable from www.python-xy.github.io/)
LIBRARIES FOR SCIENTIFIC
COMPUTATIONS AND DATA ANALYSIS:
• Following are a list of libraries, you will need for any scientific computations
and data analysis:
• NumPy stands for Numerical Python. The most powerful feature of NumPy is
n-dimensional array. This library also contains basic linear algebra functions,
Fourier transforms, advanced random number capabilities and tools for
integration with other low level languages like Fortran, C and C++
• SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the
most useful library for variety of high level science and engineering modules
like discrete Fourier transform, Linear Algebra, Optimization and Sparse
matrices.
• Matplotlib for plotting vast variety of graphs, starting from histograms to
line plots to heat plots..
• Pandas for structured data operations and manipulations. It is extensively
used for data munging and preparation. Pandas were added relatively
recently to Python and have been instrumental in boosting Python’s usage in
data scientist community.
• Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this
library contains a lot of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality
reduction.
• Statsmodels for statistical modeling. Statsmodels is a Python module that
allows users to explore data, estimate statistical models, and perform
statistical tests. An extensive list of descriptive statistics, statistical tests,
plotting functions, and result statistics are available for different types of
data and each estimator.
• Seaborn for statistical data visualization. Seaborn is a library for making
attractive and informative statistical graphics in Python. It is based on
matplotlib. Seaborn aims to make visualization a central part of exploring and
understanding data.
• Bokeh for creating interactive plots, dashboards and data applications on
modern web-browsers. It empowers the user to generate elegant and concise
graphics in the style of D3.js. Moreover, it has the capability of high-
performance interactivity over very large or streaming datasets.
• Blaze for extending the capability of Numpy and Pandas to distributed and
streaming datasets. It can be used to access data from a multitude of
sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables,
etc. Together with Bokeh, Blaze can act as a very powerful tool for creating
effective visualizations and dashboards on huge chunks of data.
• Scrapy for web crawling. It is a very useful framework for getting specific
patterns of data. It has the capability to start at a website home url and then
dig through web-pages within the website to gather information.
• SymPy for symbolic computation. It has wide-ranging capabilities from basic
symbolic arithmetic to calculus, algebra, discrete mathematics and quantum
physics. Another useful feature is the capability of formatting the result of
the computations as LaTeX code.
• Requests for accessing the web. It works similar to the the standard python
library urllib2 but is much easier to code. You will find subtle differences with
urllib2 but for beginners, Requests might be more convenient.
PYTHON - PANDAS

• Pandas is an open-source Python Library used for high-performance data


manipulation and data analysis using its powerful data structures.
• Python with pandas is in use in a variety of academic and commercial domains,
including Finance, Economics, Statistics, Advertising, Web Analytics, and more.
• Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data — load, organize, manipulate,
model, and analyse the data.
• The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
KEY FEATURES OF PANDAS

• Below are the some of the important features of Pandas which is used specifically for Data processing
and Data analysis work.
• Fast and efficient DataFrame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and subsetting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.

• Pandas deals with the following three data structures −


• Series
• DataFrame
These data structures are built on top of Numpy array, making them fast and efficient.
INSTALLATION OF PANDAS

• If you have Python and PIP already installed on a system, then installation of
Pandas is very easy.
• Install it using this command:
• C:\Users\Your Name>pip install pandas
DIMENSION & DESCRIPTION

• The best way to think of these data structures is that the higher dimensional
data structure is a container of its lower dimensional data structure. For
example, DataFrame is a container of Series, Panel is a container of
DataFrame.
Data Dimensions Description
Structure
Series 1 1D labeled homogeneous array, size-immutable.
Data Frames 2 General 2D labeled, size-mutable tabular structure
with potentially heterogeneously typed columns.

• DataFrame is widely used and it is the most important data structures.


SERIES

• Series is a one-dimensional array like structure with homogeneous data. For


example, the following series is a collection of integers 10, 23, 56, …

• Key Points of Series


• Homogeneous data
• Size Immutable
• Values of Data Mutable
• A pandas Series can be created using the following constructor −

pandas.Series( data, index, dtype, copy)


• A series can be created using various inputs like −
• Array
• Dict
• Scalar value or constant

• Example:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
LABELS
• If nothing else is specified, the values are labeled with their index number. First value has index 0,
second value has index 1 etc.
• This label can be used to access a specified value.
print(myvar[0])

Create Labels
• With the index argument, you can name your own labels.
• Example

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

print(myvar["y"])
KEY/VALUE OBJECTS AS SERIES

• You can also use a key/value object, like a dictionary, when creating a Series.
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390} # The keys of the dictionary become the
labels.
myvar = pd.Series(calories)
print(myvar)

myvar = pd.Series(calories, index = ["day1", "day2"]) # To select only some of the items in the dictionary,

print(myvar)
DATAFRAME

• DataFrame is a two-dimensional array with heterogeneous data. For example,

Name Age Gender Rating


Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78

The table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.
DATA TYPE OF COLUMNS

• The data types of the four columns are as follows −

Column Type
Name String
Age Integer
Gender String
Rating Float

• Key Points of Data Frame


• Heterogeneous data
• Size Mutable
• Data Mutable
CREATE DATAFRAME

• A pandas DataFrame can be created using the following constructor −


pandas.DataFrame( data, index, columns, dtype, copy)

• A pandas DataFrame can be created using various inputs like −


• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
• Parameters
• Data: ndarray (structured or homogeneous), Iterable, dict, or DataFrame
• Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict,
column order follows insertion-order. If a dict contains Series which have an index
defined, it is aligned by its index.
• index: Index or array-like
• Index to use for resulting frame. Will default to RangeIndex if no indexing information
part of input data and no index provided.
• columns: Index or array-like
• Column labels to use for resulting frame when data does not have them, defaulting to
RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection
instead.
• dtype: dtype, default None
• Data type to force. Only a single dtype is allowed. If None, infer.
• copy: bool or None, default None
• Copy data from inputs. For dict data, the default of None behaves like copy=True. For
DataFrame or 2d ndarray input, the default of None behaves like copy=False.
Example:1
import pandas as pd
df = pd.DataFrame() #Empty Dataframe
print (df)

Example:2
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data) #Create a DataFrame from Lists
print df

Example:3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age']) #Create a DataFrame from Lists
print df
Example:4
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df

Example:5
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
• Locate Row
• Pandas use the loc attribute to return one or more specified row(s)
• print(df.loc[0]) # if indexes are integer
• print(df.loc[[0, 1]])

• Named Indexes
• With the index argument, you can name your own indexes.
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])


• Locate Named Indexes
• Use the named index in the loc attribute to return the specified row(s).
• #refer to the named index:
print(df.loc["day2"]) #use iloc if named indexes and index number has to pass

• Attributes and methods in DataFrame:


• Please refer this website
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
LOAD FILES INTO A DATAFRAME

• If your data sets are stored in a file, Pandas can load them into a DataFrame.
• Example
• Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
print(df.to_string()) : use to_string() to print the entire DataFrame.
PANDAS READ CSV

• Read CSV Files


• A simple way to store big data sets is to use CSV files (comma separated files).
• CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
• In our examples we will be using a CSV file called ‘iris.csv'.
• max_rows
• The number of rows returned is defined in Pandas option settings.
• You can check your system's maximum rows with the
• pd.options.display.max_rows statement.
• In my system the number is 60, which means that if the DataFrame contains more
than 60 rows, the print(df) statement will return only the headers and the first and
last 5 rows.
• You can change the maximum rows number with the same statement.
• pd.options.display.max_rows = 9999
WORKING WITH MISSING DATA

• Missing Data can occur when no information is provided for one or more
items or for a whole unit. Missing Data is a very big problem in real life
scenario. Missing Data can also refer to as NA(Not Available) values in
pandas.
• Checking for missing values using isnull() and notnull() :
• In order to check missing values in Pandas DataFrame, we use a function
isnull() and notnull(). Both function help in checking whether a value is NaN
or not. These function can also be used in Pandas Series in order to find null
values in a series.
PANDAS - CLEANING EMPTY CELLS

Remove Rows
• One way to deal with empty cells is to remove rows that contain empty cells.
• This is usually OK, since data sets can be very big, and removing a few rows
will not have a big impact on the result.
• df.dropna()

• If you want to change the original DataFrame, use the inplace = True
argument:
• df.dropna(inplace = True)

Note: In our cleaning examples we will be using a CSV file called 'dirtydata.csv'.
Replace Empty Values:
• Another way of dealing with empty cells is to insert a new value instead.
• This way you do not have to delete entire rows just because of some empty
cells.
• The fillna() method allows us to replace empty cells with a value:
• df.fillna(130, inplace = True)

Replace Only For Specified Columns


• The example above replaces all empty cells in the whole Data Frame.
• To only replace empty values for one column, specify the column name for
the DataFrame:
• df["Calories"].fillna(130, inplace = True)
Replace Using Mean, Median, or Mode
• A common way to replace empty cells, is to calculate the mean, median or
mode value of the column.
• Pandas uses the mean() median() and mode() methods to calculate the
respective values for a specified column:
• Calculate the MEAN, and replace any empty values with it:
• x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)

• Calculate the MEDIAN, and replace any empty values with it:
• x = df["Calories"].median()
df["Calories"].fillna(x, inplace = True)

• Calculate the MODE, and replace any empty values with it:
• x = df["Calories"].mode()[0]
df["Calories"].fillna(x, inplace = True)
• Discovering Duplicates
• Duplicate rows are rows that have been registered more than one time.
• To discover duplicates, we can use the duplicated() method.
• The duplicated() method returns a Boolean values for each row:
• print(df.duplicated())
• df.drop_duplicates(inplace = True)
PANDAS - FIXING WRONG DATA

Wrong Data
• "Wrong data" does not have to be "empty cells" or "wrong format", it can just be
wrong, like if someone registered "199" instead of "1.99".
• Sometimes you can spot wrong data by looking at the data set, because you have an
expectation of what it should be.
• If you take a look at our data set, you can see that in row 7, the duration is 450, but
for all the other rows the duration is between 30 and 60.
• It doesn't have to be wrong, but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not work
out in 450 minutes.
Replacing Values
• One way to fix wrong values is to replace them with something else.
• In our example, it is most likely a typo, and the value should be "45" instead
of "450", and we could just insert "45" in row 7:
• df.loc[7, 'Duration'] = 45

• For small data sets you might be able to replace the wrong data one by one,
but not for big data sets.
• To replace wrong data for larger data sets you can create some rules, e.g. set
some boundaries for legal values, and replace any values that are outside of
the boundaries.
• Example
• for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
• Removing Rows
• Another way of handling wrong data is to remove the rows that contains wrong
data.
• This way you do not have to find out what to replace them with, and there is a
good chance you do not need them to do your analyses.
• Delete rows where "Duration" is higher than 120:
• for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
CLEANING DATA OF WRONG FORMAT

Data of Wrong Format


• Cells with data of wrong format can make it difficult, or even impossible, to
analyze data.
• To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.
Convert Into a Correct Format
• In our Data Frame, we have two cells with the wrong format. Check out row
at index 22 and 26, the 'Date' column should be a string that represents a
date:
• Let's try to convert all cells in the 'Date' column into dates.
• Pandas has a to_datetime() method for this:
• Example
• Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
Removing Rows
• The result from the converting in the example above gave us a NaT value,
which can be handled as a NULL value, and we can remove the row by using
the dropna() method.
• Remove rows with a NULL value in the "Date" column:
• df.dropna(subset=['Date'], inplace = True)
PYTHON - NUMPY

• NumPy is a Python package which stands for 'Numerical Python'. It is a


library consisting of multidimensional array objects and a collection of
routines for processing of array.
• Operations using NumPy
• Using NumPy, a developer can perform the following operations −
• Mathematical and logical operations on arrays.
• Fourier transforms and routines for shape manipulation.
• Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.
NUMPY – A REPLACEMENT FOR MATLAB

• NumPy is often used along with packages like SciPy (Scientific Python) and
Mat−plotlib (plotting library). This combination is widely used as a
replacement for MatLab, a popular platform for technical computing.
However, Python alternative to MatLab is now seen as a more modern and
complete programming language.

• It is open source, which is an added advantage of NumPy.


• https://numpy.org/doc/stable/user/quickstart.html
NDARRAY OBJECT

• The most important object defined in NumPy is an N-dimensional array type


called ndarray.
• It describes the collection of items of the same type.
• Items in the collection can be accessed using a zero-based index.
• Every item in an ndarray takes the same size of block in the memory.
• Each element in ndarray is an object of data-type object (called dtype).
• Any item extracted from ndarray object (by slicing) is represented by a
Python object of one of array scalar types.
PYTHON - MATPLOTLIB

• Matplotlib is a python library used to create 2D graphs and plots by using python scripts.
• It has a module named pyplot which makes things easy for plotting by providing feature to
control line styles, font properties, formatting axes etc.
• It supports a very wide variety of graphs and plots namely - histogram, bar charts, power
spectra, error charts etc.
• It is used along with NumPy to provide an environment that is an effective open source
alternative for MatLab.
• It can also be used with graphics toolkits like PyQt and wxPython.
• Conventionally, the package is imported into the Python script by adding the following
statement −
• from matplotlib import pyplot as plt
MATPLOTLIB EXAMPLE

• The following script produces the • Its output is as follows −


sine wave plot using matplotlib.
• https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-s
cience-python-scratch-2
/
• https://
www.tutorialspoint.com/python_data_science/python_data_operations.htm

You might also like