0% found this document useful (0 votes)

2 views105 pages

Python For Data Analysis Jan 28

The document outlines a live session on Data Analytics with Python, focusing on various Python libraries essential for data science, including NumPy, SciPy, Pandas, SciKit-Learn, matplotlib, and Seaborn. It covers topics such as data manipulation, visualization, descriptive and inferential statistics, and handling missing values. The tutorial includes hands-on exercises and practical examples to illustrate the use of these libraries in data analysis.

Uploaded by

ABHISHEK SINGH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views105 pages

Python For Data Analysis Jan 28

Uploaded by

ABHISHEK SINGH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 105

NPTEL LIVE SESSIONS

Data Analytics with Python

(noc25_cs17)

Presented by
Ambati Rami Reddy
Ph.D. Scholar (PMRF)
Department of CSE

07/29/2025 1
Overview of Python Libraries for Data
Tutorial Content Scientists

Reading Data; Selecting and Filtering the Data; Data manipulation,

sorting, grouping, rearranging

Plotting the data

Descriptive statistics

Inferential statistics

2
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
• NumPy
• SciPy All these libraries are
• Pandas installed on the SCC
• SciKit-Learn

Visualization libraries
• matplotlib
• Seaborn

and many more …

3
Python Libraries for Data Science
NumPy:
 introduces objects for multidimensional arrays and matrices, as well as
functions that allow to easily perform advanced mathematical and statistical
operations on those objects

 provides vectorization of mathematical operations on arrays and matrices

which significantly improves the performance

 many other python libraries are built on NumPy

Link: http://www.numpy.org/

4
Python Libraries for Data Science
SciPy:
 collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more

 part of SciPy Stack

 built on NumPy

Link: https://www.scipy.org/scipylib/

5
Python Libraries for Data Science
Pandas:
 adds data structures and tools designed to work with table-like data (similar
to Series and Data Frames in R)

 provides tools for data manipulation: reshaping, merging, sorting, slicing,

aggregation etc.

 allows handling missing data

Link: http://pandas.pydata.org/

6
Python Libraries for Data Science
SciKit-Learn:
 provides machine learning algorithms: classification, regression, clustering,
model validation etc.

 built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/

7
Python Libraries for Data Science
matplotlib:
 python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats

 a set of functionalities similar to those of MATLAB

 line plots, scatter plots, barcharts, histograms, pie charts etc.

 relatively low-level; some effort needed to create advanced visualization

Link: https://matplotlib.org/

8
Python Libraries for Data Science
Seaborn:
 based on matplotlib

 provides high level interface for drawing attractive statistical graphics

 Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/

9
Loading Python Libraries
In [ ]: #Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the cell

10
Data Frame data types
Pandas Type Native Python Type Description
object string The most general dtype. Will be
assigned to your column if column
has mixed types (numbers and
strings).

int64 int Numeric characters. 64 refers to

the memory allocated to hold this
character.

float64 float Numeric characters with decimals.

If a column contains numbers and
NaNs(see below), pandas will
default to float64, in case your
missing value has a decimal.

datetime64, timedelta[ns] N/A (but see the datetime module Values meant to hold time data.
in Python’s standard library) Look into these for time series
experiments.

11
Data Frames attributes
Python objects have attributes and methods.

df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions

size number of elements

shape return a tuple representing the dimensionality
values numpy representation of the data

12
Hands-on exercises

 Find how many records this data frame has;

 How many elements are there?

 What are the column names?

 What types of columns we have in this data frame?

13
Data Frames methods
Unlike attributes, python methods have parenthesis.
All attributes and methods can be listed with a dir() function: dir(df)

df.method() description
head( [n] ), tail( [n] ) first/last n rows

describe() generate descriptive statistics (for numeric columns only)

max(), min() return max/min values for all numeric columns

mean(), median() return mean/median values for all numeric columns

std() standard deviation

sample([n]) returns a random sample of the data frame

dropna() drop all the records with missing values

14
Hands-on exercises

 Give the summary for the numeric columns in the dataset

 Calculate standard deviation for all numeric columns;

 What are the mean values of the first 50 records in the dataset? Hint: use

head() method to subset the first 50 records and then calculate the mean

15
Selecting a column in a Data Frame
Method 1: Subset the data frame using column name:
df['sex']

Method 2: Use the column name as an attribute:

df.sex

Note: there is an attribute rank for pandas data frames, so to select a column with a name
"rank" we should use method 1.

16
Data Frames groupby method
Using "group by" method we can:
• Split the data into groups based on some criteria
• Calculate statistics (or apply a function) to each group
• Similar to dplyr() function in R
In [ ]: #Group data using rank
df_rank = df.groupby(['rank'])

In [ ]: #Calculate mean value for each numeric column per each group
df_rank.mean()

17
Data Frames groupby method

Once groupby object is create we can calculate various statistics for each group:
In [ ]: #Calculate mean salary for each professor rank:
df.groupby('rank')[['salary']].mean()

Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object.
When double brackets are used the output is a Data Frame
18
Data Frames: method iloc
(summary)
df.iloc[0] # First row of a data frame
df.iloc[i] #(i+1)th row
df.iloc[-1] # Last row

df.iloc[:, 0] # First column

df.iloc[:, -1] # Last column

df.iloc[0:7] #First 7 rows

df.iloc[:, 0:2] #First 2 columns
df.iloc[1:3, 0:2] #Second through third rows and first 2 columns
df.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th columns

19
Data Frames: Sorting

We can sort the data by a value in the column. By default the sorting will occur in
ascending order and a new data frame is return.

In [ ]: # Create a new data frame from the original sorted by the column Salary
df_sorted = df.sort_values( by ='service')
df_sorted.head()

Out[ ]:

20
Data Frames: Sorting

We can sort the data using 2 or more columns:

In [ ]: df_sorted = df.sort_values( by =['service', 'salary'], ascending = [True, False])
df_sorted.head(10)

Out[ ]:

21
Missing Values
Missing values are marked as NaN
In [ ]: # Read a dataset with missing values
flights = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/flights.csv")

In [ ]: # Select the rows that have at least one missing value

flights[flights.isnull().any(axis=1)].head()

Out[ ]:

22
Missing Values
There are a number of methods to deal with missing values in the data frame:
df.method() description
dropna() Drop missing observations

dropna(how='all') Drop observations where all cells is NA

dropna(axis=1, how='all') Drop column if all the values are missing

dropna(thresh = 5) Drop rows that contain less than 5 non-missing values

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values

23
Missing Values
• When summing the data, missing values will be treated as zero
• If all values are missing, the sum will be equal to NaN
• cumsum() and cumprod() methods ignore missing values but preserve them in
the resulting arrays
• Missing values in GroupBy method are excluded (just like in R)
• Many descriptive statistics methods have skipna option to control if missing
data should be excluded . This value is set to True by default (unlike R)

24
Aggregation Functions in Pandas
Aggregation - computing a summary statistic about each group, i.e.
• compute group sums or means
• compute group sizes/counts

Common aggregation functions:

min, max
count, sum, prod
mean, median, mode, mad
std, var

25
Aggregation Functions in Pandas
agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])

Out[ ]:

26
Basic Descriptive Statistics
df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)

min, max Minimum and maximum values

mean, median, mode Arithmetic average, median and mode

var, std Variance and standard deviation

sem Standard error of mean

skew Sample skewness

kurt kurtosis

27
Graphics to explore the data
Seaborn package is built on matplotlib but provides high level
interface for drawing attractive statistical graphics, similar to ggplot2
library in R. It specifically targets statistical data visualization

To show graphs within Python notebook include inline directive:

In [ ]: %matplotlib inline

28
Graphics
description
distplot histogram
barplot estimate of central tendency for a numeric variable
violinplot similar to boxplot, also shows the probability density of the
data
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
boxplot boxplot
swarmplot categorical scatterplot
factorplot General categorical plot

29
Basic statistical Analysis
statsmodel and scikit-learn - both have a number of function for statistical analysis

The first one is mostly used for regular analysis using R style formulas, while scikit-learn is
more tailored for Machine Learning.

statsmodels:
• linear regressions
• ANOVA tests
• hypothesis testings
• many more ...

scikit-learn:
• kmeans
• support vector machines
• random forests
• many more ...

See examples in the Tutorial Notebook

30
Numerical Python (NumPy)
• NumPy is the most foundational package for numerical computing in Python.
• If you are going to work on data analysis or machine learning projects, then
having a solid understanding of NumPy is nearly mandatory.
• Indeed, many other libraries, such as pandas and scikit-learn, use NumPy’s
array objects as the lingua franca for data exchange.
• One of the reasons as to why NumPy is so important for numerical
computations is because it is designed for efficiency with large arrays of data.
The reasons for this include:
- It stores data internally in a continuous block of memory,
independent of other in-built Python objects.
- It performs complex computations on entire arrays without the
need for for loops.
What you’ll find in NumPy
• ndarray: an efficient multidimensional array providing fast array-orientated
arithmetic operations and flexible broadcasting capabilities.
• Mathematical functions for fast operations on entire arrays of data without
having to write loops.
• Tools for reading/writing array data to disk and working with memory-
mapped files.
• Linear algebra, random number generation, and Fourier transform
capabilities.
• A C API for connecting NumPy with libraries written in C, C++, and
FORTRAN. This is why Python is the language of choice for wrapping legacy
codebases.
The NumPy ndarray: A multi-
dimensional array object
• The NumPy ndarray object is a fast and flexible container for large
data sets in Python.
• NumPy arrays are a bit like Python lists, but are still a very different
beast at the same time.
• Arrays enable you to store multiple items of the same data type. It is
the facilities around the array object that makes NumPy so convenient
for performing math and data manipulations.
Ndarray vs. lists
• By now, you are familiar with Python lists and how incredibly useful
they are.
• So, you may be asking yourself:

“I can store numbers and other objects in a Python list and do all sorts
of computations and manipulations through list comprehensions, for-
loops etc. What do I need a NumPy array for?”

• There are very significant advantages of using NumPy arrays overs

lists.
Creating a NumPy array
• To understand these advantages, lets create an array.
• One of the most common, of the many, ways to create a NumPy array
is to create one from a list by passing it to the np.array() function.

In Ou
: t:
Differences between lists and
ndarrays
• The key difference between an array and a list is that arrays are
designed to handle vectorised operations while a python lists are not.
• That means, if you apply a function, it is performed on every item in
the array, rather than on the whole array object.
• Let’s suppose you want to add the number 2 to every item in the list.
The intuitive way to do this is something like this:

In Ou
: t:

• That was not possible with a list, but you can do that on an array:

In Ou
: t:
• It should be noted here that, once a Numpy array is created, you
cannot increase its size.
• To do so, you will have to create a new array.
Create a 2d array from a list of list
• You can pass a list of lists to create a matrix-like a 2d array.

In
Ou
:
t:
The dtype argument
• You can specify the data-type by setting the dtype() argument.
• Some of the most commonly used NumPy dtypes are: float, int,
bool, str, and object.

In
Ou
:
t:
The astype argument
• You can also convert it to a different data-type using the astype method.

In Ou
: t:

• Remember that, unlike lists, all items in an array have to be of the same
type.
dtype=‘object’
• However, if you are uncertain about what data type your array will
hold, or if you want to hold characters and numbers in the same array,
you can set the dtype as 'object'.

In Ou
: t:
The tolist() function
• You can always convert an array into a list using the tolist()
command.

In Ou
: t:
Inspecting a NumPy array
• There are a range of functions built into NumPy that allow you to
inspect different aspects of an array:

In
: Ou
t:
Extracting specific items from an
array
• You can extract portions of the array using indices, much like when
you’re working with lists.
• Unlike lists, however, arrays can optionally accept as many parameters
in the square brackets as there are number of dimensions

In Ou
: t:
Boolean indexing
• A boolean index array is of the same shape as the array-to-be-filtered,
but it only contains TRUE and FALSE values.

In Ou
: t:
Pandas
• Pandas, like NumPy, is one of the most popular Python libraries for
data analysis.
• It is a high-level abstraction over low-level NumPy, which is written in
pure C.
• Pandas provides high-performance, easy-to-use data structures and
data analysis tools.
• There are two main structures used by pandas; data frames and
series.
Indices in a pandas series
• A pandas series is similar to a list, but differs in the fact that a series associates a label
with each element. This makes it look like a dictionary.
• If an index is not explicitly provided by the user, pandas creates a RangeIndex ranging
from 0 to N-1.
• Each series object also has a data type.

In O
: ut
:
• As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.

In O
: ut
:

• You can also provide an index manually.

In
:
Ou
t:
• It is easy to retrieve several elements of a series by their indices or
make group assignments.

Ou
In t:
:
Filtering and maths operations
• Filtering and maths operations are easy with Pandas as well.

In O
: ut
:
Pandas data frame
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.

Case ID Variable one Variable two Variable 3

1 123 ABC 10
2 456 DEF 20
3 789 XYZ 30
Creating a Pandas data frame
• Pandas data frames can be constructed using Python dictionaries.
In
:

Ou
t:
• You can also create a data frame from a list.

In Ou
: t:
• You can ascertain the type of a column with the type() function.

In
:

Ou
t:
• A Pandas data frame object as two indices; a column index and row
index.
• Again, if you do not provide one, Pandas will create a RangeIndex
from 0 to N-1.
In
:

Ou
t:
• There are numerous ways to provide row indices explicitly.
• For example, you could provide an index when creating a data frame:

In Ou
: t:

• or do it during runtime.
• Here, I also named the index ‘country code’.
Ou
In t:
:
• Row access using index can be performed in several ways.
• First, you could use .loc() and provide an index label.

In Ou
: t:

• Second, you could use .iloc() and provide an index number

In Ou
: t:
• A selection of particular rows and columns can be selected this way.

In Ou
: t:

• You can feed .loc() two arguments, index list and column list, slicing
operation is supported as well:

In Ou
: t:
Filtering
• Filtering is performed using so-called Boolean arrays.
Deleting columns
• You can delete a column using the drop() function.
In Ou
: t:

In Ou
: t:
Reading from and writing to a file
• Pandas supports many popular file formats including CSV, XML, HTML,
Excel, SQL, JSON, etc.
• Out of all of these, CSV is the file format that you will work with the most.
• You can read in the data from a CSV file using the read_csv() function.

• Similarly, you can write a data frame to a csv file with the to_csv()
function.
• Pandas has the capacity to do much more than what we have
covered here, such as grouping data and even data visualisation.
• However, as with NumPy, we don’t have enough time to cover every
aspect of pandas here.
Exploratory data analysis (EDA)
Exploring your data is a crucial step in data analysis. It involves:
• Organising the data set
• Plotting aspects of the data set
• Maybe producing some numerical summaries; central tendency and
spread, etc.

“Exploratory data analysis can never be the whole story, but nothing
else can serve as the foundation stone.”
- John Tukey.
Download the data
• Download the Pokemon dataset from:

https://github.com/LewBrace/da_and_vis_python

• Unzip the folder, and save the data file in a location you’ll remember.
Reading in the data
• First we import the Python packages we are going to use.
• Then we use Pandas to load in the dataset as a data frame.

NOTE: The argument index_col argument states that we'll treat the first
column of the dataset as the ID column.
NOTE: The encoding argument allows us to by pass an input error created
by special characters in the data set.
Examine the data set
• We could spend time staring at these
numbers, but that is unlikely to offer
us any form of insight.
• We could begin by conducting all of
our statistical tests.
• However, a good field commander
never goes into battle without first
doing a recognisance of the terrain…
• This is exactly what EDA is for…
Plotting a histogram in Python
Bins
• You may have noticed the two histograms we’ve seen so far look different,
despite using the exact same data.
• This is because they have different bin values.
• The left graph used the default bins generated by plt.hist(), while the one on
the right used bins that I specified.
• There are a couple of ways to manipulate bins in matplotlib.
• Here, I specified where the edges of the bars of the histogram are; the
bin edges.
• You could also specify the number of bins, and Matplotlib will
automatically generate a number of evenly spaced bins.
Seaborn
• Matplotlib is a powerful, but sometimes unwieldy, Python library.
• Seaborn provides a high-level interface to Matplotlib and makes it
easier to produce graphs like the one on the right.
• Some IDEs incorporate elements of this “under the hood” nowadays.
Benefits of Seaborn
• Seaborn offers:
- Using default themes that are aesthetically pleasing.
- Setting custom colour palettes.
- Making attractive statistical plots.
- Easily and flexibly displaying distributions.
- Visualising information from matrices and DataFrames.
• The last three points have led to Seaborn becoming the exploratory
data analysis tool of choice for many Python users.
Plotting with Seaborn
• One of Seaborn's greatest strengths is its diversity of plotting
functions.
• Most plots can be created with one line of code.
• For example….
Histograms
• Allow you to plot the distributions of numeric variables.
Other types of graphs: Creating a
scatter plot
Name of our
Name of variable we dataframe fed to the
want on the x-axis “data=“ command

Seaborn “linear Name of variable we

model plot” want on the y-axis
function for
creating a scatter
graph
• Seaborn doesn't have a dedicated scatter plot function.
• We used Seaborn's function for fitting and plotting a regression line;
hence lmplot()
• However, Seaborn makes it easy to alter plots.
• To remove the regression line, we use the fit_reg=False command
The hue function
• Another useful function in Seaborn is the hue function, which
enables us to use a variable to colour code our data points.
Factor plots
• Make it easy to separate plots by categorical classes.

Colour by stage.
Separate by stage.
Generate using a swarmplot.
Rotate axis on x-ticks by 45 degrees.
A box plot
• The total, stage, and legendary entries are not combat stats so we should
remove them.
• Pandas makes this easy to do, we just create a new dataframe
• We just use Pandas’ .drop() function to create a dataframe that doesn’t
include the variables we don’t want.
Seaborn’s theme
• Seaborn has a number of themes you can use to alter the appearance
of plots.
• For example, we can use “whitegrid” to add grid lines to our boxplot.
Violin plots
• Violin plots are useful alternatives to box plots.
• They show the distribution of a variable through the thickness of the violin.
• Here, we visualise the distribution of attack by Pokémon's primary type:
• Dragon types tend to have higher Attack stats than Ghost types, but they also have greater
variance. But there is something not right here….
• The colours!
Seaborn’s colour palettes
• Seaborn allows us to easily set custom colour palettes by providing it
with an ordered list of colour hex values.
• We first create our colours list.
• Then we just use the palette= function and feed in our colours list.
• Because of the limited number of observations, we could also use a
swarm plot.
• Here, each data point is an observation, but data points are grouped
together by the variable listed on the x-axis.
Overlapping plots
• Both of these show similar information, so it might be useful to
overlap them.
Set size of print canvas.

Remove bars from inside the violins

Make bars black and slightly transparent

Give the graph a title

Data wrangling with Pandas
• What if we wanted to create such a plot that included all of the other
stats as well?
• In our current dataframe, all of the variables are in different columns:
• If we want to visualise all stats, then we’ll have to “melt” the
dataframe.
We use the .drop() function again to re-
create the dataframe without these three
variables.
The dataframe we want to melt.

The variables to keep, all others will be

melted.
A name for the new, melted, variable.

• All 6 of the stat columns have been "melted" into one, and
the new Stat column indicates the original stat (HP, Attack,
Defense, Sp. Attack, Sp. Defense, or Speed).
• It's hard to see here, but each pokemon now has 6 rows of
data; hende the melted_df has 6 times more rows of data.
• This graph could be made to look nicer with a few tweaks.

Enlarge the plot.

Separate points by hue.

Use our special Pokemon colour palette.
Adjust the y-axis.
Move the legend box outside of
the graph and place to the right of
it..
Plotting all data: Empirical
cumulative distribution functions
(ECDFs)
• An alternative way of visualising a
distribution of a variable in a large dataset
is to use an ECDF.
• Here we have an ECDF that shows the
percentages of different attack strengths of
pokemon.
• An x-value of an ECDF is the quantity you
are measuring; i.e. attacks strength.
• The y-value is the fraction of data points
that have a value smaller than the
corresponding x-value. For example…
75% of Pokemon have an attack
level of 90 or less

20% of Pokemon have an attack

level of 50 or less.
Plotting an ECDF
• You can also plot multiple ECDFs
on the same plot.
• As an example, here with have an
ECDF for Pokemon attack, speed,
and defence levels.
• We can see here that defence
levels tend to be a little less than
the other two.
The usefulness of ECDFs
• It is often quite useful to plot the ECDF first as part of your workflow.
• It shows all the data and gives a complete picture as to how the data
are distributed.
Heatmaps
• Useful for visualising matrix-like data.
• Here, we’ll plot the correlation of the stats_df variables
Bar plot
• Visualises the distributions of categorical variables.

Rotates the x-ticks 45 degrees

Joint Distribution Plot
• Joint distribution plots combine information from scatter plots and
histograms to give you detailed information for bi-variate distributions.
Any questions?

CO3 - 1 - Pandas Series and Data Frame
No ratings yet
CO3 - 1 - Pandas Series and Data Frame
37 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
TIA 942 C DC Infrastructure Stadard - TIA White Paper
No ratings yet
TIA 942 C DC Infrastructure Stadard - TIA White Paper
10 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
37 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Unit IV
No ratings yet
Unit IV
49 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Module 6
No ratings yet
Module 6
48 pages
Python For Data Analysis Edgar
No ratings yet
Python For Data Analysis Edgar
49 pages
20ca2204 Data Science QB With Answers
No ratings yet
20ca2204 Data Science QB With Answers
48 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Python For Data Analysis: Dr. Kishore Kunal
100% (1)
Python For Data Analysis: Dr. Kishore Kunal
43 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
Python For ML
No ratings yet
Python For ML
41 pages
Starting Out With Pandas - Ext
No ratings yet
Starting Out With Pandas - Ext
18 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Data Analysis - 5th Unit
No ratings yet
Data Analysis - 5th Unit
14 pages
Pandas
No ratings yet
Pandas
25 pages
Python Libraries 2
No ratings yet
Python Libraries 2
80 pages
Unit - 4 - Part 2
No ratings yet
Unit - 4 - Part 2
36 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
01 Python For Data Analysis (Ziad)
No ratings yet
01 Python For Data Analysis (Ziad)
53 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Data Frame
No ratings yet
Data Frame
95 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas
No ratings yet
Pandas
25 pages
Pandas
No ratings yet
Pandas
13 pages
Ip Study
No ratings yet
Ip Study
18 pages
Dot Cards Introduction Procedure
100% (1)
Dot Cards Introduction Procedure
4 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Pandas
No ratings yet
Pandas
12 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Andrew Spencer, Robert Henley - Maths and English For Electrical - Functional Skills-Cengage Learning EMEA (2013)
100% (1)
Andrew Spencer, Robert Henley - Maths and English For Electrical - Functional Skills-Cengage Learning EMEA (2013)
97 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Pandas For Machine Learning: Acadview
No ratings yet
Pandas For Machine Learning: Acadview
18 pages
SPG Action Plan 2015-2016
83% (6)
SPG Action Plan 2015-2016
10 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Pandas
No ratings yet
Pandas
41 pages
Python For DA
100% (2)
Python For DA
47 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
What Is Pandas
No ratings yet
What Is Pandas
9 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Tutorial Chapter 7 Inventory Management
No ratings yet
Tutorial Chapter 7 Inventory Management
31 pages
How To Find A Media Buyer
100% (1)
How To Find A Media Buyer
10 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Pandas
No ratings yet
Pandas
5 pages
World Beyond The Windshield Roads and Landscapes in The United
No ratings yet
World Beyond The Windshield Roads and Landscapes in The United
299 pages
Coping With Stress in Middle and Late Adolescence
100% (1)
Coping With Stress in Middle and Late Adolescence
11 pages
Bacon's Secret Disclosed in Contemporary Books (1911) (230 PGS)
No ratings yet
Bacon's Secret Disclosed in Contemporary Books (1911) (230 PGS)
230 pages
Character Master Sheets - V2
No ratings yet
Character Master Sheets - V2
103 pages
AIU Fee Structure 2023-2024
No ratings yet
AIU Fee Structure 2023-2024
8 pages
31 24 0019710 00 - 60PS
No ratings yet
31 24 0019710 00 - 60PS
82 pages
Provisional - Offer Letter - DODDI ARUDRAKUMAR
No ratings yet
Provisional - Offer Letter - DODDI ARUDRAKUMAR
10 pages
Program of Study Outcomes: Lesson Title/Focus Class Badminton Day 1 (6 Day Condensed Unit) Course Grade 8
No ratings yet
Program of Study Outcomes: Lesson Title/Focus Class Badminton Day 1 (6 Day Condensed Unit) Course Grade 8
4 pages
Cedaspe - Bushing
No ratings yet
Cedaspe - Bushing
4 pages
09-Ingl - Mouse Manta Apego Plana-Sascha
No ratings yet
09-Ingl - Mouse Manta Apego Plana-Sascha
3 pages
Physics Unit & Mesaurement
No ratings yet
Physics Unit & Mesaurement
26 pages
Bulking of Sand
No ratings yet
Bulking of Sand
10 pages
Economics A Contemporary Introduction With InfoTrac 7th Edition William A. Mceachern Instant Download
No ratings yet
Economics A Contemporary Introduction With InfoTrac 7th Edition William A. Mceachern Instant Download
55 pages
Euripides Our Contemporary 1st Edition J. Michael Walton Download PDF
100% (5)
Euripides Our Contemporary 1st Edition J. Michael Walton Download PDF
55 pages
Valve Pressure Drop
No ratings yet
Valve Pressure Drop
4 pages
Eee Lab-3
No ratings yet
Eee Lab-3
4 pages
Vehicle Auxiliary Circuits
No ratings yet
Vehicle Auxiliary Circuits
8 pages
Ieq Project
No ratings yet
Ieq Project
13 pages
MCEN2001 Lab Report 1
No ratings yet
MCEN2001 Lab Report 1
8 pages
10 Communication Skills For Your Life and Career Success
No ratings yet
10 Communication Skills For Your Life and Career Success
1 page
Project in A.PE.H: Submitted By: CJ Demanarig:D
No ratings yet
Project in A.PE.H: Submitted By: CJ Demanarig:D
4 pages
KTK Bank Privacy Policy
No ratings yet
KTK Bank Privacy Policy
3 pages
WORKSHOP 1 Roadmap For Developing Relationship
No ratings yet
WORKSHOP 1 Roadmap For Developing Relationship
3 pages
Resume 2011
No ratings yet
Resume 2011
2 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Numpy Simply In Depth
From Everand
Numpy Simply In Depth
Ajit Singh
5/5 (1)
NumPy Recipes
From Everand
NumPy Recipes
Martin McBride
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet

Python For Data Analysis Jan 28

Uploaded by

Python For Data Analysis Jan 28

Uploaded by

NPTEL LIVE SESSIONS

Data Analytics with Python

Reading Data; Selecting and Filtering the Data; Data manipulation,

Plotting the data

and many more …

 provides vectorization of mathematical operations on arrays and matrices

 many other python libraries are built on NumPy

 part of SciPy Stack

 provides tools for data manipulation: reshaping, merging, sorting, slicing,

 allows handling missing data

 built on NumPy, SciPy and matplotlib

 a set of functionalities similar to those of MATLAB

 line plots, scatter plots, barcharts, histograms, pie charts etc.

 relatively low-level; some effort needed to create advanced visualization

 provides high level interface for drawing attractive statistical graphics

 Similar (in style) to the popular ggplot2 library in R

Press Shift+Enter to execute the cell

int64 int Numeric characters. 64 refers to

float64 float Numeric characters with decimals.

size number of elements

 Find how many records this data frame has;

 How many elements are there?

 What are the column names?

 What types of columns we have in this data frame?

describe() generate descriptive statistics (for numeric columns only)

max(), min() return max/min values for all numeric columns

mean(), median() return mean/median values for all numeric columns

std() standard deviation

sample([n]) returns a random sample of the data frame

dropna() drop all the records with missing values

 Give the summary for the numeric columns in the dataset

 Calculate standard deviation for all numeric columns;

Method 2: Use the column name as an attribute:

df.iloc[:, 0] # First column

df.iloc[0:7] #First 7 rows

We can sort the data using 2 or more columns:

In [ ]: # Select the rows that have at least one missing value

dropna(how='all') Drop observations where all cells is NA

dropna(axis=1, how='all') Drop column if all the values are missing

dropna(thresh = 5) Drop rows that contain less than 5 non-missing values

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values

Common aggregation functions:

min, max Minimum and maximum values

mean, median, mode Arithmetic average, median and mode

var, std Variance and standard deviation

sem Standard error of mean

skew Sample skewness

To show graphs within Python notebook include inline directive:

See examples in the Tutorial Notebook

• There are very significant advantages of using NumPy arrays overs

• You can also provide an index manually.

Case ID Variable one Variable two Variable 3

• Second, you could use .iloc() and provide an index number

Seaborn “linear Name of variable we

Remove bars from inside the violins

Make bars black and slightly transparent

Give the graph a title

The variables to keep, all others will be

Enlarge the plot.

Separate points by hue.

20% of Pokemon have an attack

Rotates the x-ticks 45 degrees

You might also like