0% found this document useful (0 votes)
0 views

Python UnitIV

Pandas is an open-source library in Python for high-performance data manipulation, developed by Wes McKinney in 2008. It provides two primary data structures, Series (one-dimensional) and DataFrame (two-dimensional), along with various features for data analysis, such as data alignment, reshaping, and handling missing data. The document outlines how to create and manipulate Series and DataFrames, including operations like selection, addition, and deletion of rows and columns.

Uploaded by

nimodbd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Python UnitIV

Pandas is an open-source library in Python for high-performance data manipulation, developed by Wes McKinney in 2008. It provides two primary data structures, Series (one-dimensional) and DataFrame (two-dimensional), along with various features for data analysis, such as data alignment, reshaping, and handling missing data. The document outlines how to create and manipulate Series and DataFrames, including operations like selection, addition, and deletion of rows and columns.

Uploaded by

nimodbd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Unit – IV

Pandas
Pandas is defined as an open-source library that provides high-performance data manipulation in Python.
The name of Pandas is derived from the word Panel Data, which means an Econometrics from
Multidimensional data. It is used for data analysis in Python and developed by Wes McKinney in 2008.

Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc.

Key Features of Pandas


 Fast and efficient DataFrame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and subsetting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.

Pandas generally provide two data structures for manipulating data, they are:
 Series
 DataFrame

Series
Series is a one-dimensional array like structure with homogeneous data. The row labels of series are called
the index. We can easily convert the list, tuple, and dictionary into series using "series' method. A Series
cannot contain multiple columns.
For example, the following series is a collection of integers 10, 23, 56, …

10 23 56 17 52 61 73 90 26 72

A pandas Series can be created using the following constructor −


pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −
Sr.No Parameter & Description

1 data
data takes various forms like ndarray, list, constants

index
2
Index values must be unique and hashable, same length as data.
Default np.arange(n) if no index is passed.

3 dtype
dtype is for data type. If None, data type will be inferred

4 Copy
Copy data. Default False

Create an Empty Series

A basic series, which can be created is an Empty Series. For Example:


import pandas as pd
s =pd.Series( )
print s

Its output is as follows −


Series([ ], dtype: float64)

Create a Series from ndarray

If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default
index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].For Example:
import pandas as pd
import numpy as np
data=np.array(['a','b','c','d'])
s =pd.Series(data)
print s
Its output is as follows −
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.
Create a Series from a list:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second
value has index 1 etc.

Create Labels

With the index argument, we can name our own labels.

Example1:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
Its output is as follows −
x 1
y 7
z 2

Example2:

import pandas as pd
import numpy as np
data=np.array(['a','b','c','d'])
s =pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d

Create a Series from dictionary

A dictionary can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted
order to construct index. If index is passed, the values in data corresponding to the labels in the index will be
pulled out.
Example1:
import pandas as pd
import numpy as np
data={'a':0.,'b':1.,'c':2.}
s =pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64

Example2:

import pandas as pd
import numpy as np
data={'a':0.,'b':1.,'c':2.}
s =pd.Series(data,index=['b','c','d','a'])
print s
Its output is as follows −
b 1.0
c 2.0
dNaN
a 0.0
dtype: float64
Note − Index order is persisted and the missing element is filled with NaN (Not a Number).

Create a Series from Scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.
import pandas as pd
import numpy as np
s =pd.Series(5, index=[0,1,2,3])
print s
Its output is as follows −
0 5
1 5
2 5
3 5
dtype: int64

Accessing Data from Series with Position

Data in the series can be accessed similar to that in an ndarray. Retrieve the first element from the series
can be done with the help of its index number. The first element is stored at zero th position and so on.
Example1:
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])
#retrieve the first element
print s[0]
Its output is as follows −
1
Example2:
Retrieve the first three elements in the Series.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])

#retrieve the first three element


print s[:3]
Its output is as follows −
a 1
b 2
c 3

Example3:
Retrieve the last three elements.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])

#retrieve the last three element


print s[-3:]
Its output is as follows −
c 3
d 4
e 5

Retrieve Data Using Label (Index)

A Series is like a fixed-size dictionary in that we can get and set values by index label.
Example1:
Retrieve a single element using index label value.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])

#retrieve a single element


print s['a']
Its output is as follows −
1

Example2:
Retrieve multiple elements using a list of index label values.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])

#retrieve multiple elements


print s[['a','c','d']]
Its output is as follows −
a 1
c 3
d 4

Example3:
If a label is not contained, an exception is raised.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])

#retrieve multiple elements


print s['f']
Its output is as follows −

KeyError: 'f'

Python Pandas DataFrame

Pandas DataFrame is a widely used data structure which works with a two-dimensional array with labeled
axes (rows and columns). DataFrame is defined as a standard way to store data that has two different
indexes, i.e., row index and column index. It consists of the following properties:

o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is
denoted as "columns" in case of columns and "index" in case of rows.

A pandas DataFrame can be created using the following constructor −

pandas.DataFrame( data, index, columns, dtype, copy)

Parameter & Description:

data: It consists of different forms like ndarray, series, map, constants, lists, array.

index: The default np.arrange(n) index is used for the row labels if no index is passed.

columns: The default syntax is np.arrange(n) for the column labels. It shows only true if no index is
passed.

dtype: Datatype of each column.

copy: This command is used for copying of data, if the default is False.
Create DataFrame

A pandas DataFrame can be created using various inputs like −

 Lists
 dict
 Series
 Numpy ndarrays
 Another DataFrame

Create an empty DataFrame

The below code shows how to create an empty DataFrame in Pandas:

import pandas as pd
df = pd.DataFrame( )
print (df)
Output
Empty DataFrame
Columns: [ ]
Index: [ ]

Create a DataFrame using List:


import pandas as pd
x = ['Python', 'Pandas']
# Calling DataFrame constructor on list
df = pd.DataFrame(x)
print(df)
Output
0
0 Python
1 Pandas

Create a DataFrame from Dict of ndarrays/ Lists:


import pandas as pd
info = {'ID' :[101, 102, 103], 'Department' :['B.Sc','B.Tech','M.Tech',]}
df = pd.DataFrame(info)
print (df)
Output
ID Department
0 101 B.Sc
1 102 B.Tech
2 103 M.Tech
Create a DataFrame from Dict of Series:
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1)

Operations on Rows and Columns in DataFrame

Column Selection

Any column from the DataFrame can be selected through the following code:

import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1 ['one'])

Column Addition

A new column can be added to an existing DataFrame through the following code:

import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)

print ("Add new column by passing series")


df['three'] = pd.Series([20,40,60],index=['a','b','c'])
print (df)

print ("Add new column using existing DataFrame columns")


df['four'] = df['one']+df['three']
print (df)
Column Deletion:

A del statement or pop( ) function is used to delete any column from the existing DataFrame.

import pandas as pd
info = {'one' : pd.Series([1, 2], index= ['a', 'b']),
'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c'])}
df = pd.DataFrame(info)
print ("The DataFrame:")
print (df)

# using del function


print ("Delete the first column:")
del df['one']
print (df)

# using pop function


print ("Delete the another column:")
df.pop('two')
print (df)

Row Selection:

(a) Selection by Label:

loc( ) function is used to select the row in DataFrame. Row can by selected by passing the
row label to a loc function.

Syntax

dataframe.loc(label name)

Example

import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df.loc['b'])

Output

one 2.0
two 2.0
Name: b, dtype: float64
(b) Selection by integer location:

The rows can also be selected by passing the integer location to an iloc function.

Syntax

dataframe.iloc(location number)

Example
import pandas as pd

info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),


'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df.iloc[3])
Output
one 4.0
two 4.0
Name: d, dtype: float64

(c) Slice Rows

It is another method to select multiple rows using ':' operator.

Example
import pandas as pd

info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),


'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df[2:5])
Output
one two
c 3.0 3
d 4.0 4
e 5.0 5

Addition of rows:

We can easily add new rows to the DataFrame using append function. It adds the new rows at the end.

import pandas as pd
d = pd.DataFrame([[7, 8], [9, 10]], columns = ['x','y'])
d2 = pd.DataFrame([[11, 12], [13, 14]], columns = ['x','y'])
d = d.append(d2)
print (d)
Output
x y
0 7 8
1 9 10
0 11 12
1 13 14

Deletion of rows:

We can delete or drop any rows from a DataFrame using the index label. If in case, the label is duplicate
then multiple rows will be deleted.

import pandas as pd
a_info = pd.DataFrame([[4, 5], [6, 7]], columns = ['x','y'])
b_info = pd.DataFrame([[8, 9], [10, 11]], columns = ['x','y'])
a_info = a_info.append(b_info)

# Drop rows with label 0


a_info = a_info.drop(0)
Output
x y
1 6 7
1 10 11

CSV Files
A csv stands for "comma separated values", which is defined as a simple file format that uses specific
structuring to arrange tabular data. It stores tabular data such as spreadsheet or database in plain text and has
a common format for data interchange. A csv file opens into the excel sheet, and the rows and columns data
define the standard format.

Reading csv files with Pandas

Reading the csv file into a pandas DataFrame is quick and straight forward. We don't need to write enough
lines of code to open, analyze, and read the csv file in pandas and it stores the data in DataFrame.

The read_csv function of the pandas library is used read the content of a CSV file into the python
environment as a pandas DataFrame.
Syntax
pandas.read_csv(csv file name)
Example
import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df)
Reading Specific Rows

The read_csv function of the pandas library can also be used to read some specific rows for a given column.
It can be done by using the slicing.

Example showing first 5 rows for the column named salary.


import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df[0:5] [‘salary’])

Reading Specific Columns

The read_csv function of the pandas library can also be used to read some specific columns. We use the
multi-axes indexing method called .loc( ) for this purpose.
Example showing the column salary and name for all the rows.
import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df.loc[ : , [‘salary’, ‘name’]])

Reading Specific Columns and Rows

The read_csv function of the pandas library can also be used to read some specific columns and specific
rows. We use the multi-axes indexing method called .loc( ) for this purpose.
Example showing the column salary and name for some of the rows.
import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df.loc[[ 1, 3, 5 ] , [‘salary’, ‘name’]])

Functions
(1) Head( ): This method is used for returning top n (by default value 5) rows of a data frame or series.

Syntax

dataframe.head(n)

Example 1
import pandas as pd
info = pd.DataFrame({'language':['C', 'C++', 'Python', 'Java','PHP']})
info.head(3)
Example 2
import pandas as pd
data = pd.read_csv("aa.csv")
data_top = data.head(2)
data_top
(2) Tail( ): This method is used for returning last n (by default value 5) rows of a data frame or series.

Syntax

dataframe.tail(n)

Example 1
import pandas as pd
info = pd.DataFrame({'language':['C', 'C++', 'Python', 'Java','PHP']})
info.tail(3)
Example 2
import pandas as pd
data = pd.read_csv("aa.csv")
data_top = data.tail(2)
data_top

(3) info( ): It is an important and widely used method of Python. This Method prints the information or
summary of the dataframe. It prints the various information of the Dataframe such as index
type, dtype, columns, non-values, and memory usage. It gives a quick overview of the
dataset.

Syntax

dataframe.info(verbose,buf,max_cols,memory_usage,show_counts=None)

Parameters -
o verbose - It is used to print the full summary of the dataset.
o buf - It is a writable buffer, default to sys.stdout.
o max_cols - It specifies whether a half summary or full summary is to be printed.
o memory_usage - It specifies whether total memory usage of the DatFrame elements
(including index) should be displayed.
o show_counts - It is used to show the non-null counts.

Example

import pandas as pd
data = pd.read_csv("aa.csv")
print(data.info( ))

(4) shape: The shape property returns a tuple containing the shape of the DataFrame. The shape is the
number of rows and columns of the DataFrame.

Syntax

dataframe.shape
Example
import pandas as pd
df=pd.DataFrame({'col1':[1,2],'col2':[3,4]})
print(df.shape)

output
(2,2)

(5) columns: The columns property returns the label of each column in the DataFrame.

Syntax

dataframe.columns

Example
import pandas as pd
df = pd.read_csv('data.csv')
print (df.columns)

(6) isnull( ): The isnull() method returns a DataFrame object where all the values are replaced with
a Boolean value True for NULL values, and otherwise False.

Syntax

dataframe.isnull()

Example

import pandas as pd
df = pd.read_csv('data.csv')
newdf = df.isnull( )
print(newdf.to_string( ))

(7) dropna( ): The dropna( ) method removes the rows that contains NULL values. This method
returns a new DataFrame object unless the inplace parameter is set to True, in that
case the dropna( ) method does the removing in the original DataFrame instead.

Syntax

dataframe.dropna(axis, how, thresh, subset, inplace)

Parameter Value Description

axis 0 Optional, default 0.


1 0 and 'index'removes ROWS that contains NULL values
'index' 1 and 'columns' removes COLUMNS that contains NULL
'columns' values
how 'all' Optional, default 'any'. Specifies whether to remove the
'any' row or column when ALL values are NULL, or if ANY
vale is NULL.

thresh Number Optional, Specifies the number of NULL values required


to remove the row or column.

subset List Optional, specifies where to look for NULL values

inplace True Optional, default False. If True: the removing is done on


False the current DataFrame. If False: returns a copy where the
removing is done.

Example

import pandas as pd
df = pd.read_csv('data.csv')
newdf = df.dropna( )
print(newdf.to_string( ))

(8) mean( ): The mean( ) method returns a Series with the mean value of each column.

Syntax

dataframe.mean(axis, skipna, level, numeric_only)

Parameter Value Description

axis 0 Optional, Which axis to check, default 0.


1
'index'
'columns'

skip_na True Optional, default True. Set to False if the


False result should NOT skip NULL values
level Number Optional, default None. Specifies which level
level name ( in a hierarchical multi index) to check along

numeric_only None Optional. Specify whether to only check


True numeric values. Default None
False

Example 1

import pandas as pd
info = pd.DataFrame({"A": [8, 2, 7, 12, 6], "B": [26, 19, 7, 5, 9],
"C": [10, 11, 15, 4, 3], "D": [16, 24, 14, 22, 1]})
info.mean(axis = 0)

Example 2

import pandas as pd
info = pd.DataFrame({"A": [5, 2, 6, 4, None], "B": [12, 19, None, 8, 21],
"C": [15, 26, 11, None, 3], "D": [14, 17, 29, 16, 23]})
info.mean(axis = 1, skipna = True)

(9) sum( ): The sum( ) method adds all values in each column and returns the sum for each column.

Syntax

dataframe.sum(axis, skipna, level, numeric_only, min_count)

 parameters axis, skip_na, level and numeric_only will behave same as mentioned in mean( )

min_count None Optional. Specifies the minimum number of values that needs
True to be present to perform the action. Default 0
False

Example

import pandas as pd
info = pd.DataFrame({"A": [8, 2, 7, 12, 6], "B": [26, 19, 7, 5, 9],
"C": [10, 11, 15, 4, 3], "D": [16, 24, 14, 22, 1]})
info.sum(axis = 1)
(10) describe( ): Pandas describe( ) is used to view some basic statistical details like percentile, mean,
std etc. of a data frame or a series of numeric values.

Syntax

dataframe.describe(percentile, include, exclude)

Parameter Value Description

percentile numbers between: Optional, a list of percentiles to


0 and 1 include in the result, default is :
[.25, .50, .75].

include None Optional, a list of the data types


'all' to allow in the result
datatypes

exclude None Optional, a list of the data types


'all' to disallow in the result
datatypes

Example

import pandas as pd
data = [[10, 18, 11], [13, 15, 8], [9, 20, 3]]
df = pd.DataFrame(data)
print(df.describe( ))

0 1 2
count 3.000000 3.000000 3.000000
mean 10.666667 17.666667 7.333333
std 2.081666 2.516611 4.041452
min 9.000000 15.000000 3.000000
25% 9.500000 16.500000 5.500000
50% 10.000000 18.000000 8.000000
75% 11.500000 19.000000 9.500000
max 13.000000 20.000000 11.000000

(11) corr( ): The main task of the DataFrame.corr( ) method is to find the pairwise correlation of all the
columns in the DataFrame. If any null value is present, it will automatically be excluded. It
also ignores non-numeric data type columns from the DataFrame.
Syntax

DataFrame.corr(self, method=’pearson’, min_periods=1)

Parameters

method :
pearson: standard correlation coefficient
kendall: Kendall Tau correlation coefficient
spearman: Spearman rank correlation

min_periods : Minimum number of observations required per pair of columns to


have a valid result. Currently only available for pearson and spearman correlation.

Example

import pandas as pd
df = {"Array_1": [30, 70, 100], "Array_2": [65.1, 49.50, 30.7] }
data = pd.DataFrame(df)
print(data.corr( ))

Output
Array_1 Array_2
Array_1 1.000000 -0.990773
Array_2 -0.990773 1.000000

(12) value_counts( ): Pandas value_counts( ) function returns series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most
frequently-occurring element. Excludes NA values by default.

Syntax

series.value_counts(normalize=False, sort=True, ascending=False, dropna=True)

Parameters:

Type/Default Required /
Name Description
Value Optional

If True then the object returned will boolean


normalize contain the relative frequencies of the Default Value: Required
unique values. False

boolean
sort Sort by frequencies. Default Value: Required
True
boolean
ascending Sort in ascending order. Default Value: Required
False

boolean
dropna Don’t include counts of NaN. Default Value: Required
True

Example
import numpy as np
import pandas as pd
index = pd.Index([2, 2, 5, 3, 4, np.nan])
index.value_counts( )

Output
2.0 2
4.0 1
3.0 1
5.0 1
dtype: int64

(13) apply( ): The apply( ) method allows to apply a function along one of the axis of the DataFrame,
default 0, which is the index (row) axis.

Syntax

dataframe.apply(func, axis, raw, result_type)

Required /
Name Description Value
Optional

func A function to apply to the DataFrame Required

0
Which axis to apply the function to. 1
axis Optional
default 0. 'index'
'columns'

Optional, default False. Set to True if the


True
raw row/column should be passed as an Optional
False
ndarray object.
'expand'
default None. Specifies how the result will 'reduce'
result_type Optional
be returned 'broadcast'
None

Example 1 Returns the sum of each row

import pandas as pd

def calc_sum(x):
a = x.sum( )
return a

data = { "x": [50, 40, 30], "y": [300, 1112, 42] }


df = pd.DataFrame(data)
x = df.apply(calc_sum)
print(x)

Example 2
The following example passes a function and checks the value of each element in series and
returns low, normal or High accordingly.
import pandas as pd
#reading csv
s = pd.read_csv(“stock.csv”, squeeze = True)
#defining function to check price
def fun(num) :
if num<200 :
return “Low”
elif num>=200 and num<400 :
return “Normal”
else :
return “High”
#passing function to apply and storing returned series in new
new = s.apply(fun)

#passing first 3 element


print(new.head(3))
#passing elements somewhere near the middle of series
print(new[1400], new[1500], new[1600])
#passing last 3 element
print(new.tail(3))

(14) loc( ) and iloc( ) both functions already explained in row selection part of DataFrame.

You might also like