Python UnitIV
Python UnitIV
Pandas
Pandas is defined as an open-source library that provides high-performance data manipulation in Python.
The name of Pandas is derived from the word Panel Data, which means an Econometrics from
Multidimensional data. It is used for data analysis in Python and developed by Wes McKinney in 2008.
Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc.
Pandas generally provide two data structures for manipulating data, they are:
Series
DataFrame
Series
Series is a one-dimensional array like structure with homogeneous data. The row labels of series are called
the index. We can easily convert the list, tuple, and dictionary into series using "series' method. A Series
cannot contain multiple columns.
For example, the following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26 72
1 data
data takes various forms like ndarray, list, constants
index
2
Index values must be unique and hashable, same length as data.
Default np.arange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
4 Copy
Copy data. Default False
If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default
index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].For Example:
import pandas as pd
import numpy as np
data=np.array(['a','b','c','d'])
s =pd.Series(data)
print s
Its output is as follows −
0 a
1 b
2 c
3 d
dtype: object
We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.
Create a Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second
value has index 1 etc.
Create Labels
Example1:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
Its output is as follows −
x 1
y 7
z 2
Example2:
import pandas as pd
import numpy as np
data=np.array(['a','b','c','d'])
s =pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d
A dictionary can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted
order to construct index. If index is passed, the values in data corresponding to the labels in the index will be
pulled out.
Example1:
import pandas as pd
import numpy as np
data={'a':0.,'b':1.,'c':2.}
s =pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Example2:
import pandas as pd
import numpy as np
data={'a':0.,'b':1.,'c':2.}
s =pd.Series(data,index=['b','c','d','a'])
print s
Its output is as follows −
b 1.0
c 2.0
dNaN
a 0.0
dtype: float64
Note − Index order is persisted and the missing element is filled with NaN (Not a Number).
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.
import pandas as pd
import numpy as np
s =pd.Series(5, index=[0,1,2,3])
print s
Its output is as follows −
0 5
1 5
2 5
3 5
dtype: int64
Data in the series can be accessed similar to that in an ndarray. Retrieve the first element from the series
can be done with the help of its index number. The first element is stored at zero th position and so on.
Example1:
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])
#retrieve the first element
print s[0]
Its output is as follows −
1
Example2:
Retrieve the first three elements in the Series.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])
Example3:
Retrieve the last three elements.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])
A Series is like a fixed-size dictionary in that we can get and set values by index label.
Example1:
Retrieve a single element using index label value.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])
Example2:
Retrieve multiple elements using a list of index label values.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])
Example3:
If a label is not contained, an exception is raised.
import pandas as pd
s =pd.Series([1,2,3,4,5],index =['a','b','c','d','e'])
Pandas DataFrame is a widely used data structure which works with a two-dimensional array with labeled
axes (rows and columns). DataFrame is defined as a standard way to store data that has two different
indexes, i.e., row index and column index. It consists of the following properties:
o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is
denoted as "columns" in case of columns and "index" in case of rows.
data: It consists of different forms like ndarray, series, map, constants, lists, array.
index: The default np.arrange(n) index is used for the row labels if no index is passed.
columns: The default syntax is np.arrange(n) for the column labels. It shows only true if no index is
passed.
copy: This command is used for copying of data, if the default is False.
Create DataFrame
Lists
dict
Series
Numpy ndarrays
Another DataFrame
import pandas as pd
df = pd.DataFrame( )
print (df)
Output
Empty DataFrame
Columns: [ ]
Index: [ ]
Column Selection
Any column from the DataFrame can be selected through the following code:
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1 ['one'])
Column Addition
A new column can be added to an existing DataFrame through the following code:
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
A del statement or pop( ) function is used to delete any column from the existing DataFrame.
import pandas as pd
info = {'one' : pd.Series([1, 2], index= ['a', 'b']),
'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c'])}
df = pd.DataFrame(info)
print ("The DataFrame:")
print (df)
Row Selection:
loc( ) function is used to select the row in DataFrame. Row can by selected by passing the
row label to a loc function.
Syntax
dataframe.loc(label name)
Example
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df.loc['b'])
Output
one 2.0
two 2.0
Name: b, dtype: float64
(b) Selection by integer location:
The rows can also be selected by passing the integer location to an iloc function.
Syntax
dataframe.iloc(location number)
Example
import pandas as pd
Example
import pandas as pd
Addition of rows:
We can easily add new rows to the DataFrame using append function. It adds the new rows at the end.
import pandas as pd
d = pd.DataFrame([[7, 8], [9, 10]], columns = ['x','y'])
d2 = pd.DataFrame([[11, 12], [13, 14]], columns = ['x','y'])
d = d.append(d2)
print (d)
Output
x y
0 7 8
1 9 10
0 11 12
1 13 14
Deletion of rows:
We can delete or drop any rows from a DataFrame using the index label. If in case, the label is duplicate
then multiple rows will be deleted.
import pandas as pd
a_info = pd.DataFrame([[4, 5], [6, 7]], columns = ['x','y'])
b_info = pd.DataFrame([[8, 9], [10, 11]], columns = ['x','y'])
a_info = a_info.append(b_info)
CSV Files
A csv stands for "comma separated values", which is defined as a simple file format that uses specific
structuring to arrange tabular data. It stores tabular data such as spreadsheet or database in plain text and has
a common format for data interchange. A csv file opens into the excel sheet, and the rows and columns data
define the standard format.
Reading the csv file into a pandas DataFrame is quick and straight forward. We don't need to write enough
lines of code to open, analyze, and read the csv file in pandas and it stores the data in DataFrame.
The read_csv function of the pandas library is used read the content of a CSV file into the python
environment as a pandas DataFrame.
Syntax
pandas.read_csv(csv file name)
Example
import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df)
Reading Specific Rows
The read_csv function of the pandas library can also be used to read some specific rows for a given column.
It can be done by using the slicing.
The read_csv function of the pandas library can also be used to read some specific columns. We use the
multi-axes indexing method called .loc( ) for this purpose.
Example showing the column salary and name for all the rows.
import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df.loc[ : , [‘salary’, ‘name’]])
The read_csv function of the pandas library can also be used to read some specific columns and specific
rows. We use the multi-axes indexing method called .loc( ) for this purpose.
Example showing the column salary and name for some of the rows.
import pandas as pd
df = pd.read_csv('hrdata.csv')
print(df.loc[[ 1, 3, 5 ] , [‘salary’, ‘name’]])
Functions
(1) Head( ): This method is used for returning top n (by default value 5) rows of a data frame or series.
Syntax
dataframe.head(n)
Example 1
import pandas as pd
info = pd.DataFrame({'language':['C', 'C++', 'Python', 'Java','PHP']})
info.head(3)
Example 2
import pandas as pd
data = pd.read_csv("aa.csv")
data_top = data.head(2)
data_top
(2) Tail( ): This method is used for returning last n (by default value 5) rows of a data frame or series.
Syntax
dataframe.tail(n)
Example 1
import pandas as pd
info = pd.DataFrame({'language':['C', 'C++', 'Python', 'Java','PHP']})
info.tail(3)
Example 2
import pandas as pd
data = pd.read_csv("aa.csv")
data_top = data.tail(2)
data_top
(3) info( ): It is an important and widely used method of Python. This Method prints the information or
summary of the dataframe. It prints the various information of the Dataframe such as index
type, dtype, columns, non-values, and memory usage. It gives a quick overview of the
dataset.
Syntax
dataframe.info(verbose,buf,max_cols,memory_usage,show_counts=None)
Parameters -
o verbose - It is used to print the full summary of the dataset.
o buf - It is a writable buffer, default to sys.stdout.
o max_cols - It specifies whether a half summary or full summary is to be printed.
o memory_usage - It specifies whether total memory usage of the DatFrame elements
(including index) should be displayed.
o show_counts - It is used to show the non-null counts.
Example
import pandas as pd
data = pd.read_csv("aa.csv")
print(data.info( ))
(4) shape: The shape property returns a tuple containing the shape of the DataFrame. The shape is the
number of rows and columns of the DataFrame.
Syntax
dataframe.shape
Example
import pandas as pd
df=pd.DataFrame({'col1':[1,2],'col2':[3,4]})
print(df.shape)
output
(2,2)
(5) columns: The columns property returns the label of each column in the DataFrame.
Syntax
dataframe.columns
Example
import pandas as pd
df = pd.read_csv('data.csv')
print (df.columns)
(6) isnull( ): The isnull() method returns a DataFrame object where all the values are replaced with
a Boolean value True for NULL values, and otherwise False.
Syntax
dataframe.isnull()
Example
import pandas as pd
df = pd.read_csv('data.csv')
newdf = df.isnull( )
print(newdf.to_string( ))
(7) dropna( ): The dropna( ) method removes the rows that contains NULL values. This method
returns a new DataFrame object unless the inplace parameter is set to True, in that
case the dropna( ) method does the removing in the original DataFrame instead.
Syntax
Example
import pandas as pd
df = pd.read_csv('data.csv')
newdf = df.dropna( )
print(newdf.to_string( ))
(8) mean( ): The mean( ) method returns a Series with the mean value of each column.
Syntax
Example 1
import pandas as pd
info = pd.DataFrame({"A": [8, 2, 7, 12, 6], "B": [26, 19, 7, 5, 9],
"C": [10, 11, 15, 4, 3], "D": [16, 24, 14, 22, 1]})
info.mean(axis = 0)
Example 2
import pandas as pd
info = pd.DataFrame({"A": [5, 2, 6, 4, None], "B": [12, 19, None, 8, 21],
"C": [15, 26, 11, None, 3], "D": [14, 17, 29, 16, 23]})
info.mean(axis = 1, skipna = True)
(9) sum( ): The sum( ) method adds all values in each column and returns the sum for each column.
Syntax
parameters axis, skip_na, level and numeric_only will behave same as mentioned in mean( )
min_count None Optional. Specifies the minimum number of values that needs
True to be present to perform the action. Default 0
False
Example
import pandas as pd
info = pd.DataFrame({"A": [8, 2, 7, 12, 6], "B": [26, 19, 7, 5, 9],
"C": [10, 11, 15, 4, 3], "D": [16, 24, 14, 22, 1]})
info.sum(axis = 1)
(10) describe( ): Pandas describe( ) is used to view some basic statistical details like percentile, mean,
std etc. of a data frame or a series of numeric values.
Syntax
Example
import pandas as pd
data = [[10, 18, 11], [13, 15, 8], [9, 20, 3]]
df = pd.DataFrame(data)
print(df.describe( ))
0 1 2
count 3.000000 3.000000 3.000000
mean 10.666667 17.666667 7.333333
std 2.081666 2.516611 4.041452
min 9.000000 15.000000 3.000000
25% 9.500000 16.500000 5.500000
50% 10.000000 18.000000 8.000000
75% 11.500000 19.000000 9.500000
max 13.000000 20.000000 11.000000
(11) corr( ): The main task of the DataFrame.corr( ) method is to find the pairwise correlation of all the
columns in the DataFrame. If any null value is present, it will automatically be excluded. It
also ignores non-numeric data type columns from the DataFrame.
Syntax
Parameters
method :
pearson: standard correlation coefficient
kendall: Kendall Tau correlation coefficient
spearman: Spearman rank correlation
Example
import pandas as pd
df = {"Array_1": [30, 70, 100], "Array_2": [65.1, 49.50, 30.7] }
data = pd.DataFrame(df)
print(data.corr( ))
Output
Array_1 Array_2
Array_1 1.000000 -0.990773
Array_2 -0.990773 1.000000
(12) value_counts( ): Pandas value_counts( ) function returns series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most
frequently-occurring element. Excludes NA values by default.
Syntax
Parameters:
Type/Default Required /
Name Description
Value Optional
boolean
sort Sort by frequencies. Default Value: Required
True
boolean
ascending Sort in ascending order. Default Value: Required
False
boolean
dropna Don’t include counts of NaN. Default Value: Required
True
Example
import numpy as np
import pandas as pd
index = pd.Index([2, 2, 5, 3, 4, np.nan])
index.value_counts( )
Output
2.0 2
4.0 1
3.0 1
5.0 1
dtype: int64
(13) apply( ): The apply( ) method allows to apply a function along one of the axis of the DataFrame,
default 0, which is the index (row) axis.
Syntax
Required /
Name Description Value
Optional
0
Which axis to apply the function to. 1
axis Optional
default 0. 'index'
'columns'
import pandas as pd
def calc_sum(x):
a = x.sum( )
return a
Example 2
The following example passes a function and checks the value of each element in series and
returns low, normal or High accordingly.
import pandas as pd
#reading csv
s = pd.read_csv(“stock.csv”, squeeze = True)
#defining function to check price
def fun(num) :
if num<200 :
return “Low”
elif num>=200 and num<400 :
return “Normal”
else :
return “High”
#passing function to apply and storing returned series in new
new = s.apply(fun)
(14) loc( ) and iloc( ) both functions already explained in row selection part of DataFrame.