UNIT - 3 Pandas
UNIT - 3 Pandas
Pandas
What is Pandas?
Pandas is a Python library used for working with data sets.
Pandas is used for data analysis in Python and developed by Wes McKinney in 2008.
Pandas is defined as an open-source library that provides high-performance data analyzing,
cleaning, exploring, and manipulating data and machine learning tasks in Python.
The name of Pandas is derived from the word Panel Data, which means an Econometrics
from Multidimensional data.
Pandas allow us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Easily handles missing data
It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data
structure.
It provides an efficient way to slice the data
It provides a flexible way to merge, concatenate or reshape the data
Pandas as pd
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the as keyword while importing:
Now the Pandas package can be referred to as pd instead of pandas.
import pandas as pd
Example
import pandas as pd
print(pd.__version__)
Python Pandas Data Structure
The Pandas provides two data structures for processing the data, i.e., Series and DataFrame, which
are discussed below:
1) Pandas Series
A Pandas Series is like a column in a table.
It is defined as a one-dimensional array that is capable of storing various data types.
The row labels of series are called the index.
We can easily convert the list, tuple, and dictionary into series using "series' method. It has
one parameter.
A Series cannot contain multiple columns.
Syntax:
Example
Output:
s = pd.Series()
DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future
version. Specify a dtype explicitly to silence this warning.
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info])
print(a)
Output:
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
Example 2:
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info, index = [100, 101, 102, 103, 104, 105])
print(a)
Output:
100 P
101 a
102 n
103 d
104 a
105 s
dtype: object
Example:
Output:
0 5
1 5
2 5
3 5
dtype: int64
Example 1:
Retrieve the first element. As we already know, the counting starts from zero for the array, which
means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5])
#retrieve the first element
print s[0]
Output:
Example 2 :
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index
onwards will be extracted. If two parameters (with : between them) is used, items between the two
indexes.
import pandas as pd
s = pd.Series([1,2,3,4,5])
#retrieve the first element
print s[ : 3]
Output:
0 1
1 2
2 3
dtype: int64
Example 3:
import pandas as pd
s = pd.Series([1,2,3,4,5] )
#retrieve the first element
print s[-3 : ]
Output:
2 3
3 4
4 5
dtype: int64
A Series is like a fixed-size dict in that you can get and set values by index label.
Example 1:
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[‘a’]
Output:
Example 2
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[[‘a’, ‘b’, ‘c’]]
Output:
a 1
b 2
c 3
Example 3
If a label is not contained, an exception is raised.
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[‘f’]
Output:
KeyError: 'f'
2) Pandas DataFrame:
Pandas DataFrame is a widely used data structure which works with a two-dimensional array
with labeled axes (rows and columns).
DataFrame is defined as a standard way to store data that has two different indexes, i.e., row
index and column index.
It consists of the following properties:
o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as "columns" in case of columns and "index" in case of rows.
Syntax:
data: It consists of different forms like ndarray, series, map, constants, lists, array.
index: The Default np.arrange(n) index is used for the row labels if no index is passed.
columns: The default syntax is np.arrange(n) for the column labels. It shows only true
if no index is passed.
Create a DataFrame
dict
Lists
Numpy ndarrrays
Series
Output:
Empty DataFrame
Columns: []
Index: []
Example 1:
Output:
0 CIVIL
1 EEE
2 MECH
3 ECE
4 CSE
5 AIDS
Example 2:
Output:
CODE NAME
0 101 CIVIL
1 201 EEE
2 301 MECH
3 401 ECE
4 501 CSE
5 3001 AIDS
Example 3:
Output:
CODE NAME
0 101.0 CIVIL
1 201.0 EEE
2 301.0 MECH
3 401.0 ECE
4 501.0 CSE
5 3001.0 AIDS
All the ndarrays must be of same length. If index is passed, then the length of the index
should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1:
import pandas as pd
df = pd.DataFrame(x)
print(df)
Output:
0 101 CIVIL
1 201 EEE
2 301 MECH
3 401 ECE
4 501 CSE
5 3001 AIDS
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by
default taken as column names.
Example 1:
import pandas as pd
print df
Output:
a b c
row1 1 2 NaN
row2 5 10 20.0
Column Selection:
We can select any column from the DataFrame. Here is the code that demonstrates how to select a
column from the DataFrame.
Example:
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1 ['one'])
Output:
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
f 6.0
g NaN
h NaN
Column Addition
We add any new column to an existing DataFrame. The below code demonstrates how to add any new
column to an existing DataFrame:
Example:
Output:
a 1.0 1 20.0
b 2.0 2 40.0
c 3.0 3 60.0
d 4.0 4 NaN
e 5.0 5 NaN
f NaN 6 NaN
Column Deletion:
We delete any column from the existing DataFrame. This code helps to demonstrate how the column
can be deleted from an existing DataFrame:
Example:
Output:
The DataFrame:
one two
a 1.0 1
b 2.0 2
c NaN 3
Delete the first column:
two
a 1
b 2
c 3
We can select, add, or delete any row at anytime. First of all, we will understand the row selection.
Let's see how we can select a row using different ways that are as follows:
Selection by Label:
We can select any row by passing the row label to a loc function.
Example:
one 2.0
two 2.0
Name: b, dtype: float64
The rows can also be selected by passing the integer location to an iloc function.
Example:
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df.iloc[3])
Output:
one 4.0
two 4.0
Name: d, dtype: float64
Slice Rows
Example:
one two
c 3.0 3
d 4.0 4
e 5.0 5
Addition of rows:
We can easily add new rows to the DataFrame using append function. It adds the new rows at the end.
Example:
# importing the pandas library
import pandas as pd
d = pd.DataFrame([[7, 8], [9, 10]], columns = ['x','y'])
d2 = pd.DataFrame([[11, 12], [13, 14]], columns = ['x','y'])
d = d.append(d2)
print (d)
Output:
x y
0 7 8
1 9 10
0 11 12
1 13 14
Deletion of rows:
We can delete or drop any rows from a DataFrame using the index label. If in case, the label is
duplicate then multiple rows will be deleted.
Example:
DataFrame Functions
There are lots of functions used in DataFrame which are as follows:
Functions Description
Pandas DataFrame.append() Add the rows of other dataframe to the end of the given
dataframe.
Pandas DataFrame.apply() Allows the user to pass a function and apply it to every single
value of the Pandas series.
Pandas DataFrame.assign() Add new column into a dataframe.
Pandas DataFrame.astype() Cast the Pandas object to a specified dtype.astype() function.
Pandas DataFrame.concat() Perform concatenation operation along an axis in the
DataFrame.
Pandas DataFrame.count() Count the number of non-NA cells for each column or row.
Pandas DataFrame.describe() Calculate some statistical data like percentile, mean and std
of the numerical values of the Series or DataFrame.
Pandas Remove duplicate values from the DataFrame.
DataFrame.drop_duplicates()
Pandas DataFrame.groupby() Split the data into various groups.
Pandas DataFrame.head() Returns the first n rows for the object based on position.
Pandas DataFrame.hist() Divide the values within a numerical variable into "bins".
Pandas DataFrame.iterrows() Iterate over the rows as (index, series) pairs.
Pandas DataFrame.mean() Return the mean of the values for the requested axis.
Pandas DataFrame.melt() Unpivots the DataFrame from a wide format to a long format.
Pandas DataFrame.merge() Merge the two datasets together into one.
Pandas DataFrame.pivot_table() Aggregate data with calculations such as Sum, Count,
Average, Max, and Min.
Pandas DataFrame.query() Filter the dataframe.
Pandas DataFrame.sample() Select the rows and columns from the dataframe randomly.
Pandas DataFrame.shift() Shift column or subtract the column value with the previous
row value from the dataframe.
Pandas DataFrame.sort() Sort the dataframe.
Pandas DataFrame.sum() Return the sum of the values for the requested axis by the
user.
Pandas DataFrame.to_excel() Export the dataframe to the excel file.
Pandas DataFrame.transpose() Transpose the index and columns of the dataframe.
Pandas DataFrame.where() Check the dataframe for one or more conditions.
Example:
import pandas as pd
mid_term_marks = {"Student": ["Kamal", "Arun", "David", "Thomas", "Steven"],
"Economics": [10, 8, 6, 5, 8],
"Fine Arts": [7, 8, 5, 9, 6],
"Mathematics": [7, 3, 5, 8, 5]}
mid_term_marks_df = pd.DataFrame(mid_term_marks)
print(mid_term_marks_df)
mid_term_marks_df.to_csv("D:\midterm.csv")
print(pd.read_csv(‘D:\midterm.csv’)
Output:
0 Kamal 10 7 7
1 Arun 8 8 3
2 David 6 5 5
3 Thomas 5 9 8
4 Steven 8 6 5
CSV stands for comma-separated values. A CSV file is a delimited text file that uses a
comma to separate values.
The CSV file format is quite popular and supported by many software applications such as
Notepad, Microsoft Excel and Google Spreadsheet.
1. Using Notepad: We can create a CSV file using Notepad. In the Notepad, open a new
file in which separate the values by comma and save the file with .csv extension.
2. Using Excel: We can also create a CSV file using Excel. In Excel, open a new file in
which specify each value in a different cell and save it with filetype CSV.
To read data row-wise from a CSV file in Python, we can use reader are present in the CSV module
allows us to fetch data row-wise.
Syntax
pandas.read_csv(filepath_or_buffer,sep=',',`names=None`,`index_col=None`,
`skipinitialspace=False`)
Example:
import pandas
result = pandas.read_csv('D:\data.csv')
print(result)
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
Example
Example:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
Note: By default, the dropna() method returns a new DataFrame, and will not change the original.
If you want to change the original DataFrame, use the inplace = True argument:
Example
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows
containg NULL values from the original DataFrame.
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
Example
#Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
As you can see from the result, the date in row 26 was fixed, but the empty date in row 22 got a NaT
(Not a Time) value, in other words an empty value. One way to deal with empty values is simply
removing the entire row.
Removing Rows
The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.
Example
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like
if someone registered "199" instead of "1.99".
Sometimes you can spot wrong data by looking at the data set.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the
other rows the duration is between 30 and 60.
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:
Example
df.loc[7, 'Duration'] = 45
For small data sets you might be able to replace the wrong data one by one, but not for big
data sets.
To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries
for legal values, and replace any values that are outside of the boundaries.
Example
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
Removing Rows
Another way of handling wrong data is to remove the rows that contain wrong data.
This way you do not have to find out what to replace them with, and there is a good chance
you do not need them to do your analyses.
Example
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
4) Removing Duplicates
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
Example
print(df.duplicated())
Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
Example
df.drop_duplicates(inplace = True)
The (inplace = True) will make sure that the method does NOT return a new DataFrame, but it will
remove all duplicates from the original DataFrame.