Python Data Frame
PREPARED BY
R.AKILA.AP(SG)/CSE
BSACIST
Pandas
At the very basic level, Pandas objects can be thought
of as enhanced versions of NumPy structured arrays.
The rows and columns are identified with labels
rather than simple integer indices.
Pandas provides a host of useful tools, methods, and
functionality on top of these data structures.
Three fundamental Pandas data structures:
Series, DataFrame, and Index.
Pandas Series
A pandas Series is a one-dimensional array of
indexed data. It can be created from a list or array
The series has both a sequence of values and a
sequence of indices, which we can access with
the values and index attributes. The values are
simply a familiar NumPy array:
The essential difference is the presence of the index:
while the Numpy Array has an implicitly
defined integer index used to access the values, the
Pandas Series has an explicitly defined index
associated with the values.
This explicit index definition gives the Series object
additional capabilities. For example, the index need
not be an integer, but can consist of values of any
desired type. For example, if we wish, we can use
strings as an index:
Series as Specialized Dictionary
A dictionary is a structure which maps arbitrary keys
to a set of arbitrary values, and a series is a structure
which which maps typed keys to a set
of typed values.
This typing is important: just as the type-specific
compiled code behind a NumPy array makes it more
efficient than a Python list for certain operations, the
type information of a Pandas Series makes it much
more efficient than Python dictionaries for certain
operations.
Pandas DataFrame
Pandas DataFrame is two-dimensional size-
mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns).
A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows and
columns.
Pandas DataFrame consists of three principal
components, the data, rows, and columns.
Basic operation on Pandas DataFrame
Creating a DataFrame
Dealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Iterating over rows and columns
Contd..
In the real world, a Pandas DataFrame will be
created by loading the datasets from existing storage,
storage can be SQL Database, CSV file, and Excel file.
Pandas DataFrame can be created from the lists,
dictionary, and from a list of dictionary etc.
Creating a dataframe using List
Creating DataFrame from dict of ndarray/lists
To create DataFrame from dict of narray/list, all the
narray must be of same length.
If index is passed then the length index should be
equal to the length of arrays.
If no index is passed, then by default, index will be
range(n) where n is the array length.
Dealing with Rows and Columns
A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows and
columns.
We can perform basic operations on rows/columns
like selecting, deleting, adding, and renaming.
Column Selection: In Order to select a column in
Pandas DataFrame, we can either access the columns
by calling them by their columns name.
import pandas as pd
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
# Using 'Address' as the column name
# and equating it to the list
df['Address'] = address
# Observe the result
print(df)
After adding new column
Dataframe
Row Selection: Pandas provide a unique method
to retrieve rows from a Data frame.
DataFrame.loc[] method is used to retrieve rows
from Pandas DataFrame.
Rows can also be selected by passing integer location
to an iloc[] function.
Dealing with rows
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
Selecting a single row
Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns
by position.
In order to do that, we’ll need to specify the positions
of the rows that we want, and the positions of the
columns that we want as well.
The df.iloc indexer is very similar to df.loc but only
uses integer locations to make its selections.
Working with Missing Data
Missing Data can occur when no information is provided for
one or more items or for a whole unit.
Missing Data is a very big problem in real life scenario.
Missing Data can also refer to as NA(Not Available) values in
pandas.
Checking for missing values
using isnull() and notnull() :
In order to check missing values in Pandas DataFrame, we use
a function isnull() and notnull().
Both function help in checking whether a value is NaN or not.
These function can also be used in Pandas Series in order to
find null values in a series.
Filling missing values
Filling missing values
using fillna(), replace() and interpolate() :
In order to fill null values in a datasets, we
use fillna(), replace() and interpolate() function these
function replace NaN values with some value of their own.
All these function help in filling a null values in datasets of
a DataFrame.
Interpolate() function is basically used to fill NA values in
the dataframe but it uses various interpolation technique to
fill the missing values rather than hard-coding the value.
Dropping missing values
Dropping missing values using dropna() :
In order to drop a null values from a dataframe, we
used dropna() function this fuction drop
Rows/Columns of datasets with Null values in
different ways.
Now we drop rows with at least one Nan
value (Null value)