A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows and
columns.
Features of DataFrame
Potentially columns are of different types
Size – Mutable
Labeled axes (rows and columns)
Can Perform Arithmetic operations on rows and
columns
A pandas DataFrame can be created using the
following constructor −
pandas.DataFrame( data, index, columns, dtype,
copy)
A pandas DataFrame can be created using various
inputs like −
Lists
dict
Series
Numpy ndarrays
Another DataFrame
In the subsequent slides of this lecture, we will see
how to create a DataFrame using these inputs.
Create an Empty DataFrame
A basic DataFrame, which can be created is an
Empty Dataframe.
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df
Create a DataFrame from Lists
The DataFrame can be created using a single list or
a list of lists.
Example 1
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)
Example 3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df=pd.DataFrame(data,columns=['Name','Age'],dty
pe=float)
print(df)
Note − Observe, the dtype parameter changes the
type of Age column to floating poi
Create DataFrame from Dictionary using default
Constructor
DataFrame constructor accepts a data object that
can be ndarray, dictionary etc.
But if we are passing a dictionary in data, then it
should contain a list like objects in value field like
Series, arrays or lists etc i.e.
# Dictionary with list object in values
studentData = {
'name' : ['jack', 'Riti', 'Aadi'],
'age' : [34, 30, 16],
'city' : ['Sydney', 'Delhi', 'New york']
}
On Initialising a DataFrame object with this kind of
dictionary, each item (Key / Value pair) in
dictionary will be converted to one column i.e. key
will become Column Name and list in the value
field will be the column data
# Dictionary with list object in values
Import pandas as pd
studentData = {
'name' : ['jack', 'Riti', 'Aadi'],
'age' : [34, 30, 16],
'city' : ['Sydney', 'Delhi', 'New york']
}
dfObj = pd.DataFrame(studentData)
print(dfObj)
All the ndarrays must be of same length. If index is
passed, then the length of the index should equal to the
length of the arrays.
If no index is passed, then by default, index will be
range(n), where n is the array length.
import pandas as pd
data={'Name':['Tom','Jack','Steve','Ricky'],'Age':[28,34,29
,42]}
df = pd.DataFrame(data)
print(df)
List of Dictionaries can be passed as input data to
create a DataFrame. The dictionary keys are by default
taken as column names.
The following example shows how to create a
DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
Example 2
The following example shows how to create a
DataFrame by passing a list of dictionaries and the row
indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)
The following example shows how to create a DataFrame
with a list of dictionaries, row indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b'])
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b1'])
print (df1)
print (df2)
Dictionary of Series can be passed to form a DataFrame.
The resultant index is the union of all the series indexes
passed
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df)
# In order to deal with columns, we perform basic
operations on columns like selecting, deleting, adding
and renaming.
Column Selection:
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])
Column Addition:
In Order to add a column in Pandas DataFrame, we can declare a new list as a
column and add to a existing Dataframe.
# Define a dictionary containing Students data
import pandas as pd
dic = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Height': [5.1, 6.2, 5.1,
5.2], 'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data=dic)
# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
# Using 'Address' as the column name # and equating it to the list
df['Address'] = address
# Observe the result
print(df)
Column Addition:
import pandas as pd
dic = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(dic)
# Adding a new column to an existing DataFrame
object with column label by passing new series
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
df['four']=df['one']+df['three']
print(df)
Pandas key data structure is called?
A. Keyframe
B. DataFrame
C. Statistics
D. Econometrics
Which of the following input can be accepted by
DataFrame?
a) Structured ndarray
b) Series
c) DataFrame
d) All of the mentioned
Identify the correct statement:
A. The standard marker for missing data in Pandas
is NaN
B. Series act in a way similar to that of an array
C. Both of the above
D. None of the above
If data is an ndarray, index must be the same
length as data.
a) True
b) False
Column Deletion
Columns can be deleted, popped or dropped.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print(df)
# using del function
del df['one']
print (df)
# using pop function
df.pop('two')
print df
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select all rows # and second to fourth column
df[df.columns[1:4]]
Selection by Label
Rows can be selected by passing row label to a loc function.
[loc is label-based, which means that you have to specify rows and columns
based on their row and column label.]
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.loc['b']
Selection by integer location
Rows can be selected by passing integer location to an iloc function.
iloc is integer index based, so you have to specify
rows and columns by their integer index
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c',
'd'])}
df = pd.DataFrame(d)
print df.iloc[2]
Select Multiple Rows
Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c',
'd'])}
df = pd.DataFrame(d)
print(df[2:4])
Addition of Rows
Add new rows to a DataFrame using the append function. This function will
append the rows at the end.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print (df)
Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is
duplicated, then multiple rows will be dropped.
If you observe, in the above example, the labels are duplicate.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
# Drop rows with label 0
df = df.drop(0)
print (df)
Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is
duplicated, then multiple rows will be dropped.
If you observe, in the above example, the labels are duplicate.
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
# Drop rows with label 0
df = df.drop(0)
print (df)
DataFrame is a two-dimensional matrix and
will give the shape as rows and columns by
df.shape
This is a tuple and thus if we need to store
the rows and columns into some variables
Pandas head() method is used to return top n
(5 by default) rows of a data frame or series.
We can get the detail of all the data in the
DataFrame like it’s max, min, mean etc. by
just one command df.describe()
Function to see first few observations in data
frame is
A. dataframe_object.head()
B.dataframe_object.start()
C.head()
D.All
What is the syntax to remove column from
dataframe
A. del dataframe_object(Column_name)
B. del Column_name
C. del dataframe_object()
D.None of the above
What is the syntax to remove column from
dataframe
A. del dataframe_object(Column_name)
B. del Column_name
C. del dataframe_object()
D.None of the above
The syntax to check uniqueness of lables
A.df.index.is_unique
B. df.is_unique
C. index.is_unique
D. None of the above
What is the method for generating multiple
statistics
A. df.explain()
B. df.stat()
C. df.describe()
D. All
What is the method for generating multiple
statistics
A. df.explain()
B. df.stat()
C. df.describe()
D. All
What is the syntax for reading a csv file into
dataframe in pandas
A. df = pd.read_csv(file_name.csv)
B. df = pd.read_csv()
C. df = read_csv(file_name.csv)
D. All
What function is used to fill missing data
A. df.fillna(value)
B. fillna(value)
C. df.fillna()
D. fillna()
The operator used for concatenation of
strings is
A. :
B. +
C. *
D. All
The index of last character in the string is
A. 0
B. 1
C. N
D. N -1