09_Pandas slides
09_Pandas slides
What is Pandas?
Basic Data Structures: Series and Data Frame
Basic Functions
Input/Output Tools
1. What is Pandas
pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data
both easy and intuitive. It aims to be the fundamental high-level building
block for doing practical, real world data analysis in Python
(http://pandas.pydata.org/ (http://pandas.pydata.org/))
For all data structures, labels/indices can be defined per row and
column.
Data alignment is intrinsict, i.e. the link between labels and data will not
be broken.
Series:
Homogeneous data
Size Immutable
Values of Data Mutable
Data Frames:
Heterogeneous data
Size Mutable
Data Mutable
2.1. Series
Series is a one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers,Python objects, etc.). The
axis labels are collectively referred to as the index. The basic method to
create a Series is to call:
Series(data, index=index)
Array
Dict
Scalar value or constant
0 a
1 b
2 c
3 d
dtype: object
In [54]:
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print (s)
100 a
101 b
102 c
103 d
dtype: object
a 0.0
b 1.0
c 2.0
dtype: float64
In [56]:
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Index order is persisted and the missing element is filled with NaN (Not
a Number).
0 5
1 5
2 5
3 5
dtype: int64
In [9]:
Out[9]:
In [10]:
Out[10]:
array([5, 5, 5, 5])
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
Out[12]:
In [13]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
Out[13]:
a 1
c 3
d 4
dtype: int64
In [10]:
ser.get_value('age')
Out[10]:
a spreadsheet
relational database table
a dictionary of series
Creating DataFrame's
Lists
Dict
Series
Numpy ndarrays
Another DataFrame
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)
0
0 1
1 2
2 3
3 4
4 5
In [15]:
data = [['Ramesh',10],['Himesh',12],['Kamesh',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)
Name Age
0 Ramesh 10
1 Himesh 12
2 Kamesh 13
In [16]:
data = [['Ramesh',10],['Himesh',12],['Kamesh',13]]
df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)
print (df)
Name Age
0 Ramesh 10.0
1 Himesh 12.0
2 Kamesh 13.0
All the ndarrays must be of same length. If index is passed, then the
length of the index should equal to the length of the arrays.
In [19]:
Age Name
0 28 Ramesh
1 34 Rajesh
2 29 Nitesh
3 42 Nilesh
In [20]:
Age Name
rank1 28 Ramesh
rank2 34 Rajesh
rank3 29 Nitesh
rank4 42 Nilesh
In [21]:
a b c
0 1 2 NaN
1 5 10 20.0
In [23]:
a b c
first 1 2 NaN
second 5 10 20.0
In [25]:
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print (df1)
print (df2)
a b
first 1 2
second 5 10
a b1
first 1 NaN
second 5 NaN
In [58]:
df = pd.DataFrame(d)
print (df)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
df = pd.DataFrame(d)
print (df ['one'])
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
In [60]:
df = pd.DataFrame(d)
print()
print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print (df)
df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)
print()
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print (df)
c 3.0 30.0 3
d NaN NaN 4
Selection by Label
In [61]:
df = pd.DataFrame(d)
print (df.loc['b'])
one 2.0
two 2.0
Name: b, dtype: float64
In [63]:
df = pd.DataFrame(d)
print (df.iloc[2])
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
df = pd.DataFrame(d)
print (df[2:4])
one two
c 3.0 3
d NaN 4
Addition of Rows
Add new rows to a DataFrame using the append function. This function
will append the rows at the end.
In [62]:
df = df.append(df2)
print (df)
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
import pandas as pd
df = df.append(df2)
print (df)
a b
1 3 4
1 7 8
3 Basic Functionality
In [64]:
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print (df)
T (Transpose)
Returns the transpose of the DataFrame. The rows and columns will
interchange.
In [65]:
# Create a DataFrame
df = pd.DataFrame(d)
print ("The transpose of the data series is:")
print (df.T)
axes
Returns the list of row axis labels and column axis labels.
In [66]:
#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print (df.axes)
dtypes
Returns the data type of each column.
In [43]:
#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print (df.dtypes)
ndim
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The dimension of the object is:")
print (df.ndim)
shape
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The shape of the object is:")
print (df.shape)
size
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print ("The total number of elements in our object is:")
print (df.size)
values
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print (df)
print()
print ("The actual data in our data frame is:")
print (df.values)
To view a small sample of a DataFrame object, use the head() and tail()
methods. head() returns the first n rows (observe the index values). The
default number of elements to display is five, but you may pass a
custom number.
In [52]:
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print (df)
print()
print ("The first two rows of the data frame is:")
print (df.head(2))
tail() returns the last n rows (observe the index values). The default
number of elements to display is five, but you may pass a custom
number.
In [53]:
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print (df)
print()
print ("The last two rows of the data frame is:")
print (df.tail(2))
4. Descriptive Statistics
Descriptive Statistics sumarizes the underlying distribution of data
values through statistical values like mean, variance etc.
Basic Functions
Function Description
count Number of non-null observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Unbiased standard deviation
var Unbiased variance
skew Unbiased skewness (3rd moment)
kurt Unbiased kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum
4.1 sum()
Returns the sum of the values for the requested axis. By default, axis is
index (axis=0).
In [10]:
#Create a DataFrame
df = pd.DataFrame(d)
print(df)
In [11]:
Age
382
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperB
e...
Rating 4
4.92
dtype: object
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
4.2 mean()
Returns the average value
In [13]:
print (df.mean())
Age 31.833333
Rating 3.743333
dtype: float64
4.3 std()
Returns the Bressel standard deviation of the numerical columns.
In [14]:
print (df.std())
Age 9.232682
Rating 0.661628
dtype: float64
In [15]:
print (df.describe())
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
This function gives the mean, std and IQR values. And, function
excludes the character columns and given summary about numeric
columns. 'include' is the argument which is used to pass necessary
information regarding what columns need to be considered for
summarizing. Takes the list of values; by default, 'number'.
print (df.describe(include=['object']))
Name
count 12
unique 12
top Steve
freq 1
In [17]:
5. Input/Output Tools
The Pandas I/O api is a set of top level reader functions accessed like
pd.read_csv() that generally return a pandas object.
read_csv
read_excel
read_hdf
read_sql
read_json
read_msgpack (experimental)
read_html
read_gbq (experimental)
read_stata
read_clipboard
read_pickle
to_excel
to_hdf
to_sql
to_json
to_msgpack (experimental) • to_html
to_gbq (experimental) • to_stata
to_clipboard
to_pickle
In [22]:
weather_data.csv
In [24]:
df = pd.read_csv("data/weather_data.csv")
print (df)
In [25]:
In [ ]: