0% found this document useful (0 votes)
10 views46 pages

Unit 04 Pandas

Uploaded by

Shubhanka A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

Unit 04 Pandas

Uploaded by

Shubhanka A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to Panda’s

• Panda’s contains data structures and data manipulation tools designed to make data
cleaning and analysis fast and easy in Python.

• pandas is often used in tandem with numerical computing tools like NumPy and SciPy,
analytical libraries like statsmodels and scikit-learn, and data visualization libraries like
matplotlib.

• pandas adopts significant parts of NumPy’s idiomatic style of array-based computing,


especially array-based functions and a preference for data processing without for loops.

• Since becoming an open source project in 2010, pandas has matured into a quite large
library that’s applicable in a broad set of real-world use cases.
Introduction to Panda’s

• While pandas adopts many coding idioms from NumPy.

• The biggest difference is that pandas is designed for working with tabular or heterogeneous
data.

• NumPy, by contrast, is best suited for working with homogeneous numerical array data.

• We use the following import convention for pandas:

• In [1]: import pandas as pd

• Thus, whenever you see pd. in code, it’s referring to pandas.

• You may also find it easier to import Series and DataFrame into the local namespace since
they are so frequently used:

• In [2]: from pandas import Series, DataFrame


Introduction to pandas Data Structures
• To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. they
provide a solid, easy-to-use basis for most applications.

• Series
• A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated
array of data labels, called its index. The simplest Series is formed from only an array of data:

• In [11]: obj = pd.Series([4, 7, -5, 3])

• In [12]: obj • The string representation of a Series displayed interactively shows the index on the
left and the values on the right.
• Out[12]:
• Since we did not specify an index for the data, a default one consisting of the integers 0
• 0 4
through N - 1 (where N is the length of the data) is created.
• 1 7

• 2 -5

• 3 3
Series
• You can get the array representation and index object of the Series via its values and index attributes,
respectively:

• In [13]: obj.values

• Out[13]: array([ 4, 7, -5, 3])

• In [14]: obj.index # like range(4)

• Out[14]: RangeIndex(start=0, stop=4, step=1)


Series

• Often it will be desirable to create a Series with an index identifying each data point

• with a label:

• In [15]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

• In [16]: obj2

• Out[16]:

• d 4

• b 7

• a -5

• c 3

• dtype: int64

• In [17]: obj2.index

• Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')


Series

• Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:

• In [18]: obj2['a']

• Out[18]: -5

• In [19]: obj2['d'] = 6

• In [20]: obj2[['c', 'a', 'd']]

• Out[20]:

• c 3

• a -5

• d 6

• dtype: int64
Series
• Using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication, or applying math functions, will preserve
the index-value link:

• In [21]: obj2[obj2 > 0]

• Out[21]:

• d 6

• b 7

• C 3

• dtype: int64

• In [22]: obj2 * 2

• Out[22]:

• d 12

• b 14

• A -10

• C 6

• dtype: int64
Series
• data contained in a Python dict, you can create a Series from it by passing the dict:

• In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

• In [27]: obj3 = pd.Series(sdata)

• In [28]: obj3 • You can override this by passing the dict keys in the order you want them to
appear in the resulting Series:
• Out[28]: In [29]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [30]: obj4 = pd.Series(sdata, index=states)
• Ohio 35000 In [31]: obj4
Out[31]:
• Oregon 16000
California NaN
• Texas 71000 Ohio 35000.0
Oregon 16000.0
• Utah 5000 Texas 71000.0
dtype: float64
• dtype: int64

• When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order.
Series
• Here, three values found in sdata were placed in the appropriate locations, but since no value for
'California' was found, it appears as NaN (not a number), which is considered in pandas to mark
missing or NA values.

• Since 'Utah' was not included in states, it is excluded from the resulting object.

• I will use the terms “missing” or “NA” interchangeably to refer to missing data.
• The isnull and notnull functions in pandas should be used to detect missing data:
• In [32]: pd.isnull(obj4) In [33]: pd.notnull(obj4)
Out[33]:
• Out[32]:
California False
• California True Ohio True
• Ohio False Oregon True
• Oregon False Texas True
dtype: bool
• Texas False
• dtype: bool
A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations:

• In [35]: obj3 • In [37]: obj3 + obj4


• Out[35]:
• Ohio 35000
• Oregon 16000 • Out[37]:
• Texas 71000 • California NaN
• Utah 5000
• dtype: int64 • Ohio 70000.0
• Oregon 32000.0
• In [36]: obj4
• Out[36]:
• Texas 142000.0
• California NaN • Utah NaN
• Ohio 35000.0
• Oregon 16000.0
• dtype: float64
• Texas 71000.0
• dtype: float64
Both the Series object itself and its index have a name attribute, which integrates with other key areas
of pandas functionality:
• In [38]: obj4.name = 'population'
• In [39]: obj4.index.name = 'state'
• In [40]: obj4
• Out[40]:
• state
• California NaN
• Ohio 35000.0
• Oregon 16000.0
• Texas 71000.0
• Name: population, dtype: float64
DataFrame
• A DataFrame represents a rectangular table of data and contains an ordered collection of
columns, each of which can be a different value type (numeric, string, boolean, etc.).

• The DataFrame has both a row and column index; it can be thought of as a dict of Series all
sharing the same index.

• the data is stored as one or more two-dimensional blocks rather than a list, dict, or some
other collection of one-dimensional arrays.
DataFrame

• There are many ways to construct a DataFrame, • The resulting DataFrame will have its index
though one of the most common is from a dict of assigned automatically as with Series, and
the columns are placed in sorted order:
equal-length lists or NumPy arrays:
• In [45]: frame
• data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', • Out[45]:
'Nevada', 'Nevada'], • pop state year
• 'year': [2000, 2001, 2002, 2001, 2002, 2003], • 0 1.5 Ohio 2000
• 1 1.7 Ohio 2001
• 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
• 2 3.6 Ohio 2002
• frame = pd.DataFrame(data)
• 3 2.4 Nevada 2001
• 4 2.9 Nevada 2002
• 5 3.2 Nevada 2003
DataFrame
• For large DataFrames, the head • If you specify a sequence of columns, the DataFrame’s columns will be arranged in
method selects only the first five
rows: that order:

• In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])


• In [46]: frame.head()
• Out[47]:

• year state pop


• Out[46]:
• pop state year • 0 2000 Ohio 1.5

• 0 1.5 Ohio 2000 • 1 2001 Ohio 1.7


• 1 1.7 Ohio 2001
• 2 2002 Ohio 3.6
• 2 3.6 Ohio 2002
• 3 2.4 Nevada 2001 • 3 2001 Nevada 2.4

• 4 2.9 Nevada 2002 • 4 2002 Nevada 2.9

• 5 2003 Nevada 3.2


DataFrame
• If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:
• In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
• ....: index=['one', 'two', 'three', 'four',
• ....: 'five', 'six'])
• In [49]: frame2
• Out[49]:
• Year state pop debt
• one 2000 Ohio 1.5 NaN
• two 2001 Ohio 1.7 NaN
• three 2002 Ohio 3.6 NaN
• four 2001 Nevada 2.4 NaN
• five 2002 Nevada 2.9 NaN
• six 2003 Nevada 3.2 NaN
• In [50]: frame2.columns
• Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
DataFrame
• A column in a DataFrame can be retrieved as a Series either by dict-like notation or
• by attribute:
• In [51]: frame2['state']
• Out[51]:
• one Ohio
• two Ohio
• Three Ohio
• four Nevada
• five Nevada
• six Nevada
• Name: state, dtype: object
• In [52]: frame2.year
• Out[52]:
• one 2000
• two 2001
• three 2002
• four 2001
• five 2002
• six 2003
• Name: year, dtype: int64
DataFrame
• Columns can be modified by assignment. • In [56]: frame2['debt'] =
For example, the empty 'debt' column np.arange(6.)
• could be assigned a scalar value or an array
• In [57]: frame2
of values:
• In [54]: frame2['debt'] = 16.5 • Out[57]:
• In [55]: frame2 • year state pop debt
• Out[55]: • one 2000 Ohio 1.5 0.0
• year state pop debt • two 2001 Ohio 1.7 1.0
• one 2000 Ohio 1.5 16.5 • three 2002 Ohio 3.6 2.0
• two 2001 Ohio 1.7 16.5
• four 2001 Nevada 2.4 3.0
• three 2002 Ohio 3.6 16.5
• four 2001 Nevada 2.4 16.5
• Five 2002 Nevada 2.9 4.0
• five 2002 Nevada 2.9 16.5 • six 2003 Nevada 3.2 5.0
• six 2003 Nevada 3.2 16.5
DATAFRAMES
• If a DataFrame’s index and columns have their name • As with Series, the values attribute returns
attributes set, these will also be the data contained in the DataFrame as a
• displayed:
• two-dimensional ndarray:
• In [72]: frame3.index.name = 'year';
frame3.columns.name = 'state' • In [74]: frame3.values
• In [73]: frame3 • Out[74]:
• Out[73]: • array([[ nan, 1.5],
• state Nevada Ohio • [ 2.4, 1.7],
• year
• [ 2.9, 3.6]])
• 2000 NaN 1.5
• 2001 2.4 1.7
• 2002 2.9 3.6
DataFrame
Index Objects
• Pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis
name or names).

• Any array or other sequence of labels you use when constructing a Series or DataFrame is
internally converted to an Index:

• In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])

• In [77]: index = obj.index

• In [78]: index

• Out[78]: Index(['a', 'b', 'c'], dtype='object')

• In [79]: index[1:]

• Out[79]: Index(['b', 'c'], dtype='object')


Index Objects
• Index objects are immutable and thus can’t be modified by the user:
• index[1] = 'd' # TypeError
• Immutability makes it safer to share Index objects among data structures:
• In [80]: labels = pd.Index(np.arange(3))
• In [81]: labels
• Out[81]: Int64Index([0, 1, 2], dtype='int64')
• In [82]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
• In [83]: obj2
• Out[83]:
• 0 1.5
• 1 -2.5
• 2 0.0
• dtype: float64
• In [84]: obj2.index is labels
• Out[84]: True
Index Objects
• In addition to being array-like, an Index also behaves like a fixed-size set:
• In [85]: frame3
• Out[85]:
• state Nevada Ohio
• year
• 2000 NaN 1.5
• 2001 2.4 1.7
• 2002 2.9 3.6
• In [86]: frame3.columns
• Out[86]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
• In [87]: 'Ohio' in frame3.columns
• Out[87]: True
• In [88]: 2003 in frame3.index
• Out[88]: False
Essential Functionality
• Reindexing:

• An important method on pandas objects is reindex, which means to create a new object with the data
conformed to a new index. Consider an example:

• In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

• In [92]: obj

• Out[92]:

• d 4.5

• b 7.2

• a -5.3

• c 3.6
Reindexing:
• Calling reindex on this Series rearranges the data according to the new index, introducing missing values if
any index values were not already present:

• In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

• In [94]: obj2

• Out[94]:

• a -5.3

• b 7.2

• c 3.6

• d 4.5

• e NaN

• dtype: float64
Reindexing:
• In [97]: obj3.reindex(range(6), method='ffill')
• For ordered data like time series, it may be desirable
to do some interpolation or filling of values when • Out[97]:
reindexing. • 0 blue
• The method option allows us to do this, using a • 1 blue
method such as ffill, which forward-fills the values:
• 2 purple
• In [95]: obj3 = pd.Series(['blue', 'purple', 'yellow'],
index=[0, 2, 4]) • 3 purple
• In [96]: obj3 • 4 yellow
• Out[96]: • 5 yellow
• 0 blue • dtype: object
• 2 purple
• 4 yellow
• dtype: object
Reindexing:
• In [100]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
• With DataFrame, reindex can alter either the (row)
index, columns, or both. • In [101]: frame2
• When passed only a sequence, it reindexes the rows • Out[101]:
in the result: • Ohio Texas California
• In [98]: frame = • a 0.0 1.0 2.0
pd.DataFrame(np.arange(9).reshape((3, 3)),
• b NaN NaN NaN
• ....: index=['a', 'c', 'd'],
• c 3.0 4.0 5.0
• ....: columns=['Ohio', 'Texas', 'California'])
• d 6.0 7.0 8.0
• In [99]: frame
• Out[99]:
• Ohio Texas California
• a 0 1 2
• c 3 4 5
• d 6 7 8
Dropping Entries from an Axis
• In [107]: new_obj = obj.drop('c')
• Dropping one or more entries from an axis is easy if
you already have an index array or list without those • In [108]: new_obj
entries. • Out[108]:
• As that can require a bit of munging and set logic, • a 0.0
the drop method will return a new object with the
indicated value or values deleted from an axis: • b 1.0
• In [105]: obj = pd.Series(np.arange(5.), index=['a', • d 3.0
'b', 'c', 'd', 'e']) • e 4.0
• In [106]: obj • dtype: float64
• Out[106]:
• a 0.0
• b 1.0
• c 2.0
• d 3.0
• e 4.0
• dtype: float64
Dropping Entries from an Axis
• With DataFrame, index values can be deleted from • Calling drop with a sequence of labels will drop values
either axis. To illustrate this, we from the row labels (axis 0):
• first create an example DataFrame: • In [112]: data.drop(['Colorado', 'Ohio'])

• In [110]: data = • Out[112]:


pd.DataFrame(np.arange(16).reshape((4, 4)), • one two three four
• .....: index=['Ohio', 'Colorado', 'Utah', 'New York'], • Utah 8 9 10 11
• .....: columns=['one', 'two', 'three', 'four']) • New York 12 13 14 15
• In [111]: data
• Out[111]:
• one two three four
• Ohio 0 1 2 3
• Colorado 4 5 6 7
• Utah 8 9 10 11
• New York 12 13 14 15
Indexing, Selection, and Filtering • In [121]: obj[2:4]
• Series indexing (obj[...]) works analogously to NumPy • Out[121]:
array indexing, except you can use the Series’s index • c 2.0
values instead of only integers.
• d 3.0
• Here are some examples of this:
• dtype: float64
• In [117]: obj = pd.Series(np.arange(4.), index=['a', 'b',
'c', 'd']) • In [122]: obj[['b', 'a', 'd']]
• In [118]: obj • Out[122]:
• Out[118]: • b 1.0
• a 0.0 • a 0.0
• b 1.0 • d 3.0
• c 2.0 • dtype: float64
• d 3.0 • In [124]: obj[obj < 2]
• dtype: float64 • Out[124]:
• In [119]: obj['b'] • a 0.0
• Out[119]: 1.0 • b 1.0
• dtype: float64
Indexing, Selection, and Filtering
• Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:
• In [125]: obj['b':'c']
• Out[125]:
• B 1.0
• c 2.0
• dtype: float64
• Setting using these methods modifies the corresponding section of the Series:
• In [126]: obj['b':'c'] = 5
• In [127]: obj
• Out[127]:
• a 0.0
• b 5.0
• c 5.0
• d 3.0
• dtype: float64
Sorting and Ranking
• Sorting a dataset by some criterion is another important built-in operation.

• To sort lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object:
• In [201]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
• In [202]: obj.sort_index()
• Out[202]:
• a 1
• b 2
• c 3
• d 0
• dtype: int64
Sorting and Ranking
• With a DataFrame, you can sort by index on either axis:
• In [203]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
• .....: index=['three', 'one'],
• .....: columns=['d', 'a', 'b', 'c'])
• In [204]: frame.sort_index()
• Out[204]:
• d a b c
• one 4 5 6 7
• three 0 1 2 3
• In [205]: frame.sort_index(axis=1)
• Out[205]:
• a b c d
• three 1 2 3 0
Sorting and Ranking

• The data is sorted in ascending order by default, but can be sorted in descending order, too:
• In [206]: frame.sort_index(axis=1, ascending=False)
• Out[206]:
• d c b a
• three 0 3 2 1
• one 4 7 6 5
• To sort a Series by its values, use its sort_values method:
• In [207]: obj = pd.Series([4, 7, -3, 2])
• In [208]: obj.sort_values()
• Out[208]:
• 2 -3
• 3 2
• 0 4
• 1 7
• dtype: int64
Sorting and Ranking
• When sorting a DataFrame, you can use the • To sort by multiple columns, pass a list of
data in one or more columns as the sort names:
keys. To do so, pass one or more column
names to the by option of sort_values: • In [214]: frame.sort_values(by=['a', 'b'])
• In [211]: frame = pd.DataFrame({'b': • Out[214]:
[4, 7, -3, 2], 'a': [0, 1, 0, 1]}) • a b
• In [212]: frame • 2 0 -3
• Out[212]: •0 0 4
• a b •3 1 2
•1 1 7
•0 0 4
•1 1 7
• 2 0 -3
•3 1 2
• Ranking assigns ranks from one through the number of valid data points in an array.

• The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean
rank:

• In [215]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

• In [216]: obj.rank()

• Out[216]:

• 0 6.5

• 1 1.0

• 2 6.5

• 3 4.5

• 4 3.0

• 5 2.0

• 6 4.5
Axis Indexes with Duplicate Labels
Up until now all of the examples we’ve looked at have had unique axis labels (index
values). While many pandas functions (like reindex) require that the labels be
unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:
In [222]: obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
In [223]: obj
Out[223]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
The index’s is_unique property can tell you whether its labels are unique or not:
In [224]: obj.index.is_unique
Out[224]: False
Data selection is one of the main things that behaves differently with duplicates.
Indexing a label with multiple entries returns a Series, while single entries return a
scalar value:
Data selection is one of the main things that behaves differently with duplicates.
Indexing a label with multiple entries returns a Series, while single entries return a
scalar value:
In [225]: obj['a']
Out[225]:
a0
a1
dtype: int64
In [226]: obj['c']
Out[226]: 4
This can make your code more complicated, as the output type from indexing can
vary based on whether a label is repeated or not.
The same logic extends to indexing rows in a DataFrame:
In [227]: df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
In [228]: df
Out[228]:
012
a 0.274992 0.228913 1.352917
a 0.886429 -2.001637 -0.371843
b 1.669025 -0.438570 -0.539741
b 0.476985 3.248944 -1.021228
In [229]: df.loc['b']
Out[229]:
012
b 1.669025 -0.438570 -0.539741
b 0.476985 3.248944 -1.021228
Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values
from the rows or columns of a DataFrame. Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data. Consider a
small DataFrame:
In [230]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
.....: [np.nan, np.nan], [0.75, -1.3]],
.....: index=['a', 'b', 'c', 'd'],
.....: columns=['one', 'two'])
In [231]: df
Out[231]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
Calling DataFrame’s sum method returns a Series containing column sums:
In [232]: df.sum()
Out[232]:
one 9.25
two -5.80
dtype: float64
Passing axis='columns' or axis=1 sums across the columns
instead:
In [233]: df.sum(axis='columns')
Out[233]:
a 1.40
b 2.60
c NaN
d -0.55
dtype: float64
NA values are excluded unless the entire slice (row or column in
this case) is NA.
This can be disabled with the skipna option:
In [234]: df.mean(axis='columns', skipna=False)
Out[234]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
See Table 5-7 for a list of common options for each reduction
method.
In [237]: df.describe()
Some methods, like idxmin and idxmax, return Out[237]:
indirect statistics like the index value one two
where the minimum or maximum values are attained: count 3.000000 2.000000
In [235]: df.idxmax() mean 3.083333 -2.900000
Out[235]: std 3.493685 2.262742
one b min 0.750000 -4.500000
two d 25% 1.075000 -3.700000
dtype: object 50% 1.400000 -2.900000
Other methods are accumulations: 75% 4.250000 -2.100000
max 7.100000 -1.300000
In [236]: df.cumsum()
On non-numeric data, describe produces alternative summary
Out[236]:
statistics:
one two
In [238]: obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
a 1.40 NaN
In [239]: obj.describe()
b 8.50 -4.5
Out[239]:
c NaN NaN
count 16
d 9.25 -5.8
Another type of method is neither a reduction nor an unique 3
accumulation. describe is one top a
such example, producing multiple summary statistics freq 8
in one shot: dtype: object
Table 5-8. Descriptive and summary statistics
Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of
arguments. Let’s consider some DataFrames of stock prices and volumes obtained
from Yahoo! Finance using the add-on pandas-datareader package. If you don’t
have it installed already, it can be obtained via conda or pip:
conda install pandas-datareader
I use the pandas_datareader module to download some data for a few stock tickers:

import pandas_datareader.data as web


all_data = {ticker: web.get_data_yahoo(ticker)
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
price = pd.DataFrame({ticker: data['Adj Close']
for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
for ticker, data in all_data.items()})
I now compute percent changes of the prices, a time series operation which will be
explored further in Chapter 11:
In [242]: returns = price.pct_change()
In [243]: returns.tail()
Out[243]:
AAPL GOOG IBM MSFT
Date
2016-10-17 -0.000680 0.001837 0.002072 -0.003483
2016-10-18 -0.000681 0.019616 -0.026168 0.007690
2016-10-19 -0.002979 0.007846 0.003583 -0.002255
2016-10-20 -0.000512 -0.005652 0.001719 -0.004867
2016-10-21 -0.003930 0.003011 -0.012474 0.042096
The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:
In [244]: returns['MSFT'].corr(returns['IBM'])
Out[244]: 0.49976361144151144
In [245]: returns['MSFT'].cov(returns['IBM'])
Out[245]: 8.8706554797035462e-05
Since MSFT is a valid Python attribute, we can also select these columns using more
concise syntax:
In [246]: returns.MSFT.corr(returns.IBM)
Out[246]: 0.49976361144151144
DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:
In [247]: returns.corr()
Out[247]:
AAPL GOOG IBM MSFT
AAPL 1.000000 0.407919 0.386817 0.389695
GOOG 0.407919 1.000000 0.405099 0.465919
IBM 0.386817 0.405099 1.000000 0.499764
MSFT 0.389695 0.465919 0.499764 1.000000
In [248]: returns.cov()
Out[248]:
AAPL GOOG IBM MSFT
AAPL 0.000277 0.000107 0.000078 0.000095
GOOG 0.000107 0.000251 0.000078 0.000108
IBM 0.000078 0.000078 0.000146 0.000089
MSFT 0.000095 0.000108 0.000089 0.000215
Using DataFrame’s corrwith method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:
In [249]: returns.corrwith(returns.IBM)
Out[249]:
AAPL 0.386817
GOOG 0.405099
IBM 1.000000
MSFT 0.499764
dtype: float64
Passing a DataFrame computes the correlations of matching column names. Here I
compute correlations of percent changes with volume:
In [250]: returns.corrwith(volume)
Out[250]:
AAPL -0.075565
GOOG -0.007067
IBM -0.204849
MSFT -0.092950
dtype: float64
Passing axis='columns' does things row-by-row instead. In all cases, the data points
are aligned by label before the correlation is computed.

You might also like