0% found this document useful (0 votes)

10 views46 pages

Unit 04 Pandas

Uploaded by

Shubhanka A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views46 pages

Unit 04 Pandas

Uploaded by

Shubhanka A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to Panda’s

• Panda’s contains data structures and data manipulation tools designed to make data
cleaning and analysis fast and easy in Python.

• pandas is often used in tandem with numerical computing tools like NumPy and SciPy,
analytical libraries like statsmodels and scikit-learn, and data visualization libraries like
matplotlib.

• pandas adopts significant parts of NumPy’s idiomatic style of array-based computing,

especially array-based functions and a preference for data processing without for loops.

• Since becoming an open source project in 2010, pandas has matured into a quite large
library that’s applicable in a broad set of real-world use cases.
Introduction to Panda’s

• While pandas adopts many coding idioms from NumPy.

• The biggest difference is that pandas is designed for working with tabular or heterogeneous
data.

• NumPy, by contrast, is best suited for working with homogeneous numerical array data.

• We use the following import convention for pandas:

• In [1]: import pandas as pd

• Thus, whenever you see pd. in code, it’s referring to pandas.

• You may also find it easier to import Series and DataFrame into the local namespace since
they are so frequently used:

• In [2]: from pandas import Series, DataFrame

Introduction to pandas Data Structures
• To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. they
provide a solid, easy-to-use basis for most applications.

• Series
• A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated
array of data labels, called its index. The simplest Series is formed from only an array of data:

• In [11]: obj = pd.Series([4, 7, -5, 3])

• In [12]: obj • The string representation of a Series displayed interactively shows the index on the
left and the values on the right.
• Out[12]:
• Since we did not specify an index for the data, a default one consisting of the integers 0
• 0 4
through N - 1 (where N is the length of the data) is created.
• 1 7

• 2 -5

• 3 3
Series
• You can get the array representation and index object of the Series via its values and index attributes,
respectively:

• In [13]: obj.values

• Out[13]: array([ 4, 7, -5, 3])

• In [14]: obj.index # like range(4)

• Out[14]: RangeIndex(start=0, stop=4, step=1)

Series

• Often it will be desirable to create a Series with an index identifying each data point

• with a label:

• In [15]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

• In [16]: obj2

• Out[16]:

• d 4

• b 7

• a -5

• c 3

• dtype: int64

• In [17]: obj2.index

• Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')

Series

• Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:

• In [18]: obj2['a']

• Out[18]: -5

• In [19]: obj2['d'] = 6

• In [20]: obj2[['c', 'a', 'd']]

• Out[20]:

• c 3

• a -5

• d 6

• dtype: int64
Series
• Using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication, or applying math functions, will preserve
the index-value link:

• In [21]: obj2[obj2 > 0]

• Out[21]:

• d 6

• b 7

• C 3

• dtype: int64

• In [22]: obj2 * 2

• Out[22]:

• d 12

• b 14

• A -10

• C 6

• dtype: int64
Series
• data contained in a Python dict, you can create a Series from it by passing the dict:

• In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

• In [27]: obj3 = pd.Series(sdata)

• In [28]: obj3 • You can override this by passing the dict keys in the order you want them to
appear in the resulting Series:
• Out[28]: In [29]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [30]: obj4 = pd.Series(sdata, index=states)
• Ohio 35000 In [31]: obj4
Out[31]:
• Oregon 16000
California NaN
• Texas 71000 Ohio 35000.0
Oregon 16000.0
• Utah 5000 Texas 71000.0
dtype: float64
• dtype: int64

• When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order.
Series
• Here, three values found in sdata were placed in the appropriate locations, but since no value for
'California' was found, it appears as NaN (not a number), which is considered in pandas to mark
missing or NA values.

• Since 'Utah' was not included in states, it is excluded from the resulting object.

• I will use the terms “missing” or “NA” interchangeably to refer to missing data.
• The isnull and notnull functions in pandas should be used to detect missing data:
• In [32]: pd.isnull(obj4) In [33]: pd.notnull(obj4)
Out[33]:
• Out[32]:
California False
• California True Ohio True
• Ohio False Oregon True
• Oregon False Texas True
dtype: bool
• Texas False
• dtype: bool
A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations:

• In [35]: obj3 • In [37]: obj3 + obj4

• Out[35]:
• Ohio 35000
• Oregon 16000 • Out[37]:
• Texas 71000 • California NaN
• Utah 5000
• dtype: int64 • Ohio 70000.0
• Oregon 32000.0
• In [36]: obj4
• Out[36]:
• Texas 142000.0
• California NaN • Utah NaN
• Ohio 35000.0
• Oregon 16000.0
• dtype: float64
• Texas 71000.0
• dtype: float64
Both the Series object itself and its index have a name attribute, which integrates with other key areas
of pandas functionality:
• In [38]: obj4.name = 'population'
• In [39]: obj4.index.name = 'state'
• In [40]: obj4
• Out[40]:
• state
• California NaN
• Ohio 35000.0
• Oregon 16000.0
• Texas 71000.0
• Name: population, dtype: float64
DataFrame
• A DataFrame represents a rectangular table of data and contains an ordered collection of
columns, each of which can be a different value type (numeric, string, boolean, etc.).

• The DataFrame has both a row and column index; it can be thought of as a dict of Series all
sharing the same index.

• the data is stored as one or more two-dimensional blocks rather than a list, dict, or some
other collection of one-dimensional arrays.
DataFrame

• There are many ways to construct a DataFrame, • The resulting DataFrame will have its index
though one of the most common is from a dict of assigned automatically as with Series, and
the columns are placed in sorted order:
equal-length lists or NumPy arrays:
• In [45]: frame
• data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', • Out[45]:
'Nevada', 'Nevada'], • pop state year
• 'year': [2000, 2001, 2002, 2001, 2002, 2003], • 0 1.5 Ohio 2000
• 1 1.7 Ohio 2001
• 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
• 2 3.6 Ohio 2002
• frame = pd.DataFrame(data)
• 3 2.4 Nevada 2001
• 4 2.9 Nevada 2002
• 5 3.2 Nevada 2003
DataFrame
• For large DataFrames, the head • If you specify a sequence of columns, the DataFrame’s columns will be arranged in
method selects only the first five
rows: that order:

• In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])

• In [46]: frame.head()
• Out[47]:

• year state pop

• Out[46]:
• pop state year • 0 2000 Ohio 1.5

• 0 1.5 Ohio 2000 • 1 2001 Ohio 1.7

• 1 1.7 Ohio 2001
• 2 2002 Ohio 3.6
• 2 3.6 Ohio 2002
• 3 2.4 Nevada 2001 • 3 2001 Nevada 2.4

• 4 2.9 Nevada 2002 • 4 2002 Nevada 2.9

• 5 2003 Nevada 3.2

DataFrame
• If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:
• In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
• ....: index=['one', 'two', 'three', 'four',
• ....: 'five', 'six'])
• In [49]: frame2
• Out[49]:
• Year state pop debt
• one 2000 Ohio 1.5 NaN
• two 2001 Ohio 1.7 NaN
• three 2002 Ohio 3.6 NaN
• four 2001 Nevada 2.4 NaN
• five 2002 Nevada 2.9 NaN
• six 2003 Nevada 3.2 NaN
• In [50]: frame2.columns
• Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
DataFrame
• A column in a DataFrame can be retrieved as a Series either by dict-like notation or
• by attribute:
• In [51]: frame2['state']
• Out[51]:
• one Ohio
• two Ohio
• Three Ohio
• four Nevada
• five Nevada
• six Nevada
• Name: state, dtype: object
• In [52]: frame2.year
• Out[52]:
• one 2000
• two 2001
• three 2002
• four 2001
• five 2002
• six 2003
• Name: year, dtype: int64
DataFrame
• Columns can be modified by assignment. • In [56]: frame2['debt'] =
For example, the empty 'debt' column np.arange(6.)
• could be assigned a scalar value or an array
• In [57]: frame2
of values:
• In [54]: frame2['debt'] = 16.5 • Out[57]:
• In [55]: frame2 • year state pop debt
• Out[55]: • one 2000 Ohio 1.5 0.0
• year state pop debt • two 2001 Ohio 1.7 1.0
• one 2000 Ohio 1.5 16.5 • three 2002 Ohio 3.6 2.0
• two 2001 Ohio 1.7 16.5
• four 2001 Nevada 2.4 3.0
• three 2002 Ohio 3.6 16.5
• four 2001 Nevada 2.4 16.5
• Five 2002 Nevada 2.9 4.0
• five 2002 Nevada 2.9 16.5 • six 2003 Nevada 3.2 5.0
• six 2003 Nevada 3.2 16.5
DATAFRAMES
• If a DataFrame’s index and columns have their name • As with Series, the values attribute returns
attributes set, these will also be the data contained in the DataFrame as a
• displayed:
• two-dimensional ndarray:
• In [72]: frame3.index.name = 'year';
frame3.columns.name = 'state' • In [74]: frame3.values
• In [73]: frame3 • Out[74]:
• Out[73]: • array([[ nan, 1.5],
• state Nevada Ohio • [ 2.4, 1.7],
• year
• [ 2.9, 3.6]])
• 2000 NaN 1.5
• 2001 2.4 1.7
• 2002 2.9 3.6
DataFrame
Index Objects
• Pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis
name or names).

• Any array or other sequence of labels you use when constructing a Series or DataFrame is
internally converted to an Index:

• In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])

• In [77]: index = obj.index

• In [78]: index

• Out[78]: Index(['a', 'b', 'c'], dtype='object')

• In [79]: index[1:]

• Out[79]: Index(['b', 'c'], dtype='object')

Index Objects
• Index objects are immutable and thus can’t be modified by the user:
• index[1] = 'd' # TypeError
• Immutability makes it safer to share Index objects among data structures:
• In [80]: labels = pd.Index(np.arange(3))
• In [81]: labels
• Out[81]: Int64Index([0, 1, 2], dtype='int64')
• In [82]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
• In [83]: obj2
• Out[83]:
• 0 1.5
• 1 -2.5
• 2 0.0
• dtype: float64
• In [84]: obj2.index is labels
• Out[84]: True
Index Objects
• In addition to being array-like, an Index also behaves like a fixed-size set:
• In [85]: frame3
• Out[85]:
• state Nevada Ohio
• year
• 2000 NaN 1.5
• 2001 2.4 1.7
• 2002 2.9 3.6
• In [86]: frame3.columns
• Out[86]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
• In [87]: 'Ohio' in frame3.columns
• Out[87]: True
• In [88]: 2003 in frame3.index
• Out[88]: False
Essential Functionality
• Reindexing:

• An important method on pandas objects is reindex, which means to create a new object with the data
conformed to a new index. Consider an example:

• In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

• In [92]: obj

• Out[92]:

• d 4.5

• b 7.2

• a -5.3

• c 3.6
Reindexing:
• Calling reindex on this Series rearranges the data according to the new index, introducing missing values if
any index values were not already present:

• In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

• In [94]: obj2

• Out[94]:

• a -5.3

• b 7.2

• c 3.6

• d 4.5

• e NaN

• dtype: float64
Reindexing:
• In [97]: obj3.reindex(range(6), method='ffill')
• For ordered data like time series, it may be desirable
to do some interpolation or filling of values when • Out[97]:
reindexing. • 0 blue
• The method option allows us to do this, using a • 1 blue
method such as ffill, which forward-fills the values:
• 2 purple
• In [95]: obj3 = pd.Series(['blue', 'purple', 'yellow'],
index=[0, 2, 4]) • 3 purple
• In [96]: obj3 • 4 yellow
• Out[96]: • 5 yellow
• 0 blue • dtype: object
• 2 purple
• 4 yellow
• dtype: object
Reindexing:
• In [100]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
• With DataFrame, reindex can alter either the (row)
index, columns, or both. • In [101]: frame2
• When passed only a sequence, it reindexes the rows • Out[101]:
in the result: • Ohio Texas California
• In [98]: frame = • a 0.0 1.0 2.0
pd.DataFrame(np.arange(9).reshape((3, 3)),
• b NaN NaN NaN
• ....: index=['a', 'c', 'd'],
• c 3.0 4.0 5.0
• ....: columns=['Ohio', 'Texas', 'California'])
• d 6.0 7.0 8.0
• In [99]: frame
• Out[99]:
• Ohio Texas California
• a 0 1 2
• c 3 4 5
• d 6 7 8
Dropping Entries from an Axis
• In [107]: new_obj = obj.drop('c')
• Dropping one or more entries from an axis is easy if
you already have an index array or list without those • In [108]: new_obj
entries. • Out[108]:
• As that can require a bit of munging and set logic, • a 0.0
the drop method will return a new object with the
indicated value or values deleted from an axis: • b 1.0
• In [105]: obj = pd.Series(np.arange(5.), index=['a', • d 3.0
'b', 'c', 'd', 'e']) • e 4.0
• In [106]: obj • dtype: float64
• Out[106]:
• a 0.0
• b 1.0
• c 2.0
• d 3.0
• e 4.0
• dtype: float64
Dropping Entries from an Axis
• With DataFrame, index values can be deleted from • Calling drop with a sequence of labels will drop values
either axis. To illustrate this, we from the row labels (axis 0):
• first create an example DataFrame: • In [112]: data.drop(['Colorado', 'Ohio'])

• In [110]: data = • Out[112]:

pd.DataFrame(np.arange(16).reshape((4, 4)), • one two three four
• .....: index=['Ohio', 'Colorado', 'Utah', 'New York'], • Utah 8 9 10 11
• .....: columns=['one', 'two', 'three', 'four']) • New York 12 13 14 15
• In [111]: data
• Out[111]:
• one two three four
• Ohio 0 1 2 3
• Colorado 4 5 6 7
• Utah 8 9 10 11
• New York 12 13 14 15
Indexing, Selection, and Filtering • In [121]: obj[2:4]
• Series indexing (obj[...]) works analogously to NumPy • Out[121]:
array indexing, except you can use the Series’s index • c 2.0
values instead of only integers.
• d 3.0
• Here are some examples of this:
• dtype: float64
• In [117]: obj = pd.Series(np.arange(4.), index=['a', 'b',
'c', 'd']) • In [122]: obj[['b', 'a', 'd']]
• In [118]: obj • Out[122]:
• Out[118]: • b 1.0
• a 0.0 • a 0.0
• b 1.0 • d 3.0
• c 2.0 • dtype: float64
• d 3.0 • In [124]: obj[obj < 2]
• dtype: float64 • Out[124]:
• In [119]: obj['b'] • a 0.0
• Out[119]: 1.0 • b 1.0
• dtype: float64
Indexing, Selection, and Filtering
• Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:
• In [125]: obj['b':'c']
• Out[125]:
• B 1.0
• c 2.0
• dtype: float64
• Setting using these methods modifies the corresponding section of the Series:
• In [126]: obj['b':'c'] = 5
• In [127]: obj
• Out[127]:
• a 0.0
• b 5.0
• c 5.0
• d 3.0
• dtype: float64
Sorting and Ranking
• Sorting a dataset by some criterion is another important built-in operation.

• To sort lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object:
• In [201]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
• In [202]: obj.sort_index()
• Out[202]:
• a 1
• b 2
• c 3
• d 0
• dtype: int64
Sorting and Ranking
• With a DataFrame, you can sort by index on either axis:
• In [203]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
• .....: index=['three', 'one'],
• .....: columns=['d', 'a', 'b', 'c'])
• In [204]: frame.sort_index()
• Out[204]:
• d a b c
• one 4 5 6 7
• three 0 1 2 3
• In [205]: frame.sort_index(axis=1)
• Out[205]:
• a b c d
• three 1 2 3 0
Sorting and Ranking

• The data is sorted in ascending order by default, but can be sorted in descending order, too:
• In [206]: frame.sort_index(axis=1, ascending=False)
• Out[206]:
• d c b a
• three 0 3 2 1
• one 4 7 6 5
• To sort a Series by its values, use its sort_values method:
• In [207]: obj = pd.Series([4, 7, -3, 2])
• In [208]: obj.sort_values()
• Out[208]:
• 2 -3
• 3 2
• 0 4
• 1 7
• dtype: int64
Sorting and Ranking
• When sorting a DataFrame, you can use the • To sort by multiple columns, pass a list of
data in one or more columns as the sort names:
keys. To do so, pass one or more column
names to the by option of sort_values: • In [214]: frame.sort_values(by=['a', 'b'])
• In [211]: frame = pd.DataFrame({'b': • Out[214]:
[4, 7, -3, 2], 'a': [0, 1, 0, 1]}) • a b
• In [212]: frame • 2 0 -3
• Out[212]: •0 0 4
• a b •3 1 2
•1 1 7
•0 0 4
•1 1 7
• 2 0 -3
•3 1 2
• Ranking assigns ranks from one through the number of valid data points in an array.

• The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean
rank:

• In [215]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

• In [216]: obj.rank()

• Out[216]:

• 0 6.5

• 1 1.0

• 2 6.5

• 3 4.5

• 4 3.0

• 5 2.0

• 6 4.5
Axis Indexes with Duplicate Labels
Up until now all of the examples we’ve looked at have had unique axis labels (index
values). While many pandas functions (like reindex) require that the labels be
unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:
In [222]: obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
In [223]: obj
Out[223]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
The index’s is_unique property can tell you whether its labels are unique or not:
In [224]: obj.index.is_unique
Out[224]: False
Data selection is one of the main things that behaves differently with duplicates.
Indexing a label with multiple entries returns a Series, while single entries return a
scalar value:
Data selection is one of the main things that behaves differently with duplicates.
Indexing a label with multiple entries returns a Series, while single entries return a
scalar value:
In [225]: obj['a']
Out[225]:
a0
a1
dtype: int64
In [226]: obj['c']
Out[226]: 4
This can make your code more complicated, as the output type from indexing can
vary based on whether a label is repeated or not.
The same logic extends to indexing rows in a DataFrame:
In [227]: df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
In [228]: df
Out[228]:
012
a 0.274992 0.228913 1.352917
a 0.886429 -2.001637 -0.371843
b 1.669025 -0.438570 -0.539741
b 0.476985 3.248944 -1.021228
In [229]: df.loc['b']
Out[229]:
012
b 1.669025 -0.438570 -0.539741
b 0.476985 3.248944 -1.021228
Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values
from the rows or columns of a DataFrame. Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data. Consider a
small DataFrame:
In [230]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
.....: [np.nan, np.nan], [0.75, -1.3]],
.....: index=['a', 'b', 'c', 'd'],
.....: columns=['one', 'two'])
In [231]: df
Out[231]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
Calling DataFrame’s sum method returns a Series containing column sums:
In [232]: df.sum()
Out[232]:
one 9.25
two -5.80
dtype: float64
Passing axis='columns' or axis=1 sums across the columns
instead:
In [233]: df.sum(axis='columns')
Out[233]:
a 1.40
b 2.60
c NaN
d -0.55
dtype: float64
NA values are excluded unless the entire slice (row or column in
this case) is NA.
This can be disabled with the skipna option:
In [234]: df.mean(axis='columns', skipna=False)
Out[234]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
See Table 5-7 for a list of common options for each reduction
method.
In [237]: df.describe()
Some methods, like idxmin and idxmax, return Out[237]:
indirect statistics like the index value one two
where the minimum or maximum values are attained: count 3.000000 2.000000
In [235]: df.idxmax() mean 3.083333 -2.900000
Out[235]: std 3.493685 2.262742
one b min 0.750000 -4.500000
two d 25% 1.075000 -3.700000
dtype: object 50% 1.400000 -2.900000
Other methods are accumulations: 75% 4.250000 -2.100000
max 7.100000 -1.300000
In [236]: df.cumsum()
On non-numeric data, describe produces alternative summary
Out[236]:
statistics:
one two
In [238]: obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
a 1.40 NaN
In [239]: obj.describe()
b 8.50 -4.5
Out[239]:
c NaN NaN
count 16
d 9.25 -5.8
Another type of method is neither a reduction nor an unique 3
accumulation. describe is one top a
such example, producing multiple summary statistics freq 8
in one shot: dtype: object
Table 5-8. Descriptive and summary statistics
Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of
arguments. Let’s consider some DataFrames of stock prices and volumes obtained
from Yahoo! Finance using the add-on pandas-datareader package. If you don’t
have it installed already, it can be obtained via conda or pip:
conda install pandas-datareader
I use the pandas_datareader module to download some data for a few stock tickers:

import pandas_datareader.data as web

all_data = {ticker: web.get_data_yahoo(ticker)
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
price = pd.DataFrame({ticker: data['Adj Close']
for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
for ticker, data in all_data.items()})
I now compute percent changes of the prices, a time series operation which will be
explored further in Chapter 11:
In [242]: returns = price.pct_change()
In [243]: returns.tail()
Out[243]:
AAPL GOOG IBM MSFT
Date
2016-10-17 -0.000680 0.001837 0.002072 -0.003483
2016-10-18 -0.000681 0.019616 -0.026168 0.007690
2016-10-19 -0.002979 0.007846 0.003583 -0.002255
2016-10-20 -0.000512 -0.005652 0.001719 -0.004867
2016-10-21 -0.003930 0.003011 -0.012474 0.042096
The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:
In [244]: returns['MSFT'].corr(returns['IBM'])
Out[244]: 0.49976361144151144
In [245]: returns['MSFT'].cov(returns['IBM'])
Out[245]: 8.8706554797035462e-05
Since MSFT is a valid Python attribute, we can also select these columns using more
concise syntax:
In [246]: returns.MSFT.corr(returns.IBM)
Out[246]: 0.49976361144151144
DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:
In [247]: returns.corr()
Out[247]:
AAPL GOOG IBM MSFT
AAPL 1.000000 0.407919 0.386817 0.389695
GOOG 0.407919 1.000000 0.405099 0.465919
IBM 0.386817 0.405099 1.000000 0.499764
MSFT 0.389695 0.465919 0.499764 1.000000
In [248]: returns.cov()
Out[248]:
AAPL GOOG IBM MSFT
AAPL 0.000277 0.000107 0.000078 0.000095
GOOG 0.000107 0.000251 0.000078 0.000108
IBM 0.000078 0.000078 0.000146 0.000089
MSFT 0.000095 0.000108 0.000089 0.000215
Using DataFrame’s corrwith method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:
In [249]: returns.corrwith(returns.IBM)
Out[249]:
AAPL 0.386817
GOOG 0.405099
IBM 1.000000
MSFT 0.499764
dtype: float64
Passing a DataFrame computes the correlations of matching column names. Here I
compute correlations of percent changes with volume:
In [250]: returns.corrwith(volume)
Out[250]:
AAPL -0.075565
GOOG -0.007067
IBM -0.204849
MSFT -0.092950
dtype: float64
Passing axis='columns' does things row-by-row instead. In all cases, the data points
are aligned by label before the correlation is computed.

Perceptions of Senior High School Academic Track Students of Jose Rizal University Towards Their Employability Skills
100% (3)
Perceptions of Senior High School Academic Track Students of Jose Rizal University Towards Their Employability Skills
17 pages
Week 7 To 10 (Writing The Concept Paper)
70% (46)
Week 7 To 10 (Writing The Concept Paper)
36 pages
Python Programming & SQL
100% (4)
Python Programming & SQL
152 pages
Assigment Histology
No ratings yet
Assigment Histology
40 pages
Anam City Master Plan
No ratings yet
Anam City Master Plan
102 pages
Python Full Notes - Working
100% (4)
Python Full Notes - Working
645 pages
Ielts Syllabus
No ratings yet
Ielts Syllabus
3 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
SBBSB
No ratings yet
SBBSB
130 pages
Daniel Schoop 539B Calibtation
No ratings yet
Daniel Schoop 539B Calibtation
8 pages
DET40073 - Topic 1
No ratings yet
DET40073 - Topic 1
54 pages
Python Cheat Sheet: Mosh Hamedani
100% (8)
Python Cheat Sheet: Mosh Hamedani
14 pages
Python 3 Basics Tutorial
100% (2)
Python 3 Basics Tutorial
128 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (21)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Universal Human Values and Ethics KVE301
0% (1)
Universal Human Values and Ethics KVE301
1 page
Python Cheat Sheet: Click Here
100% (1)
Python Cheat Sheet: Click Here
60 pages
Troubleshooting: Turn Power On
No ratings yet
Troubleshooting: Turn Power On
6 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
93% (43)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
Long Quiz - TNCT
No ratings yet
Long Quiz - TNCT
2 pages
Trace - 2020-11-18 22 - 09 - 38 640
100% (1)
Trace - 2020-11-18 22 - 09 - 38 640
7 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
HTML CSS JavaScript Basics
100% (7)
HTML CSS JavaScript Basics
225 pages
Preschool Life Skills Curriculum
No ratings yet
Preschool Life Skills Curriculum
68 pages
C-3411 Piping Bellows Expansion Joints
No ratings yet
C-3411 Piping Bellows Expansion Joints
18 pages
HTML Notes
No ratings yet
HTML Notes
96 pages
Python 3 Cheat Sheet
94% (51)
Python 3 Cheat Sheet
2 pages
22PAM0062 - INTERMEDIATE ACADEMIC ENGLISH - Part8
No ratings yet
22PAM0062 - INTERMEDIATE ACADEMIC ENGLISH - Part8
20 pages
USAF Squadron Facilities Design Guide
No ratings yet
USAF Squadron Facilities Design Guide
27 pages
Pandas 6 1716219621
No ratings yet
Pandas 6 1716219621
17 pages
Python Notes For Professionals
100% (18)
Python Notes For Professionals
814 pages
Top 50 Pandas Interview Questions and Answers (2024)
No ratings yet
Top 50 Pandas Interview Questions and Answers (2024)
34 pages
The Python Manual
97% (31)
The Python Manual
196 pages
Numpy Basics: Arithmetic Operations
100% (17)
Numpy Basics: Arithmetic Operations
7 pages
Pandas
No ratings yet
Pandas
36 pages
HTML - Basic Tags
No ratings yet
HTML - Basic Tags
5 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Water Quality Management Using GIS and RS Tools: Conference Paper
No ratings yet
Water Quality Management Using GIS and RS Tools: Conference Paper
8 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
Sist en 15347 2008
No ratings yet
Sist en 15347 2008
9 pages
Interview Questions
No ratings yet
Interview Questions
76 pages
Tecstrip Flat & Flexible Phenolic Insulating Strip: Linda B - We Simplif y Const Ruc T Ion
No ratings yet
Tecstrip Flat & Flexible Phenolic Insulating Strip: Linda B - We Simplif y Const Ruc T Ion
2 pages
Zinc Calcium Bromide Chloride
No ratings yet
Zinc Calcium Bromide Chloride
2 pages
Dynasty Warriors 4 Hyper Manual PDF
100% (1)
Dynasty Warriors 4 Hyper Manual PDF
26 pages
Training Need Analysis by Atul Mathur
No ratings yet
Training Need Analysis by Atul Mathur
11 pages
Pandas Python For Data Science
100% (1)
Pandas Python For Data Science
1 page
Lec 02 - DS100 Fa23 - Pandas 1
No ratings yet
Lec 02 - DS100 Fa23 - Pandas 1
61 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Data Handling Using Pandas-I-ORG
No ratings yet
Data Handling Using Pandas-I-ORG
44 pages
Unit-1 Python Pandas
No ratings yet
Unit-1 Python Pandas
56 pages
Object Oriented Python Tutorial
100% (20)
Object Oriented Python Tutorial
111 pages
Data Handing Using Pandas-I
100% (2)
Data Handing Using Pandas-I
46 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
Unit 2
No ratings yet
Unit 2
81 pages
Jupyter Notebook Viewer1
No ratings yet
Jupyter Notebook Viewer1
17 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Internal Storage Encoding of Characters
No ratings yet
Internal Storage Encoding of Characters
4 pages
The Racers Life
No ratings yet
The Racers Life
74 pages
Lecture 3 - Pandas
No ratings yet
Lecture 3 - Pandas
37 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
7 pages
Pandas
No ratings yet
Pandas
21 pages
Data Analytics Pandas
No ratings yet
Data Analytics Pandas
33 pages
Actc HTML Notes
No ratings yet
Actc HTML Notes
48 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
1.HTML Tutorial
No ratings yet
1.HTML Tutorial
1 page
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
Mohit
No ratings yet
Mohit
19 pages
Unit 4
No ratings yet
Unit 4
36 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
DRG QS5 Pregens
No ratings yet
DRG QS5 Pregens
4 pages
Unit III - Pandas - Data Manipulation Using Python
No ratings yet
Unit III - Pandas - Data Manipulation Using Python
15 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
IP Slybuss
No ratings yet
IP Slybuss
21 pages
Pandas Fundamentals
No ratings yet
Pandas Fundamentals
90 pages
2.2 Data Indexing and Selection
No ratings yet
2.2 Data Indexing and Selection
8 pages
2.1 Pandas Objects
No ratings yet
2.1 Pandas Objects
10 pages
05getting Started With Pandas
No ratings yet
05getting Started With Pandas
44 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
Filters and Value Helps in Virtual Data Model - Part 2
No ratings yet
Filters and Value Helps in Virtual Data Model - Part 2
16 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Pandas Class 12 Ncertttt
No ratings yet
Pandas Class 12 Ncertttt
48 pages
P Unit-4 NP
No ratings yet
P Unit-4 NP
30 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
22 pages
E Statement
No ratings yet
E Statement
4 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Pandas
No ratings yet
Pandas
27 pages
HTML
No ratings yet
HTML
12 pages
Lba Report Odisha Team
No ratings yet
Lba Report Odisha Team
10 pages
Pandas
No ratings yet
Pandas
63 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
09 - Pandas Slides
No ratings yet
09 - Pandas Slides
33 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Notes - EDA-Unit2
No ratings yet
Notes - EDA-Unit2
43 pages
18 Pandas
No ratings yet
18 Pandas
33 pages
Woliata Sodo University: Course Title Course Code: Credit Hours Course Instructor: Contact Hourse: Course Description
No ratings yet
Woliata Sodo University: Course Title Course Code: Credit Hours Course Instructor: Contact Hourse: Course Description
2 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
14 Pandas
No ratings yet
14 Pandas
25 pages
Pandas (Paneled Data)
No ratings yet
Pandas (Paneled Data)
97 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Unit 04 Pandas
No ratings yet
Unit 04 Pandas
46 pages
ML Unit-2 Notes
No ratings yet
ML Unit-2 Notes
17 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Pandas
No ratings yet
Pandas
23 pages
Python UnitIV
No ratings yet
Python UnitIV
20 pages
04 Getting Started With Pandas
No ratings yet
04 Getting Started With Pandas
85 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
138 pages
Pandas Notes
No ratings yet
Pandas Notes
19 pages
Ip Study
No ratings yet
Ip Study
18 pages
Pandas Shan Ver2
No ratings yet
Pandas Shan Ver2
25 pages
Dataframe Ip
No ratings yet
Dataframe Ip
75 pages
Pandas
No ratings yet
Pandas
13 pages
Final Formatted After Iloc Loc
No ratings yet
Final Formatted After Iloc Loc
34 pages
5f551-D36f-0026-86b-301b84f4a UPDATED The Self Help Planner 1 Better Goal Planner
No ratings yet
5f551-D36f-0026-86b-301b84f4a UPDATED The Self Help Planner 1 Better Goal Planner
15 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Dataframes UNIT 1 PART 2
No ratings yet
Dataframes UNIT 1 PART 2
33 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)

Unit 04 Pandas

Uploaded by

Unit 04 Pandas

Uploaded by

Introduction to Panda’s

• pandas adopts significant parts of NumPy’s idiomatic style of array-based computing,

• While pandas adopts many coding idioms from NumPy.

• We use the following import convention for pandas:

• In [1]: import pandas as pd

• Thus, whenever you see pd. in code, it’s referring to pandas.

• In [2]: from pandas import Series, DataFrame

• In [11]: obj = pd.Series([4, 7, -5, 3])

• Out[13]: array([ 4, 7, -5, 3])

• In [14]: obj.index # like range(4)

• Out[14]: RangeIndex(start=0, stop=4, step=1)

• In [15]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

• Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')

• In [20]: obj2[['c', 'a', 'd']]

• In [21]: obj2[obj2 > 0]

• In [27]: obj3 = pd.Series(sdata)

• In [35]: obj3 • In [37]: obj3 + obj4

• In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])

• year state pop

• 0 1.5 Ohio 2000 • 1 2001 Ohio 1.7

• 4 2.9 Nevada 2002 • 4 2002 Nevada 2.9

• 5 2003 Nevada 3.2

• In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])

• In [77]: index = obj.index

• Out[78]: Index(['a', 'b', 'c'], dtype='object')

• Out[79]: Index(['b', 'c'], dtype='object')

• In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

• In [110]: data = • Out[112]:

• In [215]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

import pandas_datareader.data as web

You might also like