Multivariate - Data - Selection: 0.1 How To Select Dataframe Subsets From Multivariate Data
Multivariate - Data - Selection: 0.1 How To Select Dataframe Subsets From Multivariate Data
In [3]: df.head()
1
0.1.1 Keep only body measures columns, so only columns with “BMX” in the name
In [4]: # get columns names
col_names = df.columns
col_names
In [5]: # One way to get the column names we want to keep is simply by copying from the above o
keep = ['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
'BMXWAIST']
In [6]: # Another way to get only column names that include 'BMX' is with list comprehension
# [keep x for x in list if condition met]
[column for column in col_names if 'BMX' in column]
In [9]: df_BMX.head()
There are two methods for selecting by row and column. # link for pandas cheat sheets *
df.loc[row labels or bool, col labels or bool] * df.iloc[row int or bool, col int or bool]
• .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used
with a boolean array.
2
Out[10]: BMXWT BMXHT BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST
0 94.8 184.5 27.8 43.3 43.6 35.9 101.1
1 90.4 171.4 30.8 38.0 40.0 33.2 107.9
2 83.4 170.1 28.8 35.6 37.0 31.0 116.5
3 109.8 160.9 42.4 38.5 37.7 38.3 110.1
4 55.2 164.9 20.3 37.4 36.0 27.2 80.4
In [12]: index_bool
Out[12]: array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, True, True, True, True, True, True, True,
False])
In [15]: waist_median
Out[15]: 98.3
In [17]: # Lets add another condition, that 'BMXLEG' must be less than 32
condition1 = df_BMX['BMXWAIST'] > waist_median
condition2 = df_BMX['BMXLEG'] < 32
df_BMX[condition1 & condition2].head() # Using [] method
# Note: can't use 'and' instead of '&'
3
Out[17]: BMXWT BMXHT BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST
15 80.5 150.8 35.4 31.6 32.7 33.7 113.5
27 75.6 145.2 35.9 31.0 33.1 36.0 108.0
39 63.7 147.9 29.1 26.0 34.0 31.5 110.0
52 105.9 157.7 42.6 29.2 35.0 40.7 129.1
55 77.5 148.3 35.2 30.5 34.0 34.4 107.6
In [19]: # Lets make a small dataframe and give it a new index so can more clearly see the diff
tmp = df_BMX.loc[condition1 & condition2, :].head()
tmp.index = ['a', 'b', 'c', 'd', 'e'] # If you use different years than 2015-2016, thi
tmp
Out[20]: a 31.6
b 31.0
Name: BMXLEG, dtype: float64
Out[21]: a 31.6
b 31.0
Name: BMXLEG, dtype: float64
---------------------------------------------------------------------------
4
<ipython-input-22-83067c5cae7c> in <module>()
----> 1 tmp[:, 'BMXBMI']
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
0.1.5 Problem
The above gives: TypeError: unhashable type: ‘slice’
The [ ] method uses hashes to identify the columns to keep, and each column has an associated
hash. A ‘slice’ (a subset of rows and columns) does not have an associated hash, thus causing this
TypeError.
Out[23]: a 35.4
b 35.9
c 29.1
d 42.6
e 35.2
Name: BMXBMI, dtype: float64
5
In [25]: tmp.iloc[:, 'BMXBMI']
---------------------------------------------------------------------------
/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_tuple(self
222 try:
--> 223 self._validate_key(k, i)
224 except ValueError:
/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_key(self, k
2083 raise ValueError("Can only index by location with "
-> 2084 "a [{types}]".format(types=self._valid_types))
2085
ValueError: Can only index by location with a [integer, integer slice (START point is I
<ipython-input-25-9fa39d4097e1> in <module>()
----> 1 tmp.iloc[:, 'BMXBMI']
/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_tuple(self,
2141 def _getitem_tuple(self, tup):
2142
-> 2143 self._has_valid_tuple(tup)
2144 try:
2145 return self._getitem_lowerdim(tup)
/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_tuple(self
6
225 raise ValueError("Location based indexing can only have "
226 "[{types}] types"
--> 227 .format(types=self._valid_types))
228
229 def _is_nested_tuple_indexer(self, tup):
ValueError: Location based indexing can only have [integer, integer slice (START point
0.1.6 Problem
The above gives: ValueError: Location based indexing can only have [integer, integer slice (START
point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
‘BMXBMI’ is not an integer that is less than or equal number of columns -1, or a list of boolean
values, so it is the wrong value type.
In [ ]: tmp.iloc[:, 2]
In [ ]: tmp.loc[:, 2]
0.1.7 Problem
The above code gives: TypeError: cannot do label indexing on <class
'pandas.core.indexes.base.Index'> with these indexers [2] of <class 'int'>
2 is not one of the labels (i.e. column names) in the dataframe
In [29]: # Here is another example of using a boolean list for indexing columns
tmp.loc[:, [False, False, True] +[False]*4]
Out[29]: BMXBMI
a 0
b 1
c 2
d 3
e 4
In [30]: tmp.iloc[:, 2]
Out[30]: a 0
b 1
c 2
d 3
e 4
Name: BMXBMI, dtype: int64
In [31]: # We can use the .loc and .iloc methods to change values within the dataframe
tmp.iloc[0:3,2] = [0]*3
tmp.iloc[:,2]
7
Out[31]: a 0
b 0
c 0
d 3
e 4
Name: BMXBMI, dtype: int64
Out[32]: a 1
b 1
c 1
d 3
e 4
Name: BMXBMI, dtype: int64
In [33]: # We can use the [] method when changing all the values of a column
tmp['BMXBMI'] = range(0, 5)
tmp
In [34]: # We will get a warning when using the [] method with conditions to set new values in
tmp[tmp.BMXBMI > 2]['BMXBMI'] = [10]*2 # Setting new values to a copy of tmp, but not
tmp
# You can see that the above code did not change our dataframe 'tmp'. This