Introduction
Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. Inthis post, I will show you how to "Select a Subset Of Data Using lexicographical slicing".
Google is full of datasets. Search for movies dataset in kaggle.com. This post uses the movies data set from kaggle.
How to do it
Import the movies dataset with only the columns required for this example.
import pandas as pd import numpy as np movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title", usecols=["title","budget","vote_average","vote_count"]) movies.sample(n=5)
budget | vote_average | vote_count | |
---|---|---|---|
titile | |||
Little Voice | 0 | 6.6 | 61 |
Grown Ups 2 | 80000000 | 5.8 | 1155 |
The Best Years of Our Lives | 2100000 | 7.6 | 143 |
Tusk | 2800000 | 5.1 | 366 |
Operation Chromite | 0 | 5.8 | 29 |
I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if you aredealing with a huge dataset when your index is sorted.
What if I don't sort the index?
No problem your code is going to run forever. Just kidding, well if the index labels are unsorted then pandas have to traversethrough all the labels one by one to match your query. Just imagine an Oxford dictionary without an index page, what you aregoing to do? With the index sorted you can jump around quickly to a label you want to extract, so is the case with Pandastoo.
Let us check first if our index is sorted or not.
# check if the index is sorted or not ? movies.index.is_monotonic
False
Clearly, the index is un sorted. We will try to select the movies starting with A%. This is like writing
select * from movies where title like'A%'
movies.loc["Aa":"Bb"]
select * from movies where title like 'A%' --------------------------------------------------------------------------- ValueErrorTraceback (most recent call last) ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind) 4844try: -> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError: ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side) 4805 -> 4806raise ValueError("index must be monotonic increasing or decreasing") 4807 ValueError: index must be monotonic increasing or decreasing During handling of the above exception, another exception occurred: KeyErrorTraceback (most recent call last) in ----> 1 movies.loc["Aa": "Bb"] ~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key) 1766 1767maybe_callable = com.apply_if_callable(key, self.obj) -> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769 1770def _is_scalar_access(self, key: Tuple): ~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1910if isinstance(key, slice): 1911self._validate_key(key, axis) -> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key): 1914return self._getbool_axis(key, axis=axis) ~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis) 1794 1795labels = obj._get_axis(axis) -> 1796indexer = labels.slice_indexer( 1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798) ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind) 4711slice(1, 3) 4712""" -> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd) 4714 4715# return a slice ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind) 4924start_slice = None 4925if start is not None: -> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None: 4928start_slice = 0 ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind) 4846except ValueError: 4847# raise the original KeyError -> 4848raise err 4849 4850if isinstance(slc, np.ndarray): ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind) 4840# we need to look up the label 4841try: -> 4842slc = self.get_loc(label) 4843except KeyError as err: 4844try: ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2646return self._engine.get_loc(key) 2647except KeyError: -> 2648return self._engine.get_loc(self._maybe_cast_indexer(key)) 2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1: pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'
Sort the index in ascending order and try the same command to take the advantage of sorting for lexicographical slicing.
True
Now our data is set and ready for lexicographical slicing. Let us now select all the movie titles starting with letter A till letter B.
budget | vote_average | vote_count | |
---|---|---|---|
title | |||
Abandon | 25000000 | 4.6 | 45 |
Abandoned | 0 | 5.8 | 27 |
Abduction | 35000000 | 5.6 | 961 |
Aberdeen | 0 | 7.0 | 6 |
About Last Night | 12500000 | 6.0 | 210 |
... | ... | ... | ... |
Battle for the Planet of the Apes | 1700000 | 5.5 | 215 |
Battle of the Year | 20000000 | 5.9 | 88 |
Battle: Los Angeles | 70000000 | 5.5 | 1448 |
Battlefield Earth | 44000000 | 3.0 | 255 |
Battleship | 209000000 | 5.5 | 2114 |
292 rows × 3 columns
True
title | budget | vote_average | vote_count |
---|---|---|---|
Æon Flux | 62000000 | 5.4 | 703 |
xXx: State of the Union | 60000000 | 4.7 | 549 |
xXx | 70000000 | 5.8 | 1424 |
eXistenZ | 15000000 | 6.7 | 475 |
[REC]² | 5600000 | 6.4 | 489 |
budget vote_average vote_count title
This is a no brainer to see the empty DataFrame as the data is sorted in reverse order. Let us reverse the letters and run this again.
title | budget | vote_average | vote_count |
---|---|---|---|
B-Girl | 0 | 5.5 | 7 |
Ayurveda: Art of Being | 300000 | 5.5 | 3 |
Away We Go | 17000000 | 6.7 | 189 |
Awake | 86000000 | 6.3 | 395 |
Avengers: Age of Ultron | 280000000 | 7.3 | 6767 |
... | ... | ... | ... |
About Last Night | 12500000 | 6.0 | 210 |
Aberdeen | 0 | 7.0 | 6 |
Abduction | 35000000 | 5.6 | 961 |
Abandoned | 0 | 5.8 | 27 |
Abandon | 25000000 | 4.6 | 45 |
228 rows × 3 columns