How to select a Subset Of Data Using lexicographical slicing in Python Pandas?

Introduction

Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. Inthis post, I will show you how to "Select a Subset Of Data Using lexicographical slicing".

Google is full of datasets. Search for movies dataset in kaggle.com. This post uses the movies data set from kaggle.

How to do it

Import the movies dataset with only the columns required for this example.

import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)

	budget	vote_average	vote_count
titile
Little Voice	0	6.6	61
Grown Ups 2	80000000	5.8	1155
The Best Years of Our Lives	2100000	7.6	143
Tusk	2800000	5.1	366
Operation Chromite	0	5.8	29

I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if you aredealing with a huge dataset when your index is sorted.

What if I don't sort the index?

No problem your code is going to run forever. Just kidding, well if the index labels are unsorted then pandas have to traversethrough all the labels one by one to match your query. Just imagine an Oxford dictionary without an index page, what you aregoing to do? With the index sorted you can jump around quickly to a label you want to extract, so is the case with Pandastoo.

Let us check first if our index is sorted or not.

# check if the index is sorted or not ?
movies.index.is_monotonic

False

Clearly, the index is un sorted. We will try to select the movies starting with A%. This is like writing

select * from movies where title like'A%'

movies.loc["Aa":"Bb"]

select * from movies where title like 'A%'



---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,

tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'

Sort the index in ascending order and try the same command to take the advantage of sorting for lexicographical slicing.

True

Now our data is set and ready for lexicographical slicing. Let us now select all the movie titles starting with letter A till letter B.

	budget	vote_average	vote_count
title
Abandon	25000000	4.6	45
Abandoned	0	5.8	27
Abduction	35000000	5.6	961
Aberdeen	0	7.0	6
About Last Night	12500000	6.0	210
...	...	...	...
Battle for the Planet of the Apes	1700000	5.5	215
Battle of the Year	20000000	5.9	88
Battle: Los Angeles	70000000	5.5	1448
Battlefield Earth	44000000	3.0	255
Battleship	209000000	5.5	2114

292 rows × 3 columns

True

title	budget	vote_average	vote_count
Æon Flux	62000000	5.4	703
xXx: State of the Union	60000000	4.7	549
xXx	70000000	5.8	1424
eXistenZ	15000000	6.7	475
[REC]²	5600000	6.4	489

budget vote_average vote_count title

This is a no brainer to see the empty DataFrame as the data is sorted in reverse order. Let us reverse the letters and run this again.

title	budget	vote_average	vote_count
B-Girl	0	5.5	7
Ayurveda: Art of Being	300000	5.5	3
Away We Go	17000000	6.7	189
Awake	86000000	6.3	395
Avengers: Age of Ultron	280000000	7.3	6767
...	...	...	...
About Last Night	12500000	6.0	210
Aberdeen	0	7.0	6
Abduction	35000000	5.6	961
Abandoned	0	5.8	27
Abandon	25000000	4.6	45

228 rows × 3 columns