yahoo_ltrc

  • Description:

The Yahoo Learning to Rank Challenge dataset (also called "C14") is a Learning-to-Rank dataset released by Yahoo. The dataset consists of query-document pairs represented as feature vectors and corresponding relevance judgment labels.

The dataset contains two versions:

  • set1: Containing 709,877 query-document pairs.
  • set2: Containing 172,870 query-document pairs.

You can specify whether to use the set1 or set2 version of the dataset as follows:

ds = tfds.load("yahoo_ltrc/set1")
ds = tfds.load("yahoo_ltrc/set2")

If only yahoo_ltrc is specified, the yahoo_ltrc/set1 option is selected by default:

# This is the same as `tfds.load("yahoo_ltrc/set1")`
ds = tfds.load("yahoo_ltrc")
@inproceedings{chapelle2011yahoo,
  title={Yahoo! learning to rank challenge overview},
  author={Chapelle, Olivier and Chang, Yi},
  booktitle={Proceedings of the learning to rank challenge},
  pages={1--24},
  year={2011},
  organization={PMLR}
}

yahoo_ltrc/set1 (default config)

  • Dataset size: 795.39 MiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'test' 6,983
'train' 19,944
'vali' 2,994
  • Feature structure:
FeaturesDict({
    'doc_id': Tensor(shape=(None,), dtype=int64),
    'float_features': Tensor(shape=(None, 699), dtype=float64),
    'label': Tensor(shape=(None,), dtype=float64),
    'query_id': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
doc_id Tensor (None,) int64
float_features Tensor (None, 699) float64
label Tensor (None,) float64
query_id Text string

yahoo_ltrc/set2

  • Dataset size: 194.92 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'test' 3,798
'train' 1,266
'vali' 1,266
  • Feature structure:
FeaturesDict({
    'doc_id': Tensor(shape=(None,), dtype=int64),
    'float_features': Tensor(shape=(None, 700), dtype=float64),
    'label': Tensor(shape=(None,), dtype=float64),
    'query_id': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
doc_id Tensor (None,) int64
float_features Tensor (None, 700) float64
label Tensor (None,) float64
query_id Text string