ClueWeb09 gaps.
The ClueWeb09 Gap Data set represents posting lists extracted from the
category B html files of the ClueWeb09 collection,
provided by the Carnegie Mellon University.
The category B contains 50 million English pages.
Each posting list is encoded as a sequence of gaps, i.e.,
differences between adjacent document numbers (IDs).