Some useful datasets:

  1. Datasets used in my experimental papers on approximate dictionary search methods. They include most frequent words from ClueWeb09 (category B), synthetic English and Russian words, as well as DNA sequences extracted from the human genome.

  2. ClueWeb09 gaps. The ClueWeb09 Gap Data set represents posting lists extracted from the category B html files of the ClueWeb09 collection, provided by the Carnegie Mellon University. The category B contains 50 million English pages. Each posting list is encoded as a sequence of gaps, i.e., differences between adjacent document numbers (IDs).