Data


Fact-Finding

Population and Biography datasets extracted from Wikipedia.  These consist of claims that users have made about the population of cities ("city X had population Y in year Z") and biographical details of persons (such as birth date, death date, parents, children, etc.) as well as the labeled data used for evaluation, derived from census information and authoritative biographical websites, respectively.  There are 44,761 population claims with 308 non-trivial labels (non-trivial in that the poulation was disputed), and, in the biography data, 129,847 claimed birth dates, 34,201 death dates, 10,418 parent-child pairs, and 9,792 spouses, with 2,685 birth/death date labels.

Download: http://took.cs.uiuc.edu/data/


Wikipedia Extracts

Categories, redirects, page titles and links in easy-to-parse text files, taken from the March 2010 English version of Wikipedia.

1. The categories data encodes the full Wikipedia category heirarchy.
2. Redirect information provides title synonyms.
3. The page titles data is a list of all pages in Wikipedia.
4. Links are triples of [source article] [target article] [link text].

Download: http://took.cs.uiuc.edu/data/


Maximum Subsequence Segmentation

This is the training and evaluation data used for the MSS article text extraction algorithm.  This includes more than the 24,000 training examples used in the paper, since we trained on only the first 2,000 automatically-generated examples from each of twelve major news websites.  The evaluation data consists of 450 pages with hand-tagged extractions taken from 45 different websites.

Download: http://cogcomp.cs.illinois.edu/Data/MSS/


Transliteration

Transliteration training example pairs from Wikipedia that were used to train our transliteration model.  2862 English-Russian pairs, 1166 English-Hebrew pairs, and 384 English-Chinese pairs.

Download: /media/720/examples.zip (41KB)