Fact-Finding
Population and Biography datasets extracted from
Wikipedia. These consist of claims that users have made about
the population of cities ("city X had population Y in year Z") and
biographical details of persons (such as birth date, death date,
parents, children, etc.) as well as the labeled data used for
evaluation, derived from census information and authoritative
biographical websites, respectively. There are 44,761
population claims with 308 non-trivial labels (non-trivial in that
the poulation was disputed), and, in the biography data, 129,847
claimed birth dates, 34,201 death dates, 10,418 parent-child pairs,
and 9,792 spouses, with 2,685 birth/death date labels.
Download:
http://took.cs.uiuc.edu/data/Wikipedia Extracts
Categories, redirects, page titles and links in easy-to-parse
text files, taken from the March 2010 English version of
Wikipedia.
1. The categories data encodes the full Wikipedia category
heirarchy.
2. Redirect information provides title synonyms.
3. The page titles data is a list of all pages in Wikipedia.
4. Links are triples of [source article] [target article] [link
text].
Download:
http://took.cs.uiuc.edu/data/Maximum Subsequence Segmentation
This is the training and evaluation data used for the MSS
article text extraction algorithm. This includes more than
the 24,000 training examples used in the paper, since we trained on
only the first 2,000 automatically-generated examples from each of
twelve major news websites. The evaluation data consists of
450 pages with hand-tagged extractions taken from 45 different
websites.
Download:
http://cogcomp.cs.illinois.edu/Data/MSS/Transliteration
Transliteration training example pairs from Wikipedia that were
used to train our transliteration model. 2862 English-Russian
pairs, 1166 English-Hebrew pairs, and 384 English-Chinese
pairs.
Download:
/media/720/examples.zip
(41KB)