| Abstract | | Much of the information on the Web is found in articles from
online news outlets, magazines, encyclopedias, review collections,
and other sources. However, extracting this content from the
original HTML document is complicated by the large amount of less
informative and typically unrelated material such as navigation
menus, forms, user comments, and ads. Existing approaches tend to
be either brittle and demand significant expert knowledge and time
(manual or tool-assisted generation of rules or code), necessitate
labeled examples for every different page structure to be processed
(wrapper induction), require relatively uniform layout (template
detection), or, as with Visual Page Segmentation (VIPS), are
computationally expensive. We introduce maximum subsequence
segmentation, a method of global optimization over token-level
local classifiers, and apply it to the domain of news websites.
Training examples are easy to obtain, both learning and prediction
are linear time, and results are excellent (our semi-supervised
algorithm yields an overall F1- score of 97.947%), surpassing even
those produced by VIPS with a hypothetical perfect block-selection
heuristic. We also evaluate against the recent CleanEval shared
task with surprisingly good cross-task performance cleaning general
web pages, exceeding the top "text-only" score (based on
Levenshtein distance), 87.8% versus 84.1%. |