[tagMatch PIC] Searching Semi-Structured Data Using Landmarks


This paper introduces landmark search operators for extracting data from poorly formatted Web pages, plain text files, and XML/SGML documents lacking grammars. The emphasis is on ease of use, and a fast, simple implementation, which can be readily ported to a wide variety of host languages.

There are two main operators: one using unique textual landmarks to divide text regions into smaller regions suitable for further search, and an operator that searches for XML/SGML tag pairs, and returns the matches as regions. An iterator class allows a search to be carried out repeatedly.



Dr. Andrew Davison
E-mail: ad@fivedots.coe.psu.ac.th
Back to my home page