< Back to previous page

Publication

Inference of Concise Regular Expressions and DTDs

Journal Contribution - Journal Article

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions-the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)-that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however ( for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs ( which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets.
Journal: ACM TRANSACTIONS ON DATABASE SYSTEMS
ISSN: 0362-5915
Issue: 2
Volume: 35
Publication year:2010
Keywords:Algorithms, Languages, Theory, Regular expressions, schema inference, XML
BOF-keylabel:yes
IOF-keylabel:yes
BOF-publication weight:1
CSS-citation score:1
Authors:International
Authors from:Higher Education
Accessibility:Closed