The Lyons Family Website | |||||||||||||||||||||||||
The Undecidability of the Schematron Conformance Problemby Bob Lyons
|
aabb |
ccab |
Can one or more of these rectangles be arranged in a row (left to right), such that the following requirements are met?
So, we can reformulate our first example as the following set of rectangles:
a |
aa |
ab |
b |
The strings in the top halves of the rectangles are the strings in list L, and the strings in the bottom halves of the rectangles are the strings in list M. The first string in list L (i.e., "a") is the string in the top half of the first rectangle, and the second string in list L (i.e., "ab") is the string in the top half of the second rectangle. Likewise, the first string in list M (i.e., "aa") is the string in the bottom half of the first rectangle, and the second string in list L (i.e., "b") is the string in the bottom half of the second rectangle.
This instance of the PCP can be solved with the following row, since the concatenation of the strings in the top half of the row (i.e., "aab") is equal to the concatenation of the strings in the bottom half of the row:
a |
ab |
aa |
b |
We can reformulate our second example as the following set of rectangles:
a |
aaa |
abaaa |
ab |
ab |
b |
This instance of the PCP can be solved with the following row, since the concatenation of the strings in the top half of the row is "abaaaaaab" and the concatenation of the strings in the bottom half of the row is also "abaaaaaab":
abaaa |
a |
a |
ab |
ab |
aaa |
aaa |
b |
Now that we understand the PCP, we can get back to our proof that the SCP is undecidable.
Recall that we assumed that the SCP is solvable (i.e., we assumed that there is an algorithm for solving the SCP). Now we extend this algorithm into a new algorithm that solves the PCP.
The input to the new algorithm is an instance of the PCP (i.e., two lists of nonempty strings over an alphabet containing at least two letters, such that the two lists have the same number of strings).
The following is the new algorithm that solves the PCP:
So, we now have an algorithm that solves the PCP. However, the PCP is unsolvable. So, our initial assumption (that there is an algorithm for solving the SCP) must be false. Therefore, the SCP is undecidable.
Now we'll show how a PCP instance can be transformed into a Schematron schema, such that an XML document conforms to the schema only if the XML document contains a solution to the PCP instance.
First, we describe the format of the XML documents that contain solutions to PCP instances. The following XML document contains a solution for the PCP instance of example #2:
<?xml version="1.0"?> <pcp-instance-solution> <list-1> <string id="2">abaaa</string> <string id="1">a</string> <string id="1">a</string> <string id="3">ab</string> </list-1> <list-2> <string id="2">ab</string> <string id="1">aaa</string> <string id="1">aaa</string> <string id="3">b</string> </list-2> </pcp-instance-solution>
Recall that a solution to this PCP instance
is the following sequence of integers:
( 2, 1, 1, 3 ). This sequence of integers is
specified by the id
attributes of the
/pcp-instance-solution/list-1/string
elements;
the solution is specified again by the
id
attributes of the
/pcp-instance-solution/list-2/string
elements.
The /pcp-instance-solution/list-1
element
contains the 2nd string of list L, followed by
the 1st string of list L, followed by the 1st
string of list L, followed by the 3rd string
of list L. The /pcp-instance-solution/list-2
element
contains the 2nd string of list M, followed by
the 1st string of list M, followed by the 1st
string of list M, followed by the 3rd string
of list M.
Notice that the concatenation of the values
of the /pcp-instance-solution/list-1/string
elements
is equal to the concatenation of the values
of the /pcp-instance-solution/list-2/string
elements.
(In other words,
"abaaaaaab"
is equal to
"abaaaaaab".)
You may have noticed that this XML document contains
much more information than the sequence of integers
that solves the problem. The extra information makes
it easier for the reader to verify the solution.
Also, if the document merely specified a list of integers that
solved the PCP instance, then I would not have been
able to write a Schematron schema that verifies
that the solution is correct (unless the schema
used the document
function to invoke
a CGI script, which would be cheating!).
The following is a Schematron schema for example #2. An XML document will match this schema only if it provides a solution for the PCP instance of example #2.
<?xml version="1.0"?> <schema xmlns="http://www.ascc.net/xml/schematron"> <!-- This Schematron schema is used to verify that an XML document contains a solution to the following instance of the Post Correspondence Problem (PCP): Alphabet A = { a, b } List L = ( a, abaaa, ab ) List M = ( aaa, ab, b ) An XML document conforms to this Schematron schema if it contains a solution to this PCP instance. If an XML document conforms to this Schematron schema, then the schema will not issue any messages when validating the document. If an XML document does not conform to this Schematron schema, then the schema will issue one or more error messages when validating the document. --> <pattern name="check-root"> <rule context="/"> <assert test="pcp-instance-solution"> The /pcp-instance-solution element was not found. </assert> </rule> </pattern> <pattern name="check-pcp-instance-solution"> <rule context="/pcp-instance-solution"> <assert test="list-1"> The /pcp-instance-solution/list-1 element is missing. </assert> <assert test="list-2"> The /pcp-instance-solution/list-2 element is missing. </assert> <assert test="count(*) = 2"> The /pcp-instance-solution element has an incorrect number of subelements. </assert> <assert test="translate( normalize-space(list-1), ' ', '' ) = translate( normalize-space(list-2), ' ', '' )"> The concatenation of the /pcp-instance-solution/list-1/string/text() nodes is not equal to the concatenation of the /pcp-instance-solution/list-2/string/text() nodes. </assert> </rule> </pattern> <pattern name="check-list-1"> <rule context= "/pcp-instance-solution/list-1/string[@id='1']"> <assert test=". = 'a'"> The value of the /pcp-instance-solution/list-1/string[@id='1'] element is not equal to 'a'. </assert> </rule> <rule context= "/pcp-instance-solution/list-1/string[@id='2']"> <assert test=". = 'abaaa'"> The value of the /pcp-instance-solution/list-1/string[@id='2'] element is not equal to 'abaaa'. </assert> </rule> <rule context= "/pcp-instance-solution/list-1/string[@id='3']"> <assert test=". = 'ab'"> The value of the /pcp-instance-solution/list-1/string[@id='3'] element is not equal to 'ab'. </assert> </rule> <rule context= "/pcp-instance-solution/list-1/string"> <assert test="@id = '1' or @id = '2' or @id = '3'"> The value of a /pcp-instance-solution/list-1/string/@id attribute is not equal to '1', '2' or '3'. </assert> </rule> <rule context= "/pcp-instance-solution/list-1/text()"> <assert test="normalize-space( . ) = ''"> Found a non-whitespace text node as child of the /pcp-instance-solution/list-1 element. </assert> </rule> <rule context= "/pcp-instance-solution/list-1/*"> <assert test="name( . ) = 'string'"> The /pcp-instance-solution/list-1 element has a subelement other than a string element. </assert> <assert test="@id"> A subelement of the /pcp-instance-solution/list-1 element has no id attribute. </assert> </rule> <rule context="/pcp-instance-solution/list-1"> <assert test="string"> The /pcp-instance-solution/list-1 element does not contain any string subelements. </assert> </rule> <rule context="/pcp-instance-solution/list-1"> <assert test="count( string ) = count( ../list-2/string )"> The /pcp-instance-solution/list-1 element does not have the same number of string elements as the /pcp-instance-solution/list-2 element. </assert> </rule> </pattern> <pattern name="check-list-2"> <rule context= "/pcp-instance-solution/list-2/string[@id='1']"> <assert test=". = 'aaa'"> The value of the /pcp-instance-solution/list-2/string[@id='1'] element is not equal to 'aaa'. </assert> </rule> <rule context= "/pcp-instance-solution/list-2/string[@id='2']"> <assert test=". = 'ab'"> The value of the /pcp-instance-solution/list-2/string[@id='2'] element is not equal to 'ab'. </assert> </rule> <rule context= "/pcp-instance-solution/list-2/string[@id='3']"> <assert test=". = 'b'"> The value of the /pcp-instance-solution/list-2/string[@id='3'] element is not equal to 'b'. </assert> </rule> <rule context= "/pcp-instance-solution/list-2/string"> <assert test="@id = '1' or @id = '2' or @id = '3'"> The value of the /pcp-instance-solution/list-2/string/@id attribute is not equal to '1', '2' or '3'. </assert> </rule> <rule context= "/pcp-instance-solution/list-2/text()"> <assert test="normalize-space( . ) = ''"> Found a non-whitespace text node as the child of the /pcp-instance-solution/list-2 element. </assert> </rule> <rule context= "/pcp-instance-solution/list-2/*"> <assert test="name( . ) = 'string'"> The /pcp-instance-solution/list-2 element has a subelement other than a string element. </assert> <assert test="@id"> A subelement of the /pcp-instance-solution/list-2 element has no id attribute. </assert> </rule> <rule context= "/pcp-instance-solution/list-2"> <assert test="string"> The /pcp-instance-solution/list-2 element does not contain any string subelements. </assert> </rule> </pattern> <pattern name="check-correspondence"> <rule context= "/pcp-instance-solution/list-1/string"> <assert test= "@id = ../../list-2/string[ count( current()/preceding-sibling::* ) + 1 ]/@id"> The value of an id attribute in a /pcp-instance-solution/list-1/string element is not equal to the value of the id attribute of the corresponding /pcp-instance-solution/list-2/string element. </assert> </rule> </pattern> </schema>
The validations performed by this schema include the following, which verify that the solution solves the PCP instance:
string
elements in the
list-1
element must be equal to the
concatenation of the values of the
string
elements in the
list-2
element.
list-1/string
element
must be equal to the i
th string in list L,
where i
is the value of the id
attribute of the list-1/string
element.
list-2/string
element
must be equal to the i
th string in list M,
where i
is the value of the id
attribute of the list-2/string
element.
@id
attribute in
the n
th string
element of the list-1
element must
be equal to the
value of the @id
attribute in
the n
th string
element of the list-2
element.
list-1
element must contain at least one
string
subelement.
list-2
element must contain at least one
string
subelement.
list-1
element must have the same number
of string
subelements as the list-2
element.
The schema also performs the following validations of the structure of the document:
pcp-instance-solution
.
pcp-instance-solution
must be the list-1
and
list-2
elements.
list-1
and list-2
elements must be
string
elements.
string
element must contain an
id
attribute.
string
element must not have any
subelements and must not contain any attributes other
than the id
attribute.
A corollary of the undecidability of the Schematron Conformance Problem (SCP) is that the Schematron Intersection Problem (SIP) is undecidable. The SIP is defined as follows:
Given two Schematron schemas, is there an XML document that conforms to both schemas?
To prove the corollary, we show, by way of contradiction, that if the SIP is decidable, then the SCP is decidable. If the SIP is decidable, then we can devise an algorithm that accepts any two Schematron schemas as input and responds with one of the following:
Now we extend this algorithm into a new algorithm that solves the SCP.
The input to the new algorithm is a Schematron schema, which we'll call S. We pass S and another Schematron schema, which we'll call T, as inputs to the algorithm that solves the SIP. T is the following Schematron schema, which matches all XML documents:
<?xml version="1.0"?> <schema xmlns="http://www.ascc.net/xml/schematron"> <!-- This schematron schema matches any XML document. --> <pattern name="check-root"> <rule context="/"> <assert test="count(*) = 1"> Error: The number of document elements is not equal to 1. </assert> </rule> </pattern> </schema>
Our new algorithm responds with one of the following:
Thus, the new algorithm solves the SCP (i.e., the SCP is decidable). However, since we showed that the SCP is undecidable, our initial assumption (that the SIP is decidable) must be false. Therefore, the SIP is undecidable.
Is the schema conformance problem undecidable for other XML schema languages, such as XML Schema, DTD and RELAX NG? In their paper entitled What's Hard about XML Schema Constraints?, Marcelo Arenas, Wenfei Fan and Leonid Libkin show that the XML Schema Conformance Problem is undecidable.
The DTD Conformance Problem looks solvable. It seems easy to construct an XML document that conforms to a DTD, as long as the DTD contains at least one element type declaration and contains an element type declaration for each child element type that is mentioned in the element type declarations.
It is possible to write a RELAX NG schema, such that there are no XML documents that conform to the schema. The trick is to use a data type that includes two contradictory constraints (e.g., a string must consist of an even number of 'a' characters, but must be 3 characters in length).
For example, there are no XML documents that conform to the following RELAX NG schema:
<?xml version="1.0" encoding="UTF-8"?> <element name="conflicted" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary= "http://www.w3.org/2001/XMLSchema-datatypes"> <data type="string"> <param name="pattern">(aa)+</param> <param name="length">3</param> </data> </element>
So, the RELAX NG Conformance Problem is an interesting problem, but it doesn't seem to be unsolvable.
In this paper, we showed that the
Schematron Conformance Problem
(SCP) is undecidable.
This is true even if we restrict ourselves to
Schematron schemas that do not use the
document
and key
functions of XPath.
To prove the undecidability of the SCP, we showed that
if the SCP is decidable, then the Post Correspondence
Problem is decidable; however, the Post Correspondence
Problem is undecidable. Therefore, the SCP must be undecidable.
The fact that the SCP is undecidable means that it's impossible to build a Schematron schema editor that evaluates the user's schema and always lets him/her know whether or not there are any XML documents that conform to the schema.
Thanks to John Cowan for his comments on DTD conformance.
The following resources provide information about XML schema languages:
The following resources provide information about the Post Correspondence Problem, the Halting Problem and undecidability: