3. XMLmind XML Editor-friendly content models

Validating a document against a RELAX NG schema is similar to matching some text against a regular expression. If the document ``matches'' the schema, the document is valid, and this, no matter which sub-expressions were used during the match.

Example: string "b" matches regular expression "(a?,b)|(b,c?)" and we don't care if it matches sub-expression "(a?,b)" or sub-expression "(b,c?)". The situation is exactly the same with RELAX NG schemas, simply replace the characters and the character classes used in a regular expression by RELAX NG patterns.

The job of a RELAX NG schema is a validate a document as a whole, and that's it. For XXE, the problem to solve is different. One of the main jobs of XXE is to guide the user when he/she edits an XML document. That is, one of the main jobs of XXE is to identify the content model of the element which is being edited, in order to suggest the right attributes and the right child elements for it.

To do that, XXE needs to know precisely which ``sub-expressions were used during the match''. Unfortunately, sometimes, this is impossible to do.

All examples used in this section are found in XXE_install_dir/doc/rngsupport/samples/. Note that they are all valid schemas and valid documents.

Example 1. Ambiguous elements

RELAX NG schema, target.rnc:

start = build-element

build-element = element build {
    target-element*
}
target-element = element target {
    attribute name { xsd:ID },
    element list { ref-element* }?,
    element list { action-element* }?
}
ref-element = element ref { 
    attribute name { xsd:IDREF }
}
action-element = element action { text }

Document conforming to the above schema, target_bad.xml:

<build>
  <target name="all">
    <list>
    </list>
  </target>

  <target name="compile"/>
  <target name="link"/>
</build>

If you open target_bad.xml in XXE and select the list element, XXE is lost: is it the list element which contains refs or is the list element which contains actions? Both list content models are fine in the case of an empty list element!

Now, if you open target_good.xml in XXE, there is no problem at all:

<build>
  <target name="all">
    <list>
      <ref name="compile"/>
      <ref name="link"/>
    </list>
    <list>
      <action>cc -c *.c</action>
      <action>cc *.o</action>
    </list>
  </target>

  <target name="compile"/>
  <target name="link"/>
</build>

The previous examples show that:

Important

XXE cannot make a difference between two child elements having the same name and having different content models, unless these two child elements have themselves distinct attributes and/or distinct child elements.

RELAX NG schema, sect.rnc:

start = doc-element

doc-element = element doc {
    (simple-sect|
     recursive-sect)+
}
simple-sect = element sect {
    attribute class {"simple"}, paragraph-element*
}
recursive-sect = element sect {
    attribute class {"recursive"}, (recursive-sect|simple-sect)*
}
paragraph-element = element paragraph { text }

Document conforming to the above schema, sect.xml:

<doc>
  <sect class="recursive">
    <sect class="recursive"></sect>

    <sect class="simple">
      <paragraph>Paragraph 2.</paragraph>
    </sect>
  </sect>

  <sect class="simple"></sect>
</doc>

XXE has no problem at all with empty <sect class="recursive"> and empty <sect class="simple"> because these elements have the same required attribute class but with different fixed values. However, it is easy to defeat XXE by slightly modifying the schema.

RELAX NG schema, sect2.rnc:

start = doc-element

doc-element = element doc {
    (simple-sect|
     recursive-sect)+
}

simple-sect = element sect {
    attribute class {"simple"}, paragraph-element*
}

recursive-sect = element sect {
    attribute class {"recursive"}?, (recursive-sect|simple-sect)*
}

paragraph-element = element paragraph { text }

Document conforming to the above schema, sect2.xml:

<doc>
  <sect>
    <sect></sect>

    <sect class="simple">
      <paragraph>Paragraph 2.</paragraph>
    </sect>
  </sect>

  <sect class="simple"></sect>
</doc>

3.1. The non-validating, lenient, editing mode

When XXE is ``lost'', it automatically enters a lenient editing mode. In this mode, XXE can no longer guide the user when he/she edits the element which poses problems.

The node path bar is used to signal elements which are in this non-validating, lenient editing mode:

  • An element underlined in red means that this element is in non-validating mode 2. In this mode, XMLmind XML Editor is not able to suggest the right attributes and the right child elements to the user. The user may add and remove any attributes and child elements he/she wants, at any place and in any number.

    Figure 1. XXE is completely lost when empty list is selected

    XXE is completely lost when empty list is selected
  • An element underlined in orange means that this element is in non-validating mode 1. In this mode, XMLmind XML Editor still suggests the right attributes and child elements to the user. But these are only suggestions: the user may add and remove any attributes and child elements he/she wants, and this, at any place and in any number.

    Figure 2. XXE has problems when target containing empty list is selected

    XXE has problems when target containing empty list is selected

Note that the lenient editing mode is local to an element and its descendants. It is not used for the whole document, but just for the element for which XXE has troubles.

Figure 3. XXE has no problem at all when target named compile is selected

XXE has no problem at all when target named compile is selected

Also note that, after modifying an element which poses problems to XXE, if these problems are solved, XXE will automatically switch to its normal, strict, validating mode.

3.2. Problems with attributes

Important

XXE cannot make a difference between two attributes (within the same element) having the same name and having different content models, unless these two attributes have both fixed values.

Example 2. Same attribute name, different content models

RELAX NG schema, person.rnc:

start = persons-element

persons-element = element persons {
    person-element+
}
person-element = element person {
    (attribute age { xsd:int } |
     (attribute age { "seeBirthDate" },
      attribute birthDate { xsd:date })),
    element firstName { text },
    element lastName { text }
}

Document conforming to the above schema, person.xml:

<persons>
  <person age="33">
    <firstName>John</firstName>
    <lastName>Doe</lastName>
  </person>

  <person age="seeBirthDate" birthDate="1980-04-15">
    <firstName>Erica</firstName>
    <lastName>Kyle</lastName>
  </person>
</persons>

XXE has problems with attribute age. It cannot make a difference between attribute age which contains an integer and attribute age which contains fixed value seeBirthDate, even if this seems very easy to do. For performance reasons, XXE does not attempt to be very smart for what it considers to be rare cases.

Note that if you replace attributes age and birthDate by similar child elements age and birthDate, XXE will behave exactly the same. See person2.rnc and person2.xml.

Important

XXE cannot make a difference between two child elements having the same name and having different data-only content models, unless these two child elements have both fixed values.

Example 3. Same attribute name, different fixed values

RELAX NG schema, div.rnc:

start = doc-element

doc-element = element doc {
    div-element+
}
div-element = element div {
    (attribute class {"section"}, div-element+) |
    (attribute class {"paragraphs"}, paragraph-element+)
}
paragraph-element = element paragraph { text }

Document conforming to the above schema, div.xml:

<doc>
  <div class="section">
    <div class="section">
      <div class="paragraphs">
        <paragraph>Paragraph 1.</paragraph>
      </div>
    </div>

    <div class="paragraphs">
      <paragraph>Paragraph 2.</paragraph>
    </div>
  </div>

  <div class="paragraphs">
    <paragraph>Paragraph 3.</paragraph>
  </div>
</doc>

XXE has no problem at all with attribute class, because even if there are two attributes named class within element div, they have different fixed values.

Note that if you replace attribute class by similar child element class, XXE will behave exactly the same. See div2.rnc and div2.xml.

3.3. Help provided by the "Show Content Model" window

The node path bar is not the only tool in XXE which can help the user recognize attributes and elements which pose problems to the XML editor. The window opened by command Help|Show Content model also displays very useful information.

Figure 4. Person.xml example when element person is selected

Person.xml example when element person is selected

Figure 5. Div.xml example when element div is selected

Div.xml example when element div is selected

3.4. Other content models which are not XXE-friendly

Example 4. Not specific to RELAX NG

RELAX NG schema, name.rnc:

start = names-element

names-element = element names {
    name-element+
}
name-element = element name {
    element fullName { text } |
    (element firstName { text } & element lastName { text })
}

Document conforming to the above schema, name.xml:

<names>
  <name><fullName>John Smith</fullName></name>

  <name><firstName>John</firstName><lastName>Smith</lastName></name>

  <name><lastName>Smith</lastName><firstName>John</firstName></name>
</names>

XXE allows to replace the firstName, lastName pair by a fullName. Simply select both child elements and use command Edit|Replace. But it is impossible to replace a fullName by a firstName, lastName pair.

The only way to do this is to select the fullName to be replaced and then, to use command Edit|Force Deletion. This will force XXE to enter the lenient editing mode. Remember that in this mode, the user is allowed to add any child elements he/she wants, including a firstName, lastName pair[1].

Note that the above example is not specific to RELAX NG. It is possible to model this kind of content with a DTD or a W3C XML Schema.

The example below is very similar but can only be expressed using a RELAX NG schema. This is the case, because, unlike a DTD and a W3C XML Schema, a RELAX NG schema can be used to specify the places within an element where text nodes may occur.

Example 5. Specific to RELAX NG

RELAX NG schema, name2.rnc:

start = names-element

names-element = element names {
    name-element+
}
name-element = element name {
    text |
    (element firstName { text } & element lastName { text })
}

Document conforming to the above schema, name2.xml:

<names>
  <name>John Smith</name>

  <name><firstName>John</firstName><lastName>Smith</lastName></name>

  <name><lastName>Smith</lastName><firstName>John</firstName></name>
</names>

The situation is worse with the name2.rnc example than with the name.rnc example. It is always allowed to delete a text node and this includes the text node containing "John Smith". That is, there is no way to force XXE to enter its lenient mode in order to be able to replace text node "John Smith" by a firstName, lastName pair.

In such case, using named element templates is the only way to cope with such content models. Simply specify two named element templates for element name, one containing a text node with a placeholder string and the other containing a firstName, lastName pair.



[1] The right approach here is to define two named element templates for element name, one containing a fullName child element and the other containing a firstName, lastName pair.