Acronyms!

API

Application Programming Interface

CSS

Cascading Style Sheets

DOM

Document Object Model

DTD

Document Type Definition

XML

Extensible Markup Language

HTML

Hypertext Markup Language

SAX

Simple API for XML

SOAP

Simple Object Access Protocol

W3C

World Wide Web Consortium

XLink

XML Linking Language

XPointer

XML Pointer Language

XSD

XMLSchema Definition

XSL

Extensible Stylesheet Language

XSLT

Extensible Stylesheet Language Transformation

DTD - Document Type Definition

<!ENTITY % addrElements "street | city | state | zip">
<!ELEMENT address (#PCDATA | %addrElements; )*>

<!ELEMENT location (city?, state?, zip?)>

<!ELEMENT street (#PCDATA)>
<!ATTLIST street id ID #IMPLIED>

<!ELEMENT city (#PCDATA)>
<!ATTLIST city id ID #IMPLIED>

<!ELEMENT state (#PCDATA)>
<!ATTLIST state id ID #IMPLIED>

<!ELEMENT zip (#PCDATA)>
<!ATTLIST zip id ID #IMPLIED>
*

Zero or more of these elements.

?

One or more of these elements.

#PCDATA

Parsed Character Data.

#IMPLIED

Means the tag is optional. Alternative to #REQUIRED.

For a good example of a DTD just look at the definition of XMLSchema (which isn’t necessarily defined using XMLSchema).

Escaping What XML Is Sensitive To

This can be put in a Bash script just like this because the new lines are ok as whitespace here.

sed 's/\&/\&amp;/g;
s/"/\&quot;/g;
s/</\&lt;/g;
s/>/\&gt;/g;
s/\x27/\&apos;/g'

Validating XML

Validating first implies that the XML isn’t messed up in some fundamental way like mismatched tags or bad nesting. But it also can be checked against the rules for how your particular flavor of XML should be organized as specified in the DTD. Here’s how I’ve been checking it:

xmllint --valid --noout --dtdvalid hardware.dtd resources.xml

If you just need to validate RSS, check out the official RSS validator.

XML Schema

My XML books are from the early 2000’s and I’m not sure what the state of XML is now. It seems that people didn’t like DTD because it was not in XML itself. So a different plan was formed to fix that. XML Schema is a way to define XML schemas using XML itself.

Looks like xmllint can handle this too:

xmllint --valid --noout --schema hardware.xsd resources.xml

I found this to be quite fussy and the first thing to check is to see if the simplest thing possible will validate. Here’s a simple test set up.

simplest.xsd
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="name" type="xs:string"/>
</xs:schema>
simplest.xml
<?xml version="1.0" encoding="UTF-8" ?>
<name>x</name>

And then check it:

xmllint --noout --schema simplest.xsd simplest.xml

It should say:

simplest.xml validates

A slightly more complicated example:

dictionary.xsd
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

<xs:element name="dictionary" type="DictType"/>

<xs:complexType name="DictType">
  <xs:sequence>
    <xs:element name="entry" type="EntryType" maxOccurs="unbounded"/>
  </xs:sequence>
</xs:complexType>

<xs:complexType name="EntryType">
  <xs:sequence>
    <xs:element name="word" type="xs:string" maxOccurs="1" />
    <xs:element name="def" type="xs:string" maxOccurs="unbounded" />
  </xs:sequence>
</xs:complexType>

</xs:schema>
dictionary.xml
<?xml version="1.0" encoding="UTF-8" ?>
<dictionary>
  <entry>
    <word>quill</word>
    <def>The large strong feather of a goose or other large fowl.</def>
    <def>Formerly a common instrument of writing.</def>
    <def>The spine of a porcupine.</def>
  </entry>
  <entry>
    <word>attribute</word>
    <def>That which is attributed.</def>
    <def>A quality which is considered as belonging to, or inherent in, a person or thing.</def>
    <def>An essential or necessary property or characteristic.</def>
  </entry>
</dictionary>

This shows a type definition used to create a container supporting multiple child tags.

Checking Schema Documents

Previously I demonstrated how to check your XML document against a schema. Since "XML Schema" files are XML it should be possible to validate this too. The question is, against what? What are the official rules for defining an XML Schema file? Here is how it is done including that defining rule definition file:

xmllint --noout --dtdvalid http://www.w3.org/2001/XMLSchema.dtd myown.xsd

If that’s not weird enough for you, you can also validate schema definition files using an "XML Schema schema document":

wget http://www.w3.org/2001/XMLSchema.xsd
xmllint  --noout --schema XMLSchema.xsd myown.xsd

This seems to take a lot longer and I wasn’t able to get xmllint to pull it in off the Internet.

Of course if you have a schema designed to specify how schemas are composed you should be able to see if that schema itself follows its own rules by checking it against itself.

xmllint  --noout --schema XMLSchema.xsd XMLSchema.xsd

If this doesn’t work, you should ask for your money back.

XSL - Extensible Stylesheet Language

Note that XSLT is Extensible Stylesheet Language for Transformation. This is a subset of the more general XSL. The input is data in an XML structure. The output is XML which has been transformed, presumably to better accommodate some output format or purpose.

Building on the "simplest" examples above, here is the simplest stylesheet for this kind of document:

simplest.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="name">
  <html>
    <body>
        <xsl:apply-templates/>
    </body>
  </html>
</xsl:template>
</xsl:stylesheet>

When this is applied with:

xsltproc simplest.xsl simplest.xml > simplest.html
Note
Seems like only weird people like me use command lines. With XSL, it seems that browsers can do this natively. Check out: https://www.w3schools.com/xml/xsl_client.asp

The resulting HTML file looks like this:

<html><body>x</body></html>

Here is a more complicated example using the dictionary example from above.

dictionary.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!-- No idea why blank lines were popping out. This fixed it. -->
<xsl:strip-space elements="dictionary"/>

<xsl:template match="dictionary">
  <html><body>
    <h3>Dictionary</h3>
    <xsl:apply-templates>
      <xsl:sort select="word"/>
    </xsl:apply-templates>
  </body></html>
</xsl:template>

<xsl:template match="entry">
  <p> <b><xsl:value-of select="word"/></b>:
  <xsl:for-each select="def">
    <xsl:number value="position()"/>
    <xsl:text>. </xsl:text>
    <xsl:value-of select="."/>
    <xsl:text> &#xa;</xsl:text>
  </xsl:for-each>
  </p>
</xsl:template>

</xsl:stylesheet>

Note that this line just produces a simple hard return in the output.

<xsl:text> &#xa;</xsl:text>

This is very helpful when degunking a big XML database into a sensible text file (and not necessarily making HTML or other such messes).

When applied to the dictionary.xml above this style sheet produces:

dictionary.html
<html><body>
<h3>Dictionary</h3>
<p><b>attribute</b>:
  1. That which is attributed.
2. A quality which is considered as belonging to, or inherent in, a person or thing.
3. An essential or necessary property or characteristic.
</p>
<p><b>quill</b>:
  1. The large strong feather of a goose or other large fowl.
2. Formerly a common instrument of writing.
3. The spine of a porcupine.
</p>
</body></html>

This looks like this in a browser:

Dictionary

attribute: 1. That which is attributed. 2. A quality which is considered as belonging to, or inherent in, a person or thing. 3. An essential or necessary property or characteristic.

quill: 1. The large strong feather of a goose or other large fowl. 2. Formerly a common instrument of writing. 3. The spine of a porcupine.