XML stuff

Home : Linux resources : XML

[Random collection of XML-related links, so I can keep track of what I'm learning. -- rgr, 7-Oct-02.]

Table of Contents

  1. XML stuff
    1. Table of Contents
    2. General XML information
      1. Standards and draft standards
      2. Books
      3. Implementations
    3. Documents frequently cited by XML Recommendations
    4. SOAP, etc.
    5. Scalable Vector Graphics
    6. Bioinformatics and XML
    7. Glossary

General XML information

From the introduction on the Extensible Markup Language (XML) page:
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.
The first sentence is only half true; nothing derived from SGML can truly be called "simple." XML does indeed simplify or eliminate the hairier SGML features, and is lexically simpler even than HTML. The basic concepts are easy, especially if you already know HTML; indeed, the most important differences between traditional HTML and XHTML, the new XML-ized version, are what you can't do in XHTML that are valid in HTML. However, [DTD vs. Schema].

Another reason for complexity is that XML is truly extensible; anyone can define and publish their own document formats (called an "XML application"), and standard tools will be able to parse and operate on such documents to some extent. For this to work, XML needs [metainformation]. So, although it is easy to write XML, and even to invent your own XML document formats, it is harder to write the "metadocuments" (such as a DTD or schema) that other tools will probably need to make sense of the XML.

Unfortunately, all of this information, like XML itself, is fairly decentralized. There are about a dozen W3C standards that form what could be considered the "core" set of XML, with numerous internal cross-references. This makes for difficult reading -- even in a Web browser. Worse, these standards were written by different working groups over a period of several years, making it harder to read the earlier standards without understanding the context in which they were written. For example, Namespaces in XML makes pervasive changes to the syntax and meaning of names, which had to be retrofitted into earlier standards.

In order to make sense of it all, it helps to have a good general book that covers the XML core concepts in one place. Even then, it is best to read lightly through it the first time, without expecting to understand everything, in order to get the big picture. Then you can fill in the details on the second pass.

Standards and draft standards

These are all available from the W3C on their Technical Reports and Publications page, directly or indirectly.
Extensible Markup Language (XML), a brief description of how it is being developed.
Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation, 6 October 2000.
Namespaces in XML, W3C Recommendation, 14-January-1999. Defines the oft-used QName and NCName productions.
This is pretty hairy, so it's a good thing that somebody thought to break it up:
XHTMLtm 1.0 The Extensible HyperText Markup Language (Second Edition), A Reformulation of HTML 4 in XML 1.0, W3C Recommendation, 26 January 2000, revised 1 August 2002.
[more "meta" things. -- rgr, 3-Nov-02.]
XML Information Set, W3C Recommendation, 24 October 2001.
XQuery 1.0 and XPath 2.0 Data Model, W3C Working Draft, 16 August 2002. This is one of the competing XML API model candidates; the XML DOM is another.
Document Object Model (DOM).
XML Path Language (XPath) Version 1.0, W3C Recommendation, 16 November 1999.
XSL Transformations (XSLT) Version 1.0, W3C Recommendation, 16 November 1999. See also the Oasis overview of XSL. Miloslav Nic has written a very nice-looking XSLT Reference with a large collection of examples that is also searchable.
Resource Description Framework (RDF) Model and Syntax Specification, W3C Recommendation, 22 February 1999. RDF overview information can be found on the Resource Description Framework (RDF) / W3C Semantic Web Activity page.


Elliotte Rusty Harold and W. Scott Means, XML in a Nutshell (2e), O'Reilly, 2002. $39.95, ISBN 0-596-00292-0. Contains a good general overview of all of the foregoing.


This is limited to only those implementations with which I have some personal experience. See also the enormous Oasis Public SGML/XML Software list.
CL-XML: Common Lisp support for the 'Extensible Markup Language', written by James Anderson. Supports both SAX-like and XQDM document interfaces.
XML/HTML parsers in Common Lisp, by Franz, Inc. Includes A Lisp Based HTML Parser, and A Lisp Based XML Parser. Both produce output as nested Lisp lists, which can be easier to deal with than SAX and lighter-weight than DOM.
SOAP::Lite, module for Perl, by Paul Kulchenko. See also the SOAP, etc. section.

Documents frequently cited by XML Recommendations

Uniform Resource Identifiers (URI): Generic Syntax, RFC2396. (Cited as the definitive reference for URIs.)
HTML 4.01 Specification, W3C Recommendation, 24 December 1999. Not an XML application, but will continue to be the dominant Web markup language for some time. In fact, XSLT contains special support for pre-XML incarnations of HTML; 4.0 is the default version when HTML output is chosen.
Cascading Style Sheets, Level 2.

SOAP, etc.

Scalable Vector Graphics

Bioinformatics and XML


application, XML
An "XML application" just means an "application of XML" to a given problem area. It is usually defined by a machine-readable DTD that describes the syntax to a validating parser, together with a human-readable document that defines the semantics.
NCName [Namespaces in XML]
A "no colon name," used as an identifier. Such names do not belong to any namespace.
QName [Namespaces in XML]
A "qualified name," with an optional namespace prefix. Such names are generally understood to belong to a namespace. A QName has at most one colon, which (if present) must be neither the first nor the last character. Syntactically, this is an NCName with an optional prefix (also an NCName).

Bob Rogers <rogers@rgrjr.dyndns.org>