XML, the eXtensible Markup Language
CS 284 (CSA), Spring 2005
XML is a markup language, with roots in text processing. Markup is annotation added to a body of text to indicate such features as font choice, line breaks, paragraph and section structure, and inclusion of figures and tables.
See the example of markup for an illustration.
We will use the term rendition for a body of text together with description of a desired format. The markup example (second box above) is an example of a rendition. Software such as Microsoft Word and Adobe PageMaker also work with renditions internally.
A presentation is the result of formatting (third box above). Presentations are intended for human perception. They need not use visual media, such as paper or an electronic display: for example, an SGML document might also generate sound through speakers for the visually impaired.
Word and PageMaker offer a visual interface that shows the (visual) presentation of the system's document rendition as one creates the document. These are examples of WYSIWYG systems (What You See Is What You Get).
The term markup was coined in 1970 by Charles Goldfarb, the leader of an IBM team that explored the problem of document content and interchange beginning in the late 1960's. His team produced GML (Generalized Markup Language) at that time, and Goldfarb led an ongoing effort to develop the system and concepts toward SGML (Standard Generalized Markup Language), resulting in the first ISO standard for SGML in 1986 and the current version of that standard in 1991.
SGML supports three fundamental goals:
Common data representation. The markup should provide a universal representation that all systems can use.
Extensibility. It must be possible to define new markup for new situations in order to represent all forms of information.
Document type rules. Documents of a common type must adhere to formally verifiable rules for that type.
Few text processing systems adhere to all of these goals. For example, Microsoft Word is incompatable with other word processors, which must convert a Word document to their own internal formats. However, Word is a de facto standard (unofficial, established through the fact that many users use it) in many organizations, including St. Olaf. However, different versions of Word offer different features. A user cannot define his/her own extensions to Word, and there are no formal rules for document types.
The LaTeX rendition format is frequently used for CS publications. LaTeX is frequently entered as markup, although WYSIWYG editors exist. This language has a consistent standard, and is extensible. LaTeX offers different document types, with presentation choices governed by those types. However, there is no rule system for documents of a particular type.
SGML is commonly used in extremely large scale documentation applications, such as aircraft maintenance information, government regulation, and power plant documentation. For example, a single model of a single commercial aircraft might require 4 million unique pages of documentation that must be revised and republished quarterly! [Kimber, in Goldfarb] Of course, Boeing or Airbus produce many such models. As recently as 2000, applications such as aircraft documentation represented more information than the entire web.
SGML markup consists of elements indicated by tags. For example, one element may represent a paragraph, another a style choice (e.g., emphasized text, perhaps indicated in this font), and another a graphic image. Each element has a start tag and an end tag, which allows for nesting of elements (e.g., a section contains paragraphs; paragraphs contain regions of emphasized text). The tags themselves are delimited with angle brackets, now familiar in HTML. The language was developed over a 20 year period, and is rich with capabilities including support for hypertext and for style sheets, which separate rendition decisions from document content.
SGML has various idiosyncrasies and complexities that support large-scale use, but which increase the learning curve and get in the way of ordinary-sized applications.
HTML adopted some of SGML's strengths: for example, most tags are generalized, not tied to particular formatting choices; and HTML certainly represents a common representation of documents. However, HTML offered no extensibility and no formal rules for document types.
Consequently, as use of the web took off, browser vendors and others made their own extensions of HTML, leading to incompatability and difficulties for alternate presentation media. Soon, Berners-Lee's WWW Consortium responded by creating a simple stylesheet mechanism (CSS) and an approach extension (now superceded by XML).
Tags. Elements, attributes. _____
Document structure. _____
Entities. _____
_____
_____
_____
Tim Berners-Lee's WWW Consortium developed XML beginning in 1996, with a standard published in (ca.) 1998, and additional supporting standards for links (XLink), style sheets (XSL), etc., emerging thereafter. Jean Paoli of Microsoft and Jon Bosak of Sun Microsystems led the effort. XML is a subset of SGML, and the accompanying standards are generally based on corresponding SGML features.
Like SGML and HTML, markup tags are delimited by angle brackets, and entities are delimited by ampersands& and semicolons.
Unlike HTML, the XML language achieves all three of the SGML goals. In particular, XML is fully extensible---one can define any desired document type---and formal rules are defined for each document type, against which individual documents can be checked for validity. SGML's DTD format provides one way to define document types and describe their rules; XML Schema constitute an alternate aproach.
By 1998, it became clear that XML had important applications
beyond text processing. For example, EDI (Electronic Data
Interchange) technology seeks to automate the way large companies buy
and sell from each other. This automation goes beyond the kind of
transaction a consumer has when they buy a book from
amazon.com: for businesses, purchase orders must be
generated and approved, entries must be made in private accounting
systems, etc.; when conducted through the web, one might refer to this
as integrated e-commerce (although it's only called
B2B, Business to Business, in the media).
XML makes it possible to send for documents in one
format between companies, with each organization then transforming
those documents into their own local formats (often also in XML) as
many times as needed, finally making changes in particular (non-XML)
internal systems. The use of XML rather than custom-built
intermediary languages makes this capability available to even small
businesses.
Document structure. Example:
~cs378/xml/Dog.xml,
SD spec for
Dog class.
_____
Balanced tags. Empty elements. _____
Parsing, verification. _____
_____
_____
_____
_____
DOM is a low-level API (Application Programming Interface) which lets a programmer deal directly with the contents of an XML document. There is no pre-processing involved with this process. All that is necessary is the creation of a DOM tree, which is a complete in-memory representation of the XML document used to create it. After the DOM is created, there are many methods already written which allow programmers to efficiently and recursively navigate their way through a DOM tree, making changes if necessary.
Document object (DOM tree)An XML parser or document builder is a code
object for converting an XML
document into a DOM tree (i.e., an object in the class
Document). DOM does not provide a general-purpose
parser object for performing such conversions, because some options for
a parser object can't just be passed as arguments or settings, but
must be incorporated in the construction of that object. Therefore,
DOM provides a factory for constructing custom parser objects.
A DocumentBuilderFactory (in the javax.xml.parsers package) is a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. A DocumentBuilder defines the API to obtain Document instances from an XML document. Using this class, a programmer can obtain a Document from XML. When we say Document, we mean a DOM tree.
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(true);
.
. (other parse options)
.
DocumentBuilder db = dbf.newDocumentBuilder();
document = db.parse(some input source);
The argument for parse() method of the
DocumentBuilder class can be a File object
or an InputSource object (look this type up in the
org.xml.sax package). For example, if a
String object str holds your XML document
represented as a single string, then
document = db.parse(new InputSource(new StringReader(str)));creates a DOM tree
document from that string str.
There are many terms frequently used by programmers when speaking about XML and DOM trees. In case you aren't familiar with the terminology, this will be a brief framework of the most commonly used terms when speaking about XML and DOM.
XML Example:
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "DTD/note.DTD">
<note>
<to>
<name>Agent X</name>
<building>Langley, CIA Headquarters</building>
</to>
<from>
<name>Agent Y</name>
</from>
<message>Leave the building at 11:00 p.m.</message>
</note>
For purposes of simplicity, let's assume that rootElement has been defined to be the root element of this DOM tree. That is, rootElement represents the DOM tree specified by the diagram above. To obtain the value "Agent X", this sequence of calls needs to be made:
An excellent way to gain a strong understanding of the recursion involved with processing an XML document, it is helpful to spend the time to write a DOM serializer. A DOM serializer is simply a process to generate a raw XML document from a DOM tree. So, the number of children must be known at all times and a recursive process must be called to deal with any number of children and any number of parents. Remember that a DOM is really nothing more than a data structure stored in main memory.
Additionally, like mentioned before, there are 12 types of Nodes, and each type is displayed differently. Writing a Serializer won't be an assignment here, but will most likely become a task in the team project later in the semester. The only way to change the contents of an XML document is to first parse the document into a DOM tree, then make changes. However, we would like a new raw XML document to reflect the changes made to the DOM tree. So, we need to write the DOM tree to a file to accomplish this. Our project is going to involve a lot of XML defining the structure of a portfolio, and if the structure needs to be changed then a DOM tree must be created, changes made to that DOM tree, and the DOM is rewritten as a file and stored again. There will be many methods to accomplish this task, and if the methods are written well the first time, it can be reused for any XML document.
The following example shows recursive modification... (append to <name>s, add a <department>)
Examples:
~cs378/xml/Contract.dtd,
SpecML.dtd
A Document Type Definition (DTD) is used to constrain the content of an XML document. For example, if you are trying to model the contents of a music collection you would like a CD to have only one artist and one album name. DTDs allow a developer to specify exactly these types of constraints. A DTD is used to define the legal building blocks of an XML document. A DTD can be declared inline in your XML document or as an external document.
<!ELEMENT element-name (element-content)>
<!ELEMENT element-name
(child-element-name, child-element-name, ..., child-element-name)>
<!ELEMENT element-name (#PCDATA)>
<!ELEMENT element-name (child-element+)>
<!ELEMENT element-name (child-element*)>
<!ELEMENT element-name
(child-element1|child-element2)>
Example: SpecMLToHTML.xsl
rab@stolaf.edu, February 15, 2005