Overview of XML
CS 284 (CSA), Spring 2005
In this page, we present an outline of markup languages in general and XML in particular, providing a framework for programming with XML. We make no particular assumptions, but assume general prior exposure to HTML.
Terms: markup, markup languages, rendition ("source" text with markup), presentation (formatted view), style sheet (determines how to present rendition).
Markup languages such as HTML and XML use tags for the markup
General form: <name attrib="value"
...>
content... </name>.
Tags such as <name attrib="value"
...> are open tags;
</name>
is the corresponding close tag
Elements of the document are indicated by tag (pair)s
<name ...>
content... </name>.
The content of such an element is the (marked up) text
between the open and close tags.
Attributes are options of the form
attrib="value" ... within an
open tag.
Entities are additional objects within the rendition.
For example, the entity
represents a "non-breaking space", and
< represents the less-than character. Other ideas
for entities: an entity used to insert a company's logo graphic; and
an entity used to insert a standard body of text, such as a copyright
notice.
SGML, Standard Generalized Markup Language, Goldfarb et al since 1960's, beginning at IBM.
Goldfarb coined term markup in 1970
Standards in '86, '91
Tags of form <name attrib="value"
...>
content... </name>. Properly
bracketed, i.e, every open tag in the rendition has a close tag.
Common rendition representation
Extensibility, i.e., ability to define new tags, etc.
Document type rules
Document type rules represented in a separate DTD (Document Type Definition) language, using regular expressions. See below
LaTeX, Leslie Lamport published 1985
Implemented as a macro package over TeX (Donald Knuth, 1978-81) typesetting language.
Some proper bracketing: e.g., "environments" have the form \begin{name}...\end{name}
Common rendition representation, extensible, but no explicit document type rules.
commonly used in Mathematics and CS research publications.
HTML, Tim Berners-Lee (creator of WWW) and Anders Berglund, '89
Tags <name attrib="value"
...>
content... </name> as in
predecessor language SGML
Common rendition, but no extensibility (at first), no (modifyable) document type rules.
Designed as a simplification of SGML for WWW authoring.
XML, Berners-Lee et al (WWW consortium, w3c.org) '96, standard '98.
Simplified subset of SGML, but with extensibility and document type rules
Document types may be expressed using DTDs or XML Schema (an XML form for specifying document types).
Examples:
Note.dtd,
SpecML.dtd.
Uses a form of regular expressions to represent patterns:
| symbol | meaning | example |
, | sequence: items expected in order | (to,from,message) |
| | OR | (nosuperclass | (superclass | interface)+) |
* | 0 or more | var* |
+ | 1 or more | var+ |
? | 0 or 1 | var? |
( ) | grouping | (nosuperclass | (superclass | interface)+) |
There are two main approaches to processing XML in a language such as Java:
SAX, the Simple API for XML, performs actions as an XML document is parsed (input as a rendition and prepared for processing as elements, attributes, and entities).
DOM, the Document Object Model, creates an internal data structure (a DOM tree) during parsing, atructure that can be manipulated later by the program code.
SAX
The API for SAX provides a parse() method for
performing the parsing of XML input, and a class
ContentHandler with methods for specifying actions to be
performed when certain tags (elements) are encountered in the XML input
stream.
Example ContentHandler methods:
ContentHandler_____
_____
_____
_____
_____
_____
rab@stolaf.edu, April 25, 2005