Jump to content

SXML

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Qwertyus (talk | contribs) at 11:54, 10 January 2015 (SXML Specification: more copyrighted text copied from http://modis.ispras.ru/Lizorkin/Publications/sxml-eng.pdf). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
SXML
Filename extension
.sxml, .scm
Type codeTEXT
Type of formatmarkup language

SXML is an alternative syntax for writing XML data, using the form of S-expressions. It is also a set of implementations that provide typical XML-processing functionalities that operate on the SXML syntax.

Textual correspondence between SXML and XML for a sample XML snippet is shown below:

XML SXML
<tag attr1="value1"
     attr2="value2">
  <nested>Text node</nested>
  <empty/>
</tag>
(tag (@ (attr1 "value1")
        (attr2 "value2"))
  (nested "Text node")
  (empty))

The following two observation can be drawn from the above example:

  1. Textual notations for XML and SXML are much alike; informally, SXML textually differs from XML in relying on round brackets instead of angular braces.
  2. Additionally, SXML is not only a straightforward textual notation for XML data, but also has a directly-corresponding primary data structure for the LISP family of functional programming languages, thus providing an illustrative approach for processing XML data with a general-purpose programming language.

Similarity between XML and S-expressions reified in SXML allows achieving close integration between XML data and programming language expressions, resulting in illustrativeness and simplicity of XML data processing for an application programmer.

The structural similarity of S-expression-like and XML-like syntaxes has often been discussed in the XML community, going at least as far back as 1993.[1][2][3][4]

Example

Take the following simple XHTML page:

 <html xmlns="http://www.w3.org/1999/xhtml"
         xml:lang="en" lang="en">
    <head>
       <title>An example page</title>
    </head>
    <body>
       <h1 id="greeting">Hi, there!</h1>
       <p>This is just an >>example<< to show XHTML & SXML.</p>
    </body>
 </html>

After translating it to SXML, the same page now looks like this:

 (*TOP* (@ (*NAMESPACES* (x "http://www.w3.org/1999/xhtml")))
  (x:html (@ (xml:lang "en") (lang "en"))
    (x:head
       (x:title "An example page"))
    (x:body
       (x:h1 (@ (id "greeting")) "Hi, there")
       (x:p  "This is just an >>example<< to show XHTML & SXML."))))

Each element's tag pair is replaced by a set of parentheses. The tag's name is not repeated at the end, it is simply the first symbol in the list. The element's contents follow, which are either elements themselves or strings. There is no special syntax required for XML attributes. In SXML they are simply represented as just another node, which has the special name of @. This can't cause a name clash with an actual "@" tag, because @ is not allowed as a tag name in XML. This is a common pattern in SXML: anytime a tag is used to indicate a special status or something that is not possible in XML, a name is used that does not constitute a valid XML identifier.

We can also see that there's no need to "escape" otherwise meaningful characters like & and > as &amp; and &gt; entities. All string content is automatically escaped because it is considered to be pure content, and has no tags or entities in it. This also means it is much easier to insert autogenerated content and that there is no danger that we might forget to escape user input when we display it to other users (which could lead to all kinds of nasty cross-site scripting attacks or other annoyances).

SXML features

This section considers some important features of SXML, deductible from SXML grammar and properties of S-expressions.

SXML attributes

The cdr of an SXML attribute list forms an association list, so that, when SXML is read into a Lisp program, any SXML attribute can be extracted from an attribute list using Lisp's built-in assoc function.

SXML elements and attributes

The uniformity of the SXML representation for elements, attributes, and processing instructions simplifies queries and transformations. For the SXML data model, attributes and processing instructions look like regular elements with a distinguished name. Therefore, query and transformation functions dedicated to attributes become redundant, because ordinary functions with distinguished names can be used.

The uniform representation for SXML elements and attributes is especially convenient for practical tasks. Differences between elements and attributes in XML are blurred. Choosing either an element or an attribute for representing concrete practical information is often a question of style, and such a choice can later be changed. Such a change in a data structure is expressed in SXML as simply an addition/removal of one hierarchy level, namely an attribute-list. This requires the minimal modification of an SXML application. For the SXML notation, the only difference between an attribute and an element is that the former is contained within the attribute-list (which is a special SXML node) and cannot have nested elements.

For example, if data restructuring requires that the weight of a delivered load, initially represented as a nested element, is to be represented as an attribute instead, the SXML element

(delivery
   ...
   (weight "789")))

will be changed to

(delivery
  (@ (weight "789"))
  ...)

Such a notation for elements and attributes simplifies SXML data restructuring and allows uniform queries to be used for data processing.

SXML document as a tree of uniform nodes

Since an SXML document is essentially a tree structure, it can be described in a more uniform way by introducing the term of an SXML node for nodes of this tree.

An SXML node can be defined on the basis of SXML grammar as a single production [N] given below. Alternatively, an SXML node can be defined as a set of two mutually recursive datatypes: [N1], [N2] and [N3]. In the latter case, a Node is constructed by adding a name to the Nodelist as its leftmost member; a Nodelist is itself a (sorted) list of Nodes.

[N]      <Node> ::= <Element> | <attributes-list> | <attribute> | "character data: text string" | <TOP> | <PI>
[N1]     <Node> ::= ( <name1> . <Nodelist> ) | "text string"
[N2] <Nodelist> ::= ( <Node> <Node>* )
[N3]    <name1> ::= <name> | @ | *TOP* | *PI*

Such a consideration emphasizes SXML tree structure and the uniform representation for information items as S-expressions

SXML as a Scheme program

The syntax of LISP family programming languages, in particular, Scheme, is based on S-expressions used for both data and code representation. This makes it possible and convenient for Scheme programs to be treated as a semi-structured data and vice versa.

Since an SXML document and its nodes are S-expressions, they can be used for representing a Scheme program. For making this possible, it is sufficient that the first member of every list contained in the SXML tree is a function; the use of macros offers more possibilities. The rest of the members of the list are then the arguments, which are passed to that function. In accordance with SXML grammar, attribute and element names and special names must be bound to functions.

An SXML document or an SXML node that fulfills these requirements may be considered a Scheme program which can be evaluated, for example, by means of eval function.

For example, if para and bold are defined as functions as follows:

(define (para . x) (cons 'p x))
(define (bold . x) (cons 'b x))

then the following SXML element

(para "plain"
      (bold "highlighted")
      "plain")

can be treated as a program, and the result of its evaluation is the SXML element:

(p "plain"
   (b "highlighted")
   "plain")

Note that the result of evaluating such a program is not necessarily an SXML element. Namely, a program may return a textual representation for the source data in XML or HTML; or even have a side effect, such as saving the SXML data in a relational database.

SXML shortcomings

As a data model

SXML, like XML, models documents as "ordered hierarchies of content-based objects." This has many strengths, perhaps most importantly separating formatting and other processing of documents, from their representation per se.[5] However, this model may be a less natural fit for other purposes. For example, relational databases differ in not being (inherently) hierarchical, and in not being inherently ordered; either model can simulate the other, but at some cost in naturalness, performance, and/or other properties.

SXML's representation suggests a slight model difference from XML's with regard to processing instructions and comments. In XML, these are reserved node types, whose content is essentially text (no attributes, and no nested elements, comments, or PIs). In the SXML grammar above, processing instructions are indeed a special type, but comments do not appear at all; while in the "uniform nodes" model above, all node-types (again except comments) are treated as equivalent to elements with reserved names (for example, "*PI*").

SXML does not provide schema specifications or validators as does XML; however, insofar as SXML is an alternative representation of the same information structure as XML, that functionality can be obtained by converting to XML and then using existing specifications and tools.

As a syntactic representation, or file format

SXML may be very slightly more compact than XML (almost entirely due to specifying element names only at element starts, and not at ends. However, this has the drawback that it is harder to detect errors such as a misplaced ")" (thus, the entropy of the file is slightly higher). This change can either increase or decrease human-readability or the raw data, largely depending on how dense the markup is in a given case.

Other considerations

Of course SXML can be parsed by a program in any programming language, and then be represented using any desired data structure. Precisely as with XML, implementations vary: XML applications that can process data in a one-pass serial fashion typically use SAX style interfaces that stay very close to the raw input data stream, while applications that must access parts of the data in non-linear random-access fashion use DOM interfaces that mirror the hierarchical structure instead.

It has been claimed that because the underlying structure is based on singly linked lists, nodes have no default access to either the parent node and the siblings nodes, only to their child nodes. But this confuses underlying structure, with a linear representation of a structure. Any disk file is a linear sequence of bytes or characters—but that mundane fact places almost no limits on what structures can be represented.

As a simple example, saying that the following expression's "underlying structure" is either a 21-character string, or a singly-linked list of 11 nodes (4 numbers, 3 arithmetic operators, and 4 grouping delimiters), is at best a gross oversimplification:

   ( 1 + 2 ) * ( 3 + 4 )

Because SXML is so similar to S-expressions syntactically, it is trivial to load it into a LISP or Scheme program just as if it were a generic S-expression. Doing so is utterly trivial to program in such languages, but would lead to each parenthesized group becoming a singly-linked list: a data structure which is far from optimal for kinds of processing commonly anticipated for XML-like structures. Similarly, in any programming language it is trivial to load an entire SXML document into one long string—but it would be a poor choice for most purposes.

In reality, XML, SXML, SGML, or most any data representation is loaded into data structures that facilitate required operations. DOM and other interfaces provide methods to get from an element to its parent, preceding and following siblings, and numbered children directly, and to access attributes by name. Practical DOM implementations make likely operations very fast.[6]

If a program does not do this, then typical operations such as getting the Nth child of an element, or the preceding element in a long list, or the element with a given ID, remain possible but are far from optimal.

Citations

  1. ^ Tim Berners-Lee. "Reform of SGML." March 1993. http://www.w3.org/MarkUp/SGML/TimComments
  2. ^ Joe English. "Delimited pseudoelements". Oct 1, 1996. http://lists.w3.org/Archives/Public/w3c-sgml-wg/1996Oct/0021.html
  3. ^ Ora Lassila. "PICS-NG Metadata Model and Label Syntax W3C NOTE. 1997-05-14 (section 6.2). http://www.w3.org/TR/NOTE-pics-ng-metadata.html
  4. ^ "CL-XML Provides Common Lisp Support for XML, XPath, and XQuery." The Cover Pages, June 9, 2001. http://xml.coverpages.org/ni2001-06-09-a.html
  5. ^ Allen Renear, Elli Mylonas, David Durand. "Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies." Proceedings of the annual joint meeting of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, Christ Church, Oxford University, April 1992. http://www.stg.brown.edu/resources/stg/monographs/ohco.html
  6. ^ Steven DeRose. "Architecture and Speed of Common XML Operations." In Proceedings of Extreme Markup Languages. Montreal, 2005.

Detailed introduction, motivation and real-life case-studies of SSAX, SXML, SXPath and SXSLT. The paper and the complementary talk presented at the International Lisp Conference 2002.