OpenDocument technical specification
File types
The recommended file extensions and MIME types are included in the official standard (OASIS, May 1 2005).
Documents
The most common file extensions used for OpenDocument documents are .odt for text documents, .ods for spreadsheets, .odp for presentation programs, .odg for graphics and .odb for database applications. These are easily remembered by considering ".od" as being short for "OpenDocument", and then noting that the last letter indicates its more specific type (such as t for text). Here is the complete list of document types, showing the type of file, the recommended file extension, and the MIME:
File type | Extension | MIME Type |
---|---|---|
Text | .odt | application/vnd.oasis.opendocument.text |
Spreadsheet | .ods | application/vnd.oasis.opendocument.spreadsheet |
Presentation | .odp | application/vnd.oasis.opendocument.presentation |
Drawing | .odg | application/vnd.oasis.opendocument.graphics |
Chart | .odc | application/vnd.oasis.opendocument.chart |
Formula | .odf | application/vnd.oasis.opendocument.formula |
Database | .odb | application/vnd.oasis.opendocument.database |
Image | .odi | application/vnd.oasis.opendocument.image |
Master Document | .odm | application/vnd.oasis.opendocument.text-master |
Templates
OpenDocument also supports a set of template types. Templates represent formatting information (including styles) for documents, without the content themselves. The recommended filename extension begins with ".ot" (which can be viewed as short for "OpenDocument template"), with the last letter indicating what kind of template (such as "t" for text). The supported set are:
File type | Extension | MIME Type |
---|---|---|
Text | .ott | application/vnd.oasis.opendocument.text-template |
Spreadsheet | .ots | application/vnd.oasis.opendocument.spreadsheet-template |
Presentation | .otp | application/vnd.oasis.opendocument.presentation-template |
Drawing | .otg | application/vnd.oasis.opendocument.graphics-template |
Chart template | .otc | application/vnd.oasis.opendocument.chart-template |
Formula template | .otf | application/vnd.oasis.opendocument.formula-template |
Image template | .oti | application/vnd.oasis.opendocument.image-template |
Web page template | .oth | application/vnd.oasis.opendocument.text-web |
Capabilities
As noted above, the OpenDocument format can describe text documents (e.g., those typically edited by a word processor), spreadsheets, presentations, drawings/graphics, images, charts, mathematical formulas, databases, and "master documents" (which can combine them). It can also represent templates for many of them.
The official OpenDocument standard (OASIS, May 1 2005) defines OpenDocument's capabilities. Haumacher (2005) provides a hyperlinks formal specification (Haumacher, 2005) derived from the official standard. Eisenberg (2005)'s book describes the format in more detail. The text below provides a brief summary of the format's capabilities.
Metadata
The OpenDocument format supports storing metadata (data about the data) by having a set of pre-defined metadata elements, as well as allowing user-defined and custom metadata. The predefined metadata are: Generator, Title, Description, Subject, Keywords, Initial Creator, Creator, Printed By, Creation Date and Time, Modification Date and Time, Print Date and Time, Document Template, Automatic Reload, Hyperlink Behavior, Language, Editing Cycles, Editing Duration, and Document Statistics.
Content
OpenDocument's text content format supports both typical and advanced capabilities. Headings of various levels, lists of various kinds (numbered and not), numbered paragraphs, and change tracking are all supported. Page sequences and section attributes can be used to control how the text is displayed. Hyperlinks, ruby text (which provides annotations and is especially critical for some languages), bookmarks, and references are supported as well. Text fields (for autogenerated content), and mechanisms for automatically generating tables such as tables of contents, indexes, and bibliographies, are included as well.
In the OpenDocument format, spreadsheets are an example of a set of tables. Thus, there are extensive capabilities for formatting the display of tables and spreadsheets. Database ranges, filters, and data pilots (known to Excel users as "pivot tables") are also supported. Change tracking is available for spreadsheets as well.
The graphics format supports a vector graphic representation, in which a set of layers and the contents[1] of each layer is defined. Available drawing shapes include Rectangle, Line, Polyline, Polygon, Regular Polygon, Path, Circle, Ellipse, and Connector. 3D Shapes are also available; the format includes information about the Scene, Light, Cube, Sphere, Extrude, and Rotate (it is intended for use as for office data exchange, however, and not sufficient to represent movies or other extensive 3D scenes). Custom shapes can also be defined.
Presentations are supported. Animations can be included in presentations, with control over the Sound, showing a shape or text, hiding a shape or text, or dimming something, and these can be grouped. In OpenDocument, much of the format capabilities are reused from the text format, simplifying implementations.
Charts define how to create graphical displays from numerical data. They support titles, subtitles, a footer, and a legend to explain the chart. The format defines the series of data that is to be used for the graphical display, and a number of different kinds of graphical displays (such as line charts, pie charts, and so on).
Forms are specially supported, building on the existing XForms standard.
Formatting
The style and formatting controls are numerous, providing a number of controls over how information is displayed.
Page layout is controlled by a variety of attributes. These include page size, number format, paper tray, print orientation, margins, border (and its line width), padding, shadow, background, columns, print page order, first page number, scale, table centering, maximum footnote height and separator, and many layout grid properties.
Headers and footer can have defined fixed and minimum heights, margins, border border line width, padding, background, shadow, and dynamic spacing.
There are many attributes for specific text, paragraphs, ruby text, sections, tables, columns, lists, and fills. Specific characters can have their fonts, sizes, and other properties set. Paragraphs can have their vertical space controlled through attributes on keep together, widow, and orphan, and have other attributes such as "drop caps" to provide special formatting. The list is extremely extensive; see the references (in particular the actual standard) for details.
Spreadsheet formulas issue
OpenDocument is fully capable of describing mathematical formulas that are displayed on the screen. It is also fully capable of exchanging spreadsheet data, formats, pivot tables, and other information typically included in a spreadsheet. OpenDocument can exchange spreadsheet formulas (formulas that are recalculated in the spreadsheet); formulas are exchanged as values of the attribute table:formula.
However, some believe that the allowed syntax of table:formula is not defined in sufficient detail. The OpenDocument version 1.0 specification defines spreadsheet formulas using a set of simple examples which show, for example, how to specify ranges and the SUM() function. Some critics argue that a more detailed, precise specification for spreadsheet functions, including syntax and semantics, should be created to augment these examples. The OpenDocument committee argued that this was outside their scope, since the syntax of such formulas is not in XML. Others have argued that, while the specification is less specific than one might like, the intent is fairly clear (especially since formulas tend to follow decades-long traditions), and also because the vast majority of spreadsheets only use a small set of functions (such as SUM) which are universally supported by all spreadsheet implementations anyway. In practice, many developers look to OpenOffice.org as a "canonical implementation"; since its code is public for anyone to review, and its XML output can be trivially inspected, this can resolve many questions. There is draft work proposing a more detailed specification for spreadsheet formulas (e.g. OpenFormula). Such work is expected to simply clarify in more detail what is acceptable in a spreadsheet formula; no one expects such work to invalidate any of the current OpenDocument standard. For more information, see the OpenFormula article.
Note that this is not a disadvantage compared to Microsoft Open XML, which also does not specify formulas in detail. Nor is it a disadvantage compared to Microsoft Excel binary format, whose format and semantics have never been completely defined this way in public.
Format internals
An OpenDocument file can be either a simple XML file which uses <office:document> as the root element or a JAR compressed archive containing a number of files and directories. Because the simple XML format does not directly support embedding binary content or thumbnails, the JAR-based format is used almost exclusively. Applications that use openDocument might not support saving and loading of the plain XML file, but all should support the JAR-based format. This simple compression mechanism means that OpenDocument files are normally significantly smaller than equivalent Microsoft ".doc" or ".ppt" files. This smaller size is important for organizations who store a vast number of documents for long periods of time, and to organizations those who must exchange documents over low bandwidth connections. Once uncompressed, most data is contained in simple text-based XML files, so the data contents (once uncompressed) have the typical ease of modification and processing of XML files. Directories can be included to store non-SVG images, non-SMIL animations, and other files that are used by the document but cannot be expressed directly in the XML.
The zipped set of files and directories includes the following:
- XML files
- content.xml
- meta.xml
- settings.xml
- styles.xml
- Other files
- mimetype
- Directories
- META-INF/
- Thumbnails/
The OpenDocument format provides a strong separation between content, layout and metadata. The most notable components of the format are described in the subsections below. The files in XML format are further defined using the RELAX NG language for defining XML schemas. RELAX NG is itself defined by an OASIS specification, as well as by part two of the international standard ISO/IEC 19757: Document Schema Definition Languages (DSDL).
content.xml
content.xml is the most important file. It carries the actual content of the document (except for binary data, like images). The base format is inspired by HTML, and though far more complex, it should be reasonably legible to humans:
<text:h text:style-name="Heading_2">This is a title</text:h> <text:p text:style-name="Text_body"/> <text:p text:style-name="Text_body"> This is a paragraph. The formatting information is in the Text_body style. The empty text:p tag above is a blank paragraph (an empty line). </text:p>
styles.xml
styles.xml contains style information. OpenDocument makes heavy use of styles for formatting and layout. Most of the style information is here (though some is in content.xml). Styles types include:
- Paragraph styles.
- Page Styles.
- Character Styles.
- Frame Styles.
- List styles.
The OpenDocument format is somewhat unusual in that you cannot avoid using styles for formatting. Even "manual" formatting is implemented through styles (the application dynamically makes new styles as needed).
meta.xml
meta.xml contains the file metadata. For example, Author, "Last modified by", date of last modification, etc. The contents look somewhat like this:
<meta:creation-date>2003-09-10T15:31:11</meta:creation-date> <dc:creator>Daniel Carrera</dc:creator> <dc:date>2005-06-29T22:02:06</dc:date> <dc:language>es-ES</dc:language> <meta:document-statistic meta:table-count="6" meta:object-count="0" meta:page-count="59" meta:paragraph-count="676" meta:image-count="2" meta:word-count="16701" meta:character-count="98757"/>
The names of the <dc:...> tags are taken from the Dublin Core XML standard.
settings.xml
settings.xml includes settings such as the zoom factor or the cursor position. These are properties that are not content or layout.
mimetype (file)
mimetype is just a one-line file with the mimetype of the document. One implication of this is that the file extension is actually immaterial to the format. The file extension is only there for the benefit of the user.
Reuse of existing formats
OpenDocument is designed to reuse existing open XML standards whenever they are available, and it creates new tags only where no existing standard can provide the needed functionality. So, OpenDocument uses DublinCore for metadata, MathML for displayed formulas, SVG for vector graphics, SMIL for multimedia, etc.
Technical Comparison with Microsoft XML Formats
Both OpenDocument and Microsoft's various XML formats for office documents (MS XML) use XML to represent office document data. However, there are technical differences between them.
Alex Hudson, J. David Eisenberg, Bruce D'Arcus and Daniel Carrera argue that OpenDocument has several technical advantages over Microsoft XML (Hudson, 2005):
- OpenDocument uses a mixed content model, whereas the MS XML format does not. "Non-mixed documents usually represent structured data; mixed documents are usually used to represent narrative. MS XML uses the non-mixed model to represent narrative (word processing). This sort of mismatch leads to awkward markup... The mixed-content model makes more sense, and is closer to what a developer will be familiar to."
- OpenDocument is similar to XHTML, while MS XML is not. It uses mixed content, marks styles in a similar way, and so on. This makes it easier to transform data accurately between OpenDocument and XHTML, and also simplifies the reuse of existing skills.
- OpenDocument gives better separation between content and presentation. "Both formats give you some separation, and neither format gives you perfect separation. But OpenDocument goes much further in that direction."
- OpenDocument hyperlinks are designed to be easier to process (they do not require processing a separate file).
- OpenDocument reuses existing standards whenever possible. It uses SVG for drawings, MathML for equations, XLink for linking, Dublin Core for metadata, etc. "This makes the format infinitely more transparent to someone familiar with XML technologies. It also allows you to reuse existing tools that understand these standards." In contrast, the Microsoft XML formats do not use appropriate standards, but reinvent everything, imposing significant additional costs to translate them to standard formats.
The Valoris Report noted that Microsoft's XML format supported custom XML schema definitions (XSDs), while OpenDocument does not. XSDs made it "possible to attach one or more custom schemas to a given Word document. It allows the users to annotate the document with the elements found in the attached schemas. There are two options for saving a Word document with custom elements. The default option is to save the document as WordProcessingML with the custom elements nested throughout the tree. The other option is "Save data only", which removes WordProcessingML markup and only persists the custom element tree structure. When the option you "Save data only" is chosen, Word removes the WordProcessingML markup and only saves the custom elements found in the document."
It is extremely controversial, however, whether or not these embedded XSDs are actually an advantage or not. If the XSD is used as a "Save data only", it is simply saved as a normal XML file with nonstandard packaging. There are already standards for exchanging XML, so this does not add a fundamental capability, and OpenDocument files can easily embed XML files in a similar way (since both are essentially zip archives of a set of files). If XSD is used to embed custom elements "throughout the tree", the result is a completely nonstandard document that requires (as a matter of practice) specialized tools to process the additional data. In many eyes, this impedes rather than aids interoperability, a serious problem since the whole point of these standards is to promote interoperability. The OASIS OpenDocument committee had considered adding this capability before public review of OpenDocument even began, but decided not to, saying that it believed this was "not essential for the current version of the specification."
The Valoris Report notes that OpenDocument was designed to support cross-platform interoperability. In contrast, the Valoris Report had reservations about the ability of Microsoft's XML formats to actually support true cross-platform interoperability (a fundamental requirement of any exchange format). Microsoft's XML format has not yet been reviewed by independent parties for its ability to support interoperability, and it was designed to only support one product which only runs on a single platform. The Valoris Report noted that using XML did not guarantee that the result was portable across heterogeneous platforms with full preservation of semantics. They noted that the Microsoft schemas can contain proprietary objects; they may be encoded in a standard-compliant fashion, but if some of them can only be executed on a Microsoft environment (e.g., OLE) the result is not interoperable. They also reported that the "spreadsheet macros are spread within the content XML elements. It is therefore very difficult to isolate the code from the text by a third-party program. Furthermore, these macros cannot be executed outside the MS-Office environment." Thus, OpenDocument appears to be interoperable, while Microsoft's XML is known to have interoperability weaknesses, and the depth of those weaknesses is not yet fully known.