Remarks on the OLIF DTD

Abstract

The OLIF (version 2.0) lexicon and terminology exchange standard is currently under development within the OLIF Consortium, a collaborative group of industrial firms active in the field of language technology. This document describes the document type definition (DTD) of OLIF (version 2.0), the current formal representation of OLIF.

Status of this Document

This document is under review by the OLIF Consortium. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use this draft as reference material or to cite it as anything other than "work in progress". Comments on this draft are invited and should be sent to the editors.

This document is a product of the OLIF Consortium Technical Working Group.

For background on this work, please see the OLIF Web site.

A list of unresolved issues and known errors in this specification is maintained by the editors.

This document may be distributed freely, as long as all text and legal notices remain intact.

This document, the proposal for the Structure and Content of the Body of an OLIF2 File (July 2001 edition) together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 1766 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand OLIF (version 2.0) and construct computer programs to process it.

Table of Contents

Introduction

Design Goals and Design Decisions

Overall Structure and Principles

DTD Modularization

Uniform Representation of Data Categories as Elements

Two-level Content Models

XML Representation for Lists of Values

User Extensions

Coding Conventions

The Header

The Body

Transfer Restrictions

Workflow Support

OLIF Consortium Technical Working Group

Introduction

The OLIF DTD was developed by the Technical Working Group of the OLIF Consortium. It was chaired by Christian Lieske of SAP with the active participation of all members of the OLIF Consortium. The membership of the Technical Working Group is given in an appendix.

From the very beginning, the vision was to use two representation formalismus for OLIF: that of DTDs and that of XML schemata. Currently, the DTD is the primary (development) representation for the following reasons:

  1. The expressive power of DTDs is smaller than that of schemata (for example wrt. to ordering constraints within content models). This implies that a formalization that uses all features of XML schemata (e.g. datatyping for element contents) cannot easily be mapped onto a DTD. Going from DTD to schema, however, is straightforward.
  2. Formalization as a DTD is generally considered to be quicker than formalization as a schema.

Design Goals and Design Decisions

The DTD should be close to the description of OLIF that is provided in the Proposal for the Structure and Content of the Body of an OLIF2 File.

It shall be easy to write programs which process OLIF data. Therefore, some technologies (for example XLink) for which wide tool support does not exist yet, are not used for the formalization.

The OLIF DTD as well as OLIF data shall be legible and reasonably clear. Therefore, terseness is not of high importance.

The design shall show quick progress and follow good practice (for example commenting). In case these two goals conflict with each other, preference is given to quick progress.

Maintenance and customization of the DTD shall be easy. Ease of maintenance is especially important while the formalization is still under review.

Lexical data should be represented in a natural way. Thus, concatenations by means of underscores etc. (like in inside_out) were banned.

The Proposal for the Structure and Content of the Body of an OLIF2 File says that elements within groups may appear in any order. Since there is no elegant way of modelling this with a DTD, the free-order had to be replaced by fixed ordering.

The formalization of alternative content for optional elements is (a|b)+. This overgenerates but is a straightforward way of modelling. Furthermore, this style of modelling has the advantage that no special provisions have to be taken to realize the required multiple occurrences of eg. project.

Clearly, the metadata information in the OLIF header should be represented in terms of the resource description format (RDF). Due to a heavy workload, however, RDF is not used yet.

Overall Structure and Principles

DTD Modularization

OLIF data represents collections of terminological and lexical data. In harmony with the Terminological Markup Framework (TMF), this type of data collection generally consists of three building blocks: general information (e.g. title of the collection), a list of terminological entries, and complementary information (e.g. shared resources like bibliographical information). The OLIF DTD reflects this partition, since the top-level file (olif.dtd) directly references three DTD modules which correspond to these building blocks: oHeader.mod, oBody.mod, and oShareR.mod.

Uniform Representation of Data Categories as Elements

For certain data categories (e.g. natural gender), OLIF foresees a fixed set of values. Although these data categories lend themselves to being represented as attributes (if this representation is used XML parsers can check automatically ), we have chosen to represent these data categories as elements. The reasons for this decision are as follows:

  1. The values of some data categories (e.g. particles for verbs) are multiwords (like inside out). However, predefined attribute values that are multiwords cannot be declared in DTDs.
  2. Coding every data category as an element (rather than some as attributes and some as elements) provides for a structure that is easier to understand.

Two-level Content Models

In principle, it is possible to declare the value of an element like <ptOfSpeech> as follows:

<!ELEMENT ptOfSpeech (#PCDATA) >

This however, does not accurately reflect that OLIF foresees a list of fixed values appearing as the content of <ptOfSpeech>. A representation that captures this fact better, makes use of parameter entities as follows:

<!ENTITY % ptOfSpeech.olif.fix.user.ext
               "PtOfSpeech CDATA #IMPLIED">
               
<!ELEMENT ptOfSpeech (%ptOfSpeech.olif.fix.user.ext;)>

This two-level model is the representation style that has been chosen. The section on coding comments details which types of parameter entities have been defined (the different types are reflected in the naming conventions).

The parameter entities that are referenced in each of the three main DTD modules have been placed into their individual DTD module files. For example, the parameter entities referenced in oBody.mod are stored in oBodyV.mod.

XML Representation for Lists of Values

The everything is represented as an element approach, does not necessarily mean that implementation of checks for validity poses a difficult problem. In principle, nothing more than easy-to-process lists of values for the data categories are needed. If these lists exist, it's fairly easy to code a program that compares the actual value of an element with the values in the corresponding list (coding may for example make use of XSLT stylesheet).

Therefore, all fixed or proposed values of OLIF data categories have been made available as XML files. A DTD for the XML data is available.

User Extensions

For certain data categories (e.g. ptOfSpeech), users should be able to supply their own content models (sometimes as an alternative to a list of recommended or required values). For this, DTD adopts the following approach (which is comparable to that of for example DocBook):

  1. The data category is defined with the help of a parameter entity whose name reflects that the data category is user-extensible
    <!ELEMENT ptOfSpeech	(%ptOfSpeech.olif.fix.user.ext;)>
  2. The parameter entity defines a content model that refers to another parameter entity. That other entity ultimately has to be modified by the user.
    <!ENTITY % ptOfSpeech.user.ext
    	""									>
    
    <!ENTITY % ptOfSpeech.olif.fix.user.ext
    	"#PCDATA %ptOfSpeech.user.ext;"						>
  3. A user definition may look like this
    <!ENTITY % ptOfSpeech.user.ext
    	"|user"									>
    <!ELEMENT user (#PCDATA)							>
  4. In case this mechanism is used, a reference to the user's list of values has to be given in the corresponding data category specification in the OLIF header)
    <ptOfSpeechDCS>www.user.net/ptOfSpeechlist.htm</ptOfSpeechDCS>

Coding Conventions

In order to enhance the readability and maintainability of the DTD, the following coding conventions are applied:

  1. In some cases, issues (e.g. related to the data category for language) have been noted in the DTD. For this, the syntax * cc: (for coding comment) has been utilized.
  2. To some degree, the naming conventions for parameter entities outlined in XMLspec have been used. Specifically, the .pcd.mix suffix is used for entities which define PCDATA content models, and the .att suffix is used for entities which define attributes.
  3. For data categories whose content model is PCDATA but which OLIF foresees recommended or fixed values, the following suffixes have been used:
    .olif.fix
    Values for this data category must be drawn from a list of fixed values published by the OLIF consortium
    .olif.rec
    Values for this data category should be drawn from a list of recommanded values published by the OLIF consortium
    .olif.pending
    Values for this data category should be drawn from a list that is pending approval form the OLIF consortium
    .user.ext
    Values may be defined by the user (in this case, the user has to give a reference to his list of values in the corresponding data category specification in the OLIF header)
  4. Elements and attributes have been described by means of comments that have be put into XML-format. For each element or attribute, it's type (element vs. attribute), it's name, and its definition are given.

The Header

Everyone needs to avoid unnecessary work. The header helps people and programs to decide whether or not it is sensible that they spend time with a certain OLIF file or not. By looking at the header, questions like the following can be answered:

  1. Is the file relevant at all (language(s), project, ...)?
  2. Am I allowed to use it (copyright, distribution, ...)?
  3. Where can I turn to for contact person, additional resources, ... more information?
  4. Who created the data creation tool, user, ... when and how?
  5. Can I handle it (encoding, size, special qualifiers, ...)?

The OLIF header aims at giving value to lexical and terminological data by looking at both practical and theoretical considerations. Many data/information categories that have been proven useful for other exchange efforts have been included. Mechanisms are in place for allowing an evolution of the format itself, and of tools for processing it. The format to some degree allows allows a step-by-step approach to implementation. A closer look at the header reveals that it

  1. is patterned after headers of formats for Translation Memory Exchange (TMX) and Corpus Encoding Initiative (CES)
  2. accommodates version tracking for formats and tools
  3. supports 80/20 solutions
  4. facilitates compression
  5. allows references to supplementary, external information

The Body

The representation of the OLIF body, closely follows the Proposal for the Structure and Content of the Body of an OLIF2 File. Amongst the few minor points of divergence, the grouping of data categories according to type may be the most visible one:

  1. The data categories that constitute the key of the entry have been grouped in the element keyDC.
  2. The data categories that are general in nature have been grouped in the element generalDC.
  3. The data categories for monolingual information have been grouped according to type: administrative went into monoAdmin, morphological into monoMorph, syntactic went into monoSyn, and semantic into monoSem

Transfer Restrictions

A transfer restriction specifies a condition in the source language under which a given translation is valid. Transfer restrictions thus contain important information for Machine Translation systems. Accordingly, OLIF provides extensive support for representation of transfer restrictions (in the oTrans.mod module). Validation of the representation for transfer restrictions is the next important step in the OLIF activity.

Workflow Support

Large-scale terminology activity requires workflow support. This especially holds true in environments, where substantial amounts of data are exchanged between different business partners of which each works on a particular aspect in the creation of a multilingual terminological data collection. The suggested standards for terminology exchange, however, do not address this requirement sufficiently. Therefore, OLIF proposes extensive workflow support (in the oWf.mod module). The support is described in detail in the currently reviewed paper Workflow Information for Terminology Exchange. During the review period, the oWf.mod module will contain nothing more than a dummy representation. After review (presumably August 2001) the real oWf.mod module will be made available.

OLIF Consortium Technical Working Group

This text was prepared and approved for publication by the OLIF Consortium Technical Working Group. Approval of this text does not necessarily imply that all working group members voted for its approval. The current and former (indicated by asterisk) members of the working group are:

Mike Dillinger, Logos

*Pierre-Yves Foucou, Systran

*Nils van der Laan, Trados (at the time of contribution)

Christian Lieske, SAP

*Paulo Martins, EC (at the time of contribution)

Carlo Mergen, EC

Peter Quartier, Lotus

Gregor Thurmair, Sail Labs

Michael Wetzel, Trados

Page last updated 30/07/01 11:24:23 +0200