The OLIF (version 2.0) lexicon and terminology exchange standard is currently under development within the OLIF Consortium, a collaborative group of industrial firms active in the field of language technology. This document describes the document type definition (DTD) of OLIF (version 2.0), the current formal representation of OLIF.
This document is under review by the OLIF Consortium. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use this draft as reference material or to cite it as anything other than "work in progress". Comments on this draft are invited and should be sent to the editors.
This document is a product of the OLIF Consortium Technical Working Group.
For background on this work, please see the OLIF Web site.
A list of unresolved issues and known errors in this specification is maintained by the editors.
This document may be distributed freely, as long as all text and legal notices remain intact.
This document, the proposal for the Structure and Content of the Body of an OLIF2 File (July 2001 edition) together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 1766 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand OLIF (version 2.0) and construct computer programs to process it.
Design Goals and Design Decisions
Overall Structure and Principles
Uniform Representation of Data Categories as Elements
Two-level Content Models
XML Representation for Lists of Values
OLIF Consortium Technical Working Group
The OLIF DTD was developed by the Technical Working Group of the OLIF Consortium. It was chaired by Christian Lieske of SAP with the active participation of all members of the OLIF Consortium. The membership of the Technical Working Group is given in an appendix.
From the very beginning, the vision was to use two representation formalismus for OLIF: that of DTDs and that of XML schemata. Currently, the DTD is the primary (development) representation for the following reasons:
The DTD should be close to the description of OLIF that is provided in the Proposal for the Structure and Content of the Body of an OLIF2 File.
It shall be easy to write programs which process OLIF data. Therefore, some technologies (for example XLink) for which wide tool support does not exist yet, are not used for the formalization.
The OLIF DTD as well as OLIF data shall be legible and reasonably clear. Therefore, terseness is not of high importance.
The design shall show quick progress and follow good practice (for example commenting). In case these two goals conflict with each other, preference is given to quick progress.
Maintenance and customization of the DTD shall be easy. Ease of maintenance is especially important while the formalization is still under review.
Lexical data should be represented in a natural way. Thus, concatenations by means of underscores etc. (like in inside_out) were banned.
The Proposal for the Structure and Content of the Body of an OLIF2 File says that elements within groups may appear in any order. Since there is no elegant way of modelling this with a DTD, the free-order had to be replaced by fixed ordering.
The formalization of alternative content for optional elements is (a|b)+. This overgenerates but is a straightforward way of modelling. Furthermore, this style of modelling has the advantage that no special provisions have to be taken to realize the required multiple occurrences of eg. project.
Clearly, the metadata information in the OLIF header should be represented in terms of the resource description format (RDF). Due to a heavy workload, however, RDF is not used yet.
OLIF data represents collections of terminological and lexical data. In harmony with the Terminological Markup Framework (TMF), this type of data collection generally consists of three building blocks: general information (e.g. title of the collection), a list of terminological entries, and complementary information (e.g. shared resources like bibliographical information). The OLIF DTD reflects this partition, since the top-level file (olif.dtd) directly references three DTD modules which correspond to these building blocks: oHeader.mod, oBody.mod, and oShareR.mod.
For certain data categories (e.g. natural gender), OLIF foresees a fixed set of values. Although these data categories lend themselves to being represented as attributes (if this representation is used XML parsers can check automatically ), we have chosen to represent these data categories as elements. The reasons for this decision are as follows:
In principle, it is possible to declare the value of an element like <ptOfSpeech> as follows:
<!ELEMENT ptOfSpeech (#PCDATA) >
This however, does not accurately reflect that OLIF foresees a list of fixed values appearing as the content of <ptOfSpeech>. A representation that captures this fact better, makes use of parameter entities as follows:
<!ENTITY % ptOfSpeech.olif.fix.user.ext "PtOfSpeech CDATA #IMPLIED"> <!ELEMENT ptOfSpeech (%ptOfSpeech.olif.fix.user.ext;)>
This two-level model is the representation style that has been chosen. The section on coding comments details which types of parameter entities have been defined (the different types are reflected in the naming conventions).
The parameter entities that are referenced in each of the three main DTD modules have been placed into their individual DTD module files. For example, the parameter entities referenced in oBody.mod are stored in oBodyV.mod.
The everything is represented as an element approach, does not necessarily mean that implementation of checks for validity poses a difficult problem. In principle, nothing more than easy-to-process lists of values for the data categories are needed. If these lists exist, it's fairly easy to code a program that compares the actual value of an element with the values in the corresponding list (coding may for example make use of XSLT stylesheet).
Therefore, all fixed or proposed values of OLIF data categories have been made available as XML files. A DTD for the XML data is available.
For certain data categories (e.g. ptOfSpeech), users should be able to supply their own content models (sometimes as an alternative to a list of recommended or required values). For this, DTD adopts the following approach (which is comparable to that of for example DocBook):
<!ELEMENT ptOfSpeech (%ptOfSpeech.olif.fix.user.ext;)>
<!ENTITY % ptOfSpeech.user.ext "" > <!ENTITY % ptOfSpeech.olif.fix.user.ext "#PCDATA %ptOfSpeech.user.ext;" >
<!ENTITY % ptOfSpeech.user.ext "|user" >
<!ELEMENT user (#PCDATA) >
In order to enhance the readability and maintainability of the DTD, the following coding conventions are applied:
Everyone needs to avoid unnecessary work. The header helps people and programs to decide whether or not it is sensible that they spend time with a certain OLIF file or not. By looking at the header, questions like the following can be answered:
The OLIF header aims at giving value to lexical and terminological data by looking at both practical and theoretical considerations. Many data/information categories that have been proven useful for other exchange efforts have been included. Mechanisms are in place for allowing an evolution of the format itself, and of tools for processing it. The format to some degree allows allows a step-by-step approach to implementation. A closer look at the header reveals that it
The representation of the OLIF body, closely follows the Proposal for the Structure and Content of the Body of an OLIF2 File. Amongst the few minor points of divergence, the grouping of data categories according to type may be the most visible one:
A transfer restriction specifies a condition in the source language under which a given translation is valid. Transfer restrictions thus contain important information for Machine Translation systems. Accordingly, OLIF provides extensive support for representation of transfer restrictions (in the oTrans.mod module). Validation of the representation for transfer restrictions is the next important step in the OLIF activity.
Large-scale terminology activity requires workflow support. This especially holds true in environments, where substantial amounts of data are exchanged between different business partners of which each works on a particular aspect in the creation of a multilingual terminological data collection. The suggested standards for terminology exchange, however, do not address this requirement sufficiently. Therefore, OLIF proposes extensive workflow support (in the oWf.mod module). The support is described in detail in the currently reviewed paper Workflow Information for Terminology Exchange. During the review period, the oWf.mod module will contain nothing more than a dummy representation. After review (presumably August 2001) the real oWf.mod module will be made available.
This text was prepared and approved for publication by the OLIF Consortium Technical Working Group. Approval of this text does not necessarily imply that all working group members voted for its approval. The current and former (indicated by asterisk) members of the working group are:
Mike Dillinger, Logos
*Pierre-Yves Foucou, Systran
*Nils van der Laan, Trados (at the time of contribution)
Christian Lieske, SAP
*Paulo Martins, EC (at the time of contribution)
Carlo Mergen, EC
Peter Quartier, Lotus
Gregor Thurmair, Sail Labs
Michael Wetzel, Trados
Page last updated 30/07/01 11:24:23 +0200