TMX Format Specifications
Specifications
Version 1.3 - August-29-2001
Summary of purpose
This document describes the TMX file format. TMX stands for Translation Memory eXchange. OSCAR (Open Standards for Container/Content Allowing Re-use) is the LISA Special Interest Group responsible for its definition.
Contents
1. Overview
The purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors, while introducing little or no loss of critical data during the process.
TMX is defined in two parts:
- A specification of the format of the container (the higher-level elements that provide information about the file as a whole and about entries). In TMX, an entry consisting of aligned segments of text in two or more languages is called a Translation Unit (the <tu> element).
- A specification of a low-level meta-markup format for the content of a segment of translation-memory text. In TMX, an individual segment of translation-memory text in a particular language is denoted by a <seg> element. See the section Content Markup for more details.
TMX can be implemented on three levels:
- Level 1 (Plain Text Only) - Support for the container only. The data inside each <seg> element is plain text, without Content Markup. This level is enough when the data do not have in-line codes, for example software messages. It is usually not sufficient for documentation-type formats
- Level 2 (Meta-Markup) - Support for both container and content. The application uses the TMX Content Markup, but ignores the native codes inside the in-line tags. This level is usually enough for most purposes.
- Level 3 - (Native-Markup) Support for both container and content. The application is able to use the native codes inside the Content Markup elements.
2. Specifications
2.1. SGML/XML Compliance
TMX is XML-compliant. It also uses various ISO standards for date/time, language codes, and country codes. (See References section.)
TMX files are intended to be created automatically by export routines and processed automatically by import routines. TMX files are "well-formed" XML documents that can be processed without explicit reference to the TMX DTD. However, a "valid" TMX file must conform to the TMX DTD, and any suspicious TMX file should be verified against the TMX DTD using a validating XML parser.
Since XML syntax is case sensitive, any XML application must define casing conventions. All elements and attributes names of TMX are defined in lower-case.
The TMX namespace is defined as http://www.lisa.org/tmx. For example, if you want to use TMX fragments in another XML document you document will look something like:
<?xml version="1.0" ?> <myformat> <data> <tmx xmlns="http://www.lisa.org/tmx"> ... TMX data </tmx> </data> </myformat>
2.2. Code Sets
TMX files are always in Unicode. They can use either of three encoding methods: UCS-2 (16-bit files), UTF-8 (8-bit files) or ISO-646 [US-ASCII] (7-bit files). In both cases only the following five character entities are allowed: & (&), < (<), > (>), ' ('), and " ("). For 7-bit files, extended (non-ASCII) characters are always represented by numeric character references using the Unicode hexadecimal values (e.g. Ζ for a GREEK CAPITAL LETTER DELTA).
Since all XML processors must accept the UTF-8 and UTF-16 encodings and since US-ASCII and UCS-2 encoding methods are, respectively, sub-sets of UTF-8 and UTF-16, a TMX document can omit the encoding declaration in the XML declaration.
Note that UCS-2 files always start with a Unicode byte-order-mark value, the ZERO WIDTH NO-BREAK SPACE 0xFEFF.
2.3. Element Definitions
The following table lists the different elements of a TMX document (Container):
<tmx> | The <tmx> element contains one <header> element followed by one <body> element. Mandatory attribute: version. |
<header> | The <header> element contains zero, one or more <note> elements; zero, one or more <ude> elements; and zero, one or more <prop> elements. Mandatory attributes: creationtool, creationtoolversion, segtype, o-tmf, adminlang, srclang and datatype. Optional attributes: o-encoding, creationdate, creationid, changedate and changeid. |
<prop> | A <prop> (Property) element contains no other elements. The <prop> elements are used to define the various properties of the parent element (or of the file when <prop> is used in the <header> element). These properties are not defined by the standard. Each tool provider should publish the different properties types it uses. If the tool exports un-published properties types, their values should begin with the prefix "x-". Mandatory attribute: type. Optional attributes: xml:lang and o-encoding. |
<ude> | A <ude> (User-Defined Encoding) element contains one or more <map/> elements. It is used to specify a set of user-defined characters and/or, optionally their mapping from Unicode to the user-defined encoding. Mandatory attributes: base (if one or more of the <map/> elements contains a code attribute) and name. |
<map/> | A <map/> element is empty (i.e., it has no content and no end tag). The <map/> element is used to specify a user-defined character and some of its properties. Mandatory attribute: unicode. Optional attributes: code, ent and subst. Note that at least one of these attributes should be specified. If the code attribute is specified, the parent <ude> element must specify a base attribute. |
<body> | The <body> element encloses the main data, the set of <tu> elements that are comprised within the file. Mandatory attributes: none. Optional attributes: none. |
<tu> | Each <tu> (Translation Unit) element contains zero, one or more <note> elements or <prop> elements, followed by one or more <tuv> elements. Logically, a complete translation-memory database will contain at least two <tuv> elements in each Translation Unit. Mandatory attributes: none. Optional attributes: tuid, o-encoding, datatype, usagecount, lastusagedate, creationtool, creationtoolversion, creationdate, creationid, changedate, segtype, changeid, o-tmf and srclang. |
<tuv> | Each <tuv> (Translation Unit Variant) element specifies text in a given language. It contains zero, one or more <note> elements or <prop> elements, followed by one <seg> element. Mandatory attribute: xml:lang. Optional attributes: o-encoding, datatype, usagecount, lastusagedate, creationtool, creationtoolversion, creationdate, creationid, changedate, o-tmf, and changeid. |
<seg> | Each <seg> (Segment) element contains the text of the <tuv> element. It contains zero, one or more <bpt> elements; the same number of corresponding <ept> elements; zero, one or more <it> elements; zero, one or more <ph> elements; and zero, one or more <ut> elements. All spacing characters and line-breaks are significant inside a <seg> element. It has no length limitation. Mandatory attributes: none. Optional attributes: none. |
<note> | A <note> element is used for comments. It contains no other element. Mandatory attributes: none. Optional attributes: o-encoding and xml:lang. |
The following table lists the different elements of a TMX document (Content):
<bpt> | The <bpt> (Begin paired tag) element contains zero, one or more <sub> elements. It is used to delimit the beginning of a paired sequence of native codes. Each <bpt> has a corresponding <ept> element within the segment. Mandatory attributes: i. Optional attributes: typeand x. |
<ept> | The <ept> (End paired tag) element contains zero, one or more <sub> elements. It is used to delimit the end of a paired sequence of native codes. Each <ept> element has a corresponding<bpt> element within the segment. Mandatory attributes: i. Optional attribute: none. |
<sub> | The <sub> (Sub-flow) element contains zero, one or more <bpt> elements; the same number of <ept> elements; zero, one or more <it> elements; zero, one or more <ut> elements; zero, one or more <hi> elements; and zero, one or more <ph> elements. It is used to delimit sub-flow text inside a sequence of native code, for example: the definition of a footnote or the text of a title in a HTML anchor element. Mandatory attributes: none. Optional attributes: datatype and type. |
<it> | The <it> (Isolated tag) element contains zero, one or more <sub> elements. It is used to delimit a beginning/ending sequence of native codes that does not have its corresponding ending/beginning within the segment. Mandatory attribute: pos. Optional attributes: type and x. |
<ph> | The <ph> (Place holder) element contains zero, one or more <sub> elements. It is used to delimit a sequence of native stand-alone codes in the segment. Mandatory attributes: none. Optional attributes: type, x and assoc. |
<ut> | The <ut> (Unknown tag) element contains no other elements. It is used to delimit a sequence of native codes, about which the exporter has no information. Mandatory attributes: none. Optional attribute: x. |
<hi> | The <hi> (Highlight) element contains zero, one or more <bpt> elements; the same number of <ept> elements; zero, one or more <it> elements; zero, one or more <ut> elements; zero, one or more <hi> elements; and zero, one or more <ph> elements. It is used to delimit a portion of the segment for any user-defined purpose. Mandatory attribute: none Optional attribute: type, x Version: 1.2 and after |
2.4. Attribute Definitions
The following table lists the different attributes used in the elements of a TMX document. The same attribute may be used with multiple elements, but will be either mandatory or optional depending on the specific occurrence.
adminlang | The adminlang attribute is used in the <header> element to specify the default language for the administrative and informative elements <note> and <prop>. Its value must be one of the values used by a xml:lang attribute. |
assoc | The assoc attribute (Association) is used to define whether an <ph> element is associated with the previous or the following text. Its value must be "p" (previous), "f" (following), or "b" (both). |
base | The base attribute specifies the code set upon which the re-mapping of the<ude> element is based. Its value should follow the same rules as the value of an o-encoding attribute. |
changedate | The changedate attribute specifies the date of the modification of the element. Its value must be in ASCII, in the format YYYYMMDDThhmmssZ. (e.g. 19970811T133402Z for August 11th 1997 at 1:34pm 2 seconds.) This is one of the options described in ISO 8601:1988. The value is always given in UTC (as indicated by the terminal Z). |
changeid | The changeid attribute specifies the user who modified the element. |
code | The code attribute specifies the code-point value in a user-defined encoding corresponding to the unicode character of a given <map/> element. Its value must be in hexadecimal format (e.g., code="#x9F"). |
creationdate | The creationdate attribute specifies the date of the creation of the element. Its value must be in ASCII, in the format YYYYMMDDThhmmssZ. (e.g. 19970811T133402Z for August 11th 1997 at 1:34pm 2 seconds.) This is one of the options described in ISO 8601:1988. The value is always given in UTC (as indicated by the terminal Z). |
creationid | The creationid attribute specifies the user who created the element. |
creationtool | The creationtool attribute identifies the tool that created the TMX document. Its possible values are not specified by the standard but each tool provider will publish the string identifier it uses. |
creationtoolversion | The creationtoolversion attribute identifies the version of the tool that created the TMX document. Its possible values are not specified by the standard but each tool provider will publish the string identifier it uses. |
datatype | The datatype attribute specifies the type of data contained in an element. Its default value is "unknown". See the recommended values section for more information. |
ent | The ent attribute specifies the entity name of the character defined by a given <map/> element. Its value must be in ASCII (e.g., ent="copy"). |
i | The i attribute (Internal matching) is used in the content markup to pair the <bpt> elements with <ept> elements. This mechanism provides TMX with support to markup a possibly overlapping range of codes, such as: "<B>Bold <I>Bold+Italic</B> Italics</I>". |
lang | DEPRECATED attribute since version 1.3 : use xml:lang instead. The lang attribute specifies the language or the locale of the data of the element. In the <note> and <prop> elements, the default value for the lang attribute is the same as the adminlang attribute in the <header> element. The value of the lang attribute must be one of the ISO language identifiers (2 or 3-letter code) or one of the standard locale identifiers (2 or 3-letter language code, dash, 2-letter region code). |
xml:lang | The xml:lang attribute specifies the language or the locale of the data of the element. In the <note> and <prop> elements, the default value for the xml:lang attribute is the same as the adminlang attribute in the <header> element. The value of the xml:lang attribute must be one of values defined by the XML specifications for this attribute. Note that the xml:lang value is case insensitive.
Starting from TMX version 1.3 this attribute replaces the deprecated attribute lang. TMX applications supporting version 1.3 should always use xml:lang for output, but should interpret the lang attribute as xml:lang in input. If, by accident, both attributes are present for a given element and have different values, xml:lang takes precedence. |
lastusagedate | The lastusagedate attribute specifies when the last time the content of a <tu> or <tuv> element was used in the original translation memory environment. Its value must be in ASCII, in the format YYYYMMDDThhmmssZ. (e.g. 19970811T133402Z for August 11th 1997 at 1:34pm 2 seconds.) This is one of the options described in ISO 8601:1988. The value is always given in UTC (as indicated by the terminal Z). |
name | The name attribute specifies the name of a <ude> element. Its value is not defined by the standard, but tools providers will publish the values they use. |
o-encoding | As stated in Section 2.2, all TMX files are in Unicode. However, it is sometimes useful to know what code set was used to encode text that was converted to Unicode for purposes of interchange. The o-encoding attribute specifies the original or preferred code set of the data of the element in case it is to be re-encoded in a non-Unicode code set. Its value, when possible, should be one of the IANA recommended code set identifiers. |
o-tmf | The o-tmf (Original Translation Memory Format) element specifies the format of the Translation Memory file from which the TMX document or segment thereof have been generated. |
pos | The pos attribute (Position) specifies that an <it> element is actually the beginning or the end part of a paired code that has no correspondence in the segment. Its value must be an empty string, "begin" or "end". |
segtype | The segtype attribute specifies the kind of segmentation used in the <tu> element. Its value must be either "block", "paragraph", "sentence" or "phrase". If a<tu> element does not have a segtype attribute specified, it is of the type defined in the <header> element. See the Implementation Notes for examples of how to use segtype. |
srclang | The srclang attribute specifies the language or locale of the source text. Its value must be one of the values used by a xml:lang attribute or the value "*all*" to indicate that any language combination can be used. |
subst | The subst attribute specifies an alternative string for the character defined in a given <map/> element. Its value must be in ASCII (e.g., "(c)" for the copyright sign). |
tuid | The tuid attribute specifies an identifier for the <tu> element. Its value is not defined by the standard (it could be unique or not, numeric or alphanumeric, etc.). |
type | The type attribute specifies the kind of data a <prop>, <bpt>, <ph>, <hi>, <sub> or <it> element represents. See the recommended values section for more information. |
unicode | The unicode attribute specifies the Unicode character value of a <map/> element. Its value must be a valid Unicode value (including the Private Use area) in hexadecimal format (e.g., unicode="#xF8FF"). |
usagecount | The usagecount attribute specifies the number of times the <tu> or the content of the <tuv> element has been accessed in the original TM environment. |
version | The version attribute indicates the version of the TMX format to which the document conforms. Its value is the major version number, a period, and the minor version number. For example: version="1.3". |
x | The x attribute (External matching) is used in the content markup to match <bpt>, <ph>, <it>, <ut> and <hi> elements between each <tuv> element of a given <tu> element. This mechanism facilitates the pairing of allied codes in source and target text, even if the order of code occurrence differs between the two. |
2.5. Recommended Values for Attributes
By using standard values for attributes, the TMX format can minimize the amount of data lost during the exchange process. However, this dynamic nature of the diverse collection of data that needs to be captured doesn't lend itself to being part of the TMX format specification. By specifying the recommended values for attributes in the accompanying Implementation Notes, developers of translation memory tools can update this information on an on-going basis without initiating a revision of the TMX format specification itself.
The TMX specification strongly recommends that developers of TMX-aware tools use only recommended attribute values when writing TMX data in order to ensure full TMX compliance. The Implementation Notes document specifies recommended values for the datatype and type attributes.
2.6. Content Mark-Up
Each TM system uses a different method of marking up the formatting. Formats are constantly evolving, and new formats will be introduced on a regular basis. Attempting to collect, interpret, disseminate and maintain finite descriptions of each formatting tag used at any given time by any of the TM systems is not possible.
The best way to deal with these native codes is to delimit them by a specific set of elements that convey where they begin and end, and possibly additional information about what they are (bold, italic, footnote, etc.).
Native codes can be grouped into four categories:
- Codes that either begin or end an instruction, and whose beginning and ending functions both appear within a single segment. For example, an instruction to begin embolden for a range of words which is then followed in the same segment by an instruction to end bold formatting.
- Codes that either begin or end an instruction, but whose beginning and ending functions are not both contained within a single segment. For example, an instruction to embolden text may apply to the first three sentences in a paragraph, but the instruction to turn off bolding may only appear at the end of the third sentence. Its beginning instruction is present in the first segment, while its closing tag is present in the third segment.
- Codes that represent self-contained functions that don't require explicit ending instructions. An image or cross-reference token are examples of these standalone codes.
- Codes that have unknown behavior.
Respectively, the TMX vocabulary provides elements to mark up each category of native code sequences:
- The <bpt> and <ept> elements demark paired sequences of native code which begin and end in the same <seg> element.
- The <it> element demarks a paired native code that is isolated from its partner, possibly due to segmentation.
- The <ph> element demarks a stand-alone native code.
- The <ut> element demarks a native code that cannot be identified by the TMX processor.
An additional element (<sub>) is provided to delimit sub-flow text within a sequence of native codes. For example, if the text content of a footnote is defined within the footnote marker code, it may be demarked with the <sub> element.
For example:
Without Content mark-up tags: <seg>Text in {\i italics}.</seg> With Content mark-up tags (content markup in bold red): <seg>Text in <bpt type="italic">{\i </bpt>italics<ept>}</ept>.</seg>
Such a mechanism allows tools to perform matching at several levels:
- Ignoring the codes. Since the code parts of the data stored in the segment are well delimited with the TMX elements, the tools can simply ignore them. It is equivalent to working at a plain text level. This solution may be the least efficient, but it does allow better matching results between units than when the codes get in the way.
- Matching the content mark-up tags. A second and more efficient way to proceed is to recognize the TMX content mark-up elements and use them as part of the matching criteria. This additional step gives you better accuracy.
- Matching the native codes inside the content mark-up tags. If the tool is sophisticated enough, it can go to a deeper and more accurate matching algorithm by also looking at the data the content mark-up elements delimit, and therefore get complete coverage for its matching algorithm.
For example, here are four segments differing only by the formatting codes:
Plain text: Special text RTF v1: {\b Special} text RTF v2: {\cf7 Special} text HTML: <B>Special</B> text
The same samples with the TMX content mark-up tags:
Plain text: <seg>Special text</seg> RTF v1: <seg><bpt type="bold">{\b </bpt>Special<ept>}</ept> text</seg> RTF v2: <seg><bpt>{\cf7 </bpt>Special<ept>}</ept> text</seg> HTML: <seg><bpt type="bold"><B></bpt>Special<ept></B></ept> text</seg>
- Native codes (RTF, HTML, etc.) need not be parsed before using the segments; the TMX elements allow you to make the distinction between code and text.
- Comparisons across file formats can be achieved, and if there is enough information (like for the previous bolding example) leverage of native code of source into target can be performed. This not only leverages the translated text but also the correct formatting, even if it was originally in a different source file format.
The datatype attribute is used to specify the kind of native code the data contains.
Matching codes between <tuv> elements (x attribute)
TMX implements a mechanism to help you match codes between source and target text. The x attribute in the <bpt>, <it>, <ut> and <ph> elements allows you to pair codes between two <tuv> elements (even if they are not in the same order any more because of the translation syntax). For example:
<seg>The <bpt x="1">{\b </bpt> black<ept>}</ept><bpt x="2">{\i </bpt> cat<ept>}</ept> sleeps.</seg> <seg>Le<bpt x="2">{\i </bpt> chat<ept>}</ept> <bpt x="1">{\b </bpt> noir<ept>}</ept> dort.</seg>
Overlapping codes (i attribute)
TMX provides a way to deal with overlapping tags. Such constructions are not used often, however several formats allow them. For example, the following HTML segment, even if not strictly legal, is accepted by some HTML editors and usually interpreted correctly by the browsers.
HTML: <B>Bold, <I>Bold+Italic</B>, Italic</I> TMX (without content mark-up): <seg><B>Bold, <I>Bold+Italic</B>, Italic</I></seg>
With the TMX content mark-up, since the <ept> element does not necessarily have a type, it can be difficult to know which sequence of codes it closes as illustrated by the following segment:
TMX (with basic content mark-up): <seg><bpt><B></bpt>Bold, <bpt><I></bpt>Bold+Italic<ept></B></ept>, Italic<ept></I></ept></seg>
The attribute i is used to specify which <ept> is closing which <bpt>.
TMX (with correct content mark-up): <seg><bpt i="1"><B></bpt>Bold, <bpt i="2"><I></bpt>Bold+Italic<ept i="1"></B></ept>, Italic<ept i="2"></I></ept></seg>
See the Implementation Notes for more details.
3. TMX Sample File
Notational conventions: The restrictions on the number of occurrences of each element and whether an attribute is mandatory within an element are indicated by:
- BOLD for the items that are mandatory.
- ITALIC for the items that can be specified zero or one times.
- NORMAL for the items that can be specified zero, one or more times.
This is an example of a TMX file. (The indentations are only there for ease of reading). Different types of notation are mixed to illustrate the various possibilities.
<?xml version="1.0" ?> <!DOCTYPE tmx SYSTEM "tmx13.dtd"> <!-- Example of TMX document --> <tmx version="1.3"> <header creationtool="XYZTool" creationtoolversion="1.01-023" datatype="PlainText" segtype="sentence" adminlang="en-us" srclang="EN" o-tmf="ABCTransMem" creationdate="19970101T163812Z" creationid="ThomasJ" changedate="19970314T023401Z" changeid="Amity" o-encoding="iso-8859-1" > <note>This is a note at document level.</note> <prop type="RTFPreamble">{\rtf1\ansi\tag etc...{\fonttbl}</prop> <ude name="MacRoman" base="Macintosh"> <map unicode="#xF8FF" code="#xF0" ent="Apple_logo" subst="[Apple]"/> </ude> </header> <body> <tu tuid="0001" datatype="Text" usagecount="2" lastusagedate="19970314T023401Z" > <note>Text of a note at the TU level.</note> <prop type="x-Domain">Computing</prop> <prop type="x-Project">Pægasus</prop> <tuv xml:lang="EN" creationdate="19970212T153400Z" creationid="BobW" > <seg>data (with a non-standard character: ).</seg> </tuv> <tuv xml:lang="FR-CA" creationdate="19970309T021145Z" creationid="BobW" changedate="19970314T023401Z" changeid="ManonD" > <prop type="Origin">MT</prop> <seg>données (avec un caractère non standard: ).</seg> </tuv> </tu> <tu tuid="0002" srclang="*all*"> <prop type="Domain">Cooking</prop> <tuv xml:lang="EN"> <seg>menu</seg> </tuv> <tuv xml:lang="FR-CA"> <seg>menu</seg> </tuv> <tuv xml:lang="FR-FR"> <seg>menu</seg> </tuv> </tu> </body> </tmx>
4. Glossary
An SGML document has an associated Document Type Definition (DTD) that specifies the rules for the structure of the document. Several industries have standardized on various DTDs for the different types of documents that they share.
SGML stands for Standard Generalized Markup Language. An ISO standard (ISO-8879) allows the definition of structured formats. SGML is not a format by itself, but a set of rules to define formats. SGML mark-up systems are defined in Document Type Definition files (DTDs).
XML stands for Extensible Markup Language. XML is a simplified and restricted subset of SGML.
UCS-2 is a 16-bit fixed-length encoding scheme of the Unicode character set.
UTC stands for Coordinated Universal Time.
5. References
The following links are useful references for implementing TMX, SGML and XML-related applications.
- ISO 639:1988 -- Code for the representation of names of languages
- ISO 3166:1993 -- Code for the representation of names of countries
- RFC 3066 -- Tags for Identification of Languages
- ISO 646:1991 -- Information Technology -- ISO 7-bit coded character set for information interchange (ASCII)
- ISO 8601:1988 -- Data elements and interchange formats - Information interchange - Representation of dates and times
- ISO 8879:1986 -- Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML)
- ISO 10646-1:1993 -- Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
- IANA Code set names -- Code sets naming conventions
The most up-to-date version of this document can be obtained on the LISA Web site at http://www.lisa.org/tmx/ [http://www.gala-global.org/oscarStandards/tmx/].
Copyright notice
This OSCAR document is copyright-protected by Localisation Industry Standards Association © 1997-2002. All rights reserved. Neither this document nor any extract from it may be reproduced, stored or transmitted for any purpose without prior written permission from the the OSCAR group of LISA.
Last update of this document: Apr-15-2002