SRX 1.0 Specification

OSCAR Recommendation, 20 April 2004

[2011-07-01 – This document is reproduced with permission of the Localization Industry Standards Association. All emendations to the original text are indicated by striking through the original text and inserting new text in red bold face in square brackets.]

Editor:

David Pooley

 

Copyright © The Localisation Industry Standards Association [LISA] 2004. All Rights Reserved.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to LISA.

The limited permissions granted above are perpetual and will not be revoked by LISA or its successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and LISA DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

 


Abstract

This document defines the Segmentation Rules eXchange format (SRX). The purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools and/or translation vendors.

Status of this document

This document is the current draft for the SRX format. Comments may be sent to [email protected].

Table of contents

1. Introduction

1.1. XML compliance

1.2. Regular expressions

2. General structure

2.1. Language rules

2.2. Map rules

2.3. Example document

3. Detailed specifications

3.1. Elements

3.2. Attributes

Appendices

A. References

B. Sample Document

C. Examples of Segmentation

D. Document Type Definition for SRX

E. XML Schema for SRX


1. Introduction

SRX is intended to enhance the TMX standard so that translation memory (TM) data that is exchanged between applications can be used more effectively. Having the segmentation rules that were used when a TM was created will increase the leverage that can be achieved when deploying the TM data. SRX does not, however, address the following procedural issues, which would cause loss of leverage when TMX data is deployed in another environment:

·         The use of segmentation rules that are different from those that were originally used to create the TM data

·         TM data that was generated using a variety of segmentation rules

SRX is defined in two parts:

  • A specification of the segmentation rules that are applicable for each language. This is represented by the <languagerules> element.
  • A specification of how the segmentation rules are applied to each language. This is represented by the <maprules> element.

An SRX document is a companion to a TMX document which will allow the application receiving the TMX data to determine how the original text was segmented before being inserted in to the translation memory. As such, it is assumed that any text that is being segmented with the rules defined in the SRX document is already in TMX format and only contains the standard TMX defined formatting tags <bpt>, <ept>, <ph> and <it>.

In its current implementation, SRX is focused primarily on sentence segmentation. The reason behind this is that TM tools that currently support (or intend to support) TMX are also primarily focused on sentence segmentation. Future versions of SRX may address more complex segmentation such as phrases and terms.

1.1. XML Compliance

SRX is XML-compliant. SRX files are intended to be created automatically by export routines and processed automatically by import routines. SRX files are "well-formed" XML documents that can be processed without explicit reference to the SRX DTD. However, a "valid" SRX file must conform to the SRX DTD, and any suspicious SRX file should be verified against the SRX DTD using a validating XML parser.

Since XML syntax is case sensitive, any XML application must define casing conventions. All elements and attributes names of SRX are defined in lowercase.

The SRX namespace is defined as "http://www.lisa.org/srx10". The following XML sample includes the SRX namespace definition.

<?xml version="1.0"?>
<myformat>
 <data>
  <srx xmlns="http://www.lisa.org/srx10"
       version="1.0">
   ... SRX data ...
  </srx>
 </data>
</myformat>

1.2. Regular Expressions

The segmentation rules themselves are represented using regular expressions. This allows for maximum flexibility in the definition of the rules. The following definitions are a subset of the current definition for the ICU regular expressions. Applications using other engines will need to adapt this format for use with their own parser.

1.2.1. Metacharacters

Character

Description

\a 

Match a BELL, \u0007 

\A 

Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input. 

\b, outside of a [Set] 

Match if the current position is a word boundary. Boundaries occur at the transitions betweem word (\w) and non-word (\W) characters, with combining marks ignored.

\b, within a [Set] 

Match a BACKSPACE, \u0008

\B 

Match if the current position is not a word boundary. 

\cX 

Match a control-X character. 

\d 

Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) 

\D 

Match any character that is not a decimal digit. 

\e 

Match an ESCAPE, \u001B

\E 

Terminates a \Q ... \E quoted sequence. 

\f 

Match a FORM FEED, \u000C

\G 

Match if the current position is at the end of the previous match. 

\n 

Match a LINE FEED, \u000A

\N{UNICODE CHARACTER NAME} 

Match the named character. 

\p{UNICODE PROPERTY NAME} 

Match any character with the specified Unicode Property. 

\P{UNICODE PROPERTY NAME} 

Match any character not having the specified Unicode Property. 

\Q 

Quotes all following characters until \E

\r 

Match a CARRIAGE RETURN, \u000D

\s 

Match a white space character. White space is defined as [\t\n\f\r\p{Z}]

\S 

Match a non-white space character. 

\t 

Match a HORIZONTAL TABULATION, \u0009

\uhhhh 

Match the character with the hex value hhhh. 

\Uhhhhhhhh 

Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff

\w 

Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]

\W 

Match a non-word character. 

\x{hhhh} 

Match the character with hex value hhhh 

\xhh 

Match the character with two digit hex value hh 

\X 

Match a Grapheme Cluster

\Z 

Match if the current position is at the end of input, but before the final line terminator, if one exists. 

\z 

Match if the current position is at the end of input. 

\0nnn 

Match the character with octal value nnn 

\n 

Back Reference. Match whatever the nth capturing group matched. n must be > 1 and < total number of capture groups in the pattern 

[pattern] 

Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern.

. 

Match any character. 

^ 

Match at the beginning of a line.  

$ 

Match at the end of a line.  

\ 

Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . / 

1.2.2. Operators

The following operators can be used.

Operator 

Description 

Alternation. A|B matches either A or B. 

Match 0 or more times. Match as many times as possible. 

Match 1 or more times. Match as many times as possible. 

Match zero or one times. Prefer one. 

{n} 

Match exactly n times 

{n,} 

Match at least n times. Match as many times as possible. 

{n,m} 

Match between n and m times. Match as many times as possible, but not more than m. 

*? 

Match 0 or more times. Match as few times as possible. 

+? 

Match 1 or more times. Match as few times as possible. 

?? 

Match zero or one times. Prefer zero. 

{n}? 

Match exactly n times 

{n,}? 

Match at least n times, but no more than required for an overall pattern match 

{n,m}? 

Match between n and m times. Match as few times as possible, but not less than n. 

*+ 

Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match) 

++ 

Match 1 or more times. Possessive match. 

?+ 

Match zero or one times. Possessive match. 

{n}+ 

Match exactly n times 

{n,}+ 

Match at least n times. Possessive Match. 

{n,m}+ 

Match between n and m times. Possessive Match. 

2. General structure

An SRX document is enclosed in an <srx> root element. The <srx> element contains two elements: <header> and <body>. The <header> element contains zero or more <formathandle> elements. The <body> element contains two elements: <languagerules> and <maprules>.

2.1. Language rules

The <languagerules> element contains information about the segmentation rules for each particular language. It is a collection of <languagerule> elements. Each one of these contains a collection of <rule> elements.

Each <rule> element contains zero or one <beforebreak> element and zero or one <afterbreak> element which provide details of the regular expressions for the rules themselves. The break attribute indicates whether this is a segment break or an exception rule. The rules are applied in the order that they are specified within the <languagerule> element.

Note that this approach is an adaptation of the method described in Unicode Technical Report 29 which covers text boundaries. Readers are encouraged to study this report with particular attention being given to the "sentence boundaries" section.

2.2. Map rules

The <maprules> element contains information as to how each language should be segmented. It is a collection of <maprule> elements. Each one of these contains a collection of <languagemap> elements that describe which rules should be used for each language.

2.3. Example document

See the sample document section for an example of a SRX document.

3. Detailed Specifications

3.1. Elements

This section lists the various elements used in the SRX document.

<afterbreak>, <beforebreak>, <body>, <formathandle>, <header>, <languagemap>, <languagerule>, <languagerules>, <maprule>, <maprules>, <rule>, <srx>.

After break - The <afterbreak> element encloses a regular expression.

Required attributes:

None.

Optional attributes:

None.

Contents:

A regular expression which represents the text that appears after a segment break.

Before break - The <beforebreak> element encloses a regular expression.

Required attributes:

None.

Optional attributes:

None.

Contents:

A regular expression which represents the text that appears before a segment break.

Body - The <body> element encloses the language rules and language maps that are contained within the file.

Required attributes:

None.

Optional attributes:

None.

Contents:

Zero or one <languagerules> element and zero or one <maprules> element.

Format handling - The <formathandle> element determines how formatting that falls on a segment boundary should be handled. The type attribute determines the type of formatting and the include attribute indicates how this formatting should be handled. As these elements are optional in the <header> element, the following defaults will apply:

<formathandle type="start" include="no"/>
<formathandle type="end" include="yes"/>
<formathandle type="isolated" include="no"/>

Required attributes:

type, include

Optional attributes:

None

Contents:

None

Header - The <header> element contains information that is relevant to the whole document.

Required attributes:

segmentsubflows, includeformatting

Optional attributes:

None.

Contents:

Zero, one, two or three <formathandle> elements.

Language map - The <languagemap> element maps one or more languages to a language rule.

Required attributes:

languagepattern, languagerulename

Optional attributes:

None.

Contents:

None.

Language rule - The <languagerule> element encloses one instance of language rule data, a set of <rule> elements.

Required attributes:

languagerulename

Optional attributes:

None.

Contents:

One or more <rule> elements.

Language rules - The <languagerules> element encloses the language rules data, the set of <languagerule> elements.

Required attributes:

None.

Optional attributes:

None.

Contents:

One or more <languagerule> elements.

Map rule - The <maprule> element encloses one instance of map rule data, a set of <languagemap> elements.

Required attributes:

maprulename

Optional attributes:

None.

Contents:

One or more <languagemap> elements.

Map rules - The <maprules> element encloses the map rules data, the set of <maprule> elements. The order of the <maprule> elements determines the logical order in which these rules should be applied.

Required attributes:

None.

Optional attributes:

None.

Contents:

One or more <maprule> elements.

Break or exception rule - The <rule> element defines a segmentation rule for a language using the <beforebreak> and <afterbreak> elements. The break attribute determines whether this is a rule that determines a break or an exception. If the break attribute is missing, it is assumed to be a break rule.

Required attributes:

None.

Optional attributes:

break

Contents:

Zero or one <beforebreak> element and zero or one <afterbreak> element.

Root element - The <srx> element is the root element of the document. It encloses the header and body information for the file.

Required attributes:

version

Optional attributes:

None.

Contents:

One <header> element and one <body> element.

3.2. Attributes

This section lists the various attributes used in the SRX elements.

break, include, languagepattern, languagerulename, maprulename, segmentsubflows, type, version.

Break indicator - Specifies whether a rule is a break or an exception.

Value description:

A value of "no" indicates that the rule is an exception rule. A value of "yes" indicates that the rule is a break rule.

Default value:

"yes"

Used in:

<rule>.

Formatting code behaviour - The include attribute indicates whether formatting is included in the segment being created.

Value description:

A value of "no" indicates that the format code does not belong to the segment being created. A value of "yes" indicates that the format code belongs to the segment being created.

Default value:

"no"

Used in:

<formathandle>.

Language pattern - Identifies a language pattern.

Value description:

Specifies a regular expression for the language codes that map to the given language rule. Language codes are defined as in [RFC 3066].

Default value:

Undefined.

Used in:

<languagemap>.

Language rule name - Specifies a unique name for a language rule.

Value description:

Used to link a language rule between the <languagerule> and <languagemap> elements.

Default value:

Undefined.

Used in:

<languagerule>, <languagemap>.

Map rule name - Specifies a unique name for a mapping rule.

Value description:

Used to uniquely identify a mapping rule.

Default value:

Undefined.

Used in:

<maprule>.

Subflow segmentation behaviour - The segmentsubflows attribute indicates how subflows should be segmented.

Value description:

A value of "no" indicates that subflows within a segment should not be segmented. A value of "yes" indicates that subflows should be segmented according to the rules. A subflow is defined as being a piece of text that appears within another segment but which should be handled separately. For example, in the following HTML snippet:

<p>Click <img src="..\button.gif" alt="Toolbar button. Click to preview."/> to preview the document.</p>

The text "Toolbar button. Click to preview." is a subflow. The segmentsubflows attribute determines whether this text should be segmented according to the rules.

Default value:

"yes"

Used in:

<header>.

Formatting code type - The type attribute indicates the type of formatting for which the <formathandle> is being applied.

Value description:

This attribute can have one of three values. These are:

·         "start" to indicate the start of a pair of formatting codes

·         "end" to indicate the end of a pair of formatting codes

·         "isolated" to indicate a format that has no partner

Default value:

Undefined

Used in:

<formathandle>.

SRX version - The version attribute indicates the version of the SRX format to which the document conforms.

Value description:

Fixed text: the major version number, a period, and the minor version number. For example: version="1.0".

Default value:

"1.0"

Used in:

<srx>.


A. References

[Unicode Character Database 4.0.0]

Unicode Character Database 4.0.0. Unicode Organisation, Apr 2003.

[ISO 639]

Codes for the Representation of Names of Languages. ISO (International Organization for Standardization), Nov 2001.

[ISO 3166]

Codes for the representation of names of countries and their subdivisions. ISO (International Organization for Standardization), Jun 2000.

[RFC 3066]

RFC 3066 Tags for the Identification of Languages. IETF (Internet Engineering Task Force), Jan 2001.

[XML 1.0]

Extensible Markup Language (XML) 1.0 Second Edition. W3C (World Wide Web Consortium), Oct 2000.

[ICU Regular Expressions]

ICU Regular Expressions User Guide. IBM, 2003.

[Unicode Technical Report 29]

UAX #29: Text Boundaries, Unicode Consortium, 2003.

B. Sample Document

In this example of an SRX document indentations are added for ease of reading, and the different types of notation are mixed to illustrate the various possibilities.

<?xml version="1.0"?>
<!DOCTYPE srx PUBLIC "-//SRX//DTD SRX//EN" "srx.dtd">
<srx version="1.0">
 <header segmentsubflows="yes">
  <formathandle type="start" include="no"/>
  <formathandle type="end" include="yes"/>
  <formathandle type="isolated" include="yes"/>
 </header>
 <body>
  <languagerules>
   <languagerule languagerulename="Default">
    <rule break="no">
     <beforebreak>^\s*[0-9]+\.</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
    <rule break="no">
     <beforebreak>[Ee][Tt][Cc]\.</beforebreak>
     <afterbreak>\s[a-z]</afterbreak>
    </rule>
    <rule break="no">
     <beforebreak>\sMr\.</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
    <rule break="yes">
     <beforebreak>[\.\?!]+</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
    <rule break="yes">
     <beforebreak></beforebreak>
     <afterbreak>\n</afterbreak>
    </rule>
   </languagerule>
   <languagerule languagerulename="Japanese">
    <rule break="no">
     <beforebreak>^\s*[0-9]+\.</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
    <rule break="no">
     <beforebreak>[Ee][Tt][Cc]\.</beforebreak>
     <afterbreak></afterbreak>
    </rule>
    <rule break="yes">
     <beforebreak>[\.\?!]+</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
    <rule break="yes">
     <beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
     <afterbreak></afterbreak>
    </rule>
    <rule break="yes">
     <beforebreak></beforebreak>
     <afterbreak>\n</afterbreak>
    </rule>
   </languagerule>
  </languagerules>
  <maprules>
   <maprule maprulename="Default">
    <languagemap languagepattern="JA.*" languagerulename="Japanese"/>
    <languagemap languagepattern=".*" languagerulename="Default"/>
   </maprule>
  </maprules>
 </body>
</srx>

C. Examples of Segmentation

This section provides some examples of how segmentation rules might be applied to fragments of text. These are simple examples and are by no means a complete reference to segmentation.

Rule set

Text to segment

Result

Notes

<rule break"yes">
 <beforebreak>[\.\?!]+</beforebreak>
 <afterbreak>\s</afterbreak>
</rule>

The U.K. Prime Minister, Mr. Blair, was seen out with his family today.

(1) The U.K.
(2)  Prime Minister, Mr.
(3)  Blair, was seen out with his family today

The simple full-stop followed by a space rule here showing its limitations

<rule break="no">
 <beforebreak>U.K.</beforebreak>
 <afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
 <beforebreak>[\.\?!]+</beforebreak>
 <afterbreak>\s</afterbreak>
</rule>

The U.K. Prime Minister, Mr. Blair, was seen out with his family today.

(1) The U.K. Prime Minister, Mr.
(2)  Blair, was seen out with his family today

Partially corrected with an exception for "U.K."

<rule break="no">
 <beforebreak>U.K.</beforebreak>
 <afterbreak>\s</afterbreak>
</rule>
<rule break="no">
 <beforebreak>Mr.</beforebreak>
 <afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
 <beforebreak>[\.\?!]+</beforebreak>
 <afterbreak>\s</afterbreak>
</rule>

The U.K. Prime Minister, Mr. Blair, was seen out with his family today.

(1) The U.K. Prime Minister, Mr. Blair, was seen out with his family today

Sufficient exceptions to prevent segmentation on "U.K." and "Mr."

D. Document Type Definition for SRX

<!-- SRX
 
Public Identifier: "-//SRX//DTD SRX//EN"
 
History of modifications (latest first):
 
Apr-21-2004 by DRP: Convert to version 1.0.
Mar-22-2004 by DRP: Eighth draft version.
                    Ensure the <excludeexception> element is removed
                    Update version number
Mar-17-2004 by DRP: Seventh draft version.
                    Remove <exceptions>, <exception>, <endrules>, <endrule> and <excludeexception> elements
                    Add <rule> element
                    Update version number
Feb-02-2004 by DRP: Sixth draft version.
                    Update version number
Oct-27-2003 by DRP: Fifth draft version.
                    Removed includeformatting attribute from <header> element
                    Added <formathandle> element to the <header>
                    Removed priority attribute from <endrule> and <exception> elements
                    Added name attribute to <exception> element
                    Added <excludeexception> element to the <endrule> element
Oct-10-2003 by DRP: Fourth draft version.
                    Removed <classdefinitions> and <classdefinition> elements
                    Removed classdefinitionname attribute
                    Removed <digitcharacters>, <whitespacecharacters> and <wordcharacters>
                    Added priority attribute to <endrule> and <exception> elements
                    Added includeformatting attribute to <header> element
Jul-24-2003 by DRP: Third draft version.
                    Removed <charsets> and <charset> to be replaced with <classdefinitions> and <classdefinition>
                    Renamed <digits> to <digitcharacters>
                    Renamed <whitespace> to <whitespacecharacters>
                    Renamed <wordchars> to <wordcharacters>
                    <digitcharacters>, <whitespacecharacters> and <wordcharacters> are now optional
                    Renamed <langrules> to <languagerules>
                    Renamed <langrule> to <languagerule>
                    Renamed <langmap> to <languagemap>
                    Renamed langrulename to languagerulename
                    Renamed langpattern to languagepattern
Jun-19-2003 by DRP: Second draft version.
                    Removed the <codepage> element.
                    Added <header> and <body> elements.
Nov-22-2002 by DRP: First draft version
 
-->
 
<!ELEMENT srx                   (header, body) >
<!ATTLIST srx
    version                     CDATA       #FIXED "1.0"
>
 
<!ELEMENT header                (formathandle*) >
<!ATTLIST header
    segmentsubflows             CDATA       #REQUIRED
>
 
<!ELEMENT formathandle          EMPTY >
<!ATTLIST formathandle
    type                        CDATA       #REQUIRED
    include                     CDATA       #REQUIRED
>
 
<!ELEMENT body                  (languagerules?, maprules?) >
 
<!ELEMENT languagerules         (languagerule+) >
 
<!ELEMENT languagerule          (rule+) >
<!ATTLIST languagerule
    languagerulename            CDATA       #REQUIRED
>
 
<!ELEMENT rule                  (beforebreak?, afterbreak?) >
<!ATTLIST rule
    break                       CDATA       #IMPLIED
>
 
<!ELEMENT beforebreak           (#PCDATA) >
 
<!ELEMENT afterbreak            (#PCDATA) >
 
<!ELEMENT maprules              (maprule+) >
 
<!ELEMENT maprule               (languagemap+) >
<!ATTLIST maprule
    maprulename                 CDATA       #REQUIRED
>
 
<!ELEMENT languagemap           EMPTY >
<!ATTLIST languagemap
    languagepattern             CDATA       #REQUIRED
    languagerulename            CDATA       #REQUIRED
>

E. XML Schema for SRX

<?xml version="1.0"?>
 
<schema xmlns="http://www.w3.org/2001/10/XMLSchema">
 
  <element name="srx">
    <complexType>
      <sequence>
        <element ref="header" minOccurs="1" maxOccurs="1" />
        <element ref="body" minOccurs="1" maxOccurs="1" />
      </sequence>
      <attribute name="version" type="string" use="required" value="1.0" />
    </complexType>
  </element>
 
  <element name="header">
    <complexType>
      <element ref="formathandle" minOccurs="0" maxOccurs="3" />
      <attribute name="segmentsubflows" type="string" use="required" />
    </complexType>
  </element>
 
  <element name="formathandle">
    <complexType>
      <attribute name="type" type="string" use="required" />
      <attribute name="include" type="string" use="required" />
    </complexType>
  </element>
 
  <element name="body">
    <complexType>
      <sequence>
        <element ref="languagerules" minOccurs="0" maxOccurs="1" />
        <element ref="maprules" minOccurs="0" maxOccurs="1" />
      </sequence>
    </complexType>
  </element>
 
  <element name="languagerules">
    <complexType>
      <element ref="languagerule" minOccurs="1" maxOccurs="unbounded" />
    </complexType>
  </element>
 
  <element name="languagerule">
    <complexType>
      <sequence>
        <element ref="rule" minOccurs="1" maxOccurs="unbounded" />
      </sequence>
      <attribute name="languagerulename" type="string" use="required" />
    </complexType>
  </element>
 
  <element name="rule">
    <complexType>
      <sequence>
        <element ref="beforebreak" minOccurs="0" maxOccurs="1" />
        <element ref="afterbreak" minOccurs="0" maxOccurs="1" />
      </sequence>
      <attribute name="break" type="string" use="optional" />
    </complexType>
  </element>
  
  <element name="beforebreak">
    <complexType mixed="true" />
  </element>
 
  <element name="afterbreak">
    <complexType mixed="true" />
  </element>
 
  <element name="maprules">
    <complexType>
      <element ref="maprule" minOccurs="1" maxOccurs="unbounded" />
    </complexType>
  </element>
 
  <element name="maprule">
    <complexType>
      <element ref="languagemap" minOccurs="1" maxOccurs="unbounded" />
      <attribute name="maprulename" type="string" use="required" />
    </complexType>
  </element>
 
  <element name="languagemap">
    <complexType>
      <attribute name="languagepattern" type="string" use="required" />
      <attribute name="languagerulename" type="string" use="required" />
    </complexType>
  </element>
 
</schema>