SRX 1.0 Specification
OSCAR Recommendation, 20 April 2004
[2011-07-01 – This document is reproduced with permission of the Localization Industry Standards Association. All emendations to the original text are indicated by striking through the original text and inserting new text in red bold face in square brackets.]
Editor:
David Pooley
Copyright © The Localisation Industry Standards Association [LISA] 2004. All Rights Reserved.
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to LISA.
The limited permissions granted above are perpetual and will not be revoked by LISA or its successors or assigns.
This document and the information contained herein is provided on an "AS IS" basis and LISA DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Abstract
This document defines the Segmentation Rules eXchange format (SRX). The purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools and/or translation vendors.
Status of this document
This document is the current draft for the SRX format. Comments may be sent to [email protected].
Table of contents
1. Introduction
1.1. XML compliance
1.2. Regular expressions
2.1. Language rules
2.2. Map rules
2.3. Example document
3.1. Elements
3.2. Attributes
Appendices
A. References
D. Document Type Definition for SRX
1. Introduction
SRX is intended to enhance the TMX standard so that translation memory (TM) data that is exchanged between applications can be used more effectively. Having the segmentation rules that were used when a TM was created will increase the leverage that can be achieved when deploying the TM data. SRX does not, however, address the following procedural issues, which would cause loss of leverage when TMX data is deployed in another environment:
· The use of segmentation rules that are different from those that were originally used to create the TM data
· TM data that was generated using a variety of segmentation rules
SRX is defined in two parts:
- A specification of the segmentation rules that are applicable for each language. This is represented by the <languagerules> element.
- A specification of how the segmentation rules are applied to each language. This is represented by the <maprules> element.
An SRX document is a companion to a TMX document which will allow the application receiving the TMX data to determine how the original text was segmented before being inserted in to the translation memory. As such, it is assumed that any text that is being segmented with the rules defined in the SRX document is already in TMX format and only contains the standard TMX defined formatting tags <bpt>, <ept>, <ph> and <it>.
In its current implementation, SRX is focused primarily on sentence segmentation. The reason behind this is that TM tools that currently support (or intend to support) TMX are also primarily focused on sentence segmentation. Future versions of SRX may address more complex segmentation such as phrases and terms.
1.1. XML Compliance
SRX is XML-compliant. SRX files are intended to be created automatically by export routines and processed automatically by import routines. SRX files are "well-formed" XML documents that can be processed without explicit reference to the SRX DTD. However, a "valid" SRX file must conform to the SRX DTD, and any suspicious SRX file should be verified against the SRX DTD using a validating XML parser.
Since XML syntax is case sensitive, any XML application must define casing conventions. All elements and attributes names of SRX are defined in lowercase.
The SRX namespace is defined as "http://www.lisa.org/srx10". The following XML sample includes the SRX namespace definition.
<?xml version="1.0"?>
<myformat>
<data>
<srx xmlns="http://www.lisa.org/srx10"
version="1.0">
... SRX data ...
</srx>
</data>
</myformat>
1.2. Regular Expressions
The segmentation rules themselves are represented using regular expressions. This allows for maximum flexibility in the definition of the rules. The following definitions are a subset of the current definition for the ICU regular expressions. Applications using other engines will need to adapt this format for use with their own parser.
1.2.1. Metacharacters
Character |
Description |
\a |
Match a BELL, \u0007 |
\A |
Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input. |
\b, outside of a [Set] |
Match if the current position is a word boundary. Boundaries occur at the transitions betweem word (\w) and non-word (\W) characters, with combining marks ignored. |
\b, within a [Set] |
Match a BACKSPACE, \u0008. |
\B |
Match if the current position is not a word boundary. |
\cX |
Match a control-X character. |
\d |
Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) |
\D |
Match any character that is not a decimal digit. |
\e |
Match an ESCAPE, \u001B. |
\E |
Terminates a \Q ... \E quoted sequence. |
\f |
Match a FORM FEED, \u000C. |
\G |
Match if the current position is at the end of the previous match. |
\n |
Match a LINE FEED, \u000A. |
\N{UNICODE CHARACTER NAME} |
Match the named character. |
\p{UNICODE PROPERTY NAME} |
Match any character with the specified Unicode Property. |
\P{UNICODE PROPERTY NAME} |
Match any character not having the specified Unicode Property. |
\Q |
Quotes all following characters until \E. |
\r |
Match a CARRIAGE RETURN, \u000D. |
\s |
Match a white space character. White space is defined as [\t\n\f\r\p{Z}]. |
\S |
Match a non-white space character. |
\t |
Match a HORIZONTAL TABULATION, \u0009. |
\uhhhh |
Match the character with the hex value hhhh. |
\Uhhhhhhhh |
Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff. |
\w |
Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. |
\W |
Match a non-word character. |
\x{hhhh} |
Match the character with hex value hhhh |
\xhh |
Match the character with two digit hex value hh |
\X |
Match a Grapheme Cluster |
\Z |
Match if the current position is at the end of input, but before the final line terminator, if one exists. |
\z |
Match if the current position is at the end of input. |
\0nnn |
Match the character with octal value nnn |
\n |
Back Reference. Match whatever the nth capturing group matched. n must be > 1 and < total number of capture groups in the pattern |
[pattern] |
Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern. |
. |
Match any character. |
^ |
Match at the beginning of a line. |
$ |
Match at the end of a line. |
\ |
Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . / |
1.2.2. Operators
The following operators can be used.
Operator |
Description |
| |
Alternation. A|B matches either A or B. |
* |
Match 0 or more times. Match as many times as possible. |
+ |
Match 1 or more times. Match as many times as possible. |
? |
Match zero or one times. Prefer one. |
{n} |
Match exactly n times |
{n,} |
Match at least n times. Match as many times as possible. |
{n,m} |
Match between n and m times. Match as many times as possible, but not more than m. |
*? |
Match 0 or more times. Match as few times as possible. |
+? |
Match 1 or more times. Match as few times as possible. |
?? |
Match zero or one times. Prefer zero. |
{n}? |
Match exactly n times |
{n,}? |
Match at least n times, but no more than required for an overall pattern match |
{n,m}? |
Match between n and m times. Match as few times as possible, but not less than n. |
*+ |
Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match) |
++ |
Match 1 or more times. Possessive match. |
?+ |
Match zero or one times. Possessive match. |
{n}+ |
Match exactly n times |
{n,}+ |
Match at least n times. Possessive Match. |
{n,m}+ |
Match between n and m times. Possessive Match. |
2. General structure
An SRX document is enclosed in an <srx> root element. The <srx> element contains two elements: <header> and <body>. The <header> element contains zero or more <formathandle> elements. The <body> element contains two elements: <languagerules> and <maprules>.
2.1. Language rules
The <languagerules> element contains information about the segmentation rules for each particular language. It is a collection of <languagerule> elements. Each one of these contains a collection of <rule> elements.
Each <rule> element contains zero or one <beforebreak> element and zero or one <afterbreak> element which provide details of the regular expressions for the rules themselves. The break attribute indicates whether this is a segment break or an exception rule. The rules are applied in the order that they are specified within the <languagerule> element.
Note that this approach is an adaptation of the method described in Unicode Technical Report 29 which covers text boundaries. Readers are encouraged to study this report with particular attention being given to the "sentence boundaries" section.
2.2. Map rules
The <maprules> element contains information as to how each language should be segmented. It is a collection of <maprule> elements. Each one of these contains a collection of <languagemap> elements that describe which rules should be used for each language.
2.3. Example document
See the sample document section for an example of a SRX document.
3. Detailed Specifications
3.1. Elements
This section lists the various elements used in the SRX document.
<afterbreak>, <beforebreak>, <body>, <formathandle>, <header>, <languagemap>, <languagerule>, <languagerules>, <maprule>, <maprules>, <rule>, <srx>.
After break - The <afterbreak> element encloses a regular expression.
Required attributes:
None.
Optional attributes:
None.
Contents:
A regular expression which represents the text that appears after a segment break.
Before break - The <beforebreak> element encloses a regular expression.
Required attributes:
None.
Optional attributes:
None.
Contents:
A regular expression which represents the text that appears before a segment break.
Body - The <body> element encloses the language rules and language maps that are contained within the file.
Required attributes:
None.
Optional attributes:
None.
Contents:
Zero or one <languagerules> element and zero or one <maprules> element.
Format handling - The <formathandle> element determines how formatting that falls on a segment boundary should be handled. The type attribute determines the type of formatting and the include attribute indicates how this formatting should be handled. As these elements are optional in the <header> element, the following defaults will apply:
<formathandle type="start" include="no"/>
<formathandle type="end" include="yes"/>
<formathandle type="isolated" include="no"/>
Required attributes:
Optional attributes:
None
Contents:
None
Header - The <header> element contains information that is relevant to the whole document.
Required attributes:
segmentsubflows, includeformatting
Optional attributes:
None.
Contents:
Zero, one, two or three <formathandle> elements.
Language map - The <languagemap> element maps one or more languages to a language rule.
Required attributes:
languagepattern, languagerulename
Optional attributes:
None.
Contents:
None.
Language rule - The <languagerule> element encloses one instance of language rule data, a set of <rule> elements.
Required attributes:
Optional attributes:
None.
Contents:
One or more <rule> elements.
Language rules - The <languagerules> element encloses the language rules data, the set of <languagerule> elements.
Required attributes:
None.
Optional attributes:
None.
Contents:
One or more <languagerule> elements.
Map rule - The <maprule> element encloses one instance of map rule data, a set of <languagemap> elements.
Required attributes:
Optional attributes:
None.
Contents:
One or more <languagemap> elements.
Map rules - The <maprules> element encloses the map rules data, the set of <maprule> elements. The order of the <maprule> elements determines the logical order in which these rules should be applied.
Required attributes:
None.
Optional attributes:
None.
Contents:
One or more <maprule> elements.
Break or exception rule - The <rule> element defines a segmentation rule for a language using the <beforebreak> and <afterbreak> elements. The break attribute determines whether this is a rule that determines a break or an exception. If the break attribute is missing, it is assumed to be a break rule.
Required attributes:
None.
Optional attributes:
Contents:
Zero or one <beforebreak> element and zero or one <afterbreak> element.
Root element - The <srx> element is the root element of the document. It encloses the header and body information for the file.
Required attributes:
Optional attributes:
None.
Contents:
One <header> element and one <body> element.
3.2. Attributes
This section lists the various attributes used in the SRX elements.
break, include, languagepattern, languagerulename, maprulename, segmentsubflows, type, version.
Break indicator - Specifies whether a rule is a break or an exception.
Value description:
A value of "no" indicates that the rule is an exception rule. A value of "yes" indicates that the rule is a break rule.
Default value:
"yes"
Used in:
Formatting code behaviour - The include attribute indicates whether formatting is included in the segment being created.
Value description:
A value of "no" indicates that the format code does not belong to the segment being created. A value of "yes" indicates that the format code belongs to the segment being created.
Default value:
"no"
Used in:
Language pattern - Identifies a language pattern.
Value description:
Specifies a regular expression for the language codes that map to the given language rule. Language codes are defined as in [RFC 3066].
Default value:
Undefined.
Used in:
Language rule name - Specifies a unique name for a language rule.
Value description:
Used to link a language rule between the <languagerule> and <languagemap> elements.
Default value:
Undefined.
Used in:
<languagerule>, <languagemap>.
Map rule name - Specifies a unique name for a mapping rule.
Value description:
Used to uniquely identify a mapping rule.
Default value:
Undefined.
Used in:
Subflow segmentation behaviour - The segmentsubflows attribute indicates how subflows should be segmented.
Value description:
A value of "no" indicates that subflows within a segment should not be segmented. A value of "yes" indicates that subflows should be segmented according to the rules. A subflow is defined as being a piece of text that appears within another segment but which should be handled separately. For example, in the following HTML snippet:
<p>Click <img src="..\button.gif" alt="Toolbar button. Click to preview."/> to preview the document.</p>
The text "Toolbar button. Click to preview." is a subflow. The segmentsubflows attribute determines whether this text should be segmented according to the rules.
Default value:
"yes"
Used in:
Formatting code type - The type attribute indicates the type of formatting for which the <formathandle> is being applied.
Value description:
This attribute can have one of three values. These are:
· "start" to indicate the start of a pair of formatting codes
· "end" to indicate the end of a pair of formatting codes
· "isolated" to indicate a format that has no partner
Default value:
Undefined
Used in:
SRX version - The version attribute indicates the version of the SRX format to which the document conforms.
Value description:
Fixed text: the major version number, a period, and the minor version number. For example: version="1.0".
Default value:
"1.0"
Used in:
A. References
[Unicode Character Database 4.0.0]
Unicode Character Database 4.0.0. Unicode Organisation, Apr 2003.
Codes for the Representation of Names of Languages. ISO (International Organization for Standardization), Nov 2001.
Codes for the representation of names of countries and their subdivisions. ISO (International Organization for Standardization), Jun 2000.
RFC 3066 Tags for the Identification of Languages. IETF (Internet Engineering Task Force), Jan 2001.
Extensible Markup Language (XML) 1.0 Second Edition. W3C (World Wide Web Consortium), Oct 2000.
ICU Regular Expressions User Guide. IBM, 2003.
UAX #29: Text Boundaries, Unicode Consortium, 2003.
B. Sample Document
In this example of an SRX document indentations are added for ease of reading, and the different types of notation are mixed to illustrate the various possibilities.
<?xml version="1.0"?>
<!DOCTYPE srx PUBLIC "-//SRX//DTD SRX//EN" "srx.dtd">
<srx version="1.0">
<header segmentsubflows="yes">
<formathandle type="start" include="no"/>
<formathandle type="end" include="yes"/>
<formathandle type="isolated" include="yes"/>
</header>
<body>
<languagerules>
<languagerule languagerulename="Default">
<rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak>\s[a-z]</afterbreak>
</rule>
<rule break="no">
<beforebreak>\sMr\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\.\?!]+</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak></beforebreak>
<afterbreak>\n</afterbreak>
</rule>
</languagerule>
<languagerule languagerulename="Japanese">
<rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\.\?!]+</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak></beforebreak>
<afterbreak>\n</afterbreak>
</rule>
</languagerule>
</languagerules>
<maprules>
<maprule maprulename="Default">
<languagemap languagepattern="JA.*" languagerulename="Japanese"/>
<languagemap languagepattern=".*" languagerulename="Default"/>
</maprule>
</maprules>
</body>
</srx>
C. Examples of Segmentation
This section provides some examples of how segmentation rules might be applied to fragments of text. These are simple examples and are by no means a complete reference to segmentation.
Rule set |
Text to segment |
Result |
Notes |
<rule break"yes"> |
The U.K. Prime Minister, Mr. Blair, was seen out with his family today. |
(1) The U.K. |
The simple full-stop followed by a space rule here showing its limitations |
<rule break="no"> |
The U.K. Prime Minister, Mr. Blair, was seen out with his family today. |
(1) The U.K. Prime Minister, Mr. |
Partially corrected with an exception for "U.K." |
<rule break="no"> |
The U.K. Prime Minister, Mr. Blair, was seen out with his family today. |
(1) The U.K. Prime Minister, Mr. Blair, was seen out with his family today |
Sufficient exceptions to prevent segmentation on "U.K." and "Mr." |
D. Document Type Definition for SRX
<!-- SRX
Public Identifier: "-//SRX//DTD SRX//EN"
History of modifications (latest first):
Apr-21-2004 by DRP: Convert to version 1.0.
Mar-22-2004 by DRP: Eighth draft version.
Ensure the <excludeexception> element is removed
Update version number
Mar-17-2004 by DRP: Seventh draft version.
Remove <exceptions>, <exception>, <endrules>, <endrule> and <excludeexception> elements
Add <rule> element
Update version number
Feb-02-2004 by DRP: Sixth draft version.
Update version number
Oct-27-2003 by DRP: Fifth draft version.
Removed includeformatting attribute from <header> element
Added <formathandle> element to the <header>
Removed priority attribute from <endrule> and <exception> elements
Added name attribute to <exception> element
Added <excludeexception> element to the <endrule> element
Oct-10-2003 by DRP: Fourth draft version.
Removed <classdefinitions> and <classdefinition> elements
Removed classdefinitionname attribute
Removed <digitcharacters>, <whitespacecharacters> and <wordcharacters>
Added priority attribute to <endrule> and <exception> elements
Added includeformatting attribute to <header> element
Jul-24-2003 by DRP: Third draft version.
Removed <charsets> and <charset> to be replaced with <classdefinitions> and <classdefinition>
Renamed <digits> to <digitcharacters>
Renamed <whitespace> to <whitespacecharacters>
Renamed <wordchars> to <wordcharacters>
<digitcharacters>, <whitespacecharacters> and <wordcharacters> are now optional
Renamed <langrules> to <languagerules>
Renamed <langrule> to <languagerule>
Renamed <langmap> to <languagemap>
Renamed langrulename to languagerulename
Renamed langpattern to languagepattern
Jun-19-2003 by DRP: Second draft version.
Removed the <codepage> element.
Added <header> and <body> elements.
Nov-22-2002 by DRP: First draft version
-->
<!ELEMENT srx (header, body) >
<!ATTLIST srx
version CDATA #FIXED "1.0"
>
<!ELEMENT header (formathandle*) >
<!ATTLIST header
segmentsubflows CDATA #REQUIRED
>
<!ELEMENT formathandle EMPTY >
<!ATTLIST formathandle
type CDATA #REQUIRED
include CDATA #REQUIRED
>
<!ELEMENT body (languagerules?, maprules?) >
<!ELEMENT languagerules (languagerule+) >
<!ELEMENT languagerule (rule+) >
<!ATTLIST languagerule
languagerulename CDATA #REQUIRED
>
<!ELEMENT rule (beforebreak?, afterbreak?) >
<!ATTLIST rule
break CDATA #IMPLIED
>
<!ELEMENT beforebreak (#PCDATA) >
<!ELEMENT afterbreak (#PCDATA) >
<!ELEMENT maprules (maprule+) >
<!ELEMENT maprule (languagemap+) >
<!ATTLIST maprule
maprulename CDATA #REQUIRED
>
<!ELEMENT languagemap EMPTY >
<!ATTLIST languagemap
languagepattern CDATA #REQUIRED
languagerulename CDATA #REQUIRED
>
E. XML Schema for SRX
<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/10/XMLSchema">
<element name="srx">
<complexType>
<sequence>
<element ref="header" minOccurs="1" maxOccurs="1" />
<element ref="body" minOccurs="1" maxOccurs="1" />
</sequence>
<attribute name="version" type="string" use="required" value="1.0" />
</complexType>
</element>
<element name="header">
<complexType>
<element ref="formathandle" minOccurs="0" maxOccurs="3" />
<attribute name="segmentsubflows" type="string" use="required" />
</complexType>
</element>
<element name="formathandle">
<complexType>
<attribute name="type" type="string" use="required" />
<attribute name="include" type="string" use="required" />
</complexType>
</element>
<element name="body">
<complexType>
<sequence>
<element ref="languagerules" minOccurs="0" maxOccurs="1" />
<element ref="maprules" minOccurs="0" maxOccurs="1" />
</sequence>
</complexType>
</element>
<element name="languagerules">
<complexType>
<element ref="languagerule" minOccurs="1" maxOccurs="unbounded" />
</complexType>
</element>
<element name="languagerule">
<complexType>
<sequence>
<element ref="rule" minOccurs="1" maxOccurs="unbounded" />
</sequence>
<attribute name="languagerulename" type="string" use="required" />
</complexType>
</element>
<element name="rule">
<complexType>
<sequence>
<element ref="beforebreak" minOccurs="0" maxOccurs="1" />
<element ref="afterbreak" minOccurs="0" maxOccurs="1" />
</sequence>
<attribute name="break" type="string" use="optional" />
</complexType>
</element>
<element name="beforebreak">
<complexType mixed="true" />
</element>
<element name="afterbreak">
<complexType mixed="true" />
</element>
<element name="maprules">
<complexType>
<element ref="maprule" minOccurs="1" maxOccurs="unbounded" />
</complexType>
</element>
<element name="maprule">
<complexType>
<element ref="languagemap" minOccurs="1" maxOccurs="unbounded" />
<attribute name="maprulename" type="string" use="required" />
</complexType>
</element>
<element name="languagemap">
<complexType>
<attribute name="languagepattern" type="string" use="required" />
<attribute name="languagerulename" type="string" use="required" />
</complexType>
</element>
</schema>