Segmentation is the technique of breaking down a paragraph of text into smaller chunks, usually sentences. The segment is the unit used in translation memory systems.

Tools have different ways to segment text, and they also often provide options for the user to change the segmentation rules. This leads to some problem when porting a TM from a system to another, or even between two versions of the same application.

In addition to end of sentence detection differences, a common segmentation issue is how sub-segments are treated. A sub-segment is a segment inside a segment. For example:

<p>Click <img src="button.gif" alt="Reply Button"/> to reply to this email.</p>

In this XHTML paragraph, the text of the alt attribute constitutes a segment on its own. Depending on the tool you are using, you will get, either:

Segment 1: Click [code] to reply to this email.
Segment 2: Reply Button


Segment 1: Click [code]Reply Button[code] to reply to this email.

Such difference on how to handle sub-segments may generate a lot of variations in a TM and, when ported to another system, cause the ratio of recyclable segments to be much lower than expected.

OSCAR, the LISA standards committee, has developed a standard format (SRX) to exchange segmentation rules between tools. See more information here...