a place for localization tools and technologies
|\\ Technologies :: Miscellaneous :: Segmentation|
Segmentation is the technique of breaking down a paragraph of text into smaller chunks, usually sentences. The segment is the unit used in translation memory systems.
Tools have different ways to segment text, and they also often provide options for the user to change the segmentation rules. This leads to some problem when porting a TM from a system to another, or even between two versions of the same application.
In addition to end of sentence detection differences, a common segmentation issue is how sub-segments are treated. A sub-segment is a segment inside a segment. For example:
<p>Click <img src="button.gif" alt="Reply Button"/> to reply to this email.</p>
In this XHTML paragraph, the text of the
Segment 1: Click [code] to reply to this email. Segment 2: Reply Button
Segment 1: Click [code]Reply Button[code] to reply to this email.
Such difference on how to handle sub-segments may generate a lot of variations in a TM and, when ported to another system, cause the ratio of recyclable segments to be much lower than expected.