Okapi Framework - Developer's Guide

Segmentation

- Overview
- Performing Segmentation
- Working with Segmented Content

Overview

Segmentation, in the context of the Okapi Framework, is the action of breaking down a given content into parts. For example, taking the content of an extracted HTML <p> element, and breaking it down into sentences.

Segmentation is of great importance in localization tasks. It allows you to define the granularity of the parts of text that are being translated, matched against translation memories, processed by machine translation, etc. Having different segmentation methods is often one of the cause of loosing re-usable data when going from one tool to another.

In the framework, the basic unit of extraction is the text unit, which corresponds to different things depending on original file format. But, roughly, it is an un-segmented chunk of text that may be composed of several sentences. Many of the tasks performed on the text units require to manipulate the unit at a finer level: the segment. This page discusses segmentation in that context.

Performing Segmentation

The framework provides one interface to apply segmentation to a text content, the ISegmenter interface.

How each implementation of ISegmenter works may be different. In this example we will use one default implementation of ISegmenter that is provided with the framework: SRXSegmenter. As its name indicates it is based on the SRX standard.

To instantiate this segmenter you must first create an SRXDocument object and load or set the SRX rules to use.

SRXDocument doc = new SRXDocument();
doc.load(myRules.srx);

Then you can obtain a segmenter for a given language.

ISegmenter segmenter = doc.compileLanguageRules(LocaleId.fromString("en"), null);

The second parameter of compileLanguageRules() is an optional segmenter object, in case you already have one and want to avoid the cost of re-creating one. You can just pass null to create a brand new one.

Once you have a segmenter with its rules set, you can use it to create segments on a given content. You can calculate the segments for a given plain text string or for a TextContainer.

Question: I've looked at the SRX specification and it seems quite complicated to write rules. Is there an easy way to create and edit SRX documents?

Answer: Sure. You can use Okapi's own SRX editor: Ratel (named after the tough honey-badger that roams the plains of Africa). You can download it from here. Just start Ratel and drag and drop your SRX document on it. The rules are applied on-the-fly to any sample text you enter.

With Plain Text

Here is an example of getting the segmentation for a plain text string:

int count = segmenter.computeSegments("Part 1. Part 2.");
System.out.println("count="+String.valueOf(count));
for ( Range range : segmenter.getRanges() ) {
   System.out.println(String.format("start=%d, end=%d",
      range.start, range.end));
}

The ISegmenter.computeSegments() method returns the number of segments it founds. It also creates internally a list of the ranges of these segments. You can get that list with the ISegmenter.getRanges() method. Each entry of the list is a Range object that contains a start and an end values corresponding to the boundaries of the segment in the given text. The text of the segment goes from the character at the start position to the character just before the end position (Just like the String.subString() argument in Java).

For example the code above will display this:

count=2
start=0, end=7
start=7, end=15

The first segment starts at 0 and ends at 7, so it corresponds to "Part 1." And the second segment starts at 7 and ends at 15, so it corresponds to " Part 2.".

Part 1. Part 2.
0000000000111111
0123456789012345

While the segmenter is designed to work with coded text as we will see below, you can also use it on any kind of normal text as long as you have rules that correspond to your text format.

With a TextContainer

The method ISegmenter.computeSegments() can also take a TextContainer as parameter and works like for plain text, but in addition it takes into account the possible inline code in the content.

For example, given a breaking rule with the text before the break set as a period and the text after the break set as a space, in which segment should go the inline codes </span> and <alone/>?

<span>Part 1.</span> Part 2.<alone/> Part 3.

The SRX standard has options for these cases. The default options are as follow:

To try out the segmenter with inline codes we have first to build a TextContainer object with the proper content:

TextContainer tc = new TextContainer();
tc.append(TagType.OPENING, "span", "<span>");
tc.append("Part 1.");
tc.append(TagType.CLOSING, "span", "</span>");
tc.append(" Part 2.");
tc.append(TagType.PLACEHOLDER, "alone", "<alone/>");
tc.append(" Part 3.");

The calculation of the segmentation itself is the same as before.

segmenter.computeSegments(tc);

To make things easier, the TextContainer class offers a method to apply the ranges provided by the segmenter to the text content in one call: TextContainer.createSegments() that takes a list of ranges as parameters.

tc.createSegments(segmenter.getRanges());

And you can retrieve each segments of the now segmented container with the TextContainer.getSegments(). The Segment class provides a simple structure to hold together the TextFragment object corresponding to the segment, and the identifier of the segment.

for ( Segment seg : tc.getSegments() ) {
   System.out.println("segment=[" + seg.toString() + "]");
}

The code above results in the following output:

segment=[<span>Part 1.</span>]
segment=[ Part 2.]
segment=[<alone/> Part 3.]

Note that the SRX specification is unclear on what is the proper behavior of the segmenter for the cases where there are several consecutive inline codes just after the break point, the specification mentions only the cases with a single code. In such cases the SRX implementation in Okapi behaves like if the several inline codes are a single code as long as they are of the same type.

Working with Segmented Content

When a TextContainer is segmented each segment is taken out of the coded text and replaced by a special marker (MARKER_SEGMENT). The text and the inline codes outside segments remain in the coded text. Any inline codes inside the segment are part of the TextFragment of that segment.

Note: see the definition about inline code marker for details on how inline codes are represented in a coded text.

  Normal: "<b>Segment 1.</b> Outside part<br/>"
   Coded: "\uE104\uE110 Outside part\uE103\uE111"
   Codes: 0={"0"=TagType.SEGMENTHOLDER}
          1={"<br/>"=TagType.PLACEHOLDER}
Segments: 0={
          Normal: "<b>Segment 1.</b>"
           Coded: "\uE101\uE110Segment 1.\uE102\uE111"
           Codes: 0="<b>",TagType.OPENING
                  1="</b>",TagType.CLOSING
          }

Most of the time segments are accessed through their index. That value is saved in the data part of the inline code corresponding to the segment marker.

For example, if you need to go through the coded text of a segmented TextContainer and access the segment corresponding to the segment marker found in the coded text, you would do something like this:

String text = tc.getCodedText();
for ( int i=0; i<text.length(); i++ ) {
   if ( TextFragment.isMarker(text.charAt(i)) ) {
      i++; // Skip marker in all cases
      if ( text.charAt(i-1) == TextFragment.MARKER_SEGMENT ) {
         Code code = tc.getCode(text.charAt(i));
         int segIndex = Integer.valueOf(code.getData());
         Segment seg = tc.getSegments().get(segIndex);
         System.out.println("segment->[" + seg.toString() + "]");
      }
   }
}

The methods provided by the TextContainer class are often enough to manipulate segments:

The method isSegmented() indicates if the content is segmented or not.

The method getSegmentCount() returns the number of segments in the content.

The method getSegments() gives you the list of the segments.

The method createSegments() takes a non-segmented content and creates segments based on a list of ranges, as we have seen already.

The method createSegment() allows you to create a new segment in a segmented or non-segmented content. The range provided must not include existing segments.

The method appendSegment() allows you to add new text as a new segment placed at the end of the current content.

The method setSegments() allows you to set the list of segments for a content that is already segmented, essentially replacing the segments but not touching on the areas between the segments. The new list of segments must have the same number of segments as the previous one.

The method joinSegmentWithNext() allows you to join a given segment with the one just after in the coded text of the container. Any non-segmented text in-between is also included in the given segment.

The method mergeAllSegments() simply resets the content as non-segmented, merging all existing segments together at their position.

The method mergeSegment() allows you to merge back into the container coded text a given segment. Other segments are re-indexed but not touched otherwise.