Okapi Framework - Developer's GuideSegmentation |
|
- Overview |
|
Segmentation, in the context of the Okapi Framework, is the action of breaking
down a given content into parts. For example, taking the content of an extracted
HTML <p> element, and breaking it down into sentences.
Segmentation is of great importance in localization tasks. It allows you to define the granularity of the parts of text that are being translated, matched against translation memories, processed by machine translation, etc. Having different segmentation methods is often one of the cause of loosing re-usable data when going from one tool to another.
In the framework, the basic unit of extraction is the text unit, which corresponds to different things depending on original file format. But, roughly, it is an un-segmented chunk of text that may be composed of several sentences. Many of the tasks performed on the text units require to manipulate the unit at a finer level: the segment. This page discusses segmentation in that context.
The framework provides one interface to apply segmentation to a text content,
the
ISegmenter interface.
How each implementation of
ISegmenter works may be different. In this example we will use one
default implementation of
ISegmenter
that is provided with the framework:
SRXSegmenter. As its name indicates it is based on the
SRX standard.
To instantiate this segmenter you must first create an
SRXDocument
object and load or set the SRX rules to use.
SRXDocument doc = new SRXDocument(); doc.load(myRules.srx);
Then you can obtain a segmenter for a given language.
ISegmenter segmenter = doc.compileLanguageRules(LocaleId.fromString("en"), null);
The second parameter of
compileLanguageRules() is an optional segmenter object, in case you
already have one and want to avoid the cost of re-creating one. You can just
pass null to create a brand new one.
Once you have a segmenter with its rules set, you can use it to create
segments on a given content. You can calculate the segments for a given plain
text string or for a
TextContainer.
Question: I've looked at the SRX specification and it seems quite complicated to write rules. Is there an easy way to create and edit SRX documents?
Answer: Sure. You can use Okapi's own SRX editor: Ratel (named after the tough honey-badger that roams the plains of Africa). You can download it from here. Just start Ratel and drag and drop your SRX document on it. The rules are applied on-the-fly to any sample text you enter.
Here is an example of getting the segmentation for a plain text string:
int count = segmenter.computeSegments("Part 1. Part 2.");
System.out.println("count="+String.valueOf(count));
for ( Range range : segmenter.getRanges() ) {
System.out.println(String.format("start=%d, end=%d",
range.start, range.end));
}
The
ISegmenter.computeSegments() method returns the number of segments it
founds. It also creates internally a list of the ranges of these segments. You
can get that list with the
ISegmenter.getRanges() method. Each entry of the list is a
Range object that
contains a start and an end values corresponding to
the boundaries of the segment in the given text. The text of the segment goes
from the character at the start position to the character just
before the end position (Just like the String.subString()
argument in Java).
For example the code above will display this:
count=2 start=0, end=7 start=7, end=15
The first segment starts at 0 and ends at 7, so it corresponds to "Part
1." And the second segment starts at 7 and ends at 15, so it corresponds
to " Part 2.".
Part 1. Part 2. 0000000000111111 0123456789012345
While the segmenter is designed to work with coded text as we will see below, you can also use it on any kind of normal text as long as you have rules that correspond to your text format.
The method
ISegmenter.computeSegments() can also take a
TextContainer as parameter and works like for plain text, but in
addition it takes into account the possible inline code in the content.
For example, given a breaking rule with the text before the break set as a
period and the text after the break set as a space, in which segment should go
the inline codes </span> and <alone/>?
<span>Part 1.</span> Part 2.<alone/> Part 3.
The SRX standard has options for these cases. The default options are as follow:
To try out the segmenter with inline codes we have first to build a
TextContainer object with the proper content:
TextContainer tc = new TextContainer();
tc.append(TagType.OPENING, "span", "<span>");
tc.append("Part 1.");
tc.append(TagType.CLOSING, "span", "</span>");
tc.append(" Part 2.");
tc.append(TagType.PLACEHOLDER, "alone", "<alone/>");
tc.append(" Part 3.");
The calculation of the segmentation itself is the same as before.
segmenter.computeSegments(tc);
To make things easier, the
TextContainer class offers a method to apply the ranges provided by
the segmenter to the text content in one call:
TextContainer.createSegments() that takes a list of ranges as
parameters.
tc.createSegments(segmenter.getRanges());
And you can retrieve each segments of the now segmented container with the
TextContainer.getSegments(). The
Segment
class provides a simple structure to hold together the
TextFragment
object corresponding to the segment, and the identifier of the segment.
for ( Segment seg : tc.getSegments() ) {
System.out.println("segment=[" + seg.toString() + "]");
}
The code above results in the following output:
segment=[<span>Part 1.</span>] segment=[ Part 2.] segment=[<alone/> Part 3.]
Note that the SRX specification is unclear on what is the proper behavior of the segmenter for the cases where there are several consecutive inline codes just after the break point, the specification mentions only the cases with a single code. In such cases the SRX implementation in Okapi behaves like if the several inline codes are a single code as long as they are of the same type.
When a
TextContainer is segmented each segment is taken out of the
coded text and replaced by a special marker (MARKER_SEGMENT). The
text and the inline codes outside segments remain in the coded text. Any inline
codes inside the segment are part of the
TextFragment
of that segment.
Note: see the definition about inline code marker for details on how inline codes are represented in a coded text.
Normal: "<b>Segment 1.</b> Outside part<br/>" Coded: "\uE104\uE110 Outside part\uE103\uE111" Codes: 0={"0"=TagType.SEGMENTHOLDER} 1={"<br/>"=TagType.PLACEHOLDER} Segments: 0={ Normal: "<b>Segment 1.</b>" Coded: "\uE101\uE110Segment 1.\uE102\uE111" Codes: 0="<b>",TagType.OPENING 1="</b>",TagType.CLOSING }
Most of the time segments are accessed through their index. That value is saved in the data part of the inline code corresponding to the segment marker.
For example, if you need to go through the coded text of a segmented
TextContainer and access the segment corresponding to the segment
marker found in the coded text, you would do something like this:
String text = tc.getCodedText();
for ( int i=0; i<text.length(); i++ ) {
if ( TextFragment.isMarker(text.charAt(i)) ) {
i++; // Skip marker in all cases
if ( text.charAt(i-1) == TextFragment.MARKER_SEGMENT ) {
Code code = tc.getCode(text.charAt(i));
int segIndex = Integer.valueOf(code.getData());
Segment seg = tc.getSegments().get(segIndex);
System.out.println("segment->[" + seg.toString() + "]");
}
}
}
The methods provided by the
TextContainer class are often enough to manipulate segments:
The method
isSegmented() indicates if the content is segmented or not.
The method
getSegmentCount() returns the number of segments in the content.
The method
getSegments() gives you the list of the segments.
The method
createSegments() takes a non-segmented content and creates segments
based on a list of ranges, as we have seen already.
The method
createSegment() allows you to create a new segment in a segmented or
non-segmented content. The range provided must not include existing segments.
The method
appendSegment() allows you to add new text as a new segment placed at
the end of the current content.
The method
setSegments() allows you to set the list of segments for a content
that is already segmented, essentially replacing the segments but not touching
on the areas between the segments. The new list of segments must have the same
number of segments as the previous one.
The method
joinSegmentWithNext() allows you to join a given segment with the one
just after in the coded text of the container. Any non-segmented text in-between
is also included in the given segment.
The method
mergeAllSegments() simply resets the content as non-segmented,
merging all existing segments together at their position.
The method
mergeSegment() allows you to merge back into the container coded text
a given segment. Other segments are re-indexed but not touched otherwise.