net.sf.okapi.lib.segmentation
Class SRXSegmenter

java.lang.Object
  extended by net.sf.okapi.lib.segmentation.SRXSegmenter
All Implemented Interfaces:
ISegmenter

public class SRXSegmenter
extends java.lang.Object
implements ISegmenter

Implements the ISegmenter interface for SRX rules.


Constructor Summary
SRXSegmenter()
          Creates a new SRXSegmenter object.
 
Method Summary
protected  void addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
          Adds a compiled rule to this segmenter.
 boolean cascade()
          Indicates if cascading must be applied when selecting the rules for a given language pattern.
 int computeSegments(java.lang.String text)
          Calculate the segmentation of a given plain text string.
 int computeSegments(TextContainer container)
          Calculates the segmentation of a given TextContainer object.
 LocaleId getLanguage()
          Gets the language used to apply the rules.
 Range getNextSegmentRange(TextContainer container)
          Compute the range of the next segment for a given TextContainer object.
 java.util.List<Range> getRanges()
          Gets the list off all segments ranges calculated when calling ISegmenter.computeSegments(String), or ISegmenter.computeSegments(TextContainer).
 java.util.List<java.lang.Integer> getSplitPositions()
          Gets the list of all the split positions in the text that was last segmented.
 boolean includeEndCodes()
          Indicates if end codes should be included (See SRX implementation notes).
 boolean includeIsolatedCodes()
          Indicates if isolated codes should be included (See SRX implementation notes).
 boolean includeStartCodes()
          Indicates if start codes should be included (See SRX implementation notes).
 boolean oneSegmentIncludesAll()
          Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)
 void reset()
          Resets the options to their defaults, and the compiled rules to nothing.
 boolean segmentSubFlows()
          Indicates if sub-flows must be segmented.
protected  void setCascade(boolean value)
          Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.
protected  void setLanguage(LocaleId languageCode)
          Sets the language used to apply the rules.
protected  void setMaskRule(java.lang.String pattern)
          Sets the pattern for the mask rule.
 void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)
          Sets the options for this segmenter.
 boolean trimLeadingWhitespaces()
          Indicates if leading white-spaces should be left outside the segments.
 boolean trimTrailingWhitespaces()
          Indicates if trailing white-spaces should be left outside the segments.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SRXSegmenter

public SRXSegmenter()
Creates a new SRXSegmenter object.

Method Detail

reset

public void reset()
Resets the options to their defaults, and the compiled rules to nothing.


setOptions

public void setOptions(boolean segmentSubFlows,
                       boolean includeStartCodes,
                       boolean includeEndCodes,
                       boolean includeIsolatedCodes,
                       boolean oneSegmentIncludesAll,
                       boolean trimLeadingWS,
                       boolean trimTrailingWS)
Sets the options for this segmenter.

Parameters:
segmentSubFlows - true to segment sub-flows, false to no segment them.
includeStartCodes - true to include start codes just before a break in the 'left' segment, false to put them in the next segment.
includeEndCodes - true to include end codes just before a break in the 'left' segment, false to put them in the next segment.
includeIsolatedCodes - true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.
oneSegmentIncludesAll - true to include everything in segments that are alone.
trimLeadingWS - true to trim leading white-spaces from the segments, false to keep them.
trimTrailingWS - true to trim trailing white-spaces from the segments, false to keep them.

oneSegmentIncludesAll

public boolean oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)

Returns:
true if a text with a single segment should include the whole text.

segmentSubFlows

public boolean segmentSubFlows()
Indicates if sub-flows must be segmented.

Returns:
true if sub-flows must be segmented, false otherwise.

cascade

public boolean cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.

Returns:
true if cascading must be applied, false otherwise.

trimLeadingWhitespaces

public boolean trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.

Returns:
true if the leading white-spaces should be trimmed.

trimTrailingWhitespaces

public boolean trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.

Returns:
true if the trailing white-spaces should be trimmed.

includeStartCodes

public boolean includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).

Returns:
true if they should be included, false otherwise.

includeEndCodes

public boolean includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).

Returns:
true if they should be included, false otherwise.

includeIsolatedCodes

public boolean includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).

Returns:
true if they should be included, false otherwise.

computeSegments

public int computeSegments(java.lang.String text)
Description copied from interface: ISegmenter
Calculate the segmentation of a given plain text string.

Specified by:
computeSegments in interface ISegmenter
Parameters:
text - plain text to segment.
Returns:
the number of segments calculated.

computeSegments

public int computeSegments(TextContainer container)
Description copied from interface: ISegmenter
Calculates the segmentation of a given TextContainer object. If the content is already segmented, it is un-segmented automatically before being processed.

Specified by:
computeSegments in interface ISegmenter
Parameters:
container - the object to segment.
Returns:
the number of segments calculated.

getNextSegmentRange

public Range getNextSegmentRange(TextContainer container)
Description copied from interface: ISegmenter
Compute the range of the next segment for a given TextContainer object. The next segment is searched from the first character after the last segment marker found in the container.

Specified by:
getNextSegmentRange in interface ISegmenter
Parameters:
container - the text container where to look for the next segment.
Returns:
a range corresponding to the start and end position of the found segment, or null if no more segments are found.

getSplitPositions

public java.util.List<java.lang.Integer> getSplitPositions()
Description copied from interface: ISegmenter
Gets the list of all the split positions in the text that was last segmented. You must call ISegmenter.computeSegments(TextContainer) or ISegmenter.computeSegments(String) before calling this method. A split position is the first character position of a new segment.

IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.

Specified by:
getSplitPositions in interface ISegmenter
Returns:
An array of integers where each value is a split position in the coded text that was segmented.

getRanges

public java.util.List<Range> getRanges()
Description copied from interface: ISegmenter
Gets the list off all segments ranges calculated when calling ISegmenter.computeSegments(String), or ISegmenter.computeSegments(TextContainer).

Specified by:
getRanges in interface ISegmenter
Returns:
the list of all segments ranges. each range is stored in a Range object where start is the start and end the end of the range. Returns null if no ranges have been defined yet.

getLanguage

public LocaleId getLanguage()
Description copied from interface: ISegmenter
Gets the language used to apply the rules.

Specified by:
getLanguage in interface ISegmenter
Returns:
the language code used to apply the rules, or null, if none has been specified.

setLanguage

protected void setLanguage(LocaleId languageCode)
Sets the language used to apply the rules.

Parameters:
languageCode - Code of the language to use to apply the rules.

setCascade

protected void setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.

Parameters:
value - true if cascading must be applied, false otherwise.

addRule

protected void addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
Adds a compiled rule to this segmenter.

Parameters:
compiledRule - the compiled rule to add.

setMaskRule

protected void setMaskRule(java.lang.String pattern)
Sets the pattern for the mask rule.

Parameters:
pattern - the new pattern to use for the mask rule.