net.sf.okapi.lib.segmentation
Class SRXDocument

java.lang.Object
  extended by net.sf.okapi.lib.segmentation.SRXDocument

public class SRXDocument
extends java.lang.Object

Provides facilities to load, save, and manage segmentation rules in SRX format. This class also implements several extensions to the standard SRX behavior.


Field Summary
static java.lang.String ANYCODE
          Marker for INLINECODE_PATTERN in the given pattern.
static java.lang.String INLINECODE_PATTERN
          Represents the pattern for an inline code (both special characters).
static java.lang.String NOAUTO
          Placed at the end of the 'after' expression, this marker indicates the given pattern should not have auto-insertion of AUTO_INLINECODES.
 
Constructor Summary
SRXDocument()
          Creates an empty SRX document.
 
Method Summary
 void addLanguageMap(LanguageMap langMap)
          Adds a language map to this document.
 void addLanguageRule(java.lang.String name, java.util.ArrayList<Rule> langRule)
          Adds a language rule to this SRX document.
 boolean cascade()
          Indicates if cascading must be applied when selecting the rules for a given language pattern.
 ISegmenter compileLanguageRules(LocaleId languageCode, ISegmenter existingSegmenter)
          Compiles the all language rules applicable for a given language code, and assign them to a segmenter.
 ISegmenter compileSingleLanguageRule(java.lang.String ruleName, ISegmenter existingSegmenter)
          Compiles a single language rule group and assign it to a segmenter.
 java.util.LinkedHashMap<java.lang.String,java.util.ArrayList<Rule>> getAllLanguageRules()
          Gets a map of all the language rules in this document.
 java.util.ArrayList<LanguageMap> getAllLanguagesMaps()
          Gets the list of all the language maps in this document.
 java.lang.String getComments()
          Gets the comments associated with this document.
 java.lang.String getHeaderComments()
          Gets the comments associated with the header of this document.
 java.util.ArrayList<Rule> getLanguageRules(java.lang.String ruleName)
          Gets the list of rules for a given <languagerule> element.
 java.lang.String getMaskRule()
          Gets the current pattern of the mask rule.
 java.lang.String getSampleLanguage()
          Gets the current sample language code.
 java.lang.String getSampleText()
          Gets the current sample text.
 java.lang.String getVersion()
          Gets the version of this SRX document.
 java.lang.String getWarning()
          Gets the last warning that was issued while loading a document.
 boolean hasWarning()
          Indicates if a warning was issued last time a document was read.
 boolean includeEndCodes()
          Indicates if end codes should be included (See SRX implementation notes).
 boolean includeIsolatedCodes()
          Indicates if isolated codes should be included (See SRX implementation notes).
 boolean includeStartCodes()
          Indicates if start codes should be included (See SRX implementation notes).
 boolean isModified()
          Indicates if the document has been modified since the last load or save.
 void loadRules(java.lang.CharSequence data)
          Loads an SRX document from a CharSequence object.
 void loadRules(java.io.InputStream inputStream)
          Loads an SRX document from an input stream.
 void loadRules(java.lang.String pathOrURL)
          Loads an SRX document from a file.
 boolean oneSegmentIncludesAll()
          Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)
 void resetAll()
          Resets the document to its default empty initial state.
 void saveRules(java.lang.String rulesPath, boolean saveExtensions, boolean saveNonValidInfo)
          Saves the current rules to an SRX rules document.
 java.lang.String saveRulesToString(boolean saveExtensions, boolean saveNonValidInfo)
          Saves the current rules to an SRX string.
 boolean segmentSubFlows()
          Indicates if sub-flows must be segmented.
 void setCascade(boolean value)
          Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.
 void setComments(java.lang.String text)
          Sets the comments for this document.
 void setHeaderComments(java.lang.String text)
          Sets the comments for the header of this document.
 void setIncludeEndCodes(boolean value)
          Sets the indicator that tells if end codes should be included or not.
 void setIncludeIsolatedCodes(boolean value)
          Sets the indicator that tells if isolated codes should be included or not.
 void setIncludeStartCodes(boolean value)
          Sets the indicator that tells if start codes should be included or not.
 void setMaskRule(java.lang.String pattern)
          Sets the pattern for the mask rule.
 void setModified(boolean value)
          Sets the flag indicating if the document has been modified since the last load or save.
 void setOneSegmentIncludesAll(boolean value)
          Sets the indicator that tells if when there is a single segment in a text it should include the whole text (no spaces or codes trim left/right) text.
 void setSampleLanguage(java.lang.String value)
          Sets the sample language code.
 void setSampleText(java.lang.String value)
          Sets the sample text.
 void setSegmentSubFlows(boolean value)
          Sets the flag indicating if sub-flows must be segmented.
 void setTestOnSelectedGroup(boolean value)
          Sets the indicator on how to apply rules for samples.
 void setTrimLeadingWhitespaces(boolean value)
          Sets the indicator that tells if leading white-spaces should be left outside the segments.
 void setTrimTrailingWhitespaces(boolean value)
          Sets the indicator that tells if trailing white-spaces should be left outside the segments.
 boolean testOnSelectedGroup()
          Indicates that, when sampling the rules, the sample should be computed using only a selected group of rules.
 boolean trimLeadingWhitespaces()
          Indicates if leading white-spaces should be left outside the segments.
 boolean trimTrailingWhitespaces()
          Indicates if trailing white-spaces should be left outside the segments.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INLINECODE_PATTERN

public static final java.lang.String INLINECODE_PATTERN
Represents the pattern for an inline code (both special characters).


ANYCODE

public static final java.lang.String ANYCODE
Marker for INLINECODE_PATTERN in the given pattern. \Y+ = one or more codes, \Y* = zero, one or more codes, etc.

See Also:
Constant Field Values

NOAUTO

public static final java.lang.String NOAUTO
Placed at the end of the 'after' expression, this marker indicates the given pattern should not have auto-insertion of AUTO_INLINECODES.

See Also:
Constant Field Values
Constructor Detail

SRXDocument

public SRXDocument()
Creates an empty SRX document.

Method Detail

getVersion

public java.lang.String getVersion()
Gets the version of this SRX document.

Returns:
the version of this SRX document.

hasWarning

public boolean hasWarning()
Indicates if a warning was issued last time a document was read.

Returns:
true if a warning was issued, false otherwise.

getWarning

public java.lang.String getWarning()
Gets the last warning that was issued while loading a document.

Returns:
the text of the last warning issued, or an empty string.

getHeaderComments

public java.lang.String getHeaderComments()
Gets the comments associated with the header of this document.

Returns:
the comments for the header of this document, or null if there are none.

setHeaderComments

public void setHeaderComments(java.lang.String text)
Sets the comments for the header of this document.

Parameters:
text - the new comments, use null or empty string for removing the comments.

getComments

public java.lang.String getComments()
Gets the comments associated with this document.

Returns:
the comments for this document, or null if there are none.

setComments

public void setComments(java.lang.String text)
Sets the comments for this document.

Parameters:
text - the new comments, use null or empty string for removing the comments.

resetAll

public void resetAll()
Resets the document to its default empty initial state.


getAllLanguageRules

public java.util.LinkedHashMap<java.lang.String,java.util.ArrayList<Rule>> getAllLanguageRules()
Gets a map of all the language rules in this document.

Returns:
a map of all the language rules.

getLanguageRules

public java.util.ArrayList<Rule> getLanguageRules(java.lang.String ruleName)
Gets the list of rules for a given <languagerule> element.

Parameters:
ruleName - the name of the <languagerule> element to query.
Returns:
the list of rules for a given <languagerule> element.

getAllLanguagesMaps

public java.util.ArrayList<LanguageMap> getAllLanguagesMaps()
Gets the list of all the language maps in this document.

Returns:
the list of all the language maps.

segmentSubFlows

public boolean segmentSubFlows()
Indicates if sub-flows must be segmented.

Returns:
true if sub-flows must be segmented, false otherwise.

setSegmentSubFlows

public void setSegmentSubFlows(boolean value)
Sets the flag indicating if sub-flows must be segmented.

Parameters:
value - true if sub-flows must be segmented, false otherwise.

cascade

public boolean cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.

Returns:
true if cascading must be applied, false otherwise.

setCascade

public void setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.

Parameters:
value - true if cascading must be applied, false otherwise.

oneSegmentIncludesAll

public boolean oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)

Returns:
true if a text with a single segment should include the whole text.

setOneSegmentIncludesAll

public void setOneSegmentIncludesAll(boolean value)
Sets the indicator that tells if when there is a single segment in a text it should include the whole text (no spaces or codes trim left/right) text.

Parameters:
value - true if a text with a single segment should include the whole text.

trimLeadingWhitespaces

public boolean trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.

Returns:
true if the leading white-spaces should be trimmed.

setTrimLeadingWhitespaces

public void setTrimLeadingWhitespaces(boolean value)
Sets the indicator that tells if leading white-spaces should be left outside the segments.

Parameters:
value - true if the leading white-spaces should be trimmed.

trimTrailingWhitespaces

public boolean trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.

Returns:
true if the trailing white-spaces should be trimmed.

setTrimTrailingWhitespaces

public void setTrimTrailingWhitespaces(boolean value)
Sets the indicator that tells if trailing white-spaces should be left outside the segments.

Parameters:
value - true if the trailing white-spaces should be trimmed.

includeStartCodes

public boolean includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).

Returns:
true if start codes should be included, false otherwise.

setIncludeStartCodes

public void setIncludeStartCodes(boolean value)
Sets the indicator that tells if start codes should be included or not. (See SRX implementation notes).

Parameters:
value - true if start codes should be included, false otherwise.

includeEndCodes

public boolean includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).

Returns:
true if end codes should be included, false otherwise.

setIncludeEndCodes

public void setIncludeEndCodes(boolean value)
Sets the indicator that tells if end codes should be included or not. (See SRX implementation notes).

Parameters:
value - true if end codes should be included, false otherwise.

includeIsolatedCodes

public boolean includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).

Returns:
true if isolated codes should be included, false otherwise.

setIncludeIsolatedCodes

public void setIncludeIsolatedCodes(boolean value)
Sets the indicator that tells if isolated codes should be included or not. (See SRX implementation notes).

Parameters:
value - true if isolated codes should be included, false otherwise.

getMaskRule

public java.lang.String getMaskRule()
Gets the current pattern of the mask rule.

Returns:
the current pattern of the mask rule.

setMaskRule

public void setMaskRule(java.lang.String pattern)
Sets the pattern for the mask rule.

Parameters:
pattern - the new pattern to use for the mask rule.

getSampleText

public java.lang.String getSampleText()
Gets the current sample text. This text is an example string that can be used to test the various rules. It can be handy to be able to save it along with the SRX document.

Returns:
the sample text, or an empty string.

setSampleText

public void setSampleText(java.lang.String value)
Sets the sample text.

Parameters:
value - the new sample text.

getSampleLanguage

public java.lang.String getSampleLanguage()
Gets the current sample language code.

Returns:
the current sample language code.

setSampleLanguage

public void setSampleLanguage(java.lang.String value)
Sets the sample language code. Null or empty strings are changed to the default language.

Parameters:
value - the new sample language code.

testOnSelectedGroup

public boolean testOnSelectedGroup()
Indicates that, when sampling the rules, the sample should be computed using only a selected group of rules.

Returns:
true to test using only a selected group of rules. False to test using all the rules matching a given language.

setTestOnSelectedGroup

public void setTestOnSelectedGroup(boolean value)
Sets the indicator on how to apply rules for samples.

Parameters:
value - true to test using only a selected group of rules. False to test using all the rules matching a given language.

isModified

public boolean isModified()
Indicates if the document has been modified since the last load or save.

Returns:
true if the document have been modified, false otherwise.

setModified

public void setModified(boolean value)
Sets the flag indicating if the document has been modified since the last load or save. If you make change to the rules or language maps directly to the lists, make sure to set this flag to true.

Parameters:
value - true if the document has been changed, false otherwise.

addLanguageRule

public void addLanguageRule(java.lang.String name,
                            java.util.ArrayList<Rule> langRule)
Adds a language rule to this SRX document. If another language rule with the same name exists already it will be replaced by the new one, without warning.

Parameters:
name - name of the language rule to add.
langRule - language rule object to add.

addLanguageMap

public void addLanguageMap(LanguageMap langMap)
Adds a language map to this document. The new map is added at the end of the one already there.

Parameters:
langMap - the language map object to add.

compileLanguageRules

public ISegmenter compileLanguageRules(LocaleId languageCode,
                                       ISegmenter existingSegmenter)
Compiles the all language rules applicable for a given language code, and assign them to a segmenter. This method applies the language code you specify to the language mappings currently available in the document and compile the rules when one or more language map is found. The matching is done in the order of the list of language maps and more than one can be selected if cascade() is true.

Parameters:
languageCode - the language code. the value should be a BCP-47 value (e.g. "de", "fr-ca", etc.)
existingSegmenter - optional existing SRXSegmenter object to re-use. Use null for not re-using anything.
Returns:
the instance of the segmenter with the new compiled rules.

compileSingleLanguageRule

public ISegmenter compileSingleLanguageRule(java.lang.String ruleName,
                                            ISegmenter existingSegmenter)
Compiles a single language rule group and assign it to a segmenter.

Parameters:
ruleName - the name of the rule group to apply.
existingSegmenter - optional existing SRXSegmenter object to re-use. Use null for not re-using anything.
Returns:
the instance of the segmenter with the new compiled rules.

loadRules

public void loadRules(java.lang.CharSequence data)
Loads an SRX document from a CharSequence object. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.

Parameters:
data - the string containing the SRX document to load.

loadRules

public void loadRules(java.lang.String pathOrURL)
Loads an SRX document from a file. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.

Parameters:
pathOrURL - The full path or URL of the document to load.

loadRules

public void loadRules(java.io.InputStream inputStream)
Loads an SRX document from an input stream. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.

Parameters:
inputStream - the input stream to read from.

saveRulesToString

public java.lang.String saveRulesToString(boolean saveExtensions,
                                          boolean saveNonValidInfo)
Saves the current rules to an SRX string.

Parameters:
saveExtensions - true to save Okapi SRX extensions, false otherwise.
saveNonValidInfo - true to save non-SRX-valid attributes, false otherwise.
Returns:
the string containing the saved SRX rules.

saveRules

public void saveRules(java.lang.String rulesPath,
                      boolean saveExtensions,
                      boolean saveNonValidInfo)
Saves the current rules to an SRX rules document.

Parameters:
rulesPath - the full path of the file where to save the rules.
saveExtensions - true to save Okapi SRX extensions, false otherwise.
saveNonValidInfo - true to save non-SRX-valid attributes, false otherwise.