morfologik.tools
Class MorphEncoder

java.lang.Object
  extended by morfologik.tools.MorphEncoder

public final class MorphEncoder
extends java.lang.Object

A class that converts tabular data to fsa morphological format. Three formats are supported:


Constructor Summary
MorphEncoder()
           
MorphEncoder(byte annotationSeparator)
           
 
Method Summary
protected static java.lang.String asString(byte[] str, java.lang.String encoding)
          Converts a byte array to a given encoding.
static int commonPrefix(byte[] s1, byte[] s2)
           
 byte[] infixEncode(byte[] wordForm, byte[] wordLemma, byte[] wordTag)
          This method converts wordform, wordLemma and the tag to the form: inflected_form + MLKending + tags where '+' is a separator, M is the position of characters to be deleted towards the beginning of the inflected form ("A" means from the beginning, "B" from the second character, "C" - from the third one, and so on), L is the number of characters to be deleted from the position specified by M ("A" means none, "B" means one, "C" - 2, etc.), K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending ("A" means none, "B' - 1, "C" - 2, and so on).
 java.lang.String infixEncodeUTF8(java.lang.String wordForm, java.lang.String wordLemma, java.lang.String wordTag)
          A UTF-8 variant of infixEncode(byte[], byte[], byte[]).
 byte[] prefixEncode(byte[] wordForm, byte[] wordLemma, byte[] wordTag)
          This method converts wordform, wordLemma and the tag to the form: inflected_form + LKending + tags where '+' is a separator, L is the number of characters to be deleted from the beginning of the word ("A" means none, "B" means one, "C" - 2, etc.), K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending ("A" means none, "B' - 1, "C" - 2, and so on).
 java.lang.String prefixEncodeUTF8(java.lang.String wordForm, java.lang.String wordLemma, java.lang.String wordTag)
          A UTF-8 variant of prefixEncode(byte[], byte[], byte[]) This method converts wordform, wordLemma and the tag to the form: inflected_form + LKending + tags where '+' is a separator, L is the number of characters to be deleted from the beginning of the word ("A" means none, "B" means one, "C" - 2, etc.), K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending ("A" means none, "B' - 1, "C" - 2, and so on).
 byte[] standardEncode(byte[] wordForm, byte[] wordLemma, byte[] wordTag)
          This method converts the wordForm, wordLemma and tag to the form: wordForm + Kending + tags where '+' is a separator, K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending.
 java.lang.String standardEncodeUTF8(java.lang.String wordForm, java.lang.String wordLemma, java.lang.String wordTag)
          A UTF-8 variant of standardEncode(byte[], byte[], byte[]) This method converts the wordForm, wordLemma and tag to the form: wordForm + Kending + tags where '+' is a separator, K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

MorphEncoder

public MorphEncoder()

MorphEncoder

public MorphEncoder(byte annotationSeparator)
Method Detail

commonPrefix

public static int commonPrefix(byte[] s1,
                               byte[] s2)

standardEncode

public byte[] standardEncode(byte[] wordForm,
                             byte[] wordLemma,
                             byte[] wordTag)
This method converts the wordForm, wordLemma and tag to the form:
 wordForm + Kending + tags
 
where '+' is a separator, K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending.


prefixEncode

public byte[] prefixEncode(byte[] wordForm,
                           byte[] wordLemma,
                           byte[] wordTag)
This method converts wordform, wordLemma and the tag to the form:

 inflected_form + LKending + tags
 

where '+' is a separator, L is the number of characters to be deleted from the beginning of the word ("A" means none, "B" means one, "C" - 2, etc.), K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending ("A" means none, "B' - 1, "C" - 2, and so on).

Parameters:
wordForm - - inflected word form
wordLemma - - canonical form
wordTag - - tag
Returns:
the encoded string

infixEncode

public byte[] infixEncode(byte[] wordForm,
                          byte[] wordLemma,
                          byte[] wordTag)
This method converts wordform, wordLemma and the tag to the form:
 inflected_form + MLKending + tags
 

where '+' is a separator, M is the position of characters to be deleted towards the beginning of the inflected form ("A" means from the beginning, "B" from the second character, "C" - from the third one, and so on), L is the number of characters to be deleted from the position specified by M ("A" means none, "B" means one, "C" - 2, etc.), K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending ("A" means none, "B' - 1, "C" - 2, and so on).

Parameters:
wordForm - - inflected word form
wordLemma - - canonical form
wordTag - - tag
Returns:
the encoded string

asString

protected static java.lang.String asString(byte[] str,
                                           java.lang.String encoding)
Converts a byte array to a given encoding.

Parameters:
str - Byte-array to be converted.
Returns:
Java String. If decoding is unsuccessful, the string is empty.

standardEncodeUTF8

public java.lang.String standardEncodeUTF8(java.lang.String wordForm,
                                           java.lang.String wordLemma,
                                           java.lang.String wordTag)
                                    throws java.io.UnsupportedEncodingException
A UTF-8 variant of standardEncode(byte[], byte[], byte[]) This method converts the wordForm, wordLemma and tag to the form:
 wordForm + Kending + tags
 
where '+' is a separator, K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending.

Throws:
java.io.UnsupportedEncodingException

prefixEncodeUTF8

public java.lang.String prefixEncodeUTF8(java.lang.String wordForm,
                                         java.lang.String wordLemma,
                                         java.lang.String wordTag)
                                  throws java.io.UnsupportedEncodingException
A UTF-8 variant of prefixEncode(byte[], byte[], byte[]) This method converts wordform, wordLemma and the tag to the form:
 inflected_form + LKending + tags
 

where '+' is a separator, L is the number of characters to be deleted from the beginning of the word ("A" means none, "B" means one, "C" - 2, etc.), K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending ("A" means none, "B' - 1, "C" - 2, and so on).

Parameters:
wordForm - - inflected word form
wordLemma - - canonical form
wordTag - - tag
Returns:
the encoded string
Throws:
java.io.UnsupportedEncodingException

infixEncodeUTF8

public java.lang.String infixEncodeUTF8(java.lang.String wordForm,
                                        java.lang.String wordLemma,
                                        java.lang.String wordTag)
                                 throws java.io.UnsupportedEncodingException
A UTF-8 variant of infixEncode(byte[], byte[], byte[]). This method converts wordform, wordLemma and the tag to the form:
 inflected_form + MLKending + tags
 

where '+' is a separator, M is the position of characters to be deleted towards the beginning of the inflected form ("A" means from the beginning, "B" from the second character, "C" - from the third one, and so on), L is the number of characters to be deleted from the position specified by M ("A" means none, "B" means one, "C" - 2, etc.), K is a character that specifies how many characters should be deleted from the end of the inflected form to produce the lexeme by concatenating the stripped string with the ending ("A" means none, "B' - 1, "C" - 2, and so on).

Parameters:
wordForm - - inflected word form
wordLemma - - canonical form
wordTag - - tag
Returns:
the encoded string
Throws:
java.io.UnsupportedEncodingException