morfologik.stemming
Class DictionaryLookup

java.lang.Object
  extended by morfologik.stemming.DictionaryLookup
All Implemented Interfaces:
java.lang.Iterable<WordData>, IStemmer

public final class DictionaryLookup
extends java.lang.Object
implements IStemmer, java.lang.Iterable<WordData>

This class implements a dictionary lookup over an FSA dictionary. The dictionary for this class should be prepared from a text file using Jan Daciuk's FSA package (see link below).

Important: finite state automatons in Jan Daciuk's implementation use bytes not unicode characters. Therefore objects of this class always have to be constructed with an encoding used to convert Java strings to byte arrays and the other way around. You can use UTF-8 encoding, as it should not conflict with any control sequences and separator characters.

See Also:
FSA package Web site

Constructor Summary
DictionaryLookup(Dictionary dictionary)
           Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.
 
Method Summary
static java.nio.ByteBuffer decodeStem(java.nio.ByteBuffer bb, byte[] bytes, int len, java.nio.ByteBuffer inflectedBuffer, DictionaryMetadata metadata)
          Decode the base form of an inflected word and save its decoded form into a byte buffer.
 Dictionary getDictionary()
           
 java.util.Iterator<WordData> iterator()
          Return an iterator over all WordData entries available in the embedded Dictionary.
 java.util.List<WordData> lookup(java.lang.CharSequence word)
          Searches the automaton for a symbol sequence equal to word, followed by a separator.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DictionaryLookup

public DictionaryLookup(Dictionary dictionary)
                 throws java.lang.IllegalArgumentException

Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.

Throws:
java.lang.IllegalArgumentException - if FSA's root node cannot be acquired (dictionary is empty).
Method Detail

lookup

public java.util.List<WordData> lookup(java.lang.CharSequence word)
Searches the automaton for a symbol sequence equal to word, followed by a separator. The result is a stem (decompressed accordingly to the dictionary's specification) and an optional tag data.

Specified by:
lookup in interface IStemmer

decodeStem

public static java.nio.ByteBuffer decodeStem(java.nio.ByteBuffer bb,
                                             byte[] bytes,
                                             int len,
                                             java.nio.ByteBuffer inflectedBuffer,
                                             DictionaryMetadata metadata)
Decode the base form of an inflected word and save its decoded form into a byte buffer.

Parameters:
bb - The byte buffer to save the result to. A new buffer may be allocated if the capacity of bb is not large enough to store the result. The buffer is not flipped upon return.
inflectedBuffer - Inflected form's bytes (decoded properly).
bytes - Bytes of the encoded base form, starting at 0 index.
len - Length of the encode base form.
Returns:
Returns either bb or a new buffer whose capacity is large enough to store the output of the decoded data.

iterator

public java.util.Iterator<WordData> iterator()
Return an iterator over all WordData entries available in the embedded Dictionary.

Specified by:
iterator in interface java.lang.Iterable<WordData>

getDictionary

public Dictionary getDictionary()
Returns:
Return the Dictionary used by this object.