|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectmorfologik.stemming.DictionaryLookup
public final class DictionaryLookup
This class implements a dictionary lookup over an FSA dictionary. The dictionary for this class should be prepared from a text file using Jan Daciuk's FSA package (see link below).
Important: finite state automatons in Jan Daciuk's implementation use bytes not unicode characters. Therefore objects of this class always have to be constructed with an encoding used to convert Java strings to byte arrays and the other way around. You can use UTF-8 encoding, as it should not conflict with any control sequences and separator characters.
Constructor Summary | |
---|---|
DictionaryLookup(Dictionary dictionary)
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes. |
Method Summary | |
---|---|
static java.nio.ByteBuffer |
decodeStem(java.nio.ByteBuffer bb,
byte[] bytes,
int len,
java.nio.ByteBuffer inflectedBuffer,
DictionaryMetadata metadata)
Decode the base form of an inflected word and save its decoded form into a byte buffer. |
Dictionary |
getDictionary()
|
java.util.Iterator<WordData> |
iterator()
Return an iterator over all WordData entries available in the
embedded Dictionary . |
java.util.List<WordData> |
lookup(java.lang.CharSequence word)
Searches the automaton for a symbol sequence equal to word ,
followed by a separator. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DictionaryLookup(Dictionary dictionary) throws java.lang.IllegalArgumentException
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.
java.lang.IllegalArgumentException
- if FSA's root node cannot be acquired (dictionary is empty).Method Detail |
---|
public java.util.List<WordData> lookup(java.lang.CharSequence word)
word
,
followed by a separator. The result is a stem (decompressed accordingly
to the dictionary's specification) and an optional tag data.
lookup
in interface IStemmer
public static java.nio.ByteBuffer decodeStem(java.nio.ByteBuffer bb, byte[] bytes, int len, java.nio.ByteBuffer inflectedBuffer, DictionaryMetadata metadata)
bb
- The byte buffer to save the result to. A new buffer may be
allocated if the capacity of bb
is not large
enough to store the result. The buffer is not flipped upon
return.inflectedBuffer
- Inflected form's bytes (decoded properly).bytes
- Bytes of the encoded base form, starting at 0 index.len
- Length of the encode base form.
bb
or a new buffer whose capacity is
large enough to store the output of the decoded data.public java.util.Iterator<WordData> iterator()
WordData
entries available in the
embedded Dictionary
.
iterator
in interface java.lang.Iterable<WordData>
public Dictionary getDictionary()
Dictionary
used by this object.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |