morfologik.fsa
Class CFSA

java.lang.Object
  extended by morfologik.fsa.FSA
      extended by morfologik.fsa.CFSA
All Implemented Interfaces:
java.lang.Iterable<java.nio.ByteBuffer>

public final class CFSA
extends FSA

CFSA (Compact Finite State Automaton) binary format implementation. This is a slightly reorganized version of FSA5 offering smaller automata size at some (minor) performance penalty.

Note: Serialize to CFSA2 for new code.

The encoding of automaton body is as follows.

 ---- FSA header (standard)
 Byte                            Description 
       +-+-+-+-+-+-+-+-+\
     0 | | | | | | | | | +------ '\'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ 'f'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     2 | | | | | | | | | +------ 's'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     3 | | | | | | | | | +------ 'a'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     4 | | | | | | | | | +------ version (fixed 0xc5)
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     5 | | | | | | | | | +------ filler character
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     6 | | | | | | | | | +------ annot character
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     7 |C|C|C|C|G|G|G|G| +------ C - node data size (ctl), G - address size (gotoLength)
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
  8-32 | | | | | | | | | +------ labels mapped for type (1) of arc encoding. 
       : : : : : : : : : |
       +-+-+-+-+-+-+-+-+/
 
 ---- Start of a node; only if automaton was compiled with NUMBERS option.
 
 Byte
        +-+-+-+-+-+-+-+-+\
      0 | | | | | | | | | \  LSB
        +-+-+-+-+-+-+-+-+  +
      1 | | | | | | | | |  |      number of strings recognized
        +-+-+-+-+-+-+-+-+  +----- by the automaton starting
        : : : : : : : : :  |      from this node.
        +-+-+-+-+-+-+-+-+  +
  ctl-1 | | | | | | | | | /  MSB
        +-+-+-+-+-+-+-+-+/
        
 ---- A vector of node's arcs. Conditional format, depending on flags.
 
 1) NEXT bit set, mapped arc label. 
 
                +--------------- arc's label mapped in M bits if M's field value > 0
                | +------------- node pointed to is next
                | | +----------- the last arc of the node
         _______| | | +--------- the arc is final
        /       | | | |
       +-+-+-+-+-+-+-+-+\
     0 |M|M|M|M|M|1|L|F| +------ flags + (M) index of the mapped label.
       +-+-+-+-+-+-+-+-+/
 
 2) NEXT bit set, label separate.
 
                +--------------- arc's label stored separately (M's field is zero).
                | +------------- node pointed to is next
                | | +----------- the last arc of the node
                | | | +--------- the arc is final
                | | | |
       +-+-+-+-+-+-+-+-+\
     0 |0|0|0|0|0|1|L|F| +------ flags
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ label
       +-+-+-+-+-+-+-+-+/
 
 3) NEXT bit not set. Full arc.
 
                  +------------- node pointed to is next
                  | +----------- the last arc of the node
                  | | +--------- the arc is final
                  | | |
       +-+-+-+-+-+-+-+-+\
     0 |A|A|A|A|A|0|L|F| +------ flags + (A) address field, lower bits
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ label
       +-+-+-+-+-+-+-+-+/
       : : : : : : : : :       
       +-+-+-+-+-+-+-+-+\
 gtl-1 |A|A|A|A|A|A|A|A| +------ address, continuation (MSB)
       +-+-+-+-+-+-+-+-+/
 


Field Summary
 byte[] arcs
          An array of bytes with the internal representation of the automaton.
static int BIT_FINAL_ARC
          Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.
static int BIT_LAST_ARC
          Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.
static int BIT_TARGET_NEXT
          Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).
 int gtl
          Number of bytes each address takes in full, expanded form (goto length).
 byte[] labelMapping
          Label mapping for arcs of type (1) (see class documentation).
 int nodeDataLength
          The length of the node header structure (if the automaton was compiled with NUMBERS option).
static byte VERSION
          Automaton header version value.
 
Constructor Summary
CFSA(java.io.InputStream fsaStream)
          Creates a new automaton, reading it from a file in FSA format, version 5.
 
Method Summary
 int getArc(int node, byte label)
          
 byte getArcLabel(int arc)
          Return the label associated with a given arc.
 int getEndNode(int arc)
          Return the end node pointed to by a given arc.
 int getFirstArc(int node)
          
 java.util.Set<FSAFlags> getFlags()
          Returns a set of flags for this FSA instance.
 int getNextArc(int arc)
          
 int getRightLanguageCount(int node)
          
 int getRootNode()
          Returns the start node of this automaton.
 boolean isArcFinal(int arc)
          Returns true if the destination node at the end of this arc corresponds to an input sequence created when building this automaton.
 boolean isArcLast(int arc)
          Returns true if this arc has NEXT bit set.
 boolean isArcTerminal(int arc)
          Returns true if this arc does not have a terminating node (@link FSA.getEndNode(int) will throw an exception).
 boolean isLabelCompressed(int arc)
          Returns true if the label is compressed inside flags byte.
 boolean isNextSet(int arc)
           
 
Methods inherited from class morfologik.fsa.FSA
getArcCount, getSequences, getSequences, iterator, read, visitAllStates, visitInPostOrder, visitInPostOrder, visitInPreOrder, visitInPreOrder
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERSION

public static final byte VERSION
Automaton header version value.

See Also:
Constant Field Values

BIT_FINAL_ARC

public static final int BIT_FINAL_ARC
Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.

See Also:
Constant Field Values

BIT_LAST_ARC

public static final int BIT_LAST_ARC
Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.

See Also:
Constant Field Values

BIT_TARGET_NEXT

public static final int BIT_TARGET_NEXT
Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).

See Also:
Constant Field Values

arcs

public byte[] arcs
An array of bytes with the internal representation of the automaton. Please see the documentation of this class for more information on how this structure is organized.


nodeDataLength

public final int nodeDataLength
The length of the node header structure (if the automaton was compiled with NUMBERS option). Otherwise zero.


gtl

public final int gtl
Number of bytes each address takes in full, expanded form (goto length).


labelMapping

public final byte[] labelMapping
Label mapping for arcs of type (1) (see class documentation). The array is indexed by mapped label's value and contains the original label.

Constructor Detail

CFSA

public CFSA(java.io.InputStream fsaStream)
     throws java.io.IOException
Creates a new automaton, reading it from a file in FSA format, version 5.

Throws:
java.io.IOException
Method Detail

getRootNode

public int getRootNode()
Returns the start node of this automaton. May return 0 if the start node is also an end node.

Specified by:
getRootNode in class FSA
Returns:
Returns the identifier of the root node of this automaton. Returns 0 if the start node is also the end node (the automaton is empty).

getFirstArc

public final int getFirstArc(int node)

Specified by:
getFirstArc in class FSA
Returns:
Returns the identifier of the first arc leaving node or 0 if the node has no outgoing arcs.

getNextArc

public final int getNextArc(int arc)

Specified by:
getNextArc in class FSA
Returns:
Returns the identifier of the next arc after arc and leaving node. Zero is returned if no more arcs are available for the node.

getArc

public int getArc(int node,
                  byte label)

Specified by:
getArc in class FSA
Returns:
Returns the identifier of an arc leaving node and labeled with label. An identifier equal to 0 means the node has no outgoing arc labeled label.

getEndNode

public int getEndNode(int arc)
Return the end node pointed to by a given arc. Terminal arcs (those that point to a terminal state) have no end node representation and throw a runtime exception.

Specified by:
getEndNode in class FSA

getArcLabel

public byte getArcLabel(int arc)
Return the label associated with a given arc.

Specified by:
getArcLabel in class FSA

getRightLanguageCount

public int getRightLanguageCount(int node)

Overrides:
getRightLanguageCount in class FSA
Returns:
Returns the number of sequences reachable from the given state if the automaton was compiled with FSAFlags.NUMBERS. The size of the right language of the state, in other words.

isArcFinal

public boolean isArcFinal(int arc)
Returns true if the destination node at the end of this arc corresponds to an input sequence created when building this automaton.

Specified by:
isArcFinal in class FSA

isArcTerminal

public boolean isArcTerminal(int arc)
Returns true if this arc does not have a terminating node (@link FSA.getEndNode(int) will throw an exception). Implies FSA.isArcFinal(int).

Specified by:
isArcTerminal in class FSA

isArcLast

public boolean isArcLast(int arc)
Returns true if this arc has NEXT bit set.

See Also:
BIT_LAST_ARC

isNextSet

public boolean isNextSet(int arc)
See Also:
BIT_TARGET_NEXT

isLabelCompressed

public boolean isLabelCompressed(int arc)
Returns true if the label is compressed inside flags byte.


getFlags

public java.util.Set<FSAFlags> getFlags()
Returns a set of flags for this FSA instance.

For this automaton version, an additional FSAFlags.NUMBERS flag may be set to indicate the automaton contains extra fields for each node.

Specified by:
getFlags in class FSA