Solving a regular expression ambiguity for Hangul syllables

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Nov 20 2003 - 11:11:15 EST

  • Next message: Ostermueller, Erik: "FW: creating a test font w/ CJKV Extension B characters."

    I note that the current definition of Hangul syllable sequences uses an ambiguous regular expression, where two alternates can be used to match the expression. Also its analysis in a top-down parser will use backward fallbacks to try the alternates.

    $Hangul_Sequence = ((($L+ $LV?) | ($L* $LV)) $V* $T*) | ($L* $LVT $T*);

    This is more visible if written like this:

    $Hangul_Sequence = ( ( ($L+ $LV?)
                         | ($L* $LV )
                         ) $V* $T*)
                     | ( $L* $LVT $T*);

    Some remarks about this expression:
    - See the first two lines for the ambiguous alternates when trying to match a string with form ($L+ $LV).
    - See how the trailing $T* can be factorized for the first alternate (first three lines) and the second alternate (last line).
    - See how it's also impossible to discriminate these first two alternatives after reading just a string starting with $L+.
    - It is only deterministic after reading a string starting by $LV or $LVT.

    A better regular expression without this problem (for faster performance in non-deterministic automatas) is:

    $Hangul_Sequence = ($L+ ( ($LVT $T*)
                            | ($LV $V* $T*)
                            | ( $V* $T*)
                            ))
                     | ($LVT $T*)
                     | ($LV $V* $T*);

    This regular expression, which matches exactly the same "language" becomes fully determinist.

    If one wants to include the factorisation of trailing $T* (which may just reduce the number of states in non optimizing regular expression engines, but does not benefit really to performance), this one works as well:

    $Hangul_Sequence = ( ($L+ ( $LVT
                              | ($LV $V*)
                              | $V*
                              ))
                       | $LVT
                       | ($LV $V*)
                       ) $T*;



    This archive was generated by hypermail 2.1.5 : Thu Nov 20 2003 - 12:13:27 EST