Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Mike (mike-list@pobox.com)
Date: Sun Sep 23 2007 - 13:19:55 CDT

  • Next message: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"

    >> Another experimental part of my implementation is that
    >> a pattern can only match if it starts and ends on a
    >> grapheme cluster boundary. This prevents, for example,
    >> the Hangul syllable \uAC00 from matching the first part
    >> of \uAC01 which is composed of the same leading and vowel
    >> jamos, but which also has a trailing jamo.
    >
    > My be, for Korean users, your requirement to match on grapheme clusters will
    > not make much sense for them, given that Hangul is an alphabet, where each
    > component jamo is a separate jamo that should be matchable, whatever its
    > position in the syllable square.

    I took my cue from the default grapheme cluster boundary definition.
    If this is not what users would expect, then I will change it. The
    only Korean I know is (and forgive the poor spelling) "on yong ha say
    yo", so I'd appreciate hearing from a native speaker.

    I agree that looking for a specific vowel jamo would be useful, but
    there is an implementation problem associated with that. Suppose
    you want to find the vowel jamo U+1161 (HANGUL JUNGSEONG A), and the
    input string is U+AC01. It matches the middle of three characters in
    the NFD decomposition, so, yes, it matches. But how would you specify
    where the match occurred?

    > Making this restriction would be the equivalent of forcing Latin users to
    > not find any matches for "s" in "stress" (that has a single syllable),
    > because the "s" are used in either leading or trailing position of the
    > syllable.

    This makes sense on the surface.

    > For me, the leading/middle/trailing position of a jamo in a Hangul cluster
    > should be treated as a contextual condition (simiar to your existing
    > implemention for word break boundaries).

    I'm simply using the Hangul_Syllable_Type for this; it seems that
    the placement is important, or this property wouldn't be needed....

    > Note also that your \L\V+\T+ syntax will not find all Hangul clusters (it
    > will forget compatibility jamos that internally don't encode their own
    > status as leading/middle/trailing position, much like with our Latin letters
    > that don't encode themselves their position in a syllable). Andit will
    > forget "defective" clusters that start with a "middle" vowel without a
    > leading consonnant.

    I believe you can use \p{Script=Hangul} to find the compatibility
    jamos also. Is the future of Korean to just use the \L\V\T? syllables?
    If so, then I would argue you want to support them.

    To include defective clusters, you could look for \g[\L\V\T]+?\g, where
    \g is a grapheme cluster boundary.

    > Your \L and \T are just character classes but they represent the same hangul
    > letters, just in particular positions in a cluster (a jamo is a normal
    > letter plus its leading/middle/trailing status in a cluster, compatibility
    > jamos just encode an unknown status that must be computed in a complex way
    > using dictionary lookups, just like we need dictionary lookups and complex
    > rules to find syllable breaks in Latin).

    I think that a regex implementation tailored for Korean could be much
    more complex. But for a general purpose implementation, dictionary
    lookup should not be expected.

    > Note that \V only contains letters that are composable in default combining
    > sequences, but some compatibility jamos are not ambiguous in Korean, even
    > though they are encoded to be allowed in isolation without needing any prior
    > hangul filler andwithout creating defective syllable.
    >
    > So I would suggest allowing matches for jamos independently of their
    > position (and independently of their precomposed state), and computing other
    > properties that are not character-based but based on cluster boundaries:
    >
    > * Add something like \J to match any jamo (or precomposed jamo, or
    > compatible jamo): in fact it would be a shortcut for a character class that
    > includes all Hangul characters (including compatibility jamos).

    I like the idea of adding \J for arbitrary jamos. Should it be
    equivalent to \p{Script=Hangul}?

    > * Add something like \b to match cluster boundary conditions: it will match
    > just before any leading jamo at the beginning of a cluster, or before a
    > compatibility jamo, or before any other cluster, but never in the middle. It
    > will also match before a base character in a combining sequence. It will
    > also match after a middle or trailing jamo at end of the cluster, or after a
    > compatibility jamo, or after the last combining character of a combining
    > sequence.
    > * Add something like \B its complement (it will match only in the middle of
    > a cluster, including between a base letter and a combining character, but
    > not at start and not at end of such sequences).

    I have already assigned \g to match a grapheme cluster boundary, and
    \G to be the complement. \b and \B still mean word boundary.

    > * Allow Hangul search patterns where position distinctions are not
    > significant: a search string like \j{hangul text} would treat each letter of
    > the hangul text as if it was made with compatibility jamos, ignoring L/V/T
    > differences, these characters being replaced in fat by character classes
    > (where matching L and V letters are part of the same class as the matching
    > compatibility jamo that encode the same Hangul letter). The expression
    > \j{hangul text} will not encode any cluster boundary condition, not even at
    > start or end.

    Is there an equivalence table indicating which L's match which T's?
    There are 28 T's, but only 19 L's. How do you map the compatibility
    jamos?

    > So:
    >
    > * "\b\j{hangul consonant}" to find a particular leading cluster consonant
    > in Hangul (it will match only the first consonant, not a second one before
    > the vowel.

    Is this any different from "\g\u1100"?

    > * "\j{hangul vowel}" to find a particular vowel in Hangul
    > (it will match any \V letter or compatibility vowels).

    Can't you just use \u for this?

    > * "\j{hangul consonant}\b" to find a particular trailing cluster consonant
    > in Hangul (it will match only the last consonant, not a prior one after the
    > vowel).

    \u11A8\g ?

    > * all Hangul syllable clusters (even the defective ones, or those made with
    > compatibility jamos) would be matched using: \b\J(\B\J)+\b.

    Or simply \g\J+?\g

    > (not supported in this regular expression: do we need to accept some other
    > combining characters other than Hangul letters in such clusters?)

    My code keeps any trailing marks with the Hangul syllable (it's all
    one big grapheme cluster). You could extend it to \g\J+?\m*\g
    Are combining characters used in Korean? Or is this just academic?

    > Also, for users of RTL scripts, it would be useful to be allowed to detect
    > direction boundaries, to allow expressions that will work with
    > transformations of BiDi overrides or embedding or mirroring conditions (for
    > characters that are not mirrored in Unicode but need to be replaced,
    > depending on the current direction, like quotation marks).

    I have no experience with RTL scripts; could you propose some syntax
    for this?

    > Also do you support Arabic joining types (that have some similarities with
    > Hangul jamos composition states)? Given that you use NFD, these distinctions
    > are lost on the per-character basis, but are remaining as contextual
    > conditions, creating new boundary conditions similar to syllabic breaks
    > (i.e. breaks between sequences of base letters).

    I support the joining type property, \p{jt=C}. I don't know what effect
    NFD decomposition causes....

    Mike



    This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 13:24:16 CDT