RE: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Sep 23 2007 - 00:30:43 CDT

  • Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

    Mike wrote:
    > Another experimental part of my implementation is that
    > a pattern can only match if it starts and ends on a
    > grapheme cluster boundary. This prevents, for example,
    > the Hangul syllable \uAC00 from matching the first part
    > of \uAC01 which is composed of the same leading and vowel
    > jamos, but which also has a trailing jamo.
    >
    > If anybody thinks that any of this is bad design, I'd
    > be happy to hear suggestions for improvement!

    My be, for Korean users, your requirement to match on grapheme clusters will
    not make much sense for them, given that Hangul is an alphabet, where each
    component jamo is a separate jamo that should be matchable, whatever its
    position in the syllable square.

    Making this restriction would be the equivalent of forcing Latin users to
    not find any matches for "s" in "stress" (that has a single syllable),
    because the "s" are used in either leading or trailing position of the
    syllable.

    For me, the leading/middle/trailing position of a jamo in a Hangul cluster
    should be treated as a contextual condition (simiar to your existing
    implemention for word break boundaries).

    Note also that your \L\V+\T+ syntax will not find all Hangul clusters (it
    will forget compatibility jamos that internally don't encode their own
    status as leading/middle/trailing position, much like with our Latin letters
    that don't encode themselves their position in a syllable). Andit will
    forget "defective" clusters that start with a "middle" vowel without a
    leading consonnant.

    Your \L and \T are just character classes but they represent the same hangul
    letters, just in particular positions in a cluster (a jamo is a normal
    letter plus its leading/middle/trailing status in a cluster, compatibility
    jamos just encode an unknown status that must be computed in a complex way
    using dictionary lookups, just like we need dictionary lookups and complex
    rules to find syllable breaks in Latin).

    Note that \V only contains letters that are composable in default combining
    sequences, but some compatibility jamos are not ambiguous in Korean, even
    though they are encoded to be allowed in isolation without needing any prior
    hangul filler andwithout creating defective syllable.

    So I would suggest allowing matches for jamos independently of their
    position (and independently of their precomposed state), and computing other
    properties that are not character-based but based on cluster boundaries:

    * Add something like \J to match any jamo (or precomposed jamo, or
    compatible jamo): in fact it would be a shortcut for a character class that
    includes all Hangul characters (including compatibility jamos).

    * Add something like \b to match cluster boundary conditions: it will match
    just before any leading jamo at the beginning of a cluster, or before a
    compatibility jamo, or before any other cluster, but never in the middle. It
    will also match before a base character in a combining sequence. It will
    also match after a middle or trailing jamo at end of the cluster, or after a
    compatibility jamo, or after the last combining character of a combining
    sequence.

    * Add something like \B its complement (it will match only in the middle of
    a cluster, including between a base letter and a combining character, but
    not at start and not at end of such sequences).

    * Allow Hangul search patterns where position distinctions are not
    significant: a search string like \j{hangul text} would treat each letter of
    the hangul text as if it was made with compatibility jamos, ignoring L/V/T
    differences, these characters being replaced in fat by character classes
    (where matching L and V letters are part of the same class as the matching
    compatibility jamo that encode the same Hangul letter). The expression
    \j{hangul text} will not encode any cluster boundary condition, not even at
    start or end.

    So:

    * "\b\j{hangul consonant}" to find a particular leading cluster consonant
    in Hangul (it will match only the first consonant, not a second one before
    the vowel.
    * "\j{hangul vowel}" to find a particular vowel in Hangul
    (it will match any \V letter or compatibility vowels).
    * "\j{hangul consonant}\b" to find a particular trailing cluster consonant
    in Hangul (it will match only the last consonant, not a prior one after the
    vowel).
    * all Hangul syllable clusters (even the defective ones, or those made with
    compatibility jamos) would be matched using: \b\J(\B\J)+\b.
    (not supported in this regular expression: do we need to accept some other
    combining characters other than Hangul letters in such clusters?)

    The only apparent complexity of Hangul is the fact that its letters have
    been encoded at many code points depending on their position/status in the
    syllable, instead of having been unified like in Arabic using normative
    joining types and compatibility mappings for positional distinctions.

    But let's keep in mind that this is still an alphabet, much smaller and much
    simpler in fact that Latin (Unicode and ISO 10646 have accepted an apparent
    complexity when encoding the script because they wanted to preserve
    round-trip compatibility with several legacy encodings that used different
    way to represent the cluster boundaries (but the Korean standard body has
    also changed several times its own view about how to encode those
    boundaries, in its own standards).

    Hangul jamos also encoded some letters as if they were different, despite
    they are represented exactly the same way graphically : SANG letters for
    example are encoded as if the pair trailing "ss" at end of the English word
    "stress" was a separate letter, distinct from the trailing "s" in the
    English word "is" and distinct from the leading "s" in the English word
    "stress", instead of being interpreted as a simple digraph (what it is in
    fact in Hangul, as demonstrated in syllables where a trailing SANG letter is
    used in the same cluster as another trailing letter).

    Also the distinction between leading and trailing consonants is not always
    clear in Korean, except graphically when a writer chooses one interpretation
    by composing his syllables in graphical squares (this could have been
    encoded by encoding syllable breaks explicitly without desunifying the same
    Hangul letters according to their context of use).

    But because of this legacy desunification of the Hangul alphabet, these
    ambiguities are persisting in Korean texts encoded today, and they will
    become more apparent when performing full-text search in large corpus from
    different authors and written at different periods of time (because they
    group letters into syllables differently).

    Korean spell checkers may help authors today to group letters according to
    the modern usage and generally accepted modern dictionary conventions for
    common words (this is at the base of the distinction between modern Korean
    syllables, but also explains the existence of other "historic" syllables and
    historic compound jamos, the concept of "jamos" being a modern creation on
    top of Hangul letters, by grouping them into syllable sub-units, in a way
    similar to the concept of unbreakable digraphs interpreted as single letters
    in the "alphabet" of some Latin-based languages).

    But there still remains case where these checkers and dictionaries won't
    help (notably in proper names, in toponyms and in Hangul transliterations of
    foreign scripts and languages, where they may even exist optional null
    consonants inserted between vowels, either in leading or trailing position,
    creating different graphical syllable breaks and multiple possible
    encodings, despite all these encodings contain the same effective Hangul
    letters). This means that even the existing syllable breaks (default
    grapheme clusters) in Hangul are not significant for searches, and there's
    some need to ignore the distinctions created by the encoding or even by the
    graphical composition of syllabic squares (what the Hangul encoding is
    trying to represent more or less successfully).

    ----
    Also, for users of RTL scripts, it would be useful to be allowed to detect
    direction boundaries, to allow expressions that will work with
    transformations of BiDi overrides or embedding or mirroring conditions (for
    characters that are not mirrored in Unicode but need to be replaced,
    depending on the current direction, like quotation marks).
    Also do you support Arabic joining types (that have some similarities with
    Hangul jamos composition states)? Given that you use NFD, these distinctions
    are lost on the per-character basis, but are remaining as contextual
    conditions, creating new boundary conditions similar to syllabic breaks
    (i.e. breaks between sequences of base letters).
    


    This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 00:34:28 CDT