RE: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Sep 23 2007 - 00:30:43 CDT

Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

Previous message: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Mike: "Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mike wrote:
> Another experimental part of my implementation is that
> a pattern can only match if it starts and ends on a
> grapheme cluster boundary. This prevents, for example,
> the Hangul syllable \uAC00 from matching the first part
> of \uAC01 which is composed of the same leading and vowel
> jamos, but which also has a trailing jamo.
>
> If anybody thinks that any of this is bad design, I'd
> be happy to hear suggestions for improvement!

My be, for Korean users, your requirement to match on grapheme clusters will
not make much sense for them, given that Hangul is an alphabet, where each
component jamo is a separate jamo that should be matchable, whatever its
position in the syllable square.

Making this restriction would be the equivalent of forcing Latin users to
not find any matches for "s" in "stress" (that has a single syllable),
because the "s" are used in either leading or trailing position of the
syllable.

For me, the leading/middle/trailing position of a jamo in a Hangul cluster
should be treated as a contextual condition (simiar to your existing
implemention for word break boundaries).

Note also that your \L\V+\T+ syntax will not find all Hangul clusters (it
will forget compatibility jamos that internally don't encode their own
status as leading/middle/trailing position, much like with our Latin letters
that don't encode themselves their position in a syllable). Andit will
forget "defective" clusters that start with a "middle" vowel without a
leading consonnant.

Your \L and \T are just character classes but they represent the same hangul
letters, just in particular positions in a cluster (a jamo is a normal
letter plus its leading/middle/trailing status in a cluster, compatibility
jamos just encode an unknown status that must be computed in a complex way
using dictionary lookups, just like we need dictionary lookups and complex
rules to find syllable breaks in Latin).

Note that \V only contains letters that are composable in default combining
sequences, but some compatibility jamos are not ambiguous in Korean, even
though they are encoded to be allowed in isolation without needing any prior
hangul filler andwithout creating defective syllable.

So I would suggest allowing matches for jamos independently of their
position (and independently of their precomposed state), and computing other
properties that are not character-based but based on cluster boundaries:

* Add something like \J to match any jamo (or precomposed jamo, or
compatible jamo): in fact it would be a shortcut for a character class that
includes all Hangul characters (including compatibility jamos).

* Add something like \b to match cluster boundary conditions: it will match
just before any leading jamo at the beginning of a cluster, or before a
compatibility jamo, or before any other cluster, but never in the middle. It
will also match before a base character in a combining sequence. It will
also match after a middle or trailing jamo at end of the cluster, or after a
compatibility jamo, or after the last combining character of a combining
sequence.

* Add something like \B its complement (it will match only in the middle of
a cluster, including between a base letter and a combining character, but
not at start and not at end of such sequences).

* Allow Hangul search patterns where position distinctions are not
significant: a search string like \j{hangul text} would treat each letter of
the hangul text as if it was made with compatibility jamos, ignoring L/V/T
differences, these characters being replaced in fat by character classes
(where matching L and V letters are part of the same class as the matching
compatibility jamo that encode the same Hangul letter). The expression
\j{hangul text} will not encode any cluster boundary condition, not even at
start or end.

So:

* "\b\j{hangul consonant}" to find a particular leading cluster consonant
in Hangul (it will match only the first consonant, not a second one before
the vowel.
* "\j{hangul vowel}" to find a particular vowel in Hangul
(it will match any \V letter or compatibility vowels).
* "\j{hangul consonant}\b" to find a particular trailing cluster consonant
in Hangul (it will match only the last consonant, not a prior one after the
vowel).
* all Hangul syllable clusters (even the defective ones, or those made with
compatibility jamos) would be matched using: \b\J(\B\J)+\b.
(not supported in this regular expression: do we need to accept some other
combining characters other than Hangul letters in such clusters?)

The only apparent complexity of Hangul is the fact that its letters have
been encoded at many code points depending on their position/status in the
syllable, instead of having been unified like in Arabic using normative
joining types and compatibility mappings for positional distinctions.

But let's keep in mind that this is still an alphabet, much smaller and much
simpler in fact that Latin (Unicode and ISO 10646 have accepted an apparent
complexity when encoding the script because they wanted to preserve
round-trip compatibility with several legacy encodings that used different
way to represent the cluster boundaries (but the Korean standard body has
also changed several times its own view about how to encode those
boundaries, in its own standards).

Hangul jamos also encoded some letters as if they were different, despite
they are represented exactly the same way graphically : SANG letters for
example are encoded as if the pair trailing "ss" at end of the English word
"stress" was a separate letter, distinct from the trailing "s" in the
English word "is" and distinct from the leading "s" in the English word
"stress", instead of being interpreted as a simple digraph (what it is in
fact in Hangul, as demonstrated in syllables where a trailing SANG letter is
used in the same cluster as another trailing letter).

Also the distinction between leading and trailing consonants is not always
clear in Korean, except graphically when a writer chooses one interpretation
by composing his syllables in graphical squares (this could have been
encoded by encoding syllable breaks explicitly without desunifying the same
Hangul letters according to their context of use).

But because of this legacy desunification of the Hangul alphabet, these
ambiguities are persisting in Korean texts encoded today, and they will
become more apparent when performing full-text search in large corpus from
different authors and written at different periods of time (because they
group letters into syllables differently).

Korean spell checkers may help authors today to group letters according to
the modern usage and generally accepted modern dictionary conventions for
common words (this is at the base of the distinction between modern Korean
syllables, but also explains the existence of other "historic" syllables and
historic compound jamos, the concept of "jamos" being a modern creation on
top of Hangul letters, by grouping them into syllable sub-units, in a way
similar to the concept of unbreakable digraphs interpreted as single letters
in the "alphabet" of some Latin-based languages).

But there still remains case where these checkers and dictionaries won't
help (notably in proper names, in toponyms and in Hangul transliterations of
foreign scripts and languages, where they may even exist optional null
consonants inserted between vowels, either in leading or trailing position,
creating different graphical syllable breaks and multiple possible
encodings, despite all these encodings contain the same effective Hangul
letters). This means that even the existing syllable breaks (default
grapheme clusters) in Hangul are not significant for searches, and there's
some need to ignore the distinctions created by the encoding or even by the
graphical composition of syllabic squares (what the Hangul encoding is
trying to represent more or less successfully).

----
Also, for users of RTL scripts, it would be useful to be allowed to detect
direction boundaries, to allow expressions that will work with
transformations of BiDi overrides or embedding or mirroring conditions (for
characters that are not mirrored in Unicode but need to be replaced,
depending on the current direction, like quotation marks).
Also do you support Arabic joining types (that have some similarities with
Hangul jamos composition states)? Given that you use NFD, these distinctions
are lost on the per-character basis, but are remaining as contextual
conditions, creating new boundary conditions similar to syllabic breaks
(i.e. breaks between sequences of base letters).

Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Previous message: Theo Veenker: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Mike: "Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 00:34:28 CDT