RE: (SC22WG20.4660) RE: UTS #10 : comment on Hangul Jamo(Letter) collation

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Sat Aug 30 2003 - 07:20:14 EDT

Next message: Jim Allan: "RE: Missing Ugaritic Code Chart Link"

Previous message: Jungshik Shin: "RE: UTS #10 : comment on Hangul Jamo(Letter) collation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

...
> > You may wish to look at
> > http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051-hangulsort.pdf
> > which contains a much updated version of my paper on the subject.
> > The table entries are also found in plain text form at
> > http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051t-table-hangulctt6.txt
>
> Wow, you've created all these entries. Thanks.

You're welcome!

> > > After a thread of emails exchanged, Mark Davis and I found
> > > that both of us
> > > are more or less in the same page as to how Hangul letters be
> > > collated.
> > > In summary,
> > >
> > > 1. Weights for T, V, and L should be assigned in such a way that
> > > T < V < L for all T, V, and L's
> >
> > That would be L < T < V; but that is complicated by the actual need
for
> > (the superficially contradictory) V < L < T < V, with the latter T
and V
> > after all scripts.
>
> I'm not following you here. 'T < V < L' works well in Mark's
> and my scheme for the most generic form of Korean syllables, 'L+V+T*'
> as far as South Korean collation rules are concerned.

But then you have to insert extra weights (TERMs, as you call them).

In my scheme, referred to above, those aren't needed at all. Indeed,
Mark has said that inserting such weight would be too much overhead,
could not be done in present implementations without changing the
architecture of the collation implementations, and would lengthen
the computed keys too much.

> > The Vs at two radically different positions in the table
> > is for different positions of the V in a syllable; V < L is for
first V in
> > a syllable, T < V is for non-first Vs in a syllable.
>
> Aha, you're talking about your scheme.
>
> > > 2. Expand precomposed (cluster) Jamos into sequences of
component
> > > basic Jamos
> >
> > Needed for covering all combinations of Jamos. If limited to (a
superset)
> > of modern Jamo, this expansion can be avoided.
>
> Absolutely.
>
> > referenced above, which lists the weightings and contractions needed
for
> > avoiding this expansion in many (but not all) cases.
> >
> > > 3. Terminate every syllable with 'TERM' that has a lower weight
than
> > > all T's (there's an alternative to this, but both favors this
> > > more than the alternative)
> >
> > This can be avoided if the weighting is done in a particular way.
> > See my paper for details.
>
> Indeed. However, I'm wondering if avoiding TERM is a better
> trade-off than avoiding seemingly more complex(than Mark's and mine)
> scheme of yours that also requires pre-handling. Could you

For a superset of modern Hangul, NO new prehandling is needed
for my scheme.

Prehandling is needed for my scheme ONLY if a *multiletter* vowel Jamo
can directly follow another vowel Jamo (multiletter or single letter).

In Microsoft's list in their "Appendix Hangul OpenType specification"
("Appendix B: Standard composition for Old Hangul Jamos"),
there is NO such sequence that requires prehandling beyond the
current prehandling (NFD of Hangul Syllable characters, which can
be avoided too, with a precomputed table of their weightings).

> give me some
> rationale behind your preferring yours to the other? Is it

No need for new prehandling for a large class of Hangul strings
(modern + known historic according to MS's list). Thus no change
in architecture for the collation algorithm (unless you want to
handle <vowel Jamo, multiletter vowel Jamo> too).

The resulting keys are as long as they would be for other alphabetic
scripts (mostly one weight per level per letter). No extra weights.

> because it's better suited to tailoring for North Korean?

No. As far as I can see that requires new prehandling also for
modern Hangul in any sufficiently general scheme.

/kent k

> I haven't given
> much thought
> to North Korean collation rules recently (at the moment, I
> have to look
> them up again to refresh my memory.)
>
> Jungshik
>

Next message: Jim Allan: "RE: Missing Ugaritic Code Chart Link"
Previous message: Jungshik Shin: "RE: UTS #10 : comment on Hangul Jamo(Letter) collation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Aug 30 2003 - 08:01:33 EDT