RE: (SC22WG20.4660) RE: UTS #10 : comment on Hangul Jamo(Letter) collation

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Sat Aug 30 2003 - 07:20:14 EDT

  • Next message: Jim Allan: "RE: Missing Ugaritic Code Chart Link"

    ...
    > > You may wish to look at
    > > http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051-hangulsort.pdf
    > > which contains a much updated version of my paper on the subject.
    > > The table entries are also found in plain text form at
    > > http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051t-table-hangulctt6.txt
    >
    > Wow, you've created all these entries. Thanks.

    You're welcome!

    > > > After a thread of emails exchanged, Mark Davis and I found
    > > > that both of us
    > > > are more or less in the same page as to how Hangul letters be
    > > > collated.
    > > > In summary,
    > > >
    > > > 1. Weights for T, V, and L should be assigned in such a way that
    > > > T < V < L for all T, V, and L's
    > >
    > > That would be L < T < V; but that is complicated by the actual need
    for
    > > (the superficially contradictory) V < L < T < V, with the latter T
    and V
    > > after all scripts.
    >
    > I'm not following you here. 'T < V < L' works well in Mark's
    > and my scheme for the most generic form of Korean syllables, 'L+V+T*'
    > as far as South Korean collation rules are concerned.

    But then you have to insert extra weights (TERMs, as you call them).

    In my scheme, referred to above, those aren't needed at all. Indeed,
    Mark has said that inserting such weight would be too much overhead,
    could not be done in present implementations without changing the
    architecture of the collation implementations, and would lengthen
    the computed keys too much.

    > > The Vs at two radically different positions in the table
    > > is for different positions of the V in a syllable; V < L is for
    first V in
    > > a syllable, T < V is for non-first Vs in a syllable.
    >
    > Aha, you're talking about your scheme.
    >
    > > > 2. Expand precomposed (cluster) Jamos into sequences of
    component
    > > > basic Jamos
    > >
    > > Needed for covering all combinations of Jamos. If limited to (a
    superset)
    > > of modern Jamo, this expansion can be avoided.
    >
    > Absolutely.
    >
    > > referenced above, which lists the weightings and contractions needed
    for
    > > avoiding this expansion in many (but not all) cases.
    > >
    > > > 3. Terminate every syllable with 'TERM' that has a lower weight
    than
    > > > all T's (there's an alternative to this, but both favors this
    > > > more than the alternative)
    > >
    > > This can be avoided if the weighting is done in a particular way.
    > > See my paper for details.
    >
    > Indeed. However, I'm wondering if avoiding TERM is a better
    > trade-off than avoiding seemingly more complex(than Mark's and mine)
    > scheme of yours that also requires pre-handling. Could you

    For a superset of modern Hangul, NO new prehandling is needed
    for my scheme.

    Prehandling is needed for my scheme ONLY if a *multiletter* vowel Jamo
    can directly follow another vowel Jamo (multiletter or single letter).

    In Microsoft's list in their "Appendix Hangul OpenType specification"
    ("Appendix B: Standard composition for Old Hangul Jamos"),
    there is NO such sequence that requires prehandling beyond the
    current prehandling (NFD of Hangul Syllable characters, which can
    be avoided too, with a precomputed table of their weightings).

    > give me some
    > rationale behind your preferring yours to the other? Is it

    No need for new prehandling for a large class of Hangul strings
    (modern + known historic according to MS's list). Thus no change
    in architecture for the collation algorithm (unless you want to
    handle <vowel Jamo, multiletter vowel Jamo> too).

    The resulting keys are as long as they would be for other alphabetic
    scripts (mostly one weight per level per letter). No extra weights.

    > because it's better suited to tailoring for North Korean?

    No. As far as I can see that requires new prehandling also for
    modern Hangul in any sufficiently general scheme.

                    /kent k

    > I haven't given
    > much thought
    > to North Korean collation rules recently (at the moment, I
    > have to look
    > them up again to refresh my memory.)
    >
    > Jungshik
    >



    This archive was generated by hypermail 2.1.5 : Sat Aug 30 2003 - 08:01:33 EDT