RE: Compression through normalization

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Thu Dec 04 2003 - 08:10:17 EST

  • Next message: Peter Kirk: "Re: OT (was RE: MS Windows and Unicode 4.0 ?)"

    Philippe Verdy wrote:

    ...
    > letters each. Fortunately, the definition of Hangul syllable blocks need
    > not be changed, as it works well with Hangul syllables as L+, V+, T*
    > (where L, V, and T stand for single-letter jamos).

            In fact the Unicode encoding of modern Hangul syllables is more
            accurately:

                    (Ls|Lm)+ (Vs|Vm)+ (Ts|Tm)*

            where Ls,Vs,Ts are single-letter L,V,T modern jamos
            and Lm,Vm,Tm are multiple-letter L,V,T modern jamos

    Yes, but that goes beyond what I wanted to say.

            If we count also the encoded modern LV and LVT johab syllables:
                  
                    ( ( (Ls|Lm)+ (Vs|Vm)+ )
                    | ( (Ls|Lm)* (LsVs|LsVm|LmVs|LmVm) (Vs|Vm)* )
                    | ( (Ls|Lm)* (LsVsTs|LsVmTs|LmVsTs|LmVmTs|
                                  LsVsTm|LsVmTm|LmVsTm|LmVmTm) )
                    ) (Ts|Tm)*

    I'm not even going to try to parse that...

            The idea is to allow decomposing Lm,Vm, or Tm into sequences of
            Ls, Vs, or Ts using supplementary decompositions including for the
            compatibility Hangul syllables.
            So this will effectively produce syllables encoded only with
                    Ls+ Vs+ Ts*

    That's what I said.

            Then to recompose them as much as possible to build Lm,Vm,Tm jamos,

    One can do that, yes (but not as part of Unicode normalisation).

            and then reassemble them in either jahob syllables (LV or LVT),

    Yes. Like for NFC, using the arithmetically specified decompositions for
    LV (into <L, V>) and LVT (into <LV, T>, as they are more properly done),
    inverted, recursively.

    Note that to ensure uniqueness, the non-arithmetically specified jamo
    compositions (NOT a part of any Unicode normalisation) for a syllable
    must be done fully before any of the arithmetically specified compositions
    on that syllable.

            or in some compatibility syllables (historic syllables starting
            by vowels).

    "Compatibility syllables"?

    None of the historic syllables start with a vowel letter. YESIEUNG, and
    later IEUNG, have always been used as a "silent" lead consonant for
    words that in pronunciation start with a vowel. (IEUNG used to mean
    "ng" also as a lead consonant, but since(?) no Korean words start with "ng",
    and IEUNG looks a lot like YESIEUNG, a leading IEUNG became silent,
    and YESIEUNG became obsolete (a silent trail consonant was always
    omitted).) The FILLERs are entirely modern inventions, used for computer
    representation (in jamos; and some compatibility encodings) mostly for
    isolated letters, and partial syllables.

            This process seems to match the Korean readers interpretation of
            Hangul syllables, and matches the description in the N954.PDF
            working document of JTC1/SC22/WG20.

            At least it has the merit to allow unification of uncomposed SSANG
            consonnants, or uncomposed Y or E vowels that may appear even within
            a text using only modern jamos or johad syllables. It also simplifies
            the preparation of Hangul texts for UCA.

    Yes, it does. But such preparation is not needed for collation, as it can
    all be done inside of the collation table.
    See http://std.dkuug.dk/jtc1/sc22/wg20/docs/n1051-hangulsort.pdf
    (I'm working on an update of that document).

                    /kent k



    This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 09:19:12 EST