RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 04 2003 - 07:36:46 EST

  • Next message: Peter Kirk: "Re: MS Windows and Unicode 4.0 ?"

    Kent Karlsson writes:
    > Philippe Verdy wrote:
    >
    > > I just have another question for Korean: many jamos are in fact
    > > composed from other jamos: this is clearly visible both in their name
    > > and in their composed glyph. What would be the linguistic impact of
    > > decomposing them (not canonically!)? Do Korean really learn these
    > > jamos without breaking them into their components? I think here
    > > about SSANG (double) consonnants, or the initial Y or final E
    > of some vowels...
    > > Of couse I won't be able to use such decomposition in Unicode,
    >
    > Of course you, and anyone else, can. Just as well as one can use spell
    > checkers/correctors, transform digits between scripts, do transcriptions,
    > or any other kind of processing on Unicode texts. It cannot be part of
    > normalisation, though. And I agree that in this case that is unfortunate,
    > since the letter cluster jamos really consist of sequences of two or more
    > letters each. Fortunately, the definition of Hangul syllable blocks need
    > not be changed, as it works well with Hangul syllables as L+, V+, T*
    > (where L, V, and T stand for single-letter jamos).

    In fact the Unicode encoding of modern Hangul syllables is more
    accurately:

            (Ls|Lm)+ (Vs|Vm)+ (Ts|Tm)*

    where Ls,Vs,Ts are single-letter L,V,T modern jamos
    and Lm,Vm,Tm are multiple-letter L,V,T modern jamos

    If we count also the encoded modern LV and LVT johab syllables:
          
            ( ( (Ls|Lm)+ (Vs|Vm)+ )
            | ( (Ls|Lm)* (LsVs|LsVm|LmVs|LmVm) (Vs|Vm)* )
            | ( (Ls|Lm)* (LsVsTs|LsVmTs|LmVsTs|LmVmTs|
                          LsVsTm|LsVmTm|LmVsTm|LmVmTm) )
            ) (Ts|Tm)*

    The idea is to allow decomposing Lm,Vm, or Tm into sequences of
    Ls, Vs, or Ts using supplementary decompositions including for the
    compatibility Hangul syllables.
    So this will effectively produce syllables encoded only with
            Ls+ Vs+ Ts*

    Then to recompose them as much as possible to build Lm,Vm,Tm jamos,
    and then reassemble them in either jahob syllables (LV or LVT),
    or in some compatibility syllables (historic syllables starting
    by vowels).

    This process seems to match the Korean readers interpretation of
    Hangul syllables, and matches the description in the N954.PDF
    working document of JTC1/SC22/WG20.

    At least it has the merit to allow unification of uncomposed SSANG
    consonnants, or uncomposed Y or E vowels that may appear even within
    a text using only modern jamos or johad syllables. It also simplifies
    the preparation of Hangul texts for UCA.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 08:24:35 EST