RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 04 2003 - 10:27:27 EST

  • Next message: Gupta, Rohit4: "Sort Order"

    Kent Karlsson wrote:
    > Philippe Verdy wrote:
    > > If we count also the encoded modern LV and LVT johab syllables:
    > >
    > > ( ((Ls|Lm)+ (Vs|Vm)+) |
    > > ((Ls|Lm)* (LsVs|LsVm|LmVs|LmVm) (Vs|Vm)*) |
    > > ((Ls|Lm)* (LsVsTs|LsVmTs|LmVsTs|LmVmTs|
    > > LsVsTm|LsVmTm|LmVsTm|LmVmTm) ) ) (Ts|Tm)*
    >
    > I'm not even going to try to parse that...

    What is complicate to read here ? I used blanks to indent
    terms that can match at the same level.

    If it is not clear enough to you, the rule expands as
    one of the three cases below:

    - Hangul syllables coded only with jamos:
    (Ls|Lm)+ (Vs|Vm)+) (Ts|Tm)*

    - Hangul syllables containing 1 "LV" precomposed johab:
    (Ls|Lm)* (LsVs|LsVm|LmVs|LmVm) (Vs|Vm)* (Ts|Tm)*

    - Hangul syllables containing 1 "LVT" precomposed johab:
    (Ls|Lm)* (LsVsTs|LsVmTs|LmVsTs|LmVmTs|LsVsTm|LsVmTm|LmVsTm|LmVmTm) (Ts|Tm)*

    In Hangul, all text coded in one of the two last sets of
    syllables are canonically equivalent to texts in the first set.

    The problem is that the first set also contains text that should be
    considered as canonically equivalent but are not (and will never be)
    according to the stability policy of normalized decompositions:
    there's no way to associate a "Lm" jamo with its "Ls" components
    so that they compare as canonically equivalent (except of course
    in UCA where they may compare equally, provided that UCA is updated
    to give them equal collation weights).

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 11:33:31 EST