RE: Compression through normalization

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Thu Dec 04 2003 - 08:10:17 EST

Next message: Peter Kirk: "Re: OT (was RE: MS Windows and Unicode 4.0 ?)"

Previous message: Peter Kirk: "Re: MS Windows and Unicode 4.0 ?"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote:

...
> letters each. Fortunately, the definition of Hangul syllable blocks need
> not be changed, as it works well with Hangul syllables as L+, V+, T*
> (where L, V, and T stand for single-letter jamos).

In fact the Unicode encoding of modern Hangul syllables is more
accurately:

(Ls|Lm)+ (Vs|Vm)+ (Ts|Tm)*

where Ls,Vs,Ts are single-letter L,V,T modern jamos
and Lm,Vm,Tm are multiple-letter L,V,T modern jamos

Yes, but that goes beyond what I wanted to say.

I'm not even going to try to parse that...

        The idea is to allow decomposing Lm,Vm, or Tm into sequences of
        Ls, Vs, or Ts using supplementary decompositions including for the
        compatibility Hangul syllables.
        So this will effectively produce syllables encoded only with
                Ls+ Vs+ Ts*

That's what I said.

Then to recompose them as much as possible to build Lm,Vm,Tm jamos,

One can do that, yes (but not as part of Unicode normalisation).

and then reassemble them in either jahob syllables (LV or LVT),

Yes. Like for NFC, using the arithmetically specified decompositions for
LV (into <L, V>) and LVT (into <LV, T>, as they are more properly done),
inverted, recursively.

Note that to ensure uniqueness, the non-arithmetically specified jamo
compositions (NOT a part of any Unicode normalisation) for a syllable
must be done fully before any of the arithmetically specified compositions
on that syllable.

or in some compatibility syllables (historic syllables starting
by vowels).

"Compatibility syllables"?

None of the historic syllables start with a vowel letter. YESIEUNG, and
later IEUNG, have always been used as a "silent" lead consonant for
words that in pronunciation start with a vowel. (IEUNG used to mean
"ng" also as a lead consonant, but since(?) no Korean words start with "ng",
and IEUNG looks a lot like YESIEUNG, a leading IEUNG became silent,
and YESIEUNG became obsolete (a silent trail consonant was always
omitted).) The FILLERs are entirely modern inventions, used for computer
representation (in jamos; and some compatibility encodings) mostly for
isolated letters, and partial syllables.

        This process seems to match the Korean readers interpretation of
        Hangul syllables, and matches the description in the N954.PDF
        working document of JTC1/SC22/WG20.

        At least it has the merit to allow unification of uncomposed SSANG
        consonnants, or uncomposed Y or E vowels that may appear even within
        a text using only modern jamos or johad syllables. It also simplifies
        the preparation of Hangul texts for UCA.

Yes, it does. But such preparation is not needed for collation, as it can
all be done inside of the collation table.
See http://std.dkuug.dk/jtc1/sc22/wg20/docs/n1051-hangulsort.pdf
(I'm working on an update of that document).

/kent k

Next message: Peter Kirk: "Re: OT (was RE: MS Windows and Unicode 4.0 ?)"
Previous message: Peter Kirk: "Re: MS Windows and Unicode 4.0 ?"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 09:19:12 EST