RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2003 - 06:52:22 EST

Next message: Arcane Jill: "RE: MS Windows and Unicode 4.0 ?"

Previous message: John Hudson: "RE: MS Windows and Unicode 4.0 ?"
Maybe in reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Reply: Jungshik Shin: "RE: Compression through normalization"
Reply: Kent Karlsson: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell writes:
> I just read C10 again and noticed that it says that character sequences
> can be replaced by canonical-equivalent sequences -- NOT that they have
> to end up in a particular normalization form. So your strategy of
> converting to a form halfway between NFC and NFD seems acceptable.
> However, the sequences still have to be correct. You can't invent your
> own equivalences, which is what I think you are doing by calling U+110B
> a filler and then using it to create "VT syllables."

That's not a problem of C10 interpretation which I understand (I'm not
thinking about inserting any additional character, just that I did a false
interpretation of a normative "empty" unicode syllable name for IEUNG).

I still think that we could try to use only LV syllables but not LVT
syllables to reduce the set of Hangul character used if this helps the final
compressor. It's true that the LV syllables are discontinuous in the large
Hangul johab syllable block. But it could reduce the number of needed codes
in compression lookup dictionnaries and would limit the number of table
resets by exhausting less often the lookup table, and it would also allow
finding compressable similarities in the text stream at much shorter
distances than within a text using a lot of LVT syllables. So the impact of
the spreaded LV syllables in the johab set would still be low.

I will retry to compress Korean by using NFC form modified by excluding LVT
johab syllables but only keeping LV johab syllables and separate L or V or T
jamos...

I just have another question for Korean: many jamos are in fact composed
from other jamos: this is clearly visible both in their name and in their
composed glyph. What would be the linguistic impact of decomposing them (not
canonically!)? Do Korean really learn these jamos without breaking them into
their components? I think here about SSANG (double) consonnants, or the
initial Y or final E of some vowels...
Of couse I won't be able to use such decomposition in Unicode, but would it
be possible to use it in some private encoding created with a m:n charset
mapping from/to Unicode?

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Arcane Jill: "RE: MS Windows and Unicode 4.0 ?"
Previous message: John Hudson: "RE: MS Windows and Unicode 4.0 ?"
Maybe in reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Reply: Jungshik Shin: "RE: Compression through normalization"
Reply: Kent Karlsson: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 07:42:33 EST