Re: Compression through normalization

From: Jungshik Shin (jshin@mailaps.org)
Date: Wed Dec 03 2003 - 13:24:17 EST

Next message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"

Previous message: jameskass@att.net: "Re: MS Windows and Unicode 4.0 ?"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: John Cowan: "Re: Compression through normalization"
Reply: John Cowan: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Wed, 3 Dec 2003, Doug Ewell wrote:

> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> Speaking of which, I just noticed that the function in SC UniPad to
> compose syllables from jamos does not handle this case (LV + T = LVT).
> I'll have to report that to the UniPad team.

Yudit, Mozilla and soon a whole bunch of applications written with Pango
treat them equivalent :-) BTW, Uniscribe doesn't treat them equivalent,
either. I 'failed to' persuade 'MS' that they should be treated
equivalent.

> > I just have another question for Korean: many jamos are in fact
> > composed from other jamos: this is clearly visible both in their name
> > and in their composed glyph. What would be the linguistic impact of
> > decomposing them (not canonically!)? Do Korean really learn these
> > jamos without breaking them into their components? I think here about
> > SSANG (double) consonnants, or the initial Y or final E of some
> > vowels...
>
> This would be a good question for Jungshik or another native Korean. I
> have read that Korean children learn the syllables as whole units,

There's no single view of what constitutes an 'atom'
in Korean writing system. You're right that on many occasions to many
people, syllables are units, but not unbreakable atomic units. Everybody
knows that syllables are made out of consonants and vowels. Korean
ABC-song enumerates 14 consonants and 10 vowels only and that's what
school children learn in the first grade. Most Korean input methods
haev a configurable option as to what 'backspace/delete' key should do
(i.e. delete the whole syllable or only the preceding letter.) during the
syllable 'formation' (i.e. before committed.) On Korean mobile phones,
we go even smaller. Only three elements (a vertical stroke, a horizontal
storke and a dot assigned '1', '2' and '3') are used to compose vowels. 14
consonants are assigned '4' thru '0' (two or three of them together).

Before KS C 5601-1987, we used n-byte Hangul code, in which we assigned
a single byte code to each of 19 leading consonants (14 basic + 5 double)
+ filler, 21 vowels (10 basic + 11 'diphtongs') + filler and 27 final
consonants (14 basic + 13 complex). A syllable was represented with
either 2bytes or 3bytes. SI and SO were used to toggle between Korean
mode and ASCII mode (that was before the invention of EUC scheme in Unix
and we could use only octets with MSB=0)

Also note that 'JOHAB' encoding (many Koreans regarded as better than
'Wansung' - KS X 1001-based EUC-KR - although it's not ISO 2022 compliant)
uses three 5-bit-long bit fields (for leading consonants, vowels and
final consonants) to encode syllables. The 'MSB' in '2 byte' unit is
set to 1 to indicate that it's for Korean (not US-ASCII).

> rather than as an arrangement of jamos as I would see them,

In the early 20th century, a couple of 'script reform' attempts
(by prominent Korean linguists) were made to work around the difficulty
with printing (having to make a lot of metal types). It was proposed that
letters be written in a linear fashion (just like Latin/Cyrillic/Greek
alphabets are written), but none of them caught on. In the Korean
community in the Russian far east, quite an amount of materials were
published in one of these 'reformed' scripts.

> leading some
> to think of Hangul as a featural syllabary instead of an alphabet.

Korean script is alphabetic, syllabic, and featural all at the same
time :-) And, it's also logographic just as any other scripts can be
(e.g. 'enough' in English is logographic in a sense, isn't it?)

Jungshik

Next message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"
Previous message: jameskass@att.net: "Re: MS Windows and Unicode 4.0 ?"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: John Cowan: "Re: Compression through normalization"
Reply: John Cowan: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 17:37:34 EST