Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 07 2003 - 22:09:21 EST

Next message: Jungshik Shin: "Re: Transcoding Tamil in the presence of markup"

Previous message: Christopher John Fynn: "Re: Coloured diacritics"
Maybe in reply to: Doug Ewell: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Next in thread: Andrew C. West: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>Doug Ewell [mailto:dewell@adelphia.net] writes:
>> Peter Kirk <peterkirk at qaya dot org> wrote:
>> > Unicode is of course very familiar with this kind of situation e.g.
>> > with character name errors, combining class errors, 11000+ redundant
>> > Korean characters without decompositions, etc etc.
>>
>> "Without decompositions"? What about the canonical equivalence between
>> jamos and syllables described in Section 3.12? What about the algorithm
>> to derive the canonical decomposition shown on page 88? What am I
>> missing here?
>
>I think that's related to the non-basic jamos in the johab set that are in
fact made of two other jamos, such as "SSANG" (double) consonnants or
YA/YU/YE vowels.
>
>He must be counting them in all the encoded compound syllables where they
are also used, or in compatibility jamos, also counting the ambiguous
leading/trailing compatiblity consonnants which must be parsed with phonetic
and/or dictionnary lookup to see if they are composed at end of the current
Hangul syllable, or if they start a new Hangul syllable, and because of them
make also the compatibility vowels ambiguous (do they combine or not with
the previous consonnant considered leading, or not when the previous
consonnant is considered trailing?).
>
>However the problem is less difficult with compatibility vowels: the texts
I have seen, as well as the input methods for Hangul keyboards, seem to
always associate the vowel with a previous compatibility consonnant
(considered then leading), unless there's a consonnant filler used to avoid
this composition (then the filler explicitly marks a syllable break, and its
presence also forces the previous consonnant to be treated as trailing).
>
>I think that there may exist a document somewhere documenting the
transition rules from compatiblity consonnants to leading/trailing
consonnants, in an annex of the new johab standard. But I can't read Korean,
so if there's a resource in English or French it would be interesting to
read it.
>
>For now all I have is the Hangul FAQ at:
>http://www.hansoft.com/hangul/faq.html
>which just speaks briefly about two-sets and three-sets keyboards, used to
enter text in the Wansung set (KSC-5715_1985) or Johab set (KSC-5601-1992).
From what I read, most Hangul keyboards are two-sets (Wansung), and so do
not allow the differentiation of leading and trailing consonnants.
>
>This FAQ also suggests that two-sets keyboards are considered better
because input with this keyboard is much faster and works accurately for
modern Korean. Users want a 3-sets keyboards to work with Old Hangul... So
the reality is that modern Korean will very often be encoded with the
Wansung set, and the actual distinctions between leading and trailing
consonnants is performed more or less accurately by software (such software,
in fact an input method, is not needed with the three-set keyboard which
allows direct input of syllables in a single keystroke with keys part of the
syllable pressed simultaneously). So the threee-sets keyboard is used only
by experimented users.
>
>This exposes then the problem of the canonical difference between text
entered on either types of keyboards, and the fact that a software IME is
needed and may fail to recognize correctly syllable breaks.
>
>To limit this complexity, the two-sets keyboard still contains keys for
precomposed compound jamos ("SSANG" double consonnants, and double vowels
YA/YE/YU/YO) as these pairs are very frequent in syllables, and it eases
their input. But as these compound jamos ar in the upper set (with shift),
many users will often think it is faster to input them separately (notably
for double vowels) as it gives the same visual result; if there's an active
IME, the IME will recompose them on the fly. But if no IME is used,
individual keystro>kes will create sequences contaiting only basic jamos,
and if this text is then converted naively to Unicode, jamo per jamo, the
Unicode string will not be canonically equivalent from the string entered
with an active IME.
>
>The absence of a Hangul IME is quite common in softwares not prepared to
accept IMEs (notably a lot of non-Korean softwares for Windows, Linux or
MacOS which will blindly process and store strings entered in the system
charset)...

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Jungshik Shin: "Re: Transcoding Tamil in the presence of markup"
Previous message: Christopher John Fynn: "Re: Coloured diacritics"
Maybe in reply to: Doug Ewell: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Next in thread: Andrew C. West: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 07 2003 - 23:00:49 EST