Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 07 2003 - 22:09:21 EST

  • Next message: Jungshik Shin: "Re: Transcoding Tamil in the presence of markup"

    >Doug Ewell [mailto:dewell@adelphia.net] writes:
    >> Peter Kirk <peterkirk at qaya dot org> wrote:
    >> > Unicode is of course very familiar with this kind of situation e.g.
    >> > with character name errors, combining class errors, 11000+ redundant
    >> > Korean characters without decompositions, etc etc.
    >>
    >> "Without decompositions"? What about the canonical equivalence between
    >> jamos and syllables described in Section 3.12? What about the algorithm
    >> to derive the canonical decomposition shown on page 88? What am I
    >> missing here?
    >
    >I think that's related to the non-basic jamos in the johab set that are in
    fact made of two other jamos, such as "SSANG" (double) consonnants or
    YA/YU/YE vowels.
    >
    >He must be counting them in all the encoded compound syllables where they
    are also used, or in compatibility jamos, also counting the ambiguous
    leading/trailing compatiblity consonnants which must be parsed with phonetic
    and/or dictionnary lookup to see if they are composed at end of the current
    Hangul syllable, or if they start a new Hangul syllable, and because of them
    make also the compatibility vowels ambiguous (do they combine or not with
    the previous consonnant considered leading, or not when the previous
    consonnant is considered trailing?).
    >
    >However the problem is less difficult with compatibility vowels: the texts
    I have seen, as well as the input methods for Hangul keyboards, seem to
    always associate the vowel with a previous compatibility consonnant
    (considered then leading), unless there's a consonnant filler used to avoid
    this composition (then the filler explicitly marks a syllable break, and its
    presence also forces the previous consonnant to be treated as trailing).
    >
    >I think that there may exist a document somewhere documenting the
    transition rules from compatiblity consonnants to leading/trailing
    consonnants, in an annex of the new johab standard. But I can't read Korean,
    so if there's a resource in English or French it would be interesting to
    read it.
    >
    >For now all I have is the Hangul FAQ at:
    >http://www.hansoft.com/hangul/faq.html
    >which just speaks briefly about two-sets and three-sets keyboards, used to
    enter text in the Wansung set (KSC-5715_1985) or Johab set (KSC-5601-1992).
    From what I read, most Hangul keyboards are two-sets (Wansung), and so do
    not allow the differentiation of leading and trailing consonnants.
    >
    >This FAQ also suggests that two-sets keyboards are considered better
    because input with this keyboard is much faster and works accurately for
    modern Korean. Users want a 3-sets keyboards to work with Old Hangul... So
    the reality is that modern Korean will very often be encoded with the
    Wansung set, and the actual distinctions between leading and trailing
    consonnants is performed more or less accurately by software (such software,
    in fact an input method, is not needed with the three-set keyboard which
    allows direct input of syllables in a single keystroke with keys part of the
    syllable pressed simultaneously). So the threee-sets keyboard is used only
    by experimented users.
    >
    >This exposes then the problem of the canonical difference between text
    entered on either types of keyboards, and the fact that a software IME is
    needed and may fail to recognize correctly syllable breaks.
    >
    >To limit this complexity, the two-sets keyboard still contains keys for
    precomposed compound jamos ("SSANG" double consonnants, and double vowels
    YA/YE/YU/YO) as these pairs are very frequent in syllables, and it eases
    their input. But as these compound jamos ar in the upper set (with shift),
    many users will often think it is faster to input them separately (notably
    for double vowels) as it gives the same visual result; if there's an active
    IME, the IME will recompose them on the fly. But if no IME is used,
    individual keystro>kes will create sequences contaiting only basic jamos,
    and if this text is then converted naively to Unicode, jamo per jamo, the
    Unicode string will not be canonically equivalent from the string entered
    with an active IME.
    >
    >The absence of a Hangul IME is quite common in softwares not prepared to
    accept IMEs (notably a lot of non-Korean softwares for Windows, Linux or
    MacOS which will blindly process and store strings entered in the system
    charset)...

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Sun Dec 07 2003 - 23:00:49 EST