RE: Indic Devanagari Query

From: Kent Karlsson (kentk@md.chalmers.se)
Date: Wed Jan 29 2003 - 09:08:29 EST

  • Next message: Keyur Shroff: "RE: Suggestions in Unicode Indic FAQ"

    > > I wouldn't go so far. The fact that clusters belong together is something
    > > that can be handled by the software. Collation and other data processing
    > > needs to deal with such issues already for many other languages. See
    > > http://www.unicode.org/reports/tr10 on the collation algorithm.
    >
    > I beg to differ with you on this point. Merely having some provision for
    > composing a character doesn't mean that the character is not a candidate
    > for inclusion as separate code point.

    At this point, having "some provision for composing" a particular letter
    is very much preventing it from being encoded at a separate code position.
    This is due mostly to the fixation of normal forms (except for very rare
    error corrections).

    > In Unicode many characters have been given codepoints regardless of the
    > fact that the same character could have been rendered through some compose
    > mechanism. This includes Indic scripts as well as other scripts. For

    For legacy reasons, yes. These reasons no longer apply for
    not-yet-encoded compositions.

    > Also, many times processing of text depends on the smallest addressable
    > unit of that language. Again as discussed in earlier e-mails this may vary
    > from one language to another in the same script. Consider a case when a
    > language processor/application wants to count the number of characters in
    > some text in order to find number of keystrokes required to input the text.

    You cannot find the number of keystrokes that way. Not even
    if you know which keyboard (and disregarding backspace). E.g.
    รค can be produced by one or two (or more, if you count hex input)
    keystrokes on (most) Swedish keyboards.

    > Further assume that API functions used for this purpose are based on either
    > WChar (wide characters) or UTF-8. In this case it is very much necessary
    > that you assign the character, say Kssha, to the class "consonant". Since
    > assignment to this class "consonant" applies to single code point (the
    > smallest addressable unit) and not to the sequence of codes, it is very
    > much necessary to have single code point for the character "Kssha".

    No, that is not the case. E.g. Hungarian (Magyar) has "gy", "ny", "ly"
    (and more) as letters (look in a Hungarian dictionary, and its headings).
    Similarly, Albanian has "dh", "rr", "th" (and more) as letters. None of
    these combinations are candidates for single code point allocation. For
    compatibility reasons the Dutch "ij" got a single code point, but it
    is better to just use "i" followed by "j" (though that has some
    difficulties; e.g. the titlecase of ijs is IJs, not Ijs).

                    /Kent K



    This archive was generated by hypermail 2.1.5 : Wed Jan 29 2003 - 10:22:45 EST