Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jun 04 2006 - 10:05:41 CDT

  • Next message: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"

    From: "Doug Ewell" <dewell@adelphia.net>
    >> Why then would it be more complicate to compose text like this,
    >> instead of using VIQR that would require composing mostly the same
    >> number of symbols (and sometimes more...)?
    >
    > Vietnamese composition becomes tricky when working with fully decomposed
    > vowels, so that αΊ­ decomposes to "U+0061 plus U+0323 plus U+0302." Not
    > all rendering systems (even today) can handle placing two or more
    > diacritical marks on a single base letter.
    >
    > Additionally, this decomposition ("a" plus dot-below plus circumflex)
    > doesn't match the way Vietnamese view this letter (("a" plus circumflex)
    > plus dot-below). This is not a Unicode problem, but entering the
    > diacritics is the language-appropriate order might be a problem if a
    > rendering engine insists on canonical order.

    That was not really addressing directly my question. The only thing that does not seem natural for Vietnamese is the encoding order of diacritics for the NFD decomposed letters, because it places some tone marks before the decomposed vowel modifier. But an encoding that does not attempt to decompose the 6 base vowelsthat Vietnamese considers as an unbreakable unit, and use the 12 vowels plus combining diacritics only for the tone marks will work fine and will seem quite natural for users.

    So a PDA or cellphone where the text is input this way is not a bad option, and it seems easy to place the 6 extra base vowels on the 9-keys of a cellphone keyboard without lots of extra keystrokes to select the appropriate character. Then allowing the users to select additional tone marks if they wish;

    My cell phone for example uses the keys [2] to [9] for all letters (with at most 4 keystrokes for all letters and digits), the key [0] only for digit 0, the key [*] for composing a space or accessing to a grid of extra characters, and the key [1] for the digit 1 and all punctuations. The key [*] is used to switch between lowercase/uppercase/digits input modes. so there's much enough space to map the selection of tone marks on the key [0].

    In addition, of course, a dictionnary lookup assistant will help composing most common words, with their correct accents and tone marks.

    The "apparent" issue in Unicode only exists with the canonical ordering of diacritics in the NFD form. But it is still considered canonically equivalent to the Vietnamese natural ordering of these combining marks. This issue only exists with the dot below tone mark, because all other tone marks (grave, acute, hook above, tilde) have the same combining class as accents (circumflex, and breve) used above base vowels, so their relative encoding order is preserved by the normalization (which won't swap for example the tilde to below the circumflex accent, as it would be incorrect for Vietnamese).

    But it's notable to see that grave and acute tone marks often do not always stack above the base vowel accents. The tone marks are generally written on the side of the circumflex rather than above it, for better legibility, and the accent (circumflex or breve) is kept centered above the letter in all cases.

    The tone marks do stack above the breve vowel (because it has a gap in the middle to place it without having to increase the line height to to reduce the letter height). But the legibility of the tilde or hook tone marks above the breve is often very questionable for uppercase letters, so I wonder if this combination is really used for uppercase...



    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 10:16:08 CDT