Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)

From: Doug Ewell (
Date: Sat Jun 03 2006 - 15:41:57 CDT

  • Next message: Doug Ewell: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    >> All precomposed letters necessary for Vietnamese are already encoded,
    >> and have been since Unicode 1.1.
    > Is it true for all Vietnamese letters? I mean here all the
    > combinations of one of the 12 Vietnamese vowels (including 5 base
    > vowels from the ASCII set, plus a few vowels with a single diacritic
    > like the circumblex or a right hook) and one of the 5 tone diacritics
    > (marked by combining accents like acute, grace, tilde, macron, dot
    > below)?

    It is true for all, and also for the barred-D (Đ, đ) which represents
    what we think of as a "true" D sound. (The plain letter D is sounded as
    "y" or "zh" depending on dialect.)

    > Are there additional combinations in Vietnamese? Or is really
    > Vietnamese using this small subset of combinations that is easy to
    > support in most fonts? I have always assumed that Vietnamese was not
    > so much complicate as many people think, despite the apparent
    > complexity of Windows-1258 or VISCII.

    All letters necessary for Vietnamese are covered; the orthography is not
    compromised. Twelve base vowels, five tone marks, 72 vowels total
    (including vowels with no tone mark), plus the barred-D and the dong
    sign, plus the rest of the Latin alphabet (not all of which is used in
    Vietnamese), times 2 for uppercase/lowercase.

    I don't think VISCII is complex at all. Everything needed by the
    Vietnamese language is encoded in one byte. The only tricky part about
    VISCII is that because there are so many letters, it has to encroach
    upon the C0 and C1 control-character space, which can cause problems for
    rendering engines that assume everything in that space is not visible.

    Windows-1258 is a bit more complex, using a combination of precomposed
    vowels and combining marks to stay out of the control space.

    > What I don't know is if Vietnamese considers the tone marks (encoded
    > as diacritical accents) as important at the primary level for the
    > language. if it's not so much important, then people can accept to not
    > encode the tone marks always, and to the number of characters to
    > support in applications like SMS on cell phones is dramatically
    > reduced (and text input becomes easy for the 12 phonetic Vietnamese
    > base vowels, and tone marks can be optionally entered after those base
    > vowels.

    Vietnamese is a tonal language, and just as in Chinese or any other
    tonal language, two words can have totally different meanings based on
    tone. It is up to the writer to decide when it is acceptable to drop
    tone marks without causing miscommunication or even offense. Some
    writers are more picky about this than others; some make greater demands
    on the reader than others. In principle, thugh perhaps not in
    frequency, the issue is not much different from dropping accents in

    > Why then would it be more complicate to compose text like this,
    > instead of using VIQR that would require composing mostly the same
    > number of symbols (and sometimes more...)?

    Vietnamese composition becomes tricky when working with fully decomposed
    vowels, so that ậ decomposes to "U+0061 plus U+0323 plus U+0302." Not
    all rendering systems (even today) can handle placing two or more
    diacritical marks on a single base letter.

    Additionally, this decomposition ("a" plus dot-below plus circumflex)
    doesn't match the way Vietnamese view this letter (("a" plus circumflex)
    plus dot-below). This is not a Unicode problem, but entering the
    diacritics is the language-appropriate order might be a problem if a
    rendering engine insists on canonical order.

    Doug Ewell
    Fullerton, California, USA

    This archive was generated by hypermail 2.1.5 : Sat Jun 03 2006 - 15:54:52 CDT