Re: Languages supported by UTF8 and UTF16

From: Jukka K. Korpela (
Date: Sat Sep 10 2005 - 09:20:01 CDT

  • Next message: Lateef Sagar: "Arabic Script: Hamza Issue - My point of view"

    On Fri, 9 Sep 2005, Doug Ewell wrote:

    > I'm afraid the list is at risk of falling into a hole debating this "how
    > many languages on the head of a pin" question, when the real underlying
    > question may be completely different.

    Indeed, especially since the question was probably based on a
    misconception on one thing at least, since it asked about encoding forms
    and not Unicode.

    While waiting for a clarification to the question, we can still discuss
    _another_ question, namely that of language support by Unicode. There
    seems to be confusion around it, too, and the question itself is somewhat
    obscure. For example, does "Unicode" mean the Unicode repertoire of
    characters, or the Unicode Standard, or the Unicode Consortium?

    I'd say that the short answer to the question "what languages are
    supported by the Unicode Standard?" would be as follows (without trying to
    clarify the question much - can't do that in a _short_ answer):

    All living languages, and many dead languages, can be written in
    their normal writing system(s) using Unicode characters. However, some
    of their characters cannot be represented as single Unicode characters
    but as combinations. Some orthographic and typographic constructs, which
    could in principle be expressed in plain text, cannot be expressed
    in Unicode. Some of the properties of characters as defined by the Unicode
    Standard do not correspond to their behavior in different languages.
    Moreover, Unicode is meant to describe plain text only, so it generally
    lacks any support that might be needed for display and processing of text
    by language-specific rules.

    Well, that's not very short, really. Neither is it very understandable,
    since it lacks examples. The point, anyway, is that "support to a
    language" can mean much more than just presence of all characters used in
    a language. It's also debatable, since people may disagree on what really
    belongs to a language, even at the character level. Moreover, it's
    debatable what can be regarded as "support". For example, if the rules of
    a language require a thin nonbreakable space before or after some
    punctuation marks, can we claim that Unicode "supports" it, since you can
    use a thin space character with a zero width no-break space on both sides
    of it?

    Jukka "Yucca" Korpela,

    This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 09:20:53 CDT