Re: Languages supported by UTF8 and UTF16

From: Mark Davis (mark.davis@icu-project.org)
Date: Sat Sep 10 2005 - 14:23:32 CDT

  • Next message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"

    The core value in your proposed formulation is that it does make sense
    to talk about 'Unicode supporting a language X' as meaning something
    like 'plain text in the customary writing system for X can be
    represented as a sequence of Unicode characters'.

    Beyond that, there are problems in your formulation.

    > All living languages, and many dead languages, can be written in their
    > normal writing system(s) using Unicode characters.

    1. It is not true of "all living languages"; there are some minority
    languages that need additional characters. (Part of the problem here is
    that we didn't apply the generative model consistently enough; had we
    done that, many of these characters could be represented right now by
    sequences.)

    > However, some
    > of their characters cannot be represented as single Unicode characters
    > but as combinations.

    3. The 'however' is misleading. It is not a deficiency that some of what
    users may perceive of as separate characters are encoded by sequences.

    > Some orthographic and typographic constructs, which
    > could in principle be expressed in plain text, cannot be expressed
    > in Unicode.

    4. Also not a deficiency. If Unicode attempted to encode all typographic
    constructs, it would be a horrible mess. It provides a foundation for
    other mechanisms (CSS, etc) to build upon; they can provide
    typographical constructs. And by 'orthographic constructs', you'd have
    to provide examples of what you mean.

    > Some of the properties of characters as defined by the
    > Unicode Standard do not correspond to their behavior in different
    > languages.

    5. Again, you'd have to provide examples to clarify what you mean.

    > Moreover, Unicode is meant to describe plain text only, so it generally
    > lacks any support that might be needed for display and processing of
    > text by language-specific rules.

    6. Again, by design, to avoid above-mentioned horrible mess. If you want
    language tagging so as to customize appearance for different languages,
    use higher level markup or structure, such as xml:lang or equivalent.

    What the Unicode Consortium *does* provide is a mechanism for providing
    language-specific tailorings of specified behavior. Look at collation,
    for example, where the Unicode Consortium supplies a default basis for
    ordering in the UCA, but then also provides a repository of
    language-based tailorings of the UCA in the CLDR.

    Mark

    Jukka K. Korpela wrote:
    > On Fri, 9 Sep 2005, Doug Ewell wrote:
    >
    >> I'm afraid the list is at risk of falling into a hole debating this "how
    >> many languages on the head of a pin" question, when the real underlying
    >> question may be completely different.
    >
    >
    > Indeed, especially since the question was probably based on a
    > misconception on one thing at least, since it asked about encoding forms
    > and not Unicode.
    >
    > While waiting for a clarification to the question, we can still discuss
    > _another_ question, namely that of language support by Unicode. There
    > seems to be confusion around it, too, and the question itself is
    > somewhat obscure. For example, does "Unicode" mean the Unicode
    > repertoire of characters, or the Unicode Standard, or the Unicode
    > Consortium?
    >
    > I'd say that the short answer to the question "what languages are
    > supported by the Unicode Standard?" would be as follows (without trying
    > to clarify the question much - can't do that in a _short_ answer):
    >
    > All living languages, and many dead languages, can be written in their
    > normal writing system(s) using Unicode characters. However, some
    > of their characters cannot be represented as single Unicode characters
    > but as combinations. Some orthographic and typographic constructs, which
    > could in principle be expressed in plain text, cannot be expressed
    > in Unicode. Some of the properties of characters as defined by the
    > Unicode Standard do not correspond to their behavior in different
    > languages.
    > Moreover, Unicode is meant to describe plain text only, so it generally
    > lacks any support that might be needed for display and processing of
    > text by language-specific rules.
    >
    > Well, that's not very short, really. Neither is it very understandable,
    > since it lacks examples. The point, anyway, is that "support to a
    > language" can mean much more than just presence of all characters used
    > in a language. It's also debatable, since people may disagree on what
    > really belongs to a language, even at the character level. Moreover,
    > it's debatable what can be regarded as "support". For example, if the
    > rules of a language require a thin nonbreakable space before or after
    > some punctuation marks, can we claim that Unicode "supports" it, since
    > you can use a thin space character with a zero width no-break space on
    > both sides of it?
    >



    This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 14:24:39 CDT