Re: Languages supported by UTF8 and UTF16

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sat Sep 10 2005 - 17:40:50 CDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Re: Languages supported by UTF8 and UTF16"

    On Sat, 10 Sep 2005, Mark Davis wrote:

    > Beyond that, there are problems in your formulation.

    Certainly. It was just an initial attempt to list down some issues, in
    order to find a useful answer to the question "what languages does Unicode
    support". I'm assuming that the question is relatively frequent. It's
    important at least indirectly: there are many short descriptions of
    Unicode that make the unqualified, unrestricted claim that Unicode
    supports all languages, or something similar.

    >> All living languages, and many dead languages, can be written in their
    >> normal writing system(s) using Unicode characters.
    >
    > 1. It is not true of "all living languages"; there are some minority
    > languages that need additional characters.

    My tentative answer was much simplified, and it plays with the word
    "normal". What worries me is how this thing could be expressed so the
    common man, or maybe even a pointy-haired boss, would get roughly the
    right idea. "All living languages" is too much, and "Almost all living
    languages" is rather vague. "Minority language" means here probably much
    smaller than the average man thinks.

    >> However, some
    >> of their characters cannot be represented as single Unicode characters
    >> but as combinations.
    >
    > 3. The 'however' is misleading. It is not a deficiency that some of what
    > users may perceive of as separate characters are encoded by sequences.

    I know it can be misleading, but if we just say that Unicode provides a
    unique number for every character, it's misleading, too. And even
    incorrect. ( http://www.unicode.org/standard/WhatIsUnicode.html )

    After one gets acquainted with Unicode, it probably comes as a surprise to
    most people that some characters (using a person's intuitive
    understanding of "character") cannot, after all, be represented
    using a single Unicode code point.

    As a thought experiment, let us suppose that the letter "w" had not been
    included into ASCII or other character codes but written as "vv", and that
    Unicode had not changed this. When people would then ask for "w", the
    answer would be that it is just a typographic variant of "vv", a ligature
    (as it historically is, in fact). Maybe after quite some debate we would
    then be told to use the combination of three Unicode characters,
    "v", word joiner, and "v". Could we then _really_ say that Unicode
    supports the English alphabet, for example, and a separate code point is
    not needed for "w"?

    Unicode contains _most_ accented letters used in human languages
    as precomposed characters, but not all. There's a clear distinction here.
    I'm not questioning the policy decision that effectively freezes the set
    of such precomposed characters. What I'm saying is that we should admit
    that it implies differences on how languages are supported.

    >> Some orthographic and typographic constructs, which
    >> could in principle be expressed in plain text, cannot be expressed
    >> in Unicode.
    >
    > 4. Also not a deficiency. If Unicode attempted to encode all typographic
    > constructs, it would be a horrible mess. It provides a foundation for other
    > mechanisms (CSS, etc) to build upon; they can provide typographical
    > constructs. And by 'orthographic constructs', you'd have to provide examples
    > of what you mean.

    Here, too, my text was supposed to address people's intuitive expectations
    rather than take a position on what should or should not be encoded in
    Unicode. For example, since the acute accent used in French, the accent
    used in Polish, and the tonos in Greek normally look different from each
    other, it is natural to expect that they are treated as different marks.
    Unicode may have made the right choice in unifying all of them to acute
    accent, but it still means that a difference that could have been - and
    should have been made in some people's opinion - made in plain text cannot
    be made in Unicode. If someone says that Unicode does not support Polish
    because Unicode does not recognize the Polish accent mark as distinct from
    the French mark, he might well be wrong by some criteria; but it's still
    an opinion that people have and that is ultimately not a completely
    objective matter.

    >> Some of the properties of characters as defined by the
    >> Unicode Standard do not correspond to their behavior in different
    >> languages.
    >
    > 5. Again, you'd have to provide examples to clarify what you mean.

    For example, line breaking behavior. Unicode line breaking rules allow,
    for example, a line break after ":" in a string like "YK:ssa". Yet, that
    string is the way to write an inflected form of an abbreviation in
    Finnish, and such an expression should not be divided, and if it really
    needs to be divided, the break must not appear after the colon but at
    syllable boundary ("YK:s-sa"). I take this example, since I have had to
    fight with such problems when fixing the typesetting of books, when the
    typesetting program supported this part of Unicode line breaking rules but
    no way to override them except with awkward trick.

    In this example, the point is that the default line breaking rules break
    the processing of texts, by introducing allowed break points that might be
    OK in many contexts, but not in others. I know that such rules are not
    normative and they are meant to be defaults that can be overridden as
    needed, but that's largely just theory. The point is that by introducing
    such properties, Unicode has created, for work with text at the plain text
    level, problems that didn't exist in earlier character codes. (Of course,
    the default line breaking rules _solve_ many problems, too.)

    >> Moreover, Unicode is meant to describe plain text only, so it generally
    >> lacks any support that might be needed for display and processing of
    >> text by language-specific rules.
    >
    > 6. Again, by design, to avoid above-mentioned horrible mess. If you want
    > language tagging so as to customize appearance for different languages, use
    > higher level markup or structure, such as xml:lang or equivalent.

    My statement was meant to explain, and perhaps apologize a bit, rather
    than to present the issue as a drawback of Unicode. Here, too, we have a
    problem of expectations. Unicode has, after all, quite a many features
    that actually operate on a level higher than plain text, such as
    typographic variants encoded as characters, even variant selectors, and
    language tags that are not recommended but still exist in Unicode.
    The question then arises why this or that feature is absent.
    And in reality, this means that the level of support to languages varies
    by language. If someone expresses this by saying that Unicode does not
    support this or that language, because Unicode does not recognize some
    difference as a character difference but just as a glyph difference that
    does not affect the coding of characters, I would say that the opinion
    is not quite right - but it is understandable, and it demonstrates a bit
    how relative the "support to a language" concept is.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 17:42:16 CDT