Re: Level of Unicode support required for various languages

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 24 2007 - 13:41:13 CDT

  • Next message: William J Poser: "Re: Use of acronyms (was RE: purl.net/net/cp)"

    Tim Armes asked:

    > I'm looking for accurate answers to the following questions. I've
    > spent a lot of time trying to find this information but it doesn't
    > appear to be readily available.

    In part that is because the questions are not completely well-formed,
    and even if fixed, the answers are problematical. And the request for
    *absolute* answers is probably doomed.

    > 1) How many and which languages absolutely require the use of
    > combinging marks due to the fact there the pre-composed glyphs
    > aren't sufficient?

    Three problems.

    A. This question isn't really about *languages*, but about writing systems
    (or orthographies) used to write languages. As an example, take
    standard Mandarin Chinese. If written with the Han writing system
    (the Chinese ideographic characters), it basically requires no
    use of combining marks. If written with the Pinyin Latin orthography,
    there are precomposed characters for all the letters. If written
    in IPA (also Latin), then you would need lots of combining marks.

    B. The issue isn't precomposed *glyphs*, but precomposed *characters*.
    Sequences of base letter plus combining mark(s) may end up
    being displayed with precomposed *glyphs* from a font, in any
    case. The glyphs themselves are a matter of the font design
    and mapping, whereas the characters are a matter of the
    character encoding and are what you store in text strings.

    C. Stating this as which "absolutely require" will end up getting
    you unclear answers, because you can generally find edge
    cases of usage which would result in someone using a combining
    mark. What you are actually after is the answer to a "typically
    require" question, instead. And for that, you can give
    a general answer: All of the non-Latin writing systems of
    South and Southeast Asia typically require the use of
    combining marks. Arabic (script -- which is used to write
    many distinct languages) also typically requires the use of
    combining marks.

    Hebrew typically doesn't require combining marks -- but
    it *absolutely* does, because pointed Hebrew isn't that
    uncommon for some types of materials, is part of the
    writing system, and requires combining marks.

    For the Greek script you can generally get by without
    combining marks, but the preferred representation of
    polytonic Greek is with combining marks.

    For the Latin script, the answers are very difficult to
    come by. Most major European languages can be written
    without combining marks, but there are thousands of
    Latin-based orthographies in use around the world, and
    many of those -- even some for very large, official
    languages in Africa, for instance -- require some use
    of combining marks.

    > 2) How many and which languages absolutely require the
    > use of variant selectors?

    At the moment, variation selection sequences are only
    defined for Mongolian and for Phags-pa scripts. (Both of
    those scripts are used to write several languages -- see
    the Unicode Standard or other references for details.)

    A large set of variation selection sequences are in
    the process of standardization for CJK ideographic
    characters -- but the intent there is not to *require*
    their use, unless you are explicitly want a very exact
    choice of glyph in some point in text.

    Andrew West clarified the situation for Mongolian and
    Phags-pa already.

    > 3) How many and which languages absolutely require the
    > use of variant glyphs?

    Not answerable without further clarification of what kind
    of requirement you have in mind.

    Note, for example, that Latvian in some sense "requires"
    an alternative glyph for U+0123 LATIN SMALL LETTER G WITH
    CEDILLA for good typography, but you can get by without
    it and still have legible text.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Oct 24 2007 - 13:42:56 CDT