Re: Level of Unicode support required for various languages

From: Eric Muller (
Date: Thu Oct 25 2007 - 13:19:10 CDT

  • Next message: Peter Constable: "RE: Level of Unicode support required for various languages"

    Timothy Armes wrote:
    > Here is a current list. Can all of these be written without combining marks and variant glyphs?

    For Thai, the answer is definitely no. Thai uses combining marks for
    (some) vowel signs and for tone marks, which do occur in everyday texts.
    There are no code points for any of the combinations.

    For Vietnamese, the answer is yes, but there is a trap. All the
    combinations of letters and marks exist as precomposed characters, hence
    the "yes". However, it is common that the representation of text does
    not use those precomposed characters (e.g. the Windows keyboard for
    Vietnamese generates combining characters). The same trap can exist with
    many of the European languages, a lot depend on the generation of the text.

    For the languages using the Arabic script (Arabic, Farsi/Persian), there
    is a somewhat similar situation: while the generally preferred path is
    to use the characters in the U+06xx block, and to have the layout engine
    select the appropriate positional form ("variant glyphs" in your
    terminology), a fair amount of texts can instead be represented using
    the presentation forms in the U+Fxxx blocks, where there is a simple
    mapping from the characters to the glyphs.

    There is an additional complication for Hebrew and Arabic, because those
    scripts are inherently bidirectional, and I doubt that even with your
    restricted domain, you can escape that (do you have texts that include
    letters and numbers?). If you were to impose that you input is in visual
    order, then you would have a hard time claiming Unicode support.

    Stepping back a bit, I suppose that you currently have a simple layout
    engine that gets some non-Unicode text and assumes that the rendering of
    a string is simply the rendering of each "character" one after the
    other, and that your immediate task is to take Unicode instead. Your
    question is probably "how much work do I need to do on my layout
    engine?" It's also probably the case that you deal with relatively
    simple texts, and that you deal (today) with a fixed set of language.

    There are many factors that would influence the answer, which only you
    can really evaluate: the sources of your text data, how much constraints
    you can put on those sources, the font technology you use, how much
    hardware budget you have, how much you can control which fonts are used,
    and so on.

    One temptation is to stick to your layout engine and put some
    constraints on the incoming data. For example, you could get by for
    Vietnamese by imposing that the input data uses only the precomposed
    characters. That would be a compliant Unicode implementation (there is
    no requirement to support the full repertoire). Similarly, you could
    impose that Arabic texts use the presentation forms, and that would also
    be compliant. However, all these constraints fundamentally amount to
    using Unicode not as text encoding system but as a glyph repertoire
    (that's what a *simple* layout engine really implies). While maintaining
    knowingly the confusion between text encoding and glyph repertoire can
    work to some extent, it is a very tricky to do so. First, you would
    really need to understand the gap between the two (which not
    surprisingly would amount to be able to specify a non-simple layout
    engine), and second, you would need to bridge that gap by other means
    than a piece of software. It is also my guess that your current
    situation (i.e. set of languages and texts you want to support) is right
    at the edge of what can be done using this approach (in fact, proper
    bidi support is probably on the "wrong" side of the edge). In short,
    that path seems attractive because it minimizes the immediate
    engineering work (either your own implementation or the acquisition and
    deployment of somebody' else implementation), but it is very harder to
    make it work in practice, and it has large hidden costs.


    This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 13:22:05 CDT