Re: questions on implementing an embeded system that supports unicode

From: Addison Phillips (addison@yahoo-inc.com)
Date: Wed Aug 01 2007 - 12:42:28 CDT

  • Next message: Philippe Verdy: "RE: questions on implementing an embeded system that supports unicode"

    Hi Cyrille,

    > I am working on an embedded system that supports the UTF-16
    > subset of Unicode.

    UTF-16 is a character encoding of the entire Unicode character set. It's
    not a subset.

    > - assuming that I use bitmap fonts (6*8 or so for Latin letters
    > and 12*12~14*14 for asian, and I do not know how bit Arabic and
    > similar letters needs to be), how much memory will I need to
    > dedicate to the fonts?

    "A lot"?

    The real question is how much of Unicode your device is going to render.
    Total coverage presents several problems, not the least of which are
    right-to-left languages.

    Some languages use what we call "complex" scripts, where the sequence of
    characters influences how they are displayed. For example, Arabic
    characters take a different shape depending on whether they are the
    first, last, or middle characters in a string (or if they stand alone).
    This means having a different "glyph" (bitmap) for each of the four ways
    of presenting each Arabic Unicode character. In addition, the vowels
    (when used) are rendered as "combining marks" (above or below the base
    character). This is quite apart from the fact that Arabic is also a
    right-to-left language. Rendering of complex scripts (and combining
    marks, etc.) is not simply an issue of having a monospaced "cell" of
    bits to display for each individual character in sequence (as it is for
    most English or Far East Asian texts). And note that complex scripts are
    not limited to right-to-left languages. The Devanagari script (used to
    write Hindi, among other languages), for example, is both a
    left-to-right script and a complex script.

    In terms of font support, note that you probably don't need the full
    range of Unicode. For example, many of the scripts in Plane 1 are of
    historical interest only. I doubt whether your embedded application
    really needs support for Cuneiform or Old Gothic. Similarly there are
    scripts in the Basic Multilingual Plane that might not be of commercial
    interest to you, as much as supporting them would be a Good Thing. For
    example, Tifinagh or Ethopic scripts might not be something you need
    right away. If you figure out which languages and scripts to support
    (perhaps based on what rendering capabilities you can provide), you can
    probably limit your font support to those characters.

    > - how critical is the implementation of the RTL languages? It
    > seems to add quite a lot of complexity to the system and, once
    > again, in a low power embedded system might not be worth it?

    The answer to that question depends on your needs. If you're going to
    sell your products in these markets, it's kind of important. If not, you
    can probably avoid it. Complex script markets do represent a growing
    area of importance, so you probably should take a hard look at the
    tradeoffs before deciding not to support one or another capability.

    > - is there any existing software package that I could start
    > with/use as a basis in my system so that I do not have to
    > rewrite everything?

    That's difficult to answer. You don't say what your programming language
    or environment are like, although I assume that you have at least a C
    compiler...

    Many people start with the ICU open source library, as it implements
    many of the features of Unicode. See:

       http://www.icu-project.org

    Hope that helps,

    Addison

    -- 
    Addison Phillips
    Globalization Architect -- Yahoo! Inc.
    Chair -- W3C Internationalization Core WG
    Internationalization is an architecture.
    It is not a feature.
    de Brebisson, Cyrille (Calculator Division) wrote:
    > Hello,
    > 
    > I have a couple of questions, which will probably make it obvious that I am a newbie :)
    > 
    > I am working on an embedded system that supports the UTF-16 subset of Unicode. I have of course read the FAQ and lots of "high level articles" However, I till have a couple of questions, mainly related to displaying Unicode strings:
    > - assuming that I use bitmap fonts (6*8 or so for Latin letters and 12*12~14*14 for asian, and I do not know how bit Arabic and similar letters needs to be), how much memory will I need to dedicate to the fonts?
    > - how critical is the implementation of the RTL languages? It seems to add quite a lot of complexity to the system and, once again, in a low power embedded system might not be worth it?
    > - is there any existing software package that I could start with/use as a basis in my system so that I do not have to rewrite everything?
    > 
    > Thanks, Cyrille
    > 
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 01 2007 - 12:45:44 CDT