RE: questions on implementing an embeded system that supports unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 01 2007 - 12:48:14 CDT

  • Next message: William J Poser: "RE: questions on implementing an embeded system that supports unicode"

    There’s no « UTF-16 subset » in Unicode. UTF-16 is not a subset, but a
    transform format that allows encoding every part of the UCS.

    I suppose you are then speaking about the Basic Multilanguage Plane (BMP)
    that contains 64536 code points, of which just a few are invalid.

    But even with such count, you will not be able to cover correctly all the
    characters that are encoded using only the BMP code points: many of them are
    making combinations including combining diacritics, required ligatures,
    required contextual forms, and some characters with glyphs reordering both
    before and after the glyph(s) for another character (or even for several
    characters).

     

    This means that even a font with 65536 glyphs (the maximum for a TrueType
    font) will not be enough to cover each script correctly.

    Don’t think about the bad idea of implementing a so large font with a
    bitmap: it will be excessive and even very inefficient.

     

    Not also that for small sizes, like the one suggested, the rendering will be
    extremely poor, unless you use coloring and transparency for halftoning the
    small details: a 6x8 black&white matrix is not enough to show all Latin
    characters properly.

     

    For the memory needed, if you limit to some subset of all the encoded
    strings using BMP characters only, the limit will be extremely high for any
    convenient application, especially if your bitmap hasmore than 1 bit of
    color-depth.

     

    You don’t even need to ask us how much memory it will take, Unicode does not
    specify it, you can do the computation yourself, by just counting the number
    of bitmap glyphs you’ll need, and summing their width according to your own
    resolution and color constraints.

     

    So what you need is a renderer made for your embedded system: there are free
    or open-source implementations of text renderers and layout engines. I
    suggest you go with them, at least for handling the reordering algorithms
    that are complex to support.

     

    If you don’t support the BiDi algorithm, and any algorithm, you won’t be
    able to properly handle texts encoded with one of the standard Unicode
    transform formats (UTF-8, UTF-16, UTF-32…) written in RTL scripts; but you
    won’t be able also to handle most Indic scripts (and others that are listed
    for example in Windows as “complex scripts” because they need a specific
    support in the text layout engine).

    If you then restrict just to LTR alphabets, you will have surprises: there
    are languages that require encoding letters only as sequences of a base
    letter and one or several diacritics (Vietnamese, written with the Latin
    script, at least requires supporting 2 diacritics on vowels).

     

    Supporting Unicode is generally not performed this way. In fact it is
    supported by first splitting the problem by sorting the characters into
    classes that are specific to their script, then to some other general
    properties. Then an engine parses the text to render and split it into units
    to identify “grapheme clusters”; for each grapheme cluster within the same
    class, an appropriate font supporting that class is selected. Then the
    renderer looks the font andworks with its internal layout rules to fond
    which glyph will best represent the text. According to what it finds, it may
    then reorder the completely text order, but will then look for the combining
    characters that require multiple separate glyphs that are not ordered the
    same way (this occurs in Indic scripts for some vowels) and that require
    transforming the characters into glyph ids according to the font
    instructions. Then it can use that font to draw the entire layout using the
    glyph ids determined by this algorithm.

     

    I have voluntarily simplified the steps, but this above should convince you
    that writing a layout engine yourself will not be an easy task. In addition,
    to support the font directives, you need to be able to parse font files and
    its instructions (even a bitmap font contains at least basic instructions
    for determining the size of each glyph and its position in the bitmap).

     

    If you embedded system does not have enough processing power and low memory,
    the best you’ll be able to do is to support basic fonts that won’t support
    the whole set of characters encodable with the abstract characters encoded
    in the UCS by ISO and Unicode.

     

    Really, every system starts by supporting one script, then adds each script
    one at a time. Supporting many scripts at the same time is a large project,
    and it would be probably too costly for your project to redevelop it (given
    that it has already taken decennials to support them in the existing
    desktop/server OS’es).

     

    So consider using an existing engine (and participate to its development if
    there are features still not working for the languages you need to support).
    Remember that for developing such thing, you can’t do that alone (there are
    lots of things to learn from the tricky cases needed in every script).
    Actually, almost all scripts of the world have their complexities for
    supporting some languages. (The possible only exception is the Korean modern
    Hangul script that is extremely simple face to the complexities of the Latin
    script).

     

    So consider learning more about the concept of grapheme clusters. If you
    don’t understand it, youcan’t understand why supporting “only” characters in
    the first plane will fail with your approach.

     

     

      _____

    De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    part de de Brebisson, Cyrille (Calculator Division)
    Envoyé : mercredi 1 août 2007 17:48
    À : unicode@unicode.org
    Objet : questions on implementing an embeded system that supports unicode

     

    Hello,

     

    I have a couple of questions, which will probably make it obvious that I am
    a newbie :-)

     

    I am working on an embedded system that supports the UTF-16 subset of
    Unicode. I have of course read the FAQ and lots of “high level articles”
    However, I till have a couple of questions, mainly related to displaying
    Unicode strings:

    - assuming that I use bitmap fonts (6*8 or so for Latin letters and
    12*12~14*14 for asian, and I do not know how bit Arabic and similar letters
    needs to be), how much memory will I need to dedicate to the fonts?

    - how critical is the implementation of the RTL languages? It seems to add
    quite a lot of complexity to the system and, once again, in a low power
    embedded system might not be worth it?

    - is there any existing software package that I could start with/use as a
    basis in my system so that I do not have to rewrite everything?

     

    Thanks, Cyrille



    This archive was generated by hypermail 2.1.5 : Wed Aug 01 2007 - 12:48:52 CDT