Re: internationalization assumption

From: Philippe Verdy (
Date: Thu Oct 07 2004 - 04:50:13 CST

  • Next message: Philippe Verdy: "Polytonic Greek pneuma letters (spirits) and half-eta glyphs"

    RE: internationalization assumptionWell the main issue for
    internationalization of software is not the character sets with which it was
    tested. It is in fact trivial today to make an application compliant with
    Unicode text encoding.

    What is more complicate is to make sure that the text will be properly
    displayed. The main issues that cause most of the problems come in the
    following area:

    - dialogs and GUI interfaces need to be resized according to text lengths

    - a GUI may have been built with a limited set of fonts, all of them with
    the same line height for the same point size; if you have to display Thai
    characters, you'll need a larger line height for the same point size.

    - some scripts are not readable at small point sizes, notably Han sinograms
    or Arabic

    - the GUI layout should be preferably reversed for RTL languages.

    - you need to be aware BiDi algorithm and you'll have to manage the case of
    mixed directions each time you have to include portions of texts from a
    general LTR script within a RTL interface (for Hebrew or Arabic notably):
    ignoring that, your application will not insert the appropriate BiDi
    controls that are needed to properly order the rendered text, notably for
    mirrored characters such as parentheses. For some variable inclusions in a
    RTL resource string, you may need to insert some surrounding RLE/PDF pair so
    that the embedded Latin items will display correctly.

    - The GUI controls such as input boxes need should be properly aligned so
    that input will be performed from the correct side.

    - Tabular data may have to be presented with distinct alignments, notably if
    items are truncated in narrow but extensible columns (traditionally, tabular
    text items are aligned on the left and truncated on the right, but for
    Hebrew or Arabic, they should be aligned and truncated in the opposite

    - You have to be aware of the variation of scripts that may be used even in
    a pure RTL interface: a user may need to enter sections of texts in another
    script, most often Latin. You have to wonder how these foreign text items
    will be handled.

    - In editable parts of the GUI, mouse selection will be more complex than
    what you think, notably with mixed RTL/LTR scripts.

    - You can't assume that all text will be readable with a fixed-width font.
    Some scripts require using variable-width letters.

    - You have to worry about grapheme clusters, notably in Hebrew, Arabic, and
    nearly all Indian scripts. This is more complex than what you think for
    Latin, Gree, Cyrillic, Han, Hiragana or Katakana texts. Even with the Latin
    script, you can't assume that all grapheme clusters will be made of only 1
    character. For various reasons, common texts will be entered using combining
    characters, without the possibility to make precomposed clusters (this is
    specially true for modern Vietnamese that uses multiple diacritics on the
    same letter).

    - Text handling routines, that change the presentation of text (such as
    capitalisation) will not work properly or will not be reversible: even in
    the Latin script, there are some characters which are available with only 1
    case. Titlecasing is another issue. Such automated presentation effects
    should be avoided, unless you are aware of the problem.

    - Plain-text searches often need to support indifferent case. This issue is
    closely related to collation order, which is sensitive to local linguistic
    conventions, and not only to the used script. For example, plain-text search
    in Hebrew will often need to support searches with or without vowel marks,
    which are combining characters, simply because they are optional in the
    language. When this is used to search and match identifiers such as
    usernames or filenames, various options will be exposed to you. In addition,
    there are lots of legacy text that are not coded with the most accurate
    Unicode character, simply because they are entered with more restricted
    input methods or keyboards, or were coded with more restricted legacy
    charsets (the 'oe' ligature in French is typical: it is absent from
    ISO-8859-1 and from standard French keyboards, although it is a mandatory
    character for the language; however it is present in Windows codepage 1252,
    and may be present in texts coded with it, because itwill be entered through
    "assisted" editors or word processors that can perform autocorrection of
    ligatures on the fly)

    - GUI keyboard accelerators may not be workable with some scripts: you can't
    assume that the displayed menu items will contain a matching ASCII letter,
    so you'll need some way to allow keyboard navigation of the interface. This
    issue is related to accessibility guidelines: you need to offer a way for
    users to see which keyboard accelerators they can use to navigate easily in
    your interface. Don't assume that accelerators for one language will be used
    as easily for another language.

    - toolbar buttons should avoid graphic icons with text elements, unless
    these items are also internationalizable.

    - color coding to add special semantics to text, or even to icons should be
    avoided, such as the too common European meanings of Red/Orange/Green.

    - Sometimes, it will be hard to summarize in a short button label the
    actions it performs. Using help tooltip texts (also internationalizable)
    will provide better experience for users, when these buttons need to display

    The other internationalization issues are much simpler: date and number
    formats, common words like Yes/No/OK/Cancel/Retry/Abort, are easily solved
    with text resources and common i18n libraries, such as the basic common set
    of CLDR resources.

    ----- Original Message -----
    From: Mike Ayers
            For Unicode applications, Latin 1 testing is insufficient, even for
    internationalization testing. Internationalization tests should verify, at
    minimum, that characters >u1000 <=uffff (basically, all of the BMP) can be
    used. It is also good to verify >=u10000 support, or at least determine
    whether or not it exists for your application. I usually test English and
    Japanese for BMP conformance. For >BMP, while all the applications I've
    tested so far have specifically excluded this range, I still have a simple
    strategy based upon snipping the Deseret text from James Kass' script links
    page ( and using that
    (thanks, James!).
            Note that none of the above at all refers to localization testing,
    which still must be done for every supported language-charset combination
    (this is where Unicode can really pay off by reducing things to 1 charset
    per language). Internationalization testing should only determine the
    ability of your application to handle other languages, it is localization
    testing that determines whether it actually handles a given language, and
    would include such things as text entry and display, text conversion,
    coextistence, etc., as applicable.

    This archive was generated by hypermail 2.1.5 : Thu Oct 07 2004 - 05:00:05 CST