Re: Unicode & space in programming & l10n

From: Philippe Verdy (
Date: Wed Sep 20 2006 - 03:15:58 CDT

  • Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

    From: "Doug Ewell" <>
    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    >> Given that most new technology terms are created and documented first
    >> in English, the bias still exists as English is a technology-related
    >> language; most programming languages are developed with very uniform
    >> English terminology in their reserved keywords...
    > 8<
    > None of this has anything to do with the connection that Paolillo
    > attempted to draw between "lack of support for Unicode" and "English
    > bias."

    Effectively, this was more related to the bias in favor of legacy 8-bit and ASCII in programming languages, and its effect in the way programmers think about programming text-handling applications. When the language itself represents international text in a complex representation and its standard library refers to text as a native string datatype where "characters" are assumed to be 8-bits, due to the choice of term in the reserved keywords of that language, many programmers, even international ones, do think that supporting something else than an 8-bit single-byte legacy encoding is troublesome.

    The language features such as string constants, but also APIs, tend to move programmers away from international text. So unless the language has natively a strong support for Unicode including in its core libraries and in the source files, and offers a native representation for texts and characters coherent with the Unicode model for characters and text, there will remain a bias in favor of English and ASCII.

    I really don't think that the size of international text is an issue with today's applications that handle tons of other non-text information. But handling text is the first thing any program has to do, and this appears so soon in the software design that programmers don't perceive immediately that they will need to support i18n or l10n, given a minor role to be handled "later". When this time comes, it becomes a complex issue, as i18n and l10n has been forgotten during the specification, and when the skeletton has been written.

    Still today, most applications are written with languages that don't have native support for the Unicode character model (C/C++ included, but also many popular script languages like PHP); if we want to have things changed, we would need to promote other languages, but the main issue there is in training programmers for these languages; C/C++ is still too much popular, even though Java has gained a strong influence in lots of domains (notably in enterprise applications), along with C#/.Net just following that move.

    Isn't it time to start deprecating C/C++ (keeping it with the assembly languages, only for some critical things like device drivers at kernel level, or fast performance maths libraries and multimedia codecs) in favor of higher-level programming languages (that have native support for the Unicode character model) ?

    This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 03:22:40 CDT