Unicode & space in programming & l10n

From: Don Osborn (dzo@bisharat.net)
Date: Sun Sep 17 2006 - 12:40:12 CDT

  • Next message: Don Osborn: "[OT] Pricing corpuses"

    A study published last year* mentioned the impact of Unicode's space
    requirements in aspects of programming and localization. How big an issue is
    the "size" requirement of Unicode for programmers these days, in terms of
    its wider potential use? (Some short excerpts are appended after the
    citation). DZO



    Paolillo, John. 2005. "Language Diversity on the Internet." In Paolillo,
    John, Daniel Pimienta, Daniel Prado, et al, eds. Measuring linguistic
    diversity on the Internet. A collection of papers. Montreal: UNESCO.
    (CI.2005/WS/06) http://unesdoc.unesco.org/images/0014/001421/142186e.pdf



    p. 47 (in the context of bias against localizing in diverse scripts):


    Technical bias arises in encoding schemes for text such as Unicode UTF-8,
    which causes text in a non-roman script to require two to three times more
    space than comparable text in a roman script. Here, the motivation stems
    from issues of compatibility between older roman-based systems and more
    recent Unicode systems.


    p. 73 (in discussion of encoding & multilingual ICT)


    In its most basic form, UTF-32, Unicode text occupies four times as much
    space as the same text in ASCII. Many software developers have assumed that
    users would not want this penalty for multilingual text, particularly if
    computer use occurs mainly in monolingual contexts.24 Unicode offers other
    variable-length encodings that are more effi cient, but the space costs are
    passed on to non-roman scripts which are forced to consume more space.
    Although data storage costs have dropped considerably in the last decade,
    enough to make Unicode less of a problem, handling Unicode still
    substantially complicates the software developer's task, since most
    applications require inter-operability with ASCII. In addition, the larger
    sizes of Unicode documents carry costs for transmission, compression and
    decompression, and these costs are enough of a penalty to discourage use of
    Unicode in some contexts.


    p. 74 (English bias in markup & programming languages)


    Unfortunately, many commonly-used programming languages such as C do not yet
    offer standard support for Unicode.25 A growing number of languages designed
    for Web-based applications do (examples include Java, JavaScript, Perl, PHP,
    Python, and Ruby, all of which are widely adopted), but other systems, such
    as database software, vary more in their support for Unicode.


    [Footnote 25 The International Components for Unicode website offers an
    open-source C library that assists in Unicode support


    This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 12:51:19 CDT