Re: Unicode & space in programming & l10n

From: Doug Ewell (
Date: Sun Sep 17 2006 - 17:52:45 CDT

  • Next message: Mark Davis: "Re: Unicode & space in programming & l10n"

    This sounds remarkably like the study by Steven Atkin and Ryan
    Stansifer, quoted in UTN #14, which attempted to prove 8-bit legacy
    encodings -- optimized for a single language or family of languages --
    are superior to Unicode because they encode those languages in fewer
    bytes than Unicode, and because a particular compression scheme
    (Burrows-Wheeler) compresses all encodings roughly equally.

    Better support for SCSU over the past 8 years or so, from Unicode and
    from industry, might have been able to put these complaints to rest.
    SCSU compresses most non-CJK text to 1 byte per character, and most CJK
    text to 2 bytes per character, the same as legacy charsets. Because
    SCSU was relegated to the realm of "a higher-level protocol" and Unicode
    continued to be represented
    until 2001 as primarily a 16-bit encoding, industry support for this
    very useful encoding scheme never got off the ground.

    I would add that the heading "English bias" perpetuates a common and
    destructive myth. 8-bit legacy encodings exist that support dozens of
    languages besides English. To the extent that C and database
    development tools exhibit a "bias" (which the passage does not prove),
    it is a bias in favor of 8-bit legacy encodings and not the English

    Doug Ewell
    Fullerton, California, USA
    RFC 4645  *  UTN #14
    ----- Original Message ----- 
    From: Don Osborn
    Sent: Sunday, September 17, 2006 10:40
    Subject: Unicode & space in programming & l10n
    A study published last year* mentioned the impact of Unicode’s space 
    requirements in aspects of programming and localization. How big an 
    issue is the “size” requirement of Unicode for programmers these days, 
    in terms of its wider potential use? (Some short excerpts are appended 
    after the citation).  DZO
    Paolillo, John. 2005. “Language Diversity on the Internet.” In 
    Paolillo, John, Daniel Pimienta, Daniel Prado, et al, eds. Measuring 
    linguistic diversity on the Internet. A collection of papers. Montreal: 
    UNESCO. (CI.2005/WS/06)
    p. 47 (in the context of bias against localizing in diverse scripts):
    Technical bias arises in encoding schemes for text such as Unicode 
    UTF-8, which causes text in a non-roman script to require two to three 
    times more space than comparable text in a roman script. Here, the 
    motivation stems from issues of compatibility between older roman-based 
    systems and more recent Unicode systems.
    p. 73 (in discussion of encoding & multilingual ICT)
    In its most basic form, UTF-32, Unicode text occupies four times as much 
    space as the same text in ASCII. Many software developers have assumed 
    that users would not want this penalty for multilingual text, 
    particularly if computer use occurs mainly in monolingual contexts.24 
    Unicode offers other variable-length encodings that are more effi cient, 
    but the space costs are passed on to non-roman scripts which are forced 
    to consume more space. Although data storage costs have dropped 
    considerably in the last decade, enough to make Unicode less of a 
    problem, handling Unicode still substantially complicates the software 
    developer’s task, since most applications require inter-operability with 
    ASCII. In addition, the larger sizes of Unicode documents carry costs 
    for transmission, compression and decompression, and these costs are 
    enough of a penalty to discourage use of Unicode in some contexts.
    p. 74 (English bias in markup & programming languages)
    Unfortunately, many commonly-used programming languages such as C do not 
    yet offer standard support for Unicode.25 A growing number of languages 
    designed for Web-based applications do (examples include Java, 
    JavaScript, Perl, PHP, Python, and Ruby, all of which are widely 
    adopted), but other systems, such as database software, vary more in 
    their support for Unicode.
    [Footnote 25 The International Components for Unicode website offers an 
    open-source C library that assists in Unicode support 

    This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 18:09:18 CDT