Re: Unicode & space in programming & l10n

From: Mark Davis (
Date: Sun Sep 17 2006 - 20:42:23 CDT

  • Next message: Steve Summit: "Re: Unicode & space in programming & l10n"

    Frankly, I think the reason why SCSU and BOCU never got a lot of traction is
    related to #1 on my list. That is, in the vast majority of cases UTF-16 or
    UTF-8 have storage characteristics that are good enough -- it's just not
    really worth taking extra steps to squeeze out more. The only small-string
    compression scheme to gain fairly wide acceptance, for different reasons, is
    PunyCode. (All three of them are roughly comparable in compression ratio
    over the samples I gave, although they have different other
    characteristics.) Of course, ZIP and related compressions do a pretty good
    job on any of these languages encoding in Unicode, so they can be applied to
    reduce sizes for any and all of them, in appropriate circumstances.


    On 9/17/06, Doug Ewell <> wrote:
    > This sounds remarkably like the study by Steven Atkin and Ryan
    > Stansifer, quoted in UTN #14, which attempted to prove 8-bit legacy
    > encodings -- optimized for a single language or family of languages --
    > are superior to Unicode because they encode those languages in fewer
    > bytes than Unicode, and because a particular compression scheme
    > (Burrows-Wheeler) compresses all encodings roughly equally.
    > Better support for SCSU over the past 8 years or so, from Unicode and
    > from industry, might have been able to put these complaints to rest.
    > SCSU compresses most non-CJK text to 1 byte per character, and most CJK
    > text to 2 bytes per character, the same as legacy charsets. Because
    > SCSU was relegated to the realm of "a higher-level protocol" and Unicode
    > continued to be represented
    > until 2001 as primarily a 16-bit encoding, industry support for this
    > very useful encoding scheme never got off the ground.
    > I would add that the heading "English bias" perpetuates a common and
    > destructive myth. 8-bit legacy encodings exist that support dozens of
    > languages besides English. To the extent that C and database
    > development tools exhibit a "bias" (which the passage does not prove),
    > it is a bias in favor of 8-bit legacy encodings and not the English
    > language.
    > --
    > Doug Ewell
    > Fullerton, California, USA
    > <>
    > RFC 4645 * UTN #14
    > ----- Original Message -----
    > From: Don Osborn
    > To:
    > Sent: Sunday, September 17, 2006 10:40
    > Subject: Unicode & space in programming & l10n
    > A study published last year* mentioned the impact of Unicode's space
    > requirements in aspects of programming and localization. How big an
    > issue is the "size" requirement of Unicode for programmers these days,
    > in terms of its wider potential use? (Some short excerpts are appended
    > after the citation). DZO
    > Paolillo, John. 2005. "Language Diversity on the Internet." In
    > Paolillo, John, Daniel Pimienta, Daniel Prado, et al, eds. Measuring
    > linguistic diversity on the Internet. A collection of papers. Montreal:
    > UNESCO. (CI.2005/WS/06)
    > p. 47 (in the context of bias against localizing in diverse scripts):
    > Technical bias arises in encoding schemes for text such as Unicode
    > UTF-8, which causes text in a non-roman script to require two to three
    > times more space than comparable text in a roman script. Here, the
    > motivation stems from issues of compatibility between older roman-based
    > systems and more recent Unicode systems.
    > p. 73 (in discussion of encoding & multilingual ICT)
    > In its most basic form, UTF-32, Unicode text occupies four times as much
    > space as the same text in ASCII. Many software developers have assumed
    > that users would not want this penalty for multilingual text,
    > particularly if computer use occurs mainly in monolingual contexts.24
    > Unicode offers other variable-length encodings that are more effi cient,
    > but the space costs are passed on to non-roman scripts which are forced
    > to consume more space. Although data storage costs have dropped
    > considerably in the last decade, enough to make Unicode less of a
    > problem, handling Unicode still substantially complicates the software
    > developer's task, since most applications require inter-operability with
    > ASCII. In addition, the larger sizes of Unicode documents carry costs
    > for transmission, compression and decompression, and these costs are
    > enough of a penalty to discourage use of Unicode in some contexts.
    > p. 74 (English bias in markup & programming languages)
    > Unfortunately, many commonly-used programming languages such as C do not
    > yet offer standard support for Unicode.25 A growing number of languages
    > designed for Web-based applications do (examples include Java,
    > JavaScript, Perl, PHP, Python, and Ruby, all of which are widely
    > adopted), but other systems, such as database software, vary more in
    > their support for Unicode.
    > [Footnote 25 The International Components for Unicode website offers an
    > open-source C library that assists in Unicode support
    > ( <>
    > ]

    This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 20:51:22 CDT