Re: Unicode & space in programming & l10n

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Sep 17 2006 - 17:52:45 CDT

Next message: Mark Davis: "Re: Unicode & space in programming & l10n"

Previous message: Mark Davis: "Re: Unicode & space in programming & l10n"
In reply to: Don Osborn: "Unicode & space in programming & l10n"
Next in thread: Mark Davis: "Re: Unicode & space in programming & l10n"
Reply: Mark Davis: "Re: Unicode & space in programming & l10n"
Reply: Philippe Verdy: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This sounds remarkably like the study by Steven Atkin and Ryan
Stansifer, quoted in UTN #14, which attempted to prove 8-bit legacy
encodings -- optimized for a single language or family of languages --
are superior to Unicode because they encode those languages in fewer
bytes than Unicode, and because a particular compression scheme
(Burrows-Wheeler) compresses all encodings roughly equally.

Better support for SCSU over the past 8 years or so, from Unicode and
from industry, might have been able to put these complaints to rest.
SCSU compresses most non-CJK text to 1 byte per character, and most CJK
text to 2 bytes per character, the same as legacy charsets. Because
SCSU was relegated to the realm of "a higher-level protocol" and Unicode
continued to be represented
until 2001 as primarily a 16-bit encoding, industry support for this
very useful encoding scheme never got off the ground.

I would add that the heading "English bias" perpetuates a common and
destructive myth. 8-bit legacy encodings exist that support dozens of
languages besides English. To the extent that C and database
development tools exhibit a "bias" (which the passage does not prove),
it is a bias in favor of 8-bit legacy encodings and not the English
language.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
RFC 4645  *  UTN #14
----- Original Message ----- 
From: Don Osborn
To: unicode@unicode.org
Sent: Sunday, September 17, 2006 10:40
Subject: Unicode & space in programming & l10n
A study published last year* mentioned the impact of Unicode’s space 
requirements in aspects of programming and localization. How big an 
issue is the “size” requirement of Unicode for programmers these days, 
in terms of its wider potential use? (Some short excerpts are appended 
after the citation).  DZO
Paolillo, John. 2005. “Language Diversity on the Internet.” In 
Paolillo, John, Daniel Pimienta, Daniel Prado, et al, eds. Measuring 
linguistic diversity on the Internet. A collection of papers. Montreal: 
UNESCO. (CI.2005/WS/06) 
http://unesdoc.unesco.org/images/0014/001421/142186e.pdf
p. 47 (in the context of bias against localizing in diverse scripts):
Technical bias arises in encoding schemes for text such as Unicode 
UTF-8, which causes text in a non-roman script to require two to three 
times more space than comparable text in a roman script. Here, the 
motivation stems from issues of compatibility between older roman-based 
systems and more recent Unicode systems.
p. 73 (in discussion of encoding & multilingual ICT)
In its most basic form, UTF-32, Unicode text occupies four times as much 
space as the same text in ASCII. Many software developers have assumed 
that users would not want this penalty for multilingual text, 
particularly if computer use occurs mainly in monolingual contexts.24 
Unicode offers other variable-length encodings that are more effi cient, 
but the space costs are passed on to non-roman scripts which are forced 
to consume more space. Although data storage costs have dropped 
considerably in the last decade, enough to make Unicode less of a 
problem, handling Unicode still substantially complicates the software 
developer’s task, since most applications require inter-operability with 
ASCII. In addition, the larger sizes of Unicode documents carry costs 
for transmission, compression and decompression, and these costs are 
enough of a penalty to discourage use of Unicode in some contexts.
p. 74 (English bias in markup & programming languages)
Unfortunately, many commonly-used programming languages such as C do not 
yet offer standard support for Unicode.25 A growing number of languages 
designed for Web-based applications do (examples include Java, 
JavaScript, Perl, PHP, Python, and Ruby, all of which are widely 
adopted), but other systems, such as database software, vary more in 
their support for Unicode.
[Footnote 25 The International Components for Unicode website offers an 
open-source C library that assists in Unicode support 
(http://oss.software.ibm.com/icu/).]

Next message: Mark Davis: "Re: Unicode & space in programming & l10n"
Previous message: Mark Davis: "Re: Unicode & space in programming & l10n"
In reply to: Don Osborn: "Unicode & space in programming & l10n"
Next in thread: Mark Davis: "Re: Unicode & space in programming & l10n"
Reply: Mark Davis: "Re: Unicode & space in programming & l10n"
Reply: Philippe Verdy: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 18:09:18 CDT