Re: Unicode & space in programming & l10n

From: Mark Davis (mark.davis@icu-project.org)
Date: Sun Sep 17 2006 - 16:41:16 CDT

Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

Previous message: Don Osborn: "[OT] Pricing corpuses"
In reply to: Don Osborn: "Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Technical bias arises in encoding schemes for text such as Unicode UTF-8,
which causes text in a non-roman script to require two to three times more
space than comparable text in a roman script

I'd definitely disagree. There are a number of factors involved.

   1. Proportion of data. A huge amount of space is often taken up by
   structure or other information. For example, it's stunning to see how little
   of the web is textual content. Toss a few images on a web page, for example,
   and any differences in text size are completely swamped. Even without
   images, HTML markup takes up a very large proportion of the web. Similarly,
   take a Word document or PowerPoint presentation, extract just the text and
   compare the sizes...
   2. Character frequency. One can't just compare the amount that a
   particular character will grow or shrink; you have to look at the frequency
   of usage of characters in the language. The frequency of space (a single
   byte in UTF-8) can play a significant role in average storage requirements,
   for example.
   3. Equivalent text. Another relevant factor is how many characters are
   needed to express the same meaning.

Here is one comparison of comparable text, generated by taking the
declaration of human rights from the UN site, stripping markup, and
processing. It shows the number of characters in each document, the number
of bytes in UTF-16, UTF8, and Zip, and also a percentage column for each of
those, giving the percentage of bytes compared to the average (for the
respective source column). The ZIP format is a rough gauge of information
content -- note that the variance in size is much smaller than either the
count of characters or the count of bytes: the standard deviations for the
UTF16, UTF8, and ZIP figures are 34%, 39%, and 14%, resp.

Language Characters UTF-16 Bytes UTF16% UTF-8 Bytes UTF8% ZIP ZIP%
Chinese 3,135 6,270 31% 8,779 62% 4,162 90% Japanese 4,570 9,140 46% 12,810
91% 4,611 99% Korean 4,937 9,874 49% 11,626 83% 4,075 88% Arabic 7,842
15,684 78% 14,009 100% 4,461 96% English 10,785 21,570 108% 10,785 77%
3,981 86% Portuguese 11,444 22,888 114% 11,840 84% 4,513 97% Hindi 11,685
23,370 117% 30,213 215% 6,453 139% Russian 11,987 23,974 120% 21,910 156%
5,595 120% French 12,047 24,094 120% 12,410 88% 4,663 100% Italian 12,086
24,172 121% 12,167 87% 4,437 96% German 12,088 24,176 121% 12,264 87% 4,669
101% Spanish 12,116 24,232 121% 12,324 88% 4,511 97% Indonesian 12,656
25,312 126% 12,656 90% 4,273 92% Dutch 12,921 25,842 129% 12,923 92% 4,633
100%
Although Chinese takes 3 bytes per character, the number of bytes in UTF8 is
considerably smaller than the average across these languages -- the
expansion from UTF-16 to UTF-8 is swamped by the use of fewer characters in
the first place; Russian and Hindi, on the other hand, take more bytes than
the average in UTF-8.

Now, this is just one sample -- the language style for the text is more
formal than is typical, and thus these figures may be different than for
more customary text. (So don't draw too many conclusions from this!)

>handling Unicode still substantially complicates the software developer's
task, since most applications require inter-operability with ASCII

What data is behind the premise "most applications require inter-operability
with ASCII"? Most applications on Windows or Mac? -- definitively false.
Most browsers? -- definitely false. Most programmer tools on Linux? -- maybe
true. Most applications running on watches? -- probably true. How would you
even compare "the number of applications" unless it is in a particular area?

Unicode does complicate the developer's task, compared to just developing in
ASCII or Latin 1. But for modern user-level programs dealing with multiple
languages, Unicode is by far the simplest choice.

Mark

On 9/17/06, Don Osborn <dzo@bisharat.net> wrote:
>
> A study published last year* mentioned the impact of Unicode's space
> requirements in aspects of programming and localization. How big an issue is
> the "size" requirement of Unicode for programmers these days, in terms of
> its wider potential use? (Some short excerpts are appended after the
> citation). DZO
>
>
>
>
>
> Paolillo, John. 2005. "Language Diversity on the Internet." In Paolillo,
> John, Daniel Pimienta, Daniel Prado, et al, eds. *Measuring linguistic
> diversity on the Internet. A collection of papers*. Montreal: UNESCO. (
> CI.2005/WS/06) http://unesdoc.unesco.org/images/0014/001421/142186e.pdf
>
>
>
>
>
> p. 47 (in the context of bias against localizing in diverse scripts):
>
>
>
> Technical bias arises in encoding schemes for text such as Unicode UTF-8,
> which causes text in a non-roman script to require two to three times more
> space than comparable text in a roman script. Here, the motivation stems
> from issues of compatibility between older roman-based systems and more
> recent Unicode systems.
>
>
>
> p. 73 (in discussion of encoding & multilingual ICT)
>
>
>
> In its most basic form, UTF-32, Unicode text occupies four times as much
> space as the same text in ASCII. Many software developers have assumed that
> users would not want this penalty for multilingual text, particularly if
> computer use occurs mainly in monolingual contexts.24 Unicode offers other
> variable-length encodings that are more effi cient, but the space costs are
> passed on to non-roman scripts which are forced to consume more space.
> Although data storage costs have dropped considerably in the last decade,
> enough to make Unicode less of a problem, handling Unicode still
> substantially complicates the software developer's task, since most
> applications require inter-operability with ASCII. In addition, the larger
> sizes of Unicode documents carry costs for transmission, compression and
> decompression, and these costs are enough of a penalty to discourage use of
> Unicode in some contexts.
>
>
>
> p. 74 (English bias in markup & programming languages)
>
>
>
> Unfortunately, many commonly-used programming languages such as C do not
> yet offer standard support for Unicode.25 A growing number of languages
> designed for Web-based applications do (examples include Java, JavaScript,
> Perl, PHP, Python, and Ruby, all of which are widely adopted), but other
> systems, such as database software, vary more in their support for Unicode.
>
>
>
> [Footnote 25 The International Components for Unicode website offers an
> open-source C library that assists in Unicode support (
> http://oss.software.ibm.com/icu/). <http://oss.software.ibm.com/icu/%29.>]
>
>
>

Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"
Previous message: Don Osborn: "[OT] Pricing corpuses"
In reply to: Don Osborn: "Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 16:50:45 CDT