Re: That UTF-8 Rant (was Unicode in source)

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Thu Jul 22 1999 - 15:02:13 EDT

Next message: John Cowan: "Two characters"
Previous message: Michael Everson: "Re: Off topic: English orthography"
Maybe in reply to: Addison Phillips: "That UTF-8 Rant (was Unicode in source)"
Next in thread: Kenneth Whistler: "Re: That UTF-8 Rant (was Unicode in source)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Addison Phillips" wrote on 1999-07-22 16:29 UTC:
> o The characters are all 16-bits in the BMP, in terms of processing (yes,
> yes, combining marks take more than one character to encode... but for
> processing purposes everything is the same width. Yes, there are higher
> planes of existence and these cannot be ignored...).
> o There is less text expansion for non-Latin languages.
> o There are programmatic ways of handling Unicode text via TCHAR that
> reduces the impact on code. If you don't unthread UTF-8 to UTF-16, text
> processing becomes somewhat uglier.
> o For languages other than Western European, the text expansion is much
> smaller than for UTF-8, so storage is conserved.

The only two sort of valid (and closely related) points are

> o There is less text expansion for non-Latin languages.
> o For languages other than Western European, the text expansion is much
> smaller than for UTF-8, so storage is conserved.

I wonder whether

> o There are programmatic ways of handling Unicode text via TCHAR that
> reduces the impact on code. If you don't unthread UTF-8 to UTF-16, text
> processing becomes somewhat uglier.

is really based on in-depth practical programming experience with UTF-8
or whether this is just an assumption. I have found that processing
UTF-8 text is extremely close to processing ASCII text. The only
different is that I have to skip the 0x80..0xbf bytes at some places
(e.g., deleting or counting a number of characters), and that is not
very problematic. In fact, it is usually exactly as complex as with
UTF-16. I have not yet seen code that is in any way nicer or more
elegant on UTF-16 then on UTF-8. For table-lookup operations, both UTF-8
and UTF-16 have to be converted into a 31-bit integer value via a
function such as mbtowc() anyway. And the simplicity advantage that
UCS-2 has over UTF-8 (ignoring not only Klingon but also the advanced
mathematical publishing characters in ISO 10646-2) vanishes very quickly
with combining characters, which are not only essential for languages
such as Thai but also for mathematical publishing.

I do not believe there is a big field of application for UTF-16. UTF-16
is in my eyes primarily a political correctness exercise towards the
users of scripts who use 6 months of Moore's law by the 3-byte encoding
of their characters. If you have to convert into a simpler
representation, then it will usually be a 32-bit integer variable, with
potentially the MSB signalling that there are associated combining
characters stored in some auxiliary data structure. And this 32-bit
representation will also often only be used very temporarily, for
instance for only the current character in a loop that processes an
entire UTF-8 or UTF-16 string.

Efficiency?

Sure, UTF-16 might save you a few CPU cycles over UTF-8 in this
conversion to UCS-4 here and there, but again this is just a week or
less of Moore's law. People are today very happy to loose at least an
order of magnitude more CPU cycles by using interpreted Perl/Python/
Java/TCL instead of good old compiled C. Low-cost PCs have been much
faster then necessary for word processing for several years now. Even
Microsoft runs out of ideas of how to further bloat word processors
these days. CPU cycles are burned today with real-time rendered
anti-aliased fonts; UTF-8 is much too efficient here.

> If you are going to write a text processing application, why would you make
> UTF-8 the default internally, when UTF-16 is so much easier to code?

Look for instance at Emacs. Emacs (and I am not talking about Mule) does
not distinguish between an internal and external encoding. This has the
great advantage that Emacs allows me to do save editing of arbitrary
binary files, something that I use frequently to fix erroneously
transmitted files, to convert between simple binary file formats, etc. I
very much hope that this binary editing capability will be preserved
when Emacs goes UTF-8, such that I can make local changes UTF-8 strings
in a binary file, without having to worry about potential information
loss at other places in the file caused by some conversion between an
internal and external file. For instance, how do you represent illegal
UTF-8 sequences when you convert everything internally into UTF-16 and
guarantee their survival? This might sound like a somewhat strange
requirement, but fact is that I have learned to love the ability of
emacs to handle illegal ASCII files (containing zero bytes, wrong order
or LFCR, etc.), and therefore I very much hope that emacs will use UTF-8
as its internal encoding and preserve its binary file transparency
(unedited illegal UTF-8 sequences will remain unmodified). Otherwise, I
would loose emacs in the UTF-8 age as a very powerful hacking tool for
solving quickly file format problems.

> Programmer's editors are, of course, text processing applications, and while
> they need to handle UTF-8 (read and write), internally UTF-16 is going to be
> much cleaner.

Why is that exactly? I think that writing an editor that operates
directly on UTF-8 as its internal representation is extremely straight
forward.

> In short, my grinding axe says: write code for UTF-16. Where possible, store
> UTF-16.

Which UTF-16 do you mean? There are at least two mutually incompatible
ones around.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Next message: John Cowan: "Two characters"
Previous message: Michael Everson: "Re: Off topic: English orthography"
Maybe in reply to: Addison Phillips: "That UTF-8 Rant (was Unicode in source)"
Next in thread: Kenneth Whistler: "Re: That UTF-8 Rant (was Unicode in source)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT