RE: The future of UTF-8

From: Gianni Mariani (gianni@corp.webtv.net)
Date: Thu Jul 22 1999 - 22:40:32 EDT


The issue I have with BOM's is that if I have 2 "plain text"
files and I do this kind of operation:

type appendfile >> oldfile

It's not guarenteed to work unless the consuming application
processes multiple BOMS which in that case it renders utf-16
and ucs4 fully stateful from a consumer application's p.o.v.
albeit with only two states since it needs to filter all incoming
characters. The above operation works with all other "plain text" files
including utf-8 without any "stateful" transitions.

This kind of operation is really not uncommon. Take log files.
If I have two co-operating applications of different endianness
machines writing to the same log where one machine is big endian
and one little endian, then the application needs to care about
endianness when it's writing utf-16 but not so with utf-8.

I can probably come up with some more examples.

When utf-16 became born, there was no real reason to go with it
because at that point, you have all the problems with multibyte
encodings and most of the programming community still like using
8 bit chars, we still fight this inside MS with libs ported to CE.
As you can tell, these are my opinions and not necessarily that of my
employer.

Anyhow, the other issue is that many applications that process
wide chars are not utf-16 aware, while any internationalized 8 bit
application that multibyte aware is a whole lot easier to port
to Unicode using utf-8. Where time is money, it's virtually
impossible to justify spending the sort of time that's required to
go to utf-16 when utf-8 can be just as effective.

It's also relativly easy to write a string class that has both
a utf-16 and utf-8 "view" of a string making it virtually unnessasary
to do an either-or decision so you get to pick the best of both
worlds.

So, apologies for my earlier snappy comments, it wasn't intended
that way, although the MS stock price may have had somthing to do
with it :))

As always, highest Regards
G

-----Original Message-----
From: kenw@sybase.com [mailto:kenw@sybase.com]
Sent: Thursday, July 22, 1999 1:56 PM
To: Unicode List
Cc: unicode@unicode.org; kenw@sybase.com
Subject: RE: The future of UTF-8

Gianni,

> If you need to process BOM's (10646 signatures) it is then stateful.

How so?

The Unicode character encoding itself is not stateful.

The UTF-16 encoding form is not stateful.

The UTF-16BE and UTF-16LE UTF's (serializations) are not stateful.

UTF-16 as a UTF (serialization) is ambiguous as to the byte order
of the serialization. That ambiguity is resolved in one of several
ways:
   1. A higher order protocol. At which point, the data processing
          is not stateful.
   2. By detection of a BOM. When the BOM is detected and interpreted,
          the data processing of the textual content is not stateful.
   3. By heuristics. And while the heuristic processing itself might
          be stateful, once the outcome of the heuristic provides
          an answer for the byte order, subsequent processing is
          not stateful. And this is in effect no different that any
          heuristic applied to detect character set, whether that
          character set itself is a stateful encoding or not.

The term "stateful", as applied to character encodings, usually
is referring to architectures like ISO 2022, where the state
induced by an escape sequence must be retained to interpret all
subsequent bytes, until encountering another escape sequences changes
the state, and thus the interpretation of the next run of bytes.
That is quite different from determination of the byte polarity "state"
on a data type before processing it. If that were the case, then you
could equally well claim that processing of any integral datatype
larger than a byte is "stateful" in a cross-platform environment.
But that is diluting the term "stateful" in the character encoding
context down to the point where it has nothing in common with
its intended applicability.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT