RE: The future of UTF-8

From: Gianni Mariani (gianni@corp.webtv.net)
Date: Thu Jul 22 1999 - 16:29:35 EDT


If you need to process BOM's (10646 signatures) it is then stateful.

-----Original Message-----
From: kenw@sybase.com [mailto:kenw@sybase.com]
Sent: Thursday, July 22, 1999 12:33 PM
To: Unicode List
Cc: unicode@unicode.org; kenw@sybase.com
Subject: RE: The future of UTF-8

Hmm, here it is only Thursday, but already the week's postings
are starting to feel like late on Friday.

Gianni stated:

>
> utf-8 is beautiful.

No argument there.

>
> While utf-16 is an abortion and ucs4 and utf-16 are stateful,

??? You must be using some concept of "stateful" that UTC and
the Unicode Standard editors don't.

> utf-8 is simple, easy to work with and is easy to upgrade to
> from where most systems are today.

But, as Addison pointed out, it is not the best choice for
many internal implementations that care about character boundaries
and semantics. UTF-16 is much easier to handle for this. As some
will argue, of course, UTF-32 is even simpler, since it avoids
surrogates.

>
> utf-8 is here to stay forever. get used to it !
>
> -----Original Message-----
> From: Markus Kuhn [mailto:Markus.Kuhn@cl.cam.ac.uk]
> Sent: Thursday, July 22, 1999 3:29 AM
> To: Unicode List
> Subject: Re: The future of UTF-8
>
>
> "Addison Phillips" wrote on 1999-07-21 22:03 UTC:
> > UTF-8 is a kludge.
>
> Why is UTF-16 better? It just moves the kludge threshold up by 0xff80
> but otherwise makes nothing fundamentally simpler.
>
> > UTF-8 is merely a detour (albeit a very useful one).
>
> I am not sure. What you refer to as a "Unicode plain text file" is in
> essence UTF-16.

A "Unicode plain text file" is Unicode plain text (a sequence of
Unicode-encoded characters) expressed in a particular encoding form
(UTF-8, UTF-16). serialized in a particular UTF (UTF-8, UTF-16,
UTF-16BE, UTF-16LE), and instantiated in a particular file format.

There is nothing about Unicode plain text that requires
it to be expressed in a particular encoding form or be serialized
in a particular UTF, with or without BOM.

> UTF-16 also does not have fixed-width characters. Even
> UCS-4 doesn't, considering that there are things such as combining
> characters.

The editors are going to great pains to try to provide further
clarification of these matters in the Unicode Standard, Version 3.0
(forthcoming).

The Unicode character encoding, per se, is not variable-width. It
associates abstract characters with Unicode scalar values. One character
is associated with each scalar value.

UTF-8 is a variable-width encoding *form* of Unicode. The scalar
value is expressed as 1 to 4 bytes.

UTF-16 is a variable-width encoding *form* of Unicode. The scalar
value is expressed as 1 or 2 wydes. (2 wydes for surrogates; 1 for the
rest).

UTF-32 (not fully approved yet, but in a Draft Unicode Technical
Report) is a fixed-width encoding *form* of Unicode. The scalar
value is expressed as 1 32-bit integer.

The relationship between those encoding forms and whether particular
combining character sequences are significant as "characters" is
orthogonal. The fact that in some sense the combining character
sequence U+0074 U+0313 (ejective-t) is considered a "grapheme" in some
orthographies has the same ontological status, from the point of
view of the encoding, as the fact that U+0074 U+0068 (th as a digraph)
might be considered a "grapheme" of an orthography. U+0074 U+0313
is a "grapheme (= character in a loose sense)" represented as a sequence
of two encoded characters (in the strict sense used by the standard).

That situation is quite distinct from a private use character for
ejective-t, encoded at U-000F0000. Once a private use code is
used as an *encoding* of the character, we would have: U+D800 U+DC00
to represent the character. Unlike U+0074 U+0313, U+D800 U+DC00
is then a "grapheme" represented a a *single* encoded character,
expressed by a sequence of two surrogate *codes* in the UTF-16
encoding form.

Yes, I know it is complicated. But it is important not to just
bandy the terminology around too loosely.

>
> Unicode is inherently a variable length encoding of characters, and
> assuming that UTF-8 is the only variable-length aspect of it might be a
> bit naive.

The Unicode Standard is not an "inherently ... variable length encoding
of characters." However, UTF-8 and UTF-16 as encoding *forms* are
both inherently variable length. This may seem like a quibble, but
the fact that people want to implement UTF-32 as well makes this
distinction important.

>
> I see UTF-8 not only as a temporary encoding for ASCII legacy systems,
> but something that is here to stay for a very long time.
>

Certainement.

--Ken
 
> Markus
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT