Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 07 2004 - 17:19:12 CST

Next message: Peter Kirk: "Word dividers, was: proposals I wrote (and also, didn't write)"

Previous message: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Philippe Verdy: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Next in thread: Lars Kristan: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe continued:

> As if Unicode had to be bound on
> architectural constraints such as the requirement of representing code units
> (which are architectural for a system) only as 16-bit or 32-bit units,

Yes, it does. By definition. In the standard.

> ignoring the fact that technologies do evolve and will not necessarily keep
> this constraint. 64-bit systems already exist today, and even if they have,
> for now, the architectural capability of handling efficiently 16-bit and
> 32-bit code units so that they can be addressed individually, this will
> possibly not be the case in the future.

This is just as irrelevant as worrying about the fact that 8-bit
character encodings may not be handled efficiently by some 32-bit
processors.

> When I look at the encoding forms such as UTF-16 and UTF-32, they just
> define the value ranges in which code units will be be valid, but not
> necessarily their size.

Philippe, you are wrong. Go reread the standard. Each of the encoding
forms is *explicitly* defined in terms of code unit size in bits.

  "The Unicode Standard uses 8-bit code units in the UTF-8 encoding
   form, 16-bit code units in the UTF-16 encoding form, and 32-bit
   code units in the UTF-32 encoding form."

If there is something ambiguous or unclear in wording such as that,
I think the UTC would like to know about it.

> You are mixing this with encoding schemes, which is
> what is needed for interoperability, and where other factors such as bit or
> byte ordering is also important in addition to the value range.

I am not mixing it up -- you are, unfortunately. And it is most
unhelpful on this list to have people waxing on, with
apparently authoritative statements about the architecture
of the Unicode Standard, which on examination turn out to be
flat wrong.

> I won't see anything wrong if a system is set so that UTF-32 code units will
> be stored in 24-bit or even 64-bit memory cells, as long as they respect and
> fully represent the value range defined in encoding forms,

Correct. And I said as much. There is nothing wrong with implementing
UTF-32 on a 64-bit processor. Putting a UTF-32 code point into
a 64-bit register is fine. What you have to watch out for is
handing me a 64-bit array of ints and claiming that it is a
UTF-32 sequence of code points -- it isn't.

> and if the system
> also provides an interface to convert them with encoding schemes to
> interoperable streams of 8-bit bytes.

No, you have to have an interface which hands me the correct
data type when I declare it uint_32, and which gives me correct
offsets in memory if I walk an index pointer down an array.
That applies to the encoding *form*, and is completely separate
from provision of any streaming interface that wants to feed
data back and form in terms of byte streams.

> Are you saying that UTF-32 code units need to be able to represent any
> 32-bit value, even if the valid range is limited, for now to the 17 first
> planes?

Yes.

> An API on a 64-bit system that would say that it requires strings being
> stored with UTF-32 would also define how UTF-32 code units are represented.
> As long as the valid range 0 to 0x10FFFF can be represented, this interface
> will be fine.

No, it will not. Read the standard.

An API on a 64-bit system that uses an unsigned 32-bit datatype for UTF-32
is fine. It isn't fine if it uses an unsigned 64-bit datatype for
UTF-32.

> If this system is designed so that two or three code units
> will be stored in a single 64-bit memory cell, no violation will occur in
> the valid range.

You can do whatever the heck crazy thing you want to do internal
to your data manipulation, but you cannot surface a datatype
packed that way and conformantly claim that it is UTF-32.

> More interestingly, there already exists systems where memory is adressable
> by units of 1 bit, and on these systems, ...

[excised some vamping on the future of computers]

> Nothing there is impossible for the future (when it will become more and
> more difficult to increase the density of transistors, or to reduce further
> the voltage, or to increase the working frequency, or to avoid the
> inevitable and random presence of natural defects in substrates; escaping
> from the historic binary-only systems may offer interesting opportunities
> for further performance increase).

Look, I don't care if the processors are dealing in qubits on
molecular arrays under the covers. It is the job of the hardware
folks to surface appropriate machine instructions that compiler
makers can use to surface appropriate formal language constructs
to programmers to enable hooking the defined datatypes of
the character encoding standards into programming language
datatypes.

It is the job of the Unicode Consortium to define the encoding
forms for representing Unicode code points, so that people
manipulating Unicode digital text representation can do so
reliably using general purpose programming languages with
well-defined textual data constructs. I believe it has done so.

No amount of blueskying about the future of optical or
quantum computing actually changes that situation one bit. ;-)

--Ken

Next message: Peter Kirk: "Word dividers, was: proposals I wrote (and also, didn't write)"
Previous message: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Philippe Verdy: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Next in thread: Lars Kristan: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 17:19:59 CST