Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

From: Kenneth Whistler (
Date: Tue Dec 07 2004 - 17:19:12 CST

  • Next message: Peter Kirk: "Word dividers, was: proposals I wrote (and also, didn't write)"

    Philippe continued:

    > As if Unicode had to be bound on
    > architectural constraints such as the requirement of representing code units
    > (which are architectural for a system) only as 16-bit or 32-bit units,

    Yes, it does. By definition. In the standard.

    > ignoring the fact that technologies do evolve and will not necessarily keep
    > this constraint. 64-bit systems already exist today, and even if they have,
    > for now, the architectural capability of handling efficiently 16-bit and
    > 32-bit code units so that they can be addressed individually, this will
    > possibly not be the case in the future.

    This is just as irrelevant as worrying about the fact that 8-bit
    character encodings may not be handled efficiently by some 32-bit

    > When I look at the encoding forms such as UTF-16 and UTF-32, they just
    > define the value ranges in which code units will be be valid, but not
    > necessarily their size.

    Philippe, you are wrong. Go reread the standard. Each of the encoding
    forms is *explicitly* defined in terms of code unit size in bits.

      "The Unicode Standard uses 8-bit code units in the UTF-8 encoding
       form, 16-bit code units in the UTF-16 encoding form, and 32-bit
       code units in the UTF-32 encoding form."
    If there is something ambiguous or unclear in wording such as that,
    I think the UTC would like to know about it.

    > You are mixing this with encoding schemes, which is
    > what is needed for interoperability, and where other factors such as bit or
    > byte ordering is also important in addition to the value range.

    I am not mixing it up -- you are, unfortunately. And it is most
    unhelpful on this list to have people waxing on, with
    apparently authoritative statements about the architecture
    of the Unicode Standard, which on examination turn out to be
    flat wrong.

    > I won't see anything wrong if a system is set so that UTF-32 code units will
    > be stored in 24-bit or even 64-bit memory cells, as long as they respect and
    > fully represent the value range defined in encoding forms,

    Correct. And I said as much. There is nothing wrong with implementing
    UTF-32 on a 64-bit processor. Putting a UTF-32 code point into
    a 64-bit register is fine. What you have to watch out for is
    handing me a 64-bit array of ints and claiming that it is a
    UTF-32 sequence of code points -- it isn't.

    > and if the system
    > also provides an interface to convert them with encoding schemes to
    > interoperable streams of 8-bit bytes.

    No, you have to have an interface which hands me the correct
    data type when I declare it uint_32, and which gives me correct
    offsets in memory if I walk an index pointer down an array.
    That applies to the encoding *form*, and is completely separate
    from provision of any streaming interface that wants to feed
    data back and form in terms of byte streams.

    > Are you saying that UTF-32 code units need to be able to represent any
    > 32-bit value, even if the valid range is limited, for now to the 17 first
    > planes?


    > An API on a 64-bit system that would say that it requires strings being
    > stored with UTF-32 would also define how UTF-32 code units are represented.
    > As long as the valid range 0 to 0x10FFFF can be represented, this interface
    > will be fine.

    No, it will not. Read the standard.

    An API on a 64-bit system that uses an unsigned 32-bit datatype for UTF-32
    is fine. It isn't fine if it uses an unsigned 64-bit datatype for

    > If this system is designed so that two or three code units
    > will be stored in a single 64-bit memory cell, no violation will occur in
    > the valid range.

    You can do whatever the heck crazy thing you want to do internal
    to your data manipulation, but you cannot surface a datatype
    packed that way and conformantly claim that it is UTF-32.

    > More interestingly, there already exists systems where memory is adressable
    > by units of 1 bit, and on these systems, ...

    [excised some vamping on the future of computers]

    > Nothing there is impossible for the future (when it will become more and
    > more difficult to increase the density of transistors, or to reduce further
    > the voltage, or to increase the working frequency, or to avoid the
    > inevitable and random presence of natural defects in substrates; escaping
    > from the historic binary-only systems may offer interesting opportunities
    > for further performance increase).

    Look, I don't care if the processors are dealing in qubits on
    molecular arrays under the covers. It is the job of the hardware
    folks to surface appropriate machine instructions that compiler
    makers can use to surface appropriate formal language constructs
    to programmers to enable hooking the defined datatypes of
    the character encoding standards into programming language

    It is the job of the Unicode Consortium to define the encoding
    forms for representing Unicode code points, so that people
    manipulating Unicode digital text representation can do so
    reliably using general purpose programming languages with
    well-defined textual data constructs. I believe it has done so.

    No amount of blueskying about the future of optical or
    quantum computing actually changes that situation one bit. ;-)


    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 17:19:59 CST