Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 07 2004 - 16:21:33 CST

  • Next message: Theodore H. Smith: "If only MS Word was coded this well (was Re: Nicest UTF)"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > Yes, and pigs could fly, if they had big enough wings.

    Once again, this is a creative comment. As if Unicode had to be bound on
    architectural constraints such as the requirement of representing code units
    (which are architectural for a system) only as 16-bit or 32-bit units,
    ignoring the fact that technologies do evolve and will not necessarily keep
    this constraint. 64-bit systems already exist today, and even if they have,
    for now, the architectural capability of handling efficiently 16-bit and
    32-bit code units so that they can be addressed individually, this will
    possibly not be the case in the future.

    When I look at the encoding forms such as UTF-16 and UTF-32, they just
    define the value ranges in which code units will be be valid, but not
    necessarily their size. You are mixing this with encoding schemes, which is
    what is needed for interoperability, and where other factors such as bit or
    byte ordering is also important in addition to the value range.

    I won't see anything wrong if a system is set so that UTF-32 code units will
    be stored in 24-bit or even 64-bit memory cells, as long as they respect and
    fully represent the value range defined in encoding forms, and if the system
    also provides an interface to convert them with encoding schemes to
    interoperable streams of 8-bit bytes.

    Are you saying that UTF-32 code units need to be able to represent any
    32-bit value, even if the valid range is limited, for now to the 17 first
    planes?
    An API on a 64-bit system that would say that it requires strings being
    stored with UTF-32 would also define how UTF-32 code units are represented.
    As long as the valid range 0 to 0x10FFFF can be represented, this interface
    will be fine. If this system is designed so that two or three code units
    will be stored in a single 64-bit memory cell, no violation will occur in
    the valid range.

    More interestingly, there already exists systems where memory is adressable
    by units of 1 bit, and on these systems, an UTF-32 code unit will work
    perfectly if code units are stored by steps of 21 bits of memory. On 64-bit
    systems, the possibility of addressing any groups individual bits will
    become an interesting option, notably when handling complex data structures
    such as bitfields, data compressors, bitmaps, ... No more need to use costly
    shifts and masking. Nothing would prevent such system to offer
    interoperability with 8-bit byte based systems (note also that recent memory
    technologies use fast serial interfaces instead of parallel buses, so that
    the memory granularity is less important).

    The only cost for bit-addressing is that it just requires 3 bits of address,
    but in a 64-bit address, this cost seems very low becaue the global
    addressable space will still be... more than 2.3*10^18 bytes, much more than
    any computer will manage in a single process for the next century (according
    to the Moore's law which doubles the computing capabilities every 3 years).
    Even such scheme would not limit the performance given that memory caches
    are paged, and these caches are always increasing, eliminating most of the
    costs and problems related to data alignment experimented today on bus-based
    systems.

    Other territories are also still unexplored in microprocessors, notably the
    possibility of using non-binary numeric systems (think about optical or
    magnetic systems which could outperform the current electric systems due to
    reduced power and heat caused by currents of electrons through molecular
    substrates, replacing them by shifts of atomic states caused by light rays,
    and the computing possibilities offered by light diffraction through
    cristals). The lowest granularity of information in some future may be
    larger than a dual-state bit, meaning that todays 8-bit systems would need
    to be emulated using other numerical systems...
    (Note for example that to store the range 0..0x10FFFF, you would need 13
    digits on a ternary system, and to store the range of 32-bit integers, you
    would need 21 ternary digits; memry technologies for such systems may use
    byte units made of 6 ternary digits, so programmers would have the choice
    between 3 "ternary bytes", i.e. 18 ternary digits, to store our 21-bit code
    units, or 4 "ternary bytes", i.e. 24 ternary digits or more than 34 binary
    bits, to be able to store the whole 32-bit range.)

    Nothing there is impossible for the future (when it will become more and
    more difficult to increase the density of transistors, or to reduce further
    the voltage, or to increase the working frequency, or to avoid the
    inevitable and random presence of natural defects in substrates; escaping
    from the historic binary-only systems may offer interesting opportunities
    for further performance increase).



    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 16:22:45 CST