Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 07 2004 - 16:21:33 CST

Next message: Theodore H. Smith: "If only MS Word was coded this well (was Re: Nicest UTF)"

Previous message: Asmus Freytag: "RE: No Invisible Character - NBSP at the start of a word"
In reply to: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Next in thread: Rick McGowan: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Kenneth Whistler" <kenw@sybase.com>
> Yes, and pigs could fly, if they had big enough wings.

Once again, this is a creative comment. As if Unicode had to be bound on
architectural constraints such as the requirement of representing code units
(which are architectural for a system) only as 16-bit or 32-bit units,
ignoring the fact that technologies do evolve and will not necessarily keep
this constraint. 64-bit systems already exist today, and even if they have,
for now, the architectural capability of handling efficiently 16-bit and
32-bit code units so that they can be addressed individually, this will
possibly not be the case in the future.

When I look at the encoding forms such as UTF-16 and UTF-32, they just
define the value ranges in which code units will be be valid, but not
necessarily their size. You are mixing this with encoding schemes, which is
what is needed for interoperability, and where other factors such as bit or
byte ordering is also important in addition to the value range.

I won't see anything wrong if a system is set so that UTF-32 code units will
be stored in 24-bit or even 64-bit memory cells, as long as they respect and
fully represent the value range defined in encoding forms, and if the system
also provides an interface to convert them with encoding schemes to
interoperable streams of 8-bit bytes.

Are you saying that UTF-32 code units need to be able to represent any
32-bit value, even if the valid range is limited, for now to the 17 first
planes?
An API on a 64-bit system that would say that it requires strings being
stored with UTF-32 would also define how UTF-32 code units are represented.
As long as the valid range 0 to 0x10FFFF can be represented, this interface
will be fine. If this system is designed so that two or three code units
will be stored in a single 64-bit memory cell, no violation will occur in
the valid range.

More interestingly, there already exists systems where memory is adressable
by units of 1 bit, and on these systems, an UTF-32 code unit will work
perfectly if code units are stored by steps of 21 bits of memory. On 64-bit
systems, the possibility of addressing any groups individual bits will
become an interesting option, notably when handling complex data structures
such as bitfields, data compressors, bitmaps, ... No more need to use costly
shifts and masking. Nothing would prevent such system to offer
interoperability with 8-bit byte based systems (note also that recent memory
technologies use fast serial interfaces instead of parallel buses, so that
the memory granularity is less important).

The only cost for bit-addressing is that it just requires 3 bits of address,
but in a 64-bit address, this cost seems very low becaue the global
addressable space will still be... more than 2.3*10^18 bytes, much more than
any computer will manage in a single process for the next century (according
to the Moore's law which doubles the computing capabilities every 3 years).
Even such scheme would not limit the performance given that memory caches
are paged, and these caches are always increasing, eliminating most of the
costs and problems related to data alignment experimented today on bus-based
systems.

Other territories are also still unexplored in microprocessors, notably the
possibility of using non-binary numeric systems (think about optical or
magnetic systems which could outperform the current electric systems due to
reduced power and heat caused by currents of electrons through molecular
substrates, replacing them by shifts of atomic states caused by light rays,
and the computing possibilities offered by light diffraction through
cristals). The lowest granularity of information in some future may be
larger than a dual-state bit, meaning that todays 8-bit systems would need
to be emulated using other numerical systems...
(Note for example that to store the range 0..0x10FFFF, you would need 13
digits on a ternary system, and to store the range of 32-bit integers, you
would need 21 ternary digits; memry technologies for such systems may use
byte units made of 6 ternary digits, so programmers would have the choice
between 3 "ternary bytes", i.e. 18 ternary digits, to store our 21-bit code
units, or 4 "ternary bytes", i.e. 24 ternary digits or more than 34 binary
bits, to be able to store the whole 32-bit range.)

Nothing there is impossible for the future (when it will become more and
more difficult to increase the density of transistors, or to reduce further
the voltage, or to increase the working frequency, or to avoid the
inevitable and random presence of natural defects in substrates; escaping
from the historic binary-only systems may offer interesting opportunities
for further performance increase).

Next message: Theodore H. Smith: "If only MS Word was coded this well (was Re: Nicest UTF)"
Previous message: Asmus Freytag: "RE: No Invisible Character - NBSP at the start of a word"
In reply to: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Next in thread: Rick McGowan: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 16:22:45 CST