RE: ASCII as a subset of Unicode (was: Re: Oxford proposes a leaner alphabet)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Apr 12 2009 - 16:05:34 CDT

Next message: Philippe Verdy: "RE: BBC.co.uk languages - mostly not UTF-8"

Previous message: Don Osborn: "FW: BBC.co.uk languages - mostly not UTF-8"
In reply to: Mark Davis: "Re: ASCII as a subset of Unicode (was: Re: Oxford proposes a leaner alphabet)"
Next in thread: Hans Aberg: "Bytes and octets"
Reply: Hans Aberg: "Bytes and octets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis wrote:
> One needs to distinguish the ASCII characters from the ASCII encoding
scheme.
> The ASCII characters are represented in Unicode at codepoints
U+0000..U+007F. The ASCII encoding scheme represents these as bytes
%00..%7F, as does the UTF-8 encoding scheme.

Actually there's a differnce between the two encoding schemes:
- UTF-8 assumes that "bytes" can contain at least 8 significant bits and it
assigns specific meaning to the 8th bit, but does not assume anything for
possible extra bits that be left used after the 8 lowest bits in the same
adressable unit of memory (a byte is not necessarily 8-bit wide; think about
it as if we hd used the term "code unit" for "byte"; in fact two bytes may
also not be separated by 1 increment of addressable memory, because 1-bit
memory also exists, even if, today, mst systems have adopted alignment
constraints for blocks of succesive bits making a single byte).
- ASCII just assumes that "bytes" can contain at least 7 significant bits
(but it indicates absolutely nothing for code units that are not in the
range 0 to 127.

In both cases, nothing forbids you to use more than the minimum bitlength,
or to use the remaining bits for something else (parity bits, CRC,
whatever...) Also nothing forbifs you to use more bytes to store a single
7-bit or 8-bit code unit (this is what you do when you use a Base64 or
Hexadecimal representation of code units).

Of course, if you have extra bits that you are using like this, this is a
loss of storage, and inefficient for most uses, if there's a fixed cost to
retreive or store bytes in memory or on a media device or transmit it over a
network. As most storage devices and transmision medium have been tuned to
align the bytes on multiples of bits, the minimum size constraint of 7-bits
is not very convenient (also because, when computing data addresses, a
division or multiplication by 7 is less efficient than using powers of 2 (in
our most common computing environment where numeric computation is performed
with binary operations.

If tomorrow, it is demonstrated that some newgeneration processor can work
more efficiently using ternary logic instead of binary flip-flops, you'll
see bytes coming back with 9 bits each.

But it's more probable that instead of using 3-state logic we'll just use
the next power of 2 for the numeric system, or that it will use more complex
numeric bases such as sets of functions or probabilistic distributions.
(this is not true everywhere, notably for fast networking techologies like
DSL, where code units can use code units with variable-length, and code
units with non integer bit lengths to maximize the throughput or capacity of
the storage (according to Shannon's theorem about quantity of information
encodable at a given signal-noise ratio within a welldelimited bandwidth,
and autoadaptation to media storage or transmission capabilities and to
acceptable rates of errorthat can be autocorrected).

But who will care about this? Memory&disk costs are now so low, that it has
been common to not use storage units to their full capacity. So instead, we
are now going to the situation where we underuse these capacities:
independantly of the physical numerid system effeticely implemented, all
that will continue to matter, is that it will continue to be usable fr
storing code units whose size is large enough to store bytes that are powers
of 2, the rest of the capacity does not necessarily need to be used, even if
this is wasting some usable space (but for now the greatest costs is in the
telecommunications and networking medias, notably for its construction and
interconnection).

Philppe.

Next message: Philippe Verdy: "RE: BBC.co.uk languages - mostly not UTF-8"
Previous message: Don Osborn: "FW: BBC.co.uk languages - mostly not UTF-8"
In reply to: Mark Davis: "Re: ASCII as a subset of Unicode (was: Re: Oxford proposes a leaner alphabet)"
Next in thread: Hans Aberg: "Bytes and octets"
Reply: Hans Aberg: "Bytes and octets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Apr 13 2009 - 10:06:29 CDT