RE: ASCII as a subset of Unicode (was: Re: Oxford proposes a leaner alphabet)

From: Philippe Verdy (
Date: Sun Apr 12 2009 - 16:05:34 CDT

  • Next message: Philippe Verdy: "RE: languages - mostly not UTF-8"

    Mark Davis wrote:
    > One needs to distinguish the ASCII characters from the ASCII encoding
    > The ASCII characters are represented in Unicode at codepoints
    U+0000..U+007F. The ASCII encoding scheme represents these as bytes
    %00..%7F, as does the UTF-8 encoding scheme.

    Actually there's a differnce between the two encoding schemes:
    - UTF-8 assumes that "bytes" can contain at least 8 significant bits and it
    assigns specific meaning to the 8th bit, but does not assume anything for
    possible extra bits that be left used after the 8 lowest bits in the same
    adressable unit of memory (a byte is not necessarily 8-bit wide; think about
    it as if we hd used the term "code unit" for "byte"; in fact two bytes may
    also not be separated by 1 increment of addressable memory, because 1-bit
    memory also exists, even if, today, mst systems have adopted alignment
    constraints for blocks of succesive bits making a single byte).
    - ASCII just assumes that "bytes" can contain at least 7 significant bits
    (but it indicates absolutely nothing for code units that are not in the
    range 0 to 127.

    In both cases, nothing forbids you to use more than the minimum bitlength,
    or to use the remaining bits for something else (parity bits, CRC,
    whatever...) Also nothing forbifs you to use more bytes to store a single
    7-bit or 8-bit code unit (this is what you do when you use a Base64 or
    Hexadecimal representation of code units).

    Of course, if you have extra bits that you are using like this, this is a
    loss of storage, and inefficient for most uses, if there's a fixed cost to
    retreive or store bytes in memory or on a media device or transmit it over a
    network. As most storage devices and transmision medium have been tuned to
    align the bytes on multiples of bits, the minimum size constraint of 7-bits
    is not very convenient (also because, when computing data addresses, a
    division or multiplication by 7 is less efficient than using powers of 2 (in
    our most common computing environment where numeric computation is performed
    with binary operations.

    If tomorrow, it is demonstrated that some newgeneration processor can work
    more efficiently using ternary logic instead of binary flip-flops, you'll
    see bytes coming back with 9 bits each.

    But it's more probable that instead of using 3-state logic we'll just use
    the next power of 2 for the numeric system, or that it will use more complex
    numeric bases such as sets of functions or probabilistic distributions.
    (this is not true everywhere, notably for fast networking techologies like
    DSL, where code units can use code units with variable-length, and code
    units with non integer bit lengths to maximize the throughput or capacity of
    the storage (according to Shannon's theorem about quantity of information
    encodable at a given signal-noise ratio within a welldelimited bandwidth,
    and autoadaptation to media storage or transmission capabilities and to
    acceptable rates of errorthat can be autocorrected).

    But who will care about this? Memory&disk costs are now so low, that it has
    been common to not use storage units to their full capacity. So instead, we
    are now going to the situation where we underuse these capacities:
    independantly of the physical numerid system effeticely implemented, all
    that will continue to matter, is that it will continue to be usable fr
    storing code units whose size is large enough to store bytes that are powers
    of 2, the rest of the capacity does not necessarily need to be used, even if
    this is wasting some usable space (but for now the greatest costs is in the
    telecommunications and networking medias, notably for its construction and


    This archive was generated by hypermail 2.1.5 : Mon Apr 13 2009 - 10:06:29 CDT