Re: 'code unit' and 'code point' meaning check

From: Doug Ewell (dewell@adelphia.net)
Date: Fri May 16 2003 - 03:11:22 EDT

  • Next message: William Overington: "Re: 'code unit' and 'code point' meaning check"

    My day to pick on Philippe Verdy <verdy_p at wanadoo dot fr>:

    >> In a nutshell: Unicode is not UTF-16.
    >
    > Or in other words, Unicode defines *code points* only, not code units
    > (this is left to specific encodings used to serialize it, including
    > UTF-*, and "compressed" BOCU and CESU encodings, which can all be
    > computed algorithmically from Unicode code points).

    Unicode defines the encoding forms, and thus the code units used by
    those encoding forms. If Philippe simply means that the code units used
    to represent a given code point vary depending on the chosen encoding
    form, he is of course right.

    Note that there is a bit of confusion here between encoding forms, which
    are about code units, and encoding schemes, which are about bytes. (I
    had a lot of trouble separating these two, at first.) Also, replace
    "CESU" with "SCSU" in this passage.

    > Note that some UTF-* encodings are now described by Unicode.org as
    > standards, but is technically an annex to the standard, and not
    > necessary to its definition.

    As Michka pointed out, Unicode Standard Annexes *are* part of the
    Unicode Standard. But this is moot, since all three UTF's are defined
    directly in the standard itself, not in UAX's (although UTF-32 used to
    be).

    This has nothing to do with whether Unicode conformance requires
    implementation of any particular UTF. (It does not.)

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 03:57:02 EDT