Re: UCS-2/4 & BOM

From: Markus Scherer (markus.icu@gmail.com)
Date: Thu Jun 02 2005 - 16:24:23 CDT

  • Next message: Theo Veenker: "Re: JIS X 0208 mappings in Unihan.txt"

    The IANA character sets list
    (http://www.iana.org/assignments/character-sets) says:

    <quote>
    Name: ISO-10646-UCS-2
    MIBenum: 1000
    Source: the 2-octet Basic Multilingual Plane, aka Unicode
            this needs to specify network byte order: the standard
            does not specify (it is a 16-bit integer space)
    Alias: csUnicode

    Name: ISO-10646-UCS-4
    MIBenum: 1001
    Source: the full code space. (same comment about byte order,
            these are 31-bit numbers.
    Alias: csUCS4
    </quote>

    I interpret this to mean that these are CEFs, not CESs or charsets.
    They would not be the only items in the charsets list that are not
    charsets.

    In practice, if you do see them specified, you might want to check if
    the sender is sending what looks like a BOM. In other words, it may be
    best to reinterpret them as "UTF-16" and "UTF-32" charsets.

    Or, reject the text with an error. It's the sender's fault to use
    these names :-)

    On 6/2/05, Theo Veenker <Theo.Veenker@let.uu.nl> wrote:
    > If someone sends me a text file marked charset=ISO-10646-UCS-2
    > or charset=ISO-10646-UCS-4, should an initial BOM in this file have
    > the same meaning as a BOM in UTF-16/32?

    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless
    otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Thu Jun 02 2005 - 16:25:31 CDT