Re: CodePage Information

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 22 2003 - 13:42:31 EDT

  • Next message: Michael Everson: "Re: Persian or Farsi?"

    From: "Doug Ewell" <dewell@adelphia.net>
    > Philippe Verdy <verdy_p at wanadoo dot fr> wrote:
    >
    > > Use the very simple UTF-8 encoding scheme (or the even more compact
    > > SCSU or BOCU-8 encodings)
    >
    > It's called BOCU-1...

    That's a typo, I wanted to signal BOCU only, not even BOCU-1 which is a particular implementation. I noted the obvious error after sending, but I also asked to the user to refer to the reference document. But thanks for this correction.

    > > to serialize Unicode strings to your device,
    > > and you won't suffer of the NULL byte problem you have with byte
    > > encoding schemes derived from UTF-16 (the encoding you probably use in
    > > your GUI)...
    >
    > SCSU doesn't guarantee that the resulting byte stream won't contain
    > 0x00. It isn't used as a tag in single-byte mode, so runs of scripts
    > using small alphabets won't see a NULL. But Abdij said his application
    > had to support Chinese too, and Chinese text encoded in SCSU would be in
    > "Unicode mode" and thus could contain 0x00 bytes.

    Here also I just gave the general framework of SCSU, not the particular implementation which uses default parameters that can potentially contain null bytes. We have no details of what its hardware device is. Don't assume that we won't need common characters like Form-Feed or SO/SI or Escape-sequences for interoperability with other standards (notably with ISO2022 if he wishes to support it as well later, to ease the interface with legacy CJK systems that don't have support of the large conversion tables needed to support any Unicode encoding schemes).

    > Later:
    >
    > > All other Unicode codepoints will use other leading codes in
    > > (0xC4..0xBF) and one to 3 trailing bytes in (0x80..0xBF).
    >
    > The range of leading byte values goes up to 0xF4.

    Here also a typo (copying a previous value), as this contradicts another part of my message. This however does not contradict my message fully, as this was only an illustration and not a specification of UTF-8 usable for any implementation.

    > > Note that you don't need any leading BOM for UTF-8, as byte ordering
    > > is not relevant for this byte encoding scheme, whose ordering is well
    > > defined and fixed.
    >
    > This is only half the story. Some processes find it useful to prefix
    > UTF-8 text with a U+FEFF signature, not because of byte ordering, but so
    > the text can be easily identified as UTF-8. Recent versions of Windows
    > Notepad use the UTF-8 and UTF-16 signatures to distinguish those
    > encodings from 8-bit "ANSI" code pages.
    >
    > It should be noted that many processes do NOT want to see a leading
    > U+FEFF on UTF-8 files. In particular, Unix and Linux systems may expect
    > specific bytes at the beginning of files, and the signature would mess
    > that up. HTML and XML files also should not contain a signature. So it
    > should not be used indiscriminately.
    >
    > My point is that the use or non-use of U+FEFF in UTF-8 files has nothing
    > to do with byte order. This is a frequently misunderstood point.

    Note that I used the word "you don't need" which relates to the fact that Unicode only defines BOM as a byte order mark and not as a signature and DOES NOT recommend using U+FEFF as a BOM within UTF-8 strings. Signature capaility is another problem, and was not requested. In fact, he said that he wanted to support ASCII compatibility, and not using any BOM preserves this wanted feature.

    > > If you want to store a NUL ASCII in your serialization (so that null
    > > Unicode codepoints will be preserved by the encoding), you may use an
    > > exception, by escaping it (like does Java internally in the JNI
    > > interface).
    > >
    > > This is NOT allowed in UTF-8 but is a trivial extension, used also in
    > > the alternate CESU-8 encoding (which is an encoding "scheme" "similar"
    > > to UTF-8, except that it is derived from the UTF-16 encoding "form",
    > > instead of the UTF-32 encoding "form"): encode a NULL codepoint with
    > > the pair of bytes (0xC0; 0x80).
    >
    > This is not allowed anywhere except in internal processing, where
    > anything goes. Do not recommend this. (Fortunately, the issue seldom
    > comes up in the real world because most people don't need to store
    > U+0000 in plain text files.)

    No, this feature is needed because U+0000 is a legal Unicode character/codepoint that may be needed. (Not the IF that I used at the beginning of the paragraph). There are many uses of the (0xC0;0x80) bytes sequences if one wants to store NUL characters within strings that are NUL terminated (and not delimited by a separate encoding length field).

    I clearly said that this sort of encoding was NOT standard, implying that this is used only as an "upper level protocol" using the Unicode terminology. This is fully allowed in this context, because he did not specify clearly the compatibility features needed by its hardware device (as part of its specification, one can fully describe its interface as using this exception on top of UTF-8, or on top of CESU-8).

    > > If you don't want to use a "strict" decoder and allow fallbacks, you
    > > may decode UTF-8 and CESU-8 to codepoints with exactly the same
    > > decoder function (the difference between UTF-8 and CESU-8 only appears
    > > when encoding Unicode codepoints out of the BMP (i.e. in
    > > 0x10000..0x10FFFF): UTF-8 uses the numeric codepoints directly before
    > > encoding it, but CESU-8 uses the intermediate encoding in UTF-16 of
    > > these characters as surrogate pairs (with a leading "code unit" in
    > > 0xD800..0xDBFF called "high surrogate", and a trailaing code unit in
    > > 0xDC00..0xDFFF called "low surrogate), and then CESU-8 encodes these
    > > surrogates individually using the same algorithm as UTF-8.
    >
    > Handling of supplementary characters in UTF-8 and CESU-8 is mutually
    > exclusive. What is legal in UTF-8 is expressly illegal in CESU-8, and
    > vice versa. You cannot use the same code to serve both purposes, unless
    > you include a flag to indicate whether the encoder and decoder are in
    > "UTF-8 mode" or "CESU-8 mode."

    Why do you want to be pendantic here ? He especially said that he needed a way to find ways to encode Unicode string for storage purpose on its hardware device using its own proprietary software interface. I suggested a safe derived application that could be used, and I correctly said this was a common extension and not part of the standard.

    Look into the Java JNI interface, which uses such extension to allow encoding strings containing NUL codepoints/characters. I do agree that Sun should have avoided using the term "utf8" in its JNI string interface, but this interface predates the newer restrictions only defined in the recent Unicode 4 standard (or in a separate Technical Report published after Unicode 3.1).

    Clearly now, such extensions are widely used, and conforming to the older definition in Unicode, even if it does not fully comply with the newer one. This distinction was made "pedanticly" in the standard only for security reasons (and these reasons are explained), so that non conforming past implementations can be updated to the new standard or use a renamed encoding label (if possible...).

    For compatibility reasons, Sun will certainly not rename its JNI interface only to comply with Unicode, as the past behaior was clearly and fully documented in the JNI specification (which at that time was compliant with Unicode).

    > And no, you can't get around this by "ignoring surrogates," as some
    > people still believe. Supplementary characters are full members of the
    > Unicode code space.

    I did not say that codepoints should be ignored. Reread!

    > Still later:
    >
    > > For higher compression of Unicode strings, but that does not preserve
    > > the ASCII encoding, look into SCSU and BOCU specifications in UTS.
    > > They are more complex and for now implemented in few softwares.
    >
    > SCSU absolutely does "preserve the ASCII encoding," except for some
    > infrequently used control characters. Please check the facts before
    > making statements like this.

    I like your "infrequently used control characters". I can't assume that the user that asked for help doesn't not need them (as well as the NUL character). Don't say that SCSU preserves ASCII. This is wrong, because both ASCII and Unicode define all control codes, and SCSU does not "preserve" them.

    I did not reply aout the Unicode standard itself, but a particular application of Unicode. No application is REQUIRED to use a strict implementation, and any adaptation is possible (notably in the encoding of strings), as long as it preserves the semantics of Unicode strings at the codepoint level.



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 14:53:47 EDT