Re: CodePage Information

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 23 2003 - 21:46:19 EDT

  • Next message: Kenneth Whistler: "Re: Is it true that Unicode is insufficient for Oriental languages?"

    Philippe Verdy continued:

    > From: "Kenneth Whistler" <kenw@sybase.com>
    > > And I think this is a *terrible* idea, which will be roundly
    > > rejected. Let me state it one last time: it is bad advice to
    > > recommend that people use <0xC0 0x80> to represent U+0000
    > > (as any kind of extension to UTF-8).
    >
    > Say that to Sun;

    People have said it to Sun. ;-)

    > I think that it will not break its ascending compatibility for JNI
    > or break its support for NULL characters within String's just
    > because UTF-8 forbids it. All what Sun will do is to update its
    > documentation by saying that it's interface is not UTF-8 but an
    > extension to it.

    What Philippe is referring to can be found at:

    http://java.sun.com/j2se/1.3/docs/guide/jni/spec/types.doc.html#16542

    "UTF-8 Strings"

    That documentation *already* recognizes that the "UTF-8 Strings" for
    JNI are not conformant UTF-8, to wit:

    "There are two differences between this format and the 'standard'
    UTF-8 format. First, the null byte (byte)0 is encoded using
    the two-byte format rather than the one-byte format. This means
    that Java VM UTF-8 strings never have embedded nulls. Second,
    only the one-byte, two-byte, and three-byte formats are used. The
    Java VM does not recognize the longer UTF-8 formats."

    This has been a long-known fact about the internal Java VM "UTF-8 String"
    format. And Sun and Java know that this internal Java VM format,
    also used for the JNI interface, should not be exchanged openly,
    labelled as "UTF-8".

    The java.io InputStreamReader and OutputStreamWriter are for
    public interchange, and they *do* use conformant UTF-8:

    http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc

    > The "string terminator" semantic of byte 0x00 is a standardized
    > convention widely used, independantly of Unicode which does
    > not specify this semantic,

    Correct.

    > and admittedly considers U+0000 as a plain abstract character,
    > with a clear Control Character semantic, which does not prohibit
    > its use in the significant part of the string or in the middle of it.

    Also correct. But other than use of "U+0000" to refer to NULL, that
    is also correct for US ASCII and ISO 8859-1 (and all the other ISO
    8859 parts). For them, too, 0x00 is a plain abstract character,
    a control code (whose semantics are defined by ISO 6429 or other
    standards for control functions); they also do not prohibit its
    use in the middle of a string. They, quite wisely, have nothing to
    say in the matter.

    You seem to keep missing the point that the behavior of NULL,
    represented as 0x00 in UTF-8, represented as 0x00 in US ASCII,
    represented as 0x00 in ISO 8859-1, and represented as 0x00 in
    GB 2312-1980, MacRoman, Code Page 1252, ... is exactly the same
    in strings for any of those encodings. The issues for embedding
    NULL's in strings when those strings are used with C runtime
    libraries or other C API's that use NULL-terminated string
    conventions are exactly the same. This has nothing to do with
    some Unicode-specific differences in how NULL is interpreted
    or handled in C environments.

    > In fact I have found several applications that use now the
    > forbidden sequences of UTF-8 as a way to insert "escaped"
    > markup within a UTF-8 string.

    Then you have found non-conforming applications, if they claim
    that they are using UTF-8 with such conventions.

    > There are other escaping conventions used, notably with XML that
    > makes another use of "<>" characters, quotes for attributes and
    > ampersands.

    What has that to do with the price of onions? I've never said
    that escape mechanisms for the quoting of characters is bad.
    It is obvious that many formal language syntaxes and markup
    systems make use of them, for obvious reasons.

    > Unicode cannot forbid or even recommend not using escaping
    > mechanisms on top of any of its UTF encoding schemes, simply
    > because there's no other way to build actual applications
    > without such additional mechanism.

    The Unicode Technical Committee cannot forbid people from doing
    silly things or prevent people from making mistakes in their
    string handling.

    It can (and does) declare what conformant UTF-8 means. And people
    who notice implementations that do things with UTF-8 which do
    not follow the specification are within their rights to declare
    such implementations to be nonconformant to the Unicode Standard
    (and to ISO/IEC 10646).

    And I dispute your claim that "there's no other way to build actual
    applications without such additional mechanism," if what you
    are talking about is specifically UTF-8. Lots of people have
    done so, including me, and many of us use C runtime libraries,
    and NULL-terminated strings in doing so.

    >
    > You are worry about the term "trivial extension". Consider
    > it clearly, any trivial extension is an extension and thus
    > not the standard. The term trivial just designates the ease
    > with which it can be encoded as an exception from the standard,
    > without breaking the encoding of all other characters.

    Trivial extensions are often the most damaging, because the
    differences tend not to be obvious to most implementers, who
    get hit after the fact with subtle problems and interoperability
    concerns that they were unaware of up front. Non-shortest UTF-8
    and CESU-8 both fall in that category, since people can go along
    assuming they are UTF-8 for a long time, and then suddenly get
    whacked with a problem they didn't anticipate.

    >
    > I took enough precautions to explain it (notably by using "if",
    > or "you can", "you could", and "extension") so that this cannot
    > beconfused with the UTF-8 standard... I also did not want to
    > explain fuly the details of the UTF-8 algorithm, pointing the
    > user to the standard document for all details needed for its
    > implementation. That's enough for me and should be enough for
    > everybody. The question was not really about Unicode but about
    > a concrete application of it, due to constraints. This makes a
    > clear difference.

    The problem that occasioned this thread was the result of
    someone trying to push byte-serialized UTF-16 at a device
    API that choked on embedded null bytes. The generic answer for
    such a problem is use UTF-8 instead.

    All the subsequent analysis and suggestions to use <0xC0, 0x80>
    for NULL in UTF-8, and the wandering on about higher-level
    protocols and whether the UTC can or cannot prevent people from
    using them was basically irrelevant to the problem.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri May 23 2003 - 22:36:19 EDT