RE: My Querry

From: Addison Phillips [wM] (
Date: Tue Nov 23 2004 - 13:15:07 CST

  • Next message: Antoine Leca: "Re: My Querry"

    RE: My QuerryHi Mike,

    You misread my sentence, I think. I did NOT say that C language strings are
    compatible with UTF-8, but rather that the UTF-8 was designed with
    compatibility with C language "strings" (char*) in mind. The point of UTF-8
    was actually to be compatible with Unix file systems, of course. But one
    stimulus for the encoding was so that the Plan9 operating system wouldn't
    have to rewrite the C libraries to deal with UTF-16 (then UCS-2). In other
    words, my statement is quite correct about the design goals of FSS-UTF,
    UTF-8's progenitor. See for example:

    If you read carefully, you'll see the desire to protect the null and \

    A NULL character is considered to terminate a char* by many C functions. I
    don't see how it helps anything to confuse a new user by bringing up the
    fact that you can't put a NULL character into the middle of a char*. This,
    as you point out, applies equally to ASCII data.

    Java's TES was designed to transport Java java.lang.String objects in a C
    char*. Java strings can contain the character U+0000 and Java's developers
    wished to allow this character in the middle of a java.lang.String. Hence
    this bit of fudge.

    When talking to a newbie I purposely omitted all of these glorious but
    pointless details. The point is that UTF-8 can go into your char* just like
    any other multibyte encoding and in contrast with the myth that char* and
    Unicode cannot mix.


    Addison P. Phillips
    Director, Globalization Architecture

    Chair, W3C Internationalization Working Group

    Internationalization is an architecture.
    It is not a feature.

      -----Original Message-----
      From: Mike Ayers []
      Sent: 2004年11月23日 10:32
      Subject: RE: My Querry

    > From:
    > [] On Behalf Of Addison Phillips [wM]
    > Sent: Tuesday, November 23, 2004 9:14 AM

    > One of the nice things about UTF-8 is that the ASCII bytes
    > from 0 to 7F hex (including the C0 control characters from
    > \x00 through \x01f---including NULL) represent the ASCII
    > characters from 0 to 7F hex.


    > That is, amoung other things
    > UTF-8 was designed specifically to be compatible with C
    > language strings.

              Wrong! Weren't you paying attention last week? C language
    strings are not even fully compatible with ASCII. UTF-8 is fully compatible
    with ASCII, therefore C language strings are not fully compatible with
    UTF-8. The Java folks devised a TES, which was UTF-8 with one change (and
    therefore no longer UTF-8), which was "designed specifically to be
    compatible with C language strings". This method apparently upsets some

              Since the problem between C strings and ASCII/UTF-8/(your
    character set here) is solely the inability to handle zero valued character
    elements, it may be, and very often is, practical to use C strings anyway,
    as zero valued characters are uncommon at best in practice, and explicitly
    disallowed in many applications.


      "Tumbleweed E-mail Firewall <>" made the following
      annotations on 11/23/04 10:34:18

      This e-mail, including attachments, may include confidential and/or
    proprietary information, and may be used only by the person or entity to
    which it is addressed. If the reader of this e-mail is not the intended
    recipient or his or her authorized agent, the reader is hereby notified that
    any dissemination, distribution or copying of this e-mail is prohibited. If
    you have received this e-mail in error, please notify the sender by replying
    to this message and delete this e-mail immediately.

    This archive was generated by hypermail 2.1.5 : Tue Nov 23 2004 - 13:23:18 CST