Fw: 'code unit' and 'code point' meaning check

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 15 2003 - 16:10:34 EDT

  • Next message: Eugene Mandel: "Re: weird UTF-8 encoding in MS Exchange 2000 IM client"

    From: "Michael (michka) Kaplan" <michka@trigeminal.com>
    > From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > > Note that some UTF-* encodings are now described by Unicode.org as
    > standards, but is technically an annex to the standard, and not necessary to
    > its definition.
    >
    > The definiton of a UAX:
    >
    > A Unicode Standard Annex (UAX) forms an integral part of the Unicode
    > Standard, but is published as a separate document. Note that conformance to
    > a version of the Unicode Standard includes conformance to its Unicode
    > Standard Annexes. The version number of a UAX document corresponds to the
    > version number of the Unicode Standard at the last point that the UAX
    > document was updated.

    > Clearly they are a part of the standard.

    Did I say something that shocks you ?

    I don't think so, I said that despite it is an annex, it is a standard approved by Unicode, but technically it is separated, and an application can conform to Unicode without even implementing the UTF encoding (it could just implement functions handling any encoding that creates Unicode codepoints, and UTF support is clearly optional and not mandatory to the implementation of Unicode, and so it seems normal that UTF-8 or UTF-16 are published as an annex...).

    Some Unicode-compliant applications may choose as well to not support UTF-8, or to not support UTF-16, or to not support any of these two when prefering BOCU or CESU (notably a database engine that wants to preserve storage space for Unicode strings, and that could simply not support at all UTF-8 in its engine, leaving this task to third party client applications or libraries...), or even only use UCS4 (without any control of the related UTF-32 restrictions to the first 17 planes of UCS4, but still fully applying the Unicode restrictions on non-characters such as banned surrogate codepoints that are automatically and implicitly converted by pairs or rejected as invalid)...

    I had read this introductory paragraph you quote, saying that it is part of the standard does not mandate its use, it just mandates conformance requirement for those applications that claim to use and implement an annex. If an application does not need an annex, it's the right of its author to choose to not implement it at all...

    Another good example is the GB2312 standard whose support is mandated in all new applications since 2000 in P.R.China. Such application can conform to Unicode, but choose to not support at all any UTF encoding, but provide a single interface with the GB2312 encoding which conforms to Unicode...



    This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 17:09:09 EDT