RE: Non-ascii string processing?

From: jon@spin.ie
Date: Wed Oct 08 2003 - 06:50:16 CST


> > A W3C XML Schema Language validator needs a character based API to
> > correctly implement the minLength and maxLength facets on xsd:string
>
> As far as I understand, xsd:string is a list of "Character"-s, and
> a
> "Character" is an integer which can hold any valid Unicode code
> point.

No. First "list" in the context of XML Schema means a series of zero or more values from another datatype represented as whitespace-separated strings, where whitespace is defined according to production S from the XML spec:

S ::= (#x20 | #x9 | #xD | #xA)+

As such it's a good idea to avoid using "list" in a more general sense when dealing with XML Schema.

Secondly while string is defined as a sequence of characters, these characters are abstract UCS characters - the things defined by Unicode and ISO 10646 - also they must match the Char production from the XML spec:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

(The XML1.1 spec removes a few of those characters, I would have removed more, but that's another issue).

So "Character" is not an integer; it's a character, the thing that *has* a code point, rather than the code point, and also some valid Unicode code points are excluded and some are kind of allowed (the xxFFFE and xxFFFF codes from the astral planes are allowed by the Char production, does ISO 10646 allow those characters even though Unicode has them undefined?).

> In other terms, xsd:string is necessarily in UTF-32 (or something close to
> it): it cannot be in UTF-8 or UTF-16.

It's characters that are included xsd:string is necessarily not in any encoding form - it's an abstract concept that can be represented by whatever means a programmer feels fit (though some will serve better than others). XML Schemata can be used with DOM and DOM mandates the use of UTF-16 at the interface.

> The fact that, in UTF-32, the *size* of the sting in encoding units
> corresponds to the number of "characters" is coincidental.

Yes, but the coincidence is the other way around. :)
The coincidence is no coincidence at all of course, UTF-32 is designed to have a one-to-one mapping between Unicode characters and it's encoding units.

> In any case, the useful information is always the *size* of the string in
> encoding units (octets for UTF-8, 16-bit units for UTF-16, etc.), not the
> number of "characters" it contains.

Bah! "always" is a very strong word. It's already been shown that the useful information is often the number of grapheme clusters so this is clearly wrong whether character-counts are useful or not.

In the case of XML Schema not only is the number of encoding units not useful it's practically a Zen Koan - there is no such thing as encoding units at the level of abstraction it operates at.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST