From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Dec 10 2003 - 07:25:10 EST
On 10/12/2003 02:41, jon@hackcraft.net wrote:
>Quoting Peter Kirk <peterkirk@qaya.org>:
>
>
>
>>OK, as a C function handling wchar_t arrays it is not expected to
>>conform to Unicode. But if it is presented as a function available to
>>users for handling Unicode text, for determining how many characters (as
>>defined by Unicode) are in a string, it should conform to Unicode,
>>including C9.
>>
>>
>
>If a function is presented as a function available to users for handling
>Unicode text then it should do whatever it claims to do.
>
>
That's not what the standard says. According to C7:
> C7 A process shall interpret a coded character representation
> according to the character semantics established by this standard, if
> that process does interpret that coded character representation.
> • This restriction does not preclude internal transformations that are
> never visible external to the process.
So, "If a function is presented as a function available to users for
handling Unicode text", it has to do so in accordance with the standard,
and is not free to do something else even if it openly claims to do that
something else. (I understand "users" here as separate processes;
Unicode conformance does not restrict internal functions.) And there is
a clear intention that processes ought to treat all canonically
equivalent strings identically, although there is a get-out clause
allowing non-ideal implementations not to do so.
A process is permitted to offer a function which distinguishes between
canonically equivalent forms, but, by C9, no other process is permitted
to rely on this distinction. This seems paradoxical but is actually
rather sensible. Such a distinction should only be made as an accidental
feature of a non-ideal version of a function, perhaps one which makes no
claim to support the whole of Unicode, and ideally such a function
should be replaced over time by an upgraded version which supports the
whole of Unicode and makes no distinction between canonically equivalent
forms.
>There are concepts of "code units", "code points", "characters", and "default
>grapheme clusters" in Unicode. Functions which count either of these are
>perfectly conformant with Unicode, as long as the perform their task correctly.
>
>
>
I fully agree with you on "default grapheme clusters", a concept which
is invariant under canonically equivalent transformations (that is
right, isn't it?). These need to be counted by renderers and perhaps in
other circumstances e.g. this is probably the right thing to count for a
character count as an estimate of the length of a text.
As for counting "code units", "code points" and "characters", we need to
distinguish different levels here. Of course it is necessary to count
such things internally within an implementation of certain Unicode
functions e.g. normalisation, and when allocating memory space. At this
level we are talking about a data type consisting of bytes or words for
one of the UTF's; we are not really talking about Unicode strings.
Obviously the wcslen function as originally discussed is supposed to
work at this level, and there is no problem with that. The problem comes
when the function is reapplied as a count of the length of a Unicode
string. For one thing, it is going to give the wrong answer unless it
uses 32-bit (well, 21-bit or more) words, as it certainly shouldn't be
hacked to recognise surrogates. But the other problem is that to use
this function with Unicode strings is to confuse different data types.
I was implicitly thinking in terms of a higher level and more abstract
data type of a Unicode string. That is the level of abstraction which
should be offered to users i.e. other processes or application
programmers, by, for example, a general purpose Unicode-compatible
string handling and I/O library. Such a Unicode string data type should
be independent of encoding form; the choice between UTF-8/16/32 etc
should be left to the compiler. C9 implies that it should also "ideally"
be independent of canonically equivalent form of the text, and this
ideal can easily (though maybe not efficiently) be attained by
automatically normalising all strings passed to and from the library.
(Indeed one might even build into the data type definition an automatic
normalisation process, used whenever a string is stored, but I will
assume that this is not done.) Within such a context, a library function
to determine whether a string is normalised is meaningless, and will
always return TRUE; and this is completely conformant to C9.
Within the functions associated with the data type, rather than as an
external process or library function, there might be a place for a
normalisation test function. On the other hand, at this level it is
redundant, as the preferred thing to do with a non-normalised string is
always to normalise it (or are there security-related cases where this
does not apply?); and so if a string is required to be normalised, even
if there is a good chance that it already is normalised, the correct
thing to do is to normalise it again (and the normalisation function,
operating at a lower level, may for efficiency first check normalisation
before applying the full procedure).
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Wed Dec 10 2003 - 08:28:34 EST