From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 14 2004 - 15:02:40 CST
Lars asked:
> BTW, what are the properties of U+FFFD? In English please, do not point me
> to the standard.
?!
It has the general category of "Symbol Other" [gc=So].
> Like, can it be a part of an identifier,
It does not have the ID_Start or the ID_Continue property, which
you could determine for yourself by referring to the standard:
http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
That doesn't prevent a formal syntax definition for a language
from including it within the BNF for defining and identifier,
but in general, no, it would not appear in identifiers, just as
most other symbols would not.
> is it an 'alphanumeric'?
No.
> Let me speculate. It should be a letter
No.
> (it probably more
> often originally was than wasn't).
You are referring here to speculation regarding what uninterpretable
sequence in some other character encoding was *converted* to U+FFFD
on conversion to Unicode. But that is irrelevant to the properties
of U+FFFD itself.
That is tantamount, for example, to claiming that the C0 control
code 0x1A SUBSTITUTE should be defined as a "letter", simply because
it is often used in signalling a conversion substitution in
8-bit tables.
> I would accept it for identifiers (variables, filenames).
If you are defining your own language, that would be your
prerogative, of course. But if you are using standard languages
like C, C++, Java, C#, SQL, etc., it is unlikely that you would
be correct in that approach.
> It has no case properties. And it is obviously not a
> space.
True.
There is much, much more to know about Unicode character properties
than just what can be inferred from an attempt to apply the
POSIX model to UTF-8. A good place to start would be Unicode
Technical Report #23, The Unicode Charater Property Model:
http://www.unicode.org/reports/tr23/
And after that, yes, I would point you to the standard.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 15:05:09 CST