Re: Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 14 2004 - 15:02:40 CST

Next message: Mike Ayers: "RE: Roundtripping in Unicode"

Previous message: Kenneth Whistler: "RE: Roundtripping in Unicode"
Maybe in reply to: Lars Kristan: "Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars asked:

> BTW, what are the properties of U+FFFD? In English please, do not point me
> to the standard.

It has the general category of "Symbol Other" [gc=So].

> Like, can it be a part of an identifier,

It does not have the ID_Start or the ID_Continue property, which
you could determine for yourself by referring to the standard:

http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

That doesn't prevent a formal syntax definition for a language
from including it within the BNF for defining and identifier,
but in general, no, it would not appear in identifiers, just as
most other symbols would not.

> is it an 'alphanumeric'?

No.

> Let me speculate. It should be a letter

No.

> (it probably more
> often originally was than wasn't).

You are referring here to speculation regarding what uninterpretable
sequence in some other character encoding was *converted* to U+FFFD
on conversion to Unicode. But that is irrelevant to the properties
of U+FFFD itself.

That is tantamount, for example, to claiming that the C0 control
code 0x1A SUBSTITUTE should be defined as a "letter", simply because
it is often used in signalling a conversion substitution in
8-bit tables.

> I would accept it for identifiers (variables, filenames).

If you are defining your own language, that would be your
prerogative, of course. But if you are using standard languages
like C, C++, Java, C#, SQL, etc., it is unlikely that you would
be correct in that approach.

> It has no case properties. And it is obviously not a
> space.

True.

There is much, much more to know about Unicode character properties
than just what can be inferred from an attempt to apply the
POSIX model to UTF-8. A good place to start would be Unicode
Technical Report #23, The Unicode Charater Property Model:

http://www.unicode.org/reports/tr23/

And after that, yes, I would point you to the standard.

--Ken

Next message: Mike Ayers: "RE: Roundtripping in Unicode"
Previous message: Kenneth Whistler: "RE: Roundtripping in Unicode"
Maybe in reply to: Lars Kristan: "Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 15:05:09 CST