From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 12 2005 - 14:03:28 CDT
Avraham Shapiro asked:
> We have an XML based application that specifies UTF-8 files as input. Occasionally users will
> include numeric character entites, for example é for e acute instead of the UTF-8
> equivalent of C3 A9. My question is: Is this legal UTF-8? And are numeric or symbolic character
> entites valid for Ascii-7 characters such as "<"? My guess is the first one is not legal,
> and the second one is application defined, i.e. Unicode says nothing about it. Am I
> right?
No, not quite. The Unicode Standard says nothing about the *first* one, either.
Any use of numeric or symbolic entities in a UTF-8 stream is a matter of
what the Unicode Standard calls a higher-level protocol. (Just as any use
of any such entities in a ISO 8859-1 Latin-1 text stream would be outside
the scope of the ISO 8859-1 standard.)
So, if I have a UTF-8 text stream that consists of the bytes:
<61 61 C3 A9 61 61 26 23 32 33 33 61 61>
That is the valid and legal representation (in UTF-8), of what I would
(in my Latin-1 email agent) represent as:
"aaéaaéaa"
and NOT the UTF-8 for:
"aaéaaéaa"
Or just to make it pedantically clear, <26 23 32 33 33> in UTF-8 is
not interpreted as one character e-acute, but as a sequence of
five characters: <ampersand, number sign, two, three, three>.
If some higher-level protocol grabs the <26 23 32 33 33> (i.e., the "é")
out of that UTF-8, says, A ha!, this is a numeric entity and should be
interpreted as "é", that is an issue for the users of that higher-level
protocol, and is not a matter of the definition of UTF-8 itself at all.
So your issue speaks to the use and interpretation of numeric entities in
XML -- not to the definition or interpretation of UTF-8.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 14:05:26 CDT