Re: character entities in UTF-8 files

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 12 2005 - 14:03:28 CDT

Next message: Chris Jacobs: "Re: character entities in UTF-8 files"

Previous message: John Hudson: "Re: Missing capital H from Unicode range (see 1E96)"
Maybe in reply to: Avraham Shapiro: "character entities in UTF-8 files"
Next in thread: Chris Jacobs: "Re: character entities in UTF-8 files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Avraham Shapiro asked:

> We have an XML based application that specifies UTF-8 files as input. Occasionally users will
> include numeric character entites, for example é for e acute instead of the UTF-8
> equivalent of C3 A9. My question is: Is this legal UTF-8? And are numeric or symbolic character
> entites valid for Ascii-7 characters such as "<"? My guess is the first one is not legal,
> and the second one is application defined, i.e. Unicode says nothing about it. Am I
> right?

No, not quite. The Unicode Standard says nothing about the *first* one, either.

Any use of numeric or symbolic entities in a UTF-8 stream is a matter of
what the Unicode Standard calls a higher-level protocol. (Just as any use
of any such entities in a ISO 8859-1 Latin-1 text stream would be outside
the scope of the ISO 8859-1 standard.)

So, if I have a UTF-8 text stream that consists of the bytes:

<61 61 C3 A9 61 61 26 23 32 33 33 61 61>

That is the valid and legal representation (in UTF-8), of what I would
(in my Latin-1 email agent) represent as:

"aaéaa&#233aa"

and NOT the UTF-8 for:

"aaéaaéaa"

Or just to make it pedantically clear, <26 23 32 33 33> in UTF-8 is
not interpreted as one character e-acute, but as a sequence of
five characters: <ampersand, number sign, two, three, three>.

If some higher-level protocol grabs the <26 23 32 33 33> (i.e., the "&#233")
out of that UTF-8, says, A ha!, this is a numeric entity and should be
interpreted as "é", that is an issue for the users of that higher-level
protocol, and is not a matter of the definition of UTF-8 itself at all.

So your issue speaks to the use and interpretation of numeric entities in
XML -- not to the definition or interpretation of UTF-8.

--Ken

Next message: Chris Jacobs: "Re: character entities in UTF-8 files"
Previous message: John Hudson: "Re: Missing capital H from Unicode range (see 1E96)"
Maybe in reply to: Avraham Shapiro: "character entities in UTF-8 files"
Next in thread: Chris Jacobs: "Re: character entities in UTF-8 files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 14:05:26 CDT