Re: character entities in UTF-8 files

From: Kenneth Whistler (
Date: Tue Jul 12 2005 - 14:03:28 CDT

  • Next message: Chris Jacobs: "Re: character entities in UTF-8 files"

    Avraham Shapiro asked:

    > We have an XML based application that specifies UTF-8 files as input. Occasionally users will
    > include numeric character entites, for example é for e acute instead of the UTF-8
    > equivalent of C3 A9. My question is: Is this legal UTF-8? And are numeric or symbolic character
    > entites valid for Ascii-7 characters such as "<"? My guess is the first one is not legal,
    > and the second one is application defined, i.e. Unicode says nothing about it. Am I
    > right?

    No, not quite. The Unicode Standard says nothing about the *first* one, either.

    Any use of numeric or symbolic entities in a UTF-8 stream is a matter of
    what the Unicode Standard calls a higher-level protocol. (Just as any use
    of any such entities in a ISO 8859-1 Latin-1 text stream would be outside
    the scope of the ISO 8859-1 standard.)

    So, if I have a UTF-8 text stream that consists of the bytes:

     <61 61 C3 A9 61 61 26 23 32 33 33 61 61>

    That is the valid and legal representation (in UTF-8), of what I would
    (in my Latin-1 email agent) represent as:

    and NOT the UTF-8 for:

    Or just to make it pedantically clear, <26 23 32 33 33> in UTF-8 is
    not interpreted as one character e-acute, but as a sequence of
    five characters: <ampersand, number sign, two, three, three>.
    If some higher-level protocol grabs the <26 23 32 33 33> (i.e., the "&#233")
    out of that UTF-8, says, A ha!, this is a numeric entity and should be
    interpreted as "é", that is an issue for the users of that higher-level
    protocol, and is not a matter of the definition of UTF-8 itself at all.

    So your issue speaks to the use and interpretation of numeric entities in
    XML -- not to the definition or interpretation of UTF-8.


    This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 14:05:26 CDT