Re: character entities in UTF-8 files

From: Doug Ewell (
Date: Tue Jul 12 2005 - 23:18:18 CDT

  • Next message: Eric Muller: "Re: character entities in UTF-8 files"

    Avraham Shapiro <asha at loc dot gov> wrote:

    > We have an XML based application that specifies UTF-8 files as input.
    > Occasionally users will include numeric character entites, for
    > example &#233; for e acute instead of the UTF-8 equivalent of C3 A9.
    > My question is: Is this legal UTF-8? And are numeric or symbolic
    > character entites valid for Ascii-7 characters such as "<"? My guess
    > is the first one is not legal, and the second one is application
    > defined, i.e. Unicode says nothing about it. Am I right?

    As many have said, you are looking at two different layers of encoding.
    You can certainly write &#233; in a UTF-8 text stream -- I just did
    so -- but that does not get you an e-acute, it gets you six ASCII
    characters. It is up to the next higher level to convent that into é.

    HTML and XML require you to use the entities &amp; and &lt; even in a
    UTF-8-encoded file. I also use &nbsp; in my otherwise UTF-8 Web pages
    (XHTML 1.0) to make sure I don't confuse them for ordinary spaces.
    Numeric entities are just the same.

    Doug Ewell
    Fullerton, California

    This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 23:23:49 CDT