From: Doug Ewell (dewell@adelphia.net)
Date: Tue Jul 12 2005 - 23:18:18 CDT
Avraham Shapiro <asha at loc dot gov> wrote:
> We have an XML based application that specifies UTF-8 files as input.
> Occasionally users will include numeric character entites, for
> example é for e acute instead of the UTF-8 equivalent of C3 A9.
> My question is: Is this legal UTF-8? And are numeric or symbolic
> character entites valid for Ascii-7 characters such as "<"? My guess
> is the first one is not legal, and the second one is application
> defined, i.e. Unicode says nothing about it. Am I right?
As many have said, you are looking at two different layers of encoding.
You can certainly write é in a UTF-8 text stream -- I just did
so -- but that does not get you an e-acute, it gets you six ASCII
characters. It is up to the next higher level to convent that into é.
HTML and XML require you to use the entities & and < even in a
UTF-8-encoded file. I also use in my otherwise UTF-8 Web pages
(XHTML 1.0) to make sure I don't confuse them for ordinary spaces.
Numeric entities are just the same.
-- Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 23:23:49 CDT