Re: character entities in UTF-8 files

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Jul 12 2005 - 17:46:54 CDT

  • Next message: Chris Jacobs: "Re: character entities in UTF-8 files"

    At 10:44 AM 7/12/2005, Avraham Shapiro wrote:
    >** Low Priority **
    >
    >We have an XML based application that specifies UTF-8 files as
    >input. Occasionally users will
    >include numeric character entites, for example é for e acute instead
    >of the UTF-8
    >equivalent of C3 A9. My question is: Is this legal UTF-8? And are
    >numeric or symbolic character
    >entites valid for Ascii-7 characters such as "<"? My guess is the first
    >one is not legal,
    >and the second one is application defined, i.e. Unicode says nothing about
    >it. Am I
    >right?

    Your message seems to imply that you are talking about XML files that are
    encoded in UTF-8, but you don't state that explicitly. Under the assumption
    that that is what you meant, it is XML that defines whether &#233; is legal
    and how it is interpreted. All the UTF-8 format can tell you is that each
    of the characters in the sequence & # 2 3 3 ; will be represented by a
    single ASCII byte in the UTF-8 file.

    If your application can read plain text files (e.g. extension .txt and not
    .xml and no XML header) as well, then inside those, neither Unicode nor XML
    define any special interpretation. XML does not, since we assume the file
    is not in XML, and Unicode does not, because an & is an & and not an escape
    character in Unicode.

    A./



    This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 17:48:08 CDT