Both XML and DOM are UTF-16 centric....for some pretty good implementation
reasons... which is why I pointed out the use of UTF-16 internally to the
parsers. I suggest UTF-8 as a storage medium for a variety of reasons,
mostly to do with differences in client and file-server processor
architecture, database support for Unicode, and other file-centric reasons.
The parser will decode UTF-8 (or any encoding that it supports and that you
declare) into its internal format, usually UTF-16 or UCS-4.
As for the BOM... it was my morning for typos.
----- Original Message -----
From: Markus Scherer <email@example.com>
To: Unicode List <firstname.lastname@example.org>
Sent: Tuesday, April 11, 2000 2:03 PM
Subject: Re: Encoding designation in Java Script sites
> "Addison Phillips [GSC]" wrote:
> > what "XML is in Unicode" *means* in terms of actual disk file encoding
> > internal parsing... it turns out that most parsers use UCS-4 or UTF-16
> > their rendering engine and smart implementers use UTF-8 when storing the
> > actual XML files on disk. Yes, you have to declare the encoding for
> > Byte Order Marks--0xFFFE--are the order of the day for UTF-16 files].
> the byte order mark is U+feff.
> i believe that the xml (or dom?) specification also makes xml
utf-16-centric: utf-8 is one of the two default encodings (utf-8 & utf-16),
but text offsets are defined in terms of utf-16 code units, as far as i
know. i would expect most parsers to use utf-16 internally.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT