Re: Using Unicode in XML

Date: Thu Jul 13 2000 - 12:46:00 EDT

Actually, you do NOT need to declare UCS-2/UTF-16 with an encoding
tag: it's supposed to be the default character set. It is, of course, not
illegal to declare it, but it is superfluous to do so (for the reason that
you suggest).

You do need to include a Byte Order Mark character as the first pair of
bytes in the file (that would be character U+FFFE), if you encode the file
as UTF-16. Many Unicode-aware text editors will do this for you (for
example, Notepad on WindowsNT does this), so this will be essentially
invisible to you.

Some XML parsers are not (alas) Unicode enabled--that is, they
can't handle a file encoded as UTF-16. There is usually a
disclaimer about their being able to handle only Latin-1 somewhere. They
can still handle Unicode (it's a requirement), but only as numeric
entities: the text stream, though, has to be Latin-1. If you have such a
beast, consider replacing it (please).

I should stress that most parsers have been written responsibly and will
handle your UTF-16 files just fine.



Addison P. Phillips Principal Consultant
Inter-Locale LLC
Globalization Engineering & Consulting Services

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)

On Thu, 13 Jul 2000, Paul Deuter wrote:

> I know that XML can contain Unicode by using the declaration
> <?xl version="1.0" encoding="ISO-10646-UCS-2">
> But there seems to be a chicken and egg dilemma here. If
> I encode my whole XML stream as Unicode, then the parser
> will need to know that the stream is Unicode in order to be able
> to parse the declaration which tells it that it is Unicode.
> If the parser cannot figure out that the stream is Unicode, then
> it won't be able to read the declaration. But if it can recognize
> the Unicode, then the declaration would seem to be superfluous.
> How do systems handle this?
> Thanks,
> Paul

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT