Re: How to distinguish UTF-8 from Latin-* ?

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Jun 19 2000 - 17:36:59 EDT


"Michael Kaplan (Trigeminal Inc.)" wrote:
> Actually, the XML spec is quite clear that neither UTF-16 nor UTF-8 require
> the encoding tag.... XML is defined by one of the following:
>
> 1) Starts with byte Mark for Big-Endian/Little-Endian Unicode -- go with the
> byte mark
>
> 2) No encoding information... UTF-8 can be assumed (often it is just ASCII
> so this works)
>
> 3) Any other encoding, use the encoding tag as Marcus mentions

you can do without for utf-8 and utf-16, but you should have it anyway.

> Clearly, we are being told that this is not a requirement of an XML
> processor. Unfortunately, most of the ones out there do not understand the
> encoding tag, cannot read UTF-16 files, and destroy UTF-8 outside of the
> ASCII range.

the ibm xml parser that is open-source and also part of apache does read encodings as specified and deals with a number of other ones, too. you can have icu underneath and get more than 60 codepages.

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT