RE: How to distinguish UTF-8 from Latin-* ?

From: Vinod Balakrishnan (
Date: Mon Jun 19 2000 - 18:11:27 EDT

>> if it is xml, then have a look at the xml spec (with the errata list!!).
>> it is very clearly specified how to figure that all out there.
>> <?xml version="1.0" encoding="utf-8"?>...
>Actually, the XML spec is quite clear that neither UTF-16 nor UTF-8 require
>the encoding tag.... XML is defined by one of the following:
>1) Starts with byte Mark for Big-Endian/Little-Endian Unicode -- go with the
>byte mark
>2) No encoding information... UTF-8 can be assumed (often it is just ASCII
>so this works)

This was my concern, there is no way to distinguish UTF-8 from Latin-1 in
case of upper ASCII characters here.

But in case of XML we can assume it as UTF-8, since Unicode is the
preferred encoding for XML.

But in case of HTML (supports both Latin and Unicode encoding), there are
cases where http server can get encoded URLs in UTF-16/UTF-8/Latin-* with
out any header containing the encoding information. In this case, if the
server doesn't maintain the encoding through an extra mechanism, there is
no way to differentiate UTF-8 from Latin-1 for upper ASCII characters.
Again this can happen only in case of HTML, because it supports both the
Latin and Unicode encoding.

>3) Any other encoding, use the encoding tag as Marcus mentions
>For more detail, see 4.3.3 of the spec, at
> and note the language in regard to
>the encoding tag:
>Although an XML processor is required to read only entities in the UTF-8 and
>UTF-16 encodings, it is recognized that other encodings are used around the
>world, and it may be desired for XML processors to read entities that use
>them. Parsed entities which are stored in an encoding other than UTF-8 or
>UTF-16 must begin with a text declaration containing an encoding declaration
>Clearly, we are being told that this is not a requirement of an XML
>processor. Unfortunately, most of the ones out there do not understand the
>encoding tag, cannot read UTF-16 files, and destroy UTF-8 outside of the
>ASCII range.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT