Re: How to distinguish UTF-8 from Latin-* ?

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Sun Jun 18 2000 - 10:40:45 EDT


Vinod Balakrishnan wrote:
> How can we distinguish the UTF-8 characters sequence from a
> Latin-1/Latin-? characters. In case of most of the internet application
> UTF16 characters are prefixed by "0xu" and for the UTF8 characters there
> is no prefix to identify those. Do we HAVE/NEED a standard to represent
> UTF8 ?

there are at least two ways:
1. some applications (like notepad on win2000) write and expect a special character (u+feff) at the beginning of a utf-8-encoded text that ends up as the three bytes ef bb bf.

2. you can read and decode it as utf-8 for the first 100 or so bytes and see if it has any encoding errors (incorrect number of trail bytes...). if not, then it is almost certainly utf-8 (or it is pure ascii, which should not matter for you).

> For example, if the browser send out a http GET request for a non-Roman
> characters with out the header information, the server application will
> not be able to identify the characters whether they are UTF8 or Latin-1.

the server typically does tell the browser what the charset is. if it is utf-8, then the server will say in some http response something like
Content-Type: text/html; charset=utf-8

how does the server know? _you_ should tell it with a config file.

also, if the get response is an html file, then you should declare its charset in there:
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
</html>

if it is xml, then have a look at the xml spec (with the errata list!!). it is very clearly specified how to figure that all out there.
<?xml version="1.0" encoding="utf-8"?>...

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT