RE: How to distinguish UTF-8 from Latin-* ?

From: Timothy Partridge (timpart@perdix.demon.co.uk)
Date: Tue Jun 20 2000 - 15:11:54 EDT


Vinod recently said:

  [Michael]
> >2) No encoding information... UTF-8 can be assumed (often it is just ASCII
> >so this works)
>
> This was my concern, there is no way to distinguish UTF-8 from Latin-1 in
> case of upper ASCII characters here.

UTF-8 has to follow strict rules about allowable sequences of bytes.
You could do full checking, but the following rules of thumb which look
at two bytes maximum should do the trick. The high bit set characters in
the Latin-* sets are relatively uncommon in text and often occur singly
(surrounded by high bit unset characters), resulting in sequences that are
illegal in UTF-8.

In the following xxx indicates "don't care" bits. They don't have to match
from one byte to the next.

The text is not (proper) UTF-8 if you encounter any of the following:

A byte greater than 11110100 (not possible in the Unicode range up to plane 16)
A byte of the form 1100000x (irregular)

The sequence 0xxxxxxx 10xxxxxx (illegal)
The sequence 11xxxxxx 0xxxxxxx (illegal)
The sequence 11xxxxxx 11xxxxxx (illegal)

The following bit pattern is not Latin-* (but could be a control code):
100xxxxx
 
Hope this is useful,

   Tim

-- 
Tim Partridge. Any opinions expressed are mine only and not those of my employer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT