RE: How to distinguish UTF-8 from Latin-* ?

From: Robert A. Rosenberg (bob.rosenberg@digitscorp.com)
Date: Tue Jun 20 2000 - 15:35:54 EDT


At 02:01 PM 06/19/2000 -0800, Vinod Balakrishnan wrote:

>[snip]
> >2) No encoding information... UTF-8 can be assumed (often it is just ASCII
> >so this works)
>
>This was my concern, there is no way to distinguish UTF-8 from Latin-1 in
>case of upper ASCII characters here.

Yes there is - its called a "Sanity Check". You parse the file looking for
High-ASCII. If you find none - you are US-ASCII (or ISO-8859-1). Once you
find one, you use the UTF-8 Suffix method to see how long the string should
be IF it is UTF-8. Look at the next x characters to see if they have the
correct suffix. If not, count as a Bad-UTF-8. If so, count as one
Good-UTF-8. Once you roll off the end of the string resume scanning for
another High-ASCII and do the check again. After finding 12 strings that
start with High-ASCII (or bopping off the end of the file) check your
GOOD/BAD counts. All BAD means ISO-8859-1. All GOOD means UTF-8. Mixed
(with most being BAD) is ISO-8859-1 (the Goods are "noise"). Mostly Good
with a few Bad are either malformed UTF-8 or ISO-8859-1 (with the bad luck
of finding 2 byte strings that LOOK LIKE UTF-8).



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT