Re: Correct definition for an "isLatin1()" function

From: Frank da Cruz (fdc@columbia.edu)
Date: Thu Oct 05 2000 - 14:28:13 EDT


> "Rogers, Paul" wrote:
>
> > We're whipping up a little function named isLatin1() that returns true if
> > the (UCS-2) string in question is "all Latin1".
>
> [snip]
>
> > In other words, should we exclude the C0, C1, and Latin Extended code
> > values?
>
> Including or excluding C0 and C1 is a matter of taste. If you mean
> "strictly containing characters in ISO 8859-1", then they're out.
> If you mean "representable in typical Latin-1 text files", then at least
> C0 is in, and C1 will do no great harm. (Provided your Unicode
> characters don't originate from incorrect transcoding from CP 1252.)
>
Amen. More chaos and confusion from our friend CP1252. If a C1 byte was
intended as a control character (such as NL, which is actually used in
some places), then, by some definitions, the file that contains it might
be considered Latin-1. If, on the other hand, it was intended to be a
"smart quote" or somesuch, it can NOT be Latin-1. Unfortunately, computers
have not yet reached the level of sophistication needed for mind reading.

Perhaps if you know the history of the data, you have some idea of what
C1 byte values are supposed to represent. If the file was converted to
UCS-2 from single-byte character sets, the history is important (and the
precise conversion algorithm). If the data is UCS-2 ab initio, then
U+0080-009F are well defined: they are C1 controls. Strictly speaking,
since the data is UCS-2 now, they are C1 controls anyway.

Of course there's also the issue of combining sequences. Unless your
data is guaranteed to already be in Normalization form C, your isLatin1()
function will have to include the entire normalization process, which
involves lookahead, database lookups, sorts, and more database lookups,
as described in the Unicode Technical Reports.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT