>> This means that an UTF-8 sequence can be identified as UTF-8
>> encoding in normal text. This has been suggested by several people
>> to be a way to identify if text is UTF-8 or not. But it can also be
>> used to make UTF-8 nearly iso 8859-1 compatible, by letting the
>> read/writing routines for UTF-8 be adaptible. When reading: if a
>> sequence is a correct UTF-8 encoding sequence, decode it as UTF-8,
>> if not use the byte as itself (just like is done for all byte values
>> below 128). When writing: if the code value is below 255 and the
>> resulting byte sequence does not look like an UTF-8 encoding, write
>> the byte itself, otherwise encode using UTF-8.
>This requires some lookahead but is surely doable. If you really use
>this for writing then you may make a few Latin-1 readers happier but
>you will also upset UTF-8 readers whose software chokes or uses a
>different default (like Latin-0).
I have quite a lot of programs that can read latin-1 bit not UTF8.
And the UTF-8 readers I have tried stopps reading when it detects
latin-1 in my files making them worthless.
My adaptive scheme would allow UTF-8 readers to read latin-1 with very
I only talk about latin-1 because latin-1 is a true subset och UCS just like
ascii and can be handle without any code translations. latin-16 (I think was
the real name for latin-0) and others are not true subsets (when you look
at code values) to UCS and would need code translation to work.
>When coexisting with standard UTF-8
>files you will have two representations to grep for to search an
If the UTF-8 grep uses the adaptive UTF-8 reader code, it works for
> And if you just want to liberally accept pre-UTF-8 texts then
>why not also accept that vast number of Windows bullets and quotes
>from code page 1252?
As I said above, latin-1 is nice because it is a true subset. Other
character codings are not and would need code translations make a lot
more complex software.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT