>Try to filter non-ASCII from French and messages are unreadable at best,
>even if about 97% of characters are indeed ASCII (statistics on some
>corpuses I have)... but the 3% remaining is highly relevant, do not forget
>to mention, and essential.
>Let's have the sense of humour.
Ah... but UTF-8 is a good "compression" scheme for most Latin-1 languages, and has some advantages with regard to legacy systems (how much of your code was generated by lex??). Of course, it has the opposite effect if you try to use it on streams of data encoded in higher ranges in Unicode (e.g. Asian languages, etc.), where you're now consuming three or more octets per character... but application is all with this.
It *IS* a solution. The implementers should be aware that the data stream is probably not legible (as you note), but may be more parser and/or storage friendly. I can see why so many people are tempted by this encoding.
Personally, I resist it. True wide-char programs are so much easier to maintain over the long haul.
Director, Globalization Services
+1 650-526-4652 (direct telephone)
+1 831-659-0514 (alternate office)
AddisonP@simultrans.com (Internet email)
"22 languages. One release date."
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT