From: Markus Scherer (firstname.lastname@example.org)
Date: Wed Oct 30 2002 - 18:13:53 EST
Dominikus Scherkl wrote:
> Converting from and to utf-8 is an all-day topic, very important
> for all applications handling with unicode. So it is a special
Converting text to/from UTF-8 is indeed common and important.
Converting text that claims to be UTF-8 - but isn't - is different: It may be a spoofing attempt, or
bytes may have been lost, or the text may not be UTF-8 at all, etc. How to handle non-UTF-8 text in
a from-UTF-8 converter seems to be a judgement call, and application-specific.
(How does the converter know _why_ there is an illegal sequence?)
> Additional I think we should have a standardized way to display
> old utf-8 text without losing information (overlong utf-8 was
> allowed for years) ...
ISO 10646 and the RFC never allowed to generate overlong UTF-8. Unicode at least used to say "should
not" for generation (but allowed decoding). Chances are nearly 100% that overlong UTF-8 was a
spoofing attempt, or the result of something other than a UTF-8 encoder.
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.
This archive was generated by hypermail 2.1.5 : Wed Oct 30 2002 - 19:01:32 EST