Re: UTF-8 Error Handling

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Feb 28 2003 - 16:02:48 EST


Yung-Fong Tang wrote:
> Same thing for JIS x0208 (a TWO and only TWO bytes character set, not a
> variable length character set). If I am processing a ISO-2022-JP message
> and in the JIS x0208 mode and I got a 0x24 0xa8 I know the boundary of
> that problem is 16 bits, not 8 -bits nor 32 bits.

Not true. You don't know if
- a byte was dropped before or after 0x24
   -> the first sequence is only 1 byte
- a byte was corrupted to become 0xa8
   -> the sequence consists of two bytes
- a wild combination of multiple errors

With a single-unit encoding, you can always assume that an illegal unit is a one-unit error. With
any multi-unit encoding, you can only guess.

markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.


This archive was generated by hypermail 2.1.5 : Fri Feb 28 2003 - 16:43:54 EST