From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 14:46:47 CST
On 2005/01/20 15:24, Mark E. Shoulson at mark@kli.org wrote:
> I've been slowly catching up on this thread. Isn't this just a case of
> GIGO? The issue at hand is how to handle ill-formed "code-points" (i.e.
> 32-bit values) where a program was expecting to be dealing only with
> Unicode values. Well, you've given it garbage in, it should be expected
> to produce garbage out. If we choose to define the output garbage as
> some twisted generalization of UTF-8 (so that it doesn't require any
> special processing to generate), what's the problem?
The problem is that we do not have a specific lexer at hand, but a lexer
generator Flex, and wants to figure out how to make it support Unicode
encodings. Then there is no universal way to define exactly how it should
act in the case of an error, because different lexers may choose different
actions. So, at least in the case of UTF-32, it is convenient tenable
regular expressions for all 2^32 numbers. The lexer writer will have to
attune that to the Unicode standard.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 14:50:17 CST