Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Thu Jan 20 2005 - 14:46:47 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/20 15:24, Mark E. Shoulson at wrote:

    > I've been slowly catching up on this thread. Isn't this just a case of
    > GIGO? The issue at hand is how to handle ill-formed "code-points" (i.e.
    > 32-bit values) where a program was expecting to be dealing only with
    > Unicode values. Well, you've given it garbage in, it should be expected
    > to produce garbage out. If we choose to define the output garbage as
    > some twisted generalization of UTF-8 (so that it doesn't require any
    > special processing to generate), what's the problem?

    The problem is that we do not have a specific lexer at hand, but a lexer
    generator Flex, and wants to figure out how to make it support Unicode
    encodings. Then there is no universal way to define exactly how it should
    act in the case of an error, because different lexers may choose different
    actions. So, at least in the case of UTF-32, it is convenient tenable
    regular expressions for all 2^32 numbers. The lexer writer will have to
    attune that to the Unicode standard.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 14:50:17 CST