Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 14:46:47 CST

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Mark E. Shoulson: "Re: 32'nd bit & UTF-8"
Next in thread: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/20 15:24, Mark E. Shoulson at mark@kli.org wrote:

> I've been slowly catching up on this thread. Isn't this just a case of
> GIGO? The issue at hand is how to handle ill-formed "code-points" (i.e.
> 32-bit values) where a program was expecting to be dealing only with
> Unicode values. Well, you've given it garbage in, it should be expected
> to produce garbage out. If we choose to define the output garbage as
> some twisted generalization of UTF-8 (so that it doesn't require any
> special processing to generate), what's the problem?

The problem is that we do not have a specific lexer at hand, but a lexer
generator Flex, and wants to figure out how to make it support Unicode
encodings. Then there is no universal way to define exactly how it should
act in the case of an error, because different lexers may choose different
actions. So, at least in the case of UTF-32, it is convenient tenable
regular expressions for all 2^32 numbers. The lexer writer will have to
attune that to the Unicode standard.

Hans Aberg

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Mark E. Shoulson: "Re: 32'nd bit & UTF-8"
Next in thread: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 14:50:17 CST