Re: <<NONCHAR>> for flex

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 25 2005 - 13:36:50 CST

Next message: Markus Scherer: "Re: Surrogate points"

Previous message: Hans Aberg: "Re: Actually, this wasn't rhetorical"
Maybe in reply to: Gregg Reynolds: "Re: <<NONCHAR>> for flex"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 18:24 +0900 2005/01/24, Martin Duerst wrote:
>What I would expect such an Unicode-enabled version of flex to do
>is to have something similar to <<EOF>>, let's call it <<NONCHAR>>
>for the moment. <<NONCHAR>> would match shortest non-UTF-8 byte
>sequences. The typical use would be for a grammar to have a single
>rule matching <<NONCHAR>>, e.g. like so:
>
><<NONCHAR>> fprintf(stderr, "Illegal UTF-8 input.\n"); exit(1);

You are just suggesting a suitable interface, making implementing a Unicode
lexer easy. One should most likely have such an interface, but that is
something that will follow, once one starts implementing Unicode support and
starts using it.

Implementation wise, the problem seems to be how to represent character
classes. I assumed that they are made up by intervals in the Unicode point
range. If one has many such intervals, the translated regular expression
gets big. A similar problem seems to happen with other implementation
methods.

Hans Aberg

Next message: Markus Scherer: "Re: Surrogate points"
Previous message: Hans Aberg: "Re: Actually, this wasn't rhetorical"
Maybe in reply to: Gregg Reynolds: "Re: <<NONCHAR>> for flex"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 14:53:11 CST