From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 25 2005 - 13:36:50 CST
At 18:24 +0900 2005/01/24, Martin Duerst wrote:
>What I would expect such an Unicode-enabled version of flex to do
>is to have something similar to <<EOF>>, let's call it <<NONCHAR>>
>for the moment. <<NONCHAR>> would match shortest non-UTF-8 byte
>sequences. The typical use would be for a grammar to have a single
>rule matching <<NONCHAR>>, e.g. like so:
>
><<NONCHAR>> fprintf(stderr, "Illegal UTF-8 input.\n"); exit(1);
You are just suggesting a suitable interface, making implementing a Unicode
lexer easy. One should most likely have such an interface, but that is
something that will follow, once one starts implementing Unicode support and
starts using it.
Implementation wise, the problem seems to be how to represent character
classes. I assumed that they are made up by intervals in the Unicode point
range. If one has many such intervals, the translated regular expression
gets big. A similar problem seems to happen with other implementation
methods.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 14:53:11 CST