Re: <<NONCHAR>> for flex

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 25 2005 - 13:36:50 CST

  • Next message: Markus Scherer: "Re: Surrogate points"

    At 18:24 +0900 2005/01/24, Martin Duerst wrote:
    >What I would expect such an Unicode-enabled version of flex to do
    >is to have something similar to <<EOF>>, let's call it <<NONCHAR>>
    >for the moment. <<NONCHAR>> would match shortest non-UTF-8 byte
    >sequences. The typical use would be for a grammar to have a single
    >rule matching <<NONCHAR>>, e.g. like so:
    >
    ><<NONCHAR>> fprintf(stderr, "Illegal UTF-8 input.\n"); exit(1);

    You are just suggesting a suitable interface, making implementing a Unicode
    lexer easy. One should most likely have such an interface, but that is
    something that will follow, once one starts implementing Unicode support and
    starts using it.

    Implementation wise, the problem seems to be how to represent character
    classes. I assumed that they are made up by intervals in the Unicode point
    range. If one has many such intervals, the translated regular expression
    gets big. A similar problem seems to happen with other implementation
    methods.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 14:53:11 CST