<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)

From: Martin Duerst (duerst@w3.org)
Date: Mon Jan 24 2005 - 03:24:36 CST

  • Next message: Peter Kirk: "Re: Actually, this wasn't rhetorical"

    At 05:46 05/01/21, Hans Aberg wrote:

    >The problem is that we do not have a specific lexer at hand, but a lexer
    >generator Flex, and wants to figure out how to make it support Unicode
    >encodings. Then there is no universal way to define exactly how it should
    >act in the case of an error, because different lexers may choose different
    >actions. So, at least in the case of UTF-32, it is convenient tenable
    >regular expressions for all 2^32 numbers. The lexer writer will have to
    >attune that to the Unicode standard.

    I think it's a bad idea to try to provide a Unicode-enable version
    of flex (by itself a very good idea) but to leave error handling
    in the original encoding to the programmer using flex.

    What I would expect such an Unicode-enabled version of flex to do
    is to have something similar to <<EOF>>, let's call it <<NONCHAR>>
    for the moment. <<NONCHAR>> would match shortest non-UTF-8 byte
    sequences. The typical use would be for a grammar to have a single
    rule matching <<NONCHAR>>, e.g. like so:

    <<NONCHAR>> fprintf(stderr, "Illegal UTF-8 input.\n"); exit(1);

    Of course, the average programmer may have a somewhat more user-
    friendly way of telling the user about errors, e.g. including
    the line number and byte position, or continuing after the first
    such error to find others, but stopping at around 10 occurrences
    to not produce too long error logs. Also, if the programmer wants
    to be more specific, e.g. in terms of thing such as 'overlong
    sequence', 'high surrogate', or whatever, this can always
    be done by the programmer having a look at yytext, where it
    should find the bytes matched. As a convenience to programmers,
    you could even provide a function that does such analysis
    and that can be called by programmers.

    <<EOF>> is just an acknowledgement that the input is not a sequence
    of bytes in the range 0x00-0xFF, but also includes end-of-file.
    Likewise, <<NONCHAR>> is an acknowledgement that the input is
    not a sequence of Unicode characters, but may include some illegal,
    non-Unicode stuff.

    Requiring the flex programmer to do anything more than something
    like the above it doing completely the wrong thing; rather than
    abstracting Unicode knowledge inside flex, you are exposing it
    to a programmer. The chance that programmers will do the right
    thing with this is very low. In particluar because what you
    seem to want to do is to abstract the normal case, but expose
    error conditions.

    Indeed, I would go even a step further, and make sure that flex
    has a default action for <<NONCHAR>>, which would be to stop
    further processing and exit with an error.

    Those programmers that really want to mess around with UTF-8
    can always hack stuff into the <<NONCHAR>> rule, or can write
    their lexical rules in a byte-oriented tool (e.g. like the current

    In addition to these 'usability' issues, your idea to extend
    the mapping from a subset of integers (those corresponding
    to Unicode characters) to a larger subset (those representable
    by a 32-bit integer) has other problems. One specific one is
    that UTF-8 doesn't allow overlong sequences, i.e. things like
    0xC0 0xAF. But your mapping might just map that to a '/',
    which would be a serious security issues.

    You may want to look at
    (search for "sub check_utf8") and
    (search for sub CheckUTF8) for legal and illegal UTF-8 expressed
    as byte-oriented regular expressions.

    And if you are affraid that some time in the future, Unicode
    will have to go beyond U+10FFFF, I think there is no problem
    if you wait for that to happen to update flex. In my view, even
    if you are very young now, the change that you still live when
    that happens is rather small. Also, please note that such
    an update, if ever necessary, will be much easier if users don't
    have to change their own clungy rules and code that deals with
    illegal stuff.

    Regards, Martin.

    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:58 CST