Re: RE: 32'nd bit & UTF-8

From: Philippe VERDY (
Date: Wed Jan 19 2005 - 12:35:28 CST

  • Next message: Philippe VERDY: "Re: Unicode lexers (was:32'nd bit & UTF-8)"

    > De : "Arcane Jill"
    > As a programmer myself, I actually followed that explanation. But I wonder if
    > it's the right approach. Would it not be a more ... interesting ... approach,
    > to forget Flex, and instead write a brand new Unicode lexer generator which
    > generates a lexer that processes characters (not bytes)?

    Why not JFlex, a free GPL-licenced lexer on SourceForge?
    See <> for the documentation, download, and access to its development.

    Yes it's not a direct replacement, because it is written in Java for Java, but this is still a base to generate lexers that will compile with C++. Also it has full Unicode support. The bad thing is its current limitation to 64K DFA states (but this could be patched by changing the internal representation for these tables).

    Some developments in JFlex will probably include Unicode character classes and categories.

    The alternative is ANTLR (iniitally written in C, it is now maintained in Java, but it generates Java classes or C/C++ classes or functions), which is certainly cleaner in its generated interface and skeleton (no more need to define kludgy macros).

    Note also that Java now has an excellent support for regular expressions, including the support of POSIX classes extended to support Unicode character classes and categories.

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 12:49:08 CST