Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Doug Ewell (
Date: Sun Apr 26 2009 - 10:40:25 CDT

  • Next message: Mark Davis: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    From: "Bjoern Hoehrmann" <>

    > Now, if we replace each character by its UTF-8 encoding, we would ob-
    > tain a regular expression and corresponding automata that match the
    > same language, but would operate directly on bytes:
    > /(A|B|...|a|b|...|\xC3\x80|...)(...)/

    I know this isn't the answer you're looking for, but it almost always
    makes more sense to decode UTF-8 code units into Unicode code points
    FIRST and then apply other algorithms to operate on Unicode text,
    instead of trying to build UTF-8 decoding into every algorithm.

    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14  ˆ

    This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 10:45:01 CDT