Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Doug Ewell (doug@ewellic.org)
Date: Sun Apr 26 2009 - 10:40:25 CDT

  • Next message: Mark Davis: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    From: "Bjoern Hoehrmann" <derhoermi@gmx.net>

    > Now, if we replace each character by its UTF-8 encoding, we would ob-
    > tain a regular expression and corresponding automata that match the
    > same language, but would operate directly on bytes:
    >
    > /(A|B|...|a|b|...|\xC3\x80|...)(...)/

    I know this isn't the answer you're looking for, but it almost always
    makes more sense to decode UTF-8 code units into Unicode code points
    FIRST and then apply other algorithms to operate on Unicode text,
    instead of trying to build UTF-8 decoding into every algorithm.

    --
    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 10:45:01 CDT