Re: UTF-8 based DFAs and Regexps from Unicode sets

From: John (Eljay) Love-Jensen (eljay@adobe.com)
Date: Mon Apr 27 2009 - 07:09:00 CDT

  • Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    Hi Asmus,

    > I respectfully disagree.
    >
    > For small amounts of data, and for applications that need to handle
    > multiple data formats/encodings, it makes sense indeed to first convert
    > into a common format and then implement the algorithm only once.
    >
    > However, when you need to scan (in real time) large amounts of data
    > known to be in UTF-8, the conversion costs will kill you. In my
    > consulting practice I've come across cases where that matters.

    Wouldn't it be prudent to have the regular expression expressed in Unicode,
    and then translate that (for performance on the data stream in the data
    stream's format) into a UTF-8 one, or UTF-16LE, or UTF-16BE, or UTF-32LE, or
    UTF-32BE as appropriate?

    Rather than specifying the optimized regular expression in native UTF-8 in
    the first place, and perhaps another in UTF-16BE, and perhaps another in
    yada yada...

    That would avoid the brittleness issue raised by others.

    And have high performance you are looking for -- working on the native data
    stream without decoding the data stream into UTF-32 (platform native 32-bit)
    characters.

    Putting the burden on the regular expression compiler, which would have to
    be Unicode savvy, and able to optimize the regex into a particular Unicode
    transformation format.

    Sincerely,
    --Eljay



    This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 07:14:24 CDT