Re: UTF-8 based DFAs and Regexps from Unicode sets

From: John (Eljay) Love-Jensen (eljay@adobe.com)
Date: Mon Apr 27 2009 - 07:09:00 CDT

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Asmus,

> I respectfully disagree.
>
> For small amounts of data, and for applications that need to handle
> multiple data formats/encodings, it makes sense indeed to first convert
> into a common format and then implement the algorithm only once.
>
> However, when you need to scan (in real time) large amounts of data
> known to be in UTF-8, the conversion costs will kill you. In my
> consulting practice I've come across cases where that matters.

Wouldn't it be prudent to have the regular expression expressed in Unicode,
and then translate that (for performance on the data stream in the data
stream's format) into a UTF-8 one, or UTF-16LE, or UTF-16BE, or UTF-32LE, or
UTF-32BE as appropriate?

Rather than specifying the optimized regular expression in native UTF-8 in
the first place, and perhaps another in UTF-16BE, and perhaps another in
yada yada...

That would avoid the brittleness issue raised by others.

And have high performance you are looking for -- working on the native data
stream without decoding the data stream into UTF-32 (platform native 32-bit)
characters.

Putting the burden on the regular expression compiler, which would have to
be Unicode savvy, and able to optimize the regex into a particular Unicode
transformation format.

Sincerely,
--Eljay

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 07:14:24 CDT