From: John (Eljay) Love-Jensen (firstname.lastname@example.org)
Date: Mon Apr 27 2009 - 07:09:00 CDT
> I respectfully disagree.
> For small amounts of data, and for applications that need to handle
> multiple data formats/encodings, it makes sense indeed to first convert
> into a common format and then implement the algorithm only once.
> However, when you need to scan (in real time) large amounts of data
> known to be in UTF-8, the conversion costs will kill you. In my
> consulting practice I've come across cases where that matters.
Wouldn't it be prudent to have the regular expression expressed in Unicode,
and then translate that (for performance on the data stream in the data
stream's format) into a UTF-8 one, or UTF-16LE, or UTF-16BE, or UTF-32LE, or
UTF-32BE as appropriate?
Rather than specifying the optimized regular expression in native UTF-8 in
the first place, and perhaps another in UTF-16BE, and perhaps another in
That would avoid the brittleness issue raised by others.
And have high performance you are looking for -- working on the native data
stream without decoding the data stream into UTF-32 (platform native 32-bit)
Putting the burden on the regular expression compiler, which would have to
be Unicode savvy, and able to optimize the regex into a particular Unicode
This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 07:14:24 CDT