Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Apr 28 2009 - 13:50:10 CDT

Next message: N. Ganesan: "Hangul > Linear Jamo converters"

Previous message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 4/28/2009 9:54 AM, Sam Mason wrote:
> On Tue, Apr 28, 2009 at 06:20:16PM +0200, Philippe Verdy wrote:
>
>> Sam Mason wrote:
>>
>>> High-volume and real-time are not normally compatible; either
>>> you want to optimise for the common case and ensure that it's
>>> fast (and hence you're good for high-volume) or you optimise
>>> to make sure that worst case response time is above some
>>> cut-off (and you can give real-time guarantees). These are
>>> very different constraints and call for different implementations.
>>>
>> That's not true: applications that include BOTH high-volume and real-time
>> needs include text editors, notably when editing large text files (including
>> programming source files or XML datafiles) or performing search-replace
>> operations.
>>
>
> A text editor is barely high-volume and definitely not real-time---at
> least not according to any definitions that I would know. An editor
> obviously wants to process the data reasonably quickly but each file is
> only going to be a few million bytes (at the large end) and you don't
> have to worry about someone's heart stopping or a nuclear meltdown if
> you miss your deadline by a millisecond because you had to swap in a
> *single* page of data. Real-time has a precise technical definition and
> it's this I'm assuming people are talking about, I'm not aware of one
> for high-volume though I'd be thinking of sustained bandwidths on the
> order of 100MB/s.
>
I was the one who used the term "real-time" not in the strict technical
sense you were expecting, but more in the sense that Philippe was
interpreting it - a time constraint based on performance criteria
vis-a-vis an interactive human user.

If you want to be able to perform an operation in an acceptable time
window for interactive use, and making sure that you can process a
certain minimum amount of data (e.g. parse files of certain sizes) in
that time window, you have a particular constraint.

This constraint is equivalent to burst-mode high-volume, i.e. high data
rates during the targeted time window, so these are perhaps not two
different scenarios.

> As far as I understand the issue was to do with efficient processing of
> UTF-8 (or maybe UTF-16) source data and writing algorithms that directly
> work in said formats rather than converting (I'm suggesting that this
> should be done piecemeal) to a standard internal representation first.
>
>
Right, this applies whenever you can't choose the encoding of your data
stream.

For UTF-8 there are many tasks where conversion can be entirely avoided,
or can be avoided for a large percentage of the data. In reading the
Unihan Database, for example, each of the > 1 million lines contains two
ASCII-only fields. The character code, e.g. "U+4E00" and the tag name
e.g. "kRSUnicode". Only the third field will contain unrestricted UTF-8
(depending on the tag).

About 1/2 of the 28MB file therefore can be read as ASCII. Any
conversion is wasted effort, and performance gains became visible the
minute my tokenizer was retargeted to collect tokens in UTF-8.

With such optimizations that avoid data conversion, as well as some
other ones that exploit redundancy in the file format, my program will
be able to read the Unihan DB in its original form, within acceptable
time constraints for interactive use, making conservative estimates for
available hardware performance. Other programs have different design
goals, and would perhaps go about things differently, for example by
preprocessing the DB.

The point is, there are occasional scenarios where close attention to
the cost of data conversion pays off. Piecemeal conversion (one line at
a time) definitely is too coarse, and if you wrap it into a "getline"
type API, that adds even more overhead. So, that's recommended only
where text throughput is not critical.

As to whether "just-in-time" conversion (a few characters ahead of the
parser) would be close enough to no conversion to allow the parser to be
in UTF-16 - and under what conditions - I can't speak to that, because I
haven't tried it.

A./

Next message: N. Ganesan: "Hangul > Linear Jamo converters"
Previous message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 13:53:56 CDT