Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Asmus Freytag (
Date: Tue Apr 28 2009 - 13:50:10 CDT

  • Next message: N. Ganesan: "Hangul > Linear Jamo converters"

    On 4/28/2009 9:54 AM, Sam Mason wrote:
    > On Tue, Apr 28, 2009 at 06:20:16PM +0200, Philippe Verdy wrote:
    >> Sam Mason wrote:
    >>> High-volume and real-time are not normally compatible; either
    >>> you want to optimise for the common case and ensure that it's
    >>> fast (and hence you're good for high-volume) or you optimise
    >>> to make sure that worst case response time is above some
    >>> cut-off (and you can give real-time guarantees). These are
    >>> very different constraints and call for different implementations.
    >> That's not true: applications that include BOTH high-volume and real-time
    >> needs include text editors, notably when editing large text files (including
    >> programming source files or XML datafiles) or performing search-replace
    >> operations.
    > A text editor is barely high-volume and definitely not real-time---at
    > least not according to any definitions that I would know. An editor
    > obviously wants to process the data reasonably quickly but each file is
    > only going to be a few million bytes (at the large end) and you don't
    > have to worry about someone's heart stopping or a nuclear meltdown if
    > you miss your deadline by a millisecond because you had to swap in a
    > *single* page of data. Real-time has a precise technical definition and
    > it's this I'm assuming people are talking about, I'm not aware of one
    > for high-volume though I'd be thinking of sustained bandwidths on the
    > order of 100MB/s.
    I was the one who used the term "real-time" not in the strict technical
    sense you were expecting, but more in the sense that Philippe was
    interpreting it - a time constraint based on performance criteria
    vis-a-vis an interactive human user.

    If you want to be able to perform an operation in an acceptable time
    window for interactive use, and making sure that you can process a
    certain minimum amount of data (e.g. parse files of certain sizes) in
    that time window, you have a particular constraint.

    This constraint is equivalent to burst-mode high-volume, i.e. high data
    rates during the targeted time window, so these are perhaps not two
    different scenarios.

    > As far as I understand the issue was to do with efficient processing of
    > UTF-8 (or maybe UTF-16) source data and writing algorithms that directly
    > work in said formats rather than converting (I'm suggesting that this
    > should be done piecemeal) to a standard internal representation first.
    Right, this applies whenever you can't choose the encoding of your data

    For UTF-8 there are many tasks where conversion can be entirely avoided,
    or can be avoided for a large percentage of the data. In reading the
    Unihan Database, for example, each of the > 1 million lines contains two
    ASCII-only fields. The character code, e.g. "U+4E00" and the tag name
    e.g. "kRSUnicode". Only the third field will contain unrestricted UTF-8
    (depending on the tag).

    About 1/2 of the 28MB file therefore can be read as ASCII. Any
    conversion is wasted effort, and performance gains became visible the
    minute my tokenizer was retargeted to collect tokens in UTF-8.

    With such optimizations that avoid data conversion, as well as some
    other ones that exploit redundancy in the file format, my program will
    be able to read the Unihan DB in its original form, within acceptable
    time constraints for interactive use, making conservative estimates for
    available hardware performance. Other programs have different design
    goals, and would perhaps go about things differently, for example by
    preprocessing the DB.

    The point is, there are occasional scenarios where close attention to
    the cost of data conversion pays off. Piecemeal conversion (one line at
    a time) definitely is too coarse, and if you wrap it into a "getline"
    type API, that adds even more overhead. So, that's recommended only
    where text throughput is not critical.

    As to whether "just-in-time" conversion (a few characters ahead of the
    parser) would be close enough to no conversion to allow the parser to be
    in UTF-16 - and under what conditions - I can't speak to that, because I
    haven't tried it.


    This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 13:53:56 CDT