Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (
Date: Wed Apr 29 2009 - 05:37:03 CDT

  • Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    On Tue, Apr 28, 2009 at 11:50:10AM -0700, Asmus Freytag wrote:
    > On 4/28/2009 9:54 AM, Sam Mason wrote:
    > >A text editor is barely high-volume and definitely not real-time---at
    > >least not according to any definitions that I would know.
    > I was the one who used the term "real-time" not in the strict technical
    > sense you were expecting, but more in the sense that Philippe was
    > interpreting it - a time constraint based on performance criteria
    > vis-a-vis an interactive human user.

    OK, sorry; I'm used to more academic mailing lists where technical terms
    are used with their formal meaning. I'll do my best to adjust here!

    > If you want to be able to perform an operation in an acceptable time
    > window for interactive use, and making sure that you can process a
    > certain minimum amount of data (e.g. parse files of certain sizes) in
    > that time window, you have a particular constraint.
    > This constraint is equivalent to burst-mode high-volume, i.e. high data
    > rates during the targeted time window, so these are perhaps not two
    > different scenarios.

    I think we're discussing two different things here; getting work done
    with tools that exist now, and designing new tools. We all pick tools
    that are best suited to our jobs; I write software so I have a general
    purpose text editor that can edit my code and the odd text or other
    files as I need. Sometimes I get a 20GB text file that I need to fiddle
    with before it's "valid" and I tend to write a bit of code to chop out
    the bad bits so I can just work on them.

    I would be nice if I could use my normal text editor on such files,
    but I don't think that's going to happen for quite a while---it takes
    several minutes just to read/write the file to disk.

    > For UTF-8 there are many tasks where conversion can be entirely avoided,
    > or can be avoided for a large percentage of the data. In reading the
    > Unihan Database, for example, each of the > 1 million lines contains two
    > ASCII-only fields. The character code, e.g. "U+4E00" and the tag name
    > e.g. "kRSUnicode". Only the third field will contain unrestricted UTF-8
    > (depending on the tag).
    > About 1/2 of the 28MB file therefore can be read as ASCII. Any
    > conversion is wasted effort, and performance gains became visible the
    > minute my tokenizer was retargeted to collect tokens in UTF-8.

    I think it would depend on how complicated the tool is and how bad it
    would be if it went "wrong". Code with simpler semantics generally
    contains less bugs and if it performs "well enough" to do the job and
    doesn't give the "wrong" answer then I'm not sure what the problem is.

    There's also a big difference here between little bits of code written
    to solve one-off problems and code that's used by many people. When
    solving one-off problems you want to use a tool that allows you to write
    correct code quickly even if the resulting code is going to run slower,
    whereas if you're writing code that's used by lots of people then it's
    worth spending longer getting fast code.

    The actual decision as to what "well enough" or "wrong" actually is is
    very context sensitive; if it's throw away code just for me having it
    abort is OK, if it's a fancy GUI program used by lots of people this is

    > The point is, there are occasional scenarios where close attention to
    > the cost of data conversion pays off.

    Indeed there are! I'm just saying that there isn't one "correct"


    This archive was generated by hypermail 2.1.5 : Wed Apr 29 2009 - 05:40:49 CDT