Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (sam@samason.me.uk)
Date: Wed Apr 29 2009 - 05:37:03 CDT

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Hans Aberg: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Tue, Apr 28, 2009 at 11:50:10AM -0700, Asmus Freytag wrote:
> On 4/28/2009 9:54 AM, Sam Mason wrote:
> >A text editor is barely high-volume and definitely not real-time---at
> >least not according to any definitions that I would know.
>
> I was the one who used the term "real-time" not in the strict technical
> sense you were expecting, but more in the sense that Philippe was
> interpreting it - a time constraint based on performance criteria
> vis-a-vis an interactive human user.

OK, sorry; I'm used to more academic mailing lists where technical terms
are used with their formal meaning. I'll do my best to adjust here!

> If you want to be able to perform an operation in an acceptable time
> window for interactive use, and making sure that you can process a
> certain minimum amount of data (e.g. parse files of certain sizes) in
> that time window, you have a particular constraint.
>
> This constraint is equivalent to burst-mode high-volume, i.e. high data
> rates during the targeted time window, so these are perhaps not two
> different scenarios.

I think we're discussing two different things here; getting work done
with tools that exist now, and designing new tools. We all pick tools
that are best suited to our jobs; I write software so I have a general
purpose text editor that can edit my code and the odd text or other
files as I need. Sometimes I get a 20GB text file that I need to fiddle
with before it's "valid" and I tend to write a bit of code to chop out
the bad bits so I can just work on them.

I would be nice if I could use my normal text editor on such files,
but I don't think that's going to happen for quite a while---it takes
several minutes just to read/write the file to disk.

> For UTF-8 there are many tasks where conversion can be entirely avoided,
> or can be avoided for a large percentage of the data. In reading the
> Unihan Database, for example, each of the > 1 million lines contains two
> ASCII-only fields. The character code, e.g. "U+4E00" and the tag name
> e.g. "kRSUnicode". Only the third field will contain unrestricted UTF-8
> (depending on the tag).
>
> About 1/2 of the 28MB file therefore can be read as ASCII. Any
> conversion is wasted effort, and performance gains became visible the
> minute my tokenizer was retargeted to collect tokens in UTF-8.

I think it would depend on how complicated the tool is and how bad it
would be if it went "wrong". Code with simpler semantics generally
contains less bugs and if it performs "well enough" to do the job and
doesn't give the "wrong" answer then I'm not sure what the problem is.

There's also a big difference here between little bits of code written
to solve one-off problems and code that's used by many people. When
solving one-off problems you want to use a tool that allows you to write
correct code quickly even if the resulting code is going to run slower,
whereas if you're writing code that's used by lots of people then it's
worth spending longer getting fast code.

The actual decision as to what "well enough" or "wrong" actually is is
very context sensitive; if it's throw away code just for me having it
abort is OK, if it's a fancy GUI program used by lots of people this is
wrong.

> The point is, there are occasional scenarios where close attention to
> the cost of data conversion pays off.

Indeed there are! I'm just saying that there isn't one "correct"
answer.

-- 
  Sam  http://samason.me.uk/

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Hans Aberg: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Apr 29 2009 - 05:40:49 CDT