Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (
Date: Tue Apr 28 2009 - 11:54:46 CDT

  • Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    On Tue, Apr 28, 2009 at 06:20:16PM +0200, Philippe Verdy wrote:
    > Sam Mason wrote:
    > > High-volume and real-time are not normally compatible; either
    > > you want to optimise for the common case and ensure that it's
    > > fast (and hence you're good for high-volume) or you optimise
    > > to make sure that worst case response time is above some
    > > cut-off (and you can give real-time guarantees). These are
    > > very different constraints and call for different implementations.
    > That's not true: applications that include BOTH high-volume and real-time
    > needs include text editors, notably when editing large text files (including
    > programming source files or XML datafiles) or performing search-replace
    > operations.

    A text editor is barely high-volume and definitely not real-time---at
    least not according to any definitions that I would know. An editor
    obviously wants to process the data reasonably quickly but each file is
    only going to be a few million bytes (at the large end) and you don't
    have to worry about someone's heart stopping or a nuclear meltdown if
    you miss your deadline by a millisecond because you had to swap in a
    *single* page of data. Real-time has a precise technical definition and
    it's this I'm assuming people are talking about, I'm not aware of one
    for high-volume though I'd be thinking of sustained bandwidths on the
    order of 100MB/s.

    > This problem can occur when the text buffer is managed as a single
    > unstructured memory block (instead of a series of blocks with reasonnable
    > variable sizes between two minimum and maximum threshold values for block
    > splitting or merging), due to the number of copies that are needed for
    > making insertions/deletions, or for computing presentation features like
    > syntax coloring, or nested sub-block compression/expansion, automatic line
    > numbering, or handling position markers, or comparing two files side by side
    > and coloring diffs.

    I fail to see how those have much to do with the issues of dealing
    directly with encoded characters.

    > Performance problems can then occur after each edit operation, with the
    > editor GUI freezing for seconds or sometimes minutes. Affected softwares
    > include for example Notepad (old versions up to Windows XP) or Notepad++
    > (all versions including the current one).
    > Similar problems can occur with programming environments that are also
    > parsing and/or compiling the sources directly while it is being edited (to
    > show syntaxic helpers or to compute a logical structure or navigatable tree,
    > e.g. in Eclipse, NetBeans, or Visual Studio).

    Then those are bugs/features in those specific programs and they need to
    be worked around or fixed however is best for those programs.

    > Similar problems will occur in multi-users environments like web sites that
    > are using interactive features like dynamic searches or data selections.
    > This includes for exampel the CLDR Survey (whose performance is critical
    > becuse it needs to handle both large volumes, and reasonnable response times
    > for users, otherwise sessions or scripts may timeout and can cause the site
    > to become almost unusable above a quite small number of users).

    These seem like more reasonable points; but these systems are all free
    to choose whatever internal encoding they prefer (and to quite a large
    extent the external encoding as well) and their code can be optimised

    As far as I understand the issue was to do with efficient processing of
    UTF-8 (or maybe UTF-16) source data and writing algorithms that directly
    work in said formats rather than converting (I'm suggesting that this
    should be done piecemeal) to a standard internal representation first.


    This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 11:58:07 CDT