RE: UTF-8 based DFAs and Regexps from Unicode sets

From: Philippe Verdy (
Date: Tue Apr 28 2009 - 11:20:16 CDT

  • Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"


    > -----Message d'origine-----
    > De :
    > [] De la part de Sam Mason
    > Envoyé : mardi 28 avril 2009 13:22
    > À : Unicode Mailing List
    > Objet : Re: UTF-8 based DFAs and Regexps from Unicode sets
    > On Mon, Apr 27, 2009 at 10:24:03AM -0700, Asmus Freytag wrote:
    > > On 4/27/2009 5:09 AM, John (Eljay) Love-Jensen wrote:
    > > >Rather than specifying the optimized regular expression in native
    > > >UTF-8 in the first place, and perhaps another in UTF-16BE, and
    > > >perhaps another in yada yada...
    > > >
    > > >That would avoid the brittleness issue raised by others.
    > >
    > > That's a good point and a bit orthogonal to what I was trying to
    > > highlight. My focus was on calling attention to the fact that
    > > multi-step implementations with separate and independent phases for
    > > conversion and algorithmic text processing can be
    > cost-prohibitive in
    > > high-volume
    > > (real-time) applications. Such application domains exist
    > and are real
    > > scenarios, even though they are not the standard case.
    > High-volume and real-time are not normally compatible; either
    > you want to optimise for the common case and ensure that it's
    > fast (and hence you're good for high-volume) or you optimise
    > to make sure that worst case response time is above some
    > cut-off (and you can give real-time guarantees). These are
    > very different constraints and call for different implementations.

    That's not true: applications that include BOTH high-volume and real-time
    needs include text editors, notably when editing large text files (including
    programming source files or XML datafiles) or performing search-replace

    This problem can occur when the text buffer is managed as a single
    unstructured memory block (instead of a series of blocks with reasonnable
    variable sizes between two minimum and maximum threshold values for block
    splitting or merging), due to the number of copies that are needed for
    making insertions/deletions, or for computing presentation features like
    syntax coloring, or nested sub-block compression/expansion, automatic line
    numbering, or handling position markers, or comparing two files side by side
    and coloring diffs.

    Performance problems can then occur after each edit operation, with the
    editor GUI freezing for seconds or sometimes minutes. Affected softwares
    include for example Notepad (old versions up to Windows XP) or Notepad++
    (all versions including the current one).

    Similar problems can occur with programming environments that are also
    parsing and/or compiling the sources directly while it is being edited (to
    show syntaxic helpers or to compute a logical structure or navigatable tree,
    e.g. in Eclipse, NetBeans, or Visual Studio).

    Similar problems will occur in multi-users environments like web sites that
    are using interactive features like dynamic searches or data selections.
    This includes for exampel the CLDR Survey (whose performance is critical
    becuse it needs to handle both large volumes, and reasonnable response times
    for users, otherwise sessions or scripts may timeout and can cause the site
    to become almost unusable above a quite small number of users).

    This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 11:24:53 CDT