RE: UTF-8 based DFAs and Regexps from Unicode sets

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Apr 28 2009 - 11:20:16 CDT

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Andreas Prilop: "Re: Bidi demo"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> -----Message d'origine-----
> De : unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org] De la part de Sam Mason
> Envoyé : mardi 28 avril 2009 13:22
> À : Unicode Mailing List
> Objet : Re: UTF-8 based DFAs and Regexps from Unicode sets
>
>
> On Mon, Apr 27, 2009 at 10:24:03AM -0700, Asmus Freytag wrote:
> > On 4/27/2009 5:09 AM, John (Eljay) Love-Jensen wrote:
> > >Rather than specifying the optimized regular expression in native
> > >UTF-8 in the first place, and perhaps another in UTF-16BE, and
> > >perhaps another in yada yada...
> > >
> > >That would avoid the brittleness issue raised by others.
> >
> > That's a good point and a bit orthogonal to what I was trying to
> > highlight. My focus was on calling attention to the fact that
> > multi-step implementations with separate and independent phases for
> > conversion and algorithmic text processing can be
> cost-prohibitive in
> > high-volume
> > (real-time) applications. Such application domains exist
> and are real
> > scenarios, even though they are not the standard case.
>
> High-volume and real-time are not normally compatible; either
> you want to optimise for the common case and ensure that it's
> fast (and hence you're good for high-volume) or you optimise
> to make sure that worst case response time is above some
> cut-off (and you can give real-time guarantees). These are
> very different constraints and call for different implementations.

That's not true: applications that include BOTH high-volume and real-time
needs include text editors, notably when editing large text files (including
programming source files or XML datafiles) or performing search-replace
operations.

This problem can occur when the text buffer is managed as a single
unstructured memory block (instead of a series of blocks with reasonnable
variable sizes between two minimum and maximum threshold values for block
splitting or merging), due to the number of copies that are needed for
making insertions/deletions, or for computing presentation features like
syntax coloring, or nested sub-block compression/expansion, automatic line
numbering, or handling position markers, or comparing two files side by side
and coloring diffs.

Performance problems can then occur after each edit operation, with the
editor GUI freezing for seconds or sometimes minutes. Affected softwares
include for example Notepad (old versions up to Windows XP) or Notepad++
(all versions including the current one).

Similar problems can occur with programming environments that are also
parsing and/or compiling the sources directly while it is being edited (to
show syntaxic helpers or to compute a logical structure or navigatable tree,
e.g. in Eclipse, NetBeans, or Visual Studio).

Similar problems will occur in multi-users environments like web sites that
are using interactive features like dynamic searches or data selections.
This includes for exampel the CLDR Survey (whose performance is critical
becuse it needs to handle both large volumes, and reasonnable response times
for users, otherwise sessions or scripts may timeout and can cause the site
to become almost unusable above a quite small number of users).

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Andreas Prilop: "Re: Bidi demo"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 11:24:53 CDT