Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (sam@samason.me.uk)
Date: Tue Apr 28 2009 - 11:54:46 CDT

Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Tue, Apr 28, 2009 at 06:20:16PM +0200, Philippe Verdy wrote:
> Sam Mason wrote:
> > High-volume and real-time are not normally compatible; either
> > you want to optimise for the common case and ensure that it's
> > fast (and hence you're good for high-volume) or you optimise
> > to make sure that worst case response time is above some
> > cut-off (and you can give real-time guarantees). These are
> > very different constraints and call for different implementations.
>
> That's not true: applications that include BOTH high-volume and real-time
> needs include text editors, notably when editing large text files (including
> programming source files or XML datafiles) or performing search-replace
> operations.

A text editor is barely high-volume and definitely not real-time---at
least not according to any definitions that I would know. An editor
obviously wants to process the data reasonably quickly but each file is
only going to be a few million bytes (at the large end) and you don't
have to worry about someone's heart stopping or a nuclear meltdown if
you miss your deadline by a millisecond because you had to swap in a
*single* page of data. Real-time has a precise technical definition and
it's this I'm assuming people are talking about, I'm not aware of one
for high-volume though I'd be thinking of sustained bandwidths on the
order of 100MB/s.

> This problem can occur when the text buffer is managed as a single
> unstructured memory block (instead of a series of blocks with reasonnable
> variable sizes between two minimum and maximum threshold values for block
> splitting or merging), due to the number of copies that are needed for
> making insertions/deletions, or for computing presentation features like
> syntax coloring, or nested sub-block compression/expansion, automatic line
> numbering, or handling position markers, or comparing two files side by side
> and coloring diffs.

I fail to see how those have much to do with the issues of dealing
directly with encoded characters.

> Performance problems can then occur after each edit operation, with the
> editor GUI freezing for seconds or sometimes minutes. Affected softwares
> include for example Notepad (old versions up to Windows XP) or Notepad++
> (all versions including the current one).
>
> Similar problems can occur with programming environments that are also
> parsing and/or compiling the sources directly while it is being edited (to
> show syntaxic helpers or to compute a logical structure or navigatable tree,
> e.g. in Eclipse, NetBeans, or Visual Studio).

Then those are bugs/features in those specific programs and they need to
be worked around or fixed however is best for those programs.

> Similar problems will occur in multi-users environments like web sites that
> are using interactive features like dynamic searches or data selections.
> This includes for exampel the CLDR Survey (whose performance is critical
> becuse it needs to handle both large volumes, and reasonnable response times
> for users, otherwise sessions or scripts may timeout and can cause the site
> to become almost unusable above a quite small number of users).

These seem like more reasonable points; but these systems are all free
to choose whatever internal encoding they prefer (and to quite a large
extent the external encoding as well) and their code can be optimised
appropriately.

As far as I understand the issue was to do with efficient processing of
UTF-8 (or maybe UTF-16) source data and writing algorithms that directly
work in said formats rather than converting (I'm suggesting that this
should be done piecemeal) to a standard internal representation first.

-- 
  Sam  http://samason.me.uk/

Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 11:58:07 CDT