RE: UTF-8 based DFAs and Regexps from Unicode sets

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Apr 28 2009 - 18:27:27 CDT

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Kenneth Whistler: "Re: Which Character sets to support for kazakh cyrillic alphabet ?"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Sam Mason wrote:
> On Tue, Apr 28, 2009 at 06:20:16PM +0200, Philippe Verdy wrote:
> > Sam Mason wrote:
> > > High-volume and real-time are not normally compatible; either you
> > > want to optimise for the common case and ensure that it's
> fast (and
> > > hence you're good for high-volume) or you optimise to
> make sure that
> > > worst case response time is above some cut-off (and you can give
> > > real-time guarantees). These are very different constraints and
> > > call for different implementations.
> >
> > That's not true: applications that include BOTH high-volume and
> > real-time needs include text editors, notably when editing
> large text
> > files (including programming source files or XML datafiles) or
> > performing search-replace operations.
>
> A text editor is barely high-volume and definitely not
> real-time---at least not according to any definitions that I
> would know. An editor obviously wants to process the data
> reasonably quickly but each file is only going to be a few
> million bytes (at the large end) and you don't have to worry
> about someone's heart stopping or a nuclear meltdown if you
> miss your deadline by a millisecond because you had to swap in a
> *single* page of data. Real-time has a precise technical
> definition and it's this I'm assuming people are talking
> about, I'm not aware of one for high-volume though I'd be
> thinking of sustained bandwidths on the order of 100MB/s.
>
> > This problem can occur when the text buffer is managed as a single
> > unstructured memory block (instead of a series of blocks with
> > reasonnable variable sizes between two minimum and maximum
> threshold
> > values for block splitting or merging), due to the number of copies
> > that are needed for making insertions/deletions, or for computing
> > presentation features like syntax coloring, or nested sub-block
> > compression/expansion, automatic line numbering, or
> handling position
> > markers, or comparing two files side by side and coloring diffs.
>
> I fail to see how those have much to do with the issues of
> dealing directly with encoded characters.

Did I relate this to the encoding of characters ? No I was speaking about
the volume of data that needs to be constantly moved in memory in most plain
text editors.

And I don't think I am alone that frequently needs to process quite large
plain text files in text editors. They are extremely frequent when working
on data exports from various applications, where it is needed to look for
and correct specific data that cannot be easily processed in the origin
application, without taking the risk of forgetting lots of occurrences, or
when checking if the application behaves as expected and handles all the
data that it is supposed to process.

Text editors are quite handy to perform such tasks, and do not necessarily
need to write a specific script for perform a one-time operation or check,
that also requires humane supervision. Text editors have tools that are
often missing in applications, and this includes the support for regexps, or
comparing two versions of data files side by side.

IT does not mean that everything must be edited by hand, sometimes it is
just enough to be able to spot where the problems are located in alrge
volumes of data, but without having to wait sometimes for seconds or minutes
(I was not speaking about the "milliseconds"). Each operation that requires
more than a handful of seconds in a text editor is a good candidate for
optimisation, because I consider that having to wait for too long between
each operation can severely impact your work, or your capability of
performing it completely in a reasonnable time (and finally having to
abandon this task or postpone it for an undefined delay, hoping that the
quirks that have been found will find another solution later.

Such situation is quite frequent when you have to deal with the needed
evolutions of a data model in some existing application. At the begining you
would have thought that the associated voume would allow easy correction
directly within the application, but after some time, you are ending with
data whose volume has exploded and whose quality or reusability can be low.
Using editors can be very helpful to perform things like data normalisation,
extractions, reclassification, in order to fit it with a newer data model
that will avoid these caveats for the future.

The fact that this text data is encoded in UTF-8 or UTF-16 or any other
encoding is secondary, the only important thing is that, independantly of
the internl encoding effectively used inthe editor, the most common sources
of inefficiency or slowdowns, is the difficulty to match it with regular
expressions (their complexity and limitations can require performing
multiple successive transformations, in a way that is not always previsible
when you are starting this job. When you are doing this, if the text ediors
constantly freezes for seconds, you will easily forget what you were trying
to do, or things that you hoped you would have not forgotten before aplying
the successive steps of transformation.

This case occurs then when handling data export files (such as database
dumps), or when fixing some machine-generated code (that is partially
complete and requires additional humane control to solve ambiguities), e.g.
when handling conflicting diff files between two versions of a source code.
My usage case of text editors frequently fall into theproblem of handling
files that can have more than 60 000 lines, with complex formatting, or full
of programming language macros. They occur when you are also porting
existing applications to another platform/API/framework, or when performing
reverse engineering works, or when debugging an existing app, when the
sources alone cannot cannot explain the reason of some bugs: you are dealing
with profile data, usage logs, command histories, statistics collection
files, ...

All this requires editors that can respond very fast when handling large
files. This is not the same case as when just editing an email or message in
a forum, when all can fit within the same screen. And most often, the only
tool you have for doing this is a text editor and its abaility to perform
many successive fast searches or copy-paste operations, or automated
substitutions. Ideally this should be independant of the encoding used in
the external storage files, but it's been frequently observed that text
editors are not working with the external encoding, but with a single
internal encoding; when the text editors is inherently written to handle
streams of bytes rather than full characters, you are falling into the
problem of the representation, which in turns impacts the performance: an
internal UTF-8 representation is usually faster when it offers better volume
compression than UTF-16.

But sometimes it pays to have the text editor internally storing the text
into a more compressed format, by splitting the text into blocks with maimum
size, and then using a background layer to convert these compressed blocks
(e.g. zipped/gzipped) into a cache with an encoding that is adapted to the
task to perform. If you have to write regexps based on UTF-8-encoded byte
streams, it becomes nearly impossible to write those regexps without lots of
errors or omissions of cases. The task to perform will use the best encoding
for this purpose, but you can't say immediately that the arbitrary choice of
UTF-8 will resolve everything, and you can't expect that users of a text
editor will lear to use multiple regexps for different encodings.

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Kenneth Whistler: "Re: Which Character sets to support for kazakh cyrillic alphabet ?"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 18:31:13 CDT