RE: UTF-8 based DFAs and Regexps from Unicode sets

From: Philippe Verdy (
Date: Tue Apr 28 2009 - 18:27:27 CDT

  • Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    Sam Mason wrote:
    > On Tue, Apr 28, 2009 at 06:20:16PM +0200, Philippe Verdy wrote:
    > > Sam Mason wrote:
    > > > High-volume and real-time are not normally compatible; either you
    > > > want to optimise for the common case and ensure that it's
    > fast (and
    > > > hence you're good for high-volume) or you optimise to
    > make sure that
    > > > worst case response time is above some cut-off (and you can give
    > > > real-time guarantees). These are very different constraints and
    > > > call for different implementations.
    > >
    > > That's not true: applications that include BOTH high-volume and
    > > real-time needs include text editors, notably when editing
    > large text
    > > files (including programming source files or XML datafiles) or
    > > performing search-replace operations.
    > A text editor is barely high-volume and definitely not
    > real-time---at least not according to any definitions that I
    > would know. An editor obviously wants to process the data
    > reasonably quickly but each file is only going to be a few
    > million bytes (at the large end) and you don't have to worry
    > about someone's heart stopping or a nuclear meltdown if you
    > miss your deadline by a millisecond because you had to swap in a
    > *single* page of data. Real-time has a precise technical
    > definition and it's this I'm assuming people are talking
    > about, I'm not aware of one for high-volume though I'd be
    > thinking of sustained bandwidths on the order of 100MB/s.
    > > This problem can occur when the text buffer is managed as a single
    > > unstructured memory block (instead of a series of blocks with
    > > reasonnable variable sizes between two minimum and maximum
    > threshold
    > > values for block splitting or merging), due to the number of copies
    > > that are needed for making insertions/deletions, or for computing
    > > presentation features like syntax coloring, or nested sub-block
    > > compression/expansion, automatic line numbering, or
    > handling position
    > > markers, or comparing two files side by side and coloring diffs.
    > I fail to see how those have much to do with the issues of
    > dealing directly with encoded characters.

    Did I relate this to the encoding of characters ? No I was speaking about
    the volume of data that needs to be constantly moved in memory in most plain
    text editors.

    And I don't think I am alone that frequently needs to process quite large
    plain text files in text editors. They are extremely frequent when working
    on data exports from various applications, where it is needed to look for
    and correct specific data that cannot be easily processed in the origin
    application, without taking the risk of forgetting lots of occurrences, or
    when checking if the application behaves as expected and handles all the
    data that it is supposed to process.

    Text editors are quite handy to perform such tasks, and do not necessarily
    need to write a specific script for perform a one-time operation or check,
    that also requires humane supervision. Text editors have tools that are
    often missing in applications, and this includes the support for regexps, or
    comparing two versions of data files side by side.

    IT does not mean that everything must be edited by hand, sometimes it is
    just enough to be able to spot where the problems are located in alrge
    volumes of data, but without having to wait sometimes for seconds or minutes
    (I was not speaking about the "milliseconds"). Each operation that requires
    more than a handful of seconds in a text editor is a good candidate for
    optimisation, because I consider that having to wait for too long between
    each operation can severely impact your work, or your capability of
    performing it completely in a reasonnable time (and finally having to
    abandon this task or postpone it for an undefined delay, hoping that the
    quirks that have been found will find another solution later.

    Such situation is quite frequent when you have to deal with the needed
    evolutions of a data model in some existing application. At the begining you
    would have thought that the associated voume would allow easy correction
    directly within the application, but after some time, you are ending with
    data whose volume has exploded and whose quality or reusability can be low.
    Using editors can be very helpful to perform things like data normalisation,
    extractions, reclassification, in order to fit it with a newer data model
    that will avoid these caveats for the future.

    The fact that this text data is encoded in UTF-8 or UTF-16 or any other
    encoding is secondary, the only important thing is that, independantly of
    the internl encoding effectively used inthe editor, the most common sources
    of inefficiency or slowdowns, is the difficulty to match it with regular
    expressions (their complexity and limitations can require performing
    multiple successive transformations, in a way that is not always previsible
    when you are starting this job. When you are doing this, if the text ediors
    constantly freezes for seconds, you will easily forget what you were trying
    to do, or things that you hoped you would have not forgotten before aplying
    the successive steps of transformation.

    This case occurs then when handling data export files (such as database
    dumps), or when fixing some machine-generated code (that is partially
    complete and requires additional humane control to solve ambiguities), e.g.
    when handling conflicting diff files between two versions of a source code.
    My usage case of text editors frequently fall into theproblem of handling
    files that can have more than 60 000 lines, with complex formatting, or full
    of programming language macros. They occur when you are also porting
    existing applications to another platform/API/framework, or when performing
    reverse engineering works, or when debugging an existing app, when the
    sources alone cannot cannot explain the reason of some bugs: you are dealing
    with profile data, usage logs, command histories, statistics collection
    files, ...

    All this requires editors that can respond very fast when handling large
    files. This is not the same case as when just editing an email or message in
    a forum, when all can fit within the same screen. And most often, the only
    tool you have for doing this is a text editor and its abaility to perform
    many successive fast searches or copy-paste operations, or automated
    substitutions. Ideally this should be independant of the encoding used in
    the external storage files, but it's been frequently observed that text
    editors are not working with the external encoding, but with a single
    internal encoding; when the text editors is inherently written to handle
    streams of bytes rather than full characters, you are falling into the
    problem of the representation, which in turns impacts the performance: an
    internal UTF-8 representation is usually faster when it offers better volume
    compression than UTF-16.

    But sometimes it pays to have the text editor internally storing the text
    into a more compressed format, by splitting the text into blocks with maimum
    size, and then using a background layer to convert these compressed blocks
    (e.g. zipped/gzipped) into a cache with an encoding that is adapted to the
    task to perform. If you have to write regexps based on UTF-8-encoded byte
    streams, it becomes nearly impossible to write those regexps without lots of
    errors or omissions of cases. The task to perform will use the best encoding
    for this purpose, but you can't say immediately that the arbitrary choice of
    UTF-8 will resolve everything, and you can't expect that users of a text
    editor will lear to use multiple regexps for different encodings.

    This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 18:31:13 CDT