RE: UTF-8 based DFAs and Regexps from Unicode sets

From: Philippe Verdy (
Date: Sun Apr 26 2009 - 21:46:51 CDT

  • Next message: John (Eljay) Love-Jensen: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    Sam Mason wrote:
    > On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
    > > I'd disagree about that. It is certainly simpler to always
    > process as
    > > code points, but where performance is required, you need to design
    > > your algorithms with the encoding of the core string
    > representation in
    > > mind, typically UTF-8 or UTF-16. You can get huge speedups
    > in that way.
    > Are there any pointers to literature about that? I'd be
    > interested to see how this sort of scheme would hang
    > together; there would seem to be quite a trade-off between
    > instruction cache pressure, branch prediction and most
    > probably other effects I can't think of at the moment.

    I also agree that the result of processing UTF-8 directly instead of
    converting it first to codepoints can benefir for the general performance,
    but this must not be a credo. You can also get exactly the opposite, because
    of the incresed number of conditional branches that will take place in the
    compiled code.

    You may try to reduce the number of conditional branches, but if this is at
    the price of more lookups (or increased table sizes for these lookup) the
    cache misses will have also a severa impact on performance. Then you can try
    to reduce the overall processing by trying to inline the reader of the
    lexer. My opinion is that trying to mix together the UTF-8 decoding and the
    general processing of Unicode algorithms based on codepoints will just
    complicate things so much that code correctness will be difficult to prove,
    code coverage (and data coverage for lookup tables) for tests will decrease,
    making the software even more fragile.

    In this domain, it really helps to optimize as much as possible at the local
    level of the code, and notably or the most internal processing loop, whose
    size must still remain small, with small lookup tables, with also a number
    of lookups per character as low as possible.

    The code should also be clearly profiled before deciding which compilation
    will produce the best branch predictions (because they are important for
    increasing the internal parallelism of the pipelines in processors),
    including statically, when branch predictn statistics are still not
    available or have been purged from internal processor caches: if your most
    internal processing loop uses too many branches with too many cases to
    handle, the algorithm implementation will become quite slow (and notably if
    it has been inlined too much, in a way where the conditional branches have
    also been inlined, multiplying the number of possible cache misses and false

    Finally, if you decide too early about an optimization specifically tuned
    for a specific processor (with a known cache size for lookup tables or
    branch prediction tables, or depending on the number of fast registers
    available), it will not work as intended on the next generation of
    processors (or on the previous ones when the implementation will be deployed
    on a larger scale). That's a place where static code compilers are loosing
    more often now, an a clear advantage of virtual environments, that can
    profile the code at runt time and recompile it or retune it dynamically
    accordingto effective usage, notably in highly multithreaded environments
    (where the threads are constanytly competing with the internal processor

    Then comes the time of parallel computing (and soon, massively parallel
    machines that can use processing kernels running on something else than just
    traditional CPUs) and the difficulty of optimizing software in such
    heterogeneous universes will become tremendously difficult. In such
    perspective, keeping things simple, with code duplication at reduced as
    posible, with less branches, and with small lookup tables will always win.
    The best you can do is trying to optimize things very locally, only if this
    does not add to the code complication (for ts evolution) or this
    optimization remains independant of the optimization made for the other
    levels of processing. On the highest level of analysis, optimization comes
    with better analysis of the problem and correctness and completeness of the
    generalizations made in the data and processing model.

    But if you build your software as a monolithic bloc, forcibly removing the
    levels of abstractions (and of testability for checking correctness), your
    software will become much less adaptative (with less possibilities for local
    optimizations depending on the effective context of use). On a medium term,
    all your optimizations will be lost, and your software will have to be
    rewritten from scratch and reoptimized and fully retested. When we know that
    the software today is always too late in front of the evolution of the
    hardware architectures, I'm not sure that we must spendtoo much time in
    breaking the fundemental modeling by performing optimizations that will mix
    the level of analysys, but also paralyze the needed evolutions of the

    For this reason, I really think that using UTF-8 based DFAs and Regexps for
    Unicode sets will perform slower in a short term as a separate optimization
    of the UTF-8 reader, made independantly of the optimization of DFAs and

    The UTF-8 reder can effectively be reduced a lot within its own DFA (but
    even this DFA can be optimized without using any conditional branch to
    determine the next state, directly from a single table lookup: if you did
    this, you would not need to inline manualmly the processing for the regexps
    and unicode set DFAs (a code without conditional brach can be automatically
    tuned by the processor itself, or by the JITcompiler of the virtual machine
    that can automatically eliminate the parts of the expressions that produce
    unused results).

    There are also other ways to reduce the number of unconditional branches
    (not just the conditional ones for branch prediction), and the most
    efficient one is to write the code according to the most frequent use (i.e.
    according to profile data). If this is done correctly, other local
    optimizations become possible, notbly for the parallel computing (you can
    schedule the instructions related to independant datapaths, so that even if
    cache misses can occur in multithreaded environments, their impact on the
    rest of the code will be minimized). I don't think this is the task of the
    programmer to perform such optimizations. This should be the role of front
    compilers and runtime JIT compilers that can proove the effective
    correctness of the optimizations performed.

    For me, a code written in such a way that the functional level of analysis
    cannot be fully vizualized on a single screen (or two, depending on source
    code presentation preferences) is fundamentally badly dedigned and cannot be
    optimized as well. It will fail due to excessive use of code caches, or
    excessive use of datacaches (due to too large lookup tables), or due to
    excessive multiplication of cases (conditional branches), with the
    impossibility to predict them reliably, including after reasonnalbe
    profiling at runtime.

    As a rule of thumb, in programming, I prefer the approach where things are
    kept simpler, keeping abstraction levels separated as much as possible in
    the implementation, with a smarter code that can be compacted using e.g.
    smart (but proven) arithmetic rules that can be proven locally and asserted
    within the domain of use, and that the language used should also be able to
    maintain and prove for correctness (so that assertions are no longer needed
    at runtime or JIT compilation, after a front compilation, but replaced by
    preconditions and postconditions and already proven declarative assertions
    that the compiler should be able to use for its own local optimizations).

    And the most frequent sources of optimization, i.e. things like dead code
    elimination, are in the job of the compiler. It's dangerous to perform
    manually, all dangers comming after the evolution of software with when the
    pre- and post-conditions (not just the basic "assert()" debug feature of
    C/C++ or Java) for the correctness of those optimizations have not been
    specified (and are unverifiable by the compiler itself and by the software
    design tools like IDEs which can really help the programmer to structure its
    ideas in mind).

    This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 21:52:03 CDT