RE: UTF-8 based DFAs and Regexps from Unicode sets

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Apr 26 2009 - 21:46:51 CDT

Next message: John (Eljay) Love-Jensen: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Sam Mason wrote:
> On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
> > I'd disagree about that. It is certainly simpler to always
> process as
> > code points, but where performance is required, you need to design
> > your algorithms with the encoding of the core string
> representation in
> > mind, typically UTF-8 or UTF-16. You can get huge speedups
> in that way.
>
> Are there any pointers to literature about that? I'd be
> interested to see how this sort of scheme would hang
> together; there would seem to be quite a trade-off between
> instruction cache pressure, branch prediction and most
> probably other effects I can't think of at the moment.

I also agree that the result of processing UTF-8 directly instead of
converting it first to codepoints can benefir for the general performance,
but this must not be a credo. You can also get exactly the opposite, because
of the incresed number of conditional branches that will take place in the
compiled code.

You may try to reduce the number of conditional branches, but if this is at
the price of more lookups (or increased table sizes for these lookup) the
cache misses will have also a severa impact on performance. Then you can try
to reduce the overall processing by trying to inline the reader of the
lexer. My opinion is that trying to mix together the UTF-8 decoding and the
general processing of Unicode algorithms based on codepoints will just
complicate things so much that code correctness will be difficult to prove,
code coverage (and data coverage for lookup tables) for tests will decrease,
making the software even more fragile.

In this domain, it really helps to optimize as much as possible at the local
level of the code, and notably or the most internal processing loop, whose
size must still remain small, with small lookup tables, with also a number
of lookups per character as low as possible.

The code should also be clearly profiled before deciding which compilation
will produce the best branch predictions (because they are important for
increasing the internal parallelism of the pipelines in processors),
including statically, when branch predictn statistics are still not
available or have been purged from internal processor caches: if your most
internal processing loop uses too many branches with too many cases to
handle, the algorithm implementation will become quite slow (and notably if
it has been inlined too much, in a way where the conditional branches have
also been inlined, multiplying the number of possible cache misses and false
predictions).

Finally, if you decide too early about an optimization specifically tuned
for a specific processor (with a known cache size for lookup tables or
branch prediction tables, or depending on the number of fast registers
available), it will not work as intended on the next generation of
processors (or on the previous ones when the implementation will be deployed
on a larger scale). That's a place where static code compilers are loosing
more often now, an a clear advantage of virtual environments, that can
profile the code at runt time and recompile it or retune it dynamically
accordingto effective usage, notably in highly multithreaded environments
(where the threads are constanytly competing with the internal processor
caches).

Then comes the time of parallel computing (and soon, massively parallel
machines that can use processing kernels running on something else than just
traditional CPUs) and the difficulty of optimizing software in such
heterogeneous universes will become tremendously difficult. In such
perspective, keeping things simple, with code duplication at reduced as
posible, with less branches, and with small lookup tables will always win.
The best you can do is trying to optimize things very locally, only if this
does not add to the code complication (for ts evolution) or this
optimization remains independant of the optimization made for the other
levels of processing. On the highest level of analysis, optimization comes
with better analysis of the problem and correctness and completeness of the
generalizations made in the data and processing model.

But if you build your software as a monolithic bloc, forcibly removing the
levels of abstractions (and of testability for checking correctness), your
software will become much less adaptative (with less possibilities for local
optimizations depending on the effective context of use). On a medium term,
all your optimizations will be lost, and your software will have to be
rewritten from scratch and reoptimized and fully retested. When we know that
the software today is always too late in front of the evolution of the
hardware architectures, I'm not sure that we must spendtoo much time in
breaking the fundemental modeling by performing optimizations that will mix
the level of analysys, but also paralyze the needed evolutions of the
software.

For this reason, I really think that using UTF-8 based DFAs and Regexps for
Unicode sets will perform slower in a short term as a separate optimization
of the UTF-8 reader, made independantly of the optimization of DFAs and
regexps.

The UTF-8 reder can effectively be reduced a lot within its own DFA (but
even this DFA can be optimized without using any conditional branch to
determine the next state, directly from a single table lookup: if you did
this, you would not need to inline manualmly the processing for the regexps
and unicode set DFAs (a code without conditional brach can be automatically
tuned by the processor itself, or by the JITcompiler of the virtual machine
that can automatically eliminate the parts of the expressions that produce
unused results).

There are also other ways to reduce the number of unconditional branches
(not just the conditional ones for branch prediction), and the most
efficient one is to write the code according to the most frequent use (i.e.
according to profile data). If this is done correctly, other local
optimizations become possible, notbly for the parallel computing (you can
schedule the instructions related to independant datapaths, so that even if
cache misses can occur in multithreaded environments, their impact on the
rest of the code will be minimized). I don't think this is the task of the
programmer to perform such optimizations. This should be the role of front
compilers and runtime JIT compilers that can proove the effective
correctness of the optimizations performed.

For me, a code written in such a way that the functional level of analysis
cannot be fully vizualized on a single screen (or two, depending on source
code presentation preferences) is fundamentally badly dedigned and cannot be
optimized as well. It will fail due to excessive use of code caches, or
excessive use of datacaches (due to too large lookup tables), or due to
excessive multiplication of cases (conditional branches), with the
impossibility to predict them reliably, including after reasonnalbe
profiling at runtime.

As a rule of thumb, in programming, I prefer the approach where things are
kept simpler, keeping abstraction levels separated as much as possible in
the implementation, with a smarter code that can be compacted using e.g.
smart (but proven) arithmetic rules that can be proven locally and asserted
within the domain of use, and that the language used should also be able to
maintain and prove for correctness (so that assertions are no longer needed
at runtime or JIT compilation, after a front compilation, but replaced by
preconditions and postconditions and already proven declarative assertions
that the compiler should be able to use for its own local optimizations).

And the most frequent sources of optimization, i.e. things like dead code
elimination, are in the job of the compiler. It's dangerous to perform
manually, all dangers comming after the evolution of software with when the
pre- and post-conditions (not just the basic "assert()" debug feature of
C/C++ or Java) for the correctness of those optimizations have not been
specified (and are unverifiable by the compiler itself and by the software
design tools like IDEs which can really help the programmer to structure its
ideas in mind).

Next message: John (Eljay) Love-Jensen: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 21:52:03 CDT