ReadMe.txt for Unicode Bidirectional Algorithm Reference Implementation

Version 1.8 (for UBA Version 6.3.0)

Code Release Date: July 8, 2013 

Documentation Update: October 1, 2013

Author: Ken Whistler

The UBA Reference Implementation is furnished in three forms:

1. A generic Windows executable (bidiref.exe).

2. Platform-independent source code, which can be compiled for an
   executable to run on almost any modern platform.

3. A compiled Windows 32-bit library (libbidir.lib) and a short public 
   header file (bidiref.h) defining a simple API.

The executable form can be used as a means of testing UBA implementations
for conformance, in a context where the developers of such implementations
are prohibited from viewing source code which is under Open Source or other
open licenses.

The source code is distributed, under the general Unicode Character
Database terms of use (see http://www.unicode.org/copyright.html), both
for the purpose of enabling compilation on any required platform and
as a general model for other implementations, to assist in correct
interpretation of the rather complicated steps of the specification
of the Unicode Bidirectional Algorithm in UAX #9.

The API can be used to execute the UBA reference algorithm on a
single input string and check the results. This may be useful in
building test frameworks which systematically compare another implementation
against this reference implementation, or for applications which simply
want to check or illustrate behavior in particular cases.

The source code for this reference implementation is written in C, and is
intended to complement the independent Java implementation
of the Unicode Bidirectional Algorithm:

http://www.unicode.org/Public/PROGRAMS/BidiReferenceJava/ 

These two reference implementations sponsored by the Unicode Consortium have been
developed completely independently, by engineers with no access to
each others' source code. The fact that these two, independent 
implementations in different programming languages are known
to produce identical output for identical input and to pass all current
conformance tests for the Unicode Bidirectional Algorithm is a confidence
measure for the correctness of the algorithm and of the reference
implementations themselves.

This bidiref implementation is deliberately didactic and non-optimized.
No attempt has been made to use compact data structures or to piggyback
or shortcircuit rules in clever ways to save space or increase performance.
The emphasis is on literal, step-by-step implementation of the rules of
the UBA, so that the relationship between each rule in the specification
and its implementation in code is as clear as possible. Although
this reference implementation does not attempt to be particularly
compact or efficient, it is architected in such a way as to minimize
superfluous reallocations and copying of data, in part to point the way
to implementation strategies which *would* be more efficient.

The bidiref implementation supports both UBA 6.2.0 and UBA 6.3.0 rules, 
and an effort has been made to keep the expression of both versions
as parallel as possible, so as to highlight the explicit differences
introduced in UBA 6.3.0. Where the logic of the rule application does
not differ in principle, the actual implementation in code is
abstracted, so that the shared rule logic is also a shared function
in the code for both versions. Where rules are distinct in their
logic, the rule tracing makes it obvious where UBA 6.3.0-specific
rules are distinct from their UBA 6.2.0 analogues. 

Tracing and debug output is provided, particularly for the more difficult
and complex rules that involve stack handling in the specification.
This makes it easier to visualize how the rules are applied, step-by-step,
in complicated cases.

=======================================================================

Using bidiref.exe

The bidiref reference algorithm is written as a command line utility program.
This enables use for scripting in test harnesses of various sorts.

The reference algorithm supports both a fully conformant UBA 6.2.0
implementation and a fully conformant UBA 6.3.0 implementation.

Command line flags currently supported are:

-v  The -v flag prints the version string and quits.

-h  The -h flag prints out a concise summary of command line usage and quits.

-x  bidiref defaults to UBA 6.2.0 rules. The -x flag sets the program to
    follow UBA 6.3.0 rules, instead.

-dn The -d flag allows setting of various levels of debug output.
    This value defaults to -d0, for no debug output. Additional
    levels supported are -d1, -d2, -d3, -d4, with each higher level
    providing more debug output.

-z  The -z flag instructs the program to take an internally defined
    static string as input. This option is primarily intended as support for
    program development and testing, but can also be used as a simple way of
    producing sample output showing application of the UBA rules
    at various levels of debug output, to help in gaining an
    understanding of how that output works.

If the -z flag is not specified, bidiref defaults to an input file name
"BidiRefTest.txt", and attempts to read bidi test cases from that
input file. Such test cases will be interpreted as using the
test case format defined in BidiCharacterTest.txt in the UCD
(see below). If the test cases do not follow that format, bidiref 
will report a parsing error and quit.

When the -z flag is not specified, a fully qualified file name can
be provided on the command line for an input file. bidiref will
then attempt to read bidi test cases from that input file.

Two distinct input file formats are supported. If an input file
starts with the substring "BidiTest", then the file format is
interpreted as the test case format of BidiTest.txt in the UCD.
That test format does not use code points, but instead specifies
test cases as vectors of Bidi_Class values, and uses an abbreviation
scheme with @-variables to specify expected levels and reordering
results for grouped sets of test cases. (See BidiTest.txt in
the UCD for full details.)

If the input file has any other name not starting with the substring
"BidiTest", then the file format is interpreted as the
test case format of BidiCharacterTest.txt in the UCD. That format
expresses test cases by sequences of Unicode code points, and puts
each test case on a single line, using semicolon-delimited fields
to separate the input string from the expected results fields.

Note that bidiref does not read version information internal
to the test case files, nor does it attempt to guess specific
versions of UBA heuristically, based on test case data. This
behavior is deliberate, to allow the program to be run using
either UBA 6.2.0 rules or UBA 6.3.0 rules against exactly the
same input test cases, without changing the input. This can
facilitate comparison of results for interesting edge cases
for which different UBA versions behave distinctly.

Examples of usage:

% bidiref

Parse BidiRefTest.txt and run the UBA 6.2.0 rules on each
test case, printing out results only, with no debug output.

% bidiref -x -d1

Parse BidiRefTest.txt and run the UBA 6.3.0 rules on each
test case, printing out results plus minimal debug output.

% bidiref -x -z -d2

Run the UBA 6.3.0 rules on the static string input test
case and print out the results with medium debug output,
showing the intermediate results of the application of
each successive rule.

% bidiref BidiTest.txt

Parse BidiTest.txt and run the UBA 6.2.0 rules on each
test case, printing out a summary of results only.

% bidiref -x BidiTest.txt

Parse BidiTest.txt and run the UBA 6.3.0 rules on each
test case, printing out a summary of results only.

Note that the BidiTest.txt conformance test files are
very large, containing hundreds of thousands of individual
test cases. Running bidiref on that data with any debugging level
greater than zero will produce very voluminous output, indeed! Use
sparingly. If you want to examine particular test cases
in more detail, it is generally advisable to excerpt
them out of BidiTest.txt into a separate smaller input
file and then use that small file as input for bidiref.

% bidiref -x -d2 BidiTestSmallSet2.txt

Parse BidiTestSmallSet2.txt, using the format of
BidiTest.txt, run the UBA 6.3.0 rules on each test
case, printing out the results with medium debug
output, showing the intermediate results of the application
of each successive rule.

% bidiref -x -d3 BidiCharacterTest.txt

Parse BidiCharacterTest.txt, using the format for test
cases defined in that file, run the UBA 6.3.0 rules on each
test case, printing out the results with a high level of
debug output.

% bidiref -d4 MyOwnTestCases.txt

Parse MyOwnTestCases.txt, using the format specified for
BidiCharacterTest.txt, run the UBA 6.2.0 rules on each
test case, printing out the results with the most verbose
level of debug output.

Trace Flags and Debug Levels

bidiref has a number of trace flags built in. These are used
to specify particular kinds of output. Currently, there
is no support for turning on or off particular
trace flags via command line arguments, but there are 
internal program functions which
can do so. The trace flags supported (as of Version 1.8) are:

Trace0   /* On by default: print final test results */
Trace1   /* Trace main algorithm function entry. */
Trace2   /* Trace initialization code function entry. */
Trace3   /* Trace stack handling in X1-X8 */
Trace4   /* Additional debug output for X1-X8 */
Trace5   /* Trace run and sequence handling in X10 */
Trace6   /* Trace stack handling in N0 */
Trace7   /* Additional debug output for N0 */
Trace8   /* Trace reordering in L2 */
Trace9   /* Additional debug output for P2/P3 */
Trace10  /* Trace property file parsing. */
Trace11  /* Display all intermediate UBA results */
Trace12  /* Additional debug output for test parsing */
Trace13  /* Explicitly list each PASS result for tests */
Trace14  /* Omit vacuous rule application from display */
Trace15  /* On by default: print error messages */

Currently those trace flags are automatically associated
with the following debug levels:

Debug Level   Trace Flags

0             Trace0, Trace15
1             Trace1, Trace2
2             Trace11
3             Trace3, Trace5, Trace6, Trace8
4             Trace4, Trace7, Trace9, Trace12

Higher debug levels retain the trace flag settings of all lower
debug levels.

Test Case Limitations

The longest test case input string currently supported is 76
characters long. (This corresponds to the longest current
test case defined in BidiTest.txt for UBA 6.3.0.)

UCD Property File Requirements

The bidiref implementation does not make use of any library
code (other than the standard C runtime library). In order to
support UBA 6.3.0 rules and the BidiCharacterTest.txt format
for test cases, the program needs access to UCD character properties.

Because of this, bidiref requires the following data files
(with these exact names) to be present in the same directory
that bidiref.exe is executed from:

UnicodeData-6.2.0.txt   (renamed from Version 6.2.0 of UnicodeData.txt)
UnicodeData-6.3.0.txt   (renamed from Version 6.3.0 of UnicodeData.txt)
BidiBrackets-6.3.0.txt  (renamed from Version 6.3.0 of BidiBrackets.txt)

[This requirement for exact file names in the current directory may
be relaxed for a more flexible scheme to access the required files
in a future revision of bidiref.]

Note that because bidiref does not make use of a library to
obtain character property information, each invocation of bidiref
from the command line parses the relevant property files
(UnicodeData-6.2.0.txt when invoked for UBA 6.2.0 rules, and
UnicodeData-6.3.0.txt and BidiBrackets-6.3.0.txt when invoked
for UBA 6.3.0 rules), to build up the internal tables
of character properties required for running the UBA rules.
As a result, if bidiref is to be used for validating
many test cases, it is more efficient to stack all those test
cases together into a single test case input file, rather than
to invoke bidiref repeatedly at the command line for each individual
test case.

[Note: This architecture may be adjusted in a future revision of
bidiref, to enable bidiref to cache all require property information
in a precompiled set of tables, and to bypass the need to parse input
data on each invocation.]

===================================================================

Using the API

The bidiref reference implementation is documented in the
public header file bidiref.h.

To build an application using this API, include bidiref.h in
your source code, and then add a link to the static libbidir
library. 

A compiled 32-bit version of the static library is provided for the
Windows platform: libbidir.lib. [TBD: add compiled version
for linuxamd64, Win64 and/or other platforms.]

The API is very simple. An application must first call br_Init()
once (and only once) to set the UBA version and initialize
the character property tables. Then it can call
br_ProcessOneTestCase() and/or br_QueryOneTestCase() as many
times as it needs. There is no cleanup call necessary or
provided. The initialization works with all static data
structures built into the library, while the test case APIs
automatically allocate and then deallocate any internal data
structures they need on a per test case basis.

Note: As of Version 1.8, this API is not multithread safe,
so a testing application should not attempt to multithread
it for performance.

Control over trace flags is also provided in the API via
the calls TraceOn and TraceOff, using trace flags defined
in bidiref.h. The API br_ProcessOneTestCase honors
individual trace flags, so its execution can be tailored
to show more or less output illustrating the steps and
details of the processing. The API br_QueryOneTestCase
turns off all trace flags and runs without producing
any console output.

===============================================================

Using the API Demo Program: bidirefdemo.exe

A small demo program, bidirefdemo, is provided both in
source code and compiled for Windows. This demo program
illustrates how to use the API, using a single test case
example. The source code can be used to compile the demo for
any platform.

To use bidirefdemo, simply execute it at a command
line prompt with no arguments. It will print out output
demonstrating br_ProcessOneTestCase with some program
tracing on, and then print out a line indicating the
successful execution of br_QueryOneTestCase, with no
other output.

Note that currently bidirefdemo has the same requirements
as the bidiref program for the presence (and names) of
the UnicodeData and BidiBracket property files in the
current directory where it is executed. (see above)

===============================================================

Build Notes

To build the bidiref executable from the source code, one
can simply compile each of the source modules and link them
all together. The basic dependency rule is just:

bidiref : bidiref.c brinput.c brutils.c brtable.c brtest.c brrule.c

To build the library to support use of the API, one can
compile and link the last four modules into a static library.
The basic dependency rule is just:

libbidir : brutils.c brtable.c brtest.c brrule.c

Then to build the bidiref executable, compile the first
two modules and link them to the library:

bidiref : bidiref.c brinput.c libbidir

To build the bidirefdemo executable, compile the bidirefdemo.c
source and link it to the library:

bidirefdemo : bidirefdemo.c libbidir

Similar approaches can then be taken to building your own
test framework which makes use of the library API.

Of course, it is also possible to build libbidir as a dynamic
link library on most platforms, but that requires definition
of the exports from the library. Note that bidiref.exe makes
use of a number of exports which are not part of the publicly
disclosed API defined in bidiref.h, so definition of all the correct
exports is not as simple as it might seem. For the purposes
to which this reference implementation would usually be put,
it is simpler to just build and use static libraries.