RE: Proposing UTF-21/24

From: Ruszlan Gaszanov (
Date: Wed Jan 24 2007 - 10:05:22 CST

  • Next message: Ruszlan Gaszanov: "RE: Regulating PUA."

    The comparison was based on my own implementation of the encoder and decoder functions that convert different encoding schemes from/to Unicode code points' scalar values in MS JScript. UTF-8 and UTF-16 are handled as arrays of scalar values, while UTF-21 and UTF-24 as simple scalars (where 24 least significant bits represent actual code units). Total of lines of encoder and decoder functions (minus comments) were counted.

    For UTF-24 data is first passed to the validator function for error detection and then decoded by UTF-21 decoder. A very basic validator function, along with a parity calculator, needed for both encoder and validator are counted (more sophisticated validator, that can attempt to recover data from badly corrupted streams and detect byte order changes in the middle of the stream, would, of course, take quite a bit more coding).

    I didn't count file reading/writing routines, because my script uses common functions for that purpose.

    UTF-8 takes the most lines because it needs a lot of conditional statements. UTF-21 encoder and decoder are 2 lines, because each function is basically a single bitwise expression. The most complex procedure for UTF-24 is the validation routine, because encoding and decoding are accomplished by pure bitwise logic. Parity calculator is, of course, a 24-cycle loop of bitwise XORs and bitshifts.


    -----Original Message-----
    From: Philippe Verdy []
    Sent: Wednesday, January 24, 2007 12:00 PM
    To: Ruszlan Gaszanov; 'Unicode Mailing List'
    Subject: Re: Proposing UTF-21/24

    From: "Ruszlan Gaszanov" <>
    >> I implemented a BOCU-1 encoder/decoder in about 400 lines of C++,
    >> so I wouldn't call it too complex.
    > Complexity is a relative concept.
    > For comparison:
    > - UTF-8 encoder+decoder - under 100 lines (100 / 400 = 25%)
    > - UTF-16 encoder+decoder - under 40 lines (40 / 400 = 10%)
    > - UTF-24 encoder+decoder - under 20 lines (20 / 400 = 5%)
    > - UTF-21 encoder+decoder - exactly 2 lines (2 / 400 = 0.5%)

    Your stats are clearly flawed. This means that you make a UTF-21 encoder as a single non commented source line for the whole function completley inlined with its tests.

    This is unfair. My UTF-16 encoder or decoder just needs 1 *simpler* line to implement than your UTF-21, using a single test (which ius really easy to inline if needed), no intermediate assignement.

    The normal fair complexity must be compared using comparable programming styles. If one wants trustable metrics, then counting lines is not accurate. You must count: the number of tests to perform, the number of basic arithmetic operations, the number of temporary variables for intermediatelookups, and the number of variable assignments. Ignore the comment lines in such metrics, because they can be expanded or completely removed at will!

    Also ignore the length of the function declaration because this is not language-neutral (just count 1 declaration for a needed function). And ignore the programming code style (position of line breaks, empty lines, blanks for indentation, length of variable names...). A reasonnably fair metric would be based on counting nodes in a syntax parsing tree.

    This archive was generated by hypermail 2.1.5 : Wed Jan 24 2007 - 10:07:18 CST