Re: Factor implements 24-bit string type for Unicode support

From: Asmus Freytag (
Date: Tue Feb 05 2008 - 12:56:49 CST

  • Next message: Andreas Stötzner: "Monetary signs"

    Phillipe gave some very interesting arguments (complete with specific
    figures) but without citing his evidence or stating the assumptions. A
    thorough comparison of the performance of large data volumes in the
    various encoding forms would be interesting.

    Assuming for the moment, that the general arguments that Phillipe
    presented are not that far off the mark, it would seem that UTF-16 is
    not such a bad choice either. Because all, except very specialized, data
    collections can expect to have 99+% of their character codes in the BMP,
    the cost of decompressing the data to UTF-32 is dominated by the case
    for BMP characters. Even if handing surrogates were to take 100 times as
    long, that would only double the average.

    In the meantime, the benefits of more localized memory access are those
    of a 50% reduction, not a 25% reduction. Plus, in many cases, you get
    the benefit of direct library support w/o the need to convert the
    strings, if you want.

    That's the real argument I see against a 3-byte form.

    But, knowing programmers, they won't rest until every single permutation
    of possible encoding forms has been used and foisted on some
    unsuspecting user.


    This archive was generated by hypermail 2.1.5 : Tue Feb 05 2008 - 12:59:44 CST