Re: Unicode forms for internal storage - BOCU-1 speed

From: Markus Scherer (
Date: Fri Jan 23 2004 - 12:56:04 EST

  • Next message: Rick McGowan: "Three new Technical Notes posted"

    Doug Ewell wrote:
    > Markus Scherer <markus dot scherer at jtcsv dot com> wrote:
    >>"claim"? That hurts...
    >>I did measure these things, and the numbers in the table are all from
    >>my measurements. I also included the type of machine I used, etc.
    > Certainly I would never accuse Markus of falsifying these statistics.
    > The word "claim" was not meant in the sense of "unsubstantiated claim."

    I might have overreacted a little here. I am not in _excruciating_ pain ;-)
    Sorry for misunderstanding "claim". My only excuse is that I am not a native speaker.

    > I'll have to see how my encoder and decoder perform when I finish them.
    > They're currently written for simplicity, not speed.

    My initial implementations were slower, too. I worked quite a bit on the performance of the
    converters that are in ICU4C.

    >>UTF-8 is useful because it's simple, and supported just about
    >>everywhere - but it's otherwise hardly optimal for anything.
    > As John said, it's all about ASCII transparency, together with no false
    > positives for "ASCII bytes" in non-Basic Latin characters.

    I agree with this, of course - in my mind, it's part of the "supported just about everywhere".

    A good part of what makes ASCII transparency useful for HTML and XML and other formats with internal
    encoding declarations is that one can parse those encoding declarations by initially assuming an
    ASCII-compatible encoding.

    It would be less important if Unicode signatures (BOMs) were used and recognized more often.


    This archive was generated by hypermail 2.1.5 : Fri Jan 23 2004 - 14:04:38 EST