Re: Unicode forms for internal storage - BOCU-1 speed

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Jan 23 2004 - 12:56:04 EST

Next message: Rick McGowan: "Three new Technical Notes posted"

Previous message: Jon Hanna: "Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed"
In reply to: Doug Ewell: "Re: Unicode forms for internal storage - BOCU-1 speed"
Next in thread: Doug Ewell: "Re: Unicode forms for internal storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:
> Markus Scherer <markus dot scherer at jtcsv dot com> wrote:
>>"claim"? That hurts...
>>
>>I did measure these things, and the numbers in the table are all from
>>my measurements. I also included the type of machine I used, etc.
>>(http://www.unicode.org/notes/tn6/#Performance)
>
> Certainly I would never accuse Markus of falsifying these statistics.
> The word "claim" was not meant in the sense of "unsubstantiated claim."

I might have overreacted a little here. I am not in _excruciating_ pain ;-)
Sorry for misunderstanding "claim". My only excuse is that I am not a native speaker.

> I'll have to see how my encoder and decoder perform when I finish them.
> They're currently written for simplicity, not speed.

My initial implementations were slower, too. I worked quite a bit on the performance of the
converters that are in ICU4C.

>>UTF-8 is useful because it's simple, and supported just about
>>everywhere - but it's otherwise hardly optimal for anything.
>
> As John said, it's all about ASCII transparency, together with no false
> positives for "ASCII bytes" in non-Basic Latin characters.

I agree with this, of course - in my mind, it's part of the "supported just about everywhere".

A good part of what makes ASCII transparency useful for HTML and XML and other formats with internal
encoding declarations is that one can parse those encoding declarations by initially assuming an
ASCII-compatible encoding.

It would be less important if Unicode signatures (BOMs) were used and recognized more often.

markus

Next message: Rick McGowan: "Three new Technical Notes posted"
Previous message: Jon Hanna: "Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed"
In reply to: Doug Ewell: "Re: Unicode forms for internal storage - BOCU-1 speed"
Next in thread: Doug Ewell: "Re: Unicode forms for internal storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 23 2004 - 14:04:38 EST