RE: Least used parts of BMP.

From: Doug Ewell ([email protected])
Date: Fri Jun 04 2010 - 11:00:47 CDT

Next message: Philippe Verdy: "RE: A question about "user areas""

Previous message: Mark Davis ☕: "Re: Least used parts of BMP."
Maybe in reply to: Kannan Goundan: "Least used parts of BMP."
Next in thread: John Dlugosz: "RE: Least used parts of BMP."
Reply: John Dlugosz: "RE: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis ☕ <mark at macchiato dot com> replied to Otto Stolz <Otto
dot Stolz at uni dash konstanz dot de>:

>> The problem with this encoding is that the trailing bytes
>> are not clearly marked: they may start with any of
>> '0', '10', or '110'; only '111' would mark a byte
>> unambiguously as a trailing one.
>>
>> In contrast, in UTF-8 every single byte carries a marker
>> that unambiguously marks it as either a single ASCII byte,
>> a starting, or a continuation byte; hence you have not to
>> go back to the beginning of the whole data stream to recognize,
>> and decode, a group of bytes.
>
> In a compression format, that doesn't matter; you can't expect random
> access, nor many of the other features of UTF-8.

That said, if Kannan were to go with the alternative format suggested on
this list:

0xxxxxxx
1xxxxxxx 0yyyyyyy
1xxxxxxx 1yyyyyyy 0zzzzzzz

then he would at least have this one feature of UTF-8, at no additional
cost in bits compared to the format he is using today.

Of course, he will not have other UTF-8-like features, such as avoidance
of ASCII values in the final trail byte, and "fast forward parsing" by
looking at the first byte. He may not care. One thing I've noted about
descriptions of UTF-8, in the context of alternative formats for private
protocols, is that they always assume these features are important to
everyone, when they may not be.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org 
RFC 5645, 4645, UTN #14 | ietf-languages: is dot gd slash 2kf0s

Next message: Philippe Verdy: "RE: A question about "user areas""
Previous message: Mark Davis ☕: "Re: Least used parts of BMP."
Maybe in reply to: Kannan Goundan: "Least used parts of BMP."
Next in thread: John Dlugosz: "RE: Least used parts of BMP."
Reply: John Dlugosz: "RE: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 11:02:19 CDT