Re: Least used parts of BMP.

From: Asmus Freytag ([email protected])
Date: Fri Jun 04 2010 - 12:22:42 CDT

Next message: Kenneth Whistler: "Re: Hexadecimal digits"

Previous message: Michael Everson: "Re: Emoji (was: Re: Preparing a proposal for encoding a portable interpretable object code into Unicode)"
In reply to: Mark Davis ☕: "Re: Least used parts of BMP."
Next in thread: Doug Ewell: "RE: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:
> In a compression format, that doesn't matter; you can't expect random
> access, nor many of the other features of UTF-8.
>
> The minimal expectation for these kinds of simple compression is that
> when you write a string with a particular /write/ method, and then
> read it back with the corresponding /read/ method, you get exactly the
> original string contents back, and you consume exactly as many bytes
> as you had written. There are really no other guarantees.
Actually, SCSU makes an additional guarantee, which is that you can edit
the compressed string. In other words, you can insert a substring such
that the new string remains a valid compressed string and the parts
preceding and following the insertion, when read, match the
corresponding portion of the original after decoding. I remember that
this was an important design criterion for the precursor RCSU. Their
implementation required the ability to deliver a "patch" to a compressed
string, something that isn't possible with many other compression formats.

So there is a sliding scale in features, each compression method being
designed to address the specific requirements of given application.

A./
>
> Mark
>
> — Il meglio è l’inimico del bene —
>
>
> On Fri, Jun 4, 2010 at 06:35, Otto Stolz <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hello,
>
> Am 2010-06-03 07:07, schrieb Kannan Goundan:
>
> This is currently what I do (I was referring to this as the
> "compact
> UTF-8-like encoding"). The one difference is that I put all the
> marker bits in the first byte (instead of in the high bit of every
> byte):
> 0xxxxxxx
> 10xxxxxx xyyyyyyy
> 110xxxxx xxyyyyyy yzzzzzzz
>
>
> The problem with this encoding is that the trailing bytes
> are not clearly marked: they may start with any of
> '0', '10', or '110'; only '111' would mark a byte
> unambiguously as a trailing one.
>
> In contrast, in UTF-8 every single byte carries a marker
> that unambiguously marks it as either a single ASCII byte,
> a starting, or a continuation byte; hence you have not to
> go back to the beginning of the whole data stream to recognize,
> and decode, a group of bytes.
>
> Best wishes,
> Otto Stolz
>
>
>
>

Next message: Kenneth Whistler: "Re: Hexadecimal digits"
Previous message: Michael Everson: "Re: Emoji (was: Re: Preparing a proposal for encoding a portable interpretable object code into Unicode)"
In reply to: Mark Davis ☕: "Re: Least used parts of BMP."
Next in thread: Doug Ewell: "RE: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 12:24:37 CDT