Re: HTML5 encodings

From: Asmus Freytag ([email protected])
Date: Sun Dec 27 2009 - 20:07:55 CST

Next message: - -: "Filtering and displaying untrusted UTF-8"

Previous message: David Starner: "Re: HTML5 encodings"
In reply to: David Starner: "Re: HTML5 encodings"
Next in thread: verdy_p: "Re: HTML5 encodings"
Reply: verdy_p: "Re: HTML5 encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 12/27/2009 5:28 PM, David Starner wrote:
> On Sun, Dec 27, 2009 at 7:10 PM, Asmus Freytag <[email protected]> wrote:
>
>> The first speaks directly to the topic of resynchronization. In the legacy
>> DBCS encodings, certain byte values satisfied the conditions to be either a
>> leading byte or a trailing byte. The encoding scheme imposes no limit on the
>> length of runs of such bytes, making resynchronization, in the worst case,
>> the same as re-reading the data stream from the start. Compare that to the
>> UTFs where the worst case requires examining 4 bytes to resynchronize.
>>
>
> How do you resynchronize UTF-16? An byte-wise arbitrary seek into ...
> 43 42 43 42 43 42 43 ... could give 䍂 repeatedly or 䉃 repeatedly.
>
In order to resynchronize any encoding you need to know which one it is.
That means, you know for UTF-16 that all code units are 16-bits wide.
You may need to know the offset to the beginning of the data (or
buffer), after that, you know that code units start on either even or
odd bytes (depending on whether your data are aligned). That's less
onerous than having to scan back to that location or to read from that
location in order to synchronize.

Anyway, thanks for bringing out explicitly the implicit assumption I had
made when making may statement. If you know any character boundary
(start of first character) then a single byte access at an even multiple
from that location tells you for UTF-16 whether you are at a start of a
single code unit, or the lead/tail end of a surrogate pair.

This suggests a third metric:
Some encodings (UTF16, DBCS) can't be synchronized at all (in the worst
case) if you don't know at least one character boundary, while some
encodings can be synchronized from purely local context (UFT-8 and UTF-32).

(UTF-16 differs from DBCS in that the known character boundary can be
anywhere in the data w/o change in the work required to actually
resynchronize because the code units are fixed length. DBCS differs from
UTF-16 in that in the best case, you can often discover an unambiguous
boundary by scanning less than the maximal distance.)

For most practical purposes, random access without also knowing the
boundaries of your data buffer aren't very useful. How do you know when
you've reached the end of your data? Or, if you do end up scanning back,
how do you know when you have reached the beginning of your data? Or
even, how do you know that your random access is even inside your buffer?

Incidentally, null-terminated anything (even ASCII or UTF-8) is one of
those cases where you need to read from the start to make sure you
haven't overrun the terminator.

However, if your task is to tune into something like a stock-ticker in
mid-stream and resynchronize, the third metric would tell you whether
you'll have any success.

A./

PS: all of this discussion sidesteps the question of "heurisitcs" -
we've only looked at fully deterministic cases.

Next message: - -: "Filtering and displaying untrusted UTF-8"
Previous message: David Starner: "Re: HTML5 encodings"
In reply to: David Starner: "Re: HTML5 encodings"
Next in thread: verdy_p: "Re: HTML5 encodings"
Reply: verdy_p: "Re: HTML5 encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 20:10:14 CST