Re: HTML5 encodings

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Dec 27 2009 - 20:07:55 CST

  • Next message: - -: "Filtering and displaying untrusted UTF-8"

    On 12/27/2009 5:28 PM, David Starner wrote:
    > On Sun, Dec 27, 2009 at 7:10 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
    >
    >> The first speaks directly to the topic of resynchronization. In the legacy
    >> DBCS encodings, certain byte values satisfied the conditions to be either a
    >> leading byte or a trailing byte. The encoding scheme imposes no limit on the
    >> length of runs of such bytes, making resynchronization, in the worst case,
    >> the same as re-reading the data stream from the start. Compare that to the
    >> UTFs where the worst case requires examining 4 bytes to resynchronize.
    >>
    >
    > How do you resynchronize UTF-16? An byte-wise arbitrary seek into ...
    > 43 42 43 42 43 42 43 ... could give 䍂 repeatedly or 䉃 repeatedly.
    >
    In order to resynchronize any encoding you need to know which one it is.
    That means, you know for UTF-16 that all code units are 16-bits wide.
    You may need to know the offset to the beginning of the data (or
    buffer), after that, you know that code units start on either even or
    odd bytes (depending on whether your data are aligned). That's less
    onerous than having to scan back to that location or to read from that
    location in order to synchronize.

    Anyway, thanks for bringing out explicitly the implicit assumption I had
    made when making may statement. If you know any character boundary
    (start of first character) then a single byte access at an even multiple
    from that location tells you for UTF-16 whether you are at a start of a
    single code unit, or the lead/tail end of a surrogate pair.

    This suggests a third metric:
    Some encodings (UTF16, DBCS) can't be synchronized at all (in the worst
    case) if you don't know at least one character boundary, while some
    encodings can be synchronized from purely local context (UFT-8 and UTF-32).

    (UTF-16 differs from DBCS in that the known character boundary can be
    anywhere in the data w/o change in the work required to actually
    resynchronize because the code units are fixed length. DBCS differs from
    UTF-16 in that in the best case, you can often discover an unambiguous
    boundary by scanning less than the maximal distance.)

    For most practical purposes, random access without also knowing the
    boundaries of your data buffer aren't very useful. How do you know when
    you've reached the end of your data? Or, if you do end up scanning back,
    how do you know when you have reached the beginning of your data? Or
    even, how do you know that your random access is even inside your buffer?

    Incidentally, null-terminated anything (even ASCII or UTF-8) is one of
    those cases where you need to read from the start to make sure you
    haven't overrun the terminator.

    However, if your task is to tune into something like a stock-ticker in
    mid-stream and resynchronize, the third metric would tell you whether
    you'll have any success.

    A./

    PS: all of this discussion sidesteps the question of "heurisitcs" -
    we've only looked at fully deterministic cases.



    This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 20:10:14 CST