Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Philippe Verdy via Unicode <>
Date: Tue, 16 May 2017 20:13:15 +0200

2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode <>:

> C) The data was corrupted by some other means. Perhaps bad
> concatenations, lost blocks during read/transmission, etc. If we lost 2
> 512 byte blocks, then maybe we should have a thousand FFFDs (but how would
> we known?)

Thousands of U+FFFD's is not a problem (independantly of the internal UTF
encoding used): yes the 2512 byte block could then become 3 times larger
(if using UTF-8 internal encoding) or 2 times larger (if using UTF-16
internal encoding) but every application should be prepared to support the
size expansion with a completely know maximum factor, which could occur as
well with any valid CJK-only text.
So the size to allocate for the internal sorage is predictable from the
size of the input, this is an important feature of all standard UTF's.
Being able to handle the worst case of allowed expansion, militates largely
for the adoption of UTF-16 as the internal encoding, instead of UTF-8
(where you'll need to allocate more space before decoding the input, if you
want to avoid successive memory reallocations, which would impact the
performance of your decoder): it's simple to accept input from 512 bytes
(or 1KB) buffers, and allocate a 1KB (or 2KB) buffer for storing the
intermediate results in the generic decoder, and simpler on the outer level
to preallocate buffers with resonable sizes that will be reallocated once
if needed to the maximum size, and then reduced to the effective size (if
needed) at end of successful decoding (some implementations can use pools
of preallocated buffers with small static sizes, allocating new buffers out
side the pool only for rare cases where more space will be needed)
Received on Tue May 16 2017 - 13:13:54 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 13:13:54 CDT