RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Shawn Steele via Unicode on 2017-05-15 (Unicode Mail List Archive)

From: Shawn Steele via Unicode <unicode_at_unicode.org>
Date: Mon, 15 May 2017 22:16:32 +0000

I’m not sure how the discussion of “which is better” relates to the discussion of ill-formed UTF-8 at all.

And to the last, saying “you cannot process UTF-16 without handling surrogates” seems to me to be the equivalent of saying “you cannot process UTF-8 without handling lead & trail bytes”. That’s how the respective encodings work.

One could look at it and think “there are 128 unicode characters that have the same value in UTF-8 as UTF-32,” and “there are xx thousand unicode characters that have the same value in UTF-16 and UTF-32.”

-Shawn

From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of David Starner via Unicode
Sent: Monday, May 15, 2017 2:38 PM
To: unicode_at_unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode <unicode_at_unicode.org<mailto:unicode_at_unicode.org>> wrote:
Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations

UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters between U+0800 and U+FFFF. Given the prevalence of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much always be smaller in UTF-8.
Even for scripts that go from 2 bytes to 3, webpages can get much smaller in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 1.8). The max change in reverse is 1.5, as two bytes goes to three.

and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway.

Not necessarily; you can legally process Unicode text without worrying about combining characters, whereas you cannot process UTF-16 without handling surrogates.
Received on Mon May 15 2017 - 17:17:36 CDT

This archive was generated by hypermail 2.2.0 : Mon May 15 2017 - 17:17:36 CDT