Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: David Starner via Unicode <unicode_at_unicode.org>
Date: Mon, 15 May 2017 21:38:26 +0000

On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode <
unicode_at_unicode.org> wrote:

> Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the
> case for other situations

UTF-8 is clearly more efficient space-wise that includes more ASCII
characters than characters between U+0800 and U+FFFF. Given the prevalence
of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic
will pretty much always be smaller in UTF-8.

Even for scripts that go from 2 bytes to 3, webpages can get much smaller
in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a
factor of 1.8). The max change in reverse is 1.5, as two bytes goes to
three.

> and the fact is that handling surrogates (which is what proponents of
> UTF-8 or UCS-4 usually focus on) is no more complicated than handling
> combining characters, which you have to do anyway.
>

Not necessarily; you can legally process Unicode text without worrying
about combining characters, whereas you cannot process UTF-16 without
handling surrogates.
Received on Mon May 15 2017 - 16:39:12 CDT

This archive was generated by hypermail 2.2.0 : Mon May 15 2017 - 16:39:12 CDT