Re: Unicode education in Schools from Richard Wordingham via Unicode on 2017-08-26 (Unicode Mail List Archive)

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Sat, 26 Aug 2017 18:52:03 +0100

On Sat, 26 Aug 2017 18:55:25 +0300
Eli Zaretskii via Unicode <unicode_at_unicode.org> wrote:

> > Date: Sat, 26 Aug 2017 16:09:33 +0100
> > From: Richard Wordingham via Unicode <unicode_at_unicode.org>

> > It shouldn't. UTF-16 works just like UTF-8, except that the code
> > units are bigger.

> Not really, since UTF-8 doesn't have surrogates.

It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
of the few systems that comes close to allowing them the dignity of
integer values of their own - 3FFF80₁₆ to 3FFFFF₁₆ for the code units
0x80 to 0xFF.

I well remembered when Unicode regular expressions were required to
allow one to search for lone surrogates, but there was no such concept
of looking for isolated ill-associated bytes in Unicode 8-bit strings.

The point is that if one understands how UTF-8 works, UTF-16 is a
system that works using a subset of the same principles, and one should
therefore understand how UTF-16 works, until one comes to the weird and
dubious concept of surrogate points having properties. I believe the
latter concept is of value only in code that lacks the concept of
gibberish. In UTF-8, the distinction between code unit value and
Unicode scalar value is very clear; in UTF-16, it is muddied by the
concept of 'codepoint'.

Richard.
Received on Sat Aug 26 2017 - 12:52:33 CDT

This archive was generated by hypermail 2.2.0 : Sat Aug 26 2017 - 12:52:34 CDT