Re: Concise term for non-ASCII Unicode characters

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 21 Sep 2015 19:18:29 +0100

On Mon, 21 Sep 2015 12:46:48 +0100
"Tony Jollans" <Tony_at_jollans.com> wrote:

> These days, it is pretty sloppy coding that cares how many bytes an
> encoding of something requires, although there may be many
> circumstances where legacy support is required.

Wow! Are you saying that code chopping up arbitrary character sequences
for legibility (and editability!) and to avoid buffering issues should
generally assume it will be read as UTF-8, and avoid splitting
well-formed UTF-8 characters? (If the text is actually Windows-1252,
there may be a lot of apparently ill-formed UTF-8 characters/gibberish.)

> You say that, in some
> contexts, one needs to be really clear that the octets 0x80 - 0xFF
> are Unicode. Either something "is" Unicode, or it isn't. Either
> something uses a recognised encoding, or it doesn't. Using these
> octets to represent Unicode code points is not ASCII, is not UTF-8,
> and is not UCS-2/UTF-16; it could, perhaps, be EBCDIC.

But most of these octets *are* used to represent non-ASCII scalar
values. It's just that they have to operate in combinations for UTF-8.

Richard.
Received on Mon Sep 21 2015 - 13:19:46 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 21 2015 - 13:19:47 CDT