Re: Concise term for non-ASCII Unicode characters

From: Sean Leonard <lists+unicode_at_seantek.com>
Date: Tue, 29 Sep 2015 09:20:50 -0700

On 9/21/2015 5:17 PM, Peter Constable wrote:
> If you think it's a serious problem that there isn't one conventional
> term for "characters outside the ASCII repertoire" or "UTF-8
> multi-code-unit encoded representations" (since different authors
> could devise different terminology solutions), then I suggest you
> submit a document to UTC explaining why it's a problem, documenting
> inconsistent or unclear terminology that's been used in some standards
> / public specifications, and requesting that Unicode formally define
> terminology for these concepts. I can't guarantee that UTC will do it,
> but I can predict with confidence that it _won't_ do anything of that
> nature if nobody submits such a document. Peter

I am of the mind to do just that, then. I have seen different documents,
standards, and standards bodies that have invented terminology around
this term, and they are not always the same. Since these standards
depend on Unicode, it would make a lot of sense for Unicode formally to
define terminology for these concepts. With the proliferation of UTF-8
(among other things), the boundary between 0x7F - 0x80 is more
significant than the boundary between 0xFFFF - 0x10000.

Since this will be my first submission I would appreciate a co-author on
this topic. Is anyone willing to help? Thanks in advance. Also, it is
not clear if such a document is destined to become a Unicode Technical
Report (UTR / PDUTR etc.), or if it should just be an informal write-up.
I am guessing this is supposed to be somewhat informal but at the same
time it (or the results of it) ought to appear in the UTC Document Search.

The current terminology that I am considering pursuing is "beyond
ASCII", in various permutations, such as "beyond the ASCII range",
"characters beyond ASCII", "code points beyond ASCII", etc. The term
"beyond" implies a certain directionality, and to that extent, implies
the Unicode repertoire as well as a Unicode encoding. We have seen on
this list the blackflips required to clarify "non-ASCII", since things
that are not ASCII literally could be a wide range of things.

I think there is some confusion about whether the term "Basic Latin"
excludes the C0 control character range. Formally the standard seems
clear enough to me that it is co-terminus with ASCII, but there is still
confusion if you don't pore through the Standard. My thought is that
maybe the Blocks.txt data should be modified to say "ASCII (Basic
Latin)" instead of just "Basic Latin". (If we "go there", I would
appreciate the wisdom of an experienced Unicode co-author. I am not
confident touching that just by myself.)

Sean
Received on Tue Sep 29 2015 - 11:23:01 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 29 2015 - 11:23:02 CDT