Beyond ASCII

From: Sean Leonard <lists+unicode_at_seantek.com>
Date: Tue, 29 Sep 2015 23:12:25 -0700

On 9/29/2015 11:50 AM, Ken Whistler wrote:
> At any rate, any formal contribution that suggests coming up with
> terminology for
> the #1 and #2 sets should take these considerations under advisement.

The original premise of this thread was (and is) to find the *most
concise* term for that range U+0080 - U+10FFFF, regardless of whether
that range is for characters, code points, scalar values, or coffee cup
icons ☕️. Preferably, such a concise term would have support in the
Unicode Standard, or in some other standard. I was not looking for a
totally new, invented term, but rather a term that has empirical,
standards-based support.

A full survey of the Unicode Standard 8.0 finds that the term "beyond
ASCII" has textual support:
p. 1 Introduction: While taking the ASCII character set as its starting
point, the Unicode Standard goes far
beyond ASCII’s limited ability [...]

p. 37 ASCII Transparency: [UTF-8] maintains transparency for all of the
ASCII code points (0x00..0x7F). That means Unicode code points
U+0000..U+007F are
[thus] indistinguishable from ASCII itself. [...] Beyond the ASCII
range of Unicode, many [...] scripts are represented by two bytes [in
UTF-8...]

p. 200 Programming Languages: A limitation of the ISO/ANSI C model is
its assumption that characters can always be processed in isolation.
Implementations that choose to go beyond the ISO/ANSI C model may
find it useful to mix widths within their APIs.
{This formulation is not "beyond ASCII", but uses the preposition
"beyond" in the exact same sense, since ASCII is fixed-width and forms
an underlying assumption of the ISO/ANSI C model.}

p. 237 Case Mappings: A number of complications to case mappings occur
once the repertoire of characters is
expanded beyond ASCII.

p. 677 Han / CJK Unified Ideographs Extension B: The ideographs in the
CJK Unified Ideographs Extension B block represent an additional set of
42,711 unified ideographs beyond the 27,496 included in The Unicode
Standard, Version 3.0.
{This formulation uses the preposition "beyond" in the exact same sense,
namely, a subsequent range that is beyond the original range.}
Ditto for Extension C, Extension D, Extension E

Finally, (case) "beyond ASCII" is in the Index at p. 237.

Perhaps this thread would have gone differently if the original subject
was "Beyond ASCII" instead of...that other one. 😉

Now, I am not saying that the term *must* be "beyond ASCII". However the
term "non-ASCII" (with or without "Unicode") has no support in the
Unicode Standard 8.0. The only occurrence is the reference to RFC 2047,
and in that document, "non-ASCII" clearly means any and every character
encoding ever invented, not specifically Unicode.

Another thing is the oxymoron "ASCII Unicode" (the opposite of
"non-ASCII Unicode"). Actually ASCII is a formal subset of Unicode...at
the beginning. ASCII itself (ANSI X3.4-1986) is a 7-bit character set;
it does not limit itself to any particular word length so long as the 7
bits are in those combinations. Therefore U+0000 - U+007F characters
encoded in UTF-32 or UTF-16 are in ASCII codes; they are truly ASCII
characters. When a bit combination '?' (0x3F) is loaded into a 64-bit
register on a CPU, is it still an ASCII character? My view is yes.

They are not in ASCII *encoding*, as *encoding* is limited to a sequence
of 7-bit or 8-bit combinations (X3.4-1986 Section 2.1.1(1)). My point
here is that to be correct, one ought to use some sort of preposition,
namely "ASCII in Unicode" or "ASCII [characters/code points/scalar
values] in Unicode"--but if you slice off "in Unicode", you are left
with "ASCII" and that is just fine. This is another basis for the
proposition that "beyond ASCII" (e.g., "characters beyond ASCII [in
Unicode]", "beyond the ASCII range [of Unicode]") makes sense.

Regards,

Sean
Received on Wed Sep 30 2015 - 01:14:04 CDT

This archive was generated by hypermail 2.2.0 : Wed Sep 30 2015 - 01:14:04 CDT