Re: Encoding of old compatibility characters from Ken Whistler on 2017-03-27 (Unicode Mail List Archive)

From: Ken Whistler <kenwhistler_at_att.net>
Date: Mon, 27 Mar 2017 10:18:03 -0700

On 3/27/2017 7:44 AM, Charlotte Buff wrote:
> Now, one of Unicode’s declared goals is to enable round-trip
> compatibility with legacy encodings. We’ve accumulated a lot of weird
> stuff over the years in the pursuit of this goal. So it would be
> natural to assume that the unencoded characters from the mentioned
> sets [ATASCII, PETSCII, the ZX80 set, the Atari ST set, and the TI
> calculator sets] would also be eligible for inclusion in the UCS.

Actually, it wouldn't be.

The original goal was to ensure round-trip compatibility with
*important* legacy character encodings, *for which there was a need to
convert legacy data, and/or an ongoing need to representation of text
for interchange*.

From Unicode 1.0: "The Unicode standard includes the character content
of all major International Standards approved and published before
December 31, 1990... [long list ensues] ... and from various industry
standards in common use (such as code pages and character sets from
Adobe, Apple, IBM, Lotus, Microsoft, WordPerfect, Xerox and others)."

Even as long ago as 1990, artifacts such as the Atari ST set were
considered obsolete antiquities, and did not rise to the level of the
kind of character listings that we considered when pulling together the
original repertoire.

And there are several observations to be made about the "weird stuff" we
have accumulated over the years in the pursuit of compatibility. A lot
of stuff that was made up out of whole cloth, rather than being
justified by existing, implemented character sets used in information
interchange at the time, came from the 1991/1992 merger process between
the Unicode Standard and the ISO/IEC 10646 drafts. That's how Unicode
acquired blocks full of Arabic ligatures, for example.

Other, subsequent additions of small (or even largish) sets of oddball
"characters" that don't fit the prototypical sets of characters for
scripts and/or well-behaved punctuation and symbols, typically have come
in with argued cases for the continued need in current text interchange,
for complete coverage. For example, that is how we ended up filling out
Zapf dingbats with some glyph pieces that had been omitted in the
initial repertoire for that block. More recently, of course, the
continued importance of Wingdings and Webdings font encodings on the
Windows platform led the UTC to filling out the set of graphical
dingbats to cover those sets. And of course, we first started down the
emoji track because of the need to interchange text originating from
widely deployed Japanese carrier sets implemented as extensions to
Shift-JIS.

I don't think the early calculator character sets, or sets for the Atari
ST and similar early consumer computer electronics fit the bill,
precisely because there isn't a real text data interchange case to be
made for character encoding. Many of the elements you have mentioned,
for example, like the inverse/negative squared versions of letters and
symbols, are simply idiosyncratic aspects of the UI for the devices, in
an era when font generators were hard coded and very primitive indeed.

Documenting these early uses, and pointing out parts of the UI and
character usage that aren't part of the character repertoire in the
Unicode Standard seems an interesting pursuit to me. But absent a true
textual data interchange issue for these long-gone, obsolete devices, I
don't really see a case to be made for spending time in the UTC defining
a bunch of compatibility characters to encode for them.

--Ken
Received on Mon Mar 27 2017 - 12:18:21 CDT

This archive was generated by hypermail 2.2.0 : Mon Mar 27 2017 - 12:18:21 CDT