Re: APL Under-bar Characters

From: Ken Whistler <kenwhistler_at_att.net>
Date: Sun, 16 Aug 2015 11:37:50 -0700

It seems to me that APL has some very deeply embedded (and ancient)
assumptions about fixed-width 8-bit characters, dating from ASCII days.
It only got as far as it did with the current assumptions because people
hacked up 8-bit fonts for all the special characters for the APL syntax,
and because IBM implemented those as dedicated special character sets with
matching specialized APL keyboards.

A built-in function like ⍴ which returns the *size* of data is structurally
hand-in-hand with the definition of vectors and arrays. There seem to
be very deep assumptions in the APL data model that strings are simply
an array of *fixed-size* data elements, aka "characters".

So requiring ⍴,'ä' and ⍴,'_A_' to "just work" is the moral equivalent of
asking the
C library call strlen("ä") or strlen("_A_") to "just work", regardless
of the
representation of the data in the string. It is a nonsensical requirement
if applied to general Unicode strings outside the context of a very
carefully restricted subset designed to ensure one-to-one relationship
between "character" and "array element".

A Unicode-based APL implementation can (presumably) just up the size
of its "character" to 16-bits internally (actually a UTF-16 code *unit*)
and carefully restrict itself to the subset of ASCII & Latin-1, the APL
symbols and a few other operators needed to fill out the set.

Looking at the fonts people seem to actually be using in various
implementations,
e.g.:

http://aplwiki.com/AplCharacters

the general choice seems to be to use both uppercase and lowercase Latin
letters,
and forgo the old convention of underlined uppercase Latin letters. That
seems a
small adjustment to make to not stay stuck in the 70's, frankly.

I can understand Alex's request that Unicode then effectively "solve the
problem" by
providing a fixed-width 16-bit entity for "_A_" that could then just be
added to
the restricted subset in the APL implementations. But that isn't going
to happen --
because of the normalization stability guarantees for the Unicode Standard.

And in any case, if users of APL need something more sophisticated for
actual
string handling than strictly limited subsets based on the assumption that
character=element_of_fixed_data_size_array, then rho and a limited subset
aren't going to handle it anyway. At that point, another layer of
abstraction
would have to be built on top of the basic array and vector processing. And
then Khaled's points about character=grapheme_cluster become relevant.

--Ken

On 8/16/2015 9:53 AM, Khaled Hosny wrote:
> On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner_at_alexweiner.com wrote:
>
>
> So I'm not sure why the allowance was made for ä as well as other certain
> characters, but not for other things (under-bar characters) that face
> similar representation issues.
> It was encoded for compatibility of pre-existing character sets AFAIK.
>
> Regards,
> Khaled
>
>
>
Received on Sun Aug 16 2015 - 13:38:56 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 16 2015 - 13:38:56 CDT