Re: Unicode in source code. WHY?

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 21 1999 - 17:29:45 EDT


Adam suggested:

>
> On Tue, Jul 20, 1999 at 10:57:31PM -0700, Jonathan Rosenne wrote:
> > We also know that those environments that do allow the use of Unicode are
> > not all compatible. I see two main problems:
> >
> > 1. Should the full Unicode repertoire be allowed, or just a subset?
>
> Why is that a problem? 8-bit bytes should be allowed without question.
> That allows all of non-ASCII Unicode, including punctuation, quotation
> marks, and everything else, UTF-8 encoded. We would finally be able to
> use non-blanking space instead of the underscore kludge.

This is asking for big-time trouble. There are already formal languages,
e.g. SQL, that recognize Asian full-width punctuation as equivalent
to the ASCII forms, especially for quotation marks. This is so that
quoted literals don't get misinterpreted and blow up queries depending
on whether the user was in single-byte or double-byte modes when they
typed the quotes, for example. There is plenty of special-case code written
into SQL parsers to ensure correct behavior (as defined by users) in these
kinds of cases.

Treating all Unicode characters above U+007F as just unanalyzed byte values
acceptable for being part of identifiers, regardless of their properties
otherwise, is an easy kludge for extending an 8-bit parser to handle
UTF-8 in a definable way, but it ends up with all kinds of anomalies,
where quotes aren't quotes, brackets aren't brackets, spaces aren't
spaces, and equivalences aren't equivalent.

There are reasons why Java went the way it did. Once you step up to
the need for a Unicode parser to be parsing Unicode *characters*, and
not just some byte stream expressed in UTF-8, then it is rather
trivial to express the parser in terms of a specification of whether
particular characters should be valid as lead or continuation
characters in identifiers. The complexity, such as it is, is all
contained in the tableware that allows you to specify the classes.
And, if you need that tableware anyway to support other character
properties (for expressing the Unicode equivalents of the C runtime's
isalpha(), isdigit(), isspace(), etc.) then there is no good reason
not to do it correctly.

>
> UTF-8 encoding makes the various environments compatible.
>
> > 2. When are two identifiers to be considered equivalent?
>
> When they consist of the same sequence of bytes. That would certainly work
> for case-sensitive languages. Case-insensitive languages would need to
> rely on the operating system to make the comparison.

You will, I presume, get to my response next on this thread. Comparing
just the same sequence of bytes is not acceptable for Unicode. You
need to specify comparison on a normalized form for this to make
sense.

And language tools that need to do case-folding should *not* be relying
on the operating system to make the comparison. If they do that, they
are depending on varying levels of implementation of case-mappings,
which may or may not be locale-specific or limited to particular
character encoding repertoires. Any tool worth its salt that claims
to be supporting Unicode will bind in a full set of case-mapping tables
based on the data tables provided by the Unicode Consortium -- with or
without extensions for special language behavior. That is the way the
tool can guarantee predictable cross-platform behavior for case-folding.

--Ken

>
> Adam
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT