Re: U+007E is informatively Sm?

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 04 2000 - 21:20:01 EDT


Tom Emerson (tree) asked:

> In UnicodeData.txt, U+007E TILDE is given the general category of Sm,
> Symbol Math.
>
> From Section 6.1, p. 149 of TUS3, the implication is that U+007E is
> actually punctuation, and should have the general category of Po,
> Punctuation Other. Indeed, the text compares this with U+223C TILDE
> OPERATOR, which is in the Mathematical Operators block and also has a
> general category of Sm.
>
> Anyway, this came up when a coworker came into my office and said,
> "Did you know tilde isn't punctuation?" In many of our applications,
> where we are looking at URLs and such, it certainly is punctuation, as
> much as '/' or ':'.
>
> So I guess what I'm asking is this: what is the rationale for U+007E
> being given the property Sm instead of Po?

The rationale was one of conservative change. The earliest General
Category assignment for U+007E was in UnicodeData-1.1.5.txt, where it
was assigned "So". When the math property was reviewed for Unicode 2.0, we
ran a consistency check against the General Category, and any "So"
character that had the math property was changed to "Sm".

The original rationale was that U+007E was *not* the math character,
whereas U+223C TILDE OPERATOR *was*. However, as for most ASCII symbols,
U+007E is massively overloaded -- and of course it had preexisting use
as a math symbol in ASCII-based formal languages, for example. For the
math property to make any sense at all, it had to include U+007E, as
well as many other overloaded ASCII symbols. (See p. 101 of TUS, 3.0 for
the listing.)

So the question devolves to a consideration of why U+007E was initially
given an S[x] (symbol) General Category, whereas U+002F SOLIDUS was
given a P[x] (punctuation) General Category. And for that, I think the
answer comes down to a consideration of general body text usage as
punctuation. The SOLIDUS has widespread usage as true punctuation,
as in "and/or" and similar constructions in text, whereas the TILDE proper
(as opposed to the swung dash) does not. (Michael Everson pointed out
the need for a distinct encoding for the swung dash character in Unicode.)
Of course, since the swung dash is not yet separately encoded, and because
of ASCII practice, U+007E is actually still overloaded with the swung
dash semantics in practice. So the line is a fairly fine one, and not
very defensible against case-by-case nitpicking.

The problem, of course, is that a character like U+007E has
multiple functions, and multiple properties. And the General Category
partition in UnicodeData is not sufficient to try to capture such a
situation for a character.

You also cannot draw absolute conclusions from the placement of
the discussion of tilde in The Unicode Standard in a section labelled
"Punctuation". That section also discusses glyph alternations of the
dollar sign, for example -- which is a currency sign.

What it comes down to is that especially when dealing with ASCII-based
legacy characters you cannot blindly rely either on an informative
General Category property from UnicodeData, or on implementations of
isPunct() or its kindred API's to provide the "right" answer in all
circumstances. If you are parsing URL's for example, then most ASCII
symbols (i.e. non-letters and non-digits) will have particular functions,
and relying on general property values to determine them is
insufficient.

--Ken

>
> TIA,
>
> -tree
>
> --
> Tom Emerson Basis Technology Corp.
> Zenkaku Language Hacker http://www.basistech.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT