From: Asmus Freytag (email@example.com)
Date: Fri Aug 07 2009 - 16:00:26 CDT
On 8/7/2009 12:46 AM, Joachim Durchholz wrote:
> I'm trying to make a piece of software determine if an opening and a
> closing character match.
> E.g. ( matches ), [ matches ], « matches ».
> I'm looking for a workable approach to base that on Unicode character
> properties, but I see lots of problems and would appreciate any advice
> how to proceed.
The task of determining a matching pair is in principle easier than
defining which is the opening and which is the closing one of the set.
But even that is not as straightforward as you think. You may well be
familiar with notations such as [...) which denote a half-open set. Here
the ")" closes the expression started with the "[". Similar unusual use
of various brackets can be found in technical / mathematical texts,
including the use of reverse bracketing: "]...[". Even in otherwise
ordinary textsyou can find unpaired use of parenthesis as in the use of
"1a)" etc. in your own message.
> Problem group 1: Determining which characters are in the "quotation
> marks" set and in the "parentheses" sets, respectively.
> 1a) For quotation marks, General Category: Initial Quote Punctuation and
> Final Quote Punctuation is a good first approximation, but it's missing
> some characters (particularly the ASCII single and double quotes " and
> ', but also e.g. U+FF02 Fullwidth Quotation Mark).
The General Category is limiting in that it can only denote a single
"principal" characteristic of a character.
> 1b) The Quotation Mark property is more complete (in particular it does
> contain " and ' ), but it is just informative and hence not subject to a
> stability policy. That's a no-go for a programming language - imagine
> all strings turning into syntax errors because the Unicode consortium
> decides to drop the Quotation Mark property from " !
Syntax rules have no business being defined by Unicode property to begin
with. The use that programming language syntax makes of Unicode
characters is exclusively defined by the specification of that
programming language. From the perspective of Unicode, then, the only
thing of interest in this context would be some other tasks of matching
delimiters where it's not possible to define a rigid syntax.
In such a context, a normative property would not be required, and in
fact, is likely to be misconstrued by general users as placing a limit
on what writers can do with various Unicode characters. A formal
guarantee of stability (which, more often than not, also leads to the
preservation of unintentional errors(!)) should be reserved for only the
most compelling cases. In practice, other than the correction of errors,
people don't usually go around and suddenly identify different matching
> 1c) Given the problems with quote punctuation, I'm worrying that General
> Category: Start Punctuation and End Punctuation may be incomplete as
> well. I can't check that, partly because the character set is so huge
> and partly because I'm no expert in foreign character sets.
> 1d) There seem to be errors in the categorization of some characters.
> U+201a ‚ SINGLE LOW-9 QUOTATION MARK strikes me as a quote (Gc: Initial
> Quote Punctuation), but its General Category is Start Punctuation just
> like Left Parenthesis.
> The same goes for U+201e „ DOUBLE LOW-9 QUOTATION MARK.
The use of quotation marks is extremely variable. About the **only*
*thing thats known about the curly quotation marks is that the "low"
variant is only used at the start of a quotation. (Please read the
section on quotation marks in the standard). The other curly quotation
marks can be used (despite their gc values) as either opening or closing
and sometimes as **both** in the same quote. The gc values describe
their use in English, but are woefully wrong for many languages. The
same goes for the chevron style of quotation marks.
> Problem group 2: How to determine that two characters match?
> Assuming I have an opening and a closing character and know they're
> either parentheses or quotes: On what criteria could I base that they
> match or don't?
> E.g. ( would match ), but ( would not match ].
For some quotation marks, the rules are language dependent. For some
parens and brackets, the rules depend on the notation or orthography
(e.g. when to use unpaired, as your "1c)" above).
> 2a. I found only one property that even lists another character as its
> property value, namely Bidi Mirroring Glyph. However, it is informative
There's nothing wrong with informative properties. Especially in
situations where any property, as is the case here, can, at best, define
the "default" behavior in the context of a few languages (usually
English and whatever languages share the same conventions).
> 2b. It does not cover vertical scripts: characters intended for use in
> vertical context such at U+23b4 Top Square Bracket don't have a Bidi
> Mirroring Glyph.
These characters, being intended for terminal emulation, are better off
not being considered in this context.
> 2c. It may be erroneous, too: U+fd3e Ornate Left Parenthesis is not
> linked up with U+fd3f Ornate Right Parenthesis.
Correct, but one that can't ever be fixed. (Because that would break
software). The error was in not giving the characters the mirrored property.
> I'll want to "normalize away" compatibility characters and confusables.
> This may take care of the problems with concrete character groups that I
> listed as potentially erroneous above (they may get rejected or
> normalized away anyway).
> I haven't opened that big can of worms that "confusables" represents
> though. Yet - it's the next big thing on my reading list.
> What would be the best course of action?
> Things I have considered:
> A) Work with the Unicode consortium to make sure that there are
> normative properties for the purpose.
I think it would be ill advised for the UTC to create a normative
property just for this purpose.
However, a general informative property, that identifies which paired
characters were encoded as a pair might be of interest.
Such a property could and should be extended to other paired characters,
lik the arrows, because, those are absent from BidiMirroringGlyph (as
arrows are not mirrored) but knowing which arrow is the directional pair
of which other one is useful nevertheless.
It is a legitimate task for Unicode to identify such paired characters,
especially as they are not always coded together. Such a property speaks
primarily to the *identity* of the character, which is the primary
concern of the UTC when encoding character. The property would not
primarily speak to how such characters are *used*, because, as I've
tried to indicate, such usage rules are too varied, too context and
language dependent, to be captured by a Unicode character property.
> (I'm a private person...
Not relevant in this context.
> B) Require that a programming language is nailed down to a single
> version of Unicode. (I think Java essentially does this.)
Syntax definitions should be nailed down like that. It's not as much of
a burden, because the kinds of characters you describe here are not
being added in large numbers.
> C) Require that programming languages with this kind of Unicode support
> start with a marker that nails down the Unicode version. (This is highly
> undesirable as it makes copying and pasting code an inherently
> unreliable operation: pasted code may have its semantics changed because
> the new context assumes a different version of Unicode.)
You are talking about programming *source* text. For that, if you
present a newer program (with newer Unicode characters) the existence of
the newer character code should give you a syntax error. No need to mark
anything, as long as your compilers or interpreters reject anything
that's not defined when they were written.
Later versions can support more characters-undoubtedly later versions
will have some other new features that make new program texts that use
these features not 100% backwards compatibly anyway.
This archive was generated by hypermail 2.1.5 : Fri Aug 07 2009 - 16:01:54 CDT