From: Joachim Durchholz (email@example.com)
Date: Fri Aug 07 2009 - 02:46:22 CDT
I'm trying to make a piece of software determine if an opening and a
closing character match.
E.g. ( matches ), [ matches ], « matches ».
I'm looking for a workable approach to base that on Unicode character
properties, but I see lots of problems and would appreciate any advice
how to proceed.
Problem group 1: Determining which characters are in the "quotation
marks" set and in the "parentheses" sets, respectively.
1a) For quotation marks, General Category: Initial Quote Punctuation and
Final Quote Punctuation is a good first approximation, but it's missing
some characters (particularly the ASCII single and double quotes " and
', but also e.g. U+FF02 Fullwidth Quotation Mark).
1b) The Quotation Mark property is more complete (in particular it does
contain " and ' ), but it is just informative and hence not subject to a
stability policy. That's a no-go for a programming language - imagine
all strings turning into syntax errors because the Unicode consortium
decides to drop the Quotation Mark property from " !
1c) Given the problems with quote punctuation, I'm worrying that General
Category: Start Punctuation and End Punctuation may be incomplete as
well. I can't check that, partly because the character set is so huge
and partly because I'm no expert in foreign character sets.
1d) There seem to be errors in the categorization of some characters.
U+201a ‚ SINGLE LOW-9 QUOTATION MARK strikes me as a quote (Gc: Initial
Quote Punctuation), but its General Category is Start Punctuation just
like Left Parenthesis.
The same goes for U+201e „ DOUBLE LOW-9 QUOTATION MARK.
Problem group 2: How to determine that two characters match?
Assuming I have an opening and a closing character and know they're
either parentheses or quotes: On what criteria could I base that they
match or don't?
E.g. ( would match ), but ( would not match ].
2a. I found only one property that even lists another character as its
property value, namely Bidi Mirroring Glyph. However, it is informative
2b. It does not cover vertical scripts: characters intended for use in
vertical context such at U+23b4 Top Square Bracket don't have a Bidi
2c. It may be erroneous, too: U+fd3e Ornate Left Parenthesis is not
linked up with U+fd3f Ornate Right Parenthesis.
I'll want to "normalize away" compatibility characters and confusables.
This may take care of the problems with concrete character groups that I
listed as potentially erroneous above (they may get rejected or
normalized away anyway).
I haven't opened that big can of worms that "confusables" represents
though. Yet - it's the next big thing on my reading list.
What would be the best course of action?
Things I have considered:
A) Work with the Unicode consortium to make sure that there are
normative properties for the purpose. I'm not sure that that's possible,
it may turn out to be too big a burden for me and/or unwelcome from the
side of the Unicode consortium. (I'm a private person, so a membership
is probably too expensive and/or too taxing on my time.)
B) Require that a programming language is nailed down to a single
version of Unicode. (I think Java essentially does this.)
C) Require that programming languages with this kind of Unicode support
start with a marker that nails down the Unicode version. (This is highly
undesirable as it makes copying and pasting code an inherently
unreliable operation: pasted code may have its semantics changed because
the new context assumes a different version of Unicode.)
Any insights and advice appreciated.
This archive was generated by hypermail 2.1.5 : Fri Aug 07 2009 - 11:30:00 CDT