Matching opening and closing characters: How?

From: Joachim Durchholz (
Date: Fri Aug 07 2009 - 02:46:22 CDT

  • Next message: Mark Davis ⌛: "Re: Guidance wanted on implementing Greek context-dependent casing"


    I'm trying to make a piece of software determine if an opening and a
    closing character match.

    E.g. ( matches ), [ matches ], « matches ».

    I'm looking for a workable approach to base that on Unicode character
    properties, but I see lots of problems and would appreciate any advice
    how to proceed.

    Problem group 1: Determining which characters are in the "quotation
    marks" set and in the "parentheses" sets, respectively.

    1a) For quotation marks, General Category: Initial Quote Punctuation and
    Final Quote Punctuation is a good first approximation, but it's missing
    some characters (particularly the ASCII single and double quotes " and
    ', but also e.g. U+FF02 Fullwidth Quotation Mark).

    1b) The Quotation Mark property is more complete (in particular it does
    contain " and ' ), but it is just informative and hence not subject to a
    stability policy. That's a no-go for a programming language - imagine
    all strings turning into syntax errors because the Unicode consortium
    decides to drop the Quotation Mark property from " !

    1c) Given the problems with quote punctuation, I'm worrying that General
    Category: Start Punctuation and End Punctuation may be incomplete as
    well. I can't check that, partly because the character set is so huge
    and partly because I'm no expert in foreign character sets.

    1d) There seem to be errors in the categorization of some characters.
    U+201a ‚ SINGLE LOW-9 QUOTATION MARK strikes me as a quote (Gc: Initial
    Quote Punctuation), but its General Category is Start Punctuation just
    like Left Parenthesis.
    The same goes for U+201e „ DOUBLE LOW-9 QUOTATION MARK.

    Problem group 2: How to determine that two characters match?

    Assuming I have an opening and a closing character and know they're
    either parentheses or quotes: On what criteria could I base that they
    match or don't?
    E.g. ( would match ), but ( would not match ].

    2a. I found only one property that even lists another character as its
    property value, namely Bidi Mirroring Glyph. However, it is informative

    2b. It does not cover vertical scripts: characters intended for use in
    vertical context such at U+23b4 Top Square Bracket don't have a Bidi
    Mirroring Glyph.

    2c. It may be erroneous, too: U+fd3e Ornate Left Parenthesis is not
    linked up with U+fd3f Ornate Right Parenthesis.

    I'll want to "normalize away" compatibility characters and confusables.
    This may take care of the problems with concrete character groups that I
    listed as potentially erroneous above (they may get rejected or
    normalized away anyway).
    I haven't opened that big can of worms that "confusables" represents
    though. Yet - it's the next big thing on my reading list.

    What would be the best course of action?

    Things I have considered:
    A) Work with the Unicode consortium to make sure that there are
    normative properties for the purpose. I'm not sure that that's possible,
    it may turn out to be too big a burden for me and/or unwelcome from the
    side of the Unicode consortium. (I'm a private person, so a membership
    is probably too expensive and/or too taxing on my time.)
    B) Require that a programming language is nailed down to a single
    version of Unicode. (I think Java essentially does this.)
    C) Require that programming languages with this kind of Unicode support
    start with a marker that nails down the Unicode version. (This is highly
    undesirable as it makes copying and pasting code an inherently
    unreliable operation: pasted code may have its semantics changed because
    the new context assumes a different version of Unicode.)

    Any insights and advice appreciated.


    This archive was generated by hypermail 2.1.5 : Fri Aug 07 2009 - 11:30:00 CDT