Re: New Public Review Issue: Proposed Update UTS #18

From: Jonathan Coxhead (
Date: Mon Sep 24 2007 - 05:33:28 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

       Mike wrote:

    > I played around with the ability to add digraphs to "." and came up
    > with two methods. The first would be to specifically list them using
    > syntax such as:

       I'd just like to point out that a "[ ]" regular expression is defined to
    match always exactly one character (if it matches at all).

       You can write "[abcdef]" as "(a|b|c|d|e|f)" if you like. You can also write
    "(a|bb|ccc|dddd|eeeee|ffffff)", but there is no form using "[ ]" to match the
    same thing.

       "[ ]" exists primarily as an optimisation, because matching 1 character
    against a set is a fast operation, whereas checking against an unknown number of
    alternatives of potentially varying lengths ("( | )") is expensive.

       So a sequence specified like [^ ] could never match a whole message, or the
    string "New York": it could only match a single character.

       What exactly this means in the context of Unicode is a different matter, but
    I imagine some sort of historical consistency is desirable.

    ... Jonathan
        Belmont CA 94002

    This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 05:37:06 CDT