From: Mark Davis (mark.davis@icu-project.org)
Date: Wed Aug 15 2007 - 11:02:18 CDT
The reason middle dot wasn't mentioned was that the UTC has decided to add
it to ID in U5.1 -- see the proposed update at
http://www.unicode.org/reports/tr31/tr31-8.html. (Middle dot was handled
specially - instead of removing the character in step #1, the character
causing a problem in its decomposition was added.)
The differences can be seen by looking at
http://unicode.org/cldr/utility/unicodeset.jsp?a=[:id_continue:]&b=[:xid_continue:]
or
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[:id_continue:]-[:xid_continue:]][[:xid_continue:]-[:id_continue:]]]
I think it would be useful to add a more detailed description of the
derivation; I'll propose that to the editorial committee.
Mark
On 8/15/07, "Martin v. Löwis" <martin@v.loewis.de> wrote:
>
> > I glean this as the algorithm:
> >
> > Add middle dot to ID_CONTINUE
> >
> > If an ID_START or ID_CONTINUE character has a decomposition containing a
> > character other than middle dot that's not in ID_CONTINUE, then remove
> > that character from ID_START or ID_CONTINUE.
> >
> > If an ID_START has a decomposition that begins with a character that's
> > not an ID_START, remove it from ID_START.
>
> Thanks, this is exactly what I was looking for - at least for Unicode
> 4.1, this algorithm produces an outcome equal to the published tables.
>
> Could that be added to UAX#31?
>
> Regards,
> Martin
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Wed Aug 15 2007 - 11:05:01 CDT