UnicodeData-2.1.8 bug report

From: Kevin Bracey (kbracey@e-14.com)
Date: Wed Mar 17 1999 - 09:25:12 EST


The ReadMe file for version 2.1.8 boldly states:

  Note that as of the 2.1.8 update of the Unicode Character Database,
  the decompositions in the UnicodeData.txt file can be used to recursively
  derive the full decomposition in canonical order, without the need
  to separately apply canonical reordering.

I've just found a bunch of Vietnamese characters for which this doesn't
seem to be the case, eg:

      1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
   
   == 00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX
      0323 COMBINING DOT BELOW
      
   == 0041 LATIN CAPITAL LETTER A
      0302 COMBINING CIRCUMFLEX ACCENT
      0323 COMBINING DOT BELOW

But the canonical order is, of course:
      
      0041 LATIN CAPITAL LETTER A
      0323 COMBINING DOT BELOW
      0302 COMBINING CIRCUMFLEX ACCENT
      
This affects characters 1EAC,1EAD,1EB6,1EB7,1EC6,1EC7,1ED8,1ED9.

Would it be worthwhile me knocking up an algorithmic check that this
assertion doesn't fail elsewhere, or is someone else already looking at it?

-- 
Kevin Bracey, Senior Software Engineer
Acorn Computers Ltd                           Tel: +44 (0) 1223 725228
Acorn House, 645 Newmarket Road               Fax: +44 (0) 1223 725328
Cambridge, CB5 8PB, United Kingdom            WWW: http://www.acorn.co.uk/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT