Re: Default bidi ranges from Ken Whistler on 2011-11-09 (Unicode Mail List Archive)

From: Ken Whistler <kenw_at_sybase.com>
Date: Wed, 09 Nov 2011 11:32:27 -0800

On 11/9/2011 9:30 AM, Asmus Freytag wrote:
> On 11/9/2011 1:18 AM, "Martin J. Dürst" wrote:
>> I tried to find something like a normative description of the default
>> bidi class of unassigned code points.
>>
>> In UTR #9, it says
>> (http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types):
>>
>> Unassigned characters are given strong types in the algorithm. This
>> is an explicit exception to the general Unicode conformance
>> requirements with respect to unassigned characters. As characters
>> become assigned in the future, these bidirectional types may change.
>> For assignments to character types, see DerivedBidiClass.txt
>> [DerivedBIDI] in the [UCD].

That *is* the normative description of the default Bidi_Class for
unassigned code points.

>>
>> The DerivedBidiClass.txt file, as far as I understand, is mainly a
>> condensation of bidi classes into character ranges (rather than
>> giving them for each codepoint independently as in UnicodeData.txt).
>> I.e. it can at any moment be derived automatically from
>> UnicodeData.txt, and is as such not normative.

Because the default values for Bidi_Class are complicated, and cannot be
derived
simply by parsing the values for *assigned* characters in
UnicodeData.txt, the
listing of the default values for Bidi_Class in DerivedBidiClass.txt
have to be
taken as normative for those values.

>>
>> Why is it then that the default class assignments are only given in
>> this file (unless I have overlooked something)? And why is it that
>> they are only given in comments?
>
> Because the UnicodeData.txt file has no header (for historical
> compatibility).
>
> Because, like the practice of putting <style> in HTML inside comments,
> these things (@missing) are in comments to protect older parsers.

And to go beyond what Asmus said there, the "@missing" hack was created as
a syntax for specifying *the* default values for properties where it
makes sense
that they have a *single* default value. It doesn't work for specifying
multiple
default values differing by code point range. Hence no addition of the
@missing
comment in DerivedBidiClass.txt (or its potential addition to
PropertyValueAliases.txt)
doesn't suffice for the entire definition.

>> I'm trying to create a program that takes all the bidi assignments
>> (including default ones) and creates the data part of a bidi
>> algorithm implementation, but I don't feel confident to code against
>> stuff that's in comments. Any advice?

Use the values in the comments.

Remember that this is not *code* with comments that get stripped out
before compiling.
These are text data files for parsing. The fact that people are already
parsing the
@missing statements indicates that those are being treated normatively
now. You
could say the same thing for the titles, dates, and copyright notices on
these data
files: they aren't "optional" content to be ignored.

>> Is it possible that this could be fixed (making it more normative,
>> and putting it in a form that's easier to process automatically)?

This is part of a very large problem for creating a more complete and
machine-parseable
means of accessing *all* of the Unicode character property data,
including data about
the *status* of properties and their default values. It won't, IMO, be
fixed by individual
file fixes one at a time, although incremental improvement can be helpful.

Note that the UCD in XML was created to address this problem in part,
but it still
cannot answer many questions about the status of properties, their full
derivations,
their interactions, and their functions.

--Ken
Received on Wed Nov 09 2011 - 13:38:15 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 09 2011 - 13:38:16 CST