L2/03-310 Proposal for making Property(Value)Aliases.txt actually machine readable Asmus Freytag 2003-09-09 Overview -------- This document presents a Review of PropertyAliases.txt and PropertyValueAliases.txt from the perspective of their usefulness during updates to a new version of the UCD followed by a five step proposal to set things to rights. The problem ----------- I've come across what I consider a pretty serious deficiency in the way we've designed PropertyAliases.txt and PropertyValueAliases.txt. The primary purpose of these files is to provide a list of standardized labels for citing properties in various contexts, be it regular expressions, user interfaces, or documentation. The PropertyAliases.txt file gives entries like these: ... jt ; Joining_Type lb ; Line_Break NFC_QC ; NFC_Quick_Check ... with a short name on the left and a long name on the right. The PropertyValueAliases gives long and short names for each of the possible values of a property, indexed by the short name of the property in the first column: ... jt ; C ; Join_Causing jt ; D ; Dual_Joining jt ; L ; Left_Joining jt ; R ; Right_Joining jt ; T ; Transparent jt ; U ; Non_Joining lb ; AI ; Ambiguous lb ; AL ; Alphabetic lb ; B2 ; Break_Both lb ; BA ; Break_After ... qc ; M ; Maybe qc ; N ; No qc ; Y ; Yes .. The data is presented in a format that on the surface appears machine readable, but which is organized in a way that makes it pretty useless for automatic updates when the UCD gets rev'd. What's needed is to be able to link the information on location of the data (file and field), with the set of legal values and the name of the property. For example, if I want to let a user select all characters that have linebreak property BA (Break_After), I need the following pieces: 1) list of property names 2) link to a short name for each 3) link to a location for each (data file and field) 4) list of possible values for each property 5) link to the name used in the file for each value With a list of all property names, I can populate a listbox in a dialog to let the user select the property. With the short name, I can access the list of values in the value aliases (or alternatively expose that to the user in regular expression syntax, etc.) With a pointer to a location I can parse the data file, once the user has selected the name for the actual value from a list of possible values. [or use it as an index into a pre-parsed table] While the list of values could expose nice names, the program needs to have a link to the actual expression of that value in the data file. This example is based on the kind of property lookup as done in unibook (http://www.unicode.org/unibook/) but it works just the same for maintaining a regular expression parser or any other form of user interaction or API call that's based on the long or short names of properties and property values. In all cases like this, it's possible to assemble the required information from the data in the database in a semi-manual way. However, with each addition to the UCD, this gets out of date and must be manually maintained. Why can the organization of the UCD not be machine readable? Why is it not machine readable? Here are the specific problems that I encounter: 1) The description of which data values are where is not machine readable at all - it's a table in UCD.html. The table is a valuable resource for people trying to read the UCD manually, but it could be exposed in machine readable form. 2) The PropertyAliases.txt file is not machine readable. It contains comments (which a parser would ignore) which organize the properties into different types (which then are treated differently, although not consistently so, in PropertyValueAliases.txt) 3) PropertyValueAliases.txt contains many exceptions that make it hard to work with for automated purposes: a) it uses a 'pseudo-property' called 'qc' for all quick check properties. This pseudo property is not defined anywhere and the only explanation is in the comment. (My parser promptly barfed on it as an 'illegal property short-name'). b) boolean properties are not listed, as are properties such as 'character name', or 'lower case mappings' that have open ended values. c) there is an inconsistent use of 'n/a' in the file. Most often, n/a means that there is no 'short-name' defined for a value. Less often it means that the value itself is an n/a value. However, for some properties that is indicated with a value alias of "none". d) the canonical combining class property has an extra field, making the number of fields variable (it could have been made empty in the rest of the file) 4) Some data files contain properties for which no alias is listed. Some data files contain properties for which an alias is listed in the data file which does not match any of the aliases given in the aliases file. given. 5) The comments suggest that in the future, additional aliases can be introduced for a property or a property value. For property values that have no short name, it's impossible to add an alias: Example: old: blk ; n/a ; Arabic blk ; n/a ; Greek blk ; n/a ; Latin new: blk ; n/a ; Alias blk ; n/a ; Arabic blk ; n/a ; Greek In this example it's not possible to know whether 'Alias' is a new value or an alias of another one. However, 'inventing' up to 100 new, fake short block names to add an occasional alias is not helpful either. Where short names do exist, an alias could be defined, but then only for *either* the short or the long names, not for both: lb ; AL ; Alphabetic lb ; B2 ; Break_Both lb ; B2 ; Break_Before_And_After lb ; BA ; Break_After but: lb ; AL ; Alphabetic lb ; BB ; Break_Both lb ; B2 ; Break_Both lb ; B2 ; Break_Before_And_After lb ; BA ; Break_After would be much harder to deal with, since one would need to thread all aliases together. Furthermore, if one of the aliases is preferred, there's no clean way, other than order, to indicate that, but the files are alphabetized. The Proposal ------------ I propose the following steps to turn the collection of aliases into a parsable index of property names and value names: 1) make a machine readable copy of the table in UCD.html, linking the short property name to the name of the data file and a field number (or tag value in case of Unihan). Once the rest of the information is in place, it should be possible to use a script to check this new locator table by scanning the data files for expected property values. 2) use a machine readable non-comment convention to organize PropertyAliases.txt to indicate which properties have value aliases, which do not (by their nature) and which have predictable (true/false) values that need no alias. The easiest would be a third column that indicates the type of property. This information is useful in other contexts. 3) In the PropertyValueAliases.txt file, remove the pseudo property, put the combining class into a field that's consistently present across the file; add an index field for aliases. (the ccc and index field could reside in a single column of numerical values). Use the value n/a consistently only for cases where there is no defined alias (short or long) for a property value, and use the word 'none' if needed both for long and short alias for the 'does not have this property'-value. 4) Make sure that all value and property labels from all the data files are actually represented in PropertyAliases.txt and PropertyValueAliases.txt. 5) Where multiple aliases exist, add a way to indicate the alias that is actually used in the current version of the data file. Other than such indication which is needed to link to the information in item 1 all aliases can be defined as equivalent, in other words, I recommend against identifying a 'preferred' alias since I suspect that changing that status could create problems with implementations. Given the enormous difficulties in actually using these files as machine readable data as they are exposed today, I have little hesitation in suggesting a reorganization of the data files. The Conclusion -------------- With these five steps, the relevant information in the two files, and UCD.html would become usable in many contexts allowing implementors to automate key steps in their migration to a new revision of the database, or to create utilities that are self-updating. The fact that some property data files contain information that is complex and needs special support, such as the data in SpecialCasing.txt does not invalidate this proposal. The vast majority of individual properties are either boolean or simple enumerations. These are the types of properties that are most likely to be added between versions or that see their range of values change. The Credits ----------- Ken Whistler found the fatal flaw in the current non-scheme to do multiple aliases.