L2/14-151
Source: Mark Davis
Date: July 11, 2014
Subject: smaller data file for confusables.txt

Laurentiu asked whether the confusables.text file could be simplified, since many of the lines are repeated, with only the table types being different. Examples:

2028 ; 0020 ; SL #* (  →   ) LINE SEPARATOR → SPACE # 

2028 ; 0020 ; SA #* (  →   ) LINE SEPARATOR → SPACE # 

2028 ; 0020 ; ML #* (  →   ) LINE SEPARATOR → SPACE # 

2028 ; 0020 ; MA #* (  →   ) LINE SEPARATOR → SPACE # 

See also http://www.unicode.org/reports/tr39/#ConfusableDataTableTypes

He had asked the quite reasonable question: "Is it the case that the SL confusables form a proper subset of the SA confusables, and so on compared to ML and then to MA confusables? If yes, the duplication in confusables.txt would be reduced quite a bit if each set only listed what that set contains in addition to the previous set, and inherited everything else from the previous set."
I did some analysis, and here's what I found:

As it turns out, they are not just supersets. With the version I had, here are the stats.
4523 [MA, ML, SA, SL]
51 [ML, SA, SL]
511 [ML, SL]
122 [SA, SL]
724 [MA, SA]
351 [MA, ML]
330 [MA]
97 [ML]
45 [SA]
1 [SL]

However, we could make the file dramatically smaller if we change the format to make the type field be a space delimited list. So all of the above would be on one line:

2028 ; 0020 ; SL SA ML MA #* (  →   ) LINE SEPARATOR → SPACE # 

 

The question to the committee is whether this is worth doing.