smaller data file for confusables.txt

L2/14-151
Source: Mark Davis
Date: July 11, 2014
Subject: smaller data file for confusables.txt

Laurentiu asked whether the confusables.text file could be simplified, since many of the lines are repeated, with only the table types being different. Examples:

2028 ; 0020 ; SL #* ( → ) LINE SEPARATOR → SPACE #

2028 ; 0020 ; SA #* ( → ) LINE SEPARATOR → SPACE #

2028 ; 0020 ; ML #* ( → ) LINE SEPARATOR → SPACE #

2028 ; 0020 ; MA #* ( → ) LINE SEPARATOR → SPACE #

He had asked the quite reasonable question: "Is it the case that the SL confusables form a proper subset of the SA confusables, and so on compared to ML and then to MA confusables? If yes, the duplication in confusables.txt would be reduced quite a bit if each set only listed what that set contains in addition to the previous set, and inherited everything else from the previous set."

I did some analysis, and here's what I found:

As it turns out, they are not just supersets. With the version I had, here are the stats.

4523 [MA, ML, SA, SL]

51 [ML, SA, SL]

511 [ML, SL]

122 [SA, SL]

724 [MA, SA]

351 [MA, ML]

330 [MA]

97 [ML]

45 [SA]

1 [SL]

However, we could make the file dramatically smaller if we change the format to make the type field be a space delimited list. So all of the above would be on one line:

2028 ; 0020 ; SL SA ML MA #* ( → ) LINE SEPARATOR → SPACE #

The question to the committee is whether this is worth doing.