Standartising search for similar symbols

From: sergey (sergey-feo@yandex.ru)
Date: Fri Nov 13 2009 - 12:06:51 CST

  • Next message: Mark Crispin: "Re: Standartising search for similar symbols"

    Hi all and please excuse me for my bad English :-)

    I want to discuss following problem.

    Please imagine that we have big text file. At the beginning of this file
    someone wrote:

    3-2*4

    The "-" here is U+002D.

    At the middle of file someone else wrote:

    3−2*5

    The "−" here is U+2212.

    Now imagine that you see "3-2*4" and want to find all that means "3 minus 2"
    in the file. You ask you text editor for searching "3-2". It will
    find only "3-2*4" but not both because "-" and "−" has different
    codes in Unicode.

    I suggest to determine and publish "similar symbol sets" in future Unicode standarts.
    This sets may be like this:

    (U+002D "-", U+2012 "‒", U+2013 "–", U+2014 "—", U+2015 "―", U+2212 "−", U+23AF "⎯", U+2500 "─", ...);
    (U+0041 "A", U+0391 "Α", U+0410 "А", ...);
    (U+0043 "C", U+03F9 "Ϲ", U+0421 "С", U+216D "Ⅽ", ...);
    (U+0033 "3", U+0417 "З").

    With this sets an usable checkbox can be implemented in search dialogs of text editors:
    "[V] Accept similar symbols"

    In our example searching for "3-2" with using similar symbols sets will match "3-2*4",
    "3−2*4" and even "2003—2005".

    You can see that "similarity" in each similar symbols set is subjective thing.
    One user with one font will see many differences between symbols in set but
    other user will decide that it is same symbols.

    So we can create 2 or more "supersets". Superset 0 will contain sets of very similar symbols that can be confusing on most fonts. Superset 1 will be bigger and will contain symbols that are less similar but still can easily confuse user. Set (U+0031 "1", U+0049 "I", U+006C "l", U+007C "|", ...) and set (U+0056 "V", U+2713 "✓") are good candidates to be in superset 1.

    There is also second problem: searching for symbols like U+0451 "ё" that can be one symbol U+0451 "ё" or two symbols: "ё" = U+0435 "е" + U+0308 diacritic. But this is theme for other discussion :-)

    ----------------
    Regards, Sergey



    This archive was generated by hypermail 2.1.5 : Fri Nov 13 2009 - 12:50:09 CST