Re: VS characters, default ignorable property and text search and collation

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 26 2010 - 14:23:22 CDT

  • Next message: Mark Davis ☕: "Re: VS characters, default ignorable property and text search and collation"

    Sharma asked:

    > I have a question about VS characters and the default ignorable property.
    >
    > TUS 5.2 ch 16.4 clearly states that VS characters are default ignorable.
    > Ch 5.21 states that default ignorable characters are to be ignored in
    > rendering (except in specialized modes which show hidden characters).
    >
    > The paragraph in p 171 on default ignorable characters under ch 5.3
    > states that "these characters are also ignored except with respect to
    > specific, defined processes; for example, zero width non-joiner is
    > ignored by default in collation."

    It is an unfortunate result of terminological history, but
    in the Unicode Standard, "ignored by default in XXX" is not
    the same as Default_Ignorable_Code_Point=True.

    Also, the meaning of Default_Ignorable_Code_Point wavered around a
    bit until it was finally nailed down, precisely because people were
    trying to use it in somewhat different implementation contexts to
    mean somewhat different things.

    At this point, the standard has nailed down the meaning of
    the character property Default_Ignorable_Code_Point to mean
    essentially that *if* an implementation does not support rendering
    of the code point in question, then it should be rendered invisibly
    (i.e., no missing glyph boxes drawn). If a Default_Ignorable_Code_Point
    *is* supported in rendering, it may have various effects, but typically
    not as a regular character would display. The variation selectors
    are a good example, because *if* you support their rendering, you
    don't draw glyphs for them directly, but rather modify the display
    of the preceding character whose variant glyph they are selecting.

    If a character has the property Default_Ignorable_Code_Point=False,
    then if an implementation does not support rendering of the
    code point in question, it *should* display a missing glyph box,
    to show that a character is there but cannot be drawn.

    All of that is completely orthogonal as to whether a particular
    code point should be "ignored by default in" some other context,
    as for searching.

    > This seems to suggest to me that despite ch 5.21 speaking only about
    > rendering, the default ignorable property also has or at least can have
    > a part in other processes such as collation. I would however like to
    > have a confirmation on this:
    >
    > Are all default ignorable characters ignored not only in rendering but
    > in other processes also?

    They aren't actually ignored in rendering. See above. The issue is
    whether they should be displayed visibly when not supported by
    a rendering engine (and font), or not.

    > Or is it that they are ignored by default in rendering and whether they
    > are ignored in other processes or not is variable?

    Yes, the latter.

    >
    > Specifically, are VS characters ignored in rendering only (i.e.
    > rendering them, not the characters they apply to of course) or are they
    > ignored even in other processes such as text search and collation?

    Whether they would be ignored for text search and collation
    depends on weighting in the Unicode Collation Algorithm.

    For that, you look in allkeys.txt for the UCA, which shows
    entries like:

    FE00 ; [.0000.0000.0000.0000] # [FE00] VARIATION SELECTOR-1

    Since this variation selector (and in fact all of them) is
    weighted with zeroes in all positions, yes, the answer is that
    variation selectors are ignored by default in text search and
    collation.

    Of course, as for any other character, it is possible to set
    up a tailoring that gives a variation selector (or all of
    them) a non-ignorable collation weight, in which case they
    *would* make a difference in searching and collation.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 26 2010 - 14:24:54 CDT