Re: VS characters, default ignorable property and text search and collation

From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Jul 26 2010 - 14:41:36 CDT

  • Next message: Asmus Freytag: "Re: ? Reasonable to propose stability policy on numeric type = decimal"

    Mark

    *— Il meglio è l’inimico del bene —*

    On Mon, Jul 26, 2010 at 09:40, Shriramana Sharma <samjnaa@gmail.com> wrote:

    > Hello list.
    >
    > I have a question about VS characters and the default ignorable property.
    >
    > TUS 5.2 ch 16.4 clearly states that VS characters are default ignorable. Ch
    > 5.21 states that default ignorable characters are to be ignored in rendering
    > (except in specialized modes which show hidden characters).
    >

    That is incorrect. What it actually says is (my bold):

    "Default ignorable code points are those that should be ignored by default
    in rendering *unless explicitly supported.* "

    Or to put it in other terms:

    If your rendering system doesn't explicitly support character X, it should
    be ignored by default (as if it hadn't been in the string to be rendered).

    So if you *do *support a given variation sequence, then this clause doesn't
    apply; as a matter of fact, supporting it means that it is not ignored; that
    it has a visible impact on the rendering.

    >
    > The paragraph in p 171 on default ignorable characters under ch 5.3 states
    > that "these characters are also ignored except with respect to specific,
    > defined processes; for example, zero width non-joiner is ignored by default
    > in collation."
    >
    > This seems to suggest to me that despite ch 5.21 speaking only about
    > rendering, the default ignorable property also has or at least can have a
    > part in other processes such as collation. I would however like to have a
    > confirmation on this:
    >
    > Are all default ignorable characters ignored not only in rendering

    incorrect assumption, see above.

    > but in other processes also?
    >

    Yes, in that in processing they should be ignored unless they are relevant
    to the kind of processing involved. Note that other characters may also be
    ignored, depending on the processing. So there is not a hard-and-fast rule.

       - For example, in collation any of the characters in
       http://unicode.org/Public/UCA/6.0.0/allkeys-6.0.0d1.txt with weights
       starting "[.0000.0000.0000." are ignorable by default, and include
       characters that are not default-ignorable.
       - For word-segmentation Extend and Format characters are ignored (except
       for edge cases): see
       http://unicode.org/reports/tr29/#Default_Word_Boundaries Those include
       many more characters than just the default-ignorables, and exclude 5
       characters (Hangul fillers and ZWSP). See also
       http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Word_Break:Format:][:Word_Break:Extend:]&g=di
       .

    In other words, default-ignorables should usually be ignored by
    non-rendering processes, but there will be exceptions. And other characters
    may also be ignored, depending on the process.

    > Or is it that they are ignored by default in rendering and whether they are
    > ignored in other processes or not is variable?
    >
    > Specifically, are VS characters ignored in rendering only (i.e. rendering
    > them, not the characters they apply to of course) or are they ignored even
    > in other processes such as text search and collation?
    >
    > --
    > Shriramana Sharma
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jul 26 2010 - 14:43:22 CDT