Ticket #7 (new defect)

Opened 4 years ago

Last modified 4 years ago

wordbreaker (\b) not working with regular expressions in List-UnicodeSet tool

Reported by: verdy_p@… Owned by: mark
Component: webutils

Description (last modified by markus) (diff)

was CldrBug:3962

On http://cldr.unicode.org/unicode-utilities/list-unicodeset
the document says that \b (word break) is supported for the character name and block name properties (and it gives examples of these).

However, \b only matches at the front and end of the property value string (sort of combination of caret and dollar metacharacters). There's still no actual support for matching word breaks, or the word breaker implemented in the tool is incorrectly instanciated or is not implemented at all (just a stub, even though it should be simple here for such properties whose values are restricted to ASCII and Basic Latin letters).

See the tests with the word "Latin" that occurs in several block names (always exactly at word woundaries)

(1) basic test (ignoring word boundaries) :

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/Latin/}

1104 code points returned. No problem here.

(2) tests with boundary at end of the query:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/Latin\b/}

only 128 codepoints returned, i.e. in the "Basic Latin" (ASCII) block only, but not all other blocks for Latin extension blocks. This behaves exactly like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/Latin$/}

i.e. \b only maches only at end of the property value

(3) tests with boundary at beginning of the query:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/\bLatin/}

only 976 codepoints returned, i.e. in all blocks for Latin extensions, but not the "Basic Latin" (ASCII) block. This behaves exactly like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/^Latin/}

i.e. \b only maches here only at begining of the property value

(4) tests with boundary at beginning and end of the query:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/\bLatin\b/}

No code points found ! This behaves exactly like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:/^Latin$/}

or

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=/p{Block:Latin}

and it returns nothing because there's no block named only "Latin"

Change History

comment:1 Changed 4 years ago by markus

  • Component changed from unknown to webutils

comment:2 Changed 4 years ago by markus

  • Description modified (diff)
  • Reporter changed from markus to verdy_p@…
Note: See TracTickets for help on using tickets.