The Unicode Consortium Discussion Forum (CLOSED)

The Unicode Consortium Discussion Forum (CLOSED)

The forum has been closed, but prior postings are accessible for reading.
 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Fri Dec 19, 2014 8:41 am

All times are UTC - 6 hours [ DST ]





Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 5 posts ] 
Author Message
 Post subject: Unicode property utility
PostPosted: Tue Feb 01, 2011 3:02 pm 
Offline

Joined: Tue Jan 11, 2011 9:29 pm
Posts: 5
Searching in the Unicode property utility for [:General_Category=Unassigned:][:General_Category=Control:] returns what looks like a random set of characters at the top of the results:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DUnassigned%3A][%3AGeneral_Category%3DControl%3A]&g=

Below that the results appear to be what I expected: all of the unassigned and control characters. Is there a bug in the way the initial list of code points is generated?


Top
 Profile  
 
 Post subject: Re: Unicode property utility
PostPosted: Tue Feb 01, 2011 3:13 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 182
The list of characters at the top should all be square boxes, because fonts should not have glyphs for unassigned characters. But, as you will see, many fonts do place glyphs at those locations. That's why the visual list of characters is seemingly full of random entries.

In an attempt to "show" you some characters, the browser may go out of its way to locate fonts that have some (non-default) glyph at these locations, so in a way, you get the worst behaving fonts.


Top
 Profile  
 
 Post subject: Re: Unicode property utility
PostPosted: Tue Feb 01, 2011 4:37 pm 
Offline

Joined: Tue Jan 11, 2011 9:29 pm
Posts: 5
I can use wget to download the URL I listed above, and I have a text editor that will identify the code point of the character under the cursor. The list of characters at the top does contain characters that are not in the set I searched for. Copying and pasting characters from the list in my browser into the character properties utility also shows that they are characters that I didn't search for. One random example is U+0B30 ର which is in general category Other_Letter.

I should mention that I don't have this problem when searching only for [:General_Category=Unassigned:] or [:General_Category=Control:]. I get the unexpected characters only when I search for both categories together. Searching for the unassigned general category prints a long string of replacement characters �, and checking the code points they are all U+FFFD, not unassigned code points being displayed with a � symbol in my browser.


Top
 Profile  
 
 Post subject: Re: Unicode property utility
PostPosted: Tue Feb 01, 2011 7:33 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
The list of characters at the top is a UnicodeSet, as used in Regular Expressions. The first character is a ^, indicating that it is all the characters that are _not_ listed:

[^\ -~ -ͷͺ-;΄-ΊΌΎ-ΡΣ-ԧ Ա-Ֆՙ-՟ա...


Top
 Profile  
 
 Post subject: Re: Unicode property utility
PostPosted: Tue Feb 01, 2011 11:57 pm 
Offline

Joined: Tue Jan 11, 2011 9:29 pm
Posts: 5
OK, that makes sense. Thanks.


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 5 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 0 guests


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com