The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Sun Nov 23, 2014 8:14 pm

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 10 posts ] 
Author Message
 Post subject: Getting Unicode charts in "ꯍ" format?
PostPosted: Sun Jan 22, 2012 11:21 pm 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
Hi, I'm wondering if there's any way to get a list of just Unicode values (ideal format would be html hex values: ԰Ե etc.) for the various PDF charts including the CJK (and Extended). Ideally this would be only for Unicode characters that are actually named and being used (not the reserved but as yet unused, grayed out, part of the ranges). For Armenian, which starts at 0530, for example, the 0530 character is not used. I thus would need to start at 0531. As well, any formatting or Unicode values that are not mapped to actual characters which appear (such as Control codes, etc.) would not be needed.

Thank you for any thoughts or suggestions!


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "ꯍ" format?
PostPosted: Mon Jan 23, 2012 6:51 pm 
Offline
Engineer

Joined: Mon Nov 30, 2009 7:14 pm
Posts: 40
Location: Earth
We don't supply a file like that ready-made, but you could make your own. All of the characters are listed in the UCD. Try looking at this file in the latest UCD:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
and then take the first field. That's the codepoint...


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "ꯍ" format?
PostPosted: Mon Jan 23, 2012 7:34 pm 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
This document is very helpful. Is there any easy way to search for characters that are not actual printing characters? (controls, private use areas, blanks unused, and so on)?

And this does include ALL the chars, including CJK and its Extended ranges? Or are those in the Unihan.txt doc? (I need a document that does not have duplicates and in Unihan.txt it seems there are lots of duplicate descriptions for characters. Something like this list is perfect, assuming that this doesn't duplicate!)

Thank you again for the helpful suggestions!


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "ꯍ" format?
PostPosted: Tue Jan 24, 2012 2:23 am 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
thesun wrote:
This document is very helpful. Is there any easy way to search for characters that are not actual printing characters? (controls, private use areas, blanks unused, and so on)?


You need to learn about the "general category" and other Unicode properties. http://unicode.org/reports/tr44 explains all about them. The General Category identifies Spaces, control and format characters, and, most conveniently, it's a field in UnicodeData.txt. However, other classifications exist, so be sure to understand which one matches your problem.

Quote:
And this does include ALL the chars, including CJK and its Extended ranges? Or are those in the Unihan.txt doc? (I need a document that does not have duplicates and in Unihan.txt it seems there are lots of duplicate descriptions for characters. Something like this list is perfect, assuming that this doesn't duplicate!)


That file covers all characters, with no duplicates, but some ranges are in there collectively, not one code per line.
Look at the line for 4E00 to get an idea how ranges are indicated using two successive lines. Should be rather obvious how the first and last code for such ranges are given, once you look at it in the actual data file.


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "ꯍ" format?
PostPosted: Tue Jan 24, 2012 8:03 am 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
asmus wrote:
That file covers all characters, with no duplicates, but some ranges are in there collectively, not one code per line.
Look at the line for 4E00 to get an idea how ranges are indicated using two successive lines. Should be rather obvious how the first and last code for such ranges are given, once you look at it in the actual data file.


Okay, this is finally starting to make sense. :-) My one last question (I hope) is for the ranges that are given by 2 lines (such as the start of the CJK range, and some of the others) is it safe to assume that within that first/last pair there are no control chars, duplicates, or unused blocks?

If not, what would the best way (ideally without painstakingly going through by hand!) of getting a list with no duplicates of those sections be? Any suggestions?

It seems that for the Unicode.txt main file I could search for <control> and other "< >" combinations until only the chars I need are left, then strip off the data after the first semi-colon and use a script to convert all those to &#x0abcd; format. If there's a way to do this for Unihan ranges as well then I'm 90 percent there. Thank you again for the help!!


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "&#x0abcd;" format?
PostPosted: Tue Jan 24, 2012 3:50 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
thesun wrote:
My one last question (I hope) is for the ranges that are given by 2 lines (such as the start of the CJK range, and some of the others) is it safe to assume that within that first/last pair there are no control chars, duplicates, or unused blocks?


For the characters listed as ranges the values in all other fields are constant for the entire range, except that the character name would be algorithmically determined (it appears that you don't need the name).

Control characters or reserved code points would have different values Cc or Cn for the general category from the ideographs which should be Lo. Because the values have to be constant across ranges, you can be sure there aren't any controls etc. mixed in.

thesun wrote:
It seems that for the Unicode.txt main file I could search for <control> and other "< >" combinations until only the chars I need are left, then strip off the data after the first semi-colon and use a script to convert all those to &#x0abcd; format. If there's a way to do this for Unihan ranges as well then I'm 90 percent there.


I would not rely on the names having a < in them. Look at the field for the General category. It has a two letter value. Values that start with "C" are all special characters, such as Cc for control codes. As are some of the characters with values starting with Z.

Be sure you understand why you are filtering these characters, but if you do, then General Category values can be helpful. Make sure you don't filter characters that you may need for your purpose.


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "&#x0abcd;" format?
PostPosted: Thu Feb 02, 2012 11:20 am 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
Okay, I'm getting somewhere but still have a few hiccups. One of them is that it's hard to know if I want to support, for example, Vietnamese, which character ranges do I need to include? I can tell it's not working but thus far I'm just guessing "maybe Viet-lai" or "maybe Combining diacriticals" instead of actually knowing I need Latin, Latin-ExtB, Latin Full Half, General Punctuation, Combing Diacriticals, etc.

Is there a way to choose the right ranges by language or writing system one wants, not by names of the ranges?

In the case of Vietnamese, it doesn't work to just "cat UnicodeData.txt | grep 'VIETNAMESE'" because there are a bunch of other important ranges that affect it.

I'd ideally like to have a list of languages I want to support and then be able to quickly choose all the charts (or code ranges) I need to support the language.

Any suggestions? It's been very hard doing this by trial and error. Some languages are not as difficult as others.

Again, thanks very much for the help and previous suggestions!


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "&#x0abcd;" format?
PostPosted: Thu Feb 02, 2012 12:48 pm 
Offline
Forum Admin

Joined: Fri Dec 04, 2009 9:13 pm
Posts: 32
The Unicode CLDR data does have information about the base characters used for given languages:

http://unicode.org/repos/cldr-tmp/trunk ... cters.html

It's not a full answer to the question "what characters do I need to support for Vietnamese", however.


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "&#x0abcd;" format?
PostPosted: Thu Feb 02, 2012 2:11 pm 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
Thank you, Mark. That's a cool chart but as you say, it doesn't really address what I'm needing. I need to somehow know what char ranges I need for each language I want to support. It doesn't work, for example, to put in all the Bengali lines because Bengali also uses punctuation and some shared Devanagari chars. In the case of Vietnamese it's even more complex -- about a year ago I had it working and now can't remember which combination I need. And rather than just guessing in the dark I thought maybe there was an easier way. There really SHOULD be, since many developers may not need all the languages and their app will run faster if it supports just the ones they need.

I'll keep my fingers crossed others out there have some other good ideas! Thanks! :-)


Top
 Profile  
 
 Post subject: Re: Getting Unicode charts in "&#x0abcd;" format?
PostPosted: Tue Feb 07, 2012 2:40 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
The question can't be answered with precision. Take German. Surrounded by countries, German publications and records will, to an extent, try to render foreign terms and names in the original spelling. If you took all texts tagged "German" (and assuming that was done correctly) you would come up with a wide "halo" of characters that are not needed for the core German orthography.

What precisely are you trying to do with the information you are trying to collect here? Knowing that might help us help you.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 0 guests


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com