Re: Unicode characters List instead of hexadecimal equivalent

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 30 2006 - 06:13:42 CDT

Next message: Andreas Prilop: "Re: kurdish sorani"

Previous message: Andries Brouwer: "Re: kurdish sorani"
In reply to: Adisesha Neelaiahgari: "Unicode characters List instead of hexadecimal equivalent"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The Unicode database is complete enough because that's all what you need to create your list. You could very easily create such list by reading each line of UnicodeData.txt, extracting the hex codepoint and generating the line with this pseudo-code:

void display(String hexCodepoint) {
print(" "+hexCodepoint+" "+toUTF8(hexToInt(hexCodepoint)));
}

This is just pseudo-code, because actually, the output of your program must generate the characters into a file with a correct character encoding. Remember that codepoints can be larger than FFFF hex, but that most languages (including those for .Net) do not store single codepoints into a single "char" native entity (most of them can only store 16-bit code units using for example UTF-16, which means that all characters out of the basic multilanguage plane will need to be stored using 2 code units).

Remember also that the Unicode database does NOT contain a line for each ideograph and each Hangul precomposed syllable, because these ranges are quite large and have similar properties (note that some extra properties for Han ideographs are in the very large separate UniHan database, which is still informative only and far from being complete).

But would you really want such a large list that has tens of thousands lines?

If you want to see sample glyphs, look at the code charts, because such list will not show you the characters if you don't have font support for them. To see all characters defined in Unicode, you need a quite large collection of fonts. Also this will not be enough, because some characters have contextual shaping or paraticipate to complex ligatres although they are encoded the same. The chart only show characters in isolated form, without the ligatures and various contextual shaping they may adopt; for details, look at the TUS chapters regarding Middle-Eastern scripts and Indic scripts.

So I think that you really need to start by reading the Unicode Standard (especially the Conformance section, which describes the UTF encodings) to understand the issues.
  ----- Original Message -----
  From: Adisesha Neelaiahgari
  To: unicode@unicode.org
  Sent: Tuesday, August 29, 2006 12:23 PM
  Subject: Unicode characters List instead of hexadecimal equivalent

Hi,

I am new to Unicode world. I am working on .Net project where I need all Unicode characters available. I could see this list from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt . I can see hexadecimal number & explanation of the Unicode character but where can I see the equivalent character for the specified hex number.

Ex:

315 Ļ

316 ļ

317 Ľ

318 ľ

319 Ŀ

320 ŀ

321 Ł

322 ł

323 Ń

324 ń

325 Ņ

I tried some functions to convert this hexadecimal number to Unicode i.e Microsoft.VisualBasic.ChrW, Convert.ToChar….in .NET which could not solve the issue.

Thanks in advance,

Next message: Andreas Prilop: "Re: kurdish sorani"
Previous message: Andries Brouwer: "Re: kurdish sorani"
In reply to: Adisesha Neelaiahgari: "Unicode characters List instead of hexadecimal equivalent"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 30 2006 - 06:16:46 CDT