Re: Unix Codes for Diacritics

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sat Sep 18 2010 - 04:42:56 CDT

Next message: �� : "Re: Unix Codes for Diacritics"

Previous message: André Szabolcs Szelp: "Re: Solved (was: [OT]: a strange language name abbreviation)"
In reply to: Krishna Birth: "Unix Codes for Diacritics"
Next in thread: �� : "Re: Unix Codes for Diacritics"
Reply: �� : "Re: Unix Codes for Diacritics"
Reply: Krishna Birth: "Re: Unix Codes for Diacritics"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Sat, 18 Sep 2010 00:06:07 +0100
Krishna Birth <krishnabirth@gmail.com> wrote:

> Could someone please correctly tell the codes to use on Unix operating
> systems to produce the below diacritics:
>
> A
> Ā = http://www.fileformat.info/info/unicode/char/0100/index.htm
...

> I need to find this for a project/coder's question?

If you are asking how to type these precomposed letters at a keyboard,
we need to know which Unix operating system you have in mind, and the
X-terminal model may be relevant. For example, if the X-terminal is
a Windows PC running Exceed, this may reduce to a Windows question.

My answer is directed to what one would write in a program. It is
possible that more detail is required as to the coder's problem.

The codepoint (i.e. number encoding the character) for these letters
is part of the name of the links you gave, e.g. the code for Ā is 0100
in hex.

If you are simply trying to produce the single, precomposed character
in a program, the information is given in the table headed 'Encodings'
in the pages you referenced. It may be worth also giving the
information for the plain letter 'A' at
http://www.fileformat.info/info/unicode/char/0041/index.htm so that the
coder may understand the information better. UTF-8 is the encoding
which for most purposes can work on Unix in exactly the same fashion as
8-bit codes (ASCII, ISO-8859, ISCII, TSCII), though multibyte EUC
encodings are a better analogy. (If the coder doesn't understand EUC,
it's not worth explaining.)

For example, when I run a terminal window using the locale en_GB.utf8,
I can have the letter printed to the terminal by a bash script using
the command
% printf "\xc4\x80" # Use UTF-8 form explicitly
The printf of bash version 4.1.5(1) does not understand escape codes
using '\u'.

On the other hand, /usr/bin/printf on the Linux system I'm using does,
and I could achieve the same effect using
% /usr/bin/printf "\u0100" # What happens in non-UTF-8 locales?

If you want the codes for the diacritics themselves, so that the
letters you listed may be entered as plain Roman letter plus diacritic
mark, the information you need
is in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt , with an
explanation in http://www.unicode.org/reports/tr44/#UnicodeData.txt .
As an example, consider the line for U+0100:

0100;LATIN CAPITAL LETTER A WITH MACRON;Lu;0;L;0041 0304;;;;N;LATIN
CAPITAL LETTER A MACRON;;;0101;

The data items are separated by semicolons. The first two are the
codepoint, the number for the character, expressed in hecadecimal
notation. The second field gives the character name. The interesting
field for you may be the sixth field, which, unless it starts with
'<', gives another way of expressing the same character - in this case
as the sequence of <U+0041 LATIN CAPITAL LETTER A WITH MACRON, U+0304
COMBINING MACRON>.

If you want to write the diacritics themselves without attaching them
to a letter, there are two or three methods. Firstly, you can
write them on a hardspace, e.g. <U+00A0 NO-BREAK SPACE, U+0304>. This
will not always work; using the spacing modifier letter is the safe way
of writing it. For this you need to look at their code chart. For the
macron, you will use <U+02C9 MODIFIER LETTER MACRON>. The third
method is to use the ISO-8859 characters, in this case <U+00AF
MACRON>. The drawback with the third method is that this is a symbol,
not a letter, and you may encounter bad line-breaking or the macron may
be combined with a preceding letter.

Richard.

Next message: �� : "Re: Unix Codes for Diacritics"
Previous message: André Szabolcs Szelp: "Re: Solved (was: [OT]: a strange language name abbreviation)"
In reply to: Krishna Birth: "Unix Codes for Diacritics"
Next in thread: �� : "Re: Unix Codes for Diacritics"
Reply: �� : "Re: Unix Codes for Diacritics"
Reply: Krishna Birth: "Re: Unix Codes for Diacritics"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 18 2010 - 04:43:31 CDT