Re: Unix Codes for Diacritics

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sat Sep 18 2010 - 04:42:56 CDT

  • Next message: Αλέξανδρος Διαμαντίδης: "Re: Unix Codes for Diacritics"

    On Sat, 18 Sep 2010 00:06:07 +0100
    Krishna Birth <krishnabirth@gmail.com> wrote:

    > Could someone please correctly tell the codes to use on Unix operating
    > systems to produce the below diacritics:
    >
    > A
    > Δ€ = http://www.fileformat.info/info/unicode/char/0100/index.htm
    ...

    > I need to find this for a project/coder's question?

    If you are asking how to type these precomposed letters at a keyboard,
    we need to know which Unix operating system you have in mind, and the
    X-terminal model may be relevant. For example, if the X-terminal is
    a Windows PC running Exceed, this may reduce to a Windows question.

    My answer is directed to what one would write in a program. It is
    possible that more detail is required as to the coder's problem.

    The codepoint (i.e. number encoding the character) for these letters
    is part of the name of the links you gave, e.g. the code for Δ€ is 0100
    in hex.

    If you are simply trying to produce the single, precomposed character
    in a program, the information is given in the table headed 'Encodings'
    in the pages you referenced. It may be worth also giving the
    information for the plain letter 'A' at
    http://www.fileformat.info/info/unicode/char/0041/index.htm so that the
    coder may understand the information better. UTF-8 is the encoding
    which for most purposes can work on Unix in exactly the same fashion as
    8-bit codes (ASCII, ISO-8859, ISCII, TSCII), though multibyte EUC
    encodings are a better analogy. (If the coder doesn't understand EUC,
    it's not worth explaining.)

    For example, when I run a terminal window using the locale en_GB.utf8,
    I can have the letter printed to the terminal by a bash script using
    the command
    % printf "\xc4\x80" # Use UTF-8 form explicitly
    The printf of bash version 4.1.5(1) does not understand escape codes
    using '\u'.

    On the other hand, /usr/bin/printf on the Linux system I'm using does,
    and I could achieve the same effect using
    % /usr/bin/printf "\u0100" # What happens in non-UTF-8 locales?

    If you want the codes for the diacritics themselves, so that the
    letters you listed may be entered as plain Roman letter plus diacritic
    mark, the information you need
    is in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt , with an
    explanation in http://www.unicode.org/reports/tr44/#UnicodeData.txt .
    As an example, consider the line for U+0100:

    0100;LATIN CAPITAL LETTER A WITH MACRON;Lu;0;L;0041 0304;;;;N;LATIN
    CAPITAL LETTER A MACRON;;;0101;

    The data items are separated by semicolons. The first two are the
    codepoint, the number for the character, expressed in hecadecimal
    notation. The second field gives the character name. The interesting
    field for you may be the sixth field, which, unless it starts with
    '<', gives another way of expressing the same character - in this case
    as the sequence of <U+0041 LATIN CAPITAL LETTER A WITH MACRON, U+0304
    COMBINING MACRON>.

    If you want to write the diacritics themselves without attaching them
    to a letter, there are two or three methods. Firstly, you can
    write them on a hardspace, e.g. <U+00A0 NO-BREAK SPACE, U+0304>. This
    will not always work; using the spacing modifier letter is the safe way
    of writing it. For this you need to look at their code chart. For the
    macron, you will use <U+02C9 MODIFIER LETTER MACRON>. The third
    method is to use the ISO-8859 characters, in this case <U+00AF
    MACRON>. The drawback with the third method is that this is a symbol,
    not a letter, and you may encounter bad line-breaking or the macron may
    be combined with a preceding letter.

    Richard.



    This archive was generated by hypermail 2.1.5 : Sat Sep 18 2010 - 04:43:31 CDT