Re: unicode entities, "beginner" questions...

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sun Mar 13 2005 - 05:51:06 CST

  • Next message: Philippe VERDY: "Re: unicode entities, "beginner" questions..."

    On Sun, 13 Mar 2005 suzume@mx82.tiki.ne.jp wrote:

    > I apologize for the level of the questions. If the place is not right
    > I'd appreciate to get pointers to lists where I can get information.

    As far as I have understood, it is OK to discuss issues of practical use
    of Unicode as well, not just the development of the Unicode standard.
    The charter at http://www.unicode.org/consortium/distlist.html
    looks rather liberal to me.

    > I have had issues with both since I realized that, contrary to unicode
    > supporting OSX apps (TextEdit to give a simple example, but also most
    > text editors on OSX) the above apps translate all the Japanese (and
    > French non ascii characters) to non human readable entities that make
    > direct editing of output files almost impossible.

    As you told in your reply to my question, "entities" refer to HTML
    entities such as é and character references like Ӓ.
    These are actually quite distinct concepts, though commonly confused with
    each other in HTML tutorials - and even specifications! But what's
    common is that they relate to markup languages - basically, SGML, XML, or
    anything based on them, such as HTML. They are not at the character level
    but "higher". They make sense in markup only, though you might sometimes
    see them as erroneously generated in other contexts as well.

    If you use a "Save as HTML" or "Save as Web page" or something similar in
    a text processing program, it is quite possible that your Unicode
    characters get stored as entity references or character references.
    There is nothing wrong with this per se (in most cases; some programs
    generate incorrect character references, though). But if you wish to edit
    the document later, e.g. using Notepad, you will see the entity or
    character references, and things really get awkward. (Occasional
    occurrences of such references are not problematic, but e.g. normal
    French, not to mention normal Japanese, are difficult to handle that way.)

    > I seem to not understand the reality of what unicode is and I thus am
    > stuck with files and no way to convert them to human readable output.

    Probably the problem is not your understanding of Unicode but the issue of
    entity and character references, which are completely external to Unicode,
    though often used in markup to present Unicode characters. And perhaps the
    most difficult part is how the various programs work. Some programs may
    have options that control whether and how Unicode characters are replaced
    by entity or character references. Moreover, they may have options for
    setting the encoding of the HTML document, and this may affect the
    situation.

    I just tried OpenOffice on Windows, and created a document with e acute
    and a kanji character, then used File/Save as with default settings. The
    e acute was saved as é and the kanji character as a character
    reference. The latter part is understandable, since the default encoding
    in HTML documents created with OpenOffice is windows-1252, which contains
    no kanji characters, so a character reference is really the only way.
    There's no good explanation to é as far as I can see. If I change
    the encoding to utf-8 (via Tools/Options/Load and save/HTML compatibility,
    or something like that), then the kanji character is saved as such, utf-8
    encoded. The é entity reference still appears!

    > Why do those tools favor a non-human readable output form ? Is there a
    > valid technical reason to do so ?

    There might be. One of the reasons is that the document's encoding might
    not allow all characters to be represented as such, and character
    references offer a universal way to overcome such limitations.
    But programs might also use such output form for no good reason.
    On the other hand, if you represent all non-ASCII characters using
    entity references or character references, you can use ASCII for any HTML
    document, and in any case, your data will be "7-bit safe", i.e. it can
    even be sent over a connection that does something nasty to octets with
    the most significant bit set.

    > Are there easy ways to convert from one to the other ?

    There's Free Recode, http://recode.progiciels-bpi.ca/
    which can perform an impressive amount of code conversions.
    It can deal with HTML as well, see
    http://recode.progiciels-bpi.ca/manual/HTML.html#HTML
    but beware that it uses rather odd terminology: it refers to "HTML
    charsets" when it actually means HTML format. (Normally "charset" means
    character encoding, at the character level, without any notion of
    entity references or character references.)

    > Are there other forms a unicode character can take ?

    Sure, in the same sense: in addition to encodings such as utf-8 and utf-7,
    there are coding systems at another level, but they all depend on the data
    format. I have presented some samples of such notations at
    http://www.cs.tut.fi/~jkorpela/chars.html#esc

    > When I started as a html writer, about 10 years ago, I used to convert
    > my French accented letters to html entities to "make sure" that they'd
    > be displayed properly.

    That's a widespread idea, but it never had a very good basis, and nowadays
    even less so. It used to be relevant (and might still occasionally be
    relevant) if you author, say, in a Mac environment and upload your
    documents onto a server that runs Unix. It might then happen that the
    software you use for uploading performs a wrong character encoding
    conversion (or doesn't do a conversion when it should). But if you have
    used only ASCII characters (and e.g. wrote accented letters using entity
    references), then no such conversion is needed, and no conceivable
    conversion will harm you either, since conversions would leave ASCII
    characters intact.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Sun Mar 13 2005 - 05:52:09 CST