Re: Viewing Source...

From: Jeroen Ruigrok van der Werven (
Date: Wed Jan 09 2008 - 11:38:08 CST

  • Next message: James Kass: "Re: Regarding binary combining accents and grouping characters"

    -On [20080109 18:16], Damon Anderson ( wrote:
    >I maybe an old dog trying to learn new tricks, but I simply can't
    >understand how Unicode is implemented in GUI editors. From Word to
    >OpenOffice to DreamWeaver when I type Unicode characters and then go to
    >look at the source I see nothing but gobbledy gook hodge podge of odd ASCII
    >characters or Character pairs/groups.

    ASCII is just using 128 positions of a byte (7 bits to be precise) whereas
    with Unicode you need multiple bytes, generally (depends on the encoding
    format chosen, to represent characters.
    So yes, when you view Unicode data in an editor that does not understand it it
    will try to interpret it as ASCII or ISO-8859-1(5) and it looks nonsensical.
    The same happens when trying to view, say, KOI8-R. You get back lots of
    accented characters.

    >I can, of course type into source directly a properly escaped HTML decimal
    >unicode character and it will display in the UI correctly, but when I type in
    >the UI and view the source I have no way to verify that the correct Unicode
    >is being used, as no Unicode is apparent, no escape characters, no hex or
    >dec. I am completely baffled.

    Which is logical actually. The editor understands Unicode and will show the
    correct associated character with the codepoint. As such there are no escape
    characters since Unicode does not use these.

    >What is happening and why/how is the Unicode being recoded or displayed in
    >non-unicode format in the source? Is there a proper source editor that will
    >display the actual Unicode encodings? Is the problem in my Unikey
    >Vietnamese keyboard driver? Unikey seems to send HTML unicode, but that's
    >not what Dreamweaver displays in the source.

    You are really looking at it from the narrow view of ASCII I am afraid. Just
    as ASCII is a character encoding that correlates ASCII A to hex 0x41 and Z to
    hex 0x5a, so does Unicode use a scheme that uses multiple byte values in order
    to encode various characters. Using UTF-16, for example, an A would be
    encoded as 0x0041. If you'd view such a file in hex with a binary editor you'd
    see the coded sequence 00 41. But of course, if the editor understands proper
    Unicode it will just interpret such code sequences as the proper Unicode
    codepoints and show the relevant characters. Nothing magical and different
    from how ASCII is shown to be honest.

    >Then there's OpenOffice... I have had to actually submit a bug to OOo
    >because when I use it to read directly from my database which is storing
    >correctly escaped HTML unicode it converts all of my ampersand escape
    >characters to & so ỡ becomes &7905. That one just baffles me,
    >as they are supposed to be supporting Unicode, but convert my Unicode and
    >then don't even convert it to Unicode but use & instead.

    This is a bit different. HTML supports encoding Unicode codepoints as entities
    using a scheme &#NNNN; The &<..>; combination is the standard for encoding
    HTML entities, OpenOffice should not have messed with the & to make it &amp;.
    Almost sounds as if they did not support Unicode entities in the first place.

    For HTML you can use either entities or just type in the characters. But some
    editors translate such codepoints to entities underwater. Personally I dislike

    Personally I am happy enough using (g)vim on Unix and Windows for my Unicode
    needs, but you could also try out BabelPad by our very own Andrew West for a
    good Unicode supporting editor. Alternatively there are a lot of other editors
    that should be ok. Notepad2 also supports Unicode editing and has syntax
    highlighting for various file formats (if you're on Windows).

    Jeroen Ruigrok van der Werven <asmodai(-at-)> / asmodai
    イェルーン ラウフロック ヴァン デル ウェルヴェン |
    When you have eliminated the impossible, whatever remains, however
    improbable, must be the truth...

    This archive was generated by hypermail 2.1.5 : Wed Jan 09 2008 - 11:40:06 CST