Re: FW: Subj: Amount of Space Unicode Takes

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Jul 16 2007 - 13:37:11 CDT

  • Next message: Asmus Freytag: "Re: FW: Subj: Amount of Space Unicode Takes"

    On Mon, 16 Jul 2007, Magda Danish (Unicode) wrote
    (quoting Daniel Johnson):

    > I have a question about how much space Unicode takes up. I am working
    > on a HTML project in multiple languages. Each of these web pages have to
    > be stored on a chip with limited space. Is there any way to "compact"
    > the HTML scripts in order to save space on the chip? Or is there a
    > different call number for a character which will take up less space in
    > hex?

    If you use UTF-8, which is almost always the right encoding for a Unicode
    encoded HTML document, then all ASCII characters occupy one byte (octet)
    each, just as in ASCII encoding and in ISO 8859 encodings. This means in
    particular that HTML markup, as well as any embedded CSS or JavaScript
    code, takes the same amount of bytes as in using ASCII.

    For textual content, the situation is different and depends on the
    character repertoire used, which in turn depends on the language. One
    Unicode character may use up to four bytes. Thus, there is a potential
    problem and potential loss of space efficiency as compared with other
    encodings. Using UTF-8 for all pages is, however, a simple approach and
    saves some headache.

    When space requirements are essential, you might consider using some
    general compression method such as gzip. It is widely used for web
    documents, and it can be used for HTML documents as well for other data,
    and web browsers can decode it automatically (when the compression is
    adequately indicated in HTTP headers: Content-Encoding: gzip). Things
    might be more difficult if you plan to make the files usable directly and
    not via an HTTP server.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 13:56:19 CDT