Re: Characters

From: Charlie Ruland (ruland@luckymail.com)
Date: Sat Feb 12 2011 - 04:13:25 CST

  • Next message: William_J_G Overington: "RE: Characters"

    U+0020 SPACE is by no means ‘the most used character’ universally. For
    Chinese it is completely unnecessary, not only when writing from top to
    bottom. The same is probably true for Japanese and ‘early forms’ of
    influential W Eurasian languages such as Phoenician, Hebrew, Greek and
    Latin. And further examples from other parts of the world won’t be hard
    to find.
    Charlie

    * William_J_G Overington [2011-02-12 10:11]:
    > On Friday 11 February 2011, Doug Ewell<doug@ewellic.org> wrote:
    >
    >> That might be a sensible place to start, if the input text is constrained to the few dozen characters in the type case, and assumed to be in the same language for which the type case was designed.
    >
    > Thinking further about this, I remembered that in the days of handsetting metal type, one bought spacing material as well, separately from fonts, because, for any particular point size, one used the same spacing for most, almost all, fonts (one exception being for Palace Script, where one used special spaces that were part of the font).
    >
    > With Unicode, possibly the most used character is U+0020 SPACE.
    >
    > I have thought of the following compression format.
    >
    > ----
    >
    > Start with a UTF16 file of characters and write to a compressed file, which starts off empty.
    >
    > Read in a character from the UTF16 file.
    >
    > Is it a U+0020 SPACE character?
    >
    > If yes, then write one bit, a 0, to the compressed file.
    >
    > If no, then write 17 bits, namely a 1 followed by the sixteen bits of the character to the compressed file.
    >
    > When the end of the input file in reached, write zero, one or more bits, each 1, so that the output file has a whole number of bytes.
    >
    > ----
    >
    > For example, consider compressing the following from a UTF16 file.
    >
    > The sky is blue.
    >
    > There are sixteen characters in the sentence, a total of thirty-two bytes.
    >
    > In the compression, three of the characters, namely the spaces, each become one bit, thus a saving of 45 bits. Thirteen of the characters become seventeen bits, a cost of 13 bits. The difference is 32 bits. As it happens, that means that zero extra bits are needed at the end of the compressed file.
    >
    > The compression has saved four bytes out of thirty-two, a saving of 12.5%.
    >
    > The interesting thing about this compression format is that it would work with at least some scripts other than Latin.
    >
    > I thought of the above compression format as a sort of mathematical toy, to assist my learning about compression.
    >
    > I then began to think of extending the system and here is where I think that I may be getting into what the original poster is trying to achieve.
    >
    > Suppose that I try to devise a second compression format, compressing two characters.
    >
    > ----
    >
    > Start with a UTF16 file of characters and write to a compressed file, which starts off empty.
    >
    > Read in a character from the UTF16 file.
    >
    > Is it a U+0020 SPACE character?
    >
    > If yes, then write two bits, 00, to the compressed file.
    >
    > If no, then is it a U+0065 LATIN SMALL LETTER E character?
    >
    > If yes, then write two bits, 01, to the compressed file.
    >
    > If no, then write 17 bits, namely a 1 followed by the sixteen bits of the character to the compressed file.
    >
    > When the end of the input file in reached, write zero, one or more bits, each 1, so that the output file has a whole number of bytes.
    >
    > ----
    >
    > Now this second format might well produce better compression of text using Latin script than would the first format, yet it would not produce better compression of text using, say, Cyrillic script, because there would be a bigger overhead for encoding U+0020 SPACE characters yet no gain from the possibility of encoding U+0065 LATIN SMALL LETTER E characters using two bits, as Cyrillic text does not contain that character.
    >
    > This brings me back to the question about the frequency of use of Unicode characters on the internet. That information might be of interest to know, yet is it needed for the compression project?
    >
    > Perhaps the answer is to prescan the text that one wishes to compress and have a compression format that first lists its encoding format.
    >
    > For example, if the UTF16 file being used for input were first scanned and it were found that U+0020 SPACE and U+0065 LATIN SMALL LETTER E were the most common characters, then the compression format would start by stating the two sixteen bit characters U+0020 and U+0065 and then encode U+0020 as 00 and U+0065 as 01 and every other character as 1 followed by sixteen bits.
    >
    > However, if the UTF16 file being used for input were first scanned and it were found that U+0020 SPACE and one of the Cyrillic characters were the most common characters, then the compression format would start by stating the two sixteen bit characters U+0020 and the particular Cyrillic character and then encode U+0020 as 00 and the particular Cyrillic character as 01 and every other character as 1 followed by sixteen bits.
    >
    > I am wondering if the idea that is behind some recent threads is to have a compression system that is like an extended version of the second format above, using codes of 00, 0100, 0101, 0110, 011100, 0011101, 0011110, 001111100, 001111101 and so on. Or maybe some other list such as 00, 010000, 010001, 010010, 010011, 010100 and so on? Or maybe some other list?
    >
    > If so, would it be better, rather than trying to choose which characters to use for the compression, to have the compression software first scan the file that is to be compressed and compute which are the most frequently used characters in the file and then state those characters in a header to the compressed file so that the software that undoes the compression first reads in those characters and uses them to form a code table to undo the compression?
    >
    > I emphasise that I am only just starting to learn about file compression, so to a large extent this post is intended as a catalyst for a discussion in this thread. If what I am suggesting is, in fact, a well-known existing technique, could someone who knows about these things possibly post something about it please?
    >
    > William Overington
    >
    > 12 February 2011
    >
    >
    >
    >
    >

    -- 
    ERROR COMMVNIS FACIT IVS
    


    This archive was generated by hypermail 2.1.5 : Sat Feb 12 2011 - 04:16:52 CST