Re: Problems encoding the spanish o

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Wed Nov 19 2003 - 17:50:53 EST

  • Next message: Philippe Verdy: "Re: BOM as WJ?"

    One thing may help you to think about this kind of issue is my 'under
    constrution" paper - "Frank Tang's List of Common Bugs that Break Text
    Integrity" http://people.netscape.com/ftang/paper/textintegrity.html
    I am going to present a newer revsion in the coming IUC25 if they accept
    my proposal.

    it look like "ón M" 4 bytes got changed to two bytes U+DB7A and U+DC0D
    which is a surrogate pair in UTF-16.

    Here is what I think what happened.
    1. the text "...ización Map.." is output from process A and pass to a
    process B which the byte is encoded in ISO-8859-1. so the 4 bytes "ón M"
    are encoded as 0xf3, 0x6e, 0x20, 0x4d.
    2. somehow process B think the incoming data is in UTF-8 instead of
    ISO-8859-1. You can find some possible cause as hint from my paper (url
    above).
    3. Process B try to convert the data stream to UTF-16 by using "UTF-8 to
    UTF-16" conversion rule. However the UTF-8 scanner in the converter is
    not will written. It implement the conversion in the following way:
    3.a. it hit the byte 0xf3, and it look at a look up table and notice
    0xf3 in a legal UTF-8 sequence is the first bytes of a 4 bytes UTF-8
    sequence.
    3.b. it decode that 4 bytes UTF-8 sequence without checking the value of
    the next 3 bytes 0x6e, 0x20, 0x4d. It blindly think these bytes are the
    2nd, 3rd and 4th bytes of this UTF-8 sequence. Of course, it need to
    first get the UCS4 value, what it does is

    m1 = byte1 & 0x07
    m2 = byte2 & 0x3F
    m3 = byte3 & 0x3F
    m4 = byte4 & 0x3F

    in your case, what it got is
    m1 = 0xf3 & 0x07 = 0x03
    m2 = 0x6e & 0x3F = 0x2e
    m3 = 0x20 & 0x3f = 0x20
    m4 = 0x4d & 0x3f = 0x0d

    [Notice the problem is such algorighm does not check to make sure byte2,
    byte3 and byte4 is in the range of 0x80 - 0xBF at all. One possibility
    is it does not check in the code. The other possibility is the code do
    value checking but massed up by using (char) value to compare with
    (unsigne char) by using < and >.

    What I mean is the following:
    main()
    {
             char a=0x23;
             printf("a is %x ",a);
             if( a > (char)0x80)
                     printf("and a is greater than 0x80\n");
             else
                     printf("and a is less or equal than 0x80\n");
    }
    sh% ./b
    a is 23 and a is greater than 0x80
    ]

    then it caculate the ucs4 by using

    ucs4 = (m1 << 18) | (m2 << 12) | (m3 << 6) | (m4 << 0);
    in your case, what it got is
    ucs4 = (0x03 << 18) | (0x2e << 12) | (0x20 << 6) | (0x0d << 0) =
             0xc0000 | 0x2e000 | 0x800 | 0x0d = U+ee80d;

    3.c. now it turn that ucs4 into UTF-16 by
    surrogate high = ((ucs4-0x10000 ) >> 10) | 0xd800
      = ((0xee80d - 0x10000) >> 10) | 0xd800
      = ( 0xde80d >> 10 ) | 0xd800
      = 0x037a | 0xd800
      = 0xdb7a
    surrogte low = ((ucs4 - 0x10000) & 0x03FF) | 0xdc00
    = ((0xee80d - 0x10000) & 0x03FF) | 0xdc00
    = (0xde80d & 0x3FF) | 0xdc00
    = 0x0d | 0xdc00
    = 0xdc0d
    so you got a UTF-16 DB7A DC0D with you now

    4. now process b (or some other code) try to convert the UTF-16 into
    HTML NCR, unfortunatelly, that process do not handle the UTF-16 to NCR
    conversion correctly. So... instead of doing the right way as below:
    4.a take DB7A DC0D convert to UCS4 as 0xEE80D
    4.b convert EE80D to decimal as 976909 and generate as "&#976909;"

    it convert DB7A as decimal 56186 and generate as "&#56186;" and then it
    convert DC0D as decimal 56333 and generate as "&#56333;"

    So... in summary, there are 3 but not only 1 problem in your system
    Problem 1: Process A convert data to ISO-8859-1 while process B is
    expecting UTF-8. You should either fix the Process A to let it generate
    UTF-8 or fix the Process B to treat the input as ISO-8859-1. The
    preferred approach is the ealier one.
    Problem 2: The UTF-8 converter in Process B does not strictly implement
    the requirement in RFC 3629 which say it MUST protect against decode
    invalid sequence. If you put the non ASCII into the end of a line it
    probably will cause your software to fold line if you put it in the end
    of the record it may even crash your software for converter in this kind
    of quality. You need to fix the convert scanning part.
    Problem 3: The UTF-16 to NCR conversion is incorrect according to the HTML.

    Hope the above analysis help.

    pepe pepe wrote:

    > Hello:
    >
    > We have the following sequence of characters "...ización Map.." that is
    > the same than "...izaci&#243;n Map..." that after suffering some
    > transformations becomes to "...izaci&#56186;&56333;ap...."
    > AS you can see the two characters 56186 and 56333 seem to represent this
    > sequences "ón M". Any idea?.
    >
    > Regards,
    > Mario.
    >
    > _________________________________________________________________
    > Charla con tus amigos en línea mediante MSN Messenger.
    > http://messenger.microsoft.com/es
    >
    >

    -- 
    --
    Frank Yung-Fong Tang
    Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
    AIM:yungfongta   mailto:ytang0648@aol.com Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan
    John 3:16 "For God so loved the world that he gave his one and only Son,
    that whoever believes in him shall not perish but have eternal life.
    Does your software display Thai language text correctly for Thailand users?
    -> Basic Conceptof Thai Language linked from Frank Tang's
    Iñtërnâtiônàlizætiøn Secrets
    Want to translate your English text to something Thailand users can
    understand ?
    -> Try English-to-Thai machine translation at
    http://c3po.links.nectec.or.th/parsit/
    


    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 19:01:18 EST