Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (delete@elfdata.com)
Date: Wed Jun 14 2006 - 16:26:36 CDT

  • Next message: Mike Ayers: "Re: What is a Jamo, and why is it staring at me?"

    On 14 Jun 2006, at 18:39, Richard Wordingham wrote:

    >
    >> The re-ordering pass, is not using my "multiple-replaceall"
    >> algorithm. It does use the canonical combining classes. A multi-
    >> pass approach, while possible... I wouldn't do, it would take too
    >> long.
    >
    > And this was the basis of the claim that you couldn't just treat
    > characters as 'bags of bytes'!

    But I do treat them as a "bag of bytes" :)

    Even my combining character reorder does that.

    I'll copy/paste part of my code (below). (theres two other necessary
    functions which I won't bother you with.)

    this line here:

    StartComb = dict.SearchObj( Data, nil, PrevCombEnd, ElfData.kEnd,
    FoundLen, FoundObj )

    It finds within some UTF-8 data, the first "key" within the
    dictionary, which can be found in the data. The length of the found
    item (a character in this case), is stored in FoundLen.

    The "value" for the key in the dictionary, is returned into FoundObj.
    This "object" is actually just a 1 byte long string, and the byte is
    actually the combining code!! Hence this line "CurrByte = ElfData
    ( FoundObj ).ByteVal".

    Anyhow, all this code is byte-oriented, yet it cannot corrupt any
    characters or miss them or cause false hits.

    Well, in theory, it *could*, if you had a UTF-8 character that could
    exist within another UTF-8 character. No ASCII values exist within
    UTF-8 characters, and no character even exists within any other, so
    in practice this false matching never occurs.

    The whole thing is safe, despite treating everything as UTF-8 and not
    doing any code point detection or character boundary checking!

    The character boundaries are implicit by the keys in the dictionary
    themself. If a key, which is a UTF-8 character, is found, and that
    key is 3 bytes long, then the character boundary is after the 3rd
    byte... It's just all treated as strings of bytes.

    Private Function ReorderSub(Dict as ElfDataDictionary, Data as
    ElfData, Start as integer, fs as FastString) As integer
        dim CurrByte, LastByte, StartComb, PrevCombEnd, FoundLen as Integer
        dim Chars() as ElfData
        dim Scores() as Integer
        dim UnOrdered as Boolean
        dim FoundObj as object

        PrevCombEnd = Start

        do
           StartComb = dict.SearchObj( Data, nil, PrevCombEnd,
    ElfData.kEnd, FoundLen, FoundObj )
           if StartComb = 0 or StartComb <> PrevCombEnd then
              exit
           end if

           CurrByte = ElfData( FoundObj ).ByteVal
           if CurrByte < LastByte then
              UnOrdered = true
           end if
           LastByte = CurrByte

           chars.Append data.mid( StartComb, FoundLen )
           Scores.Append CurrByte

           PrevCombEnd = StartComb + FoundLen
        loop

        if UnOrdered = false then
           Return PrevCombEnd
        end if

        ReorderArrays Chars, Scores

        fs.AppendSectElfData Data, fs.Length + 1, Start - ( fs.length + 1 )
        for CurrByte = 0 to UBound( Chars )
           fs.AppendElfData Chars( CurrByte )
        next

        Return fs.Length + 1

    End Function

    --
    http://elfdata.com/plugin/
    


    This archive was generated by hypermail 2.1.5 : Wed Jun 14 2006 - 16:33:32 CDT