Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

From: Peter Edberg <pedberg_at_apple.com>
Date: Fri, 19 Jul 2013 13:48:19 -0700

On Jul 19, 2013, at 12:42 PM, Mark Davis ☕ <mark_at_macchiato.com> wrote:

> Popping up a level.
>
> ICU (and some other libraries) have heuristic encoding detection, that will take a sequence of bytes and come up with a likely encoding id.

However, the ICU encoding detection typically requires more than 4 bytes (usually at least 10 characters worth of bytes) in order to make a reasonable guess.

- Peter E

>
>
> Mark
>
> — Il meglio è l’inimico del bene —
>
>
> On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken <ken.whistler_at_sap.com> wrote:
>
>
> > Suppose that these hex bytes:
> >
> > C3 83 C2 B1
> >
> > show up in a message and the message contains no hint what its encoding is.
> >
> > Perhaps it is 8859-1, in which case the message consists of four 1-byte
> > characters:
> >
> > C3 = Ã
> > 83 = the “no break here” character
> > C2 = Â
> > B1 = ±
> >
> > Perhaps it is UTF-8, in which case the message consists of two 2-byte
> > characters:
> >
> > C383 = 쎃
> > C2B1 = 슱
>
> Actually, that would be interpreting it as UTF-16, not as UTF-8. That
> can probably be quickly ruled out if the rest of the text is not obviously
> in UTF-16.
>
> Interpreted as UTF-8, it would be:
>
> C3 83 --> U+00C3 = Ã
> C2 B1 --> U+00B1 = ±
>
> More likely than the other two alternatives you cite.
>
> Of course, you also have to consider serial corruptions as a possibility.
>
> It could have started out as UTF-8 C3 B1 --> U+00F1 = ñ.
>
> Then the <C3 B1> got misinterpreted as Latin-1, and then re-misinterpreted
> as UTF-8 again.
>
> --Ken
>
>
>
>
Received on Fri Jul 19 2013 - 15:51:02 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 19 2013 - 15:51:07 CDT