Re: A few questions about encoding discovery, copying text, and pasting text in one encoding into text in another encoding

From: Asmus Freytag <>
Date: Wed, 19 Dec 2012 10:28:53 -0800

First, what Markus said. That's the high-level picture.

Some more details:

On 12/19/2012 7:59 AM, Costello, Roger L. wrote:
> 2. I have a text editor open and it contains text that is encoded using encoding A. I select some of the text and copy it to the clipboard. What is copied: (a) the characters (glyphs) that I visually see displayed on the screen, or (b) the hex values of each character displayed on the screen, or (c) the codepoints, or (d) something else (what else)?

Clipboards can support multiple formats for your data.

Text editors usually copy both a raw and a formatted stream of data. The
receiving application can pick which format to use (when you look at the
"Paste Special.." command in many applications you can see what formats
are on the clipboard.

Some clipboards support limited conversion among raw text formats. For
example, on Windows, in an effort to help migration to Unicode the
clipboard will accept text in Unicode and make it available as text in
encoding A (or vice versa) as long as A is the predefined legacy
encoding for that system.

FInally, to completely answer your question: raw text is present as a
stream of code points in Unicode or whatever encoding.
> 3. Continuing question 2, when text is copied, is its encoding also copied? Is the encoding stored in the clipboard?

I'd say, usually not. But I'm not familiar with all clipboards on all
systems. But there's usually some indicator of format (such as HTML vs.
Plain Text) and in the example I gave, there's the Unicode vs. Legacy
text format.
> 4. I have two text editors open. Text editor #1 contains text that is encoded using encoding A while text editor #2 contains text that is encoded using encoding B. Encoding A is different from encoding B. I copy text from #1 and paste it into #2. Does text editor #2 realize, "Oh my, the text being inserted uses a different encoding so I better convert each of its hex value into the equivalent hex value in my encoding." Is that the way it works?

Most encodings can't be converted into each other. The exceptions are
few. Almost all encodings can be converted TO Unicode (the reverse is
true only if the text happens to not contain characters that are
undefined in the target encoding). Some other encodings may have one or
more "partner" encodings, which contain the same characters, but with
different layouts.

The easiest way to avoid problems is for the editors to work in Unicode
(as Markus wrote) and then worry about encodings only when reading or
writing files for particular purposes (if for some reason Unicode files
are not acceptable or available).

Most editors can tell between a clipboard format for Unicode vs. legacy
encoding and if the user says "paste the legacy", but the document is in
Unicode, they would convert - because that is usually well supported and
possible. Beyond that, the choices aren't attractive, because you are
not guaranteed to succeed with a conversion, so most people don't bother
trying to write code for such scenarios.

To come back to what Markus wrote and state it in a different way: if
you have any choice (that is, are not forced to open legacy document)
you should walk away from "encodings" other than Unicode as rapidly as
possible - they are definitely not something that your application
should work with natively, It's too messy.

(And why, do you think, was Unicode invented in the first place :) )

Received on Wed Dec 19 2012 - 12:31:26 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 19 2012 - 12:31:27 CST