Re: definition of plain text

From: Ken Whistler <>
Date: Fri, 14 Oct 2011 12:17:52 -0700

On 10/14/2011 11:47 AM, Joó Ádám wrote:
> Peter asked for what the Unicode Consortium considers plain text, ie.
> what principles it apllies when deciding whether to encode a certain
> element or aspect of writing as a character. In turn, you thoroughly
> explained that plain text is what the Unicode Consortium considered to
> be plain text and encoded as characters.

Correct. And basically, that is what it comes down to.

One cannot look at *rendered* text and somehow know, a priori,
exactly how that text should be represented in characters. (In the case
of most
of what is still being considered for encoding, "rendered text" means
non-digitally printed historic printed materials, because there isn't any
character encoding for it in the first place, and hence no compatibility
encoding issues.)

Sure, there are some general principles which apply:

1. We don't represent font size differences by means of encoded characters.

2. We don't represent text coloration differences by means of encoded

3. We don't represent pictures by means of encoded characters.

and so on. Add your favorites.

But character encoding as a process engaged in by character encoding
committees (in this case, the UTC and SC2/WG2) is an art form which
needs to balance: existing practice, if any; graphological analysis of
writing systems; complexity of implementation for proposed solutions
to encoding; architectural consistency across the entire standard;
linguistic politics in user communities; and even national body politics
involved in voting on amendments to the standard.

It is impossible to codify that process in a set of a priori, axiomatic
principles about what is and is not plain text, and then sit in committee
and run down some check list and determine, logically, what exactly
is and is not a character to be encoded. People can wish all they
want that it were that way, but it ain't.

So yeah, what the Unicode Consortium considers to be plain text is
what can be represented by a sequence of Unicode characters, once
those characters ended up standardized and published in the standard.

You can't start at the other end, define exactly what plain text is, and
pick and choose amongst the already standardized characters based
on that definition. Given the universal (including historic) scope of the
Unicode Standard, that way lies madness.

Received on Fri Oct 14 2011 - 14:21:18 CDT

This archive was generated by hypermail 2.2.0 : Fri Oct 14 2011 - 14:21:19 CDT