On Thu, 5 Jun 1997, Chris Newman wrote:
> On Thu, 5 Jun 1997, unicore wrote:
[Well, it was me who wrote that. I hope we get this list fixed asap.]
> > Thanks to Kent Karlsson, some technical problems in the proposed
> > MLSF have been discovered.
> As far as I know, the only problems identified so far are:
> (1) The specification incorrectly targets the proposal. It needs to be
> retargeted as an encoding one layer above UTF-8.
Some of this retargeting can be done by changing the wording.
But the bulk of it can't. It doesn't *feel* like an encoding layer
above UTF-8. It's an addition to UTF-8, in the same way TAB, LF,
CR,... are an addition to the ASCII graphical character set.
Nobody does identify them as "one layer above".
And it's not intended to be used as one. Things like "we want to
make it a 'charset'" (rather a strange idea if you have a look
at RFC 2130, which even makes suggestions to a separation of the
two layers that a charset consists of) and "whether it's UTF-8
or MLSF, it should all go into the application without distinction"
very much point in this direction.
> > Limiting myself on just ASCII for the moment, we could just
> > select two characters (one for language tags, and one for
> > alternatives) as being special. Let's take @ and %.
> > An example could then look as follows:
> > @fr-fr@soixante-dix%@fr-ch@septante%@en@seventy%@de@siebzig
> > Details will have to be discussed.
> > The above is plain ASCII only, and easily readable and editable.
> > Now the problem with this is that the special characters in
> > the ASCII range are well used for all kinds of purposes, which
> > creates problems for searchabitily and so on. Still, with "@@"
> > as an escape code for real "@", and "%%" for "%", it could work.
> > But there is another solution. Unicode contains tons of
> > special characters. The best candidate for our purpose
> > are the two rows in the Latin-1 area. They are available
> > for display on most systems, and they are below 0x800
> > (thus need two bytes in UTF-8).
> Unfortunately, both of these solutions break the desired searching
IETF is about transport. Of course transport is always affected
by processing aspects. But I don't remember having seen a protocol
design that much influenced by internal application concerns.
Why don't you use MLSF internally, and something more readable
and less controversial externally?
> The former causes both false positives and
> false negatives. The latter causes false positives.
False positives are not really a problem. They can always
be culled. And I don't currently see why the first proposal
should have false negatives, but the second wouldn't.
If you mean the escaping with "@@", that would be necessary
even for the second proposal.
> > - It's plain UTF-8 text, not a new encoding or something
> > that looks like a new encoding. It clearly shows
> > that an application-level problem is solved with
> > a well-known IETF application-level solution.
> It's also well-known that quoting and encoding have to be removed before
> most operations can be applied, including searching.
Yes, that's the well-know distinction between processing and
transfer. And searching could indeed work directly on
the representation I propose, if an application desires to
> > - It's very easily parsable. I haven't written code yet,
> > but I guess it's shorter than what we have now.
> I'd bet it's longer -- it requires a state-based parser and a full decode
> of latin-1 UTF-8 characters.
Not at all. The only thing you have to watch out for are those
two two-byte combinations that you use as reserved characters.
> > - It's more understandable even on systems that can't
> > display anything else than ASCII (the language
> > tags are still plain text), or if it gets
> > interpreted as something else than UTF-8.
> Actually, since the language tags are there, they would show up in
> addition to the actual ASCII text in this context. The langauge tags need
> to be out-of-band markers in the text.
If they are in the text, they are not out of band. And we either
have software that knows how to deal with them (so that they
get removed) or that doesn't. In all similar cases I know
(RFC 1522,..., not otherwise my favorite :-), it has prooved
very helpful to see such things displayed when things didn't
> > - It can be used in UTF-16, UTF-7, and (depending on the
> > specials choosen) even in other encodings.
> It also has the disadvantage in that it introduces non-character
> codepoints into the character stream.
No, it doesn't introduce any new codepoints. All of them are
characters. It just defines a higher level that is clearly
visible as such.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT