Re: A basic question on encoding Latin characters

From: Gary Roberts (gar@sparc.sandiegoca.ncr.com)
Date: Wed Sep 29 1999 - 14:21:24 EDT

Next message: Michael Everson: "Missing North American characters?"
Previous message: Eric Brunner: "list mgt (was: Re: A basic question on encoding Latin characters)"
Maybe in reply to: Marion Gunn: "A basic question on encoding Latin characters"
Next in thread: Paul Keinanen: "Re: A basic question on encoding Latin characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Listening to the discussion, I have the following observations:

Scripting languages tend to do prefix matching. Because of this, they
are limited. For example, if a message could be 'login: ' or
'login: No logins permitted at this time.', a script designed to handle
just the first message is difficult to write.

[Please note that I use 'script' here as in scripting languages, not in
the standard sense used on this list]

Likewise, if a script is looking for 'cote', and another possible message
at this time is 'coté', then there could be a false positive match given a
decomposed format. (By the way, I recommend converting to decomposed
format before comparison, which guarentees consistent performance.)

The only reason the 'login: ' ambiguity is acceptable, is that it
rarely occurs. I suspect the accented case is also unlikely. Most
prompts I've seen, for readability purposes, end in a SPACE. SPACE is
rarely followed by combining marks, and I would venture to say never in
any extant script. Even in cases where SPACE is not he final character
sought by a script, the situations where an accent distinguihes two
different possible prompts is unlikely, particularly in 'legacy' scripts.

Now, if we are talking about sed, or some other stream processing
language, these have the opportunity to wait for all the input before they
decide if the strings match or not.

So, it is only real time protocol scripts that have this problem, and real
time protocol scripts have limited vocabulary. If we are inventing a new
real time protocol, we can make sure the ambiguity will never occur by
insisting on space or some other fine character as the termination marker
for the string.

Next message: Michael Everson: "Missing North American characters?"
Previous message: Eric Brunner: "list mgt (was: Re: A basic question on encoding Latin characters)"
Maybe in reply to: Marion Gunn: "A basic question on encoding Latin characters"
Next in thread: Paul Keinanen: "Re: A basic question on encoding Latin characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT