From: Jon Hanna (jon@hackcraft.net)
Date: Mon Jan 18 2010 - 10:05:48 CST
William_J_G Overington wrote:
> Hopefully one day some localizable sentences, not necessarily those used by me in the experiments, will be encoded into regular Unicode.
Hopefully indeed! But let us not rest there.
Obviously such an approach under-reaches and lacks ambition. Why settle 
for some sentences when we can have all sentences?
Now, the number of sentences in practical use is bounded; eventually 
even the worse culprit of over-long sentences (mea culpa, mea maxima 
culpa) at some point runs out of breath and feels the need to add a 
full-stop (or sentence break indicator of whatever script you're using 
today) and then begin on another sentence, or perhaps a new paragraph, 
or maybe they're finished the whole thing. Phew, knew that full-stop was 
going to happen eventually.
Theoretically though, the length of a sentence is boundless. We know of 
course that we can create an English sentence with any number (n > 0) of 
occurrences of the word "buffalo" and result in a grammatically correct, 
sentence, albeit an ambiguous one for higher values of n.
This alone gives us an infinite number of sentences where no word is 
used other than "buffalo". In languages where there is no homophones 
meaning, "bison", "intimidate" and "the second most populous city in the 
state of New York", then this doesn't hold, but we can still use this 
feature of English to deal with recording this particular infinite 
subset of the infinite set of grammatically valid sentences (itself a 
subset of the infinite set of sentences) and then use the resulting 
encoding as a key to localised resources for other languages.
Being infinite, elements of this set cannot be represented by a 
fixed-size unit, but by variable-sized strings, which requires us to do 
one of the following:
1. Prefix the entire sentence with an indicator of length.
2. Prefix the entire sentence with an indicator of length which also 
does the job of containing some of the following data.
3. End the sentence with an indicator that the end has been reached.
The third of these is self-correcting (if we miss it we do not 
mis-interpret data as length and vice-versa, and at the end of the next 
sentence we have two sentences corrupted into one over-long sentence, 
followed by correct data rather than having corrupted the entire 
data-stream).
I propose we use U+002E.
At this point we have a system adequately capable of recording and 
reproducing any of the infinite set of sentences that consist entirely 
of the word "buffalo".
Now we want to extend this to other sentences, let us start with those 
which are similarly simple. /people( people)+/ and /police( police)+/ 
both describe infinite sets of grammatically-valid sentences, and can be 
easily added in like manner.
At this point, it becomes clear that the number of words for which we 
are doing this is itself large. Fish and smelt both work, and who knows 
how many others we will find? We need some way to reduce this set in a 
manageable way.
Notably, there is a certain similarity between the first sound of both 
"police" and "people". This sound is also repeated later in "people". If 
we pick a token to encode this, we can reduce the set of tokens we need. 
I propose U+0070.
Taken in like manner we can then work on other sounds to produce a way 
of representing them in non-audible formats.
I'm sure this can't be done perfectly, but we can probably agree to some 
conventions and live with any persisting disagreements or legacies that 
will result from changes in language.
Now for the final task, of Internationalising all of these strings. Our 
encoding is based on just one language, but that's okay; we can take 
these language-dependent data and use each such datum as a key, along 
with an indicator of the language we are interested in, to retrieve a 
similar variable-length piece of data.
Extended, every sentence feasible can be encoded in a language-dependent 
way, and then used as a key to versions in other languages.
See, by just extending your idea sensibly, we can move forward to 
something that's of a level of technology we should expect in the 27th 
Century (BCE).
This archive was generated by hypermail 2.1.5 : Mon Jan 18 2010 - 10:10:36 CST