Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public) from Stephan Stiller on 2013-04-20 (Unicode Mail List Archive)

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Sat, 20 Apr 2013 19:38:25 -0700

> I am wondering whether it would be a good idea for there to be a list of numbered preset sentences that are an international standard and then if Google chose to front end Google Translate with precise translations of that list of sentences made by professional linguists who are native speakers, then there could be a system that can produce a translation that is precise for the sentences that are on the list and machine translated for everything else.
Phrase-based machine translation goes much further: it already lets you
pair up far more sentences than would fit into a standard with a limited
code inventory such as Unicode, and it lets you pair up phrases. The
fact that translations are not precise is a problem that has to do with
context and with natural language per se.

> Maybe there could then just be two special Unicode characters, one to indicate that the number of a preset sentence is to follow and one to indicate that the number has finished.
That would belong into a higher-level protocol, not Unicode.

> If that were the case then there might well not be symbols for the sentences, yet the precise conveying of messages as envisaged in the simulations would still be achievable.
The sentences will be as precise as the scope of the sentence inventory
allows. Enumerating sentences or phrasal fragments (I'm hesitant to talk
of "phrases", which for me have constituent nature, but maybe that's
just me) is unrealistic unless you are trying to cover only a /very/
limited domain. If all you encode is (say) requests for meals with the
100 most frequently wanted combinations of nutritional restrictions,
your sentence inventory will encode those requests precisely, but as
soon as you're trying to make adjustments to your formulaic requests
(you're willing to eat /any/ vegetarian, gluten-free meal each time of
the day and day of the year? of /any/ size?), the sentences won't be of
use anymore. This is really why an approach that enumerates large text
chunks is unworkable. (I won't say "useless", but of limited use;
"point-at-me" picture books and imprecise translations are likely to do
a tolerable job already.) The number of sentences you'll need will be
exponential in the number of ingredient options you are intending to
vary over. In any case, we are all left guessing about the intended
coverage of any set of sentences you have mind. From your previous
writings I'm guessing (as implied earlier) that you mean something like
"travel and emergency communication", but that is already a large
domain. If you try to delimit the coverage and come up with a finite
list of sentences, you will see that you'll end up with far too many.
You'd also need to think about how to make these sentences accessible
(via number/ID? that would be difficult or require training for the user
if the number of sentences isn't very small). What if you only want the
inventory of a travel phrasebook? For that, you have the travel
phrasebook (hierarchically organized, not by number), and I have heard
of limited-domain computers/apps for crisis situations (the details
elude me at the moment).

> Perhaps that is the way forward for some aspects of communication through the language barrier.
You would need to specify which problems precisely you are attempting to
solve, what is wrong with the approaches presently available, and
why/how your approach does a better job.

Stephan
Received on Sat Apr 20 2013 - 21:44:15 CDT

This archive was generated by hypermail 2.2.0 : Sat Apr 20 2013 - 21:44:17 CDT