Tech Transcends Tongues

Updated on 01-Apr-2006

“Data-oriented translation,” “Example-based and context-based processing,” “Syntactic parsing,” “Japanese morphological analysis,” and such is the stuff of machine translation (MT) research. But that’s exactly not what we’re going to talk about here, or even translate for you into English! Here, we take a gentle look at the potential of MT, and what we expect from it in the near future. Will you be able to shoot off an e-mail to a Spanish friend with it being translated in transit?

Your first brush with translation has probably been in the context of reverse translation using online translators such as Google’s and AltaVista’s. We’re talking about the forwards that go round saying something like “Translate the following to Spanish on Google and then translate it back into English! Have fun!” Reverse translation may be fun, but it’s unfair. Unfair, because it’s stressing the system too much, as it were.

But what about simple, one-way translation? We have the all-time classic of hilarious MT in the apocryphal story of a 1950s system that translated “The spirit is willing, but the flesh is weak” into Russian as “The vodka is strong, but the meat is rotten.” Now that might be a joke, or what, we don’t know, but let’s take a look at the state of the art: at www.systransoft.com, when you translate the German “Du Strahlenreiche” (You radiant one) into English, out comes “You jet realms.” And “Gleichnis” (likeness, or reflection) becomes “equalsneeze”!

Rays To Riches
That’s a sad state of affairs, but a good starting point for an analysis. “Strahlen” means rays or beams-hence “jet.” And “reich” means “rich,” but “Reich” can also mean “rule,” or “domain” -hence “realm.” Now, “Du Strahlenreiche,” for us, is a combination of “rays” and “rich,” making for a literal translation of “You who are rich in rays,” which we understand as “You radiant one.” But how is the poor software to understand that we meant “rich” and not “realm”? More importantly, we gather from the context, and from our world knowledge, that “rich in rays” can mean “radiant.” How on earth would the software know that there is even any such thing as radiance in the world, given only the words “realm,” “rich,” and “rays”?

An analysis of some ambiguous text shows just how much a program needs to know in order to be able to faithfully render a piece. Consider three sentences –

“I like the cat, I can drink milk.”
“I, like the cat, can drink milk.”
“I, cat-like, can drink milk.”
“I can drink milk like the cat.”

Even supplied with all three meanings of “like,” a machine would find it really hard to translate all four sentences properly! Then take the example of “The shirt is in the trunk.” The storage trunk? An elephant’s trunk? A tree-trunk? We know that it’s the first one; why? Because an elephant’s trunk is too small to shove a shirt into, and because a tree-trunk is a highly unlikely place to stash away a shirt in! And who is to take on the job of feeding all world-knowledge into a translating machine? Then there’s idioms: “Woe is me” will, in all likeness, get translated as “I am sadness” into other languages.

We’re Tough Customers
As things around us develop at the proverbial mind-numbing pace, we’ve become impatient. We want IMs that translate, so you can type in English and your friend somewhere reads your text in Hebrew. We want a button on Web pages where we choose the language to view the page in. We want automatic speech translators, so we can speak in Hindi with our speech coming out in English through a mouthpiece.

Turns out something like that has already been developed, but more on that later. For now, have you realised how fortunate you are to be able to read this copy of Digit? You have access to more Web pages than most other people! Figures vary from source to source, but apart from the fact that Chinese is competing fast and hard, English is still by and far the most prevalent language on the WWW. Why, and for how long, need this imbalance exist? And remember, with all the attention to multimedia, one is apt to forget that text is still the mainstay of communication! The world needs automated translation-and soon.

Artificial Imbeciles
The MT problem is, of course, within the domain of AI (Artificial Intelligence)), amongst other disciplines. MT is very closely related to Natural Language Understanding (NLU)-where a computer attempts to understand what a spoken or written sentence means. Now “understanding” is something that’s been subjected to much philosophical, scientific and pseudo-scientific debate, so we won’t debate it here. But then, stated one way, the goal of AI-as far as the input side is concerned-is indeed to understand. In that light, we can say that MT will advance hand in hand with AI.

But when? MT hasn’t really seen any major breakthroughs since its beginnings in the 50s. In AI, promises have all too often been made and not delivered upon; MT is no exception. Some people say, “It’s all there: the hardware (fast machines), the software (neural networks and fuzzy logic), and it’s just a matter of time before we get good automated translations…” Problem is, it’s turning out to be a really long time.

The goal of AI-as far as the input side is concerned-is to understand. In that light, we can say that MT will advance hand in hand with AI

Some people have noted, though, that other industries that do AI research-robotics, for example-could come up with insights and methods that would further the field. Then, there’s the possibility that since one of the biggest problems in AI is infusing common sense into systems, MT would benefit as bodies of common sense become bigger and better and can be easily “plugged in” to existing systems.

Common sense apart, there’s the problem of knowledge, as noted before. Unlike systems such as chat-bots that require a little knowledge and a lot of common sense, and unlike systems such as chess-playing programs that work on lots of knowledge and heuristics and not too much common sense, MT requires both lots of knowledge and lots of common sense!

Why knowledge? We mentioned earlier that a system needs to know that an elephant’s trunk is an unlikely place to shove a shirt into. Such things straddle the border between common sense and knowledge. But there’s pure knowledge to be considered, too: certain words do not occur in certain contexts, such as “glycogen” in the context of reportage (something on this later). “You” can be “tum,” “aap,” “aap log,” or “tum log” in Hindi, depending on context. Cultural and regional differences-both in terms of world reality and word usage-exist between regions and languages respectively.

The problem we’re pointing at is to decide how much knowledge needs to be put in the system, and how to encode it all so the machine translator can put it to good use. And that is going beyond even AI. It involves a collaboration between cognitive scientists, linguists, language and cultural experts, computer scientists, and more.

Methods To The Madness
But talking about understanding, we’re led to one obvious way to try and translate documents, comprising the following steps:

Understand the source document
Put it in an intermediate format
Generate the output in the target language

This is a step ahead from the “direct” method of translation, which just takes a language pair, inputs dictionaries of both, and tries to generate translations without paying attention to context and other nuances. The intermediate-representation approach also has the advantage that once the intermediate “understanding” is generated, you could translate that into any language, not just one.

A third approach is called the “transfer approach,” consisting again of three steps. Here, the first stage converts the source language to an abstract representation; the second maps these representations to representations in the target language; and the third stage generates the translation. Note that the system doesn’t exactly try and “understand” what’s going on in the source language; there are abstract representations involved for both languages, which incorporate such things as context and real-world knowledge.

But what seems really promising now is the statistical approach, which we’ll soon talk about.

Google Everywhere!
In August 2005, in a US government-run test, Google’s translation application beat technology from IBM and from various universities. In the test, Google scored the highest amongst all competing software in Arabic-to-English and Chinese-to-English translation tests; these were conducted by the NIST (National Institute of Science and Technology). Each test comprised the task of translating a hundred articles from Agence France Presse and the Chinese Xinhua News Agency from December 2004 to January 2005.

How?

The answer lies in statistical analysis. It now seems increasingly likely that rather than have a system try to understand a piece of text, or formulate abstract representations taking context and other things into account, the most promising method of translation involves looking at ready translations. How this works is, the system would look at existing translations, and be trained on those. For example, if the German “reich” is often seen as being translated as “rich” in the context of money, the system would pick it up. And “Reich” gets translated as “realm” or “rule” in the appropriate context-as in “Die dritte Reich” (remember “The Rise and Fall of the Third Reich”?)-and the distinction would be made.

This means that the more the number of documents in the source and target languages available, the better the system would get at translating. And, of course, Google has a very large store of documents in various languages.

Translation is a hard AI endeavour, and it’s unlikely we’ll see a good automated translator soon

Humans To The Rescue
Chess is an example of an AI endeavour that has succeeded remarkably well. Most others don’t. Translation, like we said, is a hard AI endeavour, and it’s unlikely we’ll see a good automated translator soon. But there are three things to consider here. First off, to aid a machine in its job, the source text may have to be pre-edited, or “normalised,” to make it suitable for translation, as in the following:

I know I’m wrong.
I know, I’m wrong.

The two in English are similar, but the translations into German by free online translators result in two very different things! Someone who (a) knows the idiosyncrasies of the translation system, and (b) has a good knowledge of both languages, could “format” text for the best results. Pre-editing is increasingly being seen as an important step. One can come up with several examples where pre-editing would be required -slang would need to be toned down, casual utterances would need to be formalised, missing words such as “that” (as in “I know that I’m wrong”) would be inserted, and so on.

Experts In Their Field
The second thing we need to look at in the context of not ever getting good machine translators is that of domain-specific translation. “Domain,” here, refers to fields such as weather forecasting, physics, and so on. Why we mention weather in particular is that in Canada, the MÃ©tÃ©o system has been used to translate weather reports from English to French. It has worked well for decades now, and is still being used without anyone even noticing or complaining! How come? Simply because the system has been tailor-made for the field, and because the terminology rarely changes. There’s no slang; the words always have a fixed meaning-for example, a “cold front” is always in the weather sense of the phrase.

Thus, instead of aiming at general-purpose translation systems, we could set our sights lower and think of developing several domain-specific systems. Thankfully, a lot of the material on the Internet that needs to be translated is domain-specific.

Helping Hands
As a third point, we gather that in the face of the problem being hard for AI, the future might well see the machine translator working as an aid in the translation process. In the case of the latter, we can envisage someone with less-than-perfect knowledge of the target language being able to translate reasonably well with the help of a machine: the machine would do a lot of the dictionary searching and basic grammar construction, while the human would do the dirty work of supplying context, and re-consulting the machine in the case of an ambiguity. However, it all boils down to money: critics of this approach say it would be just as expensive as human translation, but proponents say it would be less expensive on account of the number of man-hours spent.

One other approach being seen as the future of translation is that human translators would proofread machine translations. This, as in the former case, has the advantage of reducing the man-hours spent in the translation process. It has the added advantage of the final output having been read and verified by a human.

Some Online Translation Sites

Site	Comments
www.systransoft.com	Reasonably good; one of the oldest existing tools
www.google.com/language_tools	Probably the best one out there
http://babelfish.altavista.com/tr	Quirky. Sometimes OK, sometimes pathetic
http://trans.voila.fr/voila	Site is in French; works better for French translations
www.linguatec.net/online/ ptwebtext/index_en.shtml	Has an excellent feature: subject area selection for more accurate translation. Works better for German translations
http://translation2.paralink.com/	Poor, even at German translation, which is the site’s speciality
www.reverso.net/text_translation.asp	Very bad, even though the site seems to have won several awards

Is This The Future?
We’ve painted a somewhat anaemic picture so far-that of MT not being up to speed, and fully automated translation not being possible in the near future. But here’s an excerpt from post-gazette.com, dated 28 October 2005:

“Stan Jou’s lips were moving, but no sound was coming out. Mr Jou, a graduate student in language technologies at Carnegie Mellon University, was simply mouthing words in his native Mandarin Chinese. But 11 electrodes attached to his face and neck detected his muscle movements, enabling a computer program to figure out what he was trying to say and then translate his Mandarin into English. The result boomed out of a loudspeaker a few seconds later:

“Let me introduce our new prototype,” a synthesised voice announced. “You can speak in Mandarin and it translates into English or Spanish.”

This might sound like we’ve reached what’s needed in MT, but the fact is that electrodes, and the fact that speech is directly being translated into speech on the fly, do not make for a better translator! The article in the Post-Gazette goes on to say that when (he) announced he would take questions from reporters in Germany and America, the computer heard it as “so we glycogen it alternating questions between Germany and America.” And, of course, things such as humour would be entirely lost in translation, to use the old phrase.

What’s interesting about this system-of which several versions are being developed, by the way-is that it uses the statistical approach we mentioned earlier: a system learning from a body of texts in both languages, and how the translation was done between those texts. We can’t stress enough that the statistical approach seems to us the most likely way to go for the future. And not just for the academic reason, but also for the fact that the mass of information (in all languages) on the Internet will increase with time, so machine translators will have that much more wisdom to draw from.

MT In India
It’s probably in the EU and in India that the need for MT most exists. We came across two Indian projects of note-the Tamil-Hindi Machine Aided Translation System, at www.au-kbc.org/research_areas/nlp/demo/mat/ , and “An English to Hindi Machine Aided Translation System: an ongoing project at IIT Kanpur,” at http://anglahindi.iitk.ac.in/ .

According to a document on the Anglabharti project-which aims at translating English into all major Indian languages, and of which Anglahindi is a subset-the system uses something like an interlingua approach (the one that converts sentences from the source language into an intermediate system). It analyses English sentences to create an intermediate structure with most of the disambiguation performed in this step. The intermediate language structure has the word and word-group order according to the structure of the group of target (Indian) languages. Note that word order is an important consideration, since English usually strictly follows the subject-verb-object structure, whereas Indian languages follow a relatively free structure. As an example, consider

I am going to the shop.

This sentence, like most simple English sentences, has the subject (“I”) first, then the verb (“going”), then the object (“the shop”). Now consider that it can be translated into Hindi as

Main dukaan ja raha hoon.
Dukaan ja raha hoon main.
Ja raha hoon main dukaan.
Main ja raha hoon dukaan.

A Universal Grammar?

A possible future direction for MT would be to take advantage of a linguistics theory called “Universal Grammar” (UG). From Wikipedia, UG postulates principles of grammar shared by all languages, thought to be innate to humans. It attempts to explain language acquisition in general, not describe specific languages. UG proposes a set of rules that would explain how children acquire their language.
According to UG, there is only one method of “diagramming” sentences; this method applies to all the languages of the world, and is universal because it is genetically encoded into the brain of every human child. This is a bold thesis, and a large number of linguists are working within this approach.
MT would benefit from a well-developed UG theory because the intermediate-language approach could easily be used to generate a representation of the meaning of the sentence, since there is only one way of diagramming it. Problems such as those of ambiguity resolution would remain, but the problems of syntactic and semantic parsing would be solved.

Although one or two of these variants are more likely to be used, the fact is they’re all grammatically correct. In contrast, there’s no way of rephrasing the English sentence.

Therefore, converting the intermediate structure to each Indian language through a process of text generation is a relatively simple process. In fact, the authors of the paper say that the effort involved in analysing the English is about 70 per cent of all the work, while the text generation accounts for only 30 per cent. Part of the reason for this is the word order situation we’ve described.

If you speak Tamil and/or Hindi, try these out, and do remember to write in-we’re especially interested in the bloopers!

We Hate To Tell You, But…

MT is still in its infancy. Worse, it’s been in its infancy since the 50s. Each researcher proposes a different scheme for “the MT system of the future,” based on one or the other of the basic approaches. Some say we will never have completely automated MT, and sadly, that just could be true. In the foreseeable future, MT seems best focused at aiding human translators.

The most promising thing we can tell you is that the statistical approach based on large bodies of existing translation can only get better as the information on the WWW increases: we’re putting our bets on Google, by the way! Which, reverse translated to and from German-even if that’s unfair-becomes “We set our bets on Google, by the way!” And that’s amazing… there might still be hope!

Team Digit

Team Digit is made up of some of the most experienced and geekiest technology editors in India!

Team Digit

01-Apr-2006

Tech Transcends Tongues

Latest Article