The Worldwide Lexicon project has a long history, about ten years. I have been studying language translation technology for many years. It is a difficult problem because computers excel at something things, but are quite dumb at others. I tried many approaches to solving this problem, and have learned much in the process. It has been a long journey, but the lessons from our previous experiments taught us how to build a system that millions of people may use someday.
The Worldwide Lexicon was born in 1998, although it was called Picto at that time. The original idea was to create numeric addresses for ideas, kind of like IP addresses for computers. Every unique idea would have a unique numeric address. The number itself would not mean anything. It was just a placeholder.
A machine called a Concept Registry would assign these addresses, and would prevent the same number from being assigned to different ideas. Because the same word could represent different ideas, it could have several Picto addresses. For example, the English word “like” would have one Picto address for “to like”, and another for “like, similar to”. Picto addresses could be linked to visual icons (that’s where the name originated), and could be used in machine readable texts in web documents.
This solved a problem for automatic translation services. Computers are not able to understand subtle differences like this. Try asking Babelfish to translate “This is like totally like that sushi I like”.
The problem with Picto was simple. It was too much work for authors to create documents. It was too much work to build the index of Picto codes. I still think it is a neat idea, and plan to revive it someday. Meanwhile, it led to the creation of DML (Disambiguation Markup Language). This is a simple and lightweight way to hide machine-readable code in ordinary web documents. I’ll talk about this some more later.
Search For Extraterrestrial Intelligence Research (2000-2003)
I have written a book and peer-reviewed articles about SETI. My research focuses on the use of mathematical languages as a way for different civilizations to communicate. Much of this work was inspired by my work on Picto, and by the work of Hans Freudenthal, Carl L Devito, Paul Fitzpatrick, Stephane Dumas and Yvan Dutil.
During this time, I developed a general theory for ACETI, Algorithmic Communication with Extraterrestrial Intelligence. The idea behind ACETI was to build a message starting with very simple mathematical concepts such as addition and multiplication. Each concept was associated with a unique numeric code, very similar to Picto. With only a few symbols, it is possible to describe the foundation of a programming language. From there, it is possible to build a rich vocabulary of symbols that describe ever more complex calculations and processes.
ACETI is a powerful tool because it allows the sender of a message to describe processes that are very difficult to explain with still images or symbols. Turbulence is a good example. Turbulence, the chaotic flow of a fluid around an object (such as a rock within a stream), is difficult to explain in symbols alone. A computer program can simulate turbulence, much like a video game. The recipient would run this program, and see that it depicts turbulence, and that this program is associated with a unique symbol. The recipient would then learn that symbol number N means “turbulence”. From there, you could build a database of symbols describing many types of processes.
My work convinced me that communication among civilizations is not only possible, but will be much richer than many people assume. We will not know unless SETI succeeds in discovering a signal. We may have to wait several centuries to find out.
GNUTrans / Distributed Translation System (2002-2002)
During this time, I concluded that machine translation systems would not be able to fully understand human language, and that a better solution was to use computers where they excel, and use people where they excel. Computers are very good at math and memory. They get faster and cheaper every year. People understand language and metaphor.
I designed a system that was modeled after SETI@Home, except that it asked bilingual people to translate short texts from web documents. It was a simple idea, but the system became very complicated, and was too expensive to build.
The problem, which we did not understand at the time, is that people were interested in different types of material. One French speaker might be interested in soccer, while another is interested in politics. It was not possible to automatically assign jobs to people that we knew they would be interested in. This made the project expensive and complicated. While we did not complete this project, we learned a lot from it.
MIMS : Multilingual Instant Message System (2004-2005)
MIMS was a simple idea that enabled IM users to talk to each other with the help of a bilingual volunteer. The system kept an index of who was online and what languages they spoke. When a person wanted to chat with another user, the system would arrange the equivalent of a three-way call. The people would talk to each other via the volunteer translator.
This was a simple system. We developed a Java prototype that worked well. We had three problems. First, instant messaging systems do not work together. This made it expensive to build an IM program that could talk to all of the popular IM networks (AOL, MSN, ICQ, etc). Second, there was no way to predict how many translators would be online at one time. This created an imbalance between supply and demand. Third, volunteers were not interested in translating many conversations. If the conversations was not about a topic they liked, they were not willing to contribute.
It was a fun project, and like the others before it, taught us some important lessons.
TRIKI : Translation Wiki (2000-2007)
TRIKI (translation wiki) is a very simple version of the original GNUtrans system. The system watches a website or blog’s RSS feed (a syndication format) for new texts. When it sees new texts, it creates wiki pages for translations to over 50 languages. The website encourages its own readers to help translate it to whatever languages they speak. Other volunteer translators can also contribute to the system.
TRIKI is very simple. Any website with an RSS feed can join. Any person who wants to contribute can join in. The key insight of this project is that any website with an audience probably has bilingual readers without knowing it. Loyal readers who speak other languages will enjoy contributing translations, so that other people can read something that they enjoy. These translations will be visible to other people, and more importantly, to search engines. Material that was once invisible will suddenly become visible to the entire world.
The project may seem like an obvious idea. I often wonder why we did not do this five or so years ago, but invention is a trial-and-error process. Also, many of the tools we use in TRIKI were not widely used five years ago. RSS, wikis and social networks did not exist or were not common. The pace of change in the technology industry is incredible. The lesson for inventors is to be patient, do not quit, and eventually you will solve your problem in the right way, or the market will catch up with you. (One of the biggest mistakes you can make is to invent the right thing a few years too early).
DML – Disambiguation Markup Language
DML is another product of early experiments. It is a simple way to hide clues inside ordinary web documents. These clues will help automatic translation systems to understand difficult words or phrases. DML is simple to use, and does not require any centralized systems or services. Here is how it works. I will use the word like, because it can, like, be like many other words.
This house <!–house~home–> is like <!–like~similar–> my aunt’s home. I like <!–like~enjoy–> it.
In your web browser, you will not see the <!– … –> code. Only a computer will see this. These hidden clues are very useful to a machine translator.
DML is useful because it is a simple way to hide clues in a document. With DML, the author does not need to tag every word, only difficult words that have many meanings. Words like aunt mean only one thing, so there is no need to tag them.
Now is an exciting time. I believe that the Worldwide Lexicon, and other projects like it, will make the language barrier history, not because of a breakthrough in a laboratory somewhere, but because people on the Internet will organize themselves to make this happen. This is the magic quality of the Internet. A simple idea, or a simple change to the way a process works, can spread around the world almost overnight. I think we are at the beginning of a transition to a truly multilingual world, and that this transition could happen very quickly.