Prof. Fraser, how does machine translation work?
Prof. Fraser: Machine translation is a two-step process. First, the model analyzes the source sentence, then it generates the translation. The part of the neural network that is responsible for the analysis step is called the “encoder”. It creates a numerical representation of the sentence – ideally with identical results for sentences like “I saw the dog” and “Ich sah den Hund”. These representations can then be translated into any language. If the system works incorrectly, for example, a Upper Sorbian and a German sentence with the same meaning will result in very different representations. In this case, the output cannot be correct. This can be checked by asking the model to translate a German sentence into Sorbian. We then use automatic systems to search for similarities between a single correct translation and the hypothetical system translation.
Upper Sorbian, a minority language spoken in the Lusatia region of eastern Germany, is the focus of your research project EPICAL. The aim is to keep low-resource languages alive. What fascinates you about the analysis of languages and texts?
Prof. Fraser: In the predecessor projects “Domain Adaptation for Statistical Machine Translation” and “Health in my Language”, we worked with medical texts for consumers – with a system that was trained exclusively with such texts and therefore only worked well in this area. At that time, I had not yet focused on rare languages. But I soon realized that whether a model works or not is not dependent on the subject area. It depends on the language into or out of which you are translating. Especially for languages with few resources, there are often not enough parallel texts to train powerful machine translation systems.
How many languages worldwide are currently under threat?
Prof. Fraser: There are around 7,000 languages in the world – but one dies out every two weeks. Around 40 percent are considered endangered. According to the Summer Institute for Language, a Christian organization that has translated the Bible into different languages, around 1,500 languages could disappear in the near future. Of course, it would be naive to think that machine translation for all 7,000 languages could prevent this. The decisive factors are the prestige of the language and whether it is actively used – especially by children.
What can EPICAL do here?
Prof. Fraser: We want to support language activists in using chatbots to write texts in their language. The more texts that are created, the better we can train the bots, and the better they are, the easier it is to write new texts. In other words, we are trying to improve the language models for these languages so that they are used more. For example, activists could create Wikipedia articles with AI support and then correct them manually. This trains the language models step by step, helps to achieve better encoding and ultimately enables faster work. I'm not saying that our technologies will save languages. But they can help to ensure that languages are perceived as modern and that their speakers can work with them.
What languages other than Upper Sorbian are you working with?
Prof. Fraser: We are specifically looking for languages for which there is an active community of language activists. In addition, some texts should already be available, since we only work with text data. We are particularly interested in a broad linguistic diversity. Therefore, we also want to work with activists in Africa, South America and Asia.
Can the findings from EPICAL also be applied to other areas of machine learning?
Prof. Fraser: Yes, our research could, for example, improve the English version of ChatGPT, especially for specialized technological topics that are not yet well covered. In addition, language models like the Transformer model, which comes from natural language processing, influence many other areas of machine learning. If we manage to train better transformers with less data, it could have far-reaching effects on all areas of machine learning. One example is medical image processing: Systems for the automatic detection of tumors currently require a very large number of training images. If we succeed in training powerful models with much less data, this could significantly improve early detection.