Making less common EU languages more accessible
An EU-funded project has developed a cost-efficient, high-quality machine translation tool for less widely spoken European languages such as Croatian, facilitating communication and helping smaller companies enter new markets.
© Olivier Le Moal - fotolia.com
With 24 official languages and a range of regional ones, communicating in the European Union can sometimes be a challenge. This is particularly true for less common languages such as Croatian.
For these languages, where the required resources to develop a modern machine translation or MT system are scarce, one efficient solution is to build such software automatically using free online material.
To help fill this gap, the EU-funded ABU-MATRAN project set out to develop a cost-efficient, high-quality and web-based MT tool.
“Although current MT approaches work in any language, they first need access to such resources as vast amounts of sentences in both the source and target languages,” says Antonio Toral, formerly with ABU-MATRAN project coordinator Dublin City University in Ireland and now at the University of Groningen in the Netherlands. “For Europe’s under-resourced languages, these necessary resources may not exist, and acquiring them by manual translation would be too costly.”
New language, new system
The idea was born when Croatia officially joined the EU in 2013 bringing with it a new official language. At this time, ABU-MATRAN researchers developed an online MT system for English-Croatian based on publicly available resources.
The system uses a set of acquisition tools that allow the MT to automatically gather data from different types of resources such as dictionaries. To do this, it primarily deploys web crawlers that pull the information from the internet.
“The ABU-MATRAN system was the first translator for these languages based on free, open-source technologies and immediately helped reduce the time and costs associated with translation between the two languages,” says Toral. “Using only datasets acquired in the project and publicly available MT machinery, we successfully built a system that rivals those of large IT corporations for this language pair.”
The project consortium, consisting of partners from industry and academia, started by identifying existing research tools not yet ready to be put on the market.
“We then worked together to identify industry needs, improve existing tools and prepare them for commercialisation,” says Toral. “In doing so, we also identified new needs that led to new research and solutions for addressing these needs.”
More languages added
This initial English-Croatian MT system was gradually improved by implementing new, more efficient translation techniques such as Neural Machine Translation (NMT). The project also developed a unique Croatian MT system for tourism.
From there, the project began to prepare for commercialisation, continuing to add other South Slavic languages, including Bosnian, Serbian and Slovenian. The techniques developed for these were then applied to other language pairs English-Finnish, Spanish-Catalan and Spanish-Basque.
“By expanding to these other European languages, we demonstrated that the ABU-MATRAN system is applicable to very different language types,” says Toral. “Most importantly, the tools and techniques developed in this project have drastically reduced the cost of developing required language resources, thus lowering the barriers for SMEs to enter new markets.”
All results are publicly available under free/open-source licenses.
ABU-MATRAN received funding through the EU’s Marie Skłodowska-Curie actions programme.