EU Science Hub

New multilingual resource for machine translation

Apr 09 2015

The European Parliament (EP) and the JRC have jointly released a very large collection of documents and their translations. This rich resource can be used to improve Machine Translation software and other Language Technology applications. The ‘Digital Corpus of the European Parliament’ (DCEP) consists of EP texts produced between 2001 and 2012. It contains over 1.3 billion words in 24 languages, of text types as different as press releases, EP reports, questions to the EP and resulting answers, agendas, and more. DCEP was produced by the EP, with the support of the JRC and the Budapest University of Technology and Economics.

Texts produced by the European Institutions have the advantage that large parts have been translated into all 24 official EU languages, resulting in parallel text collections (documents and their manually produced translations) in up to 276 language pairs (e.g. Lithuanian-Portuguese). Such parallel texts, split into individual sentence pairs, are used by researchers and developers to build Machine Translation software because computers can automatically learn from man-made translations how words and phrases get translated in different contexts. The more man-made translations are available, the better automatically produced translations are going to be.

Parallel texts can also be used to automatically produce multilingual dictionaries and thesauri and to develop or improve automatic text analysis tools such as software for cross-lingual information retrieval, named entity recognition, co-reference resolution, discourse analysis, grammar and spelling checkers, and more.

Parallel texts for some language pairs are available in abundance (e.g. English-French), but for other language pairs, almost none would exist (e.g. Greek-Estonian), if it were not for the texts produced by the EU institutions.

The JRC, which is an important developer of text analysis software and which has had early access to EU document collections, has taken an important role world-wide in preparing and distributing such highly multilingual parallel corpora to the wider R&D community. Starting with the JRC-Acquis in 2006 (resulting in first versions of 462 Machine Translation systems!), the JRC has since supported the research and development community by making available the DGT-Acquis (2012) and various ‘Translation Memories’ (TM), i.e. databases of sentences and their translations to be used either by human translators or by software: the DGT-TM (since 2007), ECDC-TM (2012) and EAC-TM (2013).

The insight that simple collections of documents and their translations can be such an extremely useful resource was acknowledged in Commission Decision 2011/833/EU of 12 December 2011 on the reuse of Commission Documents. Software tools that help people better find information in large document collections, including in foreign languages, are thought to help European citizens being better informed (transparency, democracy) and to help businesses grow beyond their national or linguistic borders, while maintaining linguistic and cultural diversity as positive European assets.

Following the recent release of the DCEP, more European parallel text collections are going to be made available in the future.