Blog

European Commission Digital


title of the story on the left side, and on the right side two hands coming to gether to form a circle around a map of Europe

The MARCELL CEF Telecom Action aims to bring down linguistic barriers within the Digital Single Market.

One of the project's primary goals is to make digital services platforms, such as Online Dispute ResolutioneJustice, and Europeana, accessible in multiple languages. The eTranslation Building Block has a key role in this endeavour. 

eTranslation is provided by the Connecting Europe Facility and faces the daunting task of delivering quality machine translation (MT) services across all the EU's Digital Services Infrastructure (DSI) in all the EU's official languages. 

One challenge for MT is the scarcity of high-quality language data and the potential for inaccurate information. Ideally, the language data used for training the MT system should cover specific domains relevant to citizens' lives, such as consumer rights and justice. National legislative texts are not automatically available to eTranslation, and current Machine Translation (MT) systems could improve if they had access to national legislative texts. 

The MARCELL Project

MARCELL's overall goal is to improve machine translation of national legislation (laws, decrees, regulations) in seven countries: Bulgaria, Croatia, Hungary, Poland, Romania, Slovakia, and Slovenia.The project provides large-scale legal monolingual data, which will then apply to other eTranslation systems. The project covers the total body of national legislative documents that are in force in these seven EU Member States.

Because the Member States' national legislation is not automatically available to the European Commission (EC), MARCELL relied mainly on EU legislation for its training sessions. The legal domain differs widely in terms of content. The seven monolingual data documents fall into 21 top-level domains. These include politics, economics, trade, education, communication, and science, under the official EU multilingual ontology-based thesaurus EUROVOC. The classification will thus yield 21 thematic sub-corpora in each language.

What is eTranslation?

eTranslation is an automated translation tool available to translate text snippets or full documents. It can also be integrated into a specific digital system if you need translation capabilities. 

The tool translates over 30 languages in different domains, including Russian and simplified Chinese. Users can also integrate eTranslation into their systems to make digital content and services multilingual, accessible to anyone in the EU.


 

Data Collection and Curation

The total number of sentences collected across the seven languages has reached 30 000, ranging from 1 000 to 10 000 per language. And these numbers will continue to grow. 

New national legislative texts appear every day in each of the seven countries. For that purpose, the consortium has built processing chains (pipelines) that periodically collect (using push or pull techniques) new legislative texts from the official national providers. 

It then converts those texts into a suitable format before gathering all relevant metadata. MARCELL processes the texts and delivers them to the existing ELRC-SHARE repository that feeds the eTranslation systems with training material. 

Expected Results

As a result, MARCELL will produce:

  1. Seven large-scale pre-processed texts of national legislation classified in EUROVOC top-level domains and supporting EUROVOC and IATE terms.
  2. Translated comparable legal texts in seven languages aligned with the top-level domains identified by EUROVOC descriptors.
  3. A Croatian-English parallel corpus comprising 1,800 legislative documents.#
  4. A set of seven pipelines for processing and feeding new legislative documents in the seven languages concerned

As the most recent EU official language, Croatian is six to nine years behind in the systematic accumulation of translation memories (TMs). 

Therefore, the Croatian-English Parallel Corpus of Croatian National Legislation was set up, with legal texts dating back from 1990 to 2019. So far, MARCELL has translated 1,800 documents into English.

Future steps

As MARCELL resources become available to train eTranslation engines, one can expect noticeable improvements in the output quality when translating legal texts into one of the seven languages.

Besides the expected general improvement of the MT system in the seven languages concerned, MARCELL will have significant benefits for both the eJustice and the Online Dispute Resolution platforms. MARCELL's resources focus on national legislation, directly related to both these DSIs.

How can CEF help you?

At the Connecting Europe Facility, we give you access to free tools, support, and funding to build your digital services. Here are some other Building Blocks you might be interested in. 




Collect data from sources and support smart decisions at the right time

A free and secure translation tool to break language barriers in the EU

Offers digital services capable of electronically identifying users from all across Europe