Blog

European Commission Digital

CEF eTranslation: Polish NAP Interview "The system never lacks words"


The following is an interview with Ms Anna Kotarska, National Public Service representative (NAP) in the European Language Resource Coordination (ELRC) programme

You are welcome to read the interview in its original Polish, or via the English translation below. The views expressed are those of Ms. Kotarska and not of the European Commission.



First published in "IT w Administracji" 2019 vol. 7-8: www.itwadministracji.pl

The system never lacks words

The overall amount of funding available for reducing the language barriers to the Digital Single Market, the promotion of European multilingualism and the protection of languages at risk of digital extinction has increased considerably.

Interview with Ms Anna Kotarska, National Public Service representative (NAP) in the European Language Resource Coordination (ELRC) programme

e-Translation is the main platform currently used by the public administration in the European Union (EU) to translate the texts. What is its significance in the context of other core services?

The CEF AT platform (“Connecting Europe Facility Automated Translation”) can be used by public administrations in the EU as well as in Iceland and Norway for translation of documents and text via a website. One of the main users of the tool is the European Commission (EC) where the number of pages translated per year is over 3 million. ETranslation can also be used as part of digital services, enabling public administrations, citizens and businesses in the EU to use them in the language of their choice. It is now integrated with more than 50 digital services, such as, for example, the European Union Open Data Portal or the ODR platform or eJustice. Concerning the concept of the EU core services - CEF Building Blocks — eTranslation is less widely known than the other services in the group, such as eID, eSignature, eDelivery or eInvoicing. These services can be combined together in cross-border digital services, for example, eTranslation was used in the iADAATPA project to build a dynamic router, which is able to switch between various, domain-specific translation engines to obtain the highest machine translation quality. In this case, eTranslation has been used together with the eDelivery Building Block. In general, the Building Blocks are aimed at reducing barriers to the development of the European Digital Single Market, improving the efficiency of communication and services in order to increase the competitiveness of the Union’s economy.

The e-Translation platform was launched in November 2017 and replaced the previous MT@EC. What are the advantages of neural machine translation compared with the translation technology used in MT@EC?

I will try to describe it using simple words: neural machine translation has got two main advantages: the text produced is more natural and idiomatic in character, and is much more correct than in the case of the previous generations of machine translation systems, e.g. statistical translation (SMT) offered under the MT@EC system. Neural machine translation (NMT) allows to predict what the next word in the sentence will be, making NMT’s sentences more similar to those produced by humans. First of all, however, the data – the words used for training of the translation engines are represented by numbers (vectors). The words that have a similar meaning are represented by similar numbers, which allows the system for detecting words having similar meaning (based on the context in which they occur) which, in turn, contributes to the improvement of the quality of translation — the system never lacks words, because it can always find words that have a meaning close to that of the needed word. The system can also replace a given word with a synonym, which makes the translation sound more natural.

 How does the e-Translation platform differ from other tools that use NMT?

Most machine translation platforms available on the market (Google Translate, DeepL, Bing Translator, Facebook) make use of very similar software, which in addition is often made publicly available in the form of a source code, allowing comparisons of translation techniques. The main difference is the data of each institution and means of filtering these data, for example Google relies mainly on the parallel texts found in the Internet sites. Whereas eTranslation uses the translation memories of the European institutions, dating back to the 60s. Linguistic data used by eTranslation is arguably one of the largest, if not the largest, collection of texts in the world almost exclusively translated at the request of the European institutions and manually aligned sentences. This made it possible to train the engines particularly useful for the translation of texts of administrative or legal character or if the institution concerned provided larger training corpora for the creation of specialised (domain-specific) engines. I would also like to draw your attention to the most important feature of the CEF AT platform, namely data security. In contrast to publicly available commercial tools available on the market where, in exchange for free access to the tool, we freely give our textual data and rights to the texts practically to unknown users, the EU tool, based on TESTA-ng dedicated Data Communication Network Service ensures the security of translated texts while intellectual property rights for translation remain with the owner of the text submitted for translation.

A new version of the system — eTranslation v.2.7 — was made available in June 2019. What are the new features of this version?

A considerable number of changes have recently been made, including upgrades for pre/post-processing for all engines, cleaning of spaces, the normalisation of quotes and apostrophes, regular expressions correction, avoiding capitalisation within a compound.. There has been a new release of Court of Justice Case Law domain-specific engines for a number of language pairs with French. The efficiency of the domain-specific engines and of the engine dedicated for the Bundesbank has been improved. The responsiveness of the website has been increased, statistical analysis of the use of the CAT tool (Computer Assisted Translation) has been modified and improvements in some cloud services have been introduced. For Polish users, earlier changes were more visible, i.e. the shift in March 2018 from statistical machine translation (SMT) to neural machine translation (NMT) and significant increases in engine efficiency to reduce translation time.

 [NB: a detailed description of the eTranslation version 2.7 was published here:
https://ec.europa.eu/digital-building-blocks/sites/display/DIGITAL/2019/06/19/CEF+eTranslation+Upgrade%3A+eTranslation+2.7 ]

How is the risk of errors in machine translation minimised?

NMT generally delivers better results than SMT in terms of fluency and grammatical correctness of the translation, but it also introduces serious translation errors and does not cope well with terminology, nor does it generate a coherent translation at the level of the whole text. Neither does it cope sufficiently well with the translation of words not present in the training corpora; in such cases, the NMT is “creative”, i.e. it creates non-existent words. There are also typically problems with abbreviations, personal names, context-specific, or out of context, and idiomatic and metaphorical language. As these are the inherent features of the system, 100 % of these errors can only be prevented by the verification of a  translation by a qualified and experienced translator. Only a number of issues that are included in the rules can be solved by different IT tools for quality control, such as checking that the names of the countries are translated correctly. An example of such a solution can be, for example, a tool for assessing the translation quality and the automatic post-editing - an APE tool,  created under the APE-QUEST project, which has been incorporated into the Electronic Exchange of Social Security Information (EESSI) workflow and the Online Dispute Resolution (ODR). The term ‘Post-Edited Machine Translation’, which describes a corrected or modified machine translation has been used in the industry. From a commercial point of view, it is a product other than human translation, delivered also with the use of CAT tools. The standards for the process have been included in the ISO 18587 standard. Due to the increased use of machine translation tools and the need to correct/ edit their output, there emerged also a new profession — that of the so-called “post-editor”, and the product itself is known as the ‘post-editese’.

What are the plans for the development of language technologies in EU programmes?

The European Commission helps to promote the development of language technologies and their applications as well as research in this area in a number of ways. As far as the Connecting Europe Facility (CEF) is concerned, on 12 July, information was published on a new tender: Action on CEF Automated Translation (SMART 2019/1083). General objectives include the operation, maintenance and continuous modernisation of the CEF AT platform, its implementation on a larger scale in all EU Member States, as well as the coordination of actions and activities related to language technologies. The EC intends to allocate EUR 2.5 million to these objectives in the next two years (2020-21), including the continuation of the ELRC programme and the work of the Language Resources Board (LRB). It is worth noting the new objectives that have appeared among the proposed tasks, such as the creation of tools for automatic speech recognition or the automatic text summarization. In addition to the mentioned EUR 2.5 million in this year’s CEF Telecom call, EUR 4 million have been earmarked for projects involving the collection of language resources, language tools and integration projects, so that the total funding for reducing the language barriers to the Digital Single Market, promotion of European multilingualism and protecting endangered languages at risk of digital extinction has increased considerably. The Digital Europe programme envisages the financing of modern technologies, including building of capacity of AI based solutions. Here I have in mind the tools and resources for natural language processing (NLP). In view of the next financial framework 2021-2027, scientific projects on multilingualism will be funded under Horizon Europe programme, where priority will be given to protect languages understood as cultural heritage and research aimed at achieving transparency in the digital language.


Anna Kotarska — Public Services NAP in the European Language Resource Coordination (ELRC) programme. The function is carried out on the basis of the nomination of the Directorate-General for Communication Networks, Content and Technology (DG CONNECT) of the European Commission. She graduated in English Philology from the University of Gdańsk and completed a postgraduate programme in the field of translation at the University of Warsaw, she also studied corporate finance (Gdańsk University of Technology/ESC de Rouen), logistics (WSL in Poznan), and took an eMBA course in healthcare.

She is a member of the Polish Society of Sworn and Specialized Translators TEPIS, and works part-time as a specialized translator of English into Polish. She was also a translation project coordinator at the Gdańsk University of Technology and at the Polish National Health Fund.