Innovation and Networks Executive Agency


INEA ceased operations on 31 March 2021. The European Health and Digital Executive Agency (HaDEA) was established on 1 April 2021 to take over the CEF Telecom legacy portfolio as well as additional EU funding programmes.
Broader Web-Scale Provision of Parallel Corpora for European Languages
CEF Telecom
Call year:
Location of the Action:
Implementation schedule: 
September 2018 to September 2020
Maximum EU contribution: 

University of Edinburgh (United Kingdom)

Additional information: 

Digital Single Market (DSM) strategy

DSM - Connecting Europe Facility

CEF Digital portal

Innovation and Networks Executive Agency (INEA)

Automated Translation

Last modified: 
April 2022


This Action aims to collect translated sentences from the web for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. These translations will be mined from a large collection of web pages, approximately 1 petabyte in size. The system will extract web pages in hypertext markup language (HTML) as well as files in portable document format (PDF) format, using text where available and optical character recognition otherwise.

The Action builds on the CEF funded Action 2016-EU_IA-0114.

The resulting parallel corpora will be made freely available, including to eTranslation. To aid in customising machine translation systems, a free software tool will perform domain filtering and weighting of the corpus. The tool will accept examples of in-domain data provided by the user and filter the large corpus by relevance to that domain.