Innovation and Networks Executive Agency

2017-EU-IA-0178

INEA ceased operations on 31 March 2021. The European Health and Digital Executive Agency (HaDEA) was established on 1 April 2021 to take over the CEF Telecom legacy portfolio as well as additional EU funding programmes.
Broader Web-Scale Provision of Parallel Corpora for European Languages
Programme: 
CEF Telecom
Call year:
Location of the Action:
Implementation schedule: 
September 2018 to September 2020
Maximum EU contribution: 
€907,976
Total eligible costs: 
€1,210,634
Percentage of EU support: 
75%
Coordinator: 

University of Edinburgh (United Kingdom)
https://www.ed.ac.uk/

Status:
DSI:
Additional information: 

Digital Single Market (DSM) strategy
http://ec.europa.eu/priorities/digital-single-market

DSM - Connecting Europe Facility
http://ec.europa.eu/digital-single-market/connecting-europe-facility

CEF Digital portal
https://ec.europa.eu/cefdigital

Innovation and Networks Executive Agency (INEA)
http://inea.ec.europa.eu

Automated Translation
https://ec.europa.eu/digital-single-market/en/automated-translation

Last modified: 
August 2021

2017-EU-IA-0178

This Action aims to collect translated sentences from the web for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. These translations will be mined from a large collection of web pages, approximately 1 petabyte in size. The system will extract web pages in hypertext markup language (HTML) as well as files in portable document format (PDF) format, using text where available and optical character recognition otherwise.

The Action builds on the CEF funded Action 2016-EU_IA-0114.

The resulting parallel corpora will be made freely available, including to eTranslation. To aid in customising machine translation systems, a free software tool will perform domain filtering and weighting of the corpus. The tool will accept examples of in-domain data provided by the user and filter the large corpus by relevance to that domain.