Innovation and Networks Executive Agency

2017-EU-IA-0178

Broader Web-Scale Provision of Parallel Corpora for European Languages
Programme: 
CEF Telecom
Call year:
Location of the Action:
Implementation schedule: 
September 2018 to September 2020
Maximum EU contribution: 
€907,976
Total eligible costs: 
€1,210,634
Percentage of EU support: 
75%
Coordinator: 

University of Edinburgh (United Kingdom)
https://www.ed.ac.uk/

Status:
DSI:
Additional information: 
Last modified: 
December 2019

2017-EU-IA-0178

This Action aims to collect translated sentences from the web for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. These translations will be mined from a large collection of web pages, approximately 1 petabyte in size. The system will extract web pages in hypertext markup language (HTML) as well as files in portable document format (PDF) format, using text where available and optical character recognition otherwise.

The Action builds on the CEF funded Action 2016-EU_IA-0114.

The resulting parallel corpora will be made freely available, including to eTranslation. To aid in customising machine translation systems, a free software tool will perform domain filtering and weighting of the corpus. The tool will accept examples of in-domain data provided by the user and filter the large corpus by relevance to that domain.