Innovation and Networks Executive Agency

2018-EU-IA-0063

Continued Web-Scale Provision of Parallel Corpora for European Languages
Programme: 
CEF Telecom
Call year:
Location of the Action:
Implementation schedule: 
October 2019 to September 2021
Maximum EU contribution: 
€889,649
Total eligible costs: 
€1,186,199
Percentage of EU support: 
75%
Coordinator: 

University of Edinburgh (United Kingdom)
https://www.ed.ac.uk/

Status:
DSI:
Additional information: 
Last modified: 
December 2019

2018-EU-IA-0063

This language resource Action aims to improve and expand the parallel corpora developed in two CEF funded Actions (ParaCrawl-1-Action no 2016-EU-IA-0114 and ParaCrawl-2-Action no 2017-EU-IA-0178). These previous Actions have already resulted in the release of the largest ever publicly available parallel corpora, for all EU/EEA official languages paired with English, as well as a complete end-to-end crawling and extraction open-source software toolkit.

This Action will offer improved extraction software capable of efficiently processing an even larger portion of the Web (more than 1 compressed petabyte). At the same time, it will apply state-of-the-art neural methods to the detection of parallel sentences, and the processing of the extracted corpora. Special emphasis will be placed on collecting larger corpora for language pairs that are currently under-resourced.

The corpora will be made more useful for training machine translation (MT) systems by post-processing the data to split long sentences, repair broken sentences and synthesise new sentences. The new corpus releases will be made available via a data portal which will allow the users building the machine translation systems to select the types of text which best fit their purpose.