University of Edinburgh (United Kingdom)
Digital Single Market (DSM) strategy
DSM - Connecting Europe Facility
CEF Digital portal
Innovation and Networks Executive Agency (INEA)
This Action aims to collect translated sentences from the web for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. These translations will be mined from a large collection of web pages, approximately 1 petabyte in size. The system will extract web pages in hypertext markup language (HTML) as well as files in portable document format (PDF) format, using text where available and optical character recognition otherwise.
The Action builds on the CEF funded Action 2016-EU_IA-0114.
The resulting parallel corpora will be made freely available, including to eTranslation. To aid in customising machine translation systems, a free software tool will perform domain filtering and weighting of the corpus. The tool will accept examples of in-domain data provided by the user and filter the large corpus by relevance to that domain.