University of Edinburgh (United Kingdom)
Digital Single Market (DSM) strategy
DSM - Connecting Europe Facility
CEF Digital portal
Innovation and Networks Executive Agency (INEA)
EuroPat: Unleashing European Patent Translations, will mine parallel corpora from patents by aggregating, aligning, and converting patent data. The targeted language pairs are English in parallel with the following languages: Croatian, Norwegian (Bokmål), German, Polish, Spanish, and French. Icelandic may be added contingent on agreement and size of the data from the Icelandic Patent Office.
The aim of the Action is to prepare clean processed parallel corpora in the patent domain. The choice of domain is justified through high quality translations, large number of data and permissive copyright translation. Moreover, patents are a rich source of technical vocabulary, product names, and person names that complement other data sources. In addition to ingesting European Patent Office (EPO) data in many languages, the Action also targets Croatian and Norwegian national patent offices.
The Action will contribute to CEF eTranslation through the provision of good quality data. As neural machine translation (NMT) engines are more sensitive to the quality of the data, they perform better if they are trained with clean and good quality data.