In October 2012, the European Union (EU) agency 'European Centre for Disease Prevention and Control' (ECDC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-five languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC). Here we describe this resource, which bears the name ECDC Translation Memory, short ECDC-TM.
Translation Memories are parallel texts, i.e. texts and their manually produced translations. They are also referred to as bi-texts. A translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.
Both translation memories and parallel texts are important linguistic resources that can be used for a variety of purposes, including:
- training automatic systems for statistical machine translation (SMT);
- producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
- training and testing multilingual information extraction software;
- checking translation consistency automatically;
- testing and benchmarking alignment software (for sentences, words, etc.).
The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. The most outstanding advantage of the various parallel corpora available via our web pages - apart from them being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovenian-Finnish, etc.).
The ECDC-TM is relatively small compared to the JRC-Acquis and to DGT-TM, but it has the advantage that it focuses on a very different domain, namely that of public health. Also, it includes translation units for the languages Irish (Gaelige, GA), Norwegian (Norsk, NO) and Icelandic (IS).
ECDC-TM covers 25 languages: the 23 official languages of the EU plus Norwegian (Norsk) and Icelandic. ECDC-TM was created by translating from English into the following 24 languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, Gaelige (Irish), German, Greek, Finnish, French, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian (NOrsk), Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. The JRC then combined these 24 translation memory files to produce one large translation memory, allowing to also extract translation units for other language pairs.
All documents and sentences were thus originally written in English. They were then translated into the other languages by professional translators from the Translation Centre CdT in Luxembourg.
The documents are distributed in the widely used Translation Memory eXchange (TMX) format. They are encoded in the UTF-8 character set. The files have the following structure:
<seg>Vaccination against hepatitis C is not yet available.</seg>
<seg>Засега няма ваксина срещу хепатит С.</seg>
<seg>Očkování proti hepatitidě C zatím není k dispozici.</seg>
<seg>Det finns ännu inget vaccin mot hepatit C.</seg>
ECDC-TM was built on the basis of the website of the European Centre for Disease Prevention and Control (ECDC). The major part of the documents talks about health-related topics (anthrax, botulism, cholera, dengue fever, hepatitis, etc.), but some of the web pages also describe the organisation ECDC (e.g. its organisation, job opportunities) and its activities (e.g. epidemic intelligence, surveillance). The file ECDC-domains.xlsx gives further details.
The following table shows the size of ECDC Translation Memory per language: the number of translation units, the number of words and characters of the whole corpus and the average number of words and characters per translation unit.
For details, there is also a file containing the statistics on the size of the ECDC-TM per language pair.
No. of TUs
No. of words
No. of Chars
No. of words per TU
No. of chars per TU
| || |
Size of ECDC's Translation Memory (expressed as the number of translation units, number of words and number of characters) per language for each of the 25 European languages (all 23 official EU languages plus Icelandic and Norwegian).
By downloading or using the ECDC-Translation Memory, you are bound by the ECDC-TM usage conditions (PDF).
The public release of the ECDC-Translation Memory follows the release of various other multilingual resources via the JRC's website. These include the JRC-Acquis parallel corpus since 2006 (22 languages); the DGT-Translation Memory (DGT-TM) since 2007 (22 languages); the JRC-Names multilingual and multi-script name variant list and related software (since 2011); and the JRC Eurovoc Indexer (JEX) multilingual document categorisation software (22 languages) since 2012. For details and other, smaller linguistic resources, see the JRC-Resources page.
Further multilingual linguistic resources will be made available in the future.
The distribution of the ECDC Translation Memory consists of a single zip file (ECDC-TM.zip), which can be downloaded by clicking on the link below.
Should you be interested in the full-text version of the English files that were used to produce the translation memory, you can download these also. If needed, you can furthermore download a Java utility that allows you to extract a TMX file containing only one single language pair and to produce statistics on the number of translation units.
ECDC-TM (October 2012)
There is not currently any scientific-technical description of the ECDC Translation Memory ECDC-TM, so please simply refer to this web page.
For more information on ECDC-TM, you can contact the following persons:
Web Editor for Multilingual Content
Email address: firstname.lastname@example.org
European Centre for Disease Prevention and Control (ECDC)
171 83 Stockholm, Sweden
Joint Research Centre (JRC)
Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)
IPSC - GlobeSec - OPTIMA
Via E. Fermi 2749, T.P. 267
I-21027 Ispra (VA)
The ECDC Translation Memory was offered by the European Centre for Disease Prevention and Control (ECDC). The original files - one for each of the 24 language pairs - were cleaned and combined by Mohamed Ebrahim from the European Commission's Joint Research Centre JRC.
The European Centre for Disease Prevention and Control (ECDC) is an EU agency whose aim is to strengthen Europe's defences against infectious diseases. It was established in 2008 and it is seated in Stockholm, Sweden.
The ECDC's mission: According to the Article 3 of the founding Regulation, ECDC's mission is to identify, assess and communicate current and emerging threats to human health posed by infectious diseases. In order to achieve this mission, ECDC works in partnership with national health protection bodies across Europe to strengthen and develop continent-wide disease surveillance and early warning systems. By working with experts throughout Europe, ECDC pools Europe's health knowledge, so as to develop authoritative scientific opinions about the risks posed by current and emerging infectious diseases.
The Joint Research Centre (JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further linguistic resources.
The JRC is the creator of the Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM collects and aggregates about 150,000 online news articles per day in 50 languages from about 3500 news portals world-wide (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every ten minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are:
- NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
- MedISys: EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
- NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.