EU Science Hub

Breaking down language barriers with EMM's new translation system

photo of newspapers
Aug 31 2012

In today's globalised world, keeping aware and staying ahead on all sorts of issues from health, environment and energy to political affairs, security and the financial crisis has become a mammoth task. Breaking down language barriers means first-hand access to relevant information that helps to keep the citizen, the policy maker, governmental authorities, the private sector, industry, etc. better informed, especially on fast changing and impacting issues.

The JRC's Europe Media Monitor (EMM) now offers the possibility to understand at a glance the content of articles even if written in a foreign language. The "Optima News Translation System" (ONTS), developed in-house by researchers of the JRC's Institute for the Protection and Security of the Citizen (IPSC), automatically translates news from ten languages into English.

EMM is a live media monitoring system which gathers around 150,000 news articles every day from over 3750 internet news websites from around the world in over 62 different languages. Updated every 10 minutes, it groups all news articles that focus on the same subject, and brings to the fore the most important stories unfolding in the news. EMM applies leading edge techniques in information mining on each article, automatically determining what is happening to whom and where by: classifying every received article according to a hierarchy of some 1000 classes (what); identifying People and Organisations (who) in the news; geo-locating each article (where); and sensing positive or negative tonality.

Until now, EMM NewsBrief readers interested in articles written in foreign languages had to use commercially available translation facilities, such as Google translate or Bing Translator, which have their limitations: users need to translate articles one at a time, which is time consuming especially when processing many articles, for example, for press reviews; secondly, the number of documents that can be submitted is restricted; and last but not least, users may be reluctant to disclose their interests.

The new JRC's in-house translation system overcomes these limitations and brings quality machine translation capability in the hands of EMM users, combined with a satisfactory level of privacy and security.

ONTS automatically translates into English the title and the description of each article in Arabic, Czech, Danish, Farsi, French, German, Italian, Polish, Portuguese and Spanish. Based on the open source software Moses, the ONTS translation engine uses statistical methods and large amounts of data to identify the correct translation of each phrase and to link them together in a comprehensive English sentence.

ONTS is optimised to translate news: it takes into account that news is usually written in a specific style which differs from other types of texts and contains many names of people, organisations and places. For example, news titles contain more gerund verbs, no or few linking verbs, prepositions and adverbs than normal sentences, while content sentences include more preposition, adverbs and different verbal tenses. To improve the quality of the translated articles, therefore, titles are translated with a separate translation system, while names are translated using the JRC-Names software which automatically recognise names even if spelled in different ways.

The ONTS system can be used by simply clicking on the "EN" symbol appearing below the title of each news article available in EMM