EU Science Hub

JRC language tool outperforms competition in summarising news articles

Pile of newspapers
Nov 29 2011

NewsGist, a language software developed by the JRC's Institute for the Protection and Security of the Citizen (IPSC), had the highest score among 10 participating software packages in one of the three competing categories at the international Text Analysis Conference (TAC) held on 14 and 15 November in Gaithersburg, Maryland, the US.

The conference is organised annually by the US National Institute of Standards and Technology (NIST), with the support of the US Department for Defense. It offers a series of workshops and evaluations aimed at encouraging research in language technologies. JRC system developers took part in the category “Summarization” which considered systems that produce coherent summaries of text.

They were asked to produce a short, coherent summary of less than 250 words out of a set of 10 related news articles each in seven languages (Arabic, Czech, English, French, Hebrew, Hindi and Greek). According to the jury, the JRC’s system was ranked best in all languages except Arabic and Hindi, where each time it reached the fourth place. In a similar task for monolingual English systems, JRC was placed twelfth out of 50 evaluated systems, and second when measuring the avoidance of redundancy.

NewsGist is the summarisation system behind the JRC's Europe Media Monitor (EMM), which gathers around 100,000 news articles every day from over 3,000 news sources in 50 languages and groups them in topic-homogeneous news clusters for each language. The core of the summarizer, used for the multilingual task, uses latent semantic analysis to extract the most important sentences about a given topic. Being highly language-independent, the JRC's summarisation system guarantees similar performance across languages.