Multilingual Media Monitoring and Text Analysis – Challenges for highly inflected languages

The European Commission’s Europe Media Monitor (EMM) family of applications helps users monitor multilingual written online media for information on a wide variety of subject domains. Apart from gathering an average of 175,000 news articles per day in up to 73 languages and classifying them, the EMM applications apply a number of text mining and processing tools for about twenty languages. The text processing tools include news clustering, information extraction and disambiguation (persons, organisations, locations, quotations, events), matching of name variant spellings, topic detection and tracking, cross-lingual news cluster linking, opinion mining, multi-document summarisation, and more. Developing these tools is particularly challenging for highly inflected languages, such as those of the Slavic and the Finno-Ugric language families. The speaker will thus focus part of his talk on insights regarding the treatment of highly inflected languages, especially regarding information extraction and multi-label document classification. EMM is freely accessible to the public via