Language Technology at the JRC© EU
How do you spell it? JRC-Names has the answer
Čajkovskij and Tchaikovsky: is it the same person? NATO and OTAN: are we talking about the same organisation? Many doubts will be cleared with JRC-Names, the new software for automatic name recognition developed by the JRC.
The software contains the names of about 205,000 distinct entities - mostly persons, but also organisations, event names, and more - that are most frequently mentioned in online news, plus about the same amount of variant spellings for these entities.
By typing a name in its command box, the system will each time show the variants it knows in different languages. It will show the correct spelling according to the language context, for example Ustinov in English, Oustinov in French, and Ustinow in German. It can also resolve ambiguity on acronyms: for example it would highlight that FN, which stands for Front National in France, means United Nations in some Scandinavian languages.
JRC-Names is a living system: it grows by about 230 new entities and 430 new name variants per week. The database includes names spelt in 27 different scripts, of which the most frequently used are Latin, Cyrillic, Arabic, Japanese and Chinese Han.
JRC-Names was developed to facilitate the analysis of about 100,000 news reports per day by the Europe Media Monitor (EMM) application. It was mostly compiled automatically, by analysing hundreds of millions of news articles since 2004 in up to twenty languages.
The software is designed to improve machine translation and professional archiving, but it is publicly available and has many applications, from social networks to education. JRC-Names helps preserve European multilingualism, which is actively supported by several initiatives of the European Commission such as European Language Day, held on 26 September all over Europe.