Abstract:
Since 2004 the European Commission’s Joint Research Centre (JRC) has been analysing the online version of
printed media in over twenty languages and has automatically recognised and compiled large amounts of named
entities (persons and organisations) and their many name variants. The collected variants not only include standard
spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used
name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyamín/Biniamin/Беньямин/ بنیامین Netanyahu/
Netanjahu/Nétanyahou/Netahnyahu/Нетаньяху/ نتنیاهو ). This entity name variant data, known as JRCNames,
has been available for public download since 2011. In this article, we report on our efforts to render
JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic
Web standards, this new release goes beyond the initial one in that it includes titles found next
to the names, as well as date ranges when the titles and the name variants were found. It also establishes
links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked
dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting
large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking.
JRC-Names is publicly available through the dataset catalogue of the European Union’s Open Data Portal.