Hailed as the biggest milestone since Gutenberg’s invention of the printing press, the internet has truly democratised information and how it is consumed and produced. While buying and publishing books was initially limited to a small class of society who were able to read and write and afford to buy books, the web has empowered everyone.
The internet paves the way for ever-greater participation in what and how information is generated and disseminated. Today, we not only read and view content (passively) but also contribute our own content. And this new class of ‘prosumers’ – combining the roles of producers and consumers – creates amazingly rich content at dizzying speed, from community edited encyclopaedias (or wikis), to blogs and all kinds of social media.
We are documenting our news, thoughts and hopes, our family histories, and our every movement in ever-more creative formats and platforms. And the internet has become a de facto ‘live’ archive as it systematically captures the digital footprints of our society, and allows us to travel back in time to explore the zeitgeist of previous times – and bad hairdos.
Web archives are a veritable gold mine for analysts across different application areas, from political analysts and marketing agencies to academic researchers and product developers. The self-proclaimed keepers of our online legacy, the Internet Archive initiative and its European partner Internet Memory Foundation, have ‘captured’ (retained for posterity) more than 350 billion web-pages since 1996.
The EU-supported LAWA project set out to create a Virtual Web Observatory (VWO) on top of the Internet Memory Foundation’s rich repository, in order to explore and analyse content from web-pages and social media longitudinally (over time).
But they struck a problem: internet content is often far from self-explanatory, especially when it dates back years or decades. Typically, background knowledge is needed to interpret the content. For instance, when reading about the debate on renewable energy across Europe after the Fukushima nuclear disaster, you can understand the arguments put forward based on statistical evidence, such as renewable energy data, from the EU’s statistics organisation Eurostat. However, identifying relevant statistics and other background information for any given piece of web content is a more difficult task.
A solution developed in LAWA lifts the content to a semantic level by identifying ‘named entities’ such as people, places, organisations, events, etc. But why are named entities crucial here? Because they can be mapped onto semantic knowledge bases such as Yago or Freebase – the heart of Google Knowledge Graph.
As a result, content is no longer merely ‘plain text’, but annotated with semantic information about entities, and other valuable background knowledge. This way, the computer can infer that a document mentioning energy companies, such as Vattenfall or E.ON, is highly related to the topic of renewable energy and associated statistics. This helps to bridge the gap between text-based web content and statistical evidence.
LAWA’s public demonstrators include analytics interfaces (ways of using the analytical tools) and browser plug-ins that help to semantically enrich, classify, query, analyse and visually explore digital content at web-scale (anywhere!).
“LAWA facilitates large-scale studies of different types of web content and has added a new dimension to the Future Internet research roadmap,” notes the project team. The work of this unique European project has provided valuable impetus to the emerging field of temporal web analytics. Moreover, LAWA has been a driving force behind the launch of the Temporal Web Analytics Workshop series at the internationally-recognised WWW conference, building a thriving research community focused on tackling the challenges and opportunities presented by Big (web) Data.