skip to main content
European Commission Logo
en English
Newsroom
Overview    News

Read four new blog posts in the Web Intelligence Network (WIN) blog

We are pleased to announce that four new blog posts have been published on the WIN blog since the last WIH newsletter. Three of these blog posts refer to specific use cases: 1) gathering data on online housing advertisements, 2) improving business registers using online data, 3) improving the accuracy of job advertisement data, whilst the fourth one offers a general reflection on using web scraping for official statistics.

date:  30/09/2024

As you may already know, the WIN supports the Web Intelligence Hub (WIH) in its work on web-based data for European Statistics. The Network is led by a consortium of 17 organisations from 14 European countries, exploring new web data sources for potential integration into the WIH. The WIN's blog posts aim to keep you up-to-date on news, improvements and developments in the Network’s activities. The latest WIN blog posts offer a general reflection on the opportunities and challenges arising for official statistics from sourcing data from web contents, present country-specific perspectives and refer to different use cases (web data for real estate statistics; business registers; and online job advertisements).

The first of the new blog posts looks at how Statistics Finland gathers comprehensive data on online housing advertisements. The blog post describes the data collection process, key players in the Finnish rental market and the main conditions of the agreement. It also highlights potential future applications of this data.

Another blog post presents the work of statistical offices from Austria, Finland, Hesse (Germany), the Netherlands and Sweden on improving business registers using online data. This project focuses on URL finding and using these URLs to verify and predict business activities.

A third blog post is related to our article on the European Statistics Awards. It covers the first WIH challenge on deduplication, launched in December 2022. This competition challenge aimed to improve the accuracy of online job advertisements data by identifying and removing duplicates through techniques like text standardisation, language detection and advanced similarity measures such as MinHash. The challenge highlighted the potential of these methods to enhance the quality of WIH micro data for its use to compile European statistics. The post discusses the lessons learnt from solutions devised by teams taking part in this competition challenge.

The final blog post in this series explores the potential and challenges of using web scraping for official statistics. It highlights the benefits of collecting large amounts of data at low costs and discusses the technical, legal and methodological hurdles involved. The post emphasises the importance of ensuring data quality and representativeness, navigating legal considerations and employing sophisticated methodologies.