WPC Overview

Description of the workpackage

The aim of WPC is to use web scraping, text mining and inference techniques for collecting and processing enterprise information, in order to improve or update existing information, such as Internet presence, kind of activity, address information, ownership structure, etc., in the national business registers. The implementation involves massive scraping of company-websites, collecting, processing, analysing unstructured data and dissemination of national-level experimental statistics. The enterprise data obtained by this WP is combined with existing data from multiple other sources, such as administrative data and ICT usage in enterprise surveys, in order to maximize the quality and quantity of the statistical output.

The results from the previous ESSnet Big Data project can already be deployed and implemented in any ESS country but it may require adaptation to the local circumstances to make them effective. Generalized methods and software that countries can tune for their own specific situation are required for this. Therefore, within the new WP the methodology of the previous ESSnet project will be generalized and extended for use in any ESS country, taking into account the variety needed to support different use cases i.e. there is a need to produce generalized methods and software that countries can tune for their own specific situation where needed.

Prototypes of the software and architecture for applying these methods will be tested between the partners in this WP, and on the basis of this experience guidelines/documentation will be produced on adapting and deploying these tools.

The high level of interest in this WPC presents an opportunity to build an enhanced network of partners interested in web scraping enterprise characteristics. This NSI’s network will provide a solid base for elaboration of methodology and functional prototype for big data production on enterprise characteristics. In designing data collection/survey processes and defining statistical outputs, within-NSI subject matter experts in statistical production areas (eg Business register experts) will be consulted.


Task 1 - ESS webscraping policies

This task will design and develop transparent webscraping policies in order to allay public concerns about the data collected and the usage of them. Extracting knowledge from online data draws attention and public concerns how NSIs are utilising online data, including relatively uncontroversial cases such as NSIs utilising textual data on company websites. GDPR also adds further requirements to the web scraping activities.  As a response to this, WPC team in collaboration with WPB team NSIs and Eurostat need to develop such policy. The ‘netiquette’, developed as part of WP2 of ESSnet Big Data I, is the important first step and could be a good starting point. Sharing good experiences concerning legal constraints for webscraping scenario among participating NSIs is a key precondition for successful execution of this task.

Task 2 - Methodological framework/guidelines

This task will produce generalised and extended methods, procedures and implementation requirements for webscraping on enterprise characteristics. The methodology will be based on results from implementation of the WP 2 use cases: Enterprise URLs inventory, E-Commerce in enterprises, Social media presence on enterprises webpages, NACE identification and Job vacancies ads on enterprises’ websites.

In order to achieve important economies of scale, the opportunites for sharing of resources at ESS-level, are the following:

  • at design level – data representation (data models and  related metadata), process definition and methods;
  • at deployment phase – infrastructure (tools and platforms), running of data processing steps (collection, processing,…).

Task 3 - Experimental statistics, including reference metadata

This task will produce experimental statistical products using web scraped enterprise characteristics data as their main source. Experimental statistics will demonstrate what our final product should look like. Some statistical products at national level will be: Enterprise URLs Inventory, E-Commerce in Enterprises, Social Media Presence on Enterprises webpages, sustainable enterprises, and detecting enterprise activity etc. Experimental statistics will include quality and uncertainty indicators.

Task 4 - Starter kit for NSIs

This task will develop a Starter kit for web scraping on enterprise characteristics. The Starter kit should consist of procedures for testing and maintenance of web scraping. Next to that, guidelines are provided to users and developers for implementing the functional production prototypes defined within Task 2.

The Starter kit will use solution architectures of WPC shared to the WPF for design of a data and application architecture for big data production.

Task 5 - Quality template for statistical outputs

This task will define a quality management template for web scraped enterprise characteristics to ensure that the quality is sufficient to use the data for disseminating experimental statistics. The starting point of the quality template will be the UNECE Framework for the Quality of Big Data and all the quality aspects already identified by SGA-2 WP8 of the previous ESSnet.

Milestones and deliverables

See here for an overview of available milestones and deliverables.

WPC milestones

  CM1   Report on the WP meeting mid-2019   Month 9
  CM2   Report on the WP meeting mid-2020   Month 20

WPC deliverables

  C1   ESS web-scraping policies (report)   Month 8
  C2   Methodological framework V.1   Month 12
  C3   Experimental statistics 2019, including reference metadata   Month 12
  C4   Starter kit for NSIs V.1   Month 18
  C5   Quality template for statistical outputs   Month 22
  C6   Methodological framework V.2   Month 24
  C7   Starter kit for NSIs V.2   Month 24
  C8   Experimental statistics 2020, including reference metadata   Month 24