Analysis of methodologies for using the Internet for the collection of information society and other statistics
The report summarizes the results of project ‘Analysis of methodologies for using the Internet for the collection of information society and other statistics’, which are presented in detail in specific technical deliverables.
Conceptual framework and recommendations for internet data – based ICT statistics
In this activity two related investigations were carried out. The first one was theoretical and examined the place of the Internet in everyday social and economic life and the data that are generated by the interactions between individuals and enterprises in the Internet. The second one examined the extent to which data collected by software monitoring users’ devices and by crawlers extracting content from enterprise web sites can substitute or extend the current ICT surveys. In addition, and although beyond the scope of the project, it examined the new ways opening for production of official statistics based on the proliferation of data in the Internet and on data from the “Internet of things”.
The new sources and forms of data in the Web are raising imperative questions to Official Statistics. The envelope question is which methods should be changed or even introduced to let Official Statistics retain their character, but at the same time exploit the emerging potential of online contexts. The proposed conceptual framework for Internet and Web as data sources should facilitate the orchestration of their main characteristics with the approach of Official Statistics.
‘Cookbook’ for internet data – based ICT statistics
The ‘cookbook’ is a guide for the application of Internet-data based methods for the production of official statistics. Its audience are the producers of official statistics. The guide borrows its structure and some of its content from Eurostat’s “Methodological manual for statistics on the Information Society”. More specifically, for aspects of the production methods, which will be implemented in the same manner as in the current households and enterprises ICT surveys (e.g. sampling enterprises from the business register of the NSI) the guidelines were copied from the current manual. Even then however, minor changes were made in order to discuss possible difficulties that will be faced by the new methods. A considerable part of the cookbook however consists of original material drafted by the project team.
Feasibility of big data as a source for the production of official statistics
The potential of big data as a source of official statistics was examined. Of particular interest were the so-called ‘federated open data’ which are (big) data from business or the public sector, generally not accessible by the public, but shared in an agreed and defined way with the producers of official statistics. Five specific ‘use cases’ were examined:
- Vessel movement data from the Automatic Identification System (AIS)
- Real estate classified advertisements
- Social media message data
- Credit card transaction data (Visa Europe)
- Government financial transparency portal data
Outline of procedure for the accreditation, by producers of official statistics, of big data sources as input data for official statistics
In this activity a procedure is proposed that NSIs pondering whether to use big data sources as input in the production of official statistics could employ to accredit such sources. Our work was based on the analysis of the available recent literature on topics such as quality of statistics in general and quality of administrative data sources in particular.
The actual accreditation procedure evolves in a step-wise fashion. It consists of five stages with gradual assessments involving indicators measured through scales and hard data:
Stage 1: Initial examination of source, data and metadata. Αn early assessment of the data, the metadata and the source.
Stage 2: Acquisition of data and assessment. This stage entails negotiations with the source with a view to acquire a set of files or file extractions adequate for rigorous testing. The primary objective is to clarify whether the source is willing and able to deliver files or extractions at the record level, as well as keep open a communication channel during the testing process.
Stage 3: Forensic investigation. This stage requires a fair amount of work by the NSI. It is divided in four distinct phases: i) producing a clean microdata file (halfway through which we meet a decision point); ii) using the file to produce and analyse aggregate statistics iii) producing pilot new outputs or using the file in the production of existing outputs, and; iv) assessing the capacity of the existing statistical tools to handle the new data.
Stage 4: NSI decision. This stage is dedicated to the assessments necessary for a corporate decision to be made based on as much information and knowledge as possible. It can sub-divided in four distinct phases: i) an itemisation of the exact uses of the new data and their impacts; ii) a top-level cost-benefit analysis, which focuses on the financial picture; iii) assessment of the risks that need to be undertaken and managed by the NSI, iv) assessment of the feasibility of incorporating the new source into the gamut of the NSI’s statistical operations from a legislative and socio-political point of view
Stage 5: Formal agreement with source. This final stage involves high-level negotiations with the source as an institution to secure cooperation and arrive at a formal and comprehensive agreement.
 Eurostat (2013) Methodological manual for statistics on the Information society, v. 3. Luxembourg: Eurostat.