Microdata consist of sets of records containing information on individual respondents. To protect the anonymity of respondents (persons, households, organisations), the access to microdata is restricted. Usually access to microdata is limited to researchers.
Types of microdata
- public use files (PUF): these are files containing records on individual respondents (persons, households, business entities) anonymised in such a way that the respondent cannot be identified either directly (by name, address, social security number etc.) or indirectly (by combining different - especially rare - characteristics of respondents: age, occupation, education etc.); PUF are not confidential and in principle may be used by general public; due to extensive anonymisation PUF are not very useful for scientific purposes (many variables are suppressed, modified or grouped together); very often these files are used for training purposes; see more on the project "PUF for Eurostat microdata" here.
- scientific use files: these are files containing records on individual respondents (persons, households, business entities) anonymised in such a way that the risk of identification of respondent is appropriately reduced but is not eliminated completely; that is why these files are considered confidential and access to them is restricted; scientific use files are usually sent to researchers on CDs (or are downloadable from the website) and are used at the researchers' premises (off-site);
- secure use files: these are files that are not anonymised so they are highly confidential; usually they do not contain any direct identifiers (eg. name, address, social security number etc.) but still combination of some variables may lead to identification of respondent; access to secure use files is restricted; results of the analysis of secure use files is controlled by the staff of statistical office (see (Guidelines on output checking). Only safe (non-confidential output) is released to researchers; researchers may access secure use files in different ways:
- On-site, at the premises of statistical office;
- Remotely, connecting to servers with the data from another, distant location (eg. university, research organisation); see more here;
Scientific use files and secure use files are confidential data. In order to get access to these data researchers must fulfil certain conditions (established by data owners) and sign the relevant contracts or licenses. In addition, researchers need to respect guidelines for publication that are delivered with the data and any other conditions imposed by the data owner.
See more: conditions of access to EU microdata
Modes of access to microdata
- off-site access: the data are sent or transmitted to the user and can be analysed anywhere (PUF) or in the agreed places (scientific use files);
- on-site access: the data (secure use files) can be consulted only in the predefined locations (e.g. safe centre in statistical office); the results of the analysis are controlled by the staff of the office before they are taken out by researcher; the final results can not contain any confidential data (see more: Guidelines on output checking);
- remote access (: the data (usually secure use files) are accessed by researchers who connect to the data from another - distant location; likewise in on-site access the results of the analysis can not be taken out by researcher before the control of the output (output checking);
See more: remote access to EU microdata
- remote execution: the user (usually researcher) does not see the data but sends scripts (in statistical language like SAS, SPSS, STATA) and receives back cleared output (no confidential results); the output is checked manually (like in case of on-site or remote access) or automatically; the major drawback for the users is that the remote execution system does not allow to play with the data and that the results are sent with the delay.
Anonymisation is the process of reducing or eliminating the risk of identification of respondents in microdata. The statistical methods used to anonymise the data are called statistical disclosure control.
Statistical disclosure control
Statistical disclosure control is the statistical domain aiming at:
- microdata protection (anonymisation)
- tabular data protection (elimination of the risk of identification of the respondents in the tables published by statistical offices)
See more: Handbook on SDC
Access to microdata