A third WP1 “virtual sprint” was held on 1st of February 2017, focusing on quality. We decided to explore two different quality frameworks:
Sweden, Germany, Greece and the United Kingdom focused on the first while Slovenia focused on the second
Statistics New Zealand Quality Framework for Administrative Data
The framework serves as a tool to help identify imperfections and errors that can arise when stepping down from ideal concepts and populations to the captured data we obtain in practice. It is split into two phases, allowing to evaluate single datasets in isolation against the purpose for which the data was collected (phase 1), as well as the process of combining variables and objects from several datasets to measure a target statistical concept. In what follows we explain the individual phases in more detail and give examples relevant to our scenario of web scraping for job vacancies (JV).
Phase 1 of the framework applies to a single dataset in isolation and considers errors in terms of measurement (variables) and representation (objects).
The measurement side describes the errors occurring when stepping down from an abstract target concept we would ideally measure, to a concrete variable we eventually work with. The representation side, on the other hand, describes the types of errors arising when narrowing down the ideal (target) population to the set of objects we eventually observe. The output is a single source micro dataset that could be used for different statistical purposes. The framework is designed to support the measurement of total “survey” error.
As an example, consider the following scenario of web-scraping for job vacancy counts from a given job portal.
The measurement side describes the steps from the abstract target concept through to an edited value for a concrete variable.
The representation side describes the definition and creation of the elements of the population being measured, or ‘objects’. The output is a single source micro dataset that could be used for different statistical purposes. The framework is designed to support the measurement of total “survey” error.
It is important to note that the examples of the errors are driven by the choice of the individual measures and sets, effectively serving as an input to the framework. For example, if we chose a different target concept and target measure, there will be different validity errors (see examples below).
Other examples of the errors in phase 1:
Phase 2 involves combining data from several different sources for the use of measuring a statistical target concept and population. In this phase, the measurement side focuses on variables and the kinds of errors that may arise when combining and aligning datasets. The representation side focuses on coverage error of linked datasets, identification error (the alignment between base units and composite linked units) and unit error, which may be relevant if the output involves creation of new statistical units.
To continue the previous example, consider now the scenario when we want to compare the counts from the scraped portal with those from a survey.
Other examples of the errors in phase 2:
We were able to successfully apply the Quality framework for administrative data to the scenario of web-scraping for job vacancies, which identified several possible sources of errors. Phase 1 proved easier to apply and seem to fit well evaluating a single web-scraped dataset, or the job vacancy survey. Phase 2 was slightly more complex, possibly because in our scenario there seem to be two distinct integration steps. The first corresponds to the integration of data from different job portals, while the second involves integration of the resulting composite micro data source with the JVS to produce a final statistical output. To certain extent, phase 2 can be applied in both cases. However, Reid et al. (2017), propose a “Phase 3” that focuses on potential errors resulting from the creation of final estimates from the composite micro-data, suggesting that this may be where integration with survey data should be considered.
UNECE Framework for the Quality of Big Data
Type of BD: www data
BD supplier: BD is pulled from “free” (depends on legislation issues) sources (e.g. Job portals websites). Due to the fact that IT robots are used for web scraping there is always a chance being blocked by owners of websites.
There is always a possibility for enterprises which carry out job portals to cease to exist. Under the assumption that you don’t need the permission for accessing the JV data the risk (for sustainability through time) is low. In that case we would find the substitutes (
new JP owner) which will provide the same services. If we need the permission for accessing to JV data from new owner the risk would be higher.
In the scale “1: high risk, 2: medium risk, 3: low risk, 0: don't know” we would estimate that there is medium risk for sustainability through time.
The relevance of JV data is very high. JV data could be used as a solely or one of the sources for producing existing statistics. This source could be used for creating of new type of statistics as well.
JV data are in general not sensitive to privacy and security issues. However there could be some privacy issues in case of specialized agencies which advertise JV vacancies for purposes of other parties (enterprises). They may not be authorized to reveal the name of enterprise which is searching for certain employee.
Currently there is no written agreement with the owners of job portals which could give us the terms and conditions for accessing to their JV data.
Periodicity: periodical and repeated (weekly scraped data)
Punctuality of delivery: /
SURS uses IT robots (Web scraping studio) for web collection of JV data…..
Owner: usage for advertising purposes,…
NSI: for creating official statistics in JV domain, as additional source in creating early economic indicators,
JV data are collected by Web scraping studio and stored in excel files.
Some of the JV data are well structured (e.g. position, name of the advertiser, date of the advertisement,…). Some of data are unstructured (number of employees needed, ).
For some of the variables their name is specified (name of enterprise, job title, date of advertisement,.. ), for some of the variables their name is not specified (e.g. deadline of application, number of employees needed. There is no description of variables which wea re interested in.
Some of JV advertisements are in forms (pdf, image,..) which don’t allow us to scrape them.
There is almost no metadata available.
New skills are needed in order to collect and process this kind of data.
Timeliness: / (related to data which is provided by data provider)
Periodicity: / (related to data which is provided by data provider)
Changes trough time: possibility of changes of websites, possibility of changes of structure of certain websites
Presence and quality of linking variables:
There are no keys (variables) available which could be used as a linking key.
There are some variables which could be used in record linkage (matching) procedure.
Name of enterprise + location of JV= id of enterprise from BR
Linking level: /
Linking: There are potential (indirect) linking variables such as:
-name of enterprise which could be used as data for linkage with BR
Consistency: There is some inconsistency (anomalies) in the data (values out of the range,…).missing values (number of employees, deadline for applications,..)
Transparency of methods and process:
All methods and process for phases of data are documented (…),
Methods and processes how the JV data are generated are not known
Accuracy and Selectivity
Overcoverage: overcoverage occurs, because when we scrape data, we also obtain job ads for working outside of Slovenia and also job ads for student work. Those job ads represent about 17 % of all scraped job ads.
Undercoverage: it occurs if we don’t collect job ads for companies from certain SKD activities, because they are not there (maybe they are advertised somewhere else). We have not assessed the amount of undercoverage yet.
Under/over-representation, exclusion of sub-populations: under-representation and exclusion is possible for some of SKD activities. We have not assessed that yet. It is also possible, that some type of occupational groups are not (as much) advertised on our biggest job portals, where we obtain our data.
Missing data: we are facing data missing from two variables:
Reference datasets: we do not have any reference dataset.
Duplicates: are represented in form of publishing the same job vacancy on more than one job portal. We have not detected any duplicates within certain job portal.
Acceptable range of data: most of the data are in acceptable range, however place of work is not always inside that range as some ads advertise jobs for working abroad. Also we would like to have place of work in level of municipality, but in reality that is not the case.
Variables being linked: we are linking company name with Slovenian Business Register, we will link place of work with NUTS and the title of the job post with Standard Classification of Occupations.
Percentage of linkage: about 95 % of company names are linked unambiguously, about 3.5 % are linked ambiguously, and about 1.5 % of company names are not linked.
Data integration with other files: we could use the title of the job post for easier removal of duplicates between job portals and between different data sources (data from company websites, data from administrative source), which would make integration of data sources more accurate.
Data are currently measuring the number of job ads and not the number of job vacancies that are advertised via job portals.
Assessment of the quality framework was not completed due to time constraints. However, the work to date suggests this framework is suitable.
Overall, the UNECE framework seems more intuitive and so could be a better starting point for identifying quality issues with on-line advertisement data. However, the Statistics New Zealand Quality Framework is designed to support a total survey error approach which could further deepen the accuracy and selectivity dimension of the UNECE framework. These elements would become more important when considering how on-line job advertisements could be moved into statistical production.