WP1 2016 09 29-30 Virtual Sprint


Second virtual sprint: Briefs by country


Objective: To link the job adverts with the Business Register by the name of the enterprises, and to combine with the geographic data such as postcode and municipality, if the names do not give good enough matches.

Data: Job portal data from Swedish Employment Agency and the Business Register database

Description: If the job portal data do not use organization number as key in the database, we need to link the adverts to the correspondent enterprise using other linkage variables. The links can help to study the quality of the portal data by comparing with job vacancy statistics.

The method developed can by evaluated with the portal data from the Swedish Employment Agency, as well as name of the business and a large number of other variables.

Attendees: Ingegerd Jansson, Dan Wu

Sweden 2nd Virtual Sprint Report


Description: To link the names of enterprises taken from scraped data with the enterprises of the sample of Job Vacancies Survey, trying to better understand issues of coverage. If the results are not satisfactory, then will try to link the names of enterprises of scraped data with the Business Register.

Data: Scraped Job Vacancies data from the 2 Greek Job Portals, Enterprises from the sample of Job Vacancies Survey and Business Register

Attendees: Christina Pierrakou, Eleni Bisioti

Greece 2nd Virtual Sprint Report

United Kingdom:

Objective: To match job advertisements collected from job portals to the Job vacancy statistics sample from the Business Register.

Data: Job portal data ( Reed, Careerjet, Jobs.ac.uk, the Guardian, and the Times, Indeed, Totaljobs) and a subset of the Business Register database which matches the Job vacancy survey sample.

Description: The names of the enterprises collected from job portals do not exactly match those within the Business Register (Trade or legal name). Therefore, to match these enterprise names a number of data wrangling/matching approaches will be explored, for example, basic wrangling techniques such removing unwanted words such as plc, and matching techniques such as fuzzy matching. If there is more than one enterprise of the same name then location information will be used to create a unique match. By linking the Job Vacancy survey to the job portal data collected an assessment of the quality of the portal data can be made.

Attendees: Nigel Swier, Liz Metcalfe, Adam Pohl, Vidhya Shekar

UK 2nd Virtual Sprint Report


Objective: To investigate the structure of job advertisements found in a major job board with aggregated results of the German Job Vacancy Survey.

Data: Web scraped data from the job board stepstone.de (including information on region and economic activity sector) plus aggregated results from the German Job Vacancy Survey. If time allows, aggregated results from the statistics of vacancies registered at the German Federal Employment Agency might possibly be added to the analysis.

Description: The job board stepstone.de (the biggest job board in Germany) has the specificity that it includes information on the economic activity of the enterprise, which at least at first sight, is similar to the definition used in official statistics. For this reason it is a suitable basis for an exploratory analysis of the structural differences between job advertisements on a major job board with the JVS results. The comparison can be extended to regions, which is equally available as a variable from the web scraped job portal data.

Attendees: Martina Rengers, Thomas Körner (30 September only)

Germany 2nd Virtual Sprint Report:


Possible demo of SURS approach?

1.  Basic assumption:  There is a list of URLs of enterprises
     Somehow  every country could prepare a list of couple of hundreds   of USLs at least manually for testing purposes.
2.  Aplication  (Python script) uses URLs of enterprises as an input for searching sublinks which may contain information about JV ads. Content (only text for now) of sublinks of given URL of enterprise which potentially contain JV ads are scraped.
3.  Using machine learning techniques number the enterprise with JV ads is detected (and also  number  of of JV ads). This part is implemented in Orange (This tool will be presented at the ESSnet workshop in Ljubljana)  .
4,. We are also currently developing the tool with which we are abel to identify old JV ads and remove duplicates.