skip to main content
European Commission Logo
en English
Newsroom
Overview    News

Use of Natural Language Processing to innovate and improve OJA data

Natural Language Processing (NLP) is a form of artificial intelligence employed by the WIH to make better use of the wealth of web content extracted from online job portals and to improve the quality of OJA data. Read this item to learn more about the WIH’s use of NLP techniques and how to navigate its latest WIH-OJA-NLP table releases.

date:  23/04/2024

Natural Language Processing (NLP) is a technique used by the WIH to extract, analyse and derive meaningful insights from the text of Online Job Advertisements (OJAs). In this item, we explore the WIH’s use of NLP, and how to navigate the latest releases of the NLP tables.  

Background

NLP is a form of artificial intelligence dedicated to facilitating communication between computers and human language. It allows computers to interpret language and produce meaningful and relevant human-language outputs.  

In the context of the WIH’s work on OJA data, NLP involves using computational techniques to extract, analyse and derive meaningful insights from the language used in the job descriptions contained in online job advertisements. Specifically, WIH uses NLP samples to check and improve data quality: by collecting labelled data, for example.  

If we use only the structured web content scraped from specific fields of the advertisements (e.g. job title, location), the wealth of information on OJAs from online portals remains underutilised. Therefore, NLP enables us to extract further and more detailed information from OJA job descriptions and link it with data derived from other structured OJA fields, for instance the job title or salary.  

WIH-OJA-NLP is one of the dataflows set up to make use of this additional information (the full job description) in line with statisticians' and data users' needs. It serves a number of purposes: to collect labelled data, facilitate quality control, make statistical analysis more innovative, and improve the quality of OJA data.  

NLP can also help to explore the link between information on skills in job descriptions of online job advertisements and information on skills derived from alternative sources – for example course descriptions from higher educational institutions – and to improve and update the European Skills, Competences, and Occupations classification by identifying and analysing skills missing from its mapping.  

Navigating the WIH’s latest NLP data releases  

Two new WIH-OJA-NLP tables are now available to all DataLab users in the OJA research area of the WIH, containing the following datasets: 

·       wih_oja_nlp_occupation_class_v3_r20231201 

·       wih_oja_nlp_skills_class_v3_r20231201 

Both datasets provide stratified samples obtained from the WIH-OJA-NLP dataflow. The samples could serve as training or testing data for algorithms used in the classification of OJA data. The granular stratification allows algorithms to learn and generalise patterns within specific occupation or skill categories, contributing to improved accuracy of OJA data.  

The WIH-OJA-NLP dataflow is subject to continuous improvement, seeking to address diverse users' needs. In particular, the WIH-OJA-NLPv3 dataflow marks considerable progress in the WIH’s use of the NLP technique. Containing more than 50 million advertisements, this is a larger and more diverse dataflow for analysis than ever before. Its granular stratification enables a more targeted and nuanced analysis of occupations and skills.  

While WIH-OJA-NLPv3 focuses on greater specialisation in occupation and skills study, WIH-OJA-NLPv2 may serve many users interested in a broader analysis of a larger number of variables derived from an OJA.   

The WIH-OJA-NLP dataflow now features two main sample designs (v1 and v2) and introduces more granular stratification, specific to skills and occupations (v3). 

All DataLab users interested in finding out more about these NLP tables’ releases and how the WIH is innovating through its use of NLP techniques are encouraged to check out our WIH Blog, where we take a deep dive into this topic.   

Related Big Data sources

Web data

Related Data and data policy

Data releases

Related Themes

Innovation

Related Trusted Smart Statistics Hubs

WIH