Merging statistics and geospatial information - introduction
This article forms part of Eurostat’s statistical report on Merging statistics and geospatial information: 2019 edition.
Location is a key attribute to virtually all official statistics: it provides the structure for collecting, processing, storing, analysing and aggregating data. Moreover, location is a concept that most people are comfortable with, as statistics for a specific place, region or area help people to understand the relevance of particular indicators.
Merging statistics and geospatial information
Traditionally, geospatial information and statistics were attributed to different organisational entities in each country with little cooperation. However, the association of geography and statistics has the potential to generate information far beyond the simple representation of data on a map. Linking geo-referenced and numerical statistics in spatial analysis has the potential to reveal relationships and phenomena which are difficult to discover by analysing statistical databases alone. Furthermore, technological advances and new policy demands have shown that both fields can beneficially be combined. This development has created organisational challenges for the European statistical system (ESS), driving increased levels of cooperation between statistical authorities across the European Union (EU) and other organisations, such as national mapping agencies or providers of big data.
What is a geographical information system (GIS)?
A geographic information system (GIS) is a tool for the management, analysis, presentation and dissemination of geo-referenced data, in other words, data associated to their geographic location. This is evidently the case for topographic information about roads, rivers or administrative boundaries which have been traditionally represented on maps. However, a wide range of additional data sources can also be geo-referenced. Indeed, all statistics inherently have a geographical dimension, be it data covering the whole of an EU Member State, a region, a smaller administrative unit, or indeed an enterprise or a household.
A data revolution has resulted in the ever-increasing availability of statistical and geospatial data. This pattern of development is linked to the growing volume of data that is generated by the internet of things. At the same time as the volume of geospatial data has been increasing exponentially, the ESS has undergone a comprehensive process of reform covering most aspects of its statistical production. These reforms have been driven, in part, by new demands from policymakers to support evidence-based decision-making through better descriptions of societies, economies and the environment within the context of, for example, globalisation, demographic challenges or environmental threats.
Policymaking has increasingly moved across the confines of national borders: examples of current cross-border policies are the Europe 2020 strategy and the sustainable development goals (SDGs). At the same time, European funding for regional and cohesion policy has focused attention on specific territorial characteristics, for example targeting economic, environmental and social problems in cities and/or rural areas. Another change has been the increased level of demand for territorial disaggregation within official statistics: for example, citizens are often most affected by decisions which influence their immediate neighbourhood and this has resulted in governments/local authorities/political opponents increasingly seeking information at a very precise level of detail so they may analyse and illustrate the impact of various programmes and policies. As a result, policymakers and analysts are looking for detailed information across a broad range of spatial dimensions, such as cities and/or rural areas, local administrative units and/or 1 km² grid cells.
At a global level, the lead on geospatial information is taken by the United Nations Committee of Experts on Global Geospatial Information Management (UN-GGIM) who acknowledged the ‘critical importance of integrating geospatial information with statistics and socio-economic data’. Their work is based on the development of a global statistical geospatial framework (GSGF) designed to provide an interoperable method for geospatially coding and managing geospatial statistics and information, by connecting statistics that describe socioeconomic and environmental attributes to information that describes our physical man-made and natural environment.
Within Europe, the implementation of a strategy for merging statistics and geospatial information follows global guidelines and has been organised, to a large degree, under the auspices of GEOSTAT projects and annual conferences of the European Forum for Geography and Statistics (EFGS), both of which provide methodological guidance and are funded by Eurostat. The first GEOSTAT project was launched at the beginning of 2010 by Eurostat in cooperation with the European Forum for Geography and Statistics (EFGS), to promote grid-based statistics and more generally to work towards the integration of statistical and geospatial information in a common information infrastructure. Thereafter, there have been two further GEOSTAT projects. GEOSTAT 2 provided a model for a point-based geocoding infrastructure (based on addresses, buildings and/or dwelling registers). GEOSTAT 3 is an on-going initiative, designed to foster better integration of geospatial information and statistics by developing a European version of the global statistical geospatial framework (ESS-SGF) that focuses on providing more qualified descriptions and analyses of society and environment, with work concentrated on sustainable development and the census.
The ESS has responded to these challenges by acknowledging demand for a higher level of geographical detail in its statistics: identifying geospatial information as a valuable data source; recognising its potential for integrating data from multiple sources (administrative, statistical and/or big data); and acknowledging that it represents a considerable opportunity to create more relevant information and respond better to user needs. By geocoding various types of data, statisticians aim to link different data sets through the use of location as a neutral concept, thereby joining data disparate sources together. By doing so, the geocoding of data should help policymakers and analysts to answer the ‘where?’ in addition to the ‘what?’ and the ‘when?’ which have traditionally been the focus of official statistics.
Within national statistical authorities, competences for geospatial information often remains concentrated in specialised teams. They may have considerable experience in using geographical information systems (GIS), with the preparation of statistical maps an established practice. Some EU Member States also use maps to show statistics at a regional level or for smaller-scale territorial typologies. Hence, GIS is by no means a new technology for some members of the ESS. However, given the growing importance of geospatial information, the rapid expansion of data visualisation technologies and the emergence of entirely new groups of users, it would appear all the more important that the expert knowledge embedded in most national statistical authorities is disseminated as widely as possible to colleagues in the same Member State as well as to other Member States. Yet, at the time of writing, the work conducted by a relatively large number of statisticians has yet to be impacted by developments in GIS, with many having few or no specific competences in this area.
What is GISCO?
The geographical information system of the (European) Commission — GISCO — contains essential datasets such as topographic data and political boundaries. It provides corporate data and services for: administrative and statistical areas; hydrography; transport; land cover/land use; population distribution. These are areas of common interest for multiple stakeholders and are regularly used by a range of European Commission services that focus on regional policies, for example, the Directorate-Generals for Regional and Urban Policy, Mobility and Transport, Environment, Energy, Agriculture, Maritime Affairs and Fisheries, or the Joint Research Centre (JRC), as well as the European Environment Agency (EEA).
Data included in GISCO are defined by the GIS user community, organised through the European Commission’s inter-service group on geographical information (COGI); they ensure the coherence, consistency and usability of data. The COGI also ensures the consistent and effective use of geographic information across European Commission services, as well as coordinating elements such as data acquisition, its software portfolio, the sharing of information and expertise, as well as the implementation of INSPIRE (see below) within the European Commission. Note that most of the data in GISCO are not available to the general public due to copyright/license limitations.
The focus of geostatistics is principally on spatial things, in other words, real-world phenomena that have spatial extent or position. A spatial thing has various characteristics (such as its shape, name or boundary). Spatial things can be material (such as monuments, buildings or bridges) or non-material (such as administrative boundaries; boundaries of cadastral parcels of land; routes within a transport network). To code data at such a precise level, an address is often required: this provides a geographic data item that is used upstream of statistical processes, for example, during the collection of information either by post or in the field (by an interviewer). In both cases, an address provides direct access to the required respondent or object (by referring to a precise location with geographical coordinates that eliminates any ambiguity).
An effective geo-referenced statistical information infrastructure should be consistent and interoperable with spatial data infrastructures developed following the INSPIRE Directive (see box below). Indeed, more accurate and better exploitable geostatistics may generate added value in a range of areas:
- statistical authorities may reduce the cost of data collection, for example, by using GIS techniques and location-enabled devices to plan surveys better;
- statistical collection could be made more efficient by exploiting geo-referenced data and/or making use of it when there were changes for reporting requirements;
- dissemination could be improved, by aggregating the same data to produce statistics for, among others, grids, functional areas, regions or river basin districts.
What is INSPIRE?
The INSPIRE Directive (2007/2/EC) entered into force in May 2007, establishing an infrastructure for spatial information in Europe (INSPIRE) to support Community environmental policies, and policies or activities which may have an impact on the environment. Its goal is to make geographic information held by public administrations more accessible through a geoportal (http://inspire-geoportal.ec.europa.eu/) that is accessible to everybody. To do so, data and metadata across 34 spatial data themes from regional, national and international sources are harmonised using an agreed set of standards that make it possible to share, combine and aggregate spatial information.
The INSPIRE infrastructure is characterised by the integration of spatial data from multiple sources across socioeconomic themes, for example, covering statistical and administrative units, population distributions, health statistics or information on energy and environmental resources. INSPIRE recommends the use of table joining services to integrate socioeconomic data with geographic data for administrative boundaries and/or statistical units. Simply put, this means linking information about people, businesses and physical objects to a particular place in order to improve the understanding of complex social, economic and environmental issues through data analysis, spatial analysis and thematic mapping.
More information is available at: https://inspire.ec.europa.eu/
Investment in geostatistics also has the potential to improve greatly the synergies between national statistical authorities, mapping agencies and other authorities dealing in geographic and cadastral information: promoting the exchange and integration of statistics; avoiding duplication in data collection; and facilitating European, national and subnational reporting obligations. In order to be as effective and efficient as possible, the joining together of statistics and geospatial information needs to be focussed on harmonisation at the start of the process. In this way, geo-referenced statistics at microdata level can later be fully exploited, allowing them to be used in more flexible ways and for a broader range of analyses. For example, a single set of microdata may be used for providing statistics across a range of different territorial typologies, while the same data may also be used to analyse issues which are not necessarily known at the moment a survey was designed/carried out.
Such changes in statistical production have led to a number of established statistical conventions and rules being re-examined, the most prominent of which is the question of confidentiality. Official statistics have traditionally put considerable efforts into making the identification of statistical entities impossible or at least very difficult; this has mainly been done by eliminating confidential data from published results or aggregating data to such a level that prevents disclosure of the sensitive records. Geospatial information has a number of specific considerations and requires different types of safeguard to protect confidential information; the increased use of registries, disaggregation techniques and methods for modelling small areas are some of the most relevant areas of debate. Otherwise, the use of richer, more specific and detailed information also raises issues around business models and commercially relevant data, for example, those associated with common licencing schemes.
Purpose of this publication
Since 2009, Eurostat and the ESS have stepped up efforts to include detailed location or functional geographic classifications as an important parameter in various environmental, social and economic statistics. This goal was designed to enhance the information capacity of statistical data, mainly for planning, programming and spatial analysis, without increasing the cost of creating the data; the main driver for this initiative was the 2011 census round.
During the course of 2012, three different events were organised by the European Commission in relation to statistics and geospatial information:
- a paper was presented to the European Statistical System Committee (ESSC);
- a workshop was organised between the European Commission, national statistical authorities and mapping agencies;
- a meeting was held in Prague of the Director-Generals of national statistical authorities (DGINS).
Eurostat subsequently sought to promote integrated information systems that combine statistical and geospatial information for policymakers, researchers, spatial planners, as well as a range of other users, sharing this knowledge with the wider community, providing an overview of how geographical information systems have been implemented and identifying issues for further guidance and future developments.
The three events referred to above led Eurostat to launch a call for proposals in 2012 under the heading of Merging statistics and geospatial information. This was designed to provide grants for facilitating work on the coordination of statistics and geospatial information. It was intended to cover a wider range of topics, including:
- improving the integration of geographic information and geo-referencing in the statistical production process;
- illustrating how linking geographical and statistical information provides additional value and creates new information;
- designing innovative web applications to show the spatial distribution of statistics.
The call was also intended to increase cooperation between national statistical authorities, in the sense that tools or processes designed or developed by one Member State might be offered to others for reuse and/or inspiration. Furthermore, national statistical authorities were explicitly encouraged to propose projects together with organisations responsible for geospatial information, in particular, national mapping and cadastral authorities (NMCA) to promote greater cooperation and a cross-fertilisation of information.
This publication, Merging statistics and geospatial information — experiences and observations from the national statistical authorities, 2012-2015, presents details for each of the projects provided a grant during the first four years, showcasing a broad range of applications that may be developed using geospatial information.
For the first exercise in 2012, Eurostat received 11 proposals and selected eight of these for grants, namely those from Greece, Hungary, Malta, the Netherlands, Poland, Slovenia, Slovakia and the United Kingdom. Some of the projects were cross-cutting and ranged from data collection to web dissemination, while others were focused on a specific aspect of the business process in an individual statistical authority. All projects were thought to have enhanced the GIS expertise of NSIs and made substantial progress in giving increased visibility to GIS. However, there were concerns raised as to the potential transferability of projects between NSIs, while no projects were carried out in unison with national mapping and cadastral authorities. In response, Eurostat set-up a collaborative platform to exchange information and promote reusing results from other countries. Furthermore, it was agreed that future calls should promote projects that sought to: bring location into the mainstream of statistical production (by developing geocoded data warehouses); help NSIs prepare for the 2021 census exercise (through the implementation of geocoded data, covering buildings, addresses, citizens, businesses, workplaces and farm holdings, thus creating a point-based framework for statistical microdata).
For the second exercise in 2013, there were seven projects selected by Eurostat for grants, namely those from Bulgaria, Germany, Croatia, Italy, Austria, Slovenia and Finland. One of these continued work that was started under the 2012 grant, whereas three others were centred on extending national capabilities for merging statistics and geospatial information. There were also three more specific projects that were focused on: manipulating information collected from migrant arrivals (collected when they applied for a residence permit), so that their place of birth could be geocoded, allowing a set of 20 maps to be produced for non-EU countries, showing the precise origin of migrant arrivals for the years 2012-2015. Another project allowed a set of commuter statistics to be developed, providing information on the average distance commuters travelled to work or to their studies. The final project was more closely linked to information technology, namely, developing an open source web application for the spatial analysis of statistical data.
For the third exercise in 2014, there were seven projects selected for grants, namely those from Estonia, France, Croatia, Hungary, Poland, Portugal and Norway. These covered a broad range of issues including: linking statistical registers to address systems; producing multi-modal spatial transport data for urban centres; assessing how changes in population and land use may impact on the quality of life; or establishing a point based business register.
For the fourth exercise in 2015, there were eight projects selected for grants, namely those from France (this project was extended until 2018), Croatia, Latvia, the Netherlands, Austria, Poland, Slovenia and Finland. Note that the grant provided to the French national statistical authority concerned the preparation of a methodological handbook. As such, it did not specifically cover a practical application for merging statistics and geospatial information and for this reason has not been included in the main body of this publication. Nevertheless, the handbook produced provides a very valuable tool that may be used to promote and share results, encouraging a greater take-up and application of spatial statistics in statistical production processes. The core of the handbook focuses on describing geocoded data, measuring the importance of spatial effects, describing practical methods for taking into account spatial interactions and providing details on some more advanced issues and latest developments (spatial panel data models, network analysis, spatial econometrics, small area methods). The Handbook of Spatial Analysis is available in both French and English.
Address is the specific location of a property, usually based on address identifiers such as a road name, house number and/or postal code.
Continuous data describes data where values for the variable of interest may be observed at any point across the territory studied. Data are generated on a continuous basis, but they are measured only at a discrete number of points (for example, the chemical composition of the soil, water or air quality when analysing land use or land cover.
Geocoding is the process of transforming a description of location (such as an address or the name of a place) to a location on the Earth’s surface. Geocoding is the process of linking unreferenced location information, often in the form of a text string (an address) to a geocode. The conditions for geocoding include a high quality physical address, property or building identifier, or other location descriptor, in order to assign accurate coordinates and/or a small geographic area to each statistical unit.
Geographical classifications are methods to group geographies according objective criteria, for example classifications based on population density, functional aspects (labour market areas), or geography (mountain areas). Often geographical classifications are based on statistical or administrative geographies to be able to compare statistics between different areas with the same characteristics (for example, urban areas).
Geo-referenced statistical data are data that can be directly presented in space. Geo-referencing, or geospatial referencing, is the process of referencing data against a known geospatial coordinate system, by matching it to known points of reference in the coordinate system, so the data can be processed, queried and analysed with other geographic data.
Geospatial core information is a set of essential geospatial data and services for geocoding other types of information; examples include administrative boundaries, land cover information, addresses, orthophotos/satellite images, transport and hydrographic networks.
Geospatial data are information defined by geometrical boundaries of either administrative or other units that are in geographic information systems (GIS), commonly in the form of polygons.
Grid statistics are spatial statistics geocoded to rectangular grid cells. Each grid cell has the same size and carries a unique code. Ideally the code carries also geocoding information, for example, the lower left corner of the grid cell.
Linking defines a process of connecting structured data sources using a system of unique identifiers. While integration describes the process of combining data from different thematic communities, linking refers to technically connecting data in a machine-to-machine environment.
Location is a general term used to describe a place on the surface of the Earth; location data is often used when referring to geospatial information.
Point data are those whereby the geographic coordinates are associated with an observation. The value associated with the observation is not of interest, rather it is the location, for example, the point where a disease emerged during an epidemic, or how certain tree species are distributed. Spatial analysis of point data is aimed at quantifying the gap between observations, identifying clusters of data that are more aggregated than if they had been randomly distributed across the territory.
Regional statistics are statistics that are geocoded to administrative and functional geographies.
Spatial analysis or spatial statistics include any of the formal techniques which study entities using their topological, geometric or geographic properties; the phrase refers to a variety of techniques, many still in their early development, using different analytic approaches applied in a wide variety of fields.
Spatial data are statistics which should meet the perception of users in their area of interest (for example their neighbourhood); as such, these data are more detailed than regional statistics. Spatial statistics are geocoded to small administrative or non-administrative geographies.
Spatial unit refers to any set of spatial units that cover a whole territory and are divided into basic (house number, spatial district, statistical district, settlement, municipality, administrative unit) or additional spatial units (local, village or urban community, street, electoral unit).
Statistical unit describes one member of a set of entities being studied; this could include persons, households, businesses, buildings or parcels/units of land.
Thematic map is a type of map or chart that is designed to show a particular theme connected with a specific geographic area; these maps can portray physical, social, political, cultural, economic, sociological or agricultural patterns and developments.