enEnglish
CROS

## Contents

Introduction

The aim of the smart meter project is to assess the usability of the electricity smart meter's data in the production of the energy statistics, as a sole or supplementary source to the statistical outputs, e.g. energy statistics of businesses or households, vacant or seasonally vacant dwellings statistics, or impact on the environment.

In the earlier report on data access and data handling, an overview of the current situation with regard to the status of the instalment of the electricity smart meters in Europe was given. The first report gives also an overview of the process of getting access to the data and describes the structure of the datasets in the different countries and gives an assessment to the quality of the input data.

The goal of the current report is to demonstrate how linking electricity data with other administrative data sources can be conducted and how such enriched data can be used for producing statistics of businesses, households and identifying vacant living places. The main challenge is the mapping between an observed unit - a metering point - and a statistical unit - a business, household or dwelling and getting good quality outputs from the linked data. To achieve the goal a proper IT-infrastructure, methodology for linking data sets and estimation, as well as quality framework must be established.

The work presented in this report is carried out by three statistical offices. Statistics Estonia has access to the data in the Estonian electricity data hub for the years 2013-2015. By the end of 2015 not all electricity meters were smart ones, so for the completeness both manual and smart meters data is used  in the analysis. In this report, Estonian data is used to produce statistics about businesses and households, as well as to identify vacant dwellings. Statistics Denmark has access to the 2013-2016 data in the electricity data hub. The available data is used to analyse the electricity consumption of the households. Statistics Austria does not currently have access to Austrian data, instead synthetic data is generated and used. As mentioned in the report 1 of this work package, the objectives for generating synthetic data are to pilot demo output and to develop and test new algorithms to test and compare methods that could be used to identify vacant living spaces. The concept of using a subset of the data instead of all data is also examined by using synthetic data.

The report is divided into five parts. First two parts cover the data editing and linking issues, other three parts concentrate on the production of three different outputs i.e. electricity consumption by businesses, electricity consumption by households, an indicator of a vacant dwelling, and its quality. The report concludes with the assessment to the usability of the electricity meters data as a possible data source for the official statistics, description of the work still needed to be carried out and next steps with the smart meters pilot in the EssNet project.

## Used hard- and software

In Statistics Estonia following IT-infrastructure was used for electricity data. The data is delivered on an external hard disc as a dump of MySQL database to Statistics Estonia once a year. The data is handled in a Windows 7 workstation and the tables are exported as csv files from MySQL database and copied to Hadoop filesystem. The inappropriate hardware for running the database caused remarkable delays in data pre-processing.

For handling big data Apache Hadoop (an open-source software) was selected as it easily scalable and it can handle big data without limits. To make maintenance easier a distribution of Hadoop – Hortonworks – was selected as it enabled easy and automated installation of cluster by using Ambari. At the time of installation Hortonworks Data Platform (HDP) version 2.4.2.0-258 was available and it contained the following versions of packages: Hadoop 2.7.1, Pig 0.15.0, Hive-Hcatalog, 1.2.1000, Oozie 4.2.0, Ambari 2.2.1, HBase 1.1.2, Knox 0.6.0, Storm 0.10.0, Falcon 0.6.1, Spark 1.6.1.In addition for GUI Hue 2.6.1-169 and Zeppelin 0.6 were installed. Speed tests of queries in Hive and Spark are given in Annex A.

The Hadoop cluster consist of three serves: one Ambary server (virtual), one NameNode and one DataNode. Hardware specification for servers: 2 x 8 core 64-bit Intel Xeon processors, 64GB RAM, 6TB HDD. Better configuration would be if for each core 4GB of memory is available. As an operating system SUSE Linux Enterprise Server 11 (x86_64) is used. The version of the operating system was selected due to the requirements of HDP version 2.4.

For visualization and using clustering and machine learning tools a local version of Anaconda was installed, but it was conflicting with the existing version of python 2.6 in use of PySpark functions. Therefore the main tool for analysis were Hive SQL queries and also SAS Enterprise Guide for aggregated data.

In Statistics Denmark raw data comes in several CSV files. In the office, these datafiles are transferred into ORACLE database. Most of the data processing takes place in ORACLE and the analysis in carried out by using R software.

The production of the synthetic data set by Statistics Austria was done mainly with the R package simPop and some imputations were done with the R package VIM. Data handling was done with the R package data.table since it provides a lightning fast and very flexible interface for manipulating data sets. Since the data sets are quite large for in memory usage base R and dplyr would not lead to the same performance as data.table. For visualization the R package ggplot2, which provides very nice plotting functionality. It was combined with the R package shiny for creating an interactive tool for exploring the data. shiny nowadays provides the functionality to develop professional grade web application from R and therefore being able to use the vast possibilities R provides in the area of data science and statistics.

The vacancy estimation was also done in R and an important package was cellWise, which was used for outlier detection. For random forest modelling, the R package ranger  was used, it provides a very fast implementation of random forest.

## Methodological Framework

The dataset of smart meter data is structured and quite simplistic, measurements for inflow and outflow of energy accompanied with auxiliary information for the specific smart meter, e.g. the classification into household and business and specific characteristics for these two types.

The main challenge in using the smart meter data is the problem of the unit - the observed unit is a metering point and the desired output unit of analysis is a statistical unit, which is a business, household or dwelling.

There are two ways to approach the dataset:

• The first one is to use metering points as units and classify them by using electricity consumption data and apply labels from the sample data set. All the electricity end consumption data will be used.
• The second one is to link metering points with statistical units as well as possible and for analysis apply traditional statistical methods. Only the consumption data of the identified statistical units will be used.

In this report we will use the second approach and our goal is to link metering points with the administrative data sets for identifying electricity consumption of statistical units. The linking is also basis for establishing the sample data set which can be used for labelling the data in the first approach.

After receiving the access to the smart meter data there is need to identify:

• whether the data set contains only end consumption data or it includes also some sub-consumption or other that is not end consumption (e.g., import, export, losses, transfer etc.) and what are the rules to filter out such consumption?
• what are the keys which can be used to link the data, whether it is possible to use registry id-s directly or there is need to use some prepossessing for extracting the key-s (In Estonian case there was need to extract from the free text address filed the real address and link it with the address id)?
• what is the unit the metering point is corresponding to (a single entity or multiple entities in case the metering point measures consumption of the whole building)?
• what is the measurement time step unit (in some cases the measurement is done once a month or less frequently and the hourly consumption is modelled)?

Linking can be conducted by using some unique key-s. In Estonian case all the legal and private persons have unique registry code which is used all registers and thereby enables to link different registries, maintained by different entities, together through X-Road, the data exchange layer for information systems. The land board of Estonia has developed an address information system, which contains information of all addresses and address objects in Estonia. There are two approaches for linking:

• First, linking agreement owners by using registry code with different registers. In this case we get information about electricity consumption of agreement owners.
• Second, linking persons by using their address. In this case it is higher chance to get information about electricity consumption of end consumer.

For conducting analysis of businesses there is need to exclude apartment associations, which are registered as a legal person, but they mainly represent consumption of private persons. Text analysis was used in our case study for identifying apartment associations, but other methods are applicable as well.

Based on the answers the filtering and linking rules can be established to extract metering points which measure the end consumption of statistical units. After establishing linking between the observed unit and the statistical unit the classical steps can be followed for the production of the statistics:

1. Data editing or data cleaning with the goal to create a cleaner dataset with less noise, less errors and no missing values;
2. Linking to registers for households or enterprises, this will be ideally just merging datasets, but could also be a record linkage problem (or even a statistical matching problem, if no direct identifiers are available);
3. Computing outputs using either simple aggregation or some models. In this stage also new variables will be calculated by using additional information or classification algorithms. Some of the methods used for estimating daily vacancies are automatic time series model selection, random forest, and cell-wise outlier detection.

Automatic time series model selection

Autoregressive integrated moving average models are widely used for modelling time series (and forecasting them), it is a generalization of autoregressive moving average (ARMA) models which is only applicable for stationary time series [1]. Three parameters are necessary to define a ARIMA model and to estimate the respective coefficients: the order of the autoregressive part, the order of the moving average process and d the order of differencing. The automatic model selection procedure searches for the best fit within a given model space (0 to max order for each of the three parameters). The R function auto.arima from the R package forecast was used.

Random Forest

Random Forest[2] is a popular method for supervised learning in machine learning, it is a technique of combining a series of decision trees into a single model. The target variable or dependent variable can be a categorical or continuous variable and the same is true for the independent variables or features. In general RF provides quite good protection against overfitting, therefore the selection (removal of unnecessary variables) is not as important as in classical statistical modelling or for pure decision trees.

Cell-wise outlier detection

This (quite new) method to detect outliers was developed with the idea in mind to combine the "old" paradigms of columnwise or rowwise outlier detection into one method[3]. Rowwise in the context of outlier detection refers to identifying whole observations as outlying whereas columwise refers to finding extreme values in a single variable. The method generates predicted values for suspicious cells and compares them with the observed values, therefore it can also directly be used to impute by using these predicted values.

Details about the application of the described method can be found in the section below. In addition to some advanced methods, some very basic statistical measurements like standard deviation, variance and mean were used.

4. Evaluating quality of the outputs. For quality evaluation quality performance indicators, introduced further on, will be applied.

## Quality Framework

In this report, list of quality performance indicators from the ESSnet exploring quality issues in administrative data is used to assess the quality of the computed statistics. The chosen quality indicators are:

1. Undercoverage
• What proportion of households and companies do not yet have smart meters? The proportion of consumption that is not covered - some overlap with the discontinuity problem.
2. Overcoverage
• What proportion of smart meters does not record final electricity consumption?
3. Percent of units that fail checks
• What percent of data fail basic checks such as checks for extreme readings?
4. Percent of units that are adjusted
• Percentage of units that are adjusted through some error correction or follow up contact?
5. Percent imputed
• What percentage of households or companies are imputed?
6. Periodicity
• At what frequency are data delivered?
7. Delay
• Is there any delay in delivery of data?
8. Difference between administrative units and statistical units
• How many administrative units must be changed (merged, imputed or split) to make them fit into statistical units?
9. Discontinuity
• Quantification of level jump between survey data and estimates from data hub data.

The quality is assessed for each set of outputs: business statistics, household statistics and vacant living spaces indicator.

Data description

## Estonian electricity data hub

The Estonian Data Hub is a digital environment for information exchange in the electricity market for the purpose of changing open suppliers, forwarding consumption meter data between market participants, maintaining the data, performing the obligations imposed on market participants by the law and ensuring the rights granted to them [4]. Doing statistics is not listed in the purpose of the Data Hub and therefore the data source is not meant to be used directly for this purpose.

The four tables (metering data - contains electricity recordings, metering points - contains the address and other metadata of metering points, agreements - contains start and end data and other information about agreements, and customers - information about the owner of an agreement) form the core of the smart meter database (see Report 1 for more detailed description). The main characteristics of the tables are given in Table 1. For linking the tables keys metering id and customer id were used and for linking external tables a customer's registry code was used. The table metering data contains all the hourly recordings (unit Wh) of all active metering points from the years 2013 to 2015. The direction of meter data is presented from the perspective of the measuring network operator: in quantity – electricity entering the network (production), out quantity – electricity exiting the network (consumption) [4].

Table 1. The number of rows and number of distinct keys of the smart meter data tables.

Table No of rows Unique metering id Unique customer id Unique code
agreements 1 850 711 727 003 625 425 N/A
customers 721 854 N/A 721 854 721 763
metering_points 729 266 729 266 N/A N/A
metering_data 18 702 064 849 725 439 N/A N/A

The quality of linking of the tables (Table 2) was evaluated. All the metering id-s from the table of metering data were linked to the table of metering points (see Table1 the number of unique metering id-s (725K) equals Table 2 the number of uniquely related metering id-s (725K)). Around 2000 metering points in total were not linked to any agreement (725K - 723K). In 2015 there were 3013 metering points, which did not have agreement and they consumed 0.05% of electricity. The number of metering points without agreement has been reduced from 14 323 in 2013 to 3 013 in 2015 which is great improvement. There are more metering points related to agreements (727K) than there are metering points measuring electricity consumption (725K). There are customers who have more than one agreement (625K vs 727K).

Table 2. The number of unique joins between metering data tables.

 Table A Table B Key Relationship type metering data metering points agreements metering data metering points agreements customers customers metering id metering id customer id customer id, metering id many to one one to many many to one many to many 725 439 727 003 625 425 723 339

A temporary table was created to speed up aggregations and all the agreements and consumption related to a customer were aggregated to a daily base, as the agreement change time is 00:00. With a retail customer or a market participant two types of agreements are used – grid - the network agreement - and supply - the open supply agreement. There must be a grid agreement even the client uses universal services without any supply agreement and therefore in the further analysis only the grid agreements were used.

CREATE TEMPORARY TABLE IF NOT EXISTS workspace.agreement_daily
STORED AS ORC tblproperties ("orc.compress"="NONE") AS
SELECT metering_dates_orc.metering_date, aa.metering_id, aa.customer_id
FROM smart_meter.metering_dates_orc
JOIN (SELECT a.customer_id, a.metering_id, a.first_date, a.last_date
FROM smart_meter.agreements_orc a WHERE a.agreement_type = 'GRID') aa
WHERE to_date(aa.first_date) <= metering_dates_orc.metering_date
AND (aa.last_date IS NULL OR to_date(aa.last_date)= metering_dates_orc.metering_date);


Currently we do not have detailed overview of the network for identifying full production and consumption of electricity. Partly is related to the fact that we dont have full metadata, which describes the metering points and their actual role. We have identified three variables, which might not correspond to electricity end consumption and we call system overhead consumption. Variable ‘border point’, which marks a metering point between two network operators; identifier in metering point code, which indicates whether the metering point is an electricity substation (it includes also energy production of big energy producers and external links for import and export); a virtual metering point, which is used for recording network or line losses. On Table 3 full amount of electricity production and consumption are shown, then the sums production and consumption which are probably not related to end consumption and finally the production and consumption which roughly might correspond to actual consumption and small scale energy production. A separate temporary table of metering points which were part of system consumption was created and in further analysis those metering points were filtered out.

Table 3. Metering data – production and consumption (in and out flow) in GWh.

 Year Metering points Production Consumption All data 2013 713 924 15 503.754 32 275.818 2014 716 383 18 663.846 35 748.437 2015 719 479 22 742.688 33 372.304 Overcoverage 2013 1 620 3 887.250 24 289.792 2014 1 596 6 912.504 27 980.852 2015 1 526 12 346.621 25 173.030 End production/consumption 2013 712 304 11 616.503 7 986.026 2014 714 787 11 751.341 7 767.585 2015 717 953 10 396.066 8 199.274

There are still some differences in calculating the energy production and consumption from the data of the data hub. As according to the data provided on the Elering web page the consumption is 7466, 7473 and 7443 GWh in the years 2013-2015 and corresponding production numbers are 13826, 14402 and 14406 GWh which differs from the numbers in Table 3 around 10%. Consultations with the hub owners we have ascertained that Elering is using additional information not in the hub to produce given numbers. So, it is not feasible to get exact numbers by relaying on data hub data only.

In the smart meter dataset the customers are classified by type: Foreign, Private, Embassy, Business, House Unions and some metering points are not connected to any agreement and have value NULL (N/A). The distribution of unique metering points, distinct number of customers, production and consumption by customer type is given in Table 4. As we can see, the number of metering points, which are not linked to any agreement is decreasing and the quality is getting better. The private customers own the most of metering points, but consume 1/5 of electricity. The measurements are in GWh to see linkage of actual electricity consumption between the Table 3 and 4.

Table 4. Distribution of distinct metering points, customers by customer type. Production and consumption in GWh.

 Year Reg. type Customers % Unique points % Production % Consumption % 2013 N/A 0 0.0% 14 323 2.0% 0.0 0.0% 13.096 0.2% 2013 Foreign 4 247 0.8% 4 661 0.6% 0 0.0% 12.099 0.2% 2013 Private 529 603 93.6% 628 911 86.4% 0.403 0.0% 1 735.170 21.7% 2013 Embassy 18 0.0% 43 0.0% 0 0.0% 2.685 0.0% 2013 Business 32 019 5.7% 79 817 11.0% 11 606.549 99.9% 6 214.426 77.8% 2013 Unions 142 0.0% 536 0.1% 9.551 0.1% 8.548 0.1% 2013 Total 566 029 100% 728 291 100% 11 616.503 100% 7 986.026 100% 2014 N/A 0 0.0% 4 318 0.6% 0 0.0% 4.424 0.1% 2014 Foreign 4 407 0.8% 4 845 0.7% 0.002 0.0% 11.064 0.1% 2014 Private 527 632 93.4% 629 832 87.3% 1.042 0.0% 1 747.944 22.5% 2014 Embassy 18 0.0% 42 0.0% 0 0.0% 2.672 0.0% 2014 Business 32 447 5.8% 81 674 11.3% 11 739.849 99.9% 5 989.792 77.1% 2014 Unions 156 0.0% 537 0.1% 10.446 0.1% 11.685 0.2% 2014 Total 564 660 100% 721 248 100% 11 751.341 100% 7 767.585 100% 2015 N/A 0 0.0% 3 013 0.4% 0.123 0.0% 3.790 0.1% 2015 Foreign 4 402 0.8% 4 817 0.7% 0.006 0.0% 10.534 0.1% 2015 Private 528 738 93.3% 631 739 87.3% 2.068 0.0% 1 751.817 21.4% 2015 Embassy 18 0.0% 42 0.0% 0.0 0.0% 2.629 0.0% 2015 Business 33 272 5.9% 83 858 11.6% 10 384.050 99.9% 6 419.974 78.3% 2015 Unions 156 0.0% 471 0.1% 9.817 0.1% 10.527 0.1% 2015 Total 566 586 100% 723 940 100% 10 39.066 100% 8 199.274 100%

There is also a remarkable amount of metering points which measure zero consumption (Figure 1). Some of those are seasonal, but there are around 5% of unique metering points in Estonia that had zero total consumption in 2015. For reducing computational time it would be reasonable to filter metering points with zero consumption out, but for linking purpose, in case of identifying vacant dwellings, they are still needed. There is not strong seasonality of zero consumption in case of businesses.

Figure 1. Monthly number of metering points with zero consumption of private (blue) and business (red) customers.

## Danish electricity data hub

Statistics Denmark has received data from Energinet.dk for years 2013 – 2016. From 2017 data is received on a monthly basis. Data hub owner Energinet receives all data on consumption and own production. The various electricity suppliers are responsible for reporting all consumption data to ENERGINET as they are responsible for sending bills to the customers. ENERGINET has divided their data into three parts.
1: Data on readings: Data on consumption from meters that are billed on a periodic basis (monthly or quarterly).
2: Data on values: Data on consumption from meters that are billed on the basis of their hourly meter readings.
3: Background information. This is one large dataset with both personal information including addresses and information regarding subscription and payment method.

The last mentioned table with the background information on consumers is quite comprehensive and can contain 60 million (M) rows. It contains both address information and information connected to the subscription status of the household or company. There are only between 4 or 5 million meters in Denmark. The reason that meters can appear multiple times in the background dataset is that the background information dataset can be split up in two parts illustrated in Table 5.

Table 5. Illustration of data about one metering point in the background dataset.

METER_ID VALID_FROM VALID_FROM ADDRESS TYPE_OF_METER GRID_SUPPLIER BALANCE_SUPPLIER SETTLEMENT_METHOD HOURLY_TIME_SERIES
123456 2013-01-01 2014-07-01 xxx Consumption Xxx Vvv Periodically yes
123456 2014-07-01 2015-08-01 xxx Consumption Xxx Vvv Hourly yes

The table illustrates that a meter ID can appear multiple times because something related to the subscription has changed. The address is fixed i.e. the meter does not move. This means that background information can be split into two parts: one part containing the addresses and one the subscriptions.

After extracting address information from the background information, work with the addresses starts. With the addresses following work is carried out: removing duplicate address information, cleaning the address records and geo-coding addresses. Since the addresses came from the original background information dataset where there are 60M rows of addresses so they need to be collapsed (see Table 6).

Table 6. Before and after removing duplicates and correcting the subscription start and end information.

BEFORE

METER_ID VALID_FROM VALID_TO STREET_NAME HOUSE_NUMBER LETTER ROOM_ID FLOOR_ID POST_CODE
123456 2013-01-01 2014-07-01 Apple road 2 A Floor NA 2770
123456 2014-07-01 2015-08-01 Apple road 2 A NA NA 2770

AFTER

METER_ID VALID_FROM VALID_TO STREET_NAME HOUSE_NUMBER LETTER ROOM_ID FLOOR_ID POST_CODE
123456 2013-01-01 2015-08-01 Apple road 2 A FLOOR NA 2770

After removing duplicate addresses, the number of rows is reduced from 60M rows 11M rows. Next step involves assigning a unique code called Danish Statistics (DST) address id to the address. This task is appointed to the IT department, who is familiar with handling addresses and assigning unique address id’s. To perform the assignment the following variables are needed: street name, building number, building letter, floor id, room id, zip code.

Table 7. Illustration of the possible duplicate problem: two metering points in the almost same address.

METER_ID VALID_FROM VALID_TO STREET_NAME HOUSE_NUMBER LETTER ROOM_ID FLOOR_ID POST_CODE
123456 2013-01-01 2015-08-01 Apple road 2 NA Floor NA 2770
234567 2015-08-01 2016-08-01 Apple road 2 A Floor NA 2770

Next step involves deciphering which table to choose the consumption data from. As stated earlier Statistics Denmark receives two consumption datasets. One that is based on periodical readings and one that is based on hourly readings. The challenge is to figure which of the two datasets to choose the consumption data from as a meter can appear in both datasets. The choice depends on the settlement method. The settlement method is related to the billing and can either be based on periodic (e.g. monthly, quarterly) consumption or hourly consumption. All possible combinations of the meter's type and settlement method are given in Table 8.

Table 8. Combinations of meter's type and settlement methods.

Smart meter Manual meter Billing based on
hourly consumption
Billing based on
periodic consumption
1 0 1 0
1 1 0 1
1 1 1 0
0 1 0 1

In order to analyse consumption it is crucial to understand how the manual consumption readings and automatically recorded consumption values are linked to the background/subscription and payments dataset. Table 9 gives an overview of the tables with regard to the number of meters by the type pf meter and by the settlement method. It illustrates the challenge one is facing when working with the Danish market data. Readings in the table refer to the dataset from where the periodically read meters are taken from. From line 6 it is evident that smart meters (hourly based billed meters), that not very many meters are billed based on an hourly basis. It is therefore interesting to investigate the meters that are both read on an hourly basis and billed periodically to see whether the consumption from the values dataset and the consumption from readings overlap. If a meter appears both in the hourly read consumption dataset and in the quarterly read dataset one must know how the meter is settled in order to know whether the amount related to that meter is to be calculated from the readings or the values dataset.

Table 9: Number of meters in each category.

Number Combination 2013 2014 2015 2016
1 Periodical readings 3 183 901 3 234 797 3 251 648 3 396 599
2 Hourly readings 58 704 135 993 775 990 1 744 308
3 Periodical readings and hourly readings 1 896 71 915 700 346 1 449 455
4 Periodical readings and periodical billing 3 176 222 3 158 254 2 550 909 1 944 823
5 Periodical readings  and periodical billing and hourly readings 10 68 630 695 578 1 319 550
6 Hourly readings and hourly billing 41 694 41 830 41 859 29 801
7 Hourly readings and periodical billing 1 21 504 25

Table 10. Total consumption by the settlement type in Danish dataset.

Consumption in GWh Periodically read consumption with periodically based settlement Hourly read consumption with hourly based settlement
2013 14 194  16 347
2014 13 301  16 390
2015 11 082  14 913
2016 until April 6 343  6 821

Table 10 shows the total consumption from the meters. In table 8, it was shown that manual meters can only have periodical settlements. However, this is not true for smart meters. In table 10, the total consumption for the meters that are settled on the basis of hourly readings is given. Some consumption information is lost in case there has been a switch from one type of settlement to the other during the year.

The total consumption figures match very well the data that can be found on Energinet.dk own webpage and that are published based on the same dataset.

</div>

## Quality assessment

In this section we evaluate the quality of the data in the data hub, the summary of the quality information is given in Table 11. The data collecting unit in the data hub is metering point and the quality assessment in this section is carried out for metering points. Statistical units i.e. businesses, households and dwellings, and their quality is studied in the following sections.

Table 11. The quality of the input data (Estonia case).

 Undercoverage There is no undercoverage of metering points. Overcoverage 1526 metering points were excluded from the analysis as those metering points did not present actual end consumption i.e. the metering points recorded transfer or selling of the electricity. Although these metering points formed only 0.2% of all metering points, the consumption recorded by them was 75.4% of total consumption recorded in the data hub. Percent of units that fail checks The data is taken as it is and extreme recordings are not excluded from the analysis. Percent of units that are adjusted In Estonian data there were two cases when the end date of the grid agreement differed from the supply agreement's end date and there were two duplicate agreements in the initial dataset. All the address information of metering points was normalized and corresponding address id-s and address object id-s were identified and stored in a separate table. Percent imputed No imputations were applied. Periodicity Data is delivered once a year. Delay It takes up to 4 months after the reference period for all the corrections done by providers or sellers to appear in the data hub.Statistics Estonia does not have agreed date for receiving the data for the previous year, but it is usually received in the first half of the year.

The 75,4% of total consumption is a remarkable number but one should note that electricity moves through several metering points before it will be consumed and all transfers are recorded in the data hub. Also, one should note that overcoverage appears only if the study variable is end consumption. If one is interested in all electricity transfers then one shouldn't exclude any meters from the analysis.

## Synthetic data set

As input a smart meter data set from Australia was used. Of course, the electricity usage patterns of Australian households compared to Austrian households will not be as similar as the naming of the countries suggest, however useful testing and development work can still be done on this data set, since the daily consumption patterns and the variety of these might still be similar in real data from Austria (or other countries). The synthetic data set is a base for developing tools which can be shared easily and which can be applied to real data sets as well. It is not an objective of the synthetic data set to be able to produce realistic estimates for e.g. electricity consumption of the Austrian households.

The Smart Meter data set from Australia contains trial data for 78.720 households with several household characteristics and 345M readings in total. This data set was used as one input for the generation of the synthetic smart meter data set for Austria. The synthetic population of Austria was created using the R package simPop[5] (https://cran.r-project.org/package=simPop) and the Austrian EUSILC sample data.

Matching the households from the Australian smart meter data, $SM$, with the households from a synthetic population for Austria, $P$, was done using a simple algorithm, which proceeds as follows:

1. Select $d_1,...,d_n$ demographic variables, existing in both data sets, where each variable $d_i$ can take values from a set $D_i$,$i=1,...,n$.

2. This selection naturally splits $SM$ and $P$ into disjoint subgroups $SM_i$ and $P_i$,$i=1,...,\prod\limits_{i=1}^n|D_i|$ defined by the values of $d_1,...,d_n$.

3. For each subgroup $P_i$, sample with replacement $n=|P_i|$ households from $SM_i$.

4. If for an $i$ the set $SM_i$ is empty, discard the least significant criteria in $d_1,...,d_n$ and calculate new subgroups. Do this only for $i$s where $|SM_i|=0$.

5. Repeat steps 3 and 4 until every household in $P$ was matched with a household in $SM$.

For the matching the demographic variables household income group, number of occupants, number of children 0-10 year, number of children 11-17 years, number of occupants 70+ years and is home during daytime.

(All code is available on Github https://github.com/alexkowa/essnetBDwp3)

### Using a subset of data

When thinking about smart meter data, we tend to think about getting access to the full data set, e.g. all households of a specific country and every single day of a year. However, for a lot of output scenarios a subsample of this huge data set might be sufficient in combination with traditional survey methods to produce high quality statistics. The two main reasons for just using a subset of the data are probably limited computing resources and data access/privacy issues.

There are at least three possibilities for developing sampling schemes:

• Sampling households (metering points) and gross up to the whole population,
• Sampling days (or time intervals in more general) of all households and obviously
• the combination of sampling days and households, e.g. sample 4500 households and observe the smart meter for 20 days a year each.

In the following we make a few calculation exercises for sample size and precision for different scenarios or targets.

Estimating total consumption (yearly) per region with a (disproportionally allocated)stratified sample of 500 households in each region, we would be able to estimate the total consumption with a coefficient of variation of about 3% on a regional level and about 1% on a national level (see Table 12).

Table 12. The effect of sampling on the precision of estimates based on synthetic data.

Region Stand. deviation Total
million
Sample size CV of estimation of total
Burgenland 2992 514 500 3.0%
Carinthia 3093 1.101 500 3.0%
Lower Austria 3048 3.024 500 2.8%
Salzburg 3127 1.037 500 2.9%
Styria 3070 2.269 500 2.9%
Tyrol 3190 1.358 500 2.9%
Upper Austria 3149 2.731 500 2.9%
Vienna 2846 3.391 500 3.0%
Vorarlberg 3277 723 500 2.9%
AUSTRIA 3061 16.146 4500 1.1%

The kWh-usage per day, e.g. in Burgenland varies between 2 and 38 kWh (5 and 95 percent quantiles), therefore the standard deviation for days is about 13.

If we would sample 20 days of each household with simple random sampling, we would get already a very accurate estimate for yearly total consumption and the CV would be between 0.3 percent and 0.8 percent on a regional level and 0.1 percent on a national level. If we would sample 20 days of each household with simple random sampling from the stratified sample of 500 households (so we have 10.000 days observed for each region and 90.000 days in total), the CV would be between 12.7 percent and 13.7 percent for the regional estimates and about 5 percent on a national level. For most estimates it seems reasonable that a sample of the smart meters is sufficient for most needs, especially when the sample design and size can be adjusted according to given precision requirements and domain specifications.

Data editing

Due to the vast amount of data, fast and efficient algorithms have to be developed to identify and correct possible problems in the data set. For detecting errors one needs to determine set of editing rules that usually include rules to detect outliers. However, in practice it is very difficult to come up with editing checks because the data hub description is rather brief and variable' descriptions are lacking.

Some of the encountered problems with input data are described in the earlier report. In general, not many errors were discovered in Estonian data. In Estonian data there were two cases when the end date of the grid agreement differed from the supply agreement's end date. The dates were corrected when the table was transferred into Hadoop file system. Also, there were two duplicate agreements in the initial dataset, of which one was deleted during transfer. No other changes were made in the metering data.

There is one important difference between Estonian and Danish data hub. In Estonia, all metering points have hourly consumption information. In case the electricity consumption is reported monthly or less frequently the hourly consumption is modelled in the database for all. In Danish database smart meters reading values and reported values of the consumption in manual metering points are kept in the different datasets which creates additional complexity: how to avoid duplicates and how to find e.g. monthly consumption.

In Danish case the situation is better as the address in the electricity data hub is stored in a structured way (see Table 6), although not with geo-codes.

When data from new period is added then usually only some part of the addresses need to be geo-coded i.e. new metering points and metering points with updated address. For other part of metering points, already existing geo-codes could be assigned. Data hub owner in Estonia is updating their system and in the future geo-codes should be already in the hub.

For Statistics Denmark the largest work was caused by the manual meters and their consumption. As the dataset does not include information about the period for which the consumption is referring to and has only the dates when consumption was reported. Based on such information it is not simple to make any monthly or quarterly aggregations. The problem is more explained under linking section.

For linking the Estonian business register was used. There are three main tables in the register (see overview in Table 8):

• table of statistical business register – an extract of business entities from the business register made by Statistics Estonia, which are active. The set is used for statistical sample;

The business register contains information about the economic activity, the registration date and closing date and address information in the level of a local administration of a business entity. The table is continuously updated.

The table of statistical business register is an extract from the business register made by Statistics Estonia and it contains only information about active companies which have at least one registered employee. The set of companies is fixed once in a year (on November 11) and in addition to the information in the business register the table contains information about the number of employee, yearly turnover.

The address table of statistical business register is a table of street addresses of a company in the main location and other locations (column Distinct in Table 13 indicates the number of distinct business entities listed in the address table).

 Year Business register Statistical business register Addresses of statistical business register Distinct 2013 298 140 144 729 150 356 144 729 2014 304 575 146 748 152 394 146 748 2015 320 331 151 782 157 470 151 795

## Danish households register

Through the unique address id’s  it is possible to link to additional registers in Statistics Denmark.  Meters information was here linked to the household register and building registers to obtain the following variables:

• Number of people/households
• Area size
• Type of house (Apartment, attached house, detached house etc.)
• Indicator for the existence of electric heating in the dwelling.

To obtain these background variables three registers were used:

The population and the dwelling dataset were linked via the ADDRESS_ID and quarter. The two datasets only contain information about the households. The merged dataset was then linked to the building dataset to obtain building characteristics (the area size, the type of dwelling, heating information etc.) The dwelling dataset and the building dataset have these variables in common: DST_UNIT_ADDRESS_ID, DST_ENTRANCE_ADDRESS_ID and quarter. So the linking was then done via these variables. For all the dataset the latest record in a given quarter was used. This means that if by the end of a given quarter of the year there were two adults and two children living on an address, then they were registered as having lived there during the entire quarter. For instance the first quarter of 2014. This is regardless of whether they moved in February of 2014. Table 15 gives the overview of the results of the linking with the different registers.

Table 15. The results of linking with Danish registers

Population 2 819 584
Dwelling 2 819 584
Building 3 263 058
Link population + dwelling = pop_dwel 2 819 584

## Estonian dwellings register

The dwellings database was made by linking Building Register (only living spaces were selected), Address System and Population Register (only residents were selected) (see Table 16) by using address id as the linking key variable. This work was already done for the pilot project of Register Based Census (REGREL) and reflects the dwellings and living spaces situation at 01.01.2016.  Each living space is an address object and has an appropriate id.

In the customer table of the smart meter dataset there is 666 thousand unique private customers recorded (Table 16), the table does not contain any address information. The address information of metering points is recorded in the table metering points which contains 729 thousand metering points and for which 629 thousand address id-s were found. There are distinct 730 living spaces listed in the dwellings register. The register of the households was formed from the population register by grouping all persons living in the same address into a household (altogether 538 thousand households were formed). There was linking table for linking people in the population register to customers in the smart meter dataset and 593 thousand correspondence was found.

Note that private customers in Table 16 contain only such households where the electricity contract is owned by the private person but excludes households where the household's electricity contract is owned by the legal person (e.g. real estate firm or apartment association). In the latter case it is not possible to identify single household's consumption which is needed when computing averages per household. On the other hand, if the total consumption of all households is of interest, then this can be calculated.

Table 16. Overview of the tables used in the analysis of the living spaces.

 Table name Rows Distinct customers Distinct id's Distinct address id Missing address id Private customers 666 695 666 695 666 695 N/A N/A Metering points 729 266 N/A N/A 629 756 54 046 (8.9%) Living spaces 730 876 N/A N/A 730 876 0 Households 1 315 944 593 406 1 315 944 538 092 1 561 (0.3%)

For every dwelling in the database there are characteristics describing the dwelling, living spaces in the dwelling and the household registered in the dwelling at the reference time. For the analysis we use three variables from the database: id of the address-object of the dwelling, indicator whether a given address of the household  is also identified as the main residence in the register, and the size of household.

Other variables describing the dwelling that might be useful in the analysis are:

• information on the presence of central heating in a dwelling;
• type of dwelling (e.g. apartment, one-family house)
• type of region where dwelling is located;
• number of rooms in a dwelling.

### Estonian case study

For linking purpose it was possible to use two strategies – linking by registry code or linking by address information. Each individual (resident of Estonia) and business entity has a unique registry code which can be used to link different administrative databases. Unfortunately the standardized address information was not used by Data Hub and the address field was a free text field which contained quite often the description of the metering point as well and thereby reducing the quality of the address data.

The linking was conducted by the SQL command JOIN and whether the registry code or the address id was used as a key for linking.

1. Filter all the all businesses from the customer table by using registry type.

2. Join the businesses with a valid grid agreement. (Agreement beginning date is expressed to the precision of a day (Elering 2016))

3. Join the year and registry code with statistical business register table.

5. Filter out registry codes which belong to apartment associations.

Table 17. Customer agreement table linked by registry key with business register and profile companies.

 Year Agreements Distinctmetering points Distinctcustomer Distinctregistry codes Distinct statisticalbusiness register 2013 84 296 81 508 32 042 31 980 28 386 2014 87 450 83 273 32 471 32 417 28 699 2015 88 769 85 338 33 300 33 247 29 405

1. Create a temporary table of address id-s counting how many metering points and businesses are related to the same address.

2. Join metering points (all) and businesses if there is 1 metering point and 1 business linked to the address id (Filter 1).

3. Join metering points and businesses if there is many metering points in the same address, but a business registry code matches with the code in agreement (Filter 2).

4. Join metering points (including only businesses) and businesses if there is 1 to 1 linkage of the address id between metering point and 1 business (Filter 3).

5. Join the remaining businesses with a valid grid agreement with no attention to addresses (Filter 4).

Before making queries of production and consumption of businesses the metering points which do not represent real consumption were filtered out and also metering points related to apartment unions were filtered out, as those represent mostly private consumption.

Apartment unions are registered in the Estonian register of business under the NACE code (corresponding to the Estonian 5 digit emtak code) 68321 and 68322, but the same code is also used by some other businesses which are active in real estate management. For identifying the companies a script was used which searched for certain key words in the name of the company (e.g., 'osaühing', 'oü', 'aktsiaselts', 'as'). A temporary table of registry codes of apartment unions was made.

Table 18 shows how many metering points by each filtering stage added to the linking table. It can differ from actual linking rate as the filtering of non-valid metering points is applied afterwards in the aggregation phase.

Table 18. Linking by address. Metering points linked in different filtering stages. Production and consumption in GWh.

 Year Filter Dis. points Production Consumption 2013 1 43 620 0.112 444.567 2013 2 6 682 17.412 1 717.110 2013 3 1 118 0.049 11.378 2013 4 57 399 11 589.386 4 313.178 2013 Sum 108 819 2014 1 45 470 0.290 454.759 2014 2 6 521 19.037 1 641.063 2014 3 1 426 0.054 11.965 2014 4 58 971 11 807.177 4 209.994 2014 Sum 112 388 2015 1 46 897 0.617 458.451 2015 2 6 520 16.100 1 593.716 2015 3 1 262 0.082 9.112 2015 4 60 834 10 367.846 4 707.896 2015 Sum 115 513

Linking households by using persons registry code

1. Join customers table with the population register by using customer registry code as a key

2. Filter out households which members have linked to more than one metering point

3. Filter out customers which have linked to more than one metering point

4. Join population registry with the Register of Buildings by using persons living place by ads_oid as a key

Table 19 indicates what is potential to link households through two registry. In real linking some households are filtered out due to the members of household were linked two or more agreements in the Elering database. Customer m

Table 19. Linking households by using personal registry code. Production and consumption in GWh.

 Year Match Dis. customer % Dis. points % Production % Consumption % 2013 customers and addr. match 469 338 88.6% 571 690 89.3% 0.394 97.9% 1 611.361 92.9% 2013 customers match only 8 957 1.7% 10 822 1.7% 0.002 0.7% 27.102 1.6% 2013 not linked 51 308 9.7% 57 857 9.0% 0.005 1.4% 96.707 5.6% 2013 total 529 603 640 369 0.403 1 735.170 2014 customers and addr. match 476 396 90.3% 581 279 90.9% 0.997 95.6% 1 642.147 94.0% 2014 customers match only 8 816 1.7% 10 662 1.7% 0.011 1.1% 26.827 1.5% 2014 not linked 42 420 8.0% 47 857 7.5% 0.033 3.3% 78.968 4.5% 2014 total 527 632 639 798 1.042 1 747.944 2015 customers and addr. match 485 151 91.8% 591 755 92.2% 2.003 96.8% 1 661.391 94.8% 2015 customers match only 8 627 1.6% 10 536 1.6% 0.015 0.7% 26.991 1.5% 2015 not linked 34 960 6.6% 39 390 6.1% 0.050 2.4% 63.434 3.6% 2015 total 528 738 641 681 2.068 1 751.817

Table 19 shows that in 2015 6.6% of customers consumption information is left out from households total consumption. This is case when customer is not counted as a resident in the statistical population register. In dwellings owned by such persons 3.6% of households total consumption was recorded.

The order differs.

1. Join metering points with the register of building by using metering point address object id

Potential to link households directly by using an address object id is lower than by registry code.

Table 20. Overview of the linking process by using address ID (adr.match=1) and customer ID (customer match=1). Production and consumption in GWh.

The goal of the analysis later on is to compare main residence information at the end of 2015 with occupancy information calculated according to the consumption of the electricity. As the vacant - non-vacant indicator is computed based on daily and hourly patterns then only living spaces with hourly consumption information are of interest. Thus, only living spaces with the smart meters data will be linked to the register information. Linking the dwellings database with electricity data set by using address id we managed to obtain electricity consumption information for 61% of dwellings (see Table 21).

Table 21. Linking rate of smart meter's data with dwellings register by dwellings monthly consumption

Monthly consumption (kWh) No of smart metering points Matched records, %
0 36 856 27.9%
(0;150) 222 943 72.2%
150> 282 575 55.6%
Total 542 374 60.6%

### Danish case study

The result of the assignment of unique address ID’s to the electricity dataset is presented in Table 22.

Table 22. The result of the assignment of unique address ID’s.

Background dataset from Energinet Number
METER_ID 4 029 082
1 METER_ID - 1/x Addresses 3 834 253
1 METER_ID - 0 addresses at one point in time 194 828
1 METER_ID - 0 Addresses to begin with 264 136
1 METER_ID - 1 Address 2 825 801
1 METER_ID - x Addresses 59 058
x METER_ID's - 1 Address 930 458
x METER_ID's - x  Addresses 36 634
x METER_ID's - a Address 303 001
x METER_ID's - x Addresses 33 000
1 METER_ID - x Addresses  102 850
1 METER_ID - 1 Address 2 825 801

In the table one must notice that 2 825 801 meters have one unique address. Some meters have more than one address if:

• In one period a meter was registered at an invalid address due to too many NA’s. Then in the period after that some NA’s were filled out.
• Because of flaws in the registration of addresses

Some addresses have more than one meter if:

• There is a meter in the garage and one in the house
• If a dormitory has one meter per floor
• If a company has several meters

The useable number of addresses (over all the years) after having linked the registers to the background dataset from Energinet is: 2 503 881 which is 89 % of the addresses from the registers. The reason that some 300 000 addresses are lost, is that businesses  and their consumption information is not studied here.

After having linked the register information to the background information then it is time to link to the consumption datasets. This is no easy task, since as explained in section 2.2 meters can appear both in the periodical readings and in the hourly readings datasets. Linking was done for every quarter of every year from 2013 to 2016. Tables 25 and 26 give an overview of how many meters there are in the point of departure and how many is lost at every merge.

Table 23. Evaluation of the linking between register data and the hourly consumption data.

1. Year 2. Number of addresses (final after linking registers with the background dataset) 3. Number of meters in the hourly consumption dataset 4. Number of meters with some kind of settlement type (hourly based, on account based periodical estimates, mixed thorughout the year, NA) 5. number  of meters in the linked dataset
2013 2 179 865 58 704 49 982 4 823
2014 2 187 830 135 993 117 064 41 437
2015 2 197 117 775 990 747 666 480 708
2016 2 206 067 1 744 308 1 486 198 890 135

Second column in table 23 shows the annual number of addresses as a result of linking the registers with the meters background dataset. Column number 3 is the number of meters in the hourly consumption dataset. This number contains both the businesses and households. Only after the linking with addresses does it become clear which of those meters belong to the households. Column 4 is the number of meters that appear in the settlement dataset. This means that some meters have hourly readings but they do not have an agreement. This regardless of the type of agreement be it periodically estimated on account, hourly based, NA, mixed. Column 5 shows the number of meters after having performed the link between the addresses (background and registers) with the consumption dataset. This means that in 2013 not very many households had smart meters. This number rapidly increases until 2016 and in 2020 all meters in Denmark will be smart meters. In Table 24 the meters that are periodically read are investigated.

Table 24. Evaluation of the meters from the periodically read dataset.

1. Year 2. Number of addresses (final after linking registers with the background dataset)  % 3. Number of meters in the periodically read consumption dataset  % 4. Number of meters with periodically based settlement  % 5. Number  of meters in the linked dataset  %
2013 2 179 865 90% 3 183 901 90% 3 176 222 80% 733 899 27%
2014 2 187 830 90% 3 234 797 88% 3 158 254 80% 706 380 26%
2015 2 197 117 90% 3 251 648 71% 2 548 504 63% 722 955 26%
2016 2 206 067 90% 3 396 599 55% 1 944 823 50%

In column 2 again is the number of addresses every year after having linked the background dataset with the registers. In column 3 are the number of meters that are periodically read. Column 5 shows the final number of meters used to estimate average consumption. The reason the number falls so drastically is that meters that are periodically read can be difficult to interpret. Some meters are read once every year, some every quarter and some every month. This means that meters can be read at nice to interpret dates around the first or last day of a month and then with intervals of 30 or 90 or 365 days. This makes the interpretation of the totals and averages understandable. For readings at unusual dates e.g. 17th of May and then again the 5th of September it is not clear for which months or quarters the consumption should be assigned to. Unfortunately, quite a lot of meters fall out of the nicely to interpret dates and intervals. In the analysis part of the report only the meters that are read either the first or the last day of the month and with intervals of 30, 90 or 365 days plus minus 3 days are used.

## Linking rate and other quality information

When businesses were linked by registry code the main concern was that it was possible to link around 22 thousand distinct customers, despite they covered most of production and consumption. It is around 7% companies in the register of businesses. Apartment associations were excluded from the list of businesses entities in the business register.

Table 25. Number of businesses linked to metering data points. Production and consumption in TWh.

 Year  2013 2014 2015 Dis. customers Dis. points Production Consumption 22 090 68 939 11.606 6.004 22 335 70 433 11.739 5.776 22 856 72 988 10.384 6.207

When the address id was used for linking two strategies were used – by linking all customers or by selecting only those customers which were businesses (including apartment associations). It was possible to link more than 70 thousand businesses (Table 26). In case the customers in the smart meter database were filtered out only those which were businesses the linkage drops to 33 thousand, which is almost all the businesses customers listed in the smart meter database including apartment associations.

Table 26. Linking companies by using address id with all agreements. Production and consumption in TWh.

 Year  2013 2014 2015 Dis. codes Dis. customers Dis. points Production Consumption 59 760 70 754 110 116 11.606 6.486 61 685 72 932 113 986 11.826 6.317 63 493 75 726 117 486 10.384 6.769

Table 27. Linking companies by using address id with agreements of businesses. Production and consumption in TWh.

 Year  2013 2014 2015 Dis. codes Dis. customers Dis. points Production Consumption 23 896 31 189 71 869 11.606 6.282 24 337 31 832 74 169 11.826 6.108 24 995 33 295 76 365 10.384 6.555

Consumption of apartment associations was excluded from the analysis of businesses (Table 28).

Table 28. Production and consumption of electricity by apartment unions – NACE code 68.32. Production and consumption in GWh.

 Year nace Dis. customers Dis. points Production Consumption 2013 6832 9 929 12 661 0.014 209.690 2014 6832 10 112 13 024 0.003 213.050 2015 6832 10 416 13 381 0.018 212.122

When we linked businesses with all metering points then quite remarkable amount of metering points had an agreement with private customers .

Table 29. Businesses linked by all agreements, the owner of the agreement linked to metering point. Production and consumption in GWh. Zero values - 0.0 rounded to zero, 0 zero value.

 Year Registry Dis. points % points Production Consumption % prod. %cons. 2013 N/A 3545 3.1% 0.0 5.924 0.0% 0.1% 2013 Foreign 217 0.2% 0 0.674 0.0% 0.0% 2013 Private 43348 38.2% 0.046 213.563 0.0% 3.3% 2013 Embassy 4 0.00% 0 0.376 0.0% 0.0% 2013 Business 66428 58.5% 11 606.913 6 264.100 100% 96.6% 2013 Unions 68 0.1% 0 1.595 0.0% 0.0% 2013 Total 113 610 100% 11 606.960 6 486.235 100% 100% 2014 N/A 735 0.6% 0 0.662 0.0% 0.0% 2014 Foreign 222 0.2% 0 0.671 0.0% 0.0% 2014 Private 45311 39.6% 0.206 217.900 0.0% 3.5% 2014 Embassy 4 0.00% 0 0.433 0.0% 0.0% 2014 Business 67992 59.5% 11 826.354 6 096.190 100% 96.5% 2014 Unions 67 0.1% 0.0 1.922 0.0% 0.0% 2014 Total 114 331 100% 11 826.560 6 317.782 100% 100% 2015 N/A 312 0.3% 0 0.394 0.0% 0.0% 2015 Foreign 231 0.2% 0.0 0.752 0.0% 0.0% 2015 Private 47116 40.0% 0.426 222.603 0.0% 3.3% 2015 Embassy 2 0.0% 0 0.120 0.0% 0.00% 2015 Business 69948 59.5% 10 384.219 6 543.067 100% 96.7% 2015 Unions 49 0.0% 0 2.239 0.0% 0.0% 2015 Total 117 658 100% 10 384.646 6 769.177 100% 100%

For linking private customers the linking strategy to use registry code was used. It was possible to link almost 50% of unique private customers.

Table 30. Number of linked private customers and their consumption. Production and consumption in GWh.

 Year Dis. customers Dis. points Production Consumption 2013 245 437 244 792 0.151 666.405 2014 249 400 248 791 0.312 679.210 2015 253 102 252 566 0.606 685.448

The number of distinct customers it was possible to link is 48% of total number of private customers and their consumption covers 39% of total private consumption.

Table 31. The quality indicators related to linking (Estonia case).

 Undercoverage Overcoverage Difference between administrative units and statistical units The consumers are not always the contract holders and only 7% of business entities in the business register and 15% of companies in the statistical business register were linked directly with the data hub. Discontinuity 2015 was total consumption estimates based on survey data 4-14% higher (exact size is depending on the linking strategy)

## Summary

The main challenge is to identify end production and consumption of statistical units therefore it is crucial to identify linkage between metering points in smart meter dataset and units in administrative registers and also exclude all metering points not related to end consumption. In this section two case studies were given - Estonian and Danish. For both case study it was important to link statistical units by using address id and both had problems with finding one-to-one match between statistical unit and observed unit - a metering point. In Estonian case study the most problems were related to quality of address information, but in the Danish case study there were also difficulties identifying what was the periodicity of the reading and billing.

The main problems the case studies faced:

• The quality of address information. It was the main problem in the Estonian case to extract a valid address id. In Danish dataset a valid address id was used.
• Many metering points or many statistical units on the same address. It was a problem in both cases and reduced the rate of linking. In Estonian case study there is 5% of cases where metering points have the same address id.
• Identifying the actual consumer. It was more related to businesses and from 150 thousand businesses in the statistical business register only 22 thousand was possible to match.
• Own consumption of producers. There is trend that more end consumers have installed their own electricity production units and their own consumption is difficult to identify. The same applies also for big industries and their own consumption is impossible to identify. By description it was not the problem in Danish case study.
• Apartment associations – those are registered as business entities, but are apartment buildings. Mostly they are used for living but also many companies are active in the same address.
• The quality of registers. The quality of data in the register of buildings was questionable and many fields had not real values (the maximum number of rooms in one apartment was marked 781 and there was more than 1000 dwellings with rooms more than 15) or were missing.

## Results based on Estonian data

The main goal of the processing businesses electricity consumption data was to compute the annual and quarterly electricity consumption statistics. In addition to the total numbers, the interest was to produce statistics by economic activity sectors. The figures 2, 3 and 4 illustrate the monthly total consumption, annual and quarterly electricity consumption of Estonian businesses by economic activity sectors.

The seasonality of electricity consumption of businesses tends to become smaller (Figure 2).

Figure 2. The total monthly electricity consumption of Estonian businesses, 2013-2015.

It is possible to see electricity consumption by a sector. The areas of electricity production and real estate activities clearly dominate (Figure 3) in business consumption.

Figure 3. The annual electricity consumption of Estonian businesses by economic activity sectors, 2015.

The dataset enables to see dynamics by sector (Figure 4) and it can be linked to the data of economic activity. As one big factory of chemicals closed their activity there is remarkable drop in electricity consumption in this sector in 2013.

Figure 4. The quarterly electricity consumption of Estonian businesses in the manufacturing sector, 2013-2015.

## Quality assessment of the processed data

The difference between survey estimates and estimates based on integrated electricity metering point data by different linking strategies is visualized in Figure 5.

By looking at absolute numbers it is possible to see that difference in real estate activities roughly equals negative value of wholesale and retail activity energy consumption (Figure 6). This is quite reasonable as real estate firms build and maintain large business centres and shopping malls, and as owners of the premises also own the electricity contracts. Shops and service providers, on the other hand, rent the premises in the business centres and shopping malls, and thus do not appear in the electricity data hub as the consumers of the electricity. The other large difference occurs in the electricity sector where the survey estimates are much larger than estimates based on the data hub. This can be explained by the fact that in survey, also electricity produced for the production of electricity is included but in the electricity dataset, we have no information which meters are recording such usage and which are used for selling or transfers.

Figure 6. Visualizing difference between the survey estimates and integrated electricity data estimates in absolute numbers (2015 data). Blue colour shows smaller consumption and red larger consumption than in the probability survey.

Table 32 summarizes the quality information for the consumption estimates of the businesses. The main problem is the unit issue as it makes linking administrative unit (metering point of a building) with statistical unit (business) quite unsuccessful.

Table 32. The quality of the businesses consumption statistics (Estonia case).

## Summary

The electricity smart meter data has potential as a source for producing business statistics, but the methodology of linking has to be improved. Current estimates in certain areas differ significantly from the survey data, as it is difficult to estimate the real consumer. On the other hand the dynamics of electricity consumption could indicate the change in economic activity by a sector and could be used as an early warning indicator.

Electricity consumption statistics – households

The main goal was to link as many households with the smart meter data. After completing the linking it was possible to identify how the energy consumption is related to household size, number of rooms in a living place and other indicators available in the registries. For this purpose we created an extract of monthly consumption of metering points linked to households and other descriptive fields In hive and imported the data to python pandas for further analysis and visualization.

## Results

### Estonian case study

From the registry of the people we have identified 538 092 household (01.01.2016), there was 528 738 distinct private customers in smart meter dataset in 2015 and it was possible to link 253 102 households with the metering data.

A preliminary visualization of of linked data sets was conducted. As a result it can be seen that in Estonia the electricity consumption of private customers is seasonal and there is remarkable difference in consumption in winter and in summer (Figure 7). At the same time the dwellings belonging to a group small consumption showed opposite seasonality and probably those are seasonally occupied.

Figure 7. Monthly consumption pattern of households.

After linking with the registry of people and buildings it was possible to identify relationship between electricity consumption and household size (Figure 8). Households living in houses tends to consume more energy. As we can see the clear linear relation between the energy consumption and the household size, the data can be used to predict the household size.

Figure 8. Yearly mean consumption by household size and type of living place - house (H) or apartment (K).

The dwelling register contained very large values of size of living place in m2 therefore log values were used to find out relationship between the size of living place and energy consumption (Figure 9). There is relationship, but it needs further analysis and cleaning of the data.

Figure 9. Yearly mean consumption by the size of living place (log scale) and the type of living place - house (H) or apartment (K).

Linking year of construction and energy consumption it is possible to identify from which year and what size of living place are currently in use and also see the dynamics of construction activity (Figure 10). The average electricity consumption per dwelling was more random and needs further analysis.

Figure 10. Total consumption in 2015 by the house construction year and the number of rooms in dwelling.

In total the smaller apartments consume more electricity than bigger apartments, but average consumption is related to the number of rooms in the living place (Figure 11). Further analysis can be conducted what is the distribution of different size of dwellings in use regionally.

Figure 11. Yearly total (bars left scale) and mean (lines right scale) consumption by the number of rooms in dwelling and type of dwelling - house (H) or apartment (K).

The quality information about the households' electricity consumption is described in table 33.

Table 33. The quality of the households' statistics (Estonia case).

 Undercoverage In the analysis, only households holding their own electricity contract with the provider are used.  Own produced and consumed electricity is not recorded in the data hub and amount of such consumption is unknown. Currently the amount is estimated by the data hub owner to be rather small but having capability to produce own electricity is increasing year-by-year. Overcoverage Some of the households share the address of their main residence with a business, so some share of the electricity consumption should be assigned to the legal person. Percent of units that fail checks The data is taken as it is and extreme recordings are not excluded from the analysis. Percent of units that are adjusted No units were adjusted. Percent imputed No imputations were applied. Periodicity Data is delivered once a year. Delay It takes up to 4 months after the reference period for all the corrections done by providers or sellers to appear in the data hub. Difference between administrative units and statistical units It was possible to find consumption for 47% of households. Discontinuity There is no survey data to compare with.

The difference between the units is caused by the low address quality, but as the missingness of the address id's is not systematic and covers all areas of Estonia then in an analysis where the average consumptions and not total consumptions are used, we believe the data to be usable and useful, as such statistics has not been produced before.

### Danish case study

In the following three graphs the average consumption for Danish households are shown. Households are divided into:

• Type of dwelling: (Detached house, attached house, story building, dormitory/institution, farmhouse, holiday purpose, other
• Whether the family is comprised of more or less than two people
• The size of the dwelling
• Whether the households uses electricity as heating medium

The first graph (Figure 12) shows the average consumption for meters that are hourly read and are settled based on hourly readings. One will notice how high the values are.

Figure 12. Average consumption information based on hourly read meters with hourly settlement, 2015.

The second graph (Figure 13) shows the average consumption for meters that are hourly read but settled on account (periodically based settled). One should notice that the average consumption is lower than in the first graph but still not in compliance with other sources on Danish electricity consumption.

Figure 13. Average consumption information based on hourly read meters with periodical settlement, 2015.

The third graph (Figure 14) shows the consumption for meters that are periodically read and also settled on the basis of periodical readings.

Figure 14. Average consumption information based on periodically read meters with periodical settlement, 2015.

The consumption averages are very different according to which kind of reading and settlement you choose to look at. Since not all meters are smart meters, yet, and due to the fact that the meters that were smart to begin with were the ones with annual readings above 100 kWh, it is suspected that the meters that are hourly read but periodically settled are not correct. The numbers are simply far off other sources of statistics.

## Summary

Despite the fact that the linking quality was not very high as less than half of the households was possible to link with electricity meter data, the results indicated there is potential to use smart meter data for producing statistics of households. The main potential of using smart meter data is the possibility to link with different registers and thereby reveal new information otherwise unavailable to the society. There is still some work to be done to reveal the full potential of hourly readings of smart meter data.

Identifying vacant living spaces

Information about the vacant living spaces is relevant housing statistics that can be used in the population and housing census, tourism statistics or by different industries e.g. real estate sector. It is of interest to know how many dwellings are empty for a long time and how many are occupied seasonally. In addition to producing statistics on vacant dwellings, the information can be used on dwelling level to validate the information in the registers. Statistics Estonia is planning to carry out next population and housing census by using only register information. The pilot census showed that one of the largest quality problems is the accuracy of the address of the main residence. Namely, people are quite often not living on the address given in the register. To overcome the problem, alternative data sources are looked for. One possibility is to use mobile data but electricity smart meters data could also be beneficial. The goal of this section is to offer algorithms to identify the vacant dwellings, compute the indicator for each dwelling  referring to occupancy either on a certain day or during some other specified time period. This information would then enable to validate how many households are living in the vacant dwellings specified in the register as their main residence. In addition to validating register information, the electricity consumption information could be used to improve the address information in the census.

This section contains three parts. First part introduces some algorithms that can be applied to create vacant dwelling' indicator. All methods are tested on synthetic data. In the second part the real electricity consumption data is used to validate the results from the pilot data of register based census. Third part summarizes the results and concludes with a list of problems encountered and possibilities to improve the methodology.

## Possible methods for identifying vacant living spaces

Identifying vacancies was tested using a variety of different methods. Since there is no data present on which to evaluate the results of each methods, they can only be judged by their methodological soundness. Vacancy was estimated using the half hourly kWh usage per day. The following methods were tested:

1. Using the R-function auto.arima to predict kWh usage. A good fit could indicate predictable kWh usage and therefore a vacant living space.

2. Estimating the variance of the daily kWh usage. Low variance could indicate a vacant living space.

3. Estimating the ratio between the mean kWh usage during day- and night-time.

4. Using a volatility measure.

5. Using a cell wise outlier detection procedure to locate days with 'unusually' low kWh usage.

6. Using estimates from Methods 3., 4. and 5. as well as range, mean and median of daily kWh usage in a random forest model.

For the following descriptions let $x_1,...,x_48$ be the half hourly kWh usage for a person of a day.

Time series

For the time-series approach the auto.arima function from the R-package forecast[1] was used. The idea is to fit an ARIMA model to $x_1,...,x_48$ and estimate the correlation between the real data an the fitted model. A high correlation indicates a predictable pattern which furthermore suggests a vacant living space. This would follow from the assumption that the variance in the kWh usage is generated by the automatic (and deterministic) turn off and turn on behaviour of household appliances, e.q. fridge, heating system, etc. Whereas a low correlation would indicate unpredictable human behaviour.

Figure 15. Daily electricity consumption for selected households and correlation between the real data and the fitted model.

Figure 15 shows the kWh usage of specific days and households with the number in the grey bar at the right side showing the absolute correlation. A high correlation should indicate a vacant living space and vice versa. As it can be seen this method produces questionable results since the kWh usage takes on similar forms for cases with low and high correlation.

Variance

Calculating the variance of the $x_1,...,x_{48}$ can be used to estimate if a living space is vacant or not. Since household appliances on standby use only little kWh, a low variance could indicate a vacant living space whereas a high variance in kWh would be the result of actively using power, e.q. a non-vacant living space.

Figure 16. Daily electricity consumption for selected households and estimated variance.

Figure 16 shows the kWh usage of specific days and households with the right bars showing the estimated variance. A high variance should indicate a non-vacant living space and vice versa. This method performs a bit more convincingly than the time series approach, but cases with high variance and (seemingly) deterministic changes in kWh usage can still occur, e.q. the third graph from the bottom.

Day- and night-time

Another indicator for the classification of vacant living spaces could be the ratio between the mean of half hourly kWh usage during day- and night-time. Day-time was defined from 8 a.m. to 7 p.m. and night time from 8 p.m. to 6 a.m.. A high ratio would implicate a more excessive kWh usage during daytime which would implicate a non-vacant living space.

Figure 17. Daily electricity consumption for selected households and day-night-time usage ratio.

Figure 17 shows the kWh usage of specific days and households with the right bars showing the ratio between the day- and night-time kWh usage. It can be easily seen that this methods does not classify vacancy very well. Even cases with a low ratio can, judging by the path of the kWh usages, very well be from non-vacant living spaces.

Volatility measure

Scaling the hourly kWh usage $x_1,...,x_48$ and estimating the standard deviation of the first differences of the scaled values could also lead to an indicator for vacant living spaces. A high volatility would result from frequent and large variations in kWh usage which furthermore could point to a non-vacant living spaces.

Figure 18. Daily electricity consumption for selected households and estimated volatility.

Figure 18 shows the KWH usage of specific days and households with the right bars showing the estimated volatility. This methods seems to perform poor as well, since a high volatility does not necessarily coincide with a non-vacant living space.

Cell-wise outlier detection

Another approach in order to classify vacant living spaces was the use of a cell-wise outlier detection method. The Method is fully explained in Rousseeuw, 2016[3]. and usable in R through the package cellWise[6]. The main idea of this method is, given a data matrix $X \in \mathbb{R}^{n\times p}$, to use univariate outlier detection methods on each column as well es the correlation between every column to detect potential outlying cells. Some advantages of this method are that the method still works if $p>n$, in fact as stated in Rousseeuw, 2016[3] the method gains more performance by introducing more dimensions to the data and that it can deal with missing values in the data matrix $X$.

The method was applied on the total kWh usage per household and day, using a data matrix $X \in R^{n\times p}$, with $n$ as the number of households and $p$ as the number of days in 2013. Each cell in $X$ lists the total kWh usage of a household and day of the year.

To improve the performance of the method two additional steps were taken before applying the outlier detection method:

1. The distribution of total kWh usage per day across the households is skewed heavily to the right, as displayed by Figure 19, which shows the density estimates of kWh usage per day. Since this outlier detection method internally applies, onto each column, robust estimates for location and scale, which expect the data to be approximately normal, the Box-Cox transformation, was initially applied to each column. Because zeros can occur in the data we choose to use the two-parametric Box-Cox transformation, which is implemented in the R-package geoR.

2. After Box-Cox transformation the data matrix $X$ was split into subsets defined by the number of occupants in each household. The groups were labelled *low*,*mid* and *high*, consisting of households with 0-2 occupants, 3-4 occupants, and 5+ occupants.

Figure 19. The distribution of total kWh usage per day.

The cell-wise outlier detection method was finally applied on the subsets of the transformed data matrix $X$. Please note, that this method detects upper as well as lower outliers and that in this context only lower outliers are of interest.

Figure 20. The total kWh usage per day for some of the households.

Figure 20 shows the total kWh usage per day for some of the households with red dots showing bottom outliers e.g. vacant living spaces and yellow dots upper outliers, which are not of interest in this context but are displayed for the sake of completeness. It seems, that the method performs reasonably well, especially when flagging, what seems to be, vacancy due to long holidays. Nevertheless it can't be said what really is or is not a vacancy, which leaves questions about a goodness of classification, i.e. false-positive- or false-negative-rate, unanswered.

Random forest

The final approach consists of a random forest (see Breiman,2001[2]) model that was trained on a labelled subset. The regression variables of the model consists of previously listed classifiers as well as additional information. In detail the variables we used per household were estimates concerning kWh usage per day and quarter consisting of:

• Variance
• Range
• Mean
• Median
• Standard deviation of first differences of the scaled kWh usage
• Share between mean day- and night-time kWh usage
• Number of occupants in each households grouped as in the cell wise outlier detection method

The labelled subset was created using the cell-wise outlier detection method and than handpicking 75 living spaces for which this outlier detection method seemingly produced reliable results. For the calculation of the random forest model the function \texttt{ranger} from the R-package which carries the same name.

The goodness of fit for the random forest, on the training data, in given in Table 34.

Table 34. The goodness of fit for the random forest.

true value \ predicted value non-vacant vacant
non-vacant 24 210 135
vacant 358 2 010

About 85% of the assumedly vacant cases were successfully detected with a false discovery rate of about 6%.

Concluding remarks on the methods tested

The code to test the described methods is available at https://github.com/alexkowa/essnetBDwp3 . The outlier detection method as well es the random forest approach seem to produce, at least, more reliable results. Nevertheless the quality of these classifications can not truly be measured without a pre-labelled data set. If a pre-labelled data set is provided more fine-tuning steps can be taken, like:

• Using other measures like variance or range on the cell-wise outlier detection approach.
• Using other machine learning algorithms, like boosting or support vector machines.
• Without a labelled data set, the results would not gain more insight into the problem of classification as was displayed by the random forest model.
• Using cluster algorithms on time series to identify specific patterns for vacant living spaces.

## Results based on Estonian data

The goal to identify vacant dwellings in Estonia is to validate and improve the register based population statistics, namely the address of the main residence information. By looking at the daily or monthly electricity consumption of a living space one can estimate whether the living space is occupied during one day or a month or not. This is useful information in a situation where we have two or more potential living spaces for a household and need to choose one or when we know that the main residence information is not of good quality and want to improve the quality of the addresses given in the register and thus improve the quality of the population statistics.

Classification of the dwellings into occupied and non-occupied is here done based on electricity consumption data only.  Classification is carried out for the month of December 2015 so one can compare the results with the population statistics that refers to the 31.12.2015. According to the definition in the population statistics a dwelling is considered to be vacant at the end of the day 31.12.2015 if no one is registered in Population Register to be living in the dwelling at the given date. This definition is not directly applicable to the electricity dataset.  The following working definition is considered to be fairly near to the wished one and used in the analysis later on. A dwelling is considered vacant in December 2015 if the monthly electricity consumption is zero or  less than 150 kWh and the electricity consumption pattern is such that classification algorithm classifies the dwelling to be vacant for 16 days or more. Note that one can estimate vacancy on a certain day only for those dwellings where smart meters are installed.

The reasoning behind the working definition is following. It is reasonable to assume that occupied dwelling uses some amount of energy. Thus follows that dwellings having zero consumption are not occupied. If the consumed amount is large enough then we can say that a dwelling is occupied. In the following analysis the amount which is considered to be large enough to indicate that the dwelling is occupied is 150 kWh a month. Classification algorithm needs to be applied to the dwellings that have monthly consumption between zero and 150 kWh.

For classification more detailed data than monthly data is needed. Different approaches were considered, many of them were covered in the previous section. Firstly, the daily variance measure and daily total and average was calculated and studied but in the end, it was not used as it was very difficult to decide what threshold to use for determining vacancy. This method would work if one has control group of living spaces with known occupancy. Although it is intuitive that low total consumption and low variability would indicate vacant dwelling then it is very difficult to set the value to the "low". In the end, two measures based on daytime and night-time difference were combined to classify living spaces. Both methods are described in ONS paper [7]as method 5 and method 6.

Method 5 uses difference between daytime and night-time variance of the consumption. A living space is vacant according to this method on a given day if

night variance > 0 and 0.5 <= day variance/night variance <= 1.5 OR night  variance= 0 and day variance= 0;
otherwise living space is not empty.

Method 6 uses average daytime and night-time consumption (log-scale). A living space is vacant according to this method on a given day if

day average < night average + night std.err;
otherwise living space is not empty.

Using daytime versus night-time measures seems reasonable as one would assume that in a vacant living room the day and night differences in the consumptions are small while in the occupied living house many electric devices are used while household members are at home. For determining the vacancy a combined measure of two indicators was used. A living space is estimated to be vacant on a given day if both methods identify a living space as vacant. A living space is vacant in a current month if it is vacant in at least 16 days in the month (15 days in February).

Dwelling register has 526 465 living spaces that have been registered as main residence and 199 074 dwellings with no person reporting to have main residence there. The latter information could be used as an indication of non-occupation. However, the culture of registration the main residence in Estonia is not very good. Some surveys assess that up to 20% of residents live in the places they are not registered.

The main results of our analysis are given in table 35 which illustrates how well the computed indicator and register information coincide.

Table 35. Coherence of calculated occupancy and register occupancy by the consumption class.

Consumption class (kWh) No of living spaces Vacant by consumption data; occupied in register Occupied by consumption data; vacant in register No of matched vacant living spaces

No of non-vacant matched living spaces

No of living spaces with the same occupancy Matched occupancy, %
0 10 263 4 263 0 6 000 0 6 000 58.5%
(0,150) 134 483 10 635 18 058 7 903 97 887 105 790 78.7%
150> 156 948 0 20 352 0 136 596 136 596 87.0%
Total 301 694 14 898 38 410 13 903 234 483 248 386 82.3%

These results confirm that not all people live on the address registered in Population Register. It is unlikely that a person would live in an unoccupied dwelling based on energy consumption. One reason for registering into an unoccupied dwellings is the tax benefits for the homeowners. In more than 4000 dwelling, where people are registered in Population Register, no energy has been consumed (Table 35).

Based on energy consumption there are more occupied dwellings than in the pilot census. It makes sense because one person can have multiple living places but only one registered address.

Figure 21. Estimation of vacant dwellings by household size.

According to the register information there are more lone parent families than in regular censuses. This comes also out on Figure 21 where one can see that majority of estimated vacant living spaces are such where none is or only one person is registered.  Main reasons for registering apart from the rest of the family are wider opportunities for selecting kinder garden and school for the children. At the moment we are developing partnership index theory for register based census that will include decision stage regarding to which address to choose for a family. Energy consumption data could be helpful in this context.

Further analysis is necessary for people who are registered into dwellings that are unoccupied based on the low energy consumption. We are interested in finding an alternate data source to determine their actual place of living. In some cases the electricity dataset can be useful here.

The summary of the quality information of the output statistics is given in Table 36. Here the quality assessment is done for the statistical unit living space.

Table 36. The quality of the vacant living spaces indicator (Estonia case).

 Undercoverage Only living spaces with the smart electricity meters are included. Also living spaces that have their own electricity contract are included i.e. excluded are living spaces where the electricity contract is hold by the legal person.  Some living spaces are excluded also due the missing address id. In the end, 39,4% of living spaces is excluded. Overcoverage There is no overcoverage of living spaces. Living spaces owned by non-residents will be excluded during the linking procedure, as in the population and housing census only residents of the country are observed. Percent of units that fail checks The data is taken as it is and extreme recordings are not excluded from the analysis. Percent of units that are adjusted Type of meter was changed for some households from smart meter to the manual. The type of the meter is not always correct as the switch from manual to remote does not happen on the installation of the smart meter but after some period and even after the switch the households have to send in manual monthly readings for short time for control reasons. Percent imputed No imputations were applied. In the future, one can think about imputing hourly consumptions for the household's living in the dwellings where electricity contract is owned by the legal person. In such cases, only the total consumption by all living spaces in the dwelling is known. Periodicity Data is delivered once a year. Delay It takes up to 4 months after the reference period for all the corrections done by providers or sellers to appear in the data hub. Statistics Estonia does not have agreed date for receiving the data for the previous year, but it is usually received in the first half of the year. Difference between administrative units and statistical units Unit problem is not so big issue here as our statistical unit is a living space and administrative unit is metering point in a dwelling or living space. The problem occurs only in cases where there is one metering point for several living spaces as it can happen in the apartment buildings. Discontinuity Vacancy estimate coincides with the pilot census vacancy estimate for the 82.3% of the observed living spaces .

## Summary

This chapter explored the idea of using electricity consumption information to estimate whether the living space is vacant or not. This kind of information is of big interest in Statistics Estonia where we are having incorrect information about the household's main residence. Knowing whether the household registered to live on certain address is really living there or not will enable to look at other sources for the correct address if the living space is actually vacant.

Estimating occupancy or vacancy of a living space is not a simple task. Several methods are suggested and tested on synthetic data. The most promising are the outlier detection method and random forest method. However, on the real data simpler methods were applied due to the lack of data for the training set. The results are quite promising for those households that were included in the analysis. For those, we obtained results that show about 18% of households do not live on the address the population register has for them. This coincides quite nicely with the estimates obtained from survey data which show this number to be about 20%.

Statistics Estonia plans to continue working with the electricity data and find more ways to use vacancy indicator. In an ideal case data could be used to not only to validate the results but to improve the main residence address information in the register based census.

Conclusions

### Main findings

The aim of the work presented in this report was to show how the observed units - electricity consumption of metering points - can be mapped to statistical units - businesses, households and dwellings - and how based on the results existing statistics can be replaced or a new kind of statistics can be developed. There were three goals with regard to expected outputs. First, to assess whether current survey based business statistics can be replaced by statistics produced from the electricity meter data, second, to produce new household statistics and third, identify vacant or seasonally vacant dwellings.

As a result of current report, the production of the electricity consumption statistics of the businesses is promising, but the methodology has to be improved to get better match with survey data. Current methodology leads to the remarkable differences in the electricity consumption in the certain economic activities and thus, is not suitable to be used in the official statistics.

The household statistics about the electricity consumption gives information not available before and through the use of smart meter data it is possible to produce statistics now. Information about the average consumption by the household size, type of dwelling or any other characteristics of the dwelling could be made available to the public as experimental statistics. This gives opportunity to the public to give feedback of the usefulness of this information as well as to request some other information not yet available about the households' consumption.

The most valuable in the context of the registry based census is identifying vacant living places. It was possible to link electricity consumption information to the 61% of dwellings.  Different methods were tested for identifying vacant dwelling and the results are promising. Some of the tested methods seem promising, however conclusion in this field are very preliminary. Methods with better precision could be used when training set with known vacancy information would be available. So far only visual inspection of the corresponding data can be used as training data. For identifying vacant living places a hourly readings of electricity data are needed.

One important outcome of this project is quality assessment framework, which was used in this report to evaluate all the outputs and can also be used in the future projects.

The main advantage of using the smart meter data are:

• possibility to link it with other data sources and gain new knowledge,
• the data source can be used to validate or improve current survey based statistics and
• the data source could improve the speed of producing statistics and also increase the quality of regional statistics.

For many applications a sample of the full population, which might be drawn with conventional sampling methods, would be sufficient. This might be a good possibility to reduce the computational burden of using smart meter data.

### Problems encountered

The smart meter electricity dataset has potential to be as an additional data source in the production of the statistics. It is possible to get overview of the all electricity consumption which is not feasible in the survey. The main difficulty related to the consumption data is that it is complicated to find real consumer of the electricity as quite often the real consumer and the contract holder in the dataset differ. The problem becomes more difficult as the quality of the address information, which could be used as a linking key, is not very good. There is also an increasing trend that the end consumers are starting to produce electricity by themselves (active installation of the solar panels) and this kind of consumed electricity is not recorded.

The main problems related to the data source are:

• The description of data source is not detailed enough and metadata about the variables was often missing in the Estonian case study, but it was a local problem as Danish data was with adequate metadata,
• Without additional information it is difficult to identify which metering point records the actual end consumption and which one transfers and sellings of the electricity,
• The observed unit (metering point in a building) does not match the statistical unit (business,household, dwelling) e.g. consumption in a metering point is assigned to the owner of the contract with an energy company not to the the actual consumer,
• Address information is not standardized and geo-coding needs lots of resources,
• Manual metering readings can cause lot of work as is the case in Denmark. Dataset contains dates when the manual readings are reported but no information about period the reported consumption refers to. This problem will disapper with complete installment of the smart meters.
• One metering point corresponds to many consumers (apartment unions, real estate companies renting rooms) and one needs to develop methodology to extract the consumption of the single consumers
• There is no information about the own produced and consumed electricity in the datahub. The datahub contains information only about the electricity sold and bought.

### Future work

To reveal the full potential of the data there is still work to be done to develop methods to improve the quality of the linking, to use classification methods to identify customer type and combine different data sources to reveal new information.
Future steps to improve the quality and extend the use of the data are:

• Find ways to improve the quality of address information and use machine learning algorithms to clean the address data.
• Find a way to generate statistics by using sub-set of businesses that were linked correctly.
• After improving the address quality conduct regional analysis of the energy consumption.
• Use unsupervised classification algorithms to identify different types of electricity consumption patterns and analyze possibility to discriminate businesses from private consumers.
• Develop model to identify producers of electricity for own usage and model the amount of electricity used.
• Link the electricity data with economic activity data to see correlation between economic activity and energy consumption.
• Link electricity data with weather data and building register to identify the impact of weather to the electricity consumption.
• Identify whether electricity is used for heating using the consumption pattern.
• Develop partnership index theory for register based census that will use registers and electricity data for determining a non-vacant living space for the partners. as their main residence.
• All people do not have registered address with dwelling accuracy (3,6% of households in the first trial REGREL). They need to be placed into dwellings. Knowing if a dwelling is occupied (based on energy consumption) can be very helpful when placing people into dwellings.

References

1. Hyndman RJ and Khandakar Y (2008). Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 26(3), pp. 1-22.
2. Breiman, L. (2001). Random Forests, Machine Learning 45(1), 5-32.
3. Rousseeuw, P. J., and W. V. den Bossche (2016). Detecting Deviating Data Cells..
4. Elering (2016). Estonian Data Hub. Guide for using and joining the Estonian Data Hub. Version 2.0 July 2016.
5. Meindl, B., Templ, M., Alfons, A., Kowarik, A. and Ribatet, M. (2017). simPop: Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. R package version 0.6.1.
6. Jakob Raymaekers, Peter Rousseeuw and Wannes Van den Bossche (2016). cellWise: Analyzing Data with Cellwise Outliers. R package version Version 1.0.0..
7. Williams, S. and Gask, K. (2015). Modelling sample data from smart-type electricity meters to assess potential within official statistics. GSS Methodology Series No 40. Office for National Statistics. UK.

Annexes

## Annex A

A1 Queries speed test in Hadoop ecosystem.

For testing the speed of queries we used a nested queries for two level aggregation. The tests were conducted in Hive and Spark. We tested what is difference in processing times as we use external or internal (ORC) table format and what is the speed difference if we use some function to extract the features or the features are already extracted in the pre-processing stage.

The first test case: original metering data from the years 2013 -2014 (12443177397 rows) on csv files defined as an external table. In the original data the metering time is in a format of unix timestamp and some conversation is needed to extract the year.

SELECT COUNT(*)
FROM (SELECT year(from_unixtime(recording_time)) AS year, metering_id, SUM(in_quantity), SUM(out_quantity)
FROM smart_meter.metering_data
GROUP BY year(from_unixtime(recording_time)), metering_id) md;


Time taken: 8685.764 seconds, Fetched: 1 row(s) 2 hours 24 minutes 45.764 sec

The second test case: metering data from the period 2013 – 2015 (18702064849 rows)

SELECT COUNT(*)
FROM (SELECT year(from_unixtime(recording_time)) AS year, metering_id, SUM(in_quantity), SUM(out_quantity)
FROM smart_meter.metering_data_orc
GROUP BY year(from_unixtime(recording_time)), metering_id) md;


Time taken: 2801.614 seconds, Fetched: 1 row(s) 46 min 41,614 sek

SELECT COUNT(*)
FROM (SELECT year, metering_id, SUM(in_quantity), SUM(out_quantity)
FROM smart_meter.metering_data_orc
GROUP BY year, metering_id) md;


Time taken: 308.344 seconds, Fetched: 1 row(s) 5 min 8,344 sek

%%time
metering_data.groupBy('year', 'metering_id').agg(sum('in_quantity').alias('prod'), sum('out_quantity').alias('cons')).count()


CPU times: user 96 ms, sys 16 ms, total 112 ms Wall time: 9min 27s   %%time

metering_data.select(year(from_unixtime(metering_data.recording_time, format='yyyy-MM-dd')).alias('year'), metering_data.metering_id,
metering_data.in_quantity, metering_data.out_quantity).groupBy('year', 'metering_id').agg(sum('in_quantity').alias('prod'),
sum('out_quantity').alias('cons')).count()


CPU times: user 532 ms, sys: 108 ms, total: 640 ms Wall time: 1h 6min 51s</div></div>