enEnglish
CROS

WP3 Report 4 1

WP3 Title Report 4.png

Introduction

Six National Statistical Institutes (NSIs) participated in Work package 3 within the ESSnet Big Data: Statistics Austria, Statistics Denmark, Statistics Estonia, Statistics Italy, Statistics Portugal, and Statistics Sweden. In this report, we provide recommendations for countries that plan to use smart meter data in the future. Additional information and details are provided in previous reports - Report 1[1], Report 2[2] and Report 3[3]. Report 1 includes a review of other work done on smart meter data.

The report summarizes the findings of country-specific studies and gives advice on how to work with smart meter data, including access to data,methodology for processing and analysing data, IT-architecture, validating the quality of data, and possible outputs. Standard sql- or R-code is provided where appropriate.

The main contribution of this report is to demonstrate the use of data from smart electricity meters for production of official statistics. The pilot had three goals with regard to expected outputs. First, to assess whether current survey based business statistics can be replaced by statistics produced from electricity smart meter data, second, to produce new household statistics and third, to identify vacant or seasonally vacant dwellings.

”Smart” in this context implies that electricity consumption and production are measured at short time intervals and that the measures are read from a distance. These measures, and additional information on customers and agreements, could be stored in a common national database; a data hub. The existence of a data hub, together with the necessary legislation and agreements with data holders, greatly simplifies access to data for NSIs.

A main challenge is how the observed units - metering points - can be mapped to statistical units - businesses, households and dwellings - and how existing statistics can be produced based on the results. The methodology used depends on what kind of data are available for the NSI and the aggregation level (on the time scale). As seen in Report 3 of the pilot, availability of raw (not aggregated) data and additional information about the metering points (e.g. address information) can widen the scale of possible statistical outputs. In this report it is expected that detailed measures (hourly or less) are available.

Access

A general recommendation is that prior to obtaining the data, every NSI of a specific country should be in contact with their Privacy Control Authority to discuss legal issues and the legal aspects of exchanging data, referring to the laws in force in the country. This process will differ between countries.

In practice, the existence of a hub is a prerequisite for the possibility to use smart meter data for production. Once access is granted, data are accessible by agreement with the authority responsible for the data hub. If there is no hub, agreements must be negotiated with all data holders separately (net operators and electricity providers).

In several of the countries in the pilot, legal obstacles for access to smart meters data exist, mainly due to privacy issues. Only Estonia and Denmark have hubs in operation and full access to the data. Austria receives aggregated data only. Sweden will have a hub by 2020 and it will be possible to use the data for statistical purposes, thanks to a change of the Swedish Statistics Act, but as smart meters are available for households, they are considered personal data and for this reason there will still be privacy concerns.

Statistics Estonia contacted Elering AS, their transmission System Operator, which forwarded the request for access to the Estonian Data Protection Inspectorate. There were no legal barriers for accessing the data, since the data collection is legislated through the National Statistics Act.

Another example is the procedure in Italy, where Istat opens a data request to the Acquirente Unico (AU), the public authority controlling the market of energy. The AU, after deciding on the feasibility of the data request, will submit the request for an opinion of congruity with the existing rules to the Data Protection Authority and to the Competition Authority to verify the congruence of the request. If the result is positive, the data are released and can be acquired by Istat.

It is not given that a NSI can obtain data from external sources, therefore it is a good idea to have som leverage to offer. For example that data is provided external researchers which then contributes to the increase in knowledge about in this case electricity consumption and production. This will benefit the market for electricity since suppliers can know their consumers better. 

Statistics Denmark made an agreement with Energinet, the manager of all electricity data in Denmark. The setup is that suppliers of electricity and grid owners send their data to Energinet who then manages the billing between consumer and supplier. Before Energinet was founded, consumers of electricity received a complicated bill, which included both the tariffs paid to the grid owners and the electricity suppliers. Energinet has the mandate to store data from all suppliers, and one of their goals is that researchers and others may use their data to investigate production and consumption patterns. Statistics Denmark has the experience and the possibility to store data and supply it to research institutions through a server at the research service department. The specific delivery is done via a secure FTP server, which secures the privacy of the consumers. When data are the house they are anonymized. For a more detailed description of Energinet and their role see report 1. Statistics Denmark and Energinet have made a formal contract, in which the obligations of both parties are described.  

In Portugal, the process of instaling smart meters in houses and businesses is in progress and for that reason Statiscs Portugal made informal contacts with the main private eletrecity company to perceive the possibility of obtaining some kind of this data. A first sample file with 138 anonymized  installations for the year 2015 was provided. After having analyzing this electricity consumption readings, Statiscs Portugal asked for a more completed file. A new file was provided with 5757 anonymized meters with readings of 2016 and 2017. The files were sent in a memory stick. 

Processing data

Data processing will be very dependent on the situation in the country and the agreement on data delivery. In this section, two examples are provided, representing the most advanced countries.

Experiences from Estonia

The original Estonian data contained four tables: Customer, Agreements, Metering points, and Metering data. Not all the metering points were related with end consumption and metering points were labelled accordingly. In the data preparation phase, new columns were added to the original data table to increase processing speed. Each recording in a metering data table has a unique time stamp, but for easier linking with agreements, a date column was added. The changes in agreements are made at 00:00, therefore the time unit of agreements is a day. A temporary table was created which contained the columns Date, Metering point id, and Customer id. With this table it is easy to calculate which customer consumed electricity from a certain metering point and at what time. The same temporary aggregated table was formed for metering data and contained the columns Date and Consumption. From those two tables it is easy to aggregate further and generate a temporary table of monthly consumption (with columns Year, Month, Metering point id, Customer id, and Consumption) and a temporary table of yearly consumption (with columns Year, Metering point id, customer id, and consumption).

WP3 Report4 methodology table join.png

Figure. Linking tables.

The main challenge is to find the actual end consumer. The Estonian data contain information about the contract owner, but this is not always the end consumer. To find end consumer, the addresses of households and businesses are identified and the address is used for linking purposes. With the address identification, it is possible to link the end consumer with a metering point. But the success of the linking is highly related to the quality of the address information. Due to inferior quality, there might be many errors. In case of businesses it is also difficulty to identify the amount of electricity consumed by a particular company, as there are many metering points and businesses on the same address. To solve the problem, the consumption is distributed equally between the companies linked to a metering point (e.g. if there are ten companies linked to a metering point, they each get 1/10 of the consumption measured at the metering point).

Experiences from Denmark

Three different sets of data are considered below; two sets of consumption data and one full background data set.

Consumption data sets

Hourly consumption, automatic reading

Metering point READ_TIME AMOUNT

TYPE_OF_READING
(1: estimated,2: calculated,3: read off)

12345 01:03:2015 00:15:00 2 1
12345 01:03:2015 00:15:00 2 3
12345 01:03:2015 00:30:00 1 3
1234 01:03:2015 00:45:00 1 3

Statistics Denmark receives monthly files on hourly consumption, each file can be up to 80 GB as more and more meters are converted from manually to automatic readings, therefore the size of the files increases over time. Hourly amounts are estimated, calculated, or read off the meter. One observation can therefore be duplicated if the consumption is first estimated and then later read off. To solve this, the maximum type of reading per metering_point*read_time is selected. Some smart meters are read every 15 minutes, which means that for a great number of meters there are four rows per hour. Not all smart meters are read every 15 minutes, so the summation over an hour needs to be comparable to meters that are actually read every hour. A mistake one could make is to group by trunc(READ_TIME 'HH'), which is equivalent to grouping by hour when summing. This sum and group by code will produce the following output:

METERING_POINT AMOUNT READ_TIME  TRUNC(READ_TIME 'HH')
12345 1 01:03:2015 00:00:00 01:03:2015 00:00:00
12345 2 01:03:2015 00:15:00 01:03:2015 00:00:00
12345 1 01:03:2015 00:30:00 01:03:2015 00:00:00
12345 2 01:03:2015 00:45:00 01:03:2015 00:00:00
12345 1 01:03:2015 01:00:00 01:03:2015 01:00:00

The amount in row 1 is the amount spent between 28:02:2015 23:45:00 and 01:03:2015 00:00:00, therefore it belongs to the summation of amount between the hour 23 and 00, but with a simple group by trunc(READ_TIME 'HH') it will be included in the hour between 00 to 01. The amount in row 5 is the amount between 01:03:2015 00:45:00 and 01:03:2015 01:00:00, but will be included in the sum over an hour for the hour between 1 and 2 and not between 0 and 1. To solve this we subtract 1 minute from READ_TIME, this ensures that we sum over the correct hour, and produces the following table:

METERING_POINT AMOUNT READ_TIME  TRUNC(READ_TIME 'HH')
12345 1 28:02:2015 23:59:00 28:02:2015 23:00:00
12345 2 01:03:2015 00:14:00 01:03:2015 00:00:00
12345 1 01:03:2015 00:29:00 01:03:2015 00:00:00
1234 2 01:03:2015 00:44:00 01:03:2015 00:00:00
12345 1 01:03:2015 00:59:00 01:03:2015 00:00:00

This ensures that summing over TRUNC(READ_TIME 'HH') will allocate the amount into the correct hour. The last procedure involves selecting specific years from the consumption data set. For instance, selecting 2015 involves selecting the observations from the beginning of the year to the end of the year, but considering the fact that the first observations of the year either covers the last 15 minutes of the year 2014 if it is a meter that is read every 15 minutes, or the last hour of 2014 if the meter is read every hour. In the other end, one needs to select the first observation in the year after the year we are interested in. For instance in the case of calculating the consumption of 2015, one needs to include the first reading in 2016.

Since the data sets are so large it is advisable to produce smaller data sets i.e. samples from the bigger data sets, that can be used for development of code. We have created a random sample of METERING_POINT from the population of hourly read smart meters. This sample is joined on the consumption dataset and that way a much smaller data set is created on which code is tested. A good sql code for drawing a random sample from the population of smart meters (the distinct meters in our case the background data set) is suggested below:

create table D900002.Sample_3_pct_mpo_metering_point_id_meter_point_states_dec_2016 
compress basic 
as 
select D.Mpo_metering_point_id 
from (select distinct Mps.Mpo_metering_point_id 
      from D900002.ENERGINET_METERING_POINT_STATES_2016_12_14 (This is the background data set) Mps) D 
order by dbms_random.value 
fetch first 3 percent rows only; 

Sample_3_pct_mpo_metering_point_id_meter_point_states_dec_2016 is then the random sample of distinct meters from the background data set (3%*population).

Experiences are country specific, however to the extent that hourly electricity consumption is reported in the same way, the considerations involving periods, grouping, summing and selecting the correct observations are likely to be similar. The treatment of the hourly consumption involves:

1: Selecting the max type of reading by METERING_POINT*READ_TIME

2: Subtracting 1 minute from READ_TIME.

3: Summing consumption over hour.

4: Selecting the observations that covers 2015 considering that the first observation of 2015 belongs to 2014 and the first observation on 2016 belongs to 2015. 

Periodical consumption, manual readings

The second part of the consumption data set contains manually read consumption data. It includes the meters that are not yet converted to automatic readings every hour or every 15 minutes. These meters can be read once a year, once every quarter, once every month, or at a frequency more or less intuitive. In addition, meters are read when people move in and out. In order to be able to produce periodical statistics of consumption by household composition or business sector, we developed a method that estimates the yearly consumption.

The first steps are similar to steps taken in connection with the hourly read data set. An observation can be duplicated if the type of reading varies between rows. The maximum value of the variable Type of reading is chosen by grouping over METERING_POINT and READ_TIME. The value of READ_TIME closest to the beginning of the year and the end of the year are found, and the first sectioning of consumption is done accordingly. For instance a meter can be read on 05/01/2015 and 20/12/2015 and a number of times in between, before and after. To find the consumption between the first day of the year and the first day of a reading in 2015, we have the observed READ_TIME in 2014 and calculated the average consumption between for instance 15/12/2014 and 05:01:2015. The same thinking is applied to the end of the year 2015 to find the consumption between the last observed READ_TIME in 2015 for instance 20/12/2015 and the first observation in 2016.

Calculating annual consumption

When meters appear only in the hourly consumption data set or in the manually read consumption data set, calculating annual amounts is easy. But a meter can also be in both the consumption data sets. One needs to investigate the reason why a meter can appear in both of the data sets. 

Possible scenarios 

1: Completely overlapping periods, i.e. a meter is found in the hourly read consumption data set and during the same period it also appears in the manually read data set. In this case the annual amounts should be more or less the same

2: A meter is hourly read in one period and periodically read in the next, in which case the annual amount is just the sum of the two

3: Partly overlapping periods, in which case the decision is less easy. 

Possible error correction and error search 

In the Danish case, various quality indicators were computed that measure how complete a series of observations is. These are mostly computed for the hourly read consumption data set. 

1: Observations per month; a month with 31 days should have 31*24=744 observations

2: Observations per day

3: Days per year

Identifying outliers

Outliers can be identified in ways similar to regular data editing and error search. One should be aware that computing time might be high. 

Background data set

Statistics Denmark receives one large history file with around 70 variables. The file can be split in two. File 1 contains consumer information and street name, house number, post code etc. File 2 contains variables related to subscription. The consumer information is first processed by the Geodata department, who constructs codes from the combination of STREETNAME, HOUSENUMBER, ROOMNUMBER, FLOORID and POSTCODE. Every METERINGPOINT thus gets an ADRESSE_ID, which can be a unit ADRESSE_ID or an entrance ADRESSE_ID. An entrance ADRESSE_ID can have multiple unit ADRESSE_ID's. Furthermore, in the background data set there are a VALIDTODATE and a VALIDFROMDATE. A metering point can thus be traced from when it was first registered. When anything changes, either in the consumer part or in the subscription part, a row is added to the background data set.

Validating the addresses

After the ADRESSE_ID's are added, checks are necessary whether the combination of STREET_NAME, BUILDING_NUMBER, POST_CODE, CITY_NAME and ADRESSE_ID matches correctly. Below is an example:

METERING_POINT STREET_NAME BUILDING_NUMBER POST_CODE CITY_NAME ADRESS_ID VALIDFROMDATE VALIDTODATE
123 Apple road 2 2770 Townswille 123 01.01.2015 01.01.2016
123 Apple  2 2770 Townswille NA 01.01.2016 01.01.2017

In this example, Apple road has been assigned an ID but not Apple. With the string search function in R, one can assign a logical variable if a row is equal to the previous row. In our case we test whether STREET_NAME is equal to the STREET_NAME in the row before. This is done by METERING_POINT. 

alleStreetEqual <- function(streets, dist){
    map_lgl(streets, ~ ain(streets[1], ., maxDist=afstand)) %>% all
    }
setDT(ADRESSEDATA)[,alleStreetEqual := alleStreetEqual (STREET_NAME, 3), by = MPO_METERING_POINT_ID] 

Additional background variables

Since there are more than 70 variables in the background data set a lot of changes can occur, and this means that a number of lines per METERINGPOINT are obsolete. To work around this, only the relevant variables are selected. All the variables in the consumer part of the background data set and a few variables from the subscription part of the background data set are selected. The rows are collapsed if there are no changes in the remaining variables. Collapsing the background data set is to cut away some the rows. For instance rather than having multiple VALIDTODATE and VALIDFROMDATE one can find the min(VALIDFROMDATE) and the max(VALIDTODATE) and collapse the data into one row, provided that there are no changes in the remaining variables. Collapsing by min and max will not take into account that there can be periods with no activity on the METERINGPOINT. We use a collapsing code that takes into consideration that there might be holes between the last observed VALIDTODATE and the next observed VALIDFROMDATE, that is, they might not be sequential. An example of such code is shown below (collapsing rows in the consumer part of background data set): One distinguishes between the relevant variables and the date variables (see [Annex A]).

Outputs

Electricity consumption of businesses by economic activity

The Estonian case serves as an example. Due to the quality problems with address information, the contract owner is in some cases used as the end consumer when it is not possible to link any company with the address or the address information is missing. In order to classify business consumption by activity, the following steps are carried out.

For the metering points that have a contract with a business, the yearly consumption is calculated and a temporary table is created.

WP3 Report4 methodology business consumption.png

Figure. Creating consumption table

A temporary table of metering points which can be linked by addresses with business register is then created.

WP3 Report4 methodology business address.png

Figure. Creating address table

The two tables are joined and depending on whether it was possible to link by address or not, the corresponding NACE code is used. If there are more than two companies connected by a metering point, the consumption is equally distributed among them. The following query is used for calculating the electricity consumption of businesses by economic activity. The code joins two tables and thereafter finds electricity consumption of companies which are not linked by address and finally finds consumption of businesses which are linked by address.

-- JOIN consumption table with address table 
WITH all_data AS (SELECT md.year, 
                         md.metering_id, 
                         md.customer_id, 
                         md.code,  
                         md.nace,  
                         md.adr_id,  
                         md.cons,  
                         adr.adr_code,  
                         adr.adr_nace 
                  FROM business_consumption_yearly md 
                  LEFT JOIN business_address_yearly adr 
                  ON (md.year = adr.year AND md.metering_id = adr.metering_id)), 
-- select a nace of metering points which did not link by address 
no_address AS (SELECT ad.*,  
                      ad.code AS fin_code,  
                      ad.nace AS fin_nace,  
                      ad.cons AS fin_cons 
               FROM all_data ad  
               WHERE ad.adr_nace IS NULL), 
-- metering points which were linked by address 
with_address AS  (SELECT ad.* 
                  FROM all_data ad 
                  LEFT JOIN no_address na  
                  ON (ad.year = na.year AND ad.metering_id = na.metering_id AND  ad.customer_id = na.customer_id) 
                  WHERE na.metering_id IS NULL), 
-- share consumption between comparies registered on the same address 
final AS (SELECT wa.*, wa.jykood AS fin_code, wa.adr_emtak AS fin_emtak, wa.cons / wa.cnt_tak AS fin_cons 
          FROM (SELECT l.*, COUNT(*) OVER (PARTITION BY l.year, l.metering_id, l.customer_id) AS cnt_tak 
                FROM with_address l) wa 
          UNION ALL  
          SELECT * FROM no_address) 
SELECT * 
FROM final f  
ORDER BY f.year, f.metering_id, f.customer_id; 

The final step is to use the results of the query and sum up column fin_cons by year and NACE code to get consumption by economic activity.

Electricity consumption of households

For finding electricity consumption of households, address field is used for linking. There are two tasks - to identify where a household lives and to determine how much electricity is consumed at that address. From the population register, people living at the same address can be identified as a household, and the data can be linked with electricity data by address or a registry code of a person.

The steps to follow:

  • Form a household by grouping together people living at the same address
  • Link household data with electricity consumption data by address
  • Link electricity data with register of dwellings,
  • Aggregate data by using relevant features (e.g. household size)

WP3 Report4 methodology private consumption.png

Figure Private consumption

Identifying empty or seasonally empty dwellings

The Estonian case serves as an example. Information about vacant or seasonally empty living spaces is relevant housing statistics useful for the population and housing census. Identified empty or seasonal dwellings can serve as a new statistical product, as it provides information where people actually live or not. By using zero or low consumption data empty apartment houses can be identified as we have shown in Report 3 For identifying empty or seasonally empty dwellings two approaches can be used. One is to use electricity consumption data and identify zero or close to zero consumption at a certain period of time, another is to apply classification methods to electricity data.

Data

For identification of empty dwellings a raw (hourly or less) or aggregated (monthly or yearly) electricity consumption data can be used. For linking, registers of population and buildings can be used. The linked data sets provide background information about the size of households and building characteristics.

Methods

For classification machine learning techniques can be used. The machine learning is defined by Tom Mitchell as "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.[4] (p. 14)."

Machine learning techniques are classified as:

  • Supervised learning - when the training data contain labels (e.g., the corresponding household size is known for a metering point)
  • Unsupervised learning - when the training data do not contain labels (e.g., the corresponding household size is unknown for a metering point and must be identified from the data)
  • Semi-supervised learning - when the training data contain partially labeled data (a model is trained by using labelled data and later on used for classifying unlabeled data)

As there are high quality dependent variables avaliable, which describe households, then supervised learning approach can be used [5] .

In Report 3, a number of classification methods are listed and evaluated (code in R is available at Github):

  • Logistic regression
  • Boosted logistic regression
  • k nearest neighbor
  • Bayesian Generalized Linear Model
  • Support vector machine
  • Random forest

When kmeans clustering is applied in PySpark, the following script can be used:

from pyspark.ml.clustering import KMeans
n_clusters = 60
# extracting features (Important!)
vector_assembler = VectorAssembler(inputCols=data.columns[1:], outputCol="features")
cluster_data = vector_assembler.transform(data)
# applying kmeans
kmeans = KMeans(k=n_clusters, seed=1)
cluster_model = kmeans.fit(cluster_data.select('features'))
# finding cluster centers
cluster_centers = cluster_model.clusterCenters()
# labeling the data
transformed = cluster_model.transform(cluster_data)

Identifying empty dwellings by using electricity consumption data

If yearly consumption data are available, empty dwellings can be identified by zero yearly consumption, or consumption lower than a given threshold. Estonia's largest network operator has defined yearly consumption 250 kWh as a threshold. Lower consumption means that the dwelling is not occupied.

If monthly consumption data are available, seasonally empty dwellings can be identified if there is seasonality in the electricity consumption, showing higher consumption in the summer than in the winter.

If daily of more frequent consumption data are available, weekend living places can be identified if there is high consumption during weekends and low during weekdays.

Identifying empty dwellings by classification method

For classification, electricity consumption data can be used directly, or aggregates can be calculated.

In Report 3 a set of variables is defined:

  • mean of consumption,
  • standard deviation of the mean consumption,
  • mean of the ratio of total consumption and daytime consumption,
  • standard deviation of total consumption and daytime consumption,
  • mean of the variance of the hourly readings,
  • standard deviation of the hourly readings,
  • mean of the standard deviation of the difference time series and
  • the ratio between the mean of consumption on week days and the mean consumption on weekend days.

Steps to conduct analysis:

  • extract electricity consumption data and labels if available
  • apply classification method
  • evaluate clustering results

Validating output results

Obviously, the available input is of great relevance. Some outputs are possible with only a subset of all smart meters available (ideally a random sample of the whole population) and other outputs are only possible with full coverage of all (private and/or business) smart meters. This is especially true if you want to use smart meter data directly to improve (register based) census estimates, e.g. the main residence. Most outputs can be calculated using hourly data, but it is not always necessary to have such detailed data, for example for monthly statistics where monthly aggregated data can be used.

For validation, survey data can be used. The regular survey of energy consumption of businesses is conducted and the survey data can be used for validating electricity consumption statistics extracted from the smart meter data. For validation absolute differences and relative differences between survey data and smart meter data are useful measures. There might be significant differences due to methodological differences.

For validating dwelling occupancy the survey data (e.g., the household budget survey) in which household size is clearly measured can be used

Smart meters record net flow of energy and therefore it is difficult to identify own consumption of households, or consumption of big industrial companies which produce electricity. In case there are solar panels with known production capacity, the production can be estimated by comparing production patterns of other solar panels. For other producers there is currently no valid method for validation production, other than a survey.

Metering points for which the actual consumer is not identifiable must be left out and the consumption at such metering points should be aggregated and reported as an error rate.

Other possible outputs

Consumption statistics

An obvious statistical output is the consumption of electricy, as described above. Identifying specific patterns of consumption might be of additional interest, e.g. to find inherent socio-demographics factors to explain "energy-saving" and "energy-wasting" households. Businesses energy consumption could be related to business cycle effects and it could therefore be used as an auxiliary variable in estimating economy.

Price/Spending statistics

If price information is available in the smart meter data set, several additional statistical outputs are feasible. The consumer price index has a component for the price of electricity, which could be estimated by smart meter data. The household budget survey has a question about the spending of the household on electricity which could be left out, if the electricity costs could be directly estimated from smart meter data, and would lead to a reduced response burden.

Provider statistics

Since the market liberalization, it is possible throughout the EU to switch between different electricity providers. However, the actual number of switches could be limited to a very selective group in the population. It would be of interest to estimate the ratio of switches in different subgroups of the household population.

If the information about the specific supplier of a smart metering point is part of the smart meter data set and it is possible to link the metering points to either households or businesses, statistics about switching provider could be detailed on sociodemographic background variables or enterprise characteristics.

New buildings / construction sites

Since new construction sites need energy and therefore might have an own metering point which temporarily would use a lot of electricity, metering points appearing in the data set could be identified as new building sites. Such information could be used in (new) building statistics or in the updating process of a building register.

Household production

Smart meter data might come with information about production and the type of production (solar cells, wind  mills, water, etc). If it is possible to link the data with population and housing registers, statistics on the type of production and characteristics of households producing their own energy would be produced.

IT architecture

Different information technology is required in the different steps of accessing and processing smart meter data. The infrastructure technology needed for each step in the process is briefly discussed.

Collect

Transferring data

To speed up the process and avoid dependency on specific staff, the files should automatically be copied to the final destination, and loading the data and the first data processing should run without human intervention. This will considerably reduce time spent on collection and processing. In an example from Estonia exporting one year of data took 8 hours, and copying files took 58 hours.

In Estonia and Denmark there are data hubs where the smart meter data are collected and managed by the main system operator. In some other countries, each service provider collects the data separately. However, statistical offices do not have direct access to the data hub.

Smart meter files can be very big since electricity readings can be very frequent (one hour or less) and comprise all smart meters readings during one year- about 160GB in small countries. On the other hand, this information is sent to statistical offices at agreed intervals, for example once a year, and with agreed aggregated time periods (hour, day, month, etc). Subsequent separate files can be sent with updates or additional data (mainly background information).

Since the size of the data files used by Denmark and Estonia in the pilot exceeded the current electronic file transfer channels system limits, the files were transferred on physical devices like an external hard disc or a memory stick. The channels for transferring data that were considered in the participant countries are secure FTP and x-road (an xml based secure network), but they were only used to receive updates and background information.

Denmark receives two sets of consumption data and one full background data set. Data are transferred via an FTP server that is set up with Energinet. On a quarterly basis, CSV files are provided. 

It is advisable that a channel that can receive compressed data regularly is developed.

Data storage

Since smart meter data are well structured, central databases, distributed databases, NoSQL databases, and their hybrid solutions can be used. In the pilot, relational databases like ORACLE and NoSQL databases like Hadoop were used.

In-memory multi-core processing can accelerate the processing of data and the amount of space needed to store information. In our findings, changing from MySQL to Hadoop reduced the size of stored data ten times from 1.8TB to 180GB. The space needed to store smart meter data could be a limitation since approximately 1000 smart meters with hourly data in one year can occupy 335MB (1000*365*24=8.760.000 records). If we extrapolate for 1 million smart meters and three years, it will be about 10TB and 26.280.000.000 records.

However using SQL databases like ORACLE can be also a good solution since NSIs are accustomed to using this technology. Portioning and indexing is recommended.

All the experts involved in the different applications of smart meter data at an NSI should have access to data. Besides the smart meter data, it is necessary to ensure access to other sources for linking or other usage (business register, dwelling register, electricity consumption survey, etc.). 

Process

Once data are stored, they need processing: Cleaning, coding, anonymization, imputation, aggregation, and linking. For these processes, it is possible to use standard SQL or R, Python or Hive functions. A variety of programs can be used to process data, but it is crucial that one has some experience of SQL programming and database design.

Statistics Denmark use two schemes on one database. One schema is for raw and final data that has been through the steps described in the earlier section: Linking the consumer part of the background data set to ADRESSE_ID's, processing the hourly consumption data set, and processing the periodically read data set. The final data sets can then be accessed at the research service department within the institute. On the second schema all the data sets that are created as helpers along the way are kept. It can be data sets that count the number of observations per day, per month, or per year. To process the programs efficiently and quickly we use parallel processing. An example of code is shown below. This code sums the amount consumed on a day.

create table DAGSFORBRUG_METER_VALUES_2016 
AS 
SELECT METERING_POINT, 
       sum(AMOUNT) TOTAL_DAG, 
       DAG, 
       MND 
FROM METER_VALUES_2016_HOURLY_VW where 1=2 
GROUP BY METERING_POINT, MND, DAG; 
    
alter session enable parallel dml; 
alter TABLE D900002.DAGSFORBRUG_METER_VALUES_2016 parallel 8; 

insert /*+ append */ into D900002.DAGSFORBRUG_METER_VALUES_2016 
SELECT /*+ parallel(METER_VALUES_2016_HOURLY_VW,8) */ 
       METERING_POINT, 
       sum(AMOUNT) TOTAL_DAG, 
       DAG, 
       MND 
FROM D900002.METER_VALUES_2016_HOURLY_VW 
GROUP BY METERING_POINT, MND, DAG; 
commit; 

SQL codes executes more efficiently if SQL plus is used, because the execution time can be run at night, and because some versions of SQL developer has a row processing limit. TOAD is also a good option. If R is used for programming, Hadley Wickhams toolbox is an excellent choice. It has the plyr, dplyr, tidyr and dbplyr packages, developed to manage data like in a database.

read16 <- read16 %>% 
arrange(METERING_POINT, READ_TIME) %>% 
group_by(METERING_POINT) %>% 
filter(READ_TIME==min(READ_TIME))  

The above code arranges METERING_POINT and READ_TIME in the data set read16, it then groups by METERING_POINT and selects the minimum value of the variable READ_TIME. The data set read16 is read in to the RAM of R, but if the data set is too large and there is less RAM, you can exploit the disc space you have on the ORACLE server using the same dplyr language. To connect with the ORACLE server you need to install and download the ROACLE package and use the following code:

sql_translate_env.OraConnection <- dbplyr:::sql_translate_env.Oracle 
sql_select.OraConnection <- dbplyr:::sql_select.Oracle 
sql_subquery.OraConnection <- dbplyr:::sql_subquery.Oracle 

This produces so called TBL's that can be thought of as virtual temporary tables, that you can view in R but is "ready" to be made accesible in R with the collect() statement in the end. For more information on the dplyr language but ORACLE server disc computational setup, consider the web page https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html

Analyze

Statistical analytic tools, such as R, Minitab, SPSS, Matlab and Mathematica, can be used in smart meter data analytics. Other strategy is to use in-memory technologies such as Hadoop/Hive, MADlib or Python functions.

A very good suggestion when the option is R with ORACLE is to use the package dbplyr, since it will reduce the processing time drastically.

Disseminate

For visualization, clustering and machine learning, tools like Anaconda or PySpark functions can be used. The main tool used for analysis in the pilot was Hive SQL queries and also SAS Enterprise Guide for aggregated data.

R packages like ggplot2 (for plotting functionalities), shinny (interactive tool to explore data) and cellWise (outlier detection) are also good options.

To produce maps and work with spatial data, ESRI mapping software and Geopandas were used.

References

  1. ESSnet Big Data, SGA-1 (2016). Work Package 3, Deliverable 3.1, Report on data access and data handling.
  2. ESSnet Big Data, SGA-1 (2017). Work Package 3, Deliverable 3.2, Report on production of statistics: methodology.
  3. ESSnet Big Data, SGA-2 (2018). Work Package 3, Deliverable 3.5, Report on production of statistics: future perspectives.
  4. Mitchell T. (1997). Machine Learning Revised ed. McGraw-Hill.
  5. Carroll P. et al. (2018). Household classification using smart meter data. Journal of Official Statistics, 34(1), pp. 1–25. Doi:http://dx.doi.org/10.1515/JOS-2018-0001

Annexes

Annex A

A collapsing code that takes into consideration that there might be holes between the last observed VALIDTODATE and the next observed VALIDFROMDATE.

CREATE TABLE ADRESSER_TIL_GEO_STAM_DEC_2017 compress basic AS 
SELECT * 
FROM (SELECT VALIDFROMDATE, 
             VALIDTODATE, 
             TRIM(MPO_METERING_POINT_ID) as MPO_METERING_POINT_ID, 
             TRIM(STREET_NAME) as STREET_NAME, 
             TRIM(REPLACE(BUILDING_NUMBER, ' ', )) AS BUILDING_NUMBER, 
             TRIM(FLOOR_ID) as FLOOR_ID, 
             TRIM(ROOM_ID) as ROOM_ID, 
             TRIM(TO_NUMBER(POSTCODE)) AS POSTCODE, 
             TRIM(CITY_NAME) as CITY_NAME, 
             TRIM(REPLACE(CONSUMER_CVR, ' ' ,)) as CONSUMER_CVR  
      FROM (SELECT coalesce(TO_DATE(VALID_TO_DATE, 'YYYY.MM.DD: HH24:MI:SS'),
                   TO_DATE(SYSDATE,'YYYY.MM.DD: HH24:MI:SS')) AS VALIDTODATE, 
                   TO_DATE(VALID_FROM_DATE,'YYYY.MM.DD: HH24:MI:SS') AS VALIDFROMDATE, 
                   MPO_METERING_POINT_ID, 
                   STREET_NAME, 
                   BUILDING_NUMBER, 
                   FLOOR_ID, 
                   ROOM_ID, 
                   POSTCODE, 
                   CITY_NAME, 
                   CONSUMER_CVR 
            FROM D900002.ENERGINET_METERING_POINT_STATES_2017_11_23 WHERE VALID_TO_DATE != '36958820'))  model 
      RETURN updated rows PARTITION BY (MPO_METERING_POINT_ID, 
                                        STREET_NAME, 
                                        BUILDING_NUMBER, 
                                        FLOOR_ID, 
                                        ROOM_ID, 
                                        POSTCODE, 
                                        CITY_NAME, 
                                        CONSUMER_CVR) 
      DIMENSION BY (row_number() over (PARTITION BY MPO_METERING_POINT_ID, 
                                                    STREET_NAME, 
                                                    BUILDING_NUMBER, 
                                                    FLOOR_ID, 
                                                    ROOM_ID, 
                                                    POSTCODE, 
                                                    CITY_NAME, 
                                                    CONSUMER_CVR 
      ORDER BY VALIDFROMDATE, VALIDTODATE) Rn) measures 
 (VALIDFROMDATE, VALIDTODATE, 1 AS I) rules iterate (999999999) until (VALIDFROMDATE[iteration_number + 2] IS NULL) 
 (VALIDTODATE [I[1]] = CASE
                          WHEN ( (VALIDFROMDATE[iteration_number + 2] <= VALIDTODATE[I[1]] + 1) 
                          OR (VALIDTODATE[I[1]] IS NULL)) 
                          THEN greatest (VALIDTODATE[I[1]], VALIDTODATE[iteration_number + 2]) 
                          ELSE VALIDTODATE[I[1]] 
                        END, I [1] = 
                        CASE 
                          WHEN ( (VALIDFROMDATE[iteration_number + 2] > VALIDTODATE[I[1]] + 1) 
                          AND (VALIDTODATE[I[1]] IS NOT NULL)) 
                          THEN iteration_number + 2 
                          ELSE I[1] 
                        END)