Non-intrusive Load Disaggregation

Introduction

Motivation and goals

Climate change is one of the greatest challenges facing humanity, and machine learning approaches are a great solution to tackle this problem. In 2019, a group of machine learning experts developed a paper called "Tackling Climate Change with Machine Learning" [1] focused on impactful uses of machine learning in reducing and responding to climate change challenges.

One of the main domains of the many propositions is "Buildings and cities" and in more deep how to "optimize buildings energy consumption". The paper states "while the energy consumed in buildings is responsible for a quarter of global energy-related emissions, a combination of easy-to-implement fixes and state-of-the-art strategies could reduce emissions for existing buildings by up to 90%". This statement caught our attention to start this project. Find an optimization model to control and therefore optimize energy consumption in buildings.

After extensive research, we decided to focus our study on Non-Intrusive Load Monitoring (NILM). NILM is the task of estimating the power demand of different appliances in a building given an aggregate power demand signal recorded by a single electric meter monitoring multiple appliances.

Neural NILM is a non-linear regression problem that consists of training a neural network for each appliance in order to predict a time window of the appliance load given the corresponding time window of aggregated data.

We adopted the "Non-Intrusive Load Monitoring with an Attention-based Deep Neural Network" [2] paper developed by University of Rome Tor Vergata researchers, to be our Reference Paper. Other approaches to Neura NILM are presented in "Non-intrusive load disaggregation solutions for very low-rate smart meter data." [3] and "Sequence-to-point learning with neural networks for non-intrusive load monitoring" [4].

Dataset

As to the dataset used, we selected the real-world dataset "the Reference Energy Disaggregation Data Set (REDD)" [5]. This dataset is one of the reference datasets used in NILM Reference Paper and contains data for six different houses from the USA. The data is collected at 1 second sampling period for the aggregate power consumption and 3 seconds for the appliance power consumption. The appliances used are the following: oven, refrigerator, dishwasher, kitchen_outlets, microwave, bathroom_outlet, lighting, washer_dryer, electric_heater, stove, disposal, electronics, furance, smoke_alarms, air_conditioner. Thus, in our model, we consider three appliances: dishwasher (DW), microwave (MW), and refrigerator (FR). These appliances are the same as the ones used in the Reference Paper to reach the same results.

Dataset split

The dataset is split using houses 2,3,4,5,6 to build the training set and house 1 as the test set.

The actual dataset of our model is a combination of two datasets.

We found a deep learning research team from Seoul National University that had a pre-processed dataset that cleaned the data (see pre-processing section bellow) of the REDD dataset given by the Reference Paper. This dataset is used in their "Subtask Gated Networks for Non-Intrusive Load Monitoring paper" [6].
On the other hand, in the REDD dataset there is a high active/inactive windows imbalance. This irregularity is observed especially in the case of the dishwasher and the microwave. As it is expected, due to the use of these appliances, much of the time a dishwasher and a microwave are not being used. Therefore there is a high overrepresentation of inactive windows. We implemented an oversampling process described in the pre-processing section (see below) to solve the problem.

System architecture

Preprocessing

Initial project implementation was done using raw REDD dataset and it was necessary to pre-process the data as described in "Subtask gated networks for non-intrusive load monitoring" [6], see details:

Data alignment. Align multiple time series with different acquisition frequencies.
Data imputation. Split the sequence so that the duration of missing values in subsequence is less than 20 seconds. Then fill the missing values in each subsequence by a backward filling method.
Data filtering. Only use the subsequences with more than one day duration
Generate sliding windows. Using sliding window over the aggregated signal with hop size equal to 1 sample

Once authors from Seoul National University provided us the same dataset as the Reference Paper we disabled our data pre-processing. The main reason was to assure the same input data as the original paper to have the same, or similar, results.

Oversampling is used to solve the problem of overrepresentation of inactive windows and the irregulatity of the active/innactive windows imbalance (described in the Dataset section). The process consist in replicating randomly picked active windows in each of the appliances to obtain a 50% - 50% class balance. The ratio between active/inactive windows is configurable in settings.

After implementing oversampling the number of windows used for train, eval and test are listed below:

Appliance	Nº buildings train	Nº windows train	Nº windows eval	Nº buildings test
dishwasher	5	289163	123927	1
fridge	4	613167	262787	1
microwave	3	82922	35538	1

Model architectures

We've implemented three different model architectures:

Regression and classification enabled
Only regression enabled.
Regression and classification using the attention results.

Regression and classification enabled

The designed architecture adopted to solve the NILM problem is based on a classical end-to-end regression network with its encoder-decoder components. Adding an attention mechanism in between the encoder and decoder. Apart from the main end-to-end regression network, an auxiliary end-to-end classification subnetwork is joined.

Why an attention-based model? The attention-based model helps with the energy disaggregation task. It assigns importance, thought weights, to every position in the aggregated signal which after successful training, will correspond to a state change of the target appliance. The addition of an attention mechanism in the regression subnetwork will allow the model to focus on selected time steps or windows rather than on non-target appliances. The attention scores are the way to weigh the importance of every position in our input sequence to infer the disaggregated signal. To represent correctly these weights we made the output of the attention layer be a 1D vector with the length of a window sequence.

Both subnetworks have a different objective:

Regression end-to-end network: allows the subnetwork to “implicitly detect and assign more importance to some events (e.g. turning on or off of the appliance) and to specific signal sections”.
Classification end-to-end network: helps the disaggregation process by enforcing explicitly the on/off states of the appliances.

Both subnetwork outcomes are concatenated at the end to outcome the disaggregated consumption of the appliances.

Only regression enabled

This architecture consists of suppressing the classification subnetwork, that does not have an attention layer, from the model. The regression branch is kept as in the original network.

Regression and classification using the attention results

In this final model modification, the output of the attention layer is used to compute the result of the regression subnetwork (in all the models). In this architecture, we concatenate the output of the regression subnetwork with the output of the stack of convolutional layers, in the classification subnetwork. This concatenated vector is fed to the 2 fully connected layers on top of the classification branch. The expectations of this architecture's behavior are described in the Experiment 7 hypothesis.

Train

Methodology. Model training is done using the whole pre-processed train dataset and batches of size 64 via data loader. At first, we set the epochs at 10 epochs, in most of the cases we founded enough to do an initial analysis of model response and performance. The common do_load -> do_predict -> calculate_loss -> update_optimizer train sequence is done per each of the train batches in each epoch. The common do_load -> do_predict -> calculate_loss validation sequence is done per each of the validation batches in each epoch.
Loss function. An aggregated loss function is used for the joint optimization of both regression and classification network: L=Lout+Lclas, where Lout is the Mean Squared Error (MSE) between the overall output of the network and the ground truth of a single appliance, and Lclas is the Binary Cross-Entropy (BCE) that measures the classification error of the on/off state for the classification subnetwork.

Test

Methodology. Model testing is done over the whole preprocessed test dataset using batches of size 64 via a data loader. The common do_load -> do_predict -> calculate_error test sequence is done per each of the test batches.
Error metrics. MAE (Mean Absolute Error) is used to evaluate the performance of the neural network. MAE is calculated after applying the prediction postprocessing described in the Postprocessing section. These are the metrics used in the Reference paper and are used as benchmarking criteria between the different experiments described below.

Postprocessing

The disaggregation phase is carried out with a sliding window over the aggregated signal with a hop size equal to 1 sample. That's the reason why the model generates overlapped windows of the disaggregated signal. We reconstruct the overlapped windows employing a median filter on the overlapped portion.

Experiments

The main goals of the experiments are:

Learn how to implement and deploy a DL system on a commercial cloud computing platform
Understand and interpret the current NILM neural network described in the paper
- Understand which is the task of regression branch
- Understand which is the task of classification branch
- Understand which is the task of attention

We proposed the three main architecture modifications evaluated in the experiments during the analysis of the reference paper. The experiments were not designed sequentially after processing the results of the previous experiment.

Main architecture modifications:

Paper architecture - Regression and classification enabled
Paper modification 1 - Only regression enabled
Paper modification 2 - Regression and classification using the attention results

We initially explored the data to have a first picture of the type and the amount of data available. We realized there was a high active/inactive windows imbalance in the case of dishwasher and microwave (as explained in the Dataset explanation). There would be enough total amount of windows to train the model, but not enough specific active windows to prevent a biased model. If no oversample was done the model would mainly predict null demand in inactive windows, which would be correct, but would fail to predict non-null demand inactive windows. Although disaggregation is a regression problem, this would be similar to high specificity and low sensitivity in an active/inactive appliance classification problem.

Neural network response charts

We generate charts with time series describing the response of the neural network in train, eval, and test. These charts are used to visualize and interpret the response of both whole and specific parts of the network. The main parts of interest are regression, classification, and attention. In most of the charts, the available time series are:

Building consumption. Aggregated consumption of the building. Used as input of the neural network
Predicted appliance consumption. Disaggregated appliance consumption predicted by the neural network
Real appliance consumption. Real applianced consumption obtained from the meter
Classification branch output. Prediction of the classification branch
Regression branch output. Prediction of the regression branch
Attention score. Describes the zone of interest for attention to improve regression

All the consumption time series are referenced to the left-Y axis. Classification and attention are referenced to the right-Y-axis. In both cases, there's a rescaling in some prediction results to make all of them fit in a single chart (ie. classification prediction is scaled to nearly maximum consumption, ...). In the report, there're two train and two test sample charts per each of the experiments and appliance to visualize the response and support conclusions.

Interpretation of the charts focuses in:

Performance. Comparing real vs predicted series it's possible to identify the performance of the model
Characterization of the error. Comparing real vs predicted series it's possible to identify error specific patterns (peaks, plateaus, etc)
Correlation of the error with aggregated demand. Comparing error vs aggregated building consumption it's possible to identify the response of the model to crowded scenarios (multiple appliances) and single scenario (single appliance). It's also possible to identify the response of the model with different kinds of appliances, with different consumption patterns, running simultaneously.
Contribution of each of the branches. Analyzing the output of the branches is possible to identify the contribution of each of the branches to the prediction. It's possible to identify the objective of each branch and also its performance
Focus of attention. Analyzing the attention output it's possible to identify which parts of the window are important to the regression output. The attention can be used to:
- Identify whether the important parts vary in the different scenarios. Maybe there is a scenario in which there are different appliances ON or there is a scenario with just one appliance that is consuming a lot is being used ON. The attention will help differenciate this two situations.
- Identify whether there're specific important parts or the importance is homogeneous along with the window
- Identify whether important parts are described in the appliance itself or the neighborhood.
- Identify characteristic of important parts such as peaks, plateaus, etc.

Paper architecture - Regression and classification enabled

Experiment 1. Paper

Hypothesis

The regression subnetwork infers the power consumption, whereas the classification subnetwork focuses on the binary classification of the appliance state (on/off). The attention mechanism improves the representational power of the network to identify the positions in the aggregated input sequence with useful information to identify appliance-specific patterns.

Additional group hypothesis:

Specific appliance patterns are described by state changes and state duration which are related to the operating regime of the internal electricity consumption components. The operating regime of the internal components depends on multiple factors:

Appliance operating mode.
- User-selected modes of operation. There're appliances with a small number of user modes (fridge, dishwasher) and appliances with a mid number of user modes (microwave). The higher number of user modes is the higher number of different patterns that can be described by the neural network.
- Cycle duration. There're appliances with small duration time cycles describing the pattern per operating mode, such as the fridge and the microwave, and appliances with high duration time cycles, such as the dishwasher. The longer the cycle duration is, the more difficult it will be to describe the behavior of the pattern as the input sequence windows are longer.
Environmental factors (temperature, etc). There're appliances with dependencies to external variables like environmental factors. In this specific model, there's a high dependency on temperature on the fridge and lower dependency on the microwave and dishwasher. Weather dependency adds stochasticity to the system and consequently, complexity to the model.
Internal components demand. The main electricity consuming components are:
- Heating/cooling. There's weather dependency load demand adds stochasticity to the system, hence complexity to the model.
- Motors. Load demand is mainly related to the user mode and to the component internal operating regime.

Experiment setup

See details of the experiments below. Each of the columns describes a specific option of the previously introduced network architectures and pre/post-processing methods:

Appliance	Regression	Classification	Standardization	Recalculate mean/std in test
dishwasher	TRUE	TRUE	FALSE	FALSE
fridge	TRUE	TRUE	FALSE	FALSE
microwave	TRUE	TRUE	FALSE	FALSE

Results

See attached train vs loss curve to diagnose performance:

See attached train and test samples per each of the appliances to interpret end evaluate disaggregation:

See obtained error (previously introduced in error metrics section and extra training information):

Appliance	MAE	Nº Epochs	Nº Hours Train
dishwasher	28.25	4	15
fridge	26.75	4	25
microwave	31.47	4	1.23

Conclusions

As was described in the hypothesis the main goal of the regression branch is predicting the maximum expected demand of the appliance. As was also expected the classification branch is modulating the regression results to match the appliance load pattern. Classification has high specificity and low sensitivity.

In both cases, train and eval have good results but have less accurate results in test. Our hypothesis is that model does not generalize well due to the small number and variance of appliance patterns of the different train buildings.

See samples of dishwasher consumption per building:

The classification network is in charge of modeling the patterns. As seen in the results, it is less accurate in the steady-state sections than expected. Hence, the instability, and in some cases, the high sensitive response is also related to the overrepresentation issue.

In most cases, increasing the number of acquisition samples would not be a good solution to fix the instability issue as there would be more active windows but the same pattern. That's the case of appliances with components that do not depend on environmental factors (temperatures, etc) like microwave or dishwasher. In the case of appliances with environmental factors, it would help to have also samples from different seasons. We implemented oversampling but it's similar to increasing the number of samples from the same appliance rather than new ones.

There's no more data available rather than the public dataset. As a solution, data augmentation can not be easily implemented due to the lack of a database of appliance loads. In this case, it makes no sense to create synthetic aggregated scenarios mixing appliances from different buildings because they're already mixed in the training dataset and properly predicted in eval. In the classification branch, we hypothesize that in some cases adding noise would help to decrease high sensitive responses.

Attention in appliances with a high simultaneity factor(*) focus mainly on state changes in the appliance, like switch on/switch off or high consuming components of the appliance. Also, it focuses on state duration. That would be the case of dishwasher or microwave. Attention in appliance with low simultaneity factor also focus in other sections of the windows out of the active section. That would be the case of the fridge. Our hypothesis is that in the case of high simultaneity factor scenarios, attention focuses on appliance pattern, and in the case of low simultaneity factors it additionally focuses on the neighborhood. Attention would perform better to identify highly specialized and specific features in a consumption window.

(*) simultaneity factor describes the probability of an appliance to be active while other appliances are active. A large simultaneity factor means that the appliance is usually active while others are also active.

Regarding the hypothesis on type of appliances:

The neural network can model the different operating modes in the appliances, even the ones with a high number of operating modes
The neural network can model both heating/cooling and motor components
There's no specific conclusion about the capacity to model weather dependency as both train and test datasets were acquired under similar environments (season, etc)

Experiment 2 and 3. Paper with standarization

Standardization can be used to rescale the testing samples to better describe relative patterns rather than absolute value consumptions. Standardization transforms features such that their mean (μ) equals 0 and standard deviation (σ) equals 1. The range of the new min and max values is determined by the standard deviation of the initial un-normalized feature.

Standardization is achieved by Z-score Normalization. Z-score is given by:

TThe standardization process is done over the specific dataset in each specific experiment.

Although the model of appliances in train and test are different in terms of absolute consumptions, relative step changes in standardized data can be similar. This is an approach to bypass overrepresentation in data. In this case, the mean and standard value used in training is calculated over the train dataset, and the mean and standard deviation in the test is calculated over test dataset.

Experiment 2. Paper with standardization - Using calculated standardization in test