0% found this document useful (0 votes)
6 views6 pages

Generative Adversarial Network For Synthetic Time Series Data Generation in Smart Grids

This paper presents a novel approach using Generative Adversarial Networks (GANs) for generating synthetic time series data in smart grids, addressing issues of data availability and privacy. The authors demonstrate that their generated datasets can effectively mimic real datasets, making them suitable for training machine learning models without compromising sensitive information. Experimental results validate the indistinguishability of real and synthetic datasets through statistical tests and machine learning tasks.

Uploaded by

lavanya744866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Generative Adversarial Network For Synthetic Time Series Data Generation in Smart Grids

This paper presents a novel approach using Generative Adversarial Networks (GANs) for generating synthetic time series data in smart grids, addressing issues of data availability and privacy. The authors demonstrate that their generated datasets can effectively mimic real datasets, making them suitable for training machine learning models without compromising sensitive information. Experimental results validate the indistinguishability of real and synthetic datasets through statistical tests and machine learning tasks.

Uploaded by

lavanya744866
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Accepted for publication in IEEE SmartgridComm 2018

Generative Adversarial Network for Synthetic Time


Series Data Generation in Smart Grids
Chi Zhang∗ , Sanmukh R. Kuppannagari† , Rajgopal Kannan† and Viktor K. Prasanna†
∗ Computer Science Department, University of Southern California, Los Angeles, CA
† ElectricalEngineering Department, University of Southern California, Los Angeles, CA
Email: [email protected], [email protected], [email protected], [email protected]

Abstract—The availability of fine grained time series data the application of sophisticated models to obtain highly
is a pre-requisite for research in smart-grids. While data for accurate results.
transmission systems is relatively easily obtainable, issues related • Privacy of Data: Distribution system data, obtained
to data collection, security and privacy hinder the widespread
public availability/accessibility of such datasets at the distribution from AMI meters contain Personally Identifiable Infor-
system level. This has prevented the larger research community mation (PII) and sophisticated algorithms are required to
from effectively applying sophisticated machine learning algo- anonymize the data as per the regulations [9]. This further
rithms to significantly improve the distribution-level accuracy of prevents the large scale availability of such datasets.
predictions and increase the efficiency of grid operations.
An intuitive way to approach these problems is to generate
Synthetic dataset generation has proven to be a promising
solution for addressing data availability issues in various do- synthetic datasets that enable researchers to develop novel
mains such as computer vision, natural language processing and data-driven models, while maintaining real dataset privacy.
medicine. However, its exploration in the smart grid context Classic approaches involve modeling underlying causes of the
remains unsatisfactory. Previous works have tried to generate observed dataset and generating model-based synthetic data.
synthetic datasets by modeling the underlying system dynamics:
In [10], the authors propose a multisegment markov chain
an approach which is difficult, time consuming, error prone and
oftentimes infeasible in many problems. In this work, we propose model of the solar states and generate synthetic states using
a novel data-driven approach to synthetic dataset generation by this model. In [11], the author proposes to train autoregressive
utilizing deep generative adversarial networks (GAN) to learn models and use theta-join for generating smart meter data.
the conditional probability distribution of essential features in However, their approach requires hand-crafted features such
the real dataset and generate samples based on the learned
as fluctuations flattening and time series deseasonalizing.
distribution. To evaluate our synthetically generated dataset, we
measure the maximum mean discrepancy (MMD) between real Accurate modeling of the underlying causes is a daunting
and synthetic datasets as probability distributions, and show task. It requires us to make several assumptions (for example,
that their sampling distance converges. To further validate our markovian property) which are not necessarily true, thus
synthetic dataset, we perform common smart grid tasks such as affecting the reliability of the synthetically generated data.
k-means clustering and short-term prediction on both datasets.
Potential applications in smart grid that can benefit from
Experimental results show the efficacy of our synthetic dataset
approach: the real and synthetic datasets are indistinguishable large scale synthetic data includes behind-the-meter solar
by solely examining the output of these tasks. disaggregation [3], real-time smart grid system simulation [12]
and etc. In behind-the-meter solar disaggregation problem,
I. M OTIVATION large scale datasets are required to train and validate machine
The lack of fine grained distribution system data is a learning models, yet many of them are not available due
significant bottleneck preventing the community from devel- to privacy issues. Real-time smart grid system simulation
oping novel data science and machine solutions for smart grid also requires large scale datasets that reflect certain system
applications such as load forecasting [1], dynamic demand behavior, which is often limited due to the lack of fine-grained
response [2], behind-the-meter disaggregation [3] and so on. meters.
Although efforts exist to make public datasets available [4][5], In this work, we develop a novel data-driven approach for
they are limited due to the following reasons: generating synthetic smart grid data by directly ‘learning’ the
• Availability of Data: ISOs and RTOs regularly publish probability distribution of the real time-series data using a
data regarding the transmission level grid operations deep Generative Adversarial Network (GAN) model. While
online for public use [6][7]. However, no such framework GANs have been used to effectively synthesize cutting-edge
exists for distribution systems. Hence, the only readily “fake” images and audios [13], they have not hitherto been
available distribution system level datasets are through used for smart grid data due to various underlying challenges
the efforts of various researchers [8], which are limited in distinguishing data patterns (seasonality, short/long term,
in scope. customer behavior, prosumer etc.). Our work is based on
• Scale of Data: Several machine learning based models the following insight: we observe that smart grid time series
require vast amounts of data for training. The limited data can be separated into two distinct statistical compo-
scope and size of the available public datasets prevents nents: Level and Pattern, where Level determines high-level
statistical attributes such as mean, scale and variance while B. Generative Adversarial Network
Pattern determines the real trend. By normalizing the Level GAN [13] is a deep generative model that can implicitly
of different users, the Pattern of long-term periodic time series capture any differentiable probability distribution and provide
data can be modeled as a conditional probability distribution a way to draw samples from it. Assuming some prior distribu-
conditioned on actual date. We show that this probability tion z ∼ pz (z), we would like to learn a generator function G,
distribution can be easily “learned” using GAN. The main such that G(z) ∼ pdata (x). To achieve this goal, we introduce
contributions of this paper are as follows: a discriminator function D and let D and G play the following
• We first develop a probabilistic model to abstract sig- two-player minimax game with value function V (G, D):
nificant characteristics inherent in smart grid time series
min max V (G, D) =Ex∼pdata (x) [log D(x)]
datasets. G D (1)
• We then develop a conditional GAN to learn the proba- + Ez∼pz (z) [log(1 − D(G(z)))]
bility distribution of the real dataset in order to generate
synthetic datasets which are indistinguishable under sta- An intuitive explanation of the objective function is that the
tistical tests. To the best of our knowledge, this is the generator is trying to produce fake samples while the dis-
first effort that uses deep GANs in smart grid. criminator is trying to detect the counterfeits. A competition
• We evaluate the effectiveness of the generated synthetic in this game drives both the generator and the discriminator
datasets by performing both statistical tests as well as to improve their methods until the fake samples are indistin-
classic machine learning tasks including timeseries clus- guishable from the real data.
tering and load prediction and showing that the results In practice, function G and D are often approximated using
are indistinguishable from the real dataset. deep neural networks. It is proven in [13] that given fixed
discriminator D, minimizing the value function in Eq 1 with
respect to the generator parameters is equivalent to minimizing
II. BACKGROUND the Jensen-Shannon divergence between pdata (x) and G(z).
In other words, as training progresses, the implicit distribution
A. Target of Smart Grid Dataset Definition G(z) captures converges to pdata (x). In this work, we use a
variant GAN architecture known as Conditional GAN [17] to
The immense heterogeneity in smart grid data makes it learn conditional probability distribution of time series data.
highly unlikely that a single model could be used for synthe-
sizing the datasets. Hence, any discussion on synthetic dataset C. Evaluating Synthetic Datasets
generation is incomplete without a precise definition of the
As it is not possible to mathematically prove that the real
targeted datasets. The datasets that we target in this work
samples and the synthetic samples come from the identical
can be broadly defined as “timeseries data conditioned on
distribution, we perform statistical tests and use classic ma-
smart grid”. More precisely, we focus on the datasets which
chine learning algorithms to empirically show:
can be modeled as a timeseries. Moreover, the underlying
processes generating the dataset should be defined or affected 1) Real time series and synthetic time series share key
by the smart grid under consideration. This implies that data statistical properties.
generated using natural processes such as temperature, solar 2) Real time series and synthetic time series can not
irradiance etc. are not our focus. Moreover, event based data be distinguished by the outcomes of these machine
[14] such as on/off time of appliances, plug in/plug out times learning algorithms.
of EVs etc. are also not a focus. Ultimately, the purpose of the synthetic data is to serve as
The above definition allows us to model (trivially) a wide supplemental training data for machine learning solutions or
range of datasets such as uncontrolled load: affected by as a substitute data to preserve privacy of the original dataset.
customer behavior patterns which are conditional on economic 1) Statistical Tests: Maximum Mean Discrepency (MMD)
features etc.; PV generation: affected by solar irradiance [18] measures the distance between two probability distribu-
assuming no forced curtailment [15] and conditional on PV tions by drawing samples. Given samples {xi }N i=1 ∼ p(x) and
module number, size, efficiency; aggregate generation in the {yj }M
j=1 ∼ q(y), an estimate of MMD is:
grid: affected by smart grid demand and thus conditional on  N N N M
smart grid in consideration; electricity prices: affected by mar- 1 XX 2 XX
MMD= K(x i , xj ) − K(xi , yj )
ket conditions and thus conditional on the smart grid in con- N 2 i=1 j=1 M N i=1 j=1
sideration etc. We can also model controlled load/generation M M 1/2
(e.g. due to Demand Response [16], load/solar curtailment 1 XX
+ 2 K(yi , yj )
etc. [15]) by adding additional conditional variables to denote M i=1 j=1
the control. For example, if a building has two different (2)
consumption profiles, one under normal conditions and one
under DR, then an additional binary conditional variable to where K(x, y) = exp(−||x − y||2 /(2σ 2 )) is known as radial
denote whether the building is in DR or not can be used. basis function (RBF) kernel.
Conditional Learn Generative Sample
Real Level Synthetic
Level Normalization Probability Adversarial
Time Series Recovery Time Series
Distribution Network
| {z } | {z } | {z }
Preprocessing
<latexit sha1_base64="a0kbnkD2XG6wxD08Iy9TeMZCbpI=">AAACRnicbZBNS8NAEIYn9bt+VT16CRbBU0m86LHoxWMFq4IpZbOZ1sXNbtydqCXkb/hrvCr+Bf+EN/Go24+DVgd2eXneGZh540wKS0Hw5lVmZufmFxaXqssrq2vrtY3Nc6tzw7HNtdTmMmYWpVDYJkESLzODLI0lXsQ3x0P/4g6NFVqd0SDDTsr6SvQEZ+RQtxZEuUrQxIZxLKLb25wl47/sFhHhAxUtg5nRHK0Vql+W3Vo9aASj8v+KcCLqMKlWt/YVJZrnKSrikll7FQYZdQpmSHCJZTXKLWaM37A+XjmpWIq2U4wuK/1dRxK/p417ivwR/TlRsNTaQRq7zpTRtZ32hvBfb0hIa2nL3ziXJIy+n1qLeoedQqgsJ1R8vFUvlz5pf5ipnwiDnOTACcaNcIf5/Jq5SMklX3WJhdP5/BXn+40waISnYb15NMluEbZhB/YghANowgm0oA0cHuEJnuHFe/XevQ/vc9xa8SYzW/CrKvANlB+1Fg==</latexit>
sha1_base64="Dj1uacrCRRBAU5RulcfCdXCy0ls=">AAACRnicbZC/ThtBEMbnzJ+ACeBASbOKhZTKuqOB0oImRQojxTYSZ1l7e2OzYm/32J0DrNO9Rp4mLYhX4A2o0qAoZbK2KYJhpF19+n0z0syX5Eo6CsPHoLa0vLL6YW29vvFxc2u78Wmn50xhBXaFUcaeJdyhkhq7JEnhWW6RZ4nCfnJ5MvX712idNPo7TXIcZHys5UgKTh4NG2Fc6BRtYrnAMr66Kng6/6thGRPeUtmxmFsj0Dmpx1U1bDTDVjgr9lZEL6LZ/vb0zACgM2z8jVMjigw1CcWdO4/CnAYltySFwqoeFw5zLi75GM+91DxDNyhnl1Vs35OUjYz1TxOb0f8nSp45N8kS35lxunCL3hS+600JGaNc9RoXiqQ1Nwtr0ehoUEqdF4RazLcaFYqRYdNMWSotClITL7iw0h/GxAX3kZJPvu4TixbzeSt6B60obEWnUbN9DPNagz34DF8ggkNow1foQBcE/ICfcAf3wUPwK/gd/Jm31oKXmV14VTX4B3Alt50=</latexit>
sha1_base64="bx0hDwUe9mwdAn+t/M1KsCXNKdg=">AAACRnicbZC/ThtBEMbnHP4Y889JSpoTFhKVdUcDpRUaCgpHwoDEWdbe3thesbd73p0Ltk73GrxEXoGWKK+QN6BKE0Upw9qmAJuRdvXp981IM1+cSWEpCH55lQ8rq2vr1Y3a5tb2zm7946dLq3PDscO11OY6ZhalUNghQRKvM4MsjSVexbenU//qGxortLqgSYbdlA2U6AvOyKFePYhylaCJDeNYRKNRzpL5X/aKiHBMRdtgZjRHa4UalGWv3giawaz8ZRG+iEbr/OlPY/x9u92r/48SzfMUFXHJrL0Jg4y6BTMkuMSyFuUWM8Zv2QBvnFQsRdstZpeV/oEjid/Xxj1F/oy+nihYau0kjV1nymhoF70pfNebEtJa2vItziUJo+8W1qL+SbcQKssJFZ9v1c+lT9qfZuonwiAnOXGCcSPcYT4fMhcpueRrLrFwMZ9lcXnUDINm+DVstL7AvKqwB/twCCEcQwvOoA0d4HAPD/AIP7yf3m/vr/dv3lrxXmY+w5uqwDNAzbjQ</latexit> <latexit sha1_base64="6v64LaQfxl+hYkLjaJvmvwkJ0U4=">AAACR3icbZC/TsMwEMad8q+UfwFGlogKialKWGBEsDAwFIm2SE0VOc61WDh2sC9AFeU5eBpWEI/AU7AhNnBLh1I4ydan33cn3X1xJrhB339zKnPzC4tL1eXayura+oa7udU2KtcMWkwJpa9iakBwCS3kKOAq00DTWEAnvjkd+Z070IYreYnDDHopHUje54yiRZEbhLlMQMeaMijC29ucJtN/GRUhwgMW50C15HJQlpFb9xv+uLy/IpiIOplUM3K/wkSxPAWJTFBjuoGfYa+gGjkTUNbC3EBG2Q0dQNdKSVMwvWJ8WuntWZJ4faXtk+iN6fREQVNjhmlsO1OK12bWG8F/vRFBpYQpf+NcINfqfmYt7B/1Ci6zHEGyn636ufBQeaNQvYRrYCiGVlCmuT3MY9fUZoo2+ppNLJjN569oHzQCvxFcBPXjk0l2VbJDdsk+CcghOSZnpElahJFH8kSeyYvz6rw7H87nT2vFmcxsk19Vcb4BPmm1ZA==</latexit>
sha1_base64="MRM7FehAOS5LucPlbZhTTf7VKog=">AAACR3icbZC/btswEMZP7r/U/ae2YxaiRoFOhtSlGY1k6ZAhBeLYQGQIFHV2iFCkQp6aGIKeo0+TNUEeoW+QrUOBoltD2RlspweQ+PD77oC7LyuVdBRFP4POo8dPnj7bet598fLV6zfh23dHzlRW4FAYZew44w6V1DgkSQrHpUVeZApH2ele64++o3XS6EOalzgp+EzLqRScPErDOKl0jjazXGCdnJ1VPF/9m7ROCC+o3kdutdSzpknDXtSPFsUeivhe9Ab7t78ZAByk4b8kN6IqUJNQ3LnjOCppUnNLUihsuknlsOTilM/w2EvNC3STenFawz56krOpsf5pYgu6OlHzwrl5kfnOgtOJ2/Ra+F+vJWSMcs06rhRJa8431qLpzqSWuqwItVhuNa0UI8PaUFkuLQpScy+4sNIfxsQJ95mSj77rE4s383kojj7346gff4t7g11Y1hZswwf4BDF8gQF8hQMYgoAfcAlXcB3cBL+CP8HfZWsnuJ95D2vVCe4AGm+36w==</latexit>
sha1_base64="eChMnhGPIzxsFlzI80xR1DjQX/w=">AAACR3icbZC7SgNBFIZn4y3GW9TSZjEIVmHXRsugjYWFgolCNoTZ2ZNkcHZmnTmrhmWfw4fwGWwVH8E3sLMQxE4nl0ITD8zw8/3nwDl/mAhu0PNencLM7Nz8QnGxtLS8srpWXt9oGJVqBnWmhNKXITUguIQ6chRwmWigcSjgIrw6GvgXN6ANV/Ic+wm0YtqVvMMZRYvaZT9IZQQ61JRBFlxfpzT6/eftLEC4w+wEqJZcdvO8Xa54VW9Y7rTwx6JSO3n7qNw9rJy2y99BpFgag0QmqDFN30uwlVGNnAnIS0FqIKHsinahaaWkMZhWNjwtd3csidyO0vZJdIf090RGY2P6cWg7Y4o9M+kN4L/egKBSwuR/cSqQa3U7sRZ2DloZl0mKINloq04qXFTuIFQ34hoYir4VlGluD3NZj9pM0UZfson5k/lMi8Ze1feq/plfqR2SURXJFtkmu8Qn+6RGjskpqRNG7skjeSLPzovz7nw6X6PWgjOe2SR/quD8AOsIuR4=</latexit>
Learning Postprocessing
<latexit sha1_base64="++jCujPnCpqOQAdN57tp5pqAQx8=">AAACR3icbZC/TsMwEMad8r/8CzCyRFRITFXCAmMFC2ORKK1Eqspxrq2FYwf7AlRRnoOnYQXxCDwFG2IDt80ALSfZ+vT77qS7L0oFN+j7705lYXFpeWV1rbq+sbm17e7sXhuVaQYtpoTSnYgaEFxCCzkK6KQaaBIJaEe352O/fQ/acCWvcJRCN6EDyfucUbSo5wZhJmPQkaYM8vDuLqPx9C96eYjwiHlTGUy1YmAMl4Oi6Lk1v+5PypsXQSlqpKxmz/0OY8WyBCQyQY25CfwUuznVyJmAohpmBlLKbukAbqyUNAHTzSenFd6hJbHXV9o+id6E/p7IaWLMKIlsZ0JxaGa9MfzXGxNUSpjiL84Ecq0eZtbC/mk35zLNECSbbtXPhIfKG4fqxVwDQzGygjLN7WEeG1KbKdroqzaxYDafeXF9XA/8enAZ1BpnZXarZJ8ckCMSkBPSIBekSVqEkSfyTF7Iq/PmfDifzte0teKUM3vkT1WcH6WDtZ8=</latexit>
sha1_base64="PHphvwl+ACIWXFwkKzV8XFpyu14=">AAACR3icbZC/bhNBEMbnzJ8YE8CBkmYVCymVdUdDSguaFCmMFCeWfJa1tze2V9nbPe/OAdbpniNPkxbEI/AGdCmQonTJ+k8R24y0q0+/b0aa+ZJcSUdh+CeoPXn67Ple/UXj5f6r12+aB2/PnSmswJ4wyth+wh0qqbFHkhT2c4s8SxReJJdfFv7FN7ROGn1G8xyHGZ9oOZaCk0ejZhQXOkWbWC6wjGezgqervxqVMeEPKrvGUW6NQOeknlTVqNkK2+Gy2K6I1qLVOf37jwFAd9S8j1Mjigw1CcWdG0RhTsOSW5JCYdWIC4c5F5d8ggMvNc/QDcvlaRX74EnKxsb6p4kt6eOJkmfOzbPEd2acpm7bW8D/egtCxihXbeJCkbTm+9ZaND4ellLnBaEWq63GhWJk2CJUlkqLgtTcCy6s9IcxMeU+U/LRN3xi0XY+u+L8YzsK29HXqNX5DKuqw3s4hCOI4BN04AS60AMBV3ANP+FX8Du4CW6Du1VrLVjPvIONqgUPgYm4Jg==</latexit>
sha1_base64="7J18Qk1Ux4F4AMhBisDDShiTJJo=">AAACR3icbZC/SgNBEMb34r8Y/0UtbQ6DYBXubLQM2lhYRDAqeCHs7U2Sxb3dy+6cJhz3HD6Ez2Cr+Ai+gZ2FIHa6SSw0cWCXj983AzNfmAhu0PNenMLM7Nz8QnGxtLS8srpWXt84NyrVDBpMCaUvQ2pAcAkN5CjgMtFA41DARXh9NPQvbkAbruQZDhJoxrQjeZsziha1yn6Qygh0qCmDLOj1UhqN/7yVBQh9zOrKYKIVA2O47OR5q1zxqt6o3Gnh/4hK7eT1vdK/X6m3yl9BpFgag0QmqDFXvpdgM6MaOROQl4LUQELZNe3AlZWSxmCa2ei03N2xJHLbStsn0R3R3xMZjY0ZxKHtjCl2zaQ3hP96Q4JKCZP/xalArtXtxFrYPmhmXCYpgmTjrdqpcFG5w1DdiGtgKAZWUKa5PcxlXWozRRt9ySbmT+YzLc73qr5X9U/9Su2QjKtItsg22SU+2Sc1ckzqpEEYuSMP5JE8Oc/Om/PhfI5bC87PzCb5UwXnG1IxuVk=</latexit>

Fig. 1: Data Processing Flow


2) Machine Learning Algorithms: We perform classic ma- B. Data Processing Flow
chine learning tasks e.g. clustering and forecasting on both
We show the data processing flow in Figure 1. In prepro-
original and synthetic dataset. A potentially useful synthetic
cessing step, we perform level normalization to obtain condi-
dataset shall show similar results as the original dataset in
tional probability distribution as per the following equation:
these tasks.
0 xu,t − bu,t
III. A PPROACH xu,t = (5)
au,t
A. Probabilistic Modeling for Time Series Data in Smart Grid
where au,t = σxu,t and bu,t = xu,t is the standard deviation
In smart grids, human activity datasets often contain daily, and the average of the energy consumption or the solar
monthly and seasonal patterns. They arise from the periodic generation of user u in day t, respectively. In the learning
nature of human behavior. We assume that each time series is phase, we use GAN to implicitly learn the distribution p and
the sum of two components: Level and Pattern. 0
generate synthetic samples x̂u,t . In postprocessing step, we
Level is determined by attributes such as household con- perform level recovery as:
sumption level, season etc. and affects the scale and bias
0
in time series data. For example, high income households x̂u,t = x̂u,t × σxu,t + xu,t (6)
probably consume more energy than low income households;
all households usually consume more energy in July than IV. E XPERIMENTAL S ETUP
in October due to high temperature HVAC use. Pattern is
determined by household activity. Office worker households A. Dataset Description
may consume very little daytime energy as people are out for
We conduct experiments using Pecan Street Dataset,
work; night-owl households consume more energy in night
which is free for university researchers [5]. We use a subset
time. In this work, we assume Level is captured by daily mean
which records energy consumption and solar generation of 25
and standard deviation of the time series and we use GAN to
users with PV panels installed. The dataset is collected from
learn user Patterns.
2013-01-01 to 2016-12-31 by averaging consumption and
We split the entire time series into daily vectors. Each
solar generation within each 15 minute. We show the energy
daily data vector of a particular user u is denoted as xu,t =
consumption and solar generation of user 93 from 2013-10-08
(xu,t1 , xu,t2 , · · · , xu,ts )T , where s is the number of samples
to 2013-10-14 in Figure 2. We observe that solar generation
per day and t is the index of day. We assume that the
shows repetitive patterns with noise originated from cloud and
distribution of daily data vector is conditional dependent
rain. Consumptions also have similar trends within this week,
on the day index t and the user index u. Furthermore, we
but with irregular spikes.
assume the day index t can be represented by the day of
the week {Sun, M on, · · · , Sat} and the month of the year
{Jan, F eb, · · · , Dec}. Mathematically, we can write B. GAN Architecture
0 We show our conditional GAN architecture in Fig 3. We
xu,t = au,t xu,t + bu,t (3)
use embedding layer to transform one-hot day and month
where au,t and bu,t captures the Level information (au,t rep- labels into vector representation. The generator concatenates
resents the scale and bu,t represents the bias) and xu,t captures day vector, month vector and noise sampled from normal
0

the Pattern information. Then, the conditional probability distribution, feeds them into 3-layer 1D tranpose convolutional
distribution is denoted as network and produces the synthetic output. The discriminator
is a 3-layer 1D convolutional network that takes real or
0 0
xu,t ∼ p(xu,t |day of week, month, user) (4) synthetic data, day vector and month vector as input, produces
a label of whether the input is real or synthetic.
Each user has different behavior. Thus the time series
produced by different users are sampled from different
C. Machine Learning Algorithms for Evaluation
distributions. Given training data from N different users
{xu,t }Tt=1 , where u = 1, 2, · · · , N , the goal is to train a For each user u = 1, 2, · · · , N in the real dataset {xu,t }Tt=1 ,
generator function G that can produce samples subject to we generate synthetic data {x̂u,t }Tt=1 of the same time length
distribution p without explicitly modeling or calculating p. (4 years with 15 minutes interval in Pecan Street Dataset).
(a) User 93 Real Consumption and Generation
(a) Maximum Mean Discrep-
ancy (b) Training Loss
Fig. 4: Statistics during Training

• Train on real data and test on real data


• Train on synthetic data and test on synthetic data
(b) User 93 Synthetic Consumption and Generation We run the prediction 100 times in each experiment on
Fig. 2: User 93 Real and Synthetic Data from 2013-06-30 to different days and compare the prediction performance using
2013-07-06. mean absolute percentage error (MAPE).
ARIMA model is defined in terms of three parameters [22]:
• p: the auto-regressive order that denotes the number of
Day of Week Embedding Day Vector
Label past observations included in the model.
Month Label Embedding Month Vector • d: the number of times a time series needs to be differ-

{xu,t }Tt=1
enced to make it stationary.
Real Data Real or
• q: the moving average order that denotes the number of
<latexit sha1_base64="21z4uY0UWptk4PYIXADfBGq/+yE=">AAACK3icbZBLS8NAFIUnPmt9RcWVm2ARXEhJ3OhGKLpxWaEvaGOYTKft0EkmzNyoZciPcav4a1wpbv0dOmm7sK0HBg7fuRfunDDhTIHrflhLyyura+uFjeLm1vbOrr2331AilYTWieBCtkKsKGcxrQMDTluJpDgKOW2Gw5s8bz5QqZiIazBKqB/hfsx6jGAwKLAPO/op0OkZZJ0s0HDlZfe6lgV2yS27YzmLxpuaEpqqGtg/na4gaURjIBwr1fbcBHyNJTDCaVbspIommAxxn7aNjXFEla/H52fOiSFdpyekeTE4Y/p3Q+NIqVEUmskIw0DNZzn8N8sJCMFVNotTDkyKx7mzoHfpaxYnKdCYTK7qpdwB4eTFOV0mKQE+MgYTyczHHDLAEhMw9RZNY958P4umcV723LJ355Uq19PuCugIHaNT5KELVEG3qIrqiCCNntELerXerHfr0/qajC5Z050DNCPr+xdIx6lL</latexit>
sha1_base64="67x4xm79ZGFlEjaOFJuD1m0nNYw=">AAACK3icbZC9SgNBFIXvxr8Y/6JiZbMYBQsJuzbaCEEbSwWjgSQus5NJMmR2Z5m5q4ZhH8ZWsbX2HawUWx/BWieJhUYPDBy+cy/cOWEiuEbPe3FyE5NT0zP52cLc/MLiUnF55VzLVFFWpVJIVQuJZoLHrIocBaslipEoFOwi7B0N8osrpjSX8Rn2E9aMSCfmbU4JWhQU1xrmJjDpDmaNLDB44GeX5iwLiiWv7A3l/jX+tylVNj8enwDgJCh+NlqSphGLkQqidd33EmwaopBTwbJCI9UsIbRHOqxubUwipptmeH7mblnScttS2RejO6Q/NwyJtO5HoZ2MCHb1eDaA/2YDglIKnf3GqUCu5PXYWdjebxoeJymymI6uaqfCRekOinNbXDGKom8NoYrbj7m0SxShaOst2Mb88X7+mvPdsu+V/VO/VDmEkfKwDhuwDT7sQQWO4QSqQMHALdzBvfPgPDuvzttoNOd876zCLznvX5GarBo=</latexit>
sha1_base64="LnQ0FfKUHFbm0DCw5ClYdBXU5qY=">AAACK3icbZC7SgNBFIZn4z3eVsXKZjEKFhJ2bbQRgjaWEXKDJC6zk0kyZHZnmTmrhmEfxlax9QV8BCvF1kfQVieXQhN/GPj5/nPgzB/EnClw3VcrMzM7N7+wuJRdXlldW7c3NitKJJLQMhFcyFqAFeUsomVgwGktlhSHAafVoHc+yKvXVComohL0Y9oMcSdibUYwGOTb2w196+vkENJG6ms49dIrXUp9O+fm3aGcaeONTa6w9/n0fL38VfTt70ZLkCSkERCOlap7bgxNjSUwwmmabSSKxpj0cIfWjY1wSFVTD89PnX1DWk5bSPMicIb094bGoVL9MDCTIYaumswG8N9sQEAIrtK/OOHApLiZOAvaJ03NojgBGpHRVe2EOyCcQXFOi0lKgPeNwUQy8zGHdLHEBEy9WdOYN9nPtKkc5T037116ucIZGmkR7aBddIA8dIwK6AIVURkRpNEdukcP1qP1Yr1Z76PRjDXe2UJ/ZH38AMysrZQ=</latexit>

Day Vector Discriminator Synthetic


Month Vector
past white noise error terms included in the model.
Day Vector
Month Vector Generator In training ARIMA model, the parameters were (5, 1, 0).
z ⇠ N (0, 1)
<latexit sha1_base64="c7cm41sNx8vCd/UBOX/iMzloJos=">AAACLHicbZBLSwMxFIUz9VXrqyq4cRMsQgUpM250WXTjSirYB7RDyaSZNjQzGZI7Sh37Z9wq/ho3Im79G5ppZ2FbDwQO370XTo4XCa7Btj+s3NLyyupafr2wsbm1vVPc3WtoGSvK6lQKqVoe0UzwkNWBg2CtSDESeII1veFVOm/eM6W5DO9gFDE3IP2Q+5wSMKhbPHjsaB7gTkBgQIlIbsZl+9Q56RZLdsWeCC8aJzMllKnWLf50epLGAQuBCqJ127EjcBOigFPBxoVOrFlE6JD0WdvYkARMu8kk/xgfG9LDvlTmhYAn9O9FQgKtR4FnNtOcen6Wwn9nKQEphR7P4lgAV/JhLhb4F27CwygGFtJpKj8WGCROm8M9rhgFMTKGUMXNxzAdEEUomH4LpjFnvp9F0zirOHbFuXVK1cusuzw6REeojBx0jqroGtVQHVH0hJ7RC3q13qx369P6mq7mrOxmH83I+v4F+V+odw==</latexit>
sha1_base64="0JuXEptr08tJwpro1ip0/0vs89w=">AAACLHicbZDNSgMxFIXv1P/6VxXcuAkWoaKUGTe6LLpxJQpWC51SMmnaBjOTIbmj1LHvIm4Vn8aNiFtfQzOtC9t6IHD47r1wcoJYCoOu++7kpqZnZufmF/KLS8srq4W19SujEs14lSmpdC2ghksR8SoKlLwWa07DQPLr4OYkm1/fcm2Eii6xF/NGSDuRaAtG0aJmYfPeNyIkfkixy6hMz/old9/bbRaKbtkdiEwa79cUK9v+3iMAnDcL335LsSTkETJJjal7boyNlGoUTPJ+3k8Mjym7oR1etzaiITeNdJC/T3YsaZG20vZFSAb070VKQ2N6YWA3s5xmfJbBf2cZQaWk6Y/iRKLQ6m4sFraPGqmI4gR5xIap2okkqEjWHGkJzRnKnjWUaWE/RliXasrQ9pu3jXnj/Uyaq4Oy55a9C69YOYah5mELtqEEHhxCBU7hHKrA4AGe4BlenFfnzflwPoerOef3ZgNG5Hz9AFECqgA=</latexit>
sha1_base64="LoF4Ksk+azzhN5YP7xRQFnmNeW0=">AAACLHicbZDNSgMxFIUz9a/Wv6rgxk1oESpKmXGjy6IbV1LB/kCnlEyatqGZyZDcUerYt/AJ3Co+jZsibn0NzbRd2NYDgcN374WT44WCa7DtkZVaWl5ZXUuvZzY2t7Z3srt7VS0jRVmFSiFV3SOaCR6wCnAQrB4qRnxPsJrXv0rmtXumNJfBHQxC1vRJN+AdTgkY1MoePLqa+9j1CfQoEfHNsGCfOsetbN4u2mPhReNMTb6Uc0+eR6VBuZX9cduSRj4LgAqidcOxQ2jGRAGngg0zbqRZSGifdFnD2ID4TDfjcf4hPjKkjTtSmRcAHtO/FzHxtR74ntlMcur5WQL/nSUEpBR6OIsjAVzJh7lY0LloxjwII2ABnaTqRAKDxElzuM0VoyAGxhCquPkYpj2iCAXTb8Y05sz3s2iqZ0XHLjq3Tr50iSZKo0OUQwXkoHNUQteojCqIoif0gl7Rm/VufVif1tdkNWVNb/bRjKzvX576q4Y=</latexit>

V. E XPERIMENTAL R ESULTS
A. Statistical Property Analysis
Transpose

Transpose

Transpose
Conv1D

Conv1D

Conv1D

Conv1D

Conv1D

Conv1D
Concat

Concat

Dense

1) Maximum Mean Discrepancy: We use 3 years data


from 2013-01-01 to 2015-12-31 to train a Conditional GAN
and use data from 2016-01-01 to 2016-12-31 as validation
Fig. 3: Conditional GAN architecture
data. As show in Figure 4b, the generator loss and discrim-
inator loss converges, indicating that the GAN approaches
1) K-means Clustering: We perform dynamic time warping
Nash Equilibrium after training for 2000 iterations. After each
(DTW) k-means clustering [19] with data in January 2013 to
training epoch, we generate 1 year synthetic data and compute
cluster users according to their consumption and generation
Maximum Mean Discrepancy (MMD) against the validation
patterns. We conduct 3 sets of experiments as follows:
data. We show the MMD curve in Figure 4a. We observe
• Fit a model using real data and predict the cluster labels that the MMD gradually decreases and converges as training
on synthetic data. goes on. Thus, the probability distribution of the synthetic data
• Fit a model using synthetic data and predict the cluster approaches the real data distribution and finally converges.
labels on real data. 2) Real Data vs. Synthetic Data: We plot a week of the
• Fit a model using mixed real and synthetic data. real data and the synthetic data of user 93 in Figure 2. We
Ideally, real time series and synthetic time series generated summarize key properties of the real data that GAN captures
from the same user should belong to the same cluster. We in synthetic data:
compare the real labels and synthetic labels using F1-score • Data Range: The range of the real and the synthetic data
[20]. In this experiment, we set the number of clusters K to matches as shown in Figure 2. The peak consumption is
be 3 to simplify the illustration of the results. around 6 kW and the peak solar generation is around 4
2) Short-term Load Forecasting: We perform short-term kW. This is critical because it is a necessary condition
load forecasting to demonstrate the statistical similarities of the indistinguishability between the real data and the
between real data and synthetic data using Auto-Regressive synthetic data.
Integrated Moving Average (ARIMA) [21] model. We • Day Time Consumption vs Night Time Consumption:
predict the next 24-hour consumption based on the previous We observe that households with PV panels installed
week data of a single user. We conduct two sets of experiments typically consume less energy in day time than night
on consumption and solar generation, respectively: time. We plot 4 years average day time consumption and
(a) Real consumption during day time and night time (a) K-means centroids fit on real data

(b) Synthetic consumption during day time and night time (b) K-means centroids fit on synthetic data
Fig. 5: Four years average day time and night time consump-
tion of a user. Day time is 6am to 6pm. Night time is 12am
to 6am plus 6pm to 12am.
TABLE I: K-means clustering prediction results
Train Test F1 Score
Real Synthetic 1.00
Synthetic Real 1.00
Mixed Mixed 0.96 (c) K-means centroids fit on mixed data
Fig. 6: K-means centroids of various settings. We only show
one week curve for demonstration.
night time consumption of a user in Figure 5. As shown
in Figure 5, the synthetic data captures the general pattern
as shown in real data. However, the synthetic data is noisy
than real data.
• Solar Generation Noise: The solar generation is pro-
portional to the solar radiation when it is sunny. The
noise (glitches) on solar generation curve are caused by
cloud or rain during sampling period. We notice that
the synthetic data automatically captures this feature as
shown in Figure 2.

B. Time Series Clustering

We show the centroids of 3 clusters trained on real data,


(a) Prediction on real load
synthetic data and mixed data in Figure 6.
Insight 1: Real time series and synthetic time series of
the same user belong to the same cluster. We report the
F1 score of various settings. The high F1 score indicates that
the real data and synthetic data from the same user belongs
to the same cluster. This matches our hypothesis mentioned
in Section III-A.
Insight 2: The primary criteria of clustering is the
consumption level. In all three centroids subfigures, there is
a centroid with apparently higher consumption level than the
other two clusters. The other two centroids fit on synthetic
data can also be separated by consumption level. However,
this is not true for centroids trained on real data and mixed (b) Prediction on synthetic load
data.
Fig. 7: Short-term load forecasting results
ACKNOWLEDGMENT
This work has been supported in part by the U.S. National
Science Foundation under EAGER Award No.:CNS-1637372
and the U.S. Department of Energy (DoE) under award
number DE-EE0008003.
R EFERENCES
[1] A. K. Srivastava et al., “Short-term load forecasting methods: A review,”
in 2016 International Conference on Emerging Trends in Electrical
Electronics Sustainable Energy Systems (ICETEESES), March 2016, pp.
130–138.
[2] S. Aman et al., “Prediction models for dynamic demand response:
Requirements, challenges, and insights,” in 2015 IEEE International
Conference on Smart Grid Communications (SmartGridComm), Nov
2015, pp. 338–343.
[3] D. Chen and D. Irwin, “Sundance: Black-box behind-the-meter solar
Fig. 8: Mean absolute percentage error (MAPE) distribution disaggregation,” in Proceedings of the Eighth International Conference
of load prediction using real data and synthetic data on Future Energy Systems, ser. e-Energy ’17. New York, NY, USA:
ACM, 2017, pp. 45–55.
[4] S. Barker et al., “Smart*: An open data set and tools for enabling
C. Short-term Load Forecasting research in sustainable homes.”
[5] “Pecan street dataset,” http://www.pecanstreet.org/category/dataport/,
We show one day of load prediction result in Figure 7. The 2018.
predicted curve generally matches the ground truth on both [6] “Pjm dataset,” http://www.pjm.com/markets-and-operations/.
[7] “New york system independent system operator,”
real data and synthetic data. We show the mean absolute per- https://www.nyiso.com/public/markets operations/market data/.
centage error (MAPE) density distribution of 100 predictions [8] Y. Wang et al., “Review of smart meter data analytics: Applications,
on real and synthetic data in Figure 8. We observe that MAPE methodologies, and challenges,” CoRR, vol. abs/1802.04117, 2018.
[9] M. R. Asghar et al., “Smart meter data privacy: A survey,” IEEE
of both curves have concentration near 0.5 ∼ 0.6. It provides Communications Surveys Tutorials, vol. 19, no. 4, pp. 2820–2835,
empirical grounds for the statistical identity of the real data Fourthquarter 2017.
and the synthetic data. We also notice that the MAPE predicted [10] W. Tushar et al., “Synthetic generation of solar states for smart grid:
A multiple segment markov chain approach,” in IEEE PES Innovative
using real data has more variance error due to higher noise. Smart Grid Technologies, Europe, Oct 2014, pp. 1–6.
[11] N. Iftikhar et al., “A scalable smart meter data generator using spark,” in
VI. R ELATED W ORK On the Move to Meaningful Internet Systems. OTM 2017 Conferences,
H. Panetto et al., Eds. Cham: Springer International Publishing, 2017,
Data-driven approach for generating synthetic time-series pp. 21–36.
data has been widely studied. In [23], the author proposes to [12] F. Guo et al., “Comprehensive real-time simulation of the smart grid,”
IEEE Transactions on Industry Applications, vol. 49, no. 2, pp. 899–
use Hidden Markov Model to generate synthetic time series 908, March 2013.
data. The HMM-based approach makes strong assumption of [13] I. Goodfellow et al., “Generative adversarial nets,” in Advances in
the Markovian property of the time series data, which may not Neural Information Processing Systems 27, Z. Ghahramani et al., Eds.
Curran Associates, Inc., 2014, pp. 2672–2680.
hold. While in our approach, the assumption of conditional [14] K. Anderson et al., “Blued: A fully labeled public dataset for event-
probability distribution is much more relaxing. In [24], the based non-intrusive load monitoring research,” pp. 1–5, 01 2012.
author successfully applied Recurrent Neural Network GAN [15] S. R. Kuppannagari et al., “Optimal net-load balancing in smart grids
with high PV penetration,” CoRR, vol. abs/1709.00644, 2017.
to synthesize real-value medical signal data. The medical [16] R. Deng et al., “A survey on demand response in smart grids: Math-
signal contains auto-regressive property while in daily user ematical models and approaches,” IEEE Transactions on Industrial
consumption, the value at a certain timestamp does not depend Informatics, vol. 11, no. 3, pp. 570–582, June 2015.
[17] M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,”
on the previous timestamp. That’s why we use Convolutional ArXiv e-prints, Nov. 2014.
Nets to capture the daily vector patterns instead of using RNN [18] D. J. Sutherland et al., “Generative models and model criticism via
to capture the auto-regressive patterns. optimized maximum mean discrepancy,” ICLR, 2017.
[19] S. Salvador and P. Chan, “Toward accurate dynamic time warping in
linear time and space,” Intell. Data Anal., vol. 11, no. 5, pp. 561–580,
VII. C ONCLUSION Oct. 2007.
[20] M. Sokolova et al., “Beyond accuracy, f-score and roc: A family
Lack of fine grained time series datasets in distribution level of discriminant measures for performance evaluation,” in AI 2006:
smart grid prevent the advance of new data-driven methods. Advances in Artificial Intelligence, A. Sattar and B.-h. Kang, Eds.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 1015–1021.
In this paper, we established a general probabilistic time- [21] M. T. Hagan and S. M. Behr, “The time series approach to short term
series data model. We proposed to use generative adversarial load forecasting,” IEEE Transactions on Power Systems, vol. 2, no. 3,
network to produce synthetic datasets that sampled from pp. 785–791, Aug 1987.
[22] G. E. P. Box and G. Jenkins, Time Series Analysis, Forecasting and
the same distribution as the real datasets. To evaluate the Control. Holden-Day, Incorporated, 1990.
synthetic daatsets, we perform statistical tests as well as [23] M. Arlitt et al., “Iotabench: an internet of things analytics benchmark,”
classic machine learning tasks including time series clustering pp. 133–144, 01 2015.
[24] C. Esteban et al., “Real-valued (medical) time series generation with
and load prediction. Empirical results show that the synthetic recurrent conditional gans,” 2017.
datasets and real datasets are not distinguishable.

You might also like