1-S2.0-S1876610216317179-Main - Smart Meter Data Analytics For Optimal Customer Selection

Available online at www.sciencedirect.
com
ScienceDirect
Energy Procedia 107 (2017) 49 – 59
3rd International Conference on Energy and Environment Research, ICEER 2016, 7-11 September
2016, Barcelona, Spain
Smart meter data analytics for optimal customer selection in

demand response programs
Madeline Martinez-Pabon1,*, Timothy Eveleigh1 and Bereket Tanju1
1
Department of Engineering Management and Systems Engineering, the George Washington University, Washington DC, United States
Abstract
This paper describes a methodology to predict customers’ eligibility to participate in Demand Response (DR) programs using
real electricity data collected from customers over time by smart meters. These types of programs have been proposed to improve
generation capacity as load demand increases and the two-way communications (between utilities and users) are enabled. Instead
of installing new power plants in smart grids, utilities encourage users to shift their electricity consumption from peak hours to
off-peak hours. The number of successfully recruited customers participating in demand response programs is usually low, and
resources are wasted on recruitment efforts. The results of our research reflect that it is possible to predict with more than 90%
accuracy which customers are good targets for DR program participation based on their consumption patterns and lifestyles.
These data could ultimately improve the recruitment process for DR programs.
© 2016 The Authors. Published by Elsevier Ltd.
© 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
Peer-review under responsibility of the scientific committee of the 3rd International Conference on Energy and Environment
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Research.
Peer-review under responsibility of the scientific committee of the 3rd International Conference on Energy and Environment Research.
Keywords: Big data analytics; demand response; smart grid; smart meter data
1. Introduction
Smart Grid provides a unique opportunity to advance the energy industry to the next level, where reliability,
availability, and efficiency can be a reality, contributing to our economic and environmental health. Smart grid uses
new technologies such as intelligent controllers, advanced software for data management, and two-way
communications between power utilities and consumers to improve the efficiency, reliability and safety of the
system [1].
Currently, generation capacity is used inefficiently because electricity usage during the day is concentrated
during a short time period known as peak hours. Additionally, more electrical generation is needed to respond to
high load demands. Demand Response (DR) programs are mechanisms that have emerged to adjust the demand for
_________
* Corresponding author. Tel.: +1-508-320-4243

E-mail address: [email protected], [email protected], [email protected]
1876-6102 © 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the scientific committee of the 3rd International Conference on Energy and Environment Research.
doi:10.1016/j.egypro.2016.12.128
50 Madeline Martinez-Pabon et al. / Energy Procedia 107 (2017) 49 – 59
power instead of adjusting the supply. A well-known DR program is “Time of Use (ToU),” where customers are
prompted to change their consumption patterns to off-peak hours. DR programs are widely recognized as one of the
essential tools that utility companies need to embrace. Key benefits of DR programs include peak load shifting,
improving generation capacity, and elimination of costly spot market energy purchases. Smart meter data are
facilitating the detection of load shedding when demand response events are declared. Smart meters have made
possible the collection and proper storage of real-time electricity data. These data are used by DR programs to
encourage users to modify their consumption patterns so they can benefit from better energy costs. In particular,
when high demand is present, the cost of electricity is higher than during off-peak hours. Utilities advise customers
to shift their consumption patterns to low-peak hours and benefit from the incentives.
Typically, overlapping energy consumption is caused by thousands of customers using electricity at the same
time during certain periods—for example, a hot summer day or a Super Bowl television transmission in the United
States. Such periods are referred to as peak demand periods [2]. Customers who join DR programs effectively
reduce their peak-hour consumption and increase their off-peak demand. Since energy load demand keeps
increasing over time, the installation of new power plants is inevitable if new alternatives do not arise. Therefore,
the optimization of generation capacity needs to be taken into consideration when looking for alternatives.
Utilities widely use DR programs to shift consumption from peak demand to off-peak hours for residential
customers. When utilities design DR programs for households, they have to take into account the near-random
consumption in electricity usage. Usually small, medium and large enterprises have more stable electrical
consumption patterns [3]. In addition to the customer’s response to the DR program, other factors should be taken
into account during the design of DR programs for residential areas. The use of Plug-in Hybrid Electric Vehicles
(PHEVs) is expected to grow and take an important role on the power grid [4]. Also, the proliferation of renewable
energy generation, such as solar and wind energy, need to be considered at residential and enterprise levels [5].
1.1. Prior Work
DR program designers need to create a customer recruitment strategy to maximize the success of the programs.
Most of the previous research on customer selection for these types of programs relies on surveys where users report
their own views and behaviors [6]. The growing availability of smart meter data has shown that this approach,
where the primary source of information is survey responses, is extremely inaccurate [7]. Other research includes
predictive marketing models that use socioeconomic factors such as household size, family income, enrollment in
other programs, age, education, presence of children, and average energy bill costs [8]. The lack of real electricity
data from customers in previous research has prevented those recruitment methods from having real life
applicability. For example in [9] the author uses US Census data to make conclusions about the propensity of
customers to enroll in DR programs. In [10] the research uses clustering methods to identify suitable candidates, but
it concentrates on certain hours of the day only. Excluding hours where high electricity usage can take place which
may affect the accuracy of the method.
1.2. Contribution of the Paper
We propose a new methodology to predict eligibility to participate in DR programs using load consumption
characteristics of customers. Our approach is different from the methods currently used by utility companies, which
involve performing “psychographic” segmentation that attempts to draw conclusions about energy consumption and
future enrollment in DR and energy efficiency programs from questionnaire data administrated to customers [11].
Other utilities use the monthly billing data of users to select eligible customers [12]. After utilities have pre-selected
those customers, they spend money and other resources calling potential participants and to invite them to join the
DR programs.
Another important contribution of our paper is the use of the R-programming language for smart meter data
analysis. In the big data analytics era, R-programming is being widely used for data analytics on a large scale in both
academia and industry. In modern times, the R programming language is quickly becoming the single most
important tool for computational statistics, visualization, and data science [13]. To the authors’ knowledge, this is
the first time that R-programming is used to analyze smart meter data in the power system literature. Also, we assess
Madeline Martinez-Pabon et al. / Energy Procedia 107 (2017) 49 – 59 51
our testing dataset using four different machine learning algorithms, reaching a prediction accuracy level of more
than 90% for an algorithm called random forest. This level of accuracy was obtained using real smart meter data
collected across one-and-a-half-year period from 6,429customers.
1.3. Structure of the Paper
This paper is structured as follows. Section II describes the methodology, consisting of two types of clustering:
hierarchical clustering and K-means; and four different prediction models: KNN, ANN, random forest and decision
tree. Section III presents the experimental evaluation and validation of the methodology. Section IV summarizes the
conclusion and future work.
Fig. 1. Overall DR customer selection flow
2. Methodology
The proposed methodology to predict customers’ eligibility to participate in DR programs is presented in Fig. 1.
First, data processing is performed on raw smart meter data. The objective here is to set all data points into the
same format to facilitate the data analysis [14].Then, we examine an agglomerative hierarchical clustering of the
dataset to decide the optimal number of clusters. When applying hierarchical clustering, we are solving one the most
challenging problems of clustering, which is the selection of the right number of clusters [15]. After that, we
proceed to find the best combination of clusters using K-means. Lastly, we predict eligibility using several machine
learning techniques.
2.1. Clustering Methods
The main function of clustering algorithms in electricity consumption data analysis is the identification of
structure in an unlabeled or pre-processed dataset. Clustering allows the organization of data in homogeneous
groups “where the within-group-object similarity is minimized and the between-group-object dissimilarity is
maximized” [16]. Two different clustering methods are used in this paper:
1) Hierarchical Clustering: A hierarchical clustering method works by grouping objects into a tree of clusters.
There are two types of hierarchical clustering methods: agglomerative, which we use here, and divisive, which
depends upon whether it is top-down or bottom-up. The agglomerative hierarchical clustering method assigns each
object to its own cluster and then merges each individual cluster into larger and larger clusters until all the objects are
in a single cluster or certain termination conditions are satisfied [2].The allocation of households from individual
clusters to agglomerative hierarchical clusters leads to a tree structure that we can analyze to determine the optimal
number of clusters. The objective here is to find a reasonable number of clusters in the curved area, or the “knee” of
the graph. When we analyze the knee region, we find a balance of clusters that are homogeneous and also dissimilar
to each other. The best number of clusters is determined based on the corresponding knee region.
2) K-means Clustering on Normalized Data: The K-means algorithm is the most popular statistical clustering
approach and is used in this methodology because of its simplicity, efficiency and scalability. Also, it has been proven
to be adequate in this type of application [17]. The main idea behind its use in our research is the minimization of the
sum of the mean squared errors (1). The solution consists of a scheme that starts with an initial cluster membership or
center that is chosen arbitrarily. Then an update of cluster centers and distribution of objects among clusters are
performed in the K-means algorithm. The algorithm uses the history of performance for previous steps until the
function cannot be reduced any further. One constraint of using K-means is that it requires knowing the optimal
number of clusters in advance.
‫ܬ݊݅ܯ‬ଵ ሺܷǡ ܸሻ ൌ σ௖௜ୀଵ σ௡௞ୀଵ ‫ݑ‬௜௞ ԡ‫ݔ‬௞ െ ‫ݒ‬௜ ԡଶ (1)
Given n load electricity shapes ሼ‫ݔ‬௞ ȁ݇ ൌ ͳǡ ǥ ǡ ݊ሽ , K-means determine k clusters centers ሼ‫ݒ‬௜ ȁ݅ ൌ ͳǡ ǥ ǡ ܿሽ by
minimizing the objective function given in (1). Every household is placed in a cluster that later will allow us to
determine eligibility for DR programs.
2.2. Prediction Models
We use four different machine learning algorithms to predict enrollment in DR programs based on households’
electricity load profile shape. These algorithms are: K-Nearest Neighbor, decision tree, artificial neural network and
random forest. Later, we will describe the performance metrics used to compare the prediction models.
1) K-Nearest Neighbor (k-NN) Classification: k-NN is considered one of the simplest and oldest methods for
pattern classification [18]. We decided to use it in our research because it has shown great prediction results in the
analysis of energy load data. Also, It has been well investigated in the literature and has shown to be a powerful
non-parametric mechanism for classification of objects and density estimation [19, 20]. Neighbors classify objects
by a majority vote. In other words, an object is assigned to its k nearest neighbors, as its own name indicates. This
classifier makes use of the Euclidean distance between test samples and training samples. Let ‫ݔ‬௜ȁ be an input sample,
with ‫݌‬ȁbeing the total number of featuresሺ݆ ൌ ͳǡʹǡ ǥ ǡ ‫݌‬ሻ. The Euclidean distance between sample ‫ݔ‬௜ȁ and ‫ݔ‬௟ȁ ሺ݈ ൌ
ͳǡʹǡ ǥ ǡ ݊ሻ is defined in (2).
݀ሺ‫ݔ‬௜ ǡ ‫ݔ‬௟ ሻ
ൌ ටሺ‫ݔ‬௜ଵ െ ‫ݔ‬௟ଵ ሻଶ ൅ ሺ‫ݔ‬௜ଶ െ ‫ݔ‬௟ଶ ሻଶ ൅ ‫ ڮ‬൅ ሺ‫ݔ‬௜௣ െ ‫ݔ‬௟௣ ሻଶ ሺʹሻ
2) Decision tree: is another widely used machine learning technique that has been effective for classification and
regression. When missing values are present in a dataset, the decision tree is a good technique to take into account
[21]. During the pre-processing phase of our research, we encountered the challenge of having empty values in the
dataset, and the flexibility that decision tree algorithms offer facilitates our data analysis. The approach breaks down
a dataset into subsets by decision trees and later an associated decision tree is incrementally developed.
Let us consider the following parameters to mathematically define a decision tree: A vector X of n characteristics
as input, a corresponding label Y as the output and a training set S that contains m couples (X,Y).
்
ܺ ൌ ൫‫ݔ‬ଵ ǡ ǥ ǡ ‫ݔ‬௝ ǡ ǥ ǡ ‫ݔ‬௡ ൯ ǡ ܺ ‫ א‬Թ௡
1) ܻ‫א‬Թ
ܵ ൌ ሼሺܺଵ ǡ ܻଵ ሻǡ ǥ ǡ ሺܺ௠ ǡ ܻ௠ ሻሽ (3)
A final tree with leaf nodes and decision nodes is the end result after running the algorithm. It is important to note
that a decision node contains several branches. Additionally, leaf nodes represent decisions, or in our case,
classifications. The chosen node in a tree is called a root node, and it represents the best predictor. Also, decision
trees can handle numerical data and categorical information, which makes it a very attractive model for various
applications [14]. The training phase involves the prediction of labels for the new feature vector that uses the build
predictor h, as shown in function (4).
ܻ ൌ ݄ሺܺሻ (4)
1) Artificial Neural Network (ANN): is an emulation algorithm of biological neural systems and uses the same
approach to predict certain behaviors of other systems. The biological neural system has an interconnected group of
artificial neurons that are in charge of processing information [22]. Different processing elements are connected in
the process, and every element represents a neuron in the brain. These neurons can be constructed in real life or
simulated by a computer. This is how the algorithm works: A neuron takes an input signal (‫ݔ‬௡ ), and then, based on a
weighting mechanism (‫ݓ‬௡ ), it produces an output signal that is sent as input to another neuron (y). Fig. 2 shows the
mathematical model of the ANN. This algorithm can be used to detect patterns and find trends that are too complex
to be discovered by humans or other computer techniques.
Table 1. Smart metering meta data

Time Resolution Customer Type
14th July 2009 Households Enterprise (SME) Other
-
31st Dec. 2010 4,243 450 1735
2) Random Forest: is an ensemble learning model for regression and classification. Predictions of the ensemble
are made by aggregation, using either a voting or averaging approach. The most popular class is chosen by voting,
when a sufficient number of trees have been generated [23]. We call these voting mechanisms random forests.
A more formal definition of random forest is as follows: “A random forest is a classifier consisting of a collection
of tree-structured classifiers ሼ݄ሺ‫ݔ‬ǡ ߆݇ሻǡ ݇ ൌ ͳǡ ǥ ሽ where the ሼ߆݇ሽ are independently identified random vectors and
each tree CASTs (Classification and Regression Tree) a unit vote for the most popular class at input x” [23].
There are two characteristics of random forest that make it an attractive prediction model: 1) It is able to achieve
high prediction accuracy, 2) It makes use of an auto-collection of desired features, such as daily raw data of
electricity consumed by individual appliances [24]. These two features make Random Forest a good model for
analyzing smart meter data. Additionally, the algorithm uses multiple models to improve its performance rather than
using a single tree model. A ‘variable importance’ can be obtained because many samples are selected in the
process, not just one, and this approach can be used for model selection.
3. Experiments on Data
3.1. Description of Smart Meter Data
The data used in the experiments for this research were collected as part of an electricity pilot project study
conducted by the Irish Commission for Energy Regulation (CER) between 2009 and 2010 using smart meters. The
main objective of the pilot project was to determine the types of changes in consumers’ behaviors in terms of
reductions in energy use during peak demand. In total, data from 6,429 residential and commercial customers were
recorded for a period of 535 days, 22 hours and 30 minutes. The study took place from July 15, 2009 to December
31, 2010 for a total of 1 year, 5 months and 17 days of data recorded. Data were collected on a half-hourly basis,
documenting the time of day when the read was taken, and electricity consumption measures are presented in kW.
Originally, the dataset was grouped into three different customer types: residential, Small and Medium Enterprises
(SME) and others, as shown in Table I.
Fig. 2. Mathematical model of the ANN

The detailed data were made available in anonymized format in order to facilitate further research and protect
customers’ privacy.
3.2. Data Preparation
A data cleaning and pre-processing procedure is applied before the clustering algorithms are used to segment the
raw data. These are the steps followed to prepare data for further analysis:
• Filter out weekends: We concentrate our analysis on weekdays, which allows us to see stable electricity
usage patterns from customers.
• Remove unnecessary variables: The data feature some inconsistencies, such as reading errors and outliers. So,
we removed those customers whose data were not reliable.
• Transform data files to .RDSformat: The smart meter dataset was provided in six Comma Separated
Values (CSV) files. In order to read this large dataset over and over again using R, we proposed a simple
solution—reading the data into R once and then storing them as an R binary (RDS) file.
• Checking for same IDs across multiple files: This process involved ensuring that there are no repeated IDs
(representing each customer) across files. All data points are now identified based on their ID number using
the R data.frame function.
• Calculate mean for every 30-min period: After the files are separated by ID, we aggregate each ID separately.
Then we calculate the mean for every 30-minute period (48 points in 24 hours) over all the IDs.
• Data normalization: The result is a table with three variables: ID, hHour, kWh, with customer number, time
stamp and electricity consumed, respectively.
• Electricity percentage: We calculate the percentage of electricity consumed every 30 minutes over a period
of 24 hours. There are 48 data points in this 24-hour period. In addition, we calculated peak and off peak
electricity percentage consumption.
Fig. 3. Percentage electricity load shape
Fig. 3 shows the percentage of electricity consumed in one day every 30 minutes. In the graph, there are 48 points,
each one representing a time in the day. Zero is the beginning of the day (12:00 am) and point 48 is the end of the
day (11:59 pm). A percentage load shape can be generated for the 6,429 users that are part of this dataset. In Fig. 3,
user 3,377 was chosen and its electricity load shape is shown in the graph. For this particular customer, the majority
of the electricity is consumed between points 14 and 18 (7:00 am and 9:00 am respectively) in the morning and
points 36-42 (6:00 pm and 9:00 pm respectively) in the evening. Here we see that smart meter data offer a unique
opportunity to learn about the customers’ lifestyles, and conclusions about the participation of this customer in DR
programs can be made immediately. The electricity load shape shows that the members of this family start using
electricity at about 7:30 am. The highest peak is at 8:00 am, after which consumption decreases significantly.
Between point 20 (10:00 am) and point 35 (5:30 pm) no major electricity consumption is reported, until 6:00 pm
when another peak is seen. A sample lifestyle could involve the household leaving home in the morning, returning
in the afternoon and consuming electricity until they go to sleep.
This load shape summarizes energy consumption for weekdays, excluding weekends, for a period of one and a
half years. We can conclude that this household may not be eligible to participate in DR programs that rely on
manual response. When utility companies have the ability to extract meaningful information from the huge amount
of data they collect from customers, they can propose new and innovative methods to encourage optimal electricity
use by all types of customers (e.g., Households, SME), including DR and energy efficiency programs. Utilities may
request to switch electricity consumption from peak hours to off-peak hours, since they will probably not be
available during off-peak times.
3.3. Clustering Analysis
After having the dataset pre-processed, the clustering algorithm can be executed. However, K-means requires
knowing the number of clusters that are used in advance. For that purpose, we employed a combination of
hierarchical clustering and slope statistic methods proposed in [21].
Fig. 4. Selection of number of clusters
Slope statistic is a data-driven, non-parametric method that determines the number of clusters a dataset should
contain in order to garner the best results from the clustering process. The slope statistic does not have reference
distributions. Also, it has an intuitive interpretation and does not require intensive computations.
In Fig. 4, we examine hierarchical clusters of the dataset to decide the optimal number of clusters using the slope
statistical technique [21] and found K=12. A vertical line is traced to determine the “knee” of the curve. The
numbers of points are counted after the vertical line to determine the number of clusters. When running the
algorithm using fewer clusters in the dataset, we come to the conclusion that the slope of the curve becomes
extremely large, as shown in the first two curves of Fig. 4. Therefore, it is not adequate to choose less than 12
clusters in our dataset.
As mentioned in section II, we used a K-means algorithm to identify representative load shapes of the entire
dataset. The K-means algorithm is the most popular statistical clustering approach. The twelve representative load
shapes are used as prototypes when describing the load shape of the entire population.
Fig. 5 shows the best combination of clusters using K-means. These clusters are extracted after applying our
K-means algorithm to the entire dataset containing 6,429 users. Most of the high electricity usage occurs in the early
morning, followed by the late afternoon and evening, which represents the lifestyle of a typical household. We also
see some cluster curves showing where most of the electricity is used during the day.
Fig. 5. K-mean resulted clustes

Fig. 6. Hourly aggregated mean monthly load
They represent the small and medium enterprises that are part of the dataset where electricity is mostly used
during business hours on weekdays. Just from looking at the electricity load shapes of the clusters, we can perceive
which clusters might be a good fit for participation in the DR programs. For example, clusters 2, 5, 7 and 11 clearly
show dual peak events—Morning peaks and Night peaks—while clusters 8, 9 and 10 show Daytime and Evening
peaks. Daytime and Evening peak segments have a high potential for being targets of the DR programs, while
Morning and Night peak segments have low potential [16].
After optimal clusters are identified, each user is assigned its best-fitted cluster. We have a processing file that
contains a table with information about the corresponding clusters per customer. Table II shows the first 7 customers
with their respective clusters, starting with ID 1000, which is the first customer. Clusters range from 1 to 12.
3.4. Feature Analysis
The mean hourly load for a household during the months of October and December is shown in Fig. 6.
This load shape follows a typical residential dynamic where a larger peak is seen in the afternoon and small peaks
are seen in the morning.
Fig. 7. Hourly mean load for October and December of 2009 & 2010
Table 2. Users associated with clusters

Cluster
ID Identification
1000 5
1001 12
1002 2
1003 4
1004 4
1005 7
1006 3
Fig.7 presents four different load shape profiles in a 24-hour period for October and December across two years,
2009 and 2010. In October of 2009, the pilot project was first starting, and in December of 2010 the pilot project
was about to finalize. We see a similar consumption in KWh during December for both years (green and purple
lines); however, we can observe a reduction in electricity consumption from October 2009 and October 2010 (black
and red lines respectively). The implementation of the pilot project in Ireland reduced both overall and peak
electricity usage across participants in the trial during weekdays [25]. This statement is corroborated in Fig. 7, where
a reduction in electricity usage occurs in October 2009 and October 2010.
Comparing energy usage every day of the week is an important factor in our smart meter data analysis because it
allows us to visualize the difference in energy consumption patterns from day to day, as seen in Fig. 8. This figure
shows the mean hourly load for a 24-hour period across all days. The highest peak is presented at around 7:00pm on
Wednesday and the lowest is seen on Tuesday at 4:00 am. All lines show a consistent consumption pattern with two
peaks between 12:00 pm-3:00 pm and 8:00 pm - 10:00 pm. Surprisingly, Monday and Tuesday show the lowest
electricity consumption during the highest peak hours in the afternoon, while Wednesday and Thursday report the
highest use at the same time. Weekends do not show a considerable change in consumption patterns, presenting
similar peaks as weekdays.
Fig. 8. Hourly aggregated mean weekly load
Fig. 9. Peak and off peak time definition
3.5. Comparison of Prediction Models
The prediction of the consumption load profile shapes was tested using four different machine learning
algorithms. The algorithms are: K-nearest neighbor, decision tree, artificial neural network and random forest. Here
we describe the performance metrics used to compare the prediction models. After cluster centers are aggregated,
we select the clusters that are eligible for DR programs. In order to select these clusters, we have to numerically
define peak and off-peak times.
Peak time is defined as when the average electricity usage is at least 10% higher than it should be in the case of
an evenly distributed usage (blue line in Fig. 9). That is, a horizontal line at: ͳǤͳ ‫ כ‬ሺͳͲͲȀͶͺሻ,and a red line in the
same graph showing a 10% increase in electricity consumption. As seen in the figure, this period is at 15:30 or 3:30
pm (the 32nd half-hour interval), and the last interval is from 23:00 to 23:29 or 11:00 pm to 11:30 pm (the 47th
interval). Therefore, we defined electrical peak hours as being between 3:30 pm and 11:30 pm.
Then, we find those customers that use a large proportion of their total electricity outside peak times, from
midnight to 3:30pm. Fig. 10 shows the proportion of the total electricity used by each cluster, and it is colored by the
peak period. The red portion shows the off-peak period (2/3 of the day), while the blue portion shows the peak
periods (1/3 of the day). The horizontal line in Fig. 10 is drawn at the value 66% or 2/3 of the day. Clusters 8, 9 and
10 are selected as eligible for DR programs, as they have a higher proportion of electricity usage in the off-peak
hours than expected in the case of an evenly distributed usage, as seen in Fig. 10.
Fig. 10. Cluster selection for eligibility for participation in DR programs
We data partition the dataset, splitting the data into a training set and a tested set. For this application, initial
results revealed that the use of the random forest model produces better results than using KNN, decision tree and
ANN algorithms. It outperforms the other three models with an accuracy of 0.951, as seen in Fig. 11, classifying
95.1% of the tested sample correctly. The decision tree has the lowest performance with 34.9% accuracy, followed
by the artificial neural network with an accuracy of 66.5%. KNN proves to be a good model for this type of analysis
with 75.6%, but it is still not as good as random forest. Random forests seem to be an effective method for the
analysis of smart meter data for large datasets.
Random forests use multiple models for better results as compared to simply executing a single tree model. In
addition, because many samples are selected in the process, a measure of variable importance can be obtained.
4. Conclusions and future work
This paper proposes a new data-driven methodology for the prediction of customers’ eligibility to participate in
DR programs using clustering and four different machine learning algorithms. Among the prediction models,
random forests proved to be a valuable method for the analysis of smart meter data for large datasets, with an
accuracy of 95.1%, followed by K-Nearest Neighbor (k-NN) classification with 75.6% accuracy. Using the
proposed methodology, utility companies will have an optimal customer selection targeting those customers that are
a good fit for DR programs. As the results of our research suggest, recruiting within a subpopulation that is more
likely to enroll improves the effectiveness of targeted marketing. Unfortunately, this approach only considers
customers that are at home at the time of the peak event and requires manual response to the program. We plan to
extend our research by 1) taking into consideration customers with solar PV panels (prosumers) and 2) considering
autonomous DR systems where users’ intervention is not necessary to improve customer’s enrollment and DR
results.
Fig. 11. Accuracy of models for cluster classificaion

Acknowledgment
The authors acknowledge the Commission for Energy Regulation (CER) and the Irish Social Science Data
Archive (ISSDA) for making the smart meter datasets of residential and SME customers available for research.
References
[1] Vardakas JS, Zorba N, Verikoukis CV. A survey on demand response programs in smart grids: Pricing methods and optimization
algorithm. IEEE Communications Surveys & Tutorials 2014;17(1):152-178. doi:10.1109/COMST.2014.2341586
[2] Chelmis C, Kolte J, Prasanna VK. Big data analytics for demand response: Clustering over space and time, Proc. Big Data, 2015 IEEE
International Conference. IEEE, 2015; p. 2223-2232. doi:10.1109/BigData.2015.7364011
[3] Energy Conservation Committee Report and Recommnedations, Reducing Electricity Consumption in Houses, Ontorio Home Builders’
Assoc., May 2006.
[4] Yilmaz M and Krein PT. Review of the Impact of Vehicle-to-Grid Technologies on Distribution Systems and Utility Interfaces, in IEEE
Transactions on Power Electronics; Dec.2013; 28(12): 5673-5689. doi: 10.1109/TPEL.2012.2227500
[5] Ozturk, Y, Senthilkumar, D, Kumar, S, Lee, G. An intelligent home energy management system to improve demand response, Smart Grid,
IEEE Transactions on 2012; 4 ( 2): p. 694-701. doi:10.1109/TSG.2012.2235088
[6] Sanquist TF, Lifestyle factors in US residential electricity consumtion, Energy Policy; 2012. 42: 354-364.
[7] Todd A, Cappers P, and Goldman C. Residential customer enrollment in time-based rate and enabling technology programs, Lawrence
Berkeley Nat. Lab., Berkeley, CA, USA, Tech.Rep. LBNL-6247E, 2013
[8] Panigrahi BK and Pandi VR, Optimal feature selection for classification of power quality disturbances using wavelet packet-based fuzzy k-
nearest neighbour algorithm, in IET Generation, Transmission & Distribution; March 2009. 3 (3): 296-306. doi: 10.1049/iet-gtd:20080190
[9] Zeifman M. Smart meter data analytics: Prediction of enrollment in residential energy efficiency programs, Proc. IEEE International
Conference on Systems, Man, and Cybernetics (SMC); 2014. p. 413-416. doi:10.1109/SMC.2014.6973942
[10] Haben, S, Singleton, C., Grindrod, P. Analysis and clustering of residential customers energy behavioral demand using smart meter
data, Smart Grid, IEEE Transactions on; 2016. 7 (1), p. 136-144. doi:10.1109/TSG.2015.2409786
[11] Albert A and Rajagopal R, Smart Meter Driven Segmentation: What Your Consumption Says About You, in IEEE Transactions on Power
Systems; Nov. 2013. 28 (4):4019-4030. doi: 10.1109/TPWRS.2013.2266122
[12] Kwac J, Rajagopal R.. Data-driven targeting of customers for demand response, IEEE Transactions on Smart Grid,
2015. doi:10.1109/TSG.2015.2480841
[13] Vadovsk M, Michalik P, Zolotov I. and Parali J. Better IT services by means of data mining, 2016 IEEE 14th International Symposium on
Applied Machine Intelligence and Informatics (SAMI), Herlany; 2016. p. 187-192. doi: 10.1109/SAMI.2016.7423005
[14] Lahouar A, Ben J, Hadj Slama. Random forests model for one day ahead load forecasting. Proc.Renewable Energy Congress (IREC), 2015
6thInternational, 2015. p.1-6. doi:10.1109/IREC.2015.7110975
[15] Hammouda K, A compartive study of data clustering techniques, International Journal of Computer Science and Information Technology;
2008. 5(2) : 220-231.
[16] Jungsuk Kwac, Flora, J., Rajagopal, R. Household energy consumption segmentation using hourly data, Smart Grid, IEEE Transactions on;
2013. 5 (1) : pp.420-430. doi:10.1109 / TSG.2013.2278477
[17] Viegas JL, Vieira SM, Sousa JMC., Melicio R, Mendes VMF. Electricity demand profile prediction based on household characteristics,
Proc. European Energy Market (EEM) 2015 12th International Conference on the; 2015. p. 1-5. doi:10.1109 / EEM.2015.7216746
[18] Palomares-Salas JC, de la Rosa JJG, Agüera-Pérez A and Sierra-Fernández J M. Smart grids power quality analysis based in classification
techniques and higher-order statistics: Proposal for photovoltaic systems, Industrial Technology (ICIT), 2015 IEEE International
Conference on, Seville; 2015. p. 2955-2959. doi: 10.1109/ICIT.2015.7125534
[19] Keller JM, Gray MR and Givens JA. A fuzzy K-nearest neighbor algorithm, in IEEE Transactions on Systems, Man, and Cybernetics; July-
Aug. 1985. SMC-15(4) : 580-585.doi: 10.1109/TSMC.1985.6313426
[20] Kazakidis SA, Kokkosis AI, Moustris KP and Paliatsos AG, Electricity consumption prognosis with the combination of smart metering and
artificial neural networks, Power Generation, Transmission, Distribution and Energy Conversion (MEDPOWER 2012), 8th Mediterranean
Conference on, Cagliari; 2012. p.1-6.doi: 10.1049/cp.2012.2013
[21] Fujita A, Takahashi D, Patriota A. A non-parametric method to estimate the number of cluster, Conputational Statistics & Data Analysis
Journal; 2013. 73 (1): 27-39.
[22] Asare-Bediako B, Kling WL and Ribeiro PF, Day-ahead residential load forecasting with artificial neural networks using smart
meterdata, PowerTech (POWERTECH), 2013 IEEE Grenoble, Grenoble; 2013. p. 1-6.doi: 10.1109/PTC.2013.6652093
[23] Mei J, He D, Harley RG, Habetler TG. Random forest based adaptive non-intrusive load identification, Proc. International Joint Conference
on Neural Networks (IJCNN); 2014. p. 1978-1983. doi:10.1109/IJCNN.2014.6889897
[24] Lines J, Bagnall A, Caiger-Smith P, and Anderson S, Classification of Household Devices by Electricity Usage Profile, University of East
Anglia Norwich, Cambridge, UK. https://archive.uea.ac.uk/~ajb/Papers/LinesIDEAL2011.pdf
[25] Commiton for Energy Regulation (CER). Electricity Smart Metering Customer Behaviour Trials Findings Report, May 2011.

1-S2.0-S1876610216317179-Main - Smart Meter Data Analytics For Optimal Customer Selection

Uploaded by

Copyright:

Available Formats

1-S2.0-S1876610216317179-Main - Smart Meter Data Analytics For Optimal Customer Selection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-S2.0-S1876610216317179-Main - Smart Meter Data Analytics For Optimal Customer Selection

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Smart meter data analytics for optimal customer selection in

* Corresponding author. Tel.: +1-508-320-4243

1.1. Prior Work

1.2. Contribution of the Paper

1.3. Structure of the Paper

Fig. 1. Overall DR customer selection flow

2.1. Clustering Methods

‫ܬ݊݅ܯ‬ଵ ሺܷǡ ܸሻ ൌ σ௖௜ୀଵ σ௡௞ୀଵ ‫ݑ‬௜௞ ԡ‫ݔ‬௞ െ ‫ݒ‬௜ ԡଶ (1)

2.2. Prediction Models

Table 1. Smart metering meta data

3.1. Description of Smart Meter Data

Fig. 2. Mathematical model of the ANN

3.2. Data Preparation

Fig. 3. Percentage electricity load shape

3.3. Clustering Analysis

Fig. 4. Selection of number of clusters

Fig. 5. K-mean resulted clustes

Fig. 6. Hourly aggregated mean monthly load

3.4. Feature Analysis

Table 2. Users associated with clusters

Fig. 8. Hourly aggregated mean weekly load

Fig. 9. Peak and off peak time definition

3.5. Comparison of Prediction Models

Fig. 10. Cluster selection for eligibility for participation in DR programs

4. Conclusions and future work

Fig. 11. Accuracy of models for cluster classificaion

You might also like