Machine Learning For NY Power Grid
Machine Learning For NY Power Grid
Machine Learning For NY Power Grid
2, FEBRUARY 2012
Abstract—Power companies can benefit from the use of knowledge discovery methods and statistical machine learning for preventive
maintenance. We introduce a general process for transforming historical electrical grid data into models that aim to predict the risk of
failures for components and systems. These models can be used directly by power companies to assist with prioritization of
maintenance and repair work. Specialized versions of this process are used to produce 1) feeder failure rankings, 2) cable, joint,
terminator, and transformer rankings, 3) feeder Mean Time Between Failure (MTBF) estimates, and 4) manhole events vulnerability
rankings. The process in its most general form can handle diverse, noisy, sources that are historical (static), semi-real-time, or real-
time, incorporates state-of-the-art machine learning algorithms for prioritization (supervised ranking or MTBF), and includes an
evaluation of results via cross-validation and blind test. Above and beyond the ranked lists and MTBF estimates are business
management interfaces that allow the prediction capability to be integrated directly into corporate planning and decision support; such
interfaces rely on several important properties of our general modeling approach: that machine learning features are meaningful to
domain experts, that the processing of data is transparent, and that prediction results are accurate enough to support sound decision
making. We discuss the challenges in working with historical electrical grid data that were not designed for predictive purposes. The
“rawness” of these data contrasts with the accuracy of the statistical models that can be obtained from the process; these models are
sufficiently accurate to assist in maintaining New York City’s electrical grid.
Index Terms—Applications of machine learning, electrical grid, smart grid, knowledge discovery, supervised ranking, computational
sustainability, reliability.
1 INTRODUCTION
underground cable1 and many other utilities manage aggregation, formation of features and labels, ranking
similarly large underground electric systems. Maintaining methods), and evaluation (blind tests, visualization). Specia-
a large grid that is a mix of new and old components is lized versions of the process have been developed for:
more difficult than managing a new grid (for instance, as is
being laid in some parts of China). The US grid is generally 1. feeder failure ranking for distribution feeders,
older than many European grids that were replaced after 2. cable section, joint, terminator, and transformer
WWII, and older than grids in places where infrastructure ranking for distribution feeders,
must be continually replenished due to natural disasters 3. feeder Mean Time Between Failure (MTBF) esti-
(for instance, Japan has earthquakes that force power mates for distribution feeders, and
systems to be replenished). 4. manhole vulnerability ranking.
The smart grid will not be implemented overnight; to Each specialized process was designed to handle data with
create the smart grid of the future, we must work with the particular characteristics. In its most general form, the
electrical grid that is there now. For instance, according to process can handle diverse, noisy, sources that are historical
the Brattle Group [4], the cost of updating the grid by 2030 (static), semi-real-time, or real time; the process incorpo-
could be as much as $1.5 trillion. The major components of rates state-of-the-art machine learning algorithms for
the smart grid will (for an extended period) be the same as prioritization (supervised ranking or MTBF), and includes
the major components of the current grid, and new an evaluation of results via cross-validation on past data
intelligent meters must work with the existing equipment. and by blind evaluation. The blind evaluation is performed
Converting to a smart grid can be compared to replacing on data generated as events unfold, giving a true barrier to
worn parts of an airplane while it is in the air. As grid parts information in the future. The data used by the machine
are replaced gradually and as smart components are added, learning algorithms include past events (failures, replace-
the old components, including cables, switches, sensors, ments, repairs, tests, loading, power quality events, etc.)
etc., will still need to be maintained. Further, the state of the and asset features (type of equipment, environmental
old components should inform the priorities for the conditions, manufacturer, specifications, components con-
addition of new smart switches and sensors. nected to it, borough and network where it is installed, date
The key to making smart grid components effective is to of installation, etc.).
analyze where upgrades would be most useful, given the Beyond the ranked lists and MTBF estimates, we have
current system. Consider the analogy to human patients in designed graphical user interfaces that can be used by
the medical profession, a discipline for which many of the
managers and engineers for planning and decision support.
machine learning algorithms and techniques used here for
Successful NYC grid decision support applications based
the smart grid were originally developed and tested. While
on our models are used to assist with prioritizing repairs,
each patient (a feeder, transformer, manhole, or joint) is
prioritizing inspections, correcting of overtreatment, gen-
made up of the same kinds of components, they wear and
age differently, with variable historic stresses and heredi- erating plans for equipment replacement, and prioritizing
tary factors (analogous to different vintages, loads, manu- protective actions for the electrical distribution system.
facturers) so that each patient must be treated as a unique How useful these interfaces are depends on how accurate
individual. Nonetheless, individuals group into families, the underlying predictive models are, and also on the
neighborhoods, and populations (analogous to networks, interpretation of model results. It is an important property
boroughs) with relatively similar properties. The smart grid of our general approach that machine learning features are
must be built upon a foundation of helping the equipment meaningful to domain experts in that the data processing
(patients) improve their health so that the networks and the way causal factors are designed is transparent. The
(neighborhoods) improve their life expectancy and the transparent use of data serves several purposes: It allows
population (boroughs) lives more sustainably. domain experts to troubleshoot the model or suggest
In the late 1990s, NYC’s power company, Con Edison, extensions, it allows users to find the factors underlying
hypothesized that historical power grid data records could the root causes of failures, and it allows managers to
be used to predict, and thus prevent, grid failures and understand, and thus trust, the (non-black-box) model in
possible associated blackouts, fires, and explosions. A order to make decisions.
collaboration was formed with Columbia University, begin- We implicitly assume that data for the modeling tasks
ning in 2004, in order to extensively test this hypothesis. will have similar characteristics when collected by any
This paper discusses the tools being developed through this power company. This assumption is broadly sound but
collaboration for predicting different types of electrical grid there can be exceptions; for instance, feeders will have
failures. The tools were created for the NYC electrical grid; similar patterns of failure across cities, and data are probably
however, the technology is general and is transferable to collected in a similar way across many cities. However, the
electrical grids across the world. levels of noise within the data and the particular conditions
In this work, we present new methodologies for main- of the city (maintenance history, maintenance policies,
taining the smart grid, in the form of a general process for network topologies, weather, etc.) are specific to the city
failure prediction that can be specialized for individual and to the methods by which data are collected and stored
applications. Important steps in the process include data by the power company.
processing (cleaning, pattern matching, statistics, integra- Our goals for this paper are to demonstrate that data
tion), formation of a database, machine learning (time collected by electrical utilities can be used to create
statistical models for proactive maintenance programs, to
1. http://www.fpl.com/faqs/underground.shtml. show how this can be accomplished through knowledge
330 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 2, FEBRUARY 2012
Fig. 2. Number of feeder outages in NYC per day during 2006-2007, 2.2 Cable Sections, Joints, Terminators and
lower curve with axis at left, and system-wide peak system load, upper Transformers Ranking
curve with axis at right.
In Section 2.1, we discussed the task of predicting whether a
failure would happen to any component of a (multi-
removing lead cable from their systems, it is going to be a
component) feeder. We now discuss the task of modeling
long time before the work is complete.2 For instance, in
failures on individual feeder components; modeling how
NYC, the Public Service Commission has mandated that all individual components fail brings an extra level to the
30;000 remaining PILC sections be replaced by 2020. Note, understanding of feeder failure. Features of the components
however, that some PILC sections have been in operation can be more directly related to localized failures and kept in
for a very long time without problems, and it is important a nonaggregated form; for instance, a feature for the
to make the best use of the limited maintenance budget by component modeling task might encode that a PILC section
replacing the most unreliable sections first. was made by Okonite in 1950, whereas a feature for the
As can be seen in Fig. 2, a small number of feeder failures feeder modeling task might instead be a count of PILC
occur daily in NYC throughout the year. The rate of failures sections greater than 40 years old for the feeder. The
noticeably increases during warm weather; air conditioning component rankings can also be used to support decisions
causes electricity usage to increase by roughly 50 percent about which components to prioritize after a potentially
during the summer. It is during these times when the susceptible feeder is chosen (guided by the results of the
system is most at risk. feeder ranking task). In that way, if budget constraints
The feeder failure ranking application, described in
prohibit replacement of all the bad components of a feeder,
Section 5.1, orders feeders from most at-risk to least at-risk.
the components that are most likely to fail can be replaced.
Data for this task include: physical characteristics of the
For Con Edison, the data used for ranking sections,
feeder, including characteristics of the underlying compo-
joints, and hammerheads were diverse and fairly noisy,
nents that compose the feeder (e.g., percent of PILC
though in much better shape than the data used for the
sections); date put into service; records of previous “open
autos” (feeder failures), previous power quality events manhole events prediction project we describe next.
(disturbances), scheduled work, and testing; electrical 2.3 Manhole Ranking
characteristics, obtained from electric load flow simulations
A small number of serious “manhole events” occur each year
(e.g., how much current a feeder is expected to carry under
in many cities, including fires and explosions. These events
various network conditions); and dynamic data, from real-
are usually caused by insulation breakdown of the low-
time telemetry attached to the feeder. Approximately 300
voltage cable in the secondary network. Since the insulation
summary features are computed from the raw data, for
can break down over a long period of time, it is reasonable
example, the total number of open autos per feeder over the
to try to predict future serious events from the character-
period of data collection. For Con Edison, these features are
istics of past serious and nonserious events. We consider
reasonably complete and not too noisy. The feeder failure
events within two somewhat simplified categories: serious
rank lists are used to provide guidance for Con Edison’s
events (fires, explosions, serious smoking manholes) and
contingency analysis and winter/spring replacement pro- potential precursor events (burnouts, flickering lights, etc.).
grams. In the early spring of each year, a number of feeders Potential precursor events can be indicators of an area-wide
are improved by removing PILC sections, changing the network problem, or they can indicate that there is a local
topology of the feeders to better balance loading, or to problem affecting only 1-2 manholes.
support changing power requirements for new buildings. Many power companies keep records of all past events in
Loading is light in spring, so feeders can be taken out of
the form of trouble tickets, which are the shorthand notes
service for upgrading with low risk. Prioritizing feeders is
taken by dispatchers. An example ticket for an NYC smoking
2. For more details, see the article about replacement of PILC in NYC manhole event appears in Fig. 3. Any prediction algorithm
http://www.epa.gov/waste/partnerships/npep/success/coned.htm. must consider how to effectively process these tickets.
332 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 2, FEBRUARY 2012
structured or unstructured data), information extraction, . AUC or weighted AUC: Area under the ROC curve
text normalization, using overlapping data to find incon- [11], or Wilcoxon Mann Whitney U statistic, as
sistencies, and inferring related or duplicated records. formulated in Section 4 below. The AUC is related to
Statistics can be used to assess whether data are missing, the number of times a failure is ranked below a
and for sanity checks on inferential joins. nonfailure in the list. Weighted AUC metrics (for
An inferential join is the process by which multiple raw instance, as used the P-Norm Push algorithm [12]
data tables are united into one database. Inferential joins are derived in Section 4) are more useful when the top of
a key piece of data processing. An example to illustrate the the list is the most important.
logic behind using basic pattern matching and statistics for For MTBF/MTTF estimation, the sum of squared differ-
inferential joining is the uniting of the main cable records to ences between estimated MTBF/MTTF and true MTBF/
the raw manhole location data for the manhole event MTTF is the evaluation measure.
process in NYC, to determine which cables enter into which The evaluation stage often produces changes to the initial
manholes. Main cables connect two manholes (as opposed processing. These corrections are especially important for
to service or streetlight cables that enter only one manhole). ranking problems. In ranking problems where the top of the
The cable data come from Con Edison’s accounting list is often the most relevant, there is a possibility that top
department, which is different from the source of the of the list will be populated completely by outliers that are
manhole location data. A raw join of these two tables, based caused by incorrect or incomplete data processing, and thus
on a unique manhole identifier that is the union of three the list is essentially useless. This happens particularly
fields—manhole type, number, and local 3-block code— when the inferential joins are noisy; if a feeder is incorrectly
provided a match to only about half of the cable records. linked to a few extra failure events, it will seem as if this
We then made a first round of corrections to the data, where feeder is particularly vulnerable. It is possible to trouble-
we unified the spelling of the manhole identifiers within shoot this kind of outlier by performing case studies of the
both tables, and found matches to neighboring 3-block components on the top of the ranked lists.
codes (the neighboring 3-block code is often mistakenly
entered for manholes on a border of the 3 blocks). The next 4 MACHINE LEARNING METHODS: RANKING FOR
round of corrections used the fact that main cables have
RARE EVENT PREDICTION
limited length: If only one of the two ends of the cable was
uniquely matched to a manhole, with several possible The subfield of ranking in machine learning has expanded
manholes for the other end, then the closest of these rapidly over the past few years as the information retrieval
manholes was selected (the shortest possible cable length). community has started developing and using these meth-
This processing gave a match to about three-quarters of the ods extensively (see the LETOR website4 and references
cable records. A histogram of the cable length then therein). Ranking algorithms can be used for applications
indicated that about 5 percent of these joined records beyond information retrieval; our interest is in developing
represented cables that were too long to be real. Those and applying ranking algorithms to rank electrical grid
cables were used to troubleshoot the join again. Statistics components according to the probability of failure. In IR,
can generally assist in finding pockets of data that are not the goal is to rank a set of documents in order of relevance
joined properly to other relevant data. to a given query. For both electrical component ranking and
Data can be either: static (representing the topology of IR, the top of the list is considered to be the most important.
the network, such as number of cables, connectivity), The ranking problems considered here fall under the
semidynamic (e.g., only changes when a section is removed general category of supervised learning problems and,
or replaced, or when a feeder is split into two), or dynamic specifically, supervised bipartite ranking. In supervised
(real time, with timestamps). The dynamic data can be bipartite ranking tasks, the goal is to rank a set of randomly
measured electronically (e.g., feeder loading measure- drawn examples (the “test set”) according to the probability
ments), or it can be measured as failures occur (e.g., trouble of possessing a particular attribute. To do this, we are given
tickets). For the semidynamic and dynamic data, a timescale a “training set” that consists of examples with labels:
of aggregation needs to be chosen for the features and labels fðxi ; yi Þgm
i¼1 ; xi 2 X ; yi 2 f1; þ1g:
for machine learning.
For all four applications, machine learning models are In this case, the examples are electrical components, and the
formed, trained, and cross-validated on past data, and label we want to predict is whether a failure will occur within
evaluated via “blind test” on more recent data, discussed a given time interval. It is assumed that the training and test
further in Section 4. examples are both drawn randomly from the same unknown
For ranking algorithms, the evaluation measure is distribution. The examples are characterized by features:
usually a statistic of a ranked list (a rank statistic), and
fhj gnj¼1 ; hj : X ! R:
ranked lists are visualized as Receiver Operator Character-
istic (ROC) curves. Evaluation measures include: The features should encode all information that is relevant
for predicting the vulnerability of the components, for
. Percent of successes in the top k percent: The percent
instance, characteristics of past performance, equipment
of components that failed within the top k percent of
the ranked list (similar to “precision” in information 4. http://research.microsoft.com/en-us/um/beijing/projects/letor/
retrieval (IR)). paper.aspx.
334 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 2, FEBRUARY 2012
PD fmisrankðf Þg
ð1Þ
:¼ PD ff ðxþ Þ f ðx Þ j yþ ¼ 1; y ¼ 1g:
The notation PD indicates the probability with respect to a
random draw of ðxþ ; yþ Þ and ðx ; y Þ from distribution D
Fig. 6. Sample timeline for rare event prediction. on X f1; þ1g. The empirical risk corresponding to (1) is
the number of misranked pairs in the training set:
manufacturer, and type of equipment. To demonstrate, we X X
can have: h1 ðxÞ ¼ the age of component x, h2 ðxÞ ¼ the Rðf Þ : ¼ 1½f ðxi Þf ðxk Þ
number of past failures involving component x, and h3 ðxÞ ¼
fk:yk ¼1g fi:yi ¼1g ð2Þ
1 if x was made by a particular manufacturer. These features ¼ #misranksðf Þ:
can be either correlated or uncorrelated with failure predic- The pairwise misranking error is directly related to the
tion; the machine learning algorithm will use the training set (negative of the) area under the ROC curve; the only
to choose which features to use and determine the impor-
difference is that ties are counted as misranks in (2). Thus, a
tance of each feature for predicting future failures.
natural ranking algorithm is to choose a minimizer of Rðf Þ
Failure prediction is performed in a rare event prediction
with respect to :
framework, meaning the goal is to predict events within a
given “prediction interval” using data prior to that interval. 2 argmin Rðf Þ;
There are separate prediction intervals for training and 2Rn
testing. The choice of prediction intervals determines the and to rank the components in the test set in descending
P
labels y for the machine learning problem and the features hj . order of f ðxÞ :¼ j j hj ðxÞ.
Specifically, for training, yi is þ1 if component i failed There are three shortcomings to this algorithm: First, it is
during the training prediction interval and 1 otherwise. computationally hard to minimize Rðf Þ directly. Second,
The features are derived from the time period prior to the the misranking error Rðf Þ considers all misranks equally,
prediction interval. For instance, as shown in Fig. 6, if the in the sense that misranks at the top of the list are counted
goal is to rank components for vulnerability with respect to equally with misranks toward the bottom, even though, in
2010, the model is trained on features derived from prior to failure prediction problems, it is clear that misranks at the
2009 and labels derived from 2009. The features for testing top of the list should be considered more important. A third
are derived from pre-2010 data. The choice of the prediction shortcoming is the lack of regularization usually imposed to
interval’s length is application dependent; if the interval is enable generalization (prediction ability) in high dimen-
too small, there may be no way to accurately characterize sions. A remedy for all of these problems is to use special
failures. If the length is too large, the predictions may be too cases of the following ranking objective that do not fall into
coarse to be useful. For manhole event prediction in NYC, any of the traps listed above:
this time period was chosen to be one year, and time 0 1
aggregation was performed using the method of Fig. 6 for X X
manhole event prediction. A more elaborate time aggrega- R‘g ðf Þ :¼ g@ ‘ðf ðxi Þ f ðxk ÞÞA þ Ck
k2 ;
tion scheme is discussed in Section 5.1 for feeder failure fk:yk ¼1g fi:yi ¼1g
negative examples when p is large; the power p acts as a soft in Section 5.1. We discuss the data processing challenges for
max. Since most of the value of the objective is determined cables, joints, terminators, and transformers in Section 5.2.
by the top portion of the list, the algorithm concentrates The manhole event prediction process is discussed in
more on the top. The full P-Norm Push algorithm is Section 5.3, and the MTBF estimation process is discussed
in Section 5.4.
2 arg inf Rp ð
Þ where
0 1p 5.1 Feeder Ranking in NYC
X X
Rp ð
Þ :¼ @ expð½f ðxi Þ f ðxk ÞÞA : Con Edison data regarding the physical composition of
fk:yk ¼1g fi:yi ¼1g feeders are challenging to work with; variations in the
database entry and rewiring of components from one feeder
Vector is not difficult to compute, for instance by to another make it difficult to get a perfect snapshot of the
gradient descent. The P-Norm Push is used currently in the current state of the system. It is even more difficult to get
manhole event prediction tool. An SVM algorithm with snapshots of past states of the system; the past state needs
‘2 regularization is currently used in the feeder failure tool. to be known at the time of each past failure because it is
Algorithms designed via empirical risk minimization are used in training the machine learning algorithm. A typical
not designed to be able to produce density estimates, that is, feeder is composed of over a hundred cable sections,
estimates of P ðy ¼ 1jxÞ, though in some cases it is possible, connected by a similar number of joints, and terminating in
particularly when the loss function is smooth. These algo- a few tens of transformers. For a single feeder, these
rithms are instead designed specifically to produce an subcomponents are a hodgepodge of types and ages; for
accurate ranking of examples according to these probabilities. example, a brand-new cable section may be connected to
It is important to note that the specific choice of machine
one that is many decades old; this makes it challenging to
learning algorithm is not the major component of success in
“roll-up” the feeder into a set of features for learning. The
this domain; rather, the key to success is the data cleaning
features we currently use are statistics of the ages, numbers,
and processing as discussed in Section 3. If the machine
and types of components within the feeder; for instance, we
learning features and labels are well constructed, any
have considered maxima, averages, and 90th percentiles
reasonable algorithm will perform well; the inverse holds
(robust versions of the maxima).
too, in that badly constructed features and labels will not
Dynamic data presents a similar problem to physical
yield a useful model regardless of the choice of algorithm.
data, but here the challenge is aggregation in time instead
For our MTBF application, MTBF is estimated indirectly
of space. Telemetry data are collected at rates varying from
through failure rates; the predicted failure rate is converted
hundreds of times per second (for power quality data) to
to MTBF by taking the reciprocal of the rate. The failure rate
only a few measurements per day (weather data). These
is estimated rather than MTBF for numerical reasons: Good
can be aggregated over time, again using functions such as
feeders with no failures have an infinite MTBF. The failure
max or average, using different time windows (as we
rate is estimated by regression algorithms, for instance,
describe shortly). Some of the time windows are relatively
support vector machine regression (SVM-R) [16], Classifica-
simple (e.g., aggregating over 15 or 45 days), while others
tion and Regression Trees (CART) [17], ensemble-based
take advantage of the system’s periodicity, and aggregate
techniques such as Random Forests [18], and statistical
methods, e.g., Cox Proportional Hazards [19]. over the most recent data plus data from the same time of
year in previous years.
One of the challenges of the feeder ranking application is
5 SPECIFIC PROCESSES AND CHALLENGES that of imbalanced data, or scarcity of data characterizing
In this section, we discuss how the general process needs to the failure class, which causes problems with general-
be adapted in order to handle data processing and machine ization. Specifically, primary distribution feeders are sus-
learning challenges specific to each of our electrical ceptible to different kinds of failures, and we have very few
reliability tasks in NYC. Con Edison currently operates training examples for each kind, making it difficult to
the world’s largest underground electric system, which reliably extract statistical regularities or determine the
delivers up to a current peak record of about 14,000 MW of features that affect reliability. For instance, failure can be
electricity to over 3 million customers. A customer can be an due to: concurrent or prior outages that stress the feeder
entire office building or apartment complex in NYC so that and other feeders in the network; aging; power quality
up to 15 million people are served with electricity. Con events (e.g., voltage spikes); overloads (that have seasonal
Edison is unusual among utilities in that it started keeping variation, like summer heat waves); known weak compo-
data records on the manufacturer, age, and maintenance nents (e.g., joints connecting PILC to other sections); at-risk
history of components over a century ago, with an topologies (where cascading failures could occur); the stress
increased level of Supervisory Control and Data Acquisition of “HiPot” (high potential) testing; and deenergizing/
(SCADA) added over the last 15 years. While real-time data reenergizing of feeders that can result in multiple failures
are collected from all transformers for loading and power within a short time span due to “infant mortality.” Other
quality information, that is much less than will be needed data scarcity problems are caused by the range in MTBF of
for a truly smart grid. the feeders; while some feeders are relatively new and last
We first discuss the challenges of feeder ranking and for a long time between failures (for example, more than
specifics of the feeder failure ranking process developed for five years), others can have failures within a few tens of
Con Edison (also called “Outage Derived Data Sets—ODDS”) days of each other. In addition, rare seasonal effects (such as
336 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 2, FEBRUARY 2012
Fig. 13. ROC-like curve for blind test of Crown Heights feeders.
Fig. 16. Linear regression used to determine the mean time between
failures for 1 January 2002 (purple), and 31 December 2008 (yellow) in
each underground network in the Con Edison system. Networks are
arranged along the horizontal axis from worst (left) to best (right),
according to Con Edison’s “Network Reliability Index.”
Fig. 15. Influence of different categories of features under different 6.2 MTBF Modeling Evaluation
weather conditions. Red: hot weather of August 2010; Blue: snowy We have tracked the improvement in MTBF for each
January 2011; Yellow: rainy February 2011; Turquoise: typical fall
network as preventive maintenance work has been done by
weather in October 2010.
Con Edison to improve performance since 2002. To test
whether this improvement is significant, we use a non-
To determine what the most important features are, we
parametric statistical test, called the logrank test, that
create “tornado” diagrams like Fig. 15. This figure
compares the survival distributions of two samples. In this
illustrates the influence of different categories of features
case, we wished to determine if the 2009 summer MTBF
under different weather conditions. For each weather
values are statistically larger than the 2002 summer MTBF
condition, the influence of each category (the sum of
values. The performance of the system showed significant
coefficients j Pfor that category divided by the total sum
improvement, in that there is a less than one in a billion
of coefficients j j ) is displayed as a horizontal bar. Only
chance that the treatment population in 2009 did not
the top few categories are shown. For both snowy and hot
improve over the control population from 2002. In 2009,
weather, features describing power quality events have
for example, there were 1,468 out of 4,590 network-days that
been the most influential predictors of failure according to
were failure free, or one out of every three summer days,
our model.
but in the 2002 control group, there were only 908 network-
The categories of features in Fig. 15 are: power quality,
days that were failure free, or one out of five summer days,
which are features that count power quality events
that were failure free. The larger the percentage of network-
(disturbances) preceding the outage over various time
days that were failure free, the lower the likelihood of
windows; system load in megawatts; outage history, which
multiple outages happening at the same time.
includes features that count and characterize the prior
Fig. 16 shows MTBF predicted by our model for each
outage history (failure outages, scheduled outages, test
underground network in the Con Edison system on both
failures, immediate failures after reenergization, and urgent
1 January 2002 (purple) and 31 December 2008 (yellow). The
planned outage); load pocket weight, which measures the
yellow bars are generally larger than the purple bars,
difficulty in delivering power to the end user; transformers,
indicating an increase in MTBF.
particularly features encoding the types and ages of
We have performed various studies to predict MTBF of
transformers (e.g., percent of transformers made by a
feeders. Fig. 17 shows the accuracy of our outage rate
particular manufacturer); stop joints and paper joints, which
predictions for all classes of unplanned outages over a
include features that count joints types, configurations, and
age, where these features are associated with joining PILC
to other PILC and more modern cable; cable rank, which
encodes the results of the cable section ranking model; the
count of a specific type of cable (XP and EPR) in various age
categories; HiPot index features, which are derived by Con
Edison to estimate how vulnerable the feeders are to heat
sensitive component failures; number of shunts on the
feeder, where these shunts equalize the capacitance and also
condition the feeder to power quality events; an indicator
for nonnetwork customers, where a nonnetwork customer
is a customer that gets electricity from a radial overhead
connection to the grid; count of PILC sections along the
feeder; percent of joints that are solid joints, which takes into
account the fact that joining modern cable is simpler and
less failure-prone than joining PILC; shifted load features
that characterize how well a feeder transfers load to other Fig. 17. Scatter plot of SVM predicted outage rate versus actual rate for all
feeders if it were to go out of service. classes of unplanned outages. The diagonal line depicts a perfect model.
RUDIN ET AL.: MACHINE LEARNING FOR THE NEW YORK CITY POWER GRID 341
Fig. 18. ROC-like curve for 2009 Bronx blind test of the machine Fig. 19. Screen capture of the Contingency Analysis Program tool during
learning ranking for vulnerability of manholes to serious events (fires a fourth contingency event in the summer of 2008, with the feeders at
and explosions). most risk of failing next highlighted in red. The feeder ranking at the time
of failure is shown in a blow-up ROC-like plot in the center.
Fig. 20. A screen capture of the Con Edison CAPT evaluation, showing
an improvement in MTBF from 140 to 192 days if 34 of the most at-risk
PILC sections were to be replaced on a feeder in Brooklyn at an
estimated cost of $650,000.
applications, the predictive accuracy gained by using a event project, many Con Edison engineers did not view
different technique is often small compared to the accuracy manhole event prediction as a realistic goal. The Con Edison
gained through other steps in the discovery process or by trouble ticket data could easily have become what Fayyad
formulating the problem differently. The data in power et al. [8] consider a “data tomb.” In this case, the remedy
engineering problems are generally assumed to be amen- created by Columbia and Con Edison involved a careful
able to learning in their raw form, in contrast with our problem formulation, the use of sophisticated text processing
treatment of the data. The second reason our work is tools, and state-of-the-art machine learning techniques.
distinct from the power engineering literature is that the
machine learning techniques that have been developed by 9.2 Data Are the Key
the power engineering community are often “black-box” Power companies already collect a great deal of data;
methods such as neural networks and genetic algorithms however, if these data are going to be used for prediction of
(e.g., [32], [33]). Neural networks and genetic algorithms failures, they should ideally have certain properties: First,
can be viewed as heuristic, nonconvex optimization the data should be as clean as possible, meaning, for
procedures for objectives that have multiple local minima; instance, that unique identifiers should be used for each
the algorithms’ output can be extremely sensitive to the component. Second, if a component is replaced, it is
initial conditions. Our work uses mainly convex optimiza- important to record the properties of the old component
tion procedures to avoid this problem. Further, “black-box” (and its surrounding context if it is used to derive features)
algorithms do not generally produce interpretable/mean- before the replacement; otherwise, it cannot be determined
ingful solutions (for instance, the input-output relationship what properties are common to those being replaced.
of a multilayer neural network is not generally interpre- For trouble tickets, unstructured text fields should not be
table), whereas we use mainly simple linear combinations eliminated. It is true that structured data are easier to
of features. analyze; on the other hand, free-text can be much more
We are not aware of any other work that addresses the reliable. This was also discussed by Dalal et al. [38] in
challenges in mining historical power grid data of the same dealing with trouble tickets from web transaction data; in
level of complexity as those discussed here. Our work their case, a 40 character free-text field contained more
contrasts with a subset of work in power engineering where information than any other field in the database. In the case
data come entirely from Monte Carlo (MC) simulations [34], of Con Edison trouble tickets, our representation based on
[35], and the MC simulated failures are predicted using the free-text can much more reliably determine the
machine learning algorithms. In a sense, our work is closer seriousness of events than the (structured) trouble type
to data mining challenges in other fields such as code. Further, the type of information that is generally
e-commerce [10], criminal investigation [36], or medical recorded in trouble tickets cannot easily fit into a limited
patient processing [9] that encompass the full discovery number of categories and asking operators to choose the
process. For instance, it is interesting to contrast our work category under time pressure is not practical. We have
on manhole events with the study of Cornélusse et al. [37], demonstrated that analysis of unstructured text is possible
who used domain experts to label “frequency incidents” at and even practical.
generators, and constructed a machine learning model from
the frequency signals and labels. The manhole event 9.3 Machine Learning Ranking Methods Are Useful
prediction task discussed here also used domain experts for Prioritization
to label trouble tickets as to whether they represent serious Machine learning methods for ranking are relatively new,
events; however, the level of processing required to clean and currently they are not used in many application
and represent the tickets, along with the geocoding and domains besides information retrieval. So far, we have
information extraction required to pinpoint event locations, found that in the domain of electrical grid maintenance, the
coupled with the integration of the ticket labeling machine key to success is in the interpretation and processing of
learning task with the machine learning ranking task makes data, rather than in the exact machine learning method
the latter task a much more substantial undertaking. used; however, these new ranking methods are designed
exactly for prioritization problems, and it is possible that
these methods can offer an edge over older methods in
9 LESSONS LEARNED
many applications. Furthermore, as data collection becomes
There are several “take-away” messages from the develop- more automated, it is possible that the dependence on
ment of our knowledge discovery processes on the NYC grid. processing will lessen, and there will be a substantial
advantage in using algorithms designed precisely for the
9.1 Prediction Is Possible
task of prioritization.
We have shown successes in predicting failures of electrical
components based on data collected by a major power utility 9.4 Reactive Maintenance Can Lead to
company. It was not clear at the outset that knowledge Overtreatment
discovery and data mining approaches would be able to We have demonstrated with a statistical method called
predict electrical component failures, let alone assist domain “propensity” [39] that the High Potential (HiPot) testing
engineers with proactive maintenance programs. We are program at Con Edison was overtreating the “patient,” i.e.,
now involved in a Smart Grid Demonstration Project to the feeders. HiPot is, by definition, preventive maintenance
verify that these techniques can be scaled to robust system in that incipient faults are driven to failure by intentionally
use. For example, prior to our successes on the manhole stressing the feeder. We found, however, that the direct
344 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 2, FEBRUARY 2012
REFERENCES
[1] “Office of Electric Transmission United States Department of
Energy and Distribution,” “Grid 2030” a Nat’l Vision for Electricity’s
Second 100 Years, July 2003.
[2] “North American Electric Reliability Corporation (NERC),”
Results of the 2007 Survey of Reliability Issues, Revision 1, Oct. 2007.
[3] S.M. Amin, “U.S. Electrical Grid Gets Less Reliable,” IEEE
Spectrum, Jan. 2011.
[4] M. Chupka, R. Earle, P. Fox-Penner, and R. Hledik, “Transforming
America’s Power Industry: The Investment Challenge 2010-2030,”
technical report, The Brattle Group, prepared for The Edison
Foundation, Washington, D.C., 2008.
[5] W.J. Frawley, G. Piatetsky-Shapiro, and C.J. Matheus, “Knowl-
edge Discovery in Databases: An Overview,” AI Magazine, vol. 13,
no. 3, pp. 57-70, 1992.
[6] J.A. Harding, M. Shahbaz, S. Srinivas, and A. Kusiak, “Data
Mining in Manufacturing: A Review,” J. Manufacturing Science and
Eng., vol. 128, no. 4, pp. 969-976, 2006.
[7] A. Azevedo and M.F. Santos, “KDD, SEMMA and CRISP-DM: A
Parallel Overview,” Proc. Int’l Assoc. Development of the Information
Soc. European Conf. Data Mining, pp. 182-185, 2008.
Fig. 23. Overtreatment in the High Potential Preventive Maintenance [8] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From Data
program was identified by comparing to control group performance. Mining to Knowledge Discovery in Databases,” AI Magazine,
Modified and A/C Hipot tests are now used by Con Edison instead of DC vol. 17, pp. 37-54, 1996.
Hipot tests. [9] W. Hsu, M.L. Lee, B. Liu, and T.W. Ling, “Exploration Mining in
Diabetic Patients Databases: Findings and Conclusions,” Proc.
Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data
current (DC) HiPot testing, in particular, was not out- Mining, pp. 430-436, 2000.
performing a “placebo” control group which was scored by [10] R. Kohavi, L. Mason, R. Parekh, and Z. Zheng, “Lessons and
Con Edison to be equally “sick” but on which no work was Challenges from Mining Retail E-Commerce Data,” Machine
Learning, special issue on data mining lessons learned, vol. 57,
done (Fig. 23). When a new alternating current (AC) test pp. 83-113, 2004.
was added by Con Edison to avoid some of the over- [11] A.P. Bradley, “The Use of the Area under the ROC Curve in the
treatment, we were able to demonstrate that as the test was Evaluation of Machine Learning Algorithms,” Pattern Recognition,
being perfected on the system, the performance level vol. 30, no. 7, pp. 1145-1159, July 1997.
[12] C. Rudin, “The P-Norm Push: A Simple Convex Ranking
increased and has now surpassed that of the control group. Algorithm that Concentrates at the Top of the List,” J. Machine
Indeed, operations and distribution engineering at Con Learning Research, vol. 10, pp. 2233-2271, Oct. 2009.
Edison has since added a modified AC test that improved [13] C. Rudin and R.E. Schapire, “Margin-Based Ranking and an
on the performance of the control group also. This Equivalence between AdaBoost and RankBoost,” J. Machine
Learning Research, vol. 10, pp. 2193-2232, Oct. 2009.
interaction among machine learning, statistics, preventive [14] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer, “An Efficient
maintenance programs, and domain experts will likely Boosting Algorithm for Combining Preferences,” J. Machine
identify overtreatment in most utilities that are predomi- Learning Research, vol. 4, pp. 933-969, 2003.
nantly reactive to failures now. That has been the [15] T. Joachims, “A Support Vector Method for Multivariate Perfor-
mance Measures,” Proc. Int’l Conf. Machine Learning, 2005.
experience in other industries, including those for which
[16] H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, and V. Vapnik,
these techniques have been developed, such as the “Support Vector Regression Machines,” Proc. Advances in Neural
automotive and aerospace industries, the military, and Information Processing Systems, vol. 9, pp. 155-161, 1996.
healthcare. [17] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen, CART:
Classification and Regression Trees. Wadsworth Press, 1983.
[18] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1,
10 CONCLUSIONS pp. 5-32, Oct. 2001.
[19] D.R. Cox, “Regression Models and Life-Tables,” J. the Royal
Over the next several decades we will depend more on an Statistical Soc., Series B (Methodological), vol. 34, no. 2, pp. 187-220,
aging and overtaxed electrical infrastructure. The reliability 1972.
[20] P. Gross, A. Salleb-Aouissi, H. Dutta, and A. Boulanger, “Ranking
of the future smart grid will depend heavily on the new Electrical Feeders of the New York Power Grid,” Proc. Int’l Conf.
preemptive maintenance policies that are currently being Machine Learning and Applications, pp. 725-730, 2009.
implemented around the world. Our work provides a [21] P. Gross, A. Boulanger, M. Arias, D.L. Waltz, P.M. Long, C.
Lawson, R. Anderson, M. Koenig, M. Mastrocinque, W. Fairechio,
fundamental means for constructing intelligent automated J.A. Johnson, S. Lee, F. Doherty, and A. Kressner, “Predicting
policies: machine learning and knowledge discovery for Electricity Distribution Feeder Failures Using Machine Learning
prediction of vulnerable components. Our main scientific Susceptibility Analysis,” Proc. 18th Conf. Innovative Applications of
contribution is a general process that can be used by power Artificial Intelligence, 2006.
[22] H. Becker and M. Arias, “Real-Time Ranking with Concept Drift
utilities for failure prediction and preemptive maintenance. Using Expert Advice,” Proc. 13th ACM SIGKDD Int’l Conf.
We showed specialized versions of this process to feeder Knowledge Discovery and Data Mining, pp. 86-94, 2007.
ranking, feeder component ranking (cables, joints, ham- [23] C. Rudin, R. Passonneau, A. Radeva, H. Dutta, S. Ierome, and D.
Isaac, “A Process for Predicting Manhole Events in Manhattan,”
merheads, and transformers), MTBF/MTTF estimation, Machine Learning, vol. 80, pp. 1-31, 2010.
and manhole vulnerability ranking. We have demon- [24] R. Passonneau, C. Rudin, A. Radeva, and Z.A. Liu, “Reducing
strated, through direct application to the New York City Noise in Labels and Features for a Real World Dataset:
Application of NLP Corpus Annotation Methods,” Proc. 10th Int’l
power grid, that data already collected by power compa-
Conf. Computational Linguistics and Intelligent Text Processing, 2009.
nies can be harnessed to predict, and to thus assist in
preventing, grid failures.
RUDIN ET AL.: MACHINE LEARNING FOR THE NEW YORK CITY POWER GRID 345
[25] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, Albert Boulanger received the BS degree in physics from the
“GATE: A Framework and Graphical Development Environment University of Florida, Gainesville, in 1979 and the MS degree in
for Robust NLP Tools and Applications,” Proc. 40th Anniversary computer science from the University of Illinois, Urbana-Champaign, in
Meeting Assoc. for Computational Linguistics, July 2002. 1984. He is a senior staff associate at Columbia University’s Center for
[26] P. Shivaswamy, W. Chu, and M. Jansche, “A Support Vector Computational Learning Systems. He is a member of the IEEE.
Approach to Censored Targets,” Proc. Int’l Conf. Data Mining,
2007. Ansaf Salleb-Aouissi received the MS and PhD degrees from the
[27] A. Radeva, C. Rudin, R. Passonneau, and D. Isaac, “Report Cards University of Orleans, France,) and an engineer degree in computer
for Manholes: Eliciting Expert Feedback for a Machine Learning science from the University of Science and Technology Houari
Task,” Proc. Int’l Conf. Machine Learning and Applications, 2009. Boumediene (USTHB), Algeria. She joined Columbia University’s
[28] H. Dutta, C. Rudin, R. Passonneau, F. Seibel, N. Bhardwaj, A. Center for Computational Learning Systems as an associate research
Radeva, Z.A. Liu, S. Ierome, and D. Isaac, “Visualization of scientist after a postdoctoral fellowship at INRIA Rennes, France.
Manhole and Precursor-Type Events for the Manhattan Electrical
Distribution System,” Proc. Workshop Geo-Visualization of Dy- Maggie Chow received the BE degree from City College of New York
namics, Movement and Change, 11th AGILE Int’l Conf. Geographic and the master’s degree from NYU-Poly. She is a section manager at
Information Science, May 2008. Consolidated Edison of New York. Her responsibilities focus on lean
[29] N.D. Hatziargyriou, “Machine Learning Applications to Power management and system reliability.
Systems,” Machine Learning and Its Applications, pp. 308-317,
Springer-Verlag 2001. Haimonti Dutta received the PhD degree in computer science and
[30] A. Ukil, Intelligent Systems and Signal Processing in Power electrical engineering (CSEE) from the University of Maryland. She is an
Engineering. Springer, 2007. associate research scientist at the Center for Computational Learning
[31] L.A. Wehenkel, Automatic Learning Techniques in Power Systems. Systems, Columbia University.
Springer, 1998.
[32] A. Saramourtsis, J. Damousis, A. Bakirtzis, and P. Dokopoulos,
Philip N. Gross received the BS degree from Columbia University in
“Genetic Algorithm Solution to the Economic Dispatch Problem— 1999 and the MS degree from Columbia University in 2001. He is a
Application to the Electrical Power Grid of Crete Island,” Proc. software engineer at Google.
Workshop Machine Learning Applications to Power Systems (ACAI),
pp. 308-317, 2001.
[33] Y.A. Katsigiannis, A.G. Tsikalakis, P.S. Georgilakis, and N.D. Bert Huang received the BS and BA degrees from Brandeis University
Hatziargyriou, “Improved Wind Power Forecasting Using a and the MS and MPhil degrees from Columbia University. He is working
Combined Neuro-Fuzzy and Artificial Neural Network Model,” toward the PhD degree in the Department of Computer Science,
Columbia University.
Proc. Fourth Helenic Conf. Artificial Intelligence, pp. 105-115, 2006.
[34] P. Geurts and L. Wehenkel, “Early Prediction of Electric Power
System Blackouts by Temporal Machine Learning,” Proc. ICML Steve Ierome received the BS degree in electrical engineering from the
’98/AAAI ’98 Workshop Predicting the Future: AI Approaches to Time City College of New York in 1975. He has 40 years of experience in
Series Analysis, pp. 21-28, 1998. distribution engineering design and planning at Con Edison, and three
[35] L. Wehenkel, M. Glavic, P. Geurts, and D. Ernst, “Automatic years of experience in power quality and testing of overhead radial
Learning for Advanced Sensing Monitoring and Control of equipment.
Electric Power Systems,” Proc. Second Carnegie Mellon Conf. Electric
Power Systems, 2006. Delfina F. Isaac received both the BS degree in applied mathematics
[36] H. Chen, W. Chung, J.J. Xu, G. Wang, Y. Qin, and M. Chau, and statistics in 1998 and the MS degree in statistics in 2000 from the
“Crime Data Mining: A General Framework and Some Examples,” State University of New York at Stony Brook. She is a quality assurance
Computer, vol. 37, no. 4, pp. 50-56, Apr. 2004. manager and was previously a senior statistical analyst in the
[37] B. Cornélusse, C. Wera, and L. Wehenkel, “Automatic Learning Engineering and Planning organization at Con Edison.
for the Classification of Primary Frequency Control Behaviour,”
Proc. IEEE Lausanne Power Tech Conf., 2007. Arthur Kressner is the president of Grid Connections, LLC. He recently
[38] S.R. Dalal, D. Egan, M. Rosenstein, and Y. Ho, “The Promise and retired from Consolidated Edison Company in New York City with more
Challenge of Mining Web Transaction Data,” Statistics in Industry than 40 years experience, most recently as the director of research and
(Handbook of Statistics), R. Khatree and C.R. Rao, eds., vol. 22, development.
Elsevier, 2003.
[39] P.R. Rosenbaum and D.B. Rubin, “The Central Role of the Rebecca J. Passonneau received the doctorate degree from the
Propensity Score in Observational Studies for Causal Effects,” University of Chicago Department of Linguistics. She is a senior
Biometrika, vol. 70, no. 1, pp. 45-55, 1983. research scientist at the Center for Computational Learning Systems,
Columbia University, where she works on knowledge extraction from
Cynthia Rudin received the BS and BA degrees from the University at noisy textual data, spoken dialogue systems, and other applications of
Buffalo and the PhD degree in applied and computational mathematics computational linguistics.
from Princeton University. She is an assistant professor in the
Operations Research and Statistics Group at the MIT Sloan School of Axinia Radeva received the MS degree in electrical engineering from
Management, and an adjunct research scientist at the Center for the Technical University in Sofia, Bulgaria, and a second MS degree in
Computational Learning Systems, Columbia University. computer science from Eastern Michigan University. She is a staff
associate at Columbia University’s Center for Computational Learning
David Waltz received all the degrees from the Massachusetts Institute Systems.
of Technology. He is the director of the Center for Computational
Learning Systems at Columbia University, with prior positions as Leon Wu received the BSc degree in physics from Sun Yat-sen
president of the NEC Research Institute, director of Advanced University and the MS and MPhil degrees in computer science from
Information Systems at Thinking Machines Corp., and faculty positions Columbia University. He is working toward the PhD degree in the
at Brandeis University and the University of Illinois at Urbana- Department of Computer Science and is a senior research associate at
Champaign. He is a fellow and past president of Association for the the Center for Computational Learning Systems, Columbia University.
Advancement of AI (AAAI), and fellow of the ACM. He is a senior He is a member of the IEEE.
member of the IEEE.
Roger N. Anderson received the PhD degree from the Scripps . For more information on this or any other computing topic,
Institution of Oceanography, the University of California at San Diego. please visit our Digital Library at www.computer.org/publications/dlib.
He is a senior scholar at the Center for Computational Learning
Systems, Columbia University. He is a member of the IEEE.