0% found this document useful (0 votes)
33 views20 pages

Analysis and Prediction of Pipeline Corrosion Defects Based On Data Analytics of in Line Inspection

ARTICLE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views20 pages

Analysis and Prediction of Pipeline Corrosion Defects Based On Data Analytics of in Line Inspection

ARTICLE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Journal of Infrastructure

https://doi.org/10.1186/s43065-023-00081-w
Preservation and Resilience

RESEARCH Open Access

Analysis and prediction of pipeline corrosion


defects based on data analytics of in‑line
inspection
Bingyan Cui1 and Hao Wang1*

Abstract
In-line inspection (ILI) is important to pipeline integrity management since it can detect pipeline defects and identify
potential failure locations through periodical examinations. However, effectively evaluating defects based on ILI data
is challenging. Measurements of ILI are easily influenced by instrument performance and maintenance activities,
leading to unmatched and imbalanced data. Poor ILI data make it difficult to establish defect growth models based
on multiple inspections. This study conducted comprehensive analysis of ILI data for evaluating corrosion defects of
a steel pipeline. First, statistical analysis was performed on raw data to visualize distributions of corrosion depths and
number of corrosions. Second, hierarchical clustering method was used to classify corrosion severity levels based
on features of corrosion depth and estimated repair factor. The interaction effect between adjacent corrosions was
considered. Machine learning methods, including k-nearest neighbor, support vector machine, random forest, and
light gradient boosting machine were used to explore the relationship between the location parameters of adjacent
corrosions and severity levels. Then, maximum corrosion depths and corrosion density were filtered from raw ILI data
of multiple inspections, which were critical for pipeline failure prediction. Finally, distribution parameters were fitted to
establish stochastic growth models on maximum corrosion depth and corrosion number density. This study presents
data analytics based approach to obtain valid information from ILI data in practice.
Keywords Steel pipeline, In-line inspection, Corrosion, Interacting effect, Machine learning, Stochastic growth model

Introduction is a program that coordinates procedures, instruments,


Pipelines play a significant role in transporting substan- and tasks for evaluating the condition of pipelines. It can
tial amounts of oil and gas commodities across long help schedule inspection and maintenance work to lower
distances. Steel pipes may suffer from different types of failure risk [27]. Generally, it includes three main compo-
defects, including corrosion, cracking, and mechanical nents: defects detection and identification, defect growth
damage. If these defects are not properly monitored and prediction, and risk-based management.
repaired, it may cause public safety issues and economic Non-destructive evaluation methods such are widely
losses. Pipeline integrity management has been devel- used for in-line inspection (ILI) to locate and identify
oped to keep pipelines in safe operating conditions. It anomalies on pipelines. Magnetic flux leakage (MFL) and
ultrasonic tools are common ILI techniques used for cor-
rosion inspection of steel pipes. Different ILI tools show
*Correspondence: different capabilities to identify corrosion features. Some
Hao Wang
[email protected]
ILI tools can identify corrosion features with unique
1
Department of Civil and Environmental Engineering, School geometries including corrosion pits, axial grooving, and
of Engineering, Rutgers, The State University of New Jersey, New general corrosion better than others. Generally, ILI tools
Brunswick, USA
have average accuracy within ± 10% of pipe wall thickness

© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 2 of 19

[26]. To predict defect growth and time to failure, ILI to predict the remaining useful life of the pipeline. The
need to be performed periodically. Defects from at least data-driven approach is to use ILI data or sample data to
two inspections should be matched to their positions in investigate the propagation of defects. F. Caleyo et al. [5]
the pipeline. However, each ILI uses its own coordinate used the Markov chain to estimate the time-dependent
system to locate detected corrosions in the pipeline [22]. growth rate of pipelines. Arzaghi et al. [1] used Dynamic
As a result, these inconsistent coordinate systems would Bayesian network (DBN) to predict varying growth rates
lead to unmatched data from multiple ILI runs. In addi- of pitting and corrosion degradation in subsea pipelines.
tion, the accuracy of ILI tools is greatly influenced by Instead of calculating corrosion growth rates, Mohd et al.
instruments error and environmental conditions [4]. [18] used Weibull distribution to develop a time-depend-
Changes in technologies and maintenance activities ent corrosion depth model that can predict the peak
make it difficult to obtain consistent ILI data from mul- depth of pipeline at any given age. Similarly, Gumbel
tiple years. distribution was adopted to predict the growth of block
Corrosion is one of the most important defects that maximum corrosion depth [13]. Further to this study,
affects the pipeline integrity directly. There are large the peaks over threshold (POT) method was also used to
quantities of ILI data on corrosion features. There- improve the evaluation performance of extreme values
fore, extracting useful information from corrosion ILI [28]. Therefore, it is applicable to use different distribu-
data is important. Corrosion defects on pipeline can be tion parameters to establish stochastic growth models of
divided into single defect and interacting multiple defects different corrosion features.
[19]. Compared to single defect, analysis of interaction In summary, how to process and analyze existing ILI
between multiple corrosion defects was more complex. data from multiple years is of great significance. Complex
Chiodo and Ruggieri [7] found that interactions between corrosion features may be unmatched on both spatial and
adjacent defects would influence the failure pressure of temporal scales. Therefore, this study aimed to propose
pipeline significantly. Similarly, it was reported that the a comprehensive procedure to analyze both raw and fil-
failure pressure of pipeline decreased significantly due to tered ILI data. Firstly, distributions of corrosion number
interaction effects between adjacent corrosion defects [6, and corrosion depth were visualized to provide prelimi-
14, 24, 25]. Therefore, the assessment of interacting cor- nary evaluation. Then, interacting effects of adjacent cor-
rosion defects is desired from ILI data. rosions were considered to find the relationship between
In addition to interacting effects that need to be con- defect locations and defect severities. Finally, stochastic
sidered, establishing appropriate growth model is also growth models were established to predict the evolu-
important for pipeline integrity management. The reli- tion of maximum corrosion depth and corrosion number
able prediction of defect growth can help schedule future density.
inspection and maintenance activities to prevent poten-
tial pipeline failures in the future. There are two catego-
ries of defect growth models: the model-based approach Data collection
and the data-driven approach. Model-based approaches The ILI dataset was obtained from Magnetic Flux Leak-
primarily rely on physical models, such as finite element age (MFL) tools in 2005, 2012 and 2016, respectively. A
models, to predict defect growth. Liu et al. [15] employed 12-mile steel pipeline which was originally built in 1974
Bayesian networks to update the likelihood of subsea was inspected. Based on the history of replacements and
pipeline damage and estimated the ultimate probability relocations, the pipeline was divided into several seg-
of damage. Based on this probability, they were also able ments (a-g), as listed in Table 1.

Table 1 General information about the pipeline


Line segment No Length (feet) Outer diameter (in.) Wall thickness (in.) Pipe grade Year installed

a 51,241 30 0.562 5L × 42 1974


b 613 24 0.438 5L × 42 1974
c 772 20 0.375 5L × 42 1982
d 5,910 30 0.562 5L × 60 2005
e 1,698 30 0.562 5L × 60 2005
f 41 6.625 0.28 5L × 42 1974
g 654 30 0.562 5L × 42 2002

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 3 of 19

In this study, external corrosion defects were selected Single linkage clustering calculates the distance between
for analysis since it is the major defect observed in this two clusters as the shortest distance between any two
pipeline. The pipeline consisted of 1,955 girth welds in data points in each cluster. In contrast, complete linkage
total. Corrosions did not occur in every segment of all clustering uses the maximum distance between any two
girth weld numbers. Therefore, only girth weld number data points in each cluster. Average linkage clustering
with corrosion defects were extracted from the ILI data- calculates the average distance between all pairs of data
set. From 2005 to 2016, about 400 girth weld numbers points in each cluster. Centroid linkage clustering calcu-
showed external corrosions. Table 2 displays the number lates the distance between the centroids of each cluster.
of corrosion defects present in each girth weld location, These linkage methods may be sensitive to anomalous
with some locations having multiple defects. Details of data points and easy to generate unreasonable cluster-
ILI dataset included girth weld number, absolute dis- ing. However, data points of corrosion defects have many
tance, peak depth, length and orientation. outliers. Therefore, ward linkage was used in this study.
Ward linkage can minimize the loss of combining clus-
Analysis methodology ters each time. It calculates the error sum of squares
Clustering (ESS) of each cluster. Small ESS value means agglomera-
The objective of clustering is to divide observations into tive data points. Therefore, clusters can be combined to
several clusters so that data points within the same clus- fewer clusters by minimizing the increase of ESS.
ter are similar to each other. In this study, hierarchical
clustering method was used to separate corrosion defects Classification
with similar features. Corrosion severity levels have a Machine learning methods
hierarchical structure, as most features of defects in high To find relationship between defect location parameters
level would be severer than low level. Therefore, hierar- and severity levels, different machine learning methods
chical clustering is suitable for the classification of corro- were used, including k-nearest neighbors (KNN), support
sion severity level. vector machine (SVM), random forest (RF), and light
Hierarchical clustering includes divisive and agglom- gradient boosting machine (LightGBM).
erative algorithms. The divisive algorithm is a top-down KNN is a supervised learning method proposed by
approach. At the beginning, all the observations belong Fix and Hodges [11]. In classification, an unlabeled data
to one cluster. Then, different observations will be point will be assigned to the label that is most commonly
divided into more clusters according to the certain crite- found among the k-nearest training data points from
rion such as distance. On the contrary, the agglomerative the target data point. Therefore, the select of k value and
algorithm is a bottom-up approach. Each observation is a measurement of distance are important for KNN.
cluster at first. Then, similar observations will be merged SVM is initially a binary classification approach which
to fewer clusters. is aimed to construct an optimal separation hyperplane
In this study, the agglomerative algorithm was used. [17]. The hyperplane has the maximum distance from
It can determine the similarity between observations of the nearest sample points (called support vector) on
each cluster by measuring the distance between them. both sides. Therefore, SVM can balance the learning abil-
Smaller distance indicates higher similarity. Therefore, ity and the complexity of the model. By means of kernel
the clustering algorithm merges the two clusters with the functions, SVM is capable of mapping data from a low-
shortest distance between them to construct the clus- dimensional space to a higher-dimensional space. There
tering tree. Measurements of distance between clusters are three commonly used kernel functions, including the
can be conducted through different methods, such as linear kernel, polynomial kernel and radial basis function
single, complete, centroid, average and ward linkages. (RBF) kernel [20].
RF was proposed to solve classification, clustering, and
prediction problems. It is a decision tree based machine
Table 2 Number of external corrosion defects found in different learning algorithm evolved from the bagging ensemble
inspection years learning. Firstly, a decision tree consisting of multiple
independent forests is randomly generated. Then, fea-
Inspection year Defects
count tures are selected by calculating the information gain.
no From the root node, the tree is split according to the fea-
ture partitioning condition and the principle of minimum
2005 792
node purity until the rule is satisfied. Usually, information
2012 1345
entropy is used to measure the purity of data [3]. Differ-
2016 2508
ent from the single decision tree method, Random Forest

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 4 of 19

randomly selects m subsamples from the original data- number of negative samples incorrectly predicted as pos-
set with put-back. And then it will train a single decision itive; FN is number of positive samples incorrectly pre-
tree with k randomly-selected features. The optimal fea- dicted as negative.
tures are chosen from these k features to split the nodes. For multi-classification, it can be regarded as multiple
After that, t decision tree can be constructed by repeat- binary classifications. Therefore, average value of them can
ing above process t times. The final prediction result is a be used to evaluate the model performance. In this study,
weighted average of each decision tree. weighted F1 score was calculated, because it takes into
LightGBM is a boosting tree algorithm in the ensem- account the importance of different categories [16].
ble learning [12]. It utilizes a leaf-wise approach to select
the best split, allowing it to identify the leaf node with the Defects growth predictions
highest split gain out of all the leaf nodes in the decision Data preprocessing
tree. LightGBM optimizes training data points based on In the ILI dataset, not all inspection locations had external
the gradient of each data point. Data point with larger corrosions. Normal points and manufactural bend were
gradient means larger contributions to the informa- also common. Therefore, data points of external corrosions
tion gain. The algorithm employs a histogram-based were filtered first. After that, the segments with replace-
method to convert continuous feature values into k inte- ment recordings were eliminated because it will influence
gers, thereby allowing for the creation of a histogram the defect growth.
with a width of k. Subsequently, the algorithm will iter- However, to establish growth models, the filtered data
ate through the training data to compute the cumulative still needed to be organized according to certain rules. For
statistics for each discrete value present in the histogram. the growth model of corrosion depth, the maximum peak
In this case, only discrete values of the sorted histogram depth in each segment was selected, as maximum cor-
are required to be traversed when choosing the splitting rosion depth is one of the most important factors leading
point of feature. Therefore, LightGBM can decrease the to pipeline failure. Then, only data showing continuous
computation cost significantly. increase in maximum depth over inspection years were
filtered. This approach yields a more conservative data
Evaluation metrics subset, which will be used to analyze the growth of maxi-
For binary classification, accuracy, precision, recall and mum corrosion depth in further analysis. For the growth
F1 score are usually used to evaluate model performance. model of corrosion density, data points that deviated from
Accuracy, as defined by Baldi et al. [2], is the proportion the mean value by more than 3 times the standard devia-
of correctly classified samples in the testing dataset out tion were removed. Except for these outliers, all data points
of all the samples. Precision, on the other hand, is the were used for growth prediction of corrosion density.
percentage of true positive samples among all the pre- Distribution models and parameters were used to predict
dicted positive samples. Recall is the percentage of truly future corrosions because these distributions can capture
predicted positive samples out of all truly positive sam- the trend of corrosion based on previous ILI data. Corrosion
ples. F1 score is a balanced score that combine precision growth process is complicated so that using stochastic growth
and recall. These metrics can be calculated as shown in models instead of simplified growth rate may be better.
Eq. (1) to (4) [21].
Gumbel distribution
TN + TP Gumbel distribution is particular useful in fitting the distri-
Accuracy = (1)
TN + FP + TP + FN bution of extreme values. Since maximum corrosion depth
is the extreme value, Gumbel distribution was selected to
TP fit corrosion depth data. Gumbel distribution is derived
Precision = (2) from the extreme value theory that developed by Fisher
TP + FP
and Tippett [10]. The probability distribution function of
TP the maximum value for each sample converges to the gen-
Recall = (3) eralized extreme value (GEV) distribution. Gumbel distri-
TP + FN
bution is a special form of GEV distribution, as expressed
in Eq. (5) [13].
2 × Precision × Recall
F1 = (4) z−µ(t)
Precision + Recall
Gt (z) = e−e
σ (t) (5)
where, TP represents number of positive samples cor-
where, Gt(z) is the density when the maximum corro-
rectly predicted as positive; TN represents number of
sion depth is equal to z; and z is the maximum corrosion
negative samples correctly predicted as negative; FP is

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 5 of 19

depth in this study; μ is the location parameter; σ is the data, left-skewed data, or symmetric data [23]. In this
scale parameter; and t is the inspection year. study, corrosion number density is an index that reflects
the number of defect per unit distance. In different seg-
Weibull distribution ment, the number density has a large difference. Corro-
Weibull distribution is a non-stationary distribution that sion number density below 5 was the most, leading to
follows Cole’s method [9]. It is usually used to model the left-skewed ILI data. In this case, Weibull distribution
reliability. Weibull distributions can model right-skewed

Fig. 1 Number of defects along the pipeline from 2005 to 2016 (a) scatter plot, (b) boxplot

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 6 of 19

can be of great help. The expression of Weibull distribu- density; ξ is the shape parameter; σ is the scale param-
tion is shown in Eq. (6) [13]. eter; and t is the inspection year.
ξ (t)−1 ξ (t)
ξ (t) x − x
Wt (x) = ×e σ (t)
, x ≥ 0 (6) Analysis results and discussion
σ (t) σ (t)
Statistical analysis of corrosion depths and locations
where, Wt(x) is the density when the corrosion num- To compare the distribution of corrosion defects, the
ber density is equal to x; and x is the corrosion number number of defects in each girth weld number along the
pipeline was counted, as shown in Fig. 1. Each girth weld

Fig. 2 Corrosion depth along the pipeline from 2005 to 2016 (a) scatter plot, (b) boxplot

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 7 of 19

Fig. 3 2D contour plot of peak depth in (a) 2005, (b) 2012, (c) 2016

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 8 of 19

Fig. 4 Density plots of (a) longitudinal locations of corrosion defects; (b) circumferential locations of corrosion defects

represents 30–40 feet pipe length. It can be seen that the However, it increased to 77 and 170 in 2012 and 2016,
average number of defects increased from 2005 to 2016, respectively, indicating the soil environment in these seg-
which is consistent with the change in total number of ments for high corrosion potential. However, the soil sur-
defects. In addition, the increase of corrosion defects vey data were not available.
around several girth weld numbers was found more The comparison of corrosion depth was based on peak
significant. For example, the number of defects in seg- depth in each girth weld number. The peak depth is
ments around 11,080 girth weld number was 9 in 2005. defined as the maximum depth of the corrosion divided

Fig. 5 Hierarchical clustering of corrosion defects based on peak depth and ERF

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 9 of 19

Table 3 Classification results of corrosion severity levels by the wall thickness at the location of the corrosion.
Severity level Average Average ERF Average Average
Therefore, the larger peak depth means the severer corro-
depth (%) length (in.) width sion condition. The plot of peak depth along the pipeline
(in.) is shown in Fig. 2. Interestingly, the average corrosion
Low 4.34 0.910 1.4 1.7
depth was observed to decrease from 2005 to 2016. This
Medium 14.05 0.913 1.4 1.9
is reasonable because there were a lot of small corro-
High 27.31 0.937 3.3 4.8
sion defects generated in 2012 and 2016, which reduced
the average depth. Ideally, the corrosion depth would

Fig. 6 Geographical distribution of corrosion severity levels

Fig. 7 Illustration of location parameters on a 2D plane for one pipe segment

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 10 of 19

Fig. 8 Correlation plot between three input variables: (a) scatter pair plot, (b) heat map

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 11 of 19

increase over years if no repair is placed. However, this represent medium severity level. Cluster 2 were the low
trend was not observed at each inspection location. The severity level. It should be noted that the low, medium,
variations can be caused by the changes in instrument and high severity levels here are relative in this ILI
performance of ILI tools and the maintenance or repair dataset.
activities between different inspections. However, the Table 3 shows the classification results of sever-
information of these changes were not available in this ity level. From the table, it is obvious that the average
study. Therefore, establishing the corrosion depth growth defect depth, ERF, length and width were the most in
model based on raw ILI data was not suitable. high severity level. This is reasonable, as higher val-
For the localized segment, corrosion depth presented ues mean higher risk of failure. Therefore, defects at
certain increasing trend. As shown in Fig. 1, the corro- high severity level should be prioritized in the mainte-
sion depths were the most severe around the distance of nance scheduling. Furthermore, the geographical dis-
4500–5000 feet. Therefore, 2D contours of the peak cor- tribution of three severity levels can be seen in Fig. 6.
rosion depth were plotted in these segments, as shown in Defects with high severity level were mainly found in
Fig. 3. In Fig. 3, x-axis was the absolute distance to the low latitudes, indicating the soil environment in low
original location; y-axis was the orientation degree in the latitudes may have high corrosion potential.
circumferential direction. For example, 0° and 360° rep-
resented the top of pipeline, while 180° denoted the bot- Relationship between corrosion location parameters
tom of pipeline. It shows that the area of maximum peak and severity level
depth increased a lot in 2016, compared to 2005. In addi- As stated above, corrosion severity level was classified
tion, it was found that maximum peak corrosion depths based on defect depth and ERF. These two indicators are
were located at around 4600 and 4800 feet with circum- geometric parameters related to defects themselves and
ferential degrees of 150°-200°. do not take into account the interactions between multi-
To have better understanding of the corrosion distribu- ple defects. In this study, three location parameters were
tion, the density plots of axial and circumferential loca- selected to represent the interacting effect of adjacent
tions of corrosion defects were shown in Fig. 4. It was defects, including OD, ­Sc and ­SL. OD denotes the relative
found that external corrosions were more likely to occur distance between the centroid of corrosion defect and
at 10 and 30 feet relative to the pipeline joint. The cir- pipeline girth weld. ­Sc is the distance between two adja-
cumferential degree was mainly around 180°, indicating cent corrosion defects in the circumferential direction,
external corrosion tended to happen at the bottom of while ­SL is the distance between two adjacent corrosion
steel pipe. defects in the longitudinal direction. Detailed illustra-
tions of these parameters are depicted in Fig. 7.
Interaction of adjacent defects on corrosion severity level
It should be noted that final values of OD, ­Sc and S ­L
Classification of corrosion severity level
were the minimum of upstream and downstream values.
In this section, ILI data in 2016 was used to investigate This is because the interacting effect of adjacent defects is
the relationship between corrosion severity level and mainly caused by the nearest ones. After obtaining location
defect location parameters. Estimated repair factor (ERF) parameters, the correlation between these factors should
is the ratio of maximum allowable operating pressure be analyzed first to avoid co-linearity. Figure 8 (a) shows the
(MAOP) of pipeline to the safe working pressure. Both correlation between each two variables. It can be observed
peak depth and ERF are the significant indicators about that the scatter data points of them distributed randomly.
corrosion severity level. Higher peak depth and ERF No obvious linear or nonlinear relationship were found.
indicate defects that are more dangerous. Therefore, all From Fig. 8 (b), correlation coefficients between each pair
defects were divided into several clusters through hierar- were also small, indicating that the co-linearity did not exist
chical clustering method based on defect depth and ERF, in these variables. Therefore, there is no need to reduce the
as shown in Fig. 5. dimensionality of these variables.
To better characterize the corrosion severity level, Machine learning methods were used to analyze the
these clusters needed to be combined to fewer catego- relationship between location parameters and severity of
ries. Clustering methods can capture characteristics of
data distribution based on distance criteria. However,
to obtain reasonable severity levels, empirical methods Table 4 Performance of different machine learning methods
should also be considered. Therefore, cluster 1, cluster Metrics KNN SVM RF LightGBM
3 and cluster 4 were combined to represent the highest
Accuracy 72.71% 74.24% 74.83% 74.15%
defect level, because these clusters had the highest value
Weighted F1 0.67 0.69 0.71 0.69
in peak depth or ERF. Similarly, cluster 5 were used to

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 12 of 19

Fig. 9 (a) importance of input variables on three severity levels of corrosion defects; and impact of input variables on (b) low; (c) medium; (d) high
severity level

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 13 of 19

Fig. 10 Plot of maximum corrosion depth (a) density plot, (b) boxplot

corrosion. Taking OD, ­Sc and ­SL as the input variables and value of each feature can be calculated as SHAP value. A
three severity levels as responses, the fitting results using higher SHAP value indicates a more important feature.
four different machine learning methods (KNN, SVM, RF, In Fig. 9 (a), class 0 represented the low severity level,
LightGBM) were listed in Table 4. As can be seen, random class 1 represented the medium severity level, class 2 rep-
forest shows the best performance among all methods. resented the high severity level. It can be seen that OD had
The importance of three location parameters were fur- the most significant impact on the classification, followed
ther analyzed using random forest model. Shapley Additive by ­SL and ­Sc. Furthermore, positive and negative correla-
Explanation (SHAP) was used to interpret the classification tions between location parameters and severity level can be
results. It is a method derived from coalitional game theory interpreted. As shown in Fig. 9 (b), high feature values of
[8]. Initially, SHAP value is developed to evaluate the con- OD, ­SL and S­ c mainly distributed in regions greater than 0.
tributions from each player to the game. In the model inter- That means greater value of OD, ­SL and ­Sc can make more
pretation, the prediction made by a model can be explained defects belong to low severity level. Similarly, in Fig. 9 (d),
as the sum of the contribution or attribution values of each high feature values of OD, ­SL and ­Sc mainly distributed in
input variable used in the model. Therefore, the impact regions smaller than 0, which means smaller value of OD,

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 14 of 19

Fig. 11 Fitting of Gumbel Distribution in different inspection years (a) histogram plot, (b) Q-Q plot

­SL and S ­ c can make more defects belong to high severity


level. When considering the location parameters OD, S ­ L,
and ­Sc, smaller values of these parameters indicate higher
Table 5 Fitting parameters of Gumbel distributions potential for more critical corrosion defects. A smaller
Inspection year Location parameter (μ) Scale value of OD implies that the corrosion defect is located
parameter closer to the pipeline joint, increasing the likelihood of
(σ) high severity. Similarly, smaller values of ­SL and ­Sc indicate
2005 16.53 7.68 that the corrosion defects are located closer together in the
2012 18.83 7.96 longitudinal and circumferential directions, respectively,
2016 22.47 8.97 which can lead to higher potential for interaction and

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 15 of 19

Fig. 12 Fitted lines of Gumbel distribution parameters (a) location parameter; (b) scale parameter with respect to inspection year

combined effect. Therefore, smaller values of these location Considering that the Gumbel distribution is particular
parameters can cause more severe corrosion defects. useful in representing the probability distribution of the
maximum value in a sample, the corrosion depths in the
Stochastic growth models subset were fitted to the Gumbel distribution. The theo-
Maximum corrosion depth retical and empirical quantiles were compared through
To predict the growth of maximum corrosion depth, the the histogram and Q-Q plots as shown in Fig. 11. The
raw dataset was processed to obtain the reasonable sub- data points are close between theoretical and empiri-
sets for further analysis. Relative distance to the girth weld cal quantiles, indicating the fitted Gumbel distribution
number was used to locate the defect location; For the has high accuracy. The fitting parameters are shown in
defects at the close locations, only the data that shows the Table 5.
continuous growth trend of maximum corrosion depth Using the linear regression to fit the Gumbel distribu-
over inspection years were selected. That means, if the tion parameters over the inspection year, the fitted line
maximum corrosion depth keeps growing, it can be con- can be seen in Fig. 12. It can be found that the location
sidered that this location was most susceptible to external and scale parameters have an increasing trend that indi-
corrosion. This approach resulted in a smaller and more cates the growth of maximum corrosion depths. The
conservative data subset, which was used to analyze the linear model can be expressed in Eq. (7) and (8). After
growth of maximum corrosion depth. The density and obtaining the two parameters, the Gumbel distribution
box plot of the extracted data subset are shown in Fig. 10. can be used to calculate the density of maximum corro-
It shows that the maximum corrosion depth increases sion depth at the year of interest.
over time and can be used for growth prediction.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 16 of 19

Fig. 13 Plot of corrosion number density (a) density plot, (b) boxplot

effect on the fitting of growth distribution parameters.


µ(t) = 0.5161t − 1018.7 (7)
Therefore, raw data for the number of defects were
processed to filter these outliers. Data points that devi-
σ (t) = 0.1085t − 210.1 (8) ated from the mean value by more than three times the
where, μ is the location parameter; σ is the scale param- standard deviation were deleted. The density and box
eter; and t is the inspection year. plot of the processed data can be seen in Fig. 13.
Then, Weibull distribution was used to fit the cor-
Corrosion number density rosion number density. As shown in Fig. 14, most of
In this study, corrosion number density denotes the the observations were located in the tails, which was
number of defects per unit distance. Therefore, the consistent with non-stationary assumption of Weibull
number of corrosion defects in each segment of girth distribution [9]. In addition, it can be observed that
weld was used to construct the probabilistic model the percentage of corrosion number density with high
of number growth. As stated before, the number of values increased over time. Therefore, more corrosion
defects tended to increase over time. However, there defects could be found in the same segment in 2016
were many outliers in raw data, which had a negative than 2005 and 2012.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 17 of 19

Fig. 14 Fitting of Weibull Distribution in different inspection years (a) histogram plot, (b) Q-Q plot

The fitted parameters of Weibull parameters can be σ (t) = 0.1313t − 260.94 (10)
seen in Table 6. And the linear regression of these param-
eters are shown in Fig. 15. It was found that the shape
parameter decreased over time, but the scale parameters
had an increasing trend. The linear model is expressed in Conclusions
Eq. (9) and (10). After obtaining the two parameters, the This study used statistical analysis and data analytics to
Weibull distribution can be used to calculate the density analyze ILI data of pipeline corrosions. Firstly, the dis-
of corrosion number density at the year of interest. tributions of corrosion depths and the number of cor-
rosions on raw data were visualized. Then, the corrosion
ξ(t) = −0.021t + 43.56 (9) severity levels were classified based on the clustering of
corrosion depth and ERF. Relationship between location

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 18 of 19

Fig. 15 Fitted lines of Weibull distribution parameters (a) shape parameter; (b) scale parameter with respect to inspection year

parameters and corrosion severity level considering were more likely to occur at 10 and 30 feet relative to
interactive effects were explored. In addition, raw ILI pipeline joint; while in the circumferential direction,
data were processed to obtain useful data for establishing corrosions were prone to occur at the bottom of pipe-
stochastic growth prediction models on maximum corro- line. In the segment of each girth weld number, the loca-
sion depth and corrosion number density. tions with shorter spacing between adjacent defects and
The number of corrosion defects increased significantly the locations close to the girth weld were more prone
over years. However, average corrosion depths decreased to severe corrosion. For the entire pipeline, corrosion
due to the occurrence of small corrosions and mainte- with higher severity level was mainly located in lower
nance activities. In the longitudinal direction, corrosions latitudes, indicating the soil environment in low latitudes
may cause high corrosion potential.
The growth trend of two corrosion characteristics:
maximum corrosion depth and corrosion number
Table 6 Fitting parameters of Weibull distributions density were observed. Gumbel and Weibull distribu-
tion parameters of stochastic growth models can be
Inspection year Shape parameter (ξ) Scale
parameter
used to predict the evolutions of maximum corrosion
(σ) depth and corrosion number density, respectively. This
study presents a detailed approach on how to obtain
2005 1.52 2.40
valid information from ILI data in practice, which can
2012 1.37 2.94
be further used for failure prediction and maintenance
2016 1.29 3.92
planning in pipeline integrity management system.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Cui and Wang J Infrastruct Preserv Resil (2023) 4:14 Page 19 of 19

Authors’ contributions content identification in indo-european languages. Paper presented at


B.Y. Cui: Data Curation, Investigation, Formal analysis, Original draft prepara- the Proceedings of the 11th forum for information retrieval evaluation
tion; H. Wang: Supervision, Methodology, Writing- Reviewing and Editing. The 17. Mathur A, Foody GM (2008) Multiclass and binary SVM classification:
authors read and approved the final manuscript. Implications for training and classification users. IEEE Geosci Rem Sens
Lett 5(2):241–245
Funding 18. Mohd MH, Kim DK, Kim DW, Paik JK (2014) A time-variant corrosion wast-
USDOT Pipeline and Hazardous Materials Safety Administration (PHMSA). age model for subsea gas pipelines. Ships Offshore Struct 9(2):161–176
19. Norske VD (2004) DNV Recommended practice. Corroded Pipelines,
Availability of data and materials RP-F10.
The dataset is provided by a third-party pipeline operator and can only be 20. Patle A, Chouhan DS (2013) SVM kernel functions for classification. Paper
available after the specific request is made and approved. presented at the 2013 International Conference on Advances in Technol-
ogy and Engineering (ICATE).
21. Powers DM (2020) Evaluation: from precision, recall and F-measure
Declarations to ROC, informedness, markedness and correlation. arXiv preprint
arXiv:.16061.
Ethics approval and consent to participate 22. Reber K, Beller M, Barbian A (2006) Run comparisons: using in-line
Not applicable. inspection data for the assessment of pipelines. Paper presented at the
Hannover: Pipeline Technology 2006 Conference.
Competing interests 23. Sharif MN, Islam MN (1980) The Weibull distribution as a general model
The authors declare no competing interests. for forecasting technological change. Technol Forecast Soc Change
18(3):247–256
24. Silva R, Guerreiro J, Loula A (2007) A study of pipe interacting cor-
Received: 23 March 2023 Revised: 28 March 2023 Accepted: 4 May 2023 rosion defects using the FEM and neural networks. Adv Eng Softw
38(11–12):868–875
25. Sun J, Cheng YF (2018) Assessment by finite element modeling of the
interaction of multiple corrosion defects and the effect on failure pres-
sure of corroded pipelines. Eng Struct 165:278–286
References 26 Vanaei H, Eslami A, Egbewande A, Piping A (2017) A review on pipeline
1. Arzaghi E, Abbassi R, Garaniya V, Binns J, Chin C, Khakzad N, Reniers G corrosion, in-line inspection (ILI), and corrosion growth rate models. Int J
(2018) Developing a dynamic model for pitting and corrosion-fatigue Press Vess 149:43–54
damage of subsea pipelines. Ocean Eng 150:391–396. https://​doi.​org/​10.​ 27. Xie M, Tian Z (2018) A review on pipeline integrity management utilizing
1016/j.​ocean​eng.​2017.​12.​014 in-line inspection data. Eng Fail Anal 92:222–239
2. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing 28. Yarveisy R, Khan F, Abbassi R (2022) Data-driven predictive corrosion fail-
the accuracy of prediction algorithms for classification: an overview. ure model for maintenance planning of process systems. Comput Chem
Bioinformatics 16(5):412–424 Eng 157:107612
3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
4. Caleyo F, Alfonso L, Espina-Hernandez JH, Hallen J (2007) Criteria for
performance assessment and calibration of in-line inspections of oil Publisher’s Note
and gas pipelines. Measur Sci Technol 18(7):1787 Springer Nature remains neutral with regard to jurisdictional claims in pub-
5. Caleyo F, Velázquez JC, Valor A, Hallen JM (2009) Markov chain lished maps and institutional affiliations.
modelling of pitting corrosion in underground pipelines. Corros Sci
51(9):2197–2207. https://​doi.​org/​10.​1016/j.​corsci.​2009.​06.​014
6. Chen H, Shu D (2001) Simplified limit analysis of pipelines with multi-
defects. Eng Struct 23(2):207–213
7. Chiodo MS, Ruggieri C (2009) Failure assessments of corroded pipe-
lines with axial defects using stress-based criteria: numerical studies
and verification analyses. Int J Press Vessel Pip 86(2–3):164–176
8. Cohen S, Dror G, Ruppin E (2007) Feature selection via coalitional game
theory. Neural Comput 19(7):1939–1961
9. Coles S, Bawa J, Trenner L, Dorazio P (2001) An introduction to statisti-
cal modeling of extreme values (Vol. 208): Springer.
10. Fisher RA, Tippett LHC (1928) Limiting forms of the frequency distribu-
tion of the largest or smallest member of a sample. Paper presented at
the Mathematical proceedings of the Cambridge philosophical society.
11. Fix E, Hodges JL (1989) Discriminatory analysis. Nonparametric dis-
crimination: Consistency properties. Int Stat Rev/Revue Internationale
de Statistique. 57(3):238–247
12. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Liu T-Y. (2017) Lightgbm:
a highly efficient gradient boosting decision tree. 31st Conference on
Neural Information Processing Systems (NIPS 2017), Long Beach
13. Khan F, Yarveisy R, Abbassi R (2021) Cross-country pipeline inspec-
tion data analysis and testing of probabilistic degradation models. J
Pipeline Sci Eng 1(3):308–320
14. Li X, Bai Y, Su C, Li M, Piping (2016) Effect of interaction between corro-
sion defects on failure pressure of thin wall steel pipeline. Int J Press Vess
138:8–18
15. Liu Y, Hu H, Zhang D (2013) Probability analysis of damage to offshore
pipeline by ship factors. Transp Res Rec 2326(1):24–31. https://​doi.​org/​10.​
3141/​2326-​04
16. Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A (2019)
Overview of the hasoc track at fire 2019: Hate speech and offensive

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like