0% found this document useful (0 votes)
9 views11 pages

Computer Eng

This study explores the application of artificial intelligence methods, specifically support vector machine (SVM), K-Nearest Neighbors (KNN), and logistic regression, for predicting pump health and diagnosing faults in the reliability sector. The SVM with a linear kernel demonstrated superior accuracy of 0.92, highlighting its effectiveness in proactive maintenance and reducing downtime. The research emphasizes the importance of feature engineering in enhancing predictive maintenance models and suggests policy recommendations for improving operational efficiency across industries.

Uploaded by

trilokbist04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Computer Eng

This study explores the application of artificial intelligence methods, specifically support vector machine (SVM), K-Nearest Neighbors (KNN), and logistic regression, for predicting pump health and diagnosing faults in the reliability sector. The SVM with a linear kernel demonstrated superior accuracy of 0.92, highlighting its effectiveness in proactive maintenance and reducing downtime. The research emphasizes the importance of feature engineering in enhancing predictive maintenance models and suggests policy recommendations for improving operational efficiency across industries.

Uploaded by

trilokbist04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Alexandria Engineering Journal 119 (2025) 587–597

Contents lists available at ScienceDirect

Alexandria Engineering Journal


journal homepage: www.elsevier.com/locate/aej

Original article

On the implications of artificial intelligence methods for feature engineering


in reliability sector with computer knowledge graph
Heling Jiang a ,∗, Yongping Xia b , Changjie Yu b , Zhao Qu b , Huaiyong Li a
a
School of Information, Guizhou University of Finance and Economics, Guiyang, Guizhou, 550025, China
b
R & D Departmen, Guizhou Zhonghui Technology Development Co., Ltd., Guiyang, GuiZhou 430022, China

ARTICLE INFO ABSTRACT

Keywords: This work employs support vector machine (SVM), K-Nearest Neighbors (KNN) and logistic regression models
Artificial intelligence methods to predict the health state of the pump and to establish fault diagnosis. From the features like vibration,
Reliability sector temperature of the motor, pressure, and flow rate, the models categorize the state of the pump into two;
Feature engineering
normal or No Fault, and Fault Detected. This makes it possible to detect specific faults and assist in creating
Pump fault
preventive maintenance. Post analysis, it was inferred that with an accuracy of 0.92, the SVM with a linear
Computer knowledge graph
kernel outperformed the competing models. While the KNN performed marginally worse with an accuracy
of 0.85, the SVM with RBF and polynomial kernels as well as logistic regression both attained accuracy of
0.91. These findings highlight the SVM with a linear kernel’s superior generalization skills, which make it the
best option for pump system defect identification. For defect detection, giving the SVM with a linear kernel
priority guarantees precise predictions, allowing for proactive maintenance and minimizing downtime. To
improve operational efficiency and lower long-term maintenance costs, policy ideas include standardizing data
collection techniques, investing in real-time monitoring systems, and implementing machine learning-based
predictive maintenance across industries.

1. Introduction is essential for advancement in AI [3].


The Feature engineering (FE), a component of data engineering, is
Technology has become an essential element of our daily lives, essential for enhancing machine learning (ML) algorithms, particularly
impacting both our personal and professional life. Numerous break- in guaranteeing system stability in applications like predictive mainte-
throughs occur instantaneously, influencing our experiences in various nance [3]. Utilizing advanced AI technology, predictive maintenance
ways [1]. Likewise, the industrial landscape is currently shaped by anticipates probable system failures, enhancing safety and minimiz-
data-centric methodologies, with the dependability sector positioned at ing costs through economical sensors. Prior to the implementation of
the forefront of innovation and requirement. In this context, artificial any machine learning algorithm, it is essential to transform raw data
intelligence (AI) has rapidly emerged as a transformational force that sources into meaningful features that reveal significant patterns, as this
not only redefines corporate forecasting difficulties but also manages enhances algorithm efficiency and facilitates data comprehension [4].
operations and maintains essential infrastructure. Due to the absence of a comprehensive methodology, it is essen-
Similarly, the oil and gas industry are regarded as one of the most tial to test multiple approaches to improve the model’s performance,
complicated sectors due to its sophisticated operations and significant
which typically requires significant time. Data cleaning rectifies in-
contributions to production, transportation, and industrialization [2].
accuracies, outliers, and redundancies, while it often intersects with
The sector requires efficiency to meet the timely demands of the econ-
analogous feature selection and feature extraction procedures. Both are
omy, and ensuring the reliability of industrial operations is essential
components of feature engineering, where feature selection eliminates
for sustaining efficiency and safety. Recent technological advance-
extraneous data and feature extraction creates new attributes, such
ments, particularly in AI, have transformed workflows and enhanced
as dimensionality reduction, which can occasionally lead to confusion
operational efficiency while improving inclusive safety. Consequently,
regarding the two concepts. Moreover, advancements in maintenance
feature engineering has become critically significant in converting raw
data into valuable insights and improving operational efficiency, which technology facilitate effective data processing and improve forecasting

∗ Corresponding author.
E-mail addresses: [email protected] (H. Jiang), [email protected] (Y. Xia), [email protected] (C. Yu), [email protected] (Z. Qu),
[email protected] (H. Li).

https://doi.org/10.1016/j.aej.2025.01.093
Received 15 December 2024; Received in revised form 20 January 2025; Accepted 24 January 2025
Available online 10 February 2025
1110-0168/© 2025 The Authors. Published by Elsevier B.V. on behalf of Faculty of Engineering, Alexandria University. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

precision. Additionally, AI is presently effectively utilized across sev- In a another study, Sircar et al. [10] examined the application of
eral sectors, including automotive, manufacturing, energy, and aircraft machine learning and artificial intelligence in addressing data pro-
transportation [3]. cessing challenges within the oil and gas sector. The study elucidated
Numerous scholars have employed feature engineering techniques many methodologies that enhance data processing efficiency and en-
for predictive analysis across various domains. Dai et al. [5] predicted capsulated prior applications with their constraints, while underscoring
high entropy alloy phase formation by feature engineering algorithms, technologies that mitigate hazards and maintenance expenses, so op-
attaining superior accuracy compared to traditional techniques. Paul- timizing the decision-making process. Similarly, in a separate study
son et al. [6] utilized machine learning to predict the lifespans of on the oil and gas business, Arinze et al. [2] examine the potential
lithium-ion batteries. The research employed a feature selection tech- of AI by addressing efficiency and optimization within the industry’s
nique to improve longevity predictions. Mumuni and Mumuni [7] value chain through predictive analysis and deep/machine learning.
examined the essential function of feature engineering in automated The study indicates that AI has the potential to transform the oil
data processing for deep learning pipelines. The work enhances ma- and gas limited (OGL) sector by enhancing operational efficiency and
chine learning models by transforming complicated and heterogeneous safety. Numerous hurdles must be confronted, including legal limits,
input into solid predictions. cybersecurity threats, and ethical issues with data confidentiality and
Additionally, Fan et al. [8] employed feature engineering to en- algorithmic bias, to achieve sustainable AI transformation in the OGL
hance developing forecasting models by addressing data dimensionality industry.
and mitigating prediction difficulties. Data-driven solutions reduce the Feature engineering is an essential phase in the machine learn-
necessity for human feature selection and enhance forecasting accuracy ing workflow and is classified as a component of data processing.
in various building modeling procedures. Considering the extensive It enhances machine learning model performance by optimizing cur-
scope of feature engineering and the preceding discussion, this study rent data features rather than creating new algorithms. The process
tries to achieve the following objectives: focusses on utilizing inherent biases of machine learning methods to
enhance prediction and accuracy [4]. Pan and Zhang [11] examined
• Investigate the utilization of feature engineering within the relia- 4473 papers from 1997 to 2020, elucidating the current landscape of
bility industry and its impact on predictive models. construction engineering management (CEM) by pinpointing significant
• To examine AI-driven methodologies that optimize the feature en- research trends. The research revealed that the prevalent topics of AI
gineering process to enhance model efficiency and performance. in Customer Experience Management (CEM) include knowledge repre-
• To examine the impact of automated data processing on enhanc- sentation and reasoning, information fusion, computer vision, natural
ing machine learning models within the dependability sector. language processing, intelligence optimization, and process mining, all
of which pertain to improved modeling, forecasting, and optimization.
The rationale for this research stems from the observation that the
Afridi et al. [12] examine predictive and analytical maintenance
failure in pump operation poses major concern in industries because
frameworks, emphasizing the essential function of fault-resilient op-
of the high direct financial costs and safety risks that may follow.
eration and maintenance (OM) systems. The study found significant
Thus, this research is positioned to make a noteworthy contribution
obstacles, including data quality, security, and feature engineering,
to literature by incorporating Artificial Intelligence-derived feature and stated that as renewable adoption progresses, the development
engineering approaches into models of predictive maintenance of com- of operations and maintenance systems is essential for enhancing effi-
pressors, thereby improving the precision of failure identification and ciency in data processing, forecasting, and optimization. Additionally, it
the protocols for creating predictive models. This paper’s strength is encouraged future academics to investigate more novel methods for op-
in automating feature extraction and selection, which further decom- timizing maintenance procedures in renewable operations. Szczepaniuk
plexifies the existing data, enhances the model and allows for better and Szczepaniuk [13] assessed operational and system optimization in
maintenance predictions. Consequently, this work seeks to provide a the energy sector through the analysis of AI algorithm applications. The
best practice model for predictive maintenance to deliver dependable study findings indicated that AI algorithms improve the processes of
and economically viable operations in various industries. energy generation, delivery, storage, utilization, and transaction.
The remainder of the study is structured as follows. Section 2 Mohamed Almazrouei et al. [14] conducted an evaluation of artifi-
addresses the literature review of studies grounded in contemporary cial intelligence models for the predictive maintenance of water pumps.
research. Section 3 pertains to the data and methods of the current The research highlights the efficacy of AI algorithms in accurately
investigation. Section 4 provides estimated results along with their capturing data patterns for dependable maintenance. Furthermore, the
analysis. Finally, Section 5 delineates the conclusion and implications study underscores the theoretical AI principles for pump maintenance,
derived from the study’s findings. representing an innovative contribution to the OGL industry by improv-
ing efficiency and minimizing operational inconsistencies. Ambadekar
2. Literature review et al. [15] identify the integration of artificial intelligence and me-
chanical engineering within the framework of Industry 4.0. The study
The AI and machine learning (ML) are valuable instruments for ad- examines how AI and ML facilitate mistake detection, augment product
dressing the intricacies of systems engineering. Adeyeye and Akanbi [9] value, and optimize systems, demonstrating that AI implementation
examined the impact of AI and ML on systems engineering, particularly effectively addressed numerous mechanical concerns at minimal ex-
in analyzing extensive datasets to predict system faults and improve pense. Furthermore, the study emphasizes the prospective application
performance. The research emphasizes multidisciplinary approaches of machine learning in addressing mechanical engineering issues.
that transform educational programs to utilize AI effectively. The re- Similarly, Ucar et al. [3] assess the application of AI in predic-
port effectively encapsulates the benefits and problems associated with tive maintenance to prevent system failures. The research identifies
leveraging AI. Wang et al. [1] identify AI applications in feature engi- advanced technologies, including AI and data analytics, to enhance
neering that focus on improving learning and academic performance of system performance and predictability in difficult circumstances. Ad-
individuals. The study included correlation analysis, which introduces ditionally, it outlines prospective future study domains, namely, the
specific inefficiencies, hence necessitating the employment of Adaptive metaverse, generative AI, and the Industrial Internet of Things, among
Lasso (ALasso), artificial neural networks (ANN), and support vector others. Kumar et al. [16] examined the classification of partial dis-
regression (SVR). The projected outcomes further underscored the ex- charges (PD) and medium voltage cables to ensure cable reliability. The
ploration of data-driven techniques for AI methodologies in enhancing study examines detection methods, feature extraction, artificial intelli-
the efficacy of educational sectors. gence, and optimization strategies for partial discharge categorization,

588
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

while also highlighting future prospects for improving partial discharge of 𝛾 increase, causing them to approach zero. The blending parameter
and maintenance techniques related to medium voltage cables. between ridge and lasso is denoted by 𝜃.
In a recent study, Lu et al. [17] examine the integration of AI When 𝜃 = 1, the elastic net functions identically to Lasso, employing
and the Internet of Things (IoT) to enhance processing and analytical solely 𝐿1 regularization. On the other hand, when 𝜃 = 0, it functions
capabilities across several industry sectors. The research investigates similar to ridge regression, employing solely 𝐿2 regularization. When
deep learning models for the IoT, highlighting their limits and eval- 0 < 𝜃 < 1, it integrates both 𝐿1 and 𝐿2 regularizations.
uating neuro-symbolic techniques that enhance Artificial Intelligence The 𝐿1 Norm (Lasso), defined as:
of Things (AIoT). Bendaouia et al. [18] utilize image processing tech- ∑𝑞

niques and convolutional neural networks (CNN) for feature extrac- ‖ℵ‖1 = |ℵ𝑘 |,
𝑘=1
tion, along with machine and deep learning, to predict the mineral
composition of components. The projected assessments demonstrated promoting sparsity by rendering certain coefficients 𝛾 equal to zero.
performance accuracy of 94%. The study identified itself as a potential This leads to feature selection, as features with zero coefficients are
AI-powered metals analyzer. Furthermore, Hussain et al. [19] empha- omitted from the model; see Khan et al. [20] and Khan and Albalawi
sized the prospective advantages of AI in the OGL sector while delin- [21].
eating its obstacles and complexities through an exhaustive analysis of The 𝐿2 Norm (Ridge), defined as:
AI in transportation, refining, and operations. The study comprehen- ∑
𝑞
‖ℵ‖22 = ℵ2𝑘 ,
sively examines pipeline integrity management and logistical optimiza- 𝑘=1
tion, while also addressing AI-driven risk assessment, monitoring, and
which reduces the coefficients towards zero without eliminating them
environmental mitigation with prompt response.
entirely. This mitigates model complexity, particularly in the context
It can be summarized that previous studies should be summarized
of multicollinearity.
to reflect the usage of AI and ML in different fields such as systems
Lasso is a technique which can be used for both variable selection
engineering, education, oil and gas, and energy to solve different
and shrinkage; it is most often applied to the linear regression analysis
problems and even transform the systems. They are predominantly with the 𝐿1 penalty applied to coefficients, which helps make many of
applied in the improvement of production processes; asset reliability them zero. Elastic Net also contains features of both 𝐿1 (Lasso) and 𝐿2
and maintenance; and plant and equipment efficiency. However, there (Ridge) norms. Using this method, a balance can be struck between the
are limitations in using AI; questions like data quality and security, and two concerns, highlighting important features while at the same time
ethical dilemmas must be solved to enable the complete application of shrinking the coefficients on the basis of their relevancy. Elastic Net
AI. Consequently, feature engineering is basically admitted to being one is used when the predictors are correlated since groups of correlated
of the major factors for boosting the performance of ML, in addition, features are selected by Elastic Net while Lasso might fail.
legal concerns and cybersecurity are cited as the key challenges that
have to be addressed in order to fully harness potential of AI. 3.2. The SVM model

3. Methodology The SVM serves as an alternative approach for tackling classification


problems, encompassing nonlinearity and data intricacy, through the
This section outlines the methods for predicting pump health and utilization of a distinct loss function. The SVM functions on analogous
identifying probable defects with machine learning techniques, namely principles to the SVR. It is a powerful tool that has shown remarkable
K-Nearest Neighbors (KNN), SVM, and logistic regression. These models predictive ability in several practical applications. The SVM utilizes
are utilized to examine several operational characteristics like vibra- diverse kernel functions to evaluate the similarity between two data
tion, temperature, pressure, and flow rate. The objective is to discern points to tackle non-linearity. The principal benefit of the SVM is its
patterns that signify defects such as cavitation, mechanical misalign- ability to collect covariate nonlinearity and utilize it to improve pre-
ment, or overheating. The models are designed to categorize the pump’s dictive outcomes. It aids in determining the allowable margin of error
operational condition, facilitating early identification of problems for inside the model; see Ribeiro et al. [22]. The mathematical formulation
preemptive maintenance. of the SVM employing a kernel function can be expressed as:
Data preprocessing is done before modeling to deal with the qual- ∑

( ) ( )
ities of the data to be used for modeling and compatibility with the M= ℏ𝑘 − ℏ∗𝑘 𝑁 𝑓𝑘 , 𝑓 + 𝜀,
model algorithms. It comprises normalization process, missing values 𝑘=1
( )
or data gap handling, as well the process of feature scaling. The data where 𝑁 𝑓𝑘 , 𝑓 is the kernel function that denotes the inner product,
is divided into a 80:20 ratio, the division is 80% for training and 20% and the constraint is:
for testing. This makes certain that the model is trained using most of ∑

( )
the data with a small portion set for testing purposes. ℏ𝑘 − ℏ∗𝑘 .
𝑘=1

3.1. The elastic net method The term ‘‘𝜀’’ is the slack variable or error tolerance in the course of
optimization. The radial basis function (RBF) is a kernel function that
The elastic net is a unified framework that integrates both Lasso is frequently employed and can be defined as
(𝐿1 regularization) and (𝐿2 regularization). There is a goal to reduce ( ) 2
𝑁 𝑓𝑘 , 𝑓 = 𝑒−𝛾‖𝑓𝑘 −𝑓 ‖ . (2)
the cost function as follows:
( ) The Euclidean distance between the two covariate vectors is repre-
1
Loss function = RSS + 𝛾 𝜃‖ℵ‖1 + (1 − 𝜃) ‖ℵ‖22 , (1)
2 sented by ‖𝑓𝑘 − 𝑓 ‖2 while the width of the RBF is shown by 𝜎 2 given in
where, the residual sum of squares (RSS) is the difference between the Eq. (3); see Ahmad et al. [23]. The RBF kernel is the subsequent phase
predicted and actual values: of our work. The term 𝛾 can be illustrated as:
∑𝑛
( ) 1
̂𝑗 2, 𝛾= . (3)
RSS = 𝑀𝑗 − 𝑀 2𝜎 2
𝑗=1
The RBF, especially the Gaussian RBF, are effective mechanisms for
and the vector of model coefficients (features) is denoted as ℵ. mapping data into a higher-dimensional space, hence rendering non-
The regularization parameter 𝛾, regulates the degree of regulariza- linear patterns more discernible and amenable to modeling. Its utiliza-
tion. These coefficients are subjected to a greater penalty as the values tion in kernel-based methodologies such as Support Vector Machines

589
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

renders it a crucial technique in machine learning, particularly for man- instance is classified as class 1 if the predicted probability is equivalent
aging intricate, non-linearly separable data. Nonetheless, meticulous to or greater than 0.5; otherwise, it is classified as class 0. In order to
selection of parameters, such as 𝛾, is essential to prevent overfitting model the probability that the dependent variable 𝑀 equals 1, logistic
and attain optimal performance. regression employs the logistic (sigmoid) function in conjunction with
input features 𝑍 for binary classification.
3.3. The KNN model Mathematically, the logistic regression is illustrated as:
1
𝑃 (𝑀 = 1|𝑍) = , (4)
The KNN method is employed as both a classifier and regressor in 1+𝑒 − (𝜃 0 +𝜃1 𝑍 1 +𝜃2 𝑍2 ,…,+𝜃𝑛 𝑍𝑛 )
machine learning. It is frequently utilized as a classification algorithm, where 𝑃 (𝑀 = 1|𝑍) is the likelihood that the output is 1 (positive
based on the principle that similar points are situated near one other, ( )
class), and 𝑍 𝑍1 , 𝑍2 , … , 𝑍𝑛 , are the characteristics of the input, 𝜃0
facilitating the assignment of a class label to that point. To determine is the intercept term, and 𝜃1 , 𝜃2 , … , 𝜃𝑛 are the coefficients (weights)
which data points are closest to a designated query point, it is essential correspond to each feature.
to calculate the distance between the query point and the other data
points. Despite the existence of alternative distance measurements, the 3.4.1. The sigmoid function
predominant distance metrics utilized in the KNN are Euclidean. The sigmoid function is fundamental to logistic regression, as it con-
verts the linear combination of inputs into a probability. The sigmoid
3.3.1. The Euclidean distance
function is defined as:
The distance metric is the most used, computed as the linear dis-
1
tance between two places. 𝛽 (𝑤) = , (5)
√ 1 + 𝑒−𝑤
√ 𝑛
√∑ ( )2 where
D(z,u) = √ 𝑧𝑖 − 𝑢𝑖 .
𝑖=1 𝑤 = 𝜃0 + 𝜃1 𝑍1 + 𝜃2 𝑍2 , … , +𝜃𝑛 𝑍𝑛 ,
It measures the distance between two points by adding the differ- is the linear combination of the weights assigned to the input attributes.
ence of each of their coordinates between the two points. The threshold probability used to divide examples into two classes
is known as the decision border in logistic regression. The decision
3.3.2. The manhattan distance boundary is usually set at 0.5. The following is the decision rule based
It quantifies the absolute difference between two points. This metric on the model’s anticipated probability 𝑃 (𝑀 = 1|𝑍):
is employed for categorical data and is computed as the count of sites { if 𝑃 (𝑀 = 1|𝑍) ≥ 0.5}
where the relevant items differ: 𝑀̂ = 1 ,
0 if 𝑃 (𝑀 = 1|𝑍) < 0.5
∑𝑛
|( )|
D(z,u) = | 𝑧𝑖 − 𝑢𝑖 |. where 𝑀 ̂ is the predicted class label (1 or 0).
| |
𝑖=1
Using a technique called Maximum probability Estimation (MLE),
3.3.3. The Minkowski distance the logistic regression model’s coefficients
It was coined from the names of Minkowski and can be extended
to get Euclidean and Manhattan-distances. The parameter 𝑝 in the 𝜃0 + 𝜃1 + 𝜃2 , … , +𝜃𝑛 ,
Minkowski formula determines the type of distance measure generated: are learned by optimizing the probability of the observed data. By
( 𝑛 )1 determining the values of the coefficients that increase the likelihood
∑|( )| 𝑝
D(z,u) = | 𝑧𝑖 − 𝑢𝑖 | . of the observed data, MLE calculates the parameters.
| |
𝑖=1

When 𝑝 = 1, the Minkowski formula is equal to the Manhattan 3.4.2. The loss function (log-loss or binary cross-entropy)
Distance . Likewise, if 𝑝 = 2, then it reduces to Euclidean Distance A loss function is used in logistic regression to quantify the dif-
such the distance formula used in geography. For other values of 𝑝, ference between the expected and actual probability. The log-loss (or
the formula provides more distance measures which makes this a very binary cross-entropy) function is employed for binary classification:
versatile metric that can be utilized in a range of situations. ( ) ( ( ) ( ))
𝐿 𝑀, 𝑀 ̂ = − 𝑀 ∗ log 𝑀 ̂ + (1 − 𝑀) ∗ log 1 − 𝑀 ̂ ,
It is for this reason that the KNN is referred to as a ‘‘lazy learning
algorithm’’ because the model is not learned during the training stage. where 𝑀 is the true label (either 0 or 1). The positive class’s predicted
However, it stores the training data and predicts by carrying out probability is denoted by 𝑀.̂ The model should be as near to the gen-
comparisons of the test data with the data stored during the testing uine labels as feasible during training in order to reduce this log-loss.
phase. This means that the KNN only puts in computation when it is
the time to make a specific prediction. It does not train a generic model
by focusing on the data at the time of prediction. 4. Reliability modeling

3.4. The logistic regression This section aims to predict the pump’s operational health and
identify potential defects by utilizing machine learning models, such
The logistic regression is a statistical technique that is employed to as the SVM, KNN, and logistic regression. These models are intended
solve binary classification problems, in which the outcome variable is to learn signs of faults from continuous features such as vibration
categorical and has two potential classes. Based on the input features, amplitude, temperature of motor, pressure and flow rate etc. The
this algorithm forecasts the likelihood that an instance belongs to a classification is based on two categories: It involves reporting the result
specific class. of the diagnostic test in two options: ‘‘No Fault’’ or ‘‘Fault Detected’’.
The relationship between input features and the probability that an It is a goal to establish a sophisticated predictive model that allows for
observation belongs to a specific class is modeled by logistic regression, informed prevention in case of similar issues for the pump and their
which is typically represented as 1 (positive class) or 0 (negative class). chances.
In contrast to linear regression, which forecasts continuous outcomes, Fig. 1 shows a computer knowledge diagram for using ML for pump
logistic regression generates probabilities that are restricted to the fault detection. It begins with features set, which contains input data
range of 0 to 1. The output of logistic regression is converted to a for analysis. This data is passed to three ML models: SVM, KNN, and
binary decision by applying a threshold, which is typically 0.5. The logistic regression and they are all individual which means they can

590
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

Fig. 1. The computer knowledge graph for the pump detection.

process the features alone. The outcomes of the above models reach as Grid Search, which refines models for enhanced performance.
the pump control node which decides the state of the pump is tested Ultimately, forecasts are generated on the test dataset, and the
at this node. From here the system categorizes the state of the pump models are assessed utilizing metrics such as accuracy, precision, recall,
either as a ‘Fault’ or ‘No Fault’. Last of all, these classifications give and F1-score. The optimal model is chosen based on these assessments,
a Prediction node that indicates the overall decision or outcome. The guaranteeing it provides the most dependable outcomes for subsequent
flow reveals how models and steps are connected for fault detection. predictions.
Tables 1 and 2 delineate the variables that are utilized to predict the The bar chart illustrates the distribution of the ‘‘Fault Detected’’
presence of a fault in a pump system. The target variable is categorical, variable in the dataset, shown in Fig. 3. The 𝑥-axis denotes the two
labeled either ‘‘No fault’’ or ‘‘Fault detected’’. The features enumerated potential states: 0 (No Fault) and 1 (Fault), whilst the 𝑦-axis reflects
are all continuous variables that offer a variety of measurements that the frequency of each state. A visual inspection indicates that the bar
are pertinent to the pump’s operation. These features encompass phys- representing ‘‘No Fault’’ exceeds the height of the bar for ‘‘Fault’’,
ical parameters, such as vibration amplitude and vibration frequency, implying a greater frequency of instances without detected faults. This
which aid in the identification of mechanical issues; motor current and observation indicates a disparity in class distribution, with the ‘‘No
temperature, which may indicate electrical issues or overheating; and Fault’’ class being more dominant than the ‘‘Fault’’ class.
pressure readings (e.g., pump inlet, outlet, suction pressure), which can The correlation matrix illustrates the magnitude and orientation of
indicate blockages or cavitation. linear associations among various variables in the dataset, as given
Insight into the overall health and efficacy of the pump is provided in Fig. 4. Every cell in the matrix denotes the correlation coefficient
by additional features such as efficiency, flow rate, and pump speed. between two variables, spanning from −1 (indicating perfect negative
Bearing temperature, lubrication pressure, and fluid temperature are correlation) to 1 (indicating perfect positive correlation). A value of
additional parameters that aid in the evaluation of lubrication and 0 signifies the absence of a linear relationship. Several significant
attrition issues. The capacity to identify defects is further improved by observations can be derived from the matrix. Initially, certain fea-
the incorporation of specialized indicators, including vibration phase tures demonstrate robust positive correlations, notably Feature-4 and
shift and cavitation indicator. Collectively, these attributes constitute Feature-5, which possess a correlation coefficient of 0.79, indicating a
an exhaustive dataset for the purpose of diagnosing and monitoring significant positive linear association between them. Secondly, moder-
potential pump system malfunctions. ate correlations exist between certain feature pairs, such as Feature-1
The procedure commences with the loading of the dataset, con- and Feature-2, which exhibit a correlation of 0.65, signifying a mod-
firming its preparedness for analysis, depicted in Fig. 2. Data cleaning erate positive link. Finally, several features exhibit low correlations or
ensues, rectifying missing numbers using imputation and eliminating even mild negative correlations, indicating negligible linear relation-
any outliers that could skew the results. After data cleansing, it is ship, as evidenced by the correlation of 0.07 between Feature-3 and
subjected to transformation, during which attributes are normalized or Feature-18. Multicollinearity issues in modeling can result from high
standardized for comparability across models. Subsequently, feature en- correlations between features, which may impact the interpretability
gineering utilizing Elastic Net is implemented, facilitating the selection and stability of the model. Consequently, it is imperative to identify
of the most pertinent features while mitigating overfitting through the and resolve highly correlated features in order to construct a good
integration of 𝐿1 and 𝐿2 regularization. model. Feature selection is a technique that can assist in the removal
Subsequent to data preparation, it is divided into training and of redundant information and the enhancement of model performance.
testing sets, generally in an 80/20 ratio, to guarantee that the model is Additionally, comprehension of the fundamental relationships between
evaluated on unfamiliar data. Multiple models are then trained on the features can offer a deeper understanding of the data-generation pro-
training set, including the KNN and other SVM models utilizing differ- cess and inform future analyses. Consequently, we implement Elastic
ent kernels (Linear, RBF, Polynomial), in addition to logistic regression. Net regularization to mitigate potential multicollinearity by selecting
Hyperparameter optimization may be conducted using techniques such the most pertinent features and eradicating redundant ones. Elastic

591
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

Table 1
The description of variables.
Target variable Feature No fault, fault detected
Feature Type Representation Description
Vibration amplitude Continuous Feature-1 Intensity of pump vibrations.
Vibration frequency Continuous Feature-2 Frequency of vibrations, helps identify fault patterns
Motor current Continuous Feature-3 Electric current consumed by the motor, can indicate electrical issues
Motor temperature Continuous Feature-4 Temperature of the motor, indicating overheating or poor cooling
Pump inlet pressure Continuous Feature-5 Pressure at the pump inlet, can indicate blockages or cavitation
Pump outlet pressure Continuous Feature-6 Pressure at the outlet of the pump, indicates the overall pump performance
Flow rate Continuous Feature-7 Rate of fluid flow, decreased flow can signal internal blockages
Pump speed (RPM) Continuous Feature-8 RPM of the pump motor, helps detect motor or operational faults
Differential pressure Continuous Feature-9 Difference between inlet and outlet pressures, indicates pump efficiency
Vibration phase shift Continuous Feature-10 Phase shift in vibration signals, can indicate mechanical misalignment
Cavitation indicator Continuous Feature-11 Acoustic signal that indicates cavitation, a common pump fault
Bearing temperature Continuous Feature-12 Temperature of bearings, indicating lubrication failure or excessive wear
Suction pressure Continuous Feature-13 Pressure on the suction side, useful for detecting cavitation or blockages
Efficiency Continuous Feature-14 Ratio of output to input power, lower efficiency can indicate pump wear
Power consumption Continuous Feature-15 Amount of electrical power used, can indicate inefficiencies or fault issues

Table 2
The description of variables.
Target variable Feature No fault, fault detected
Feature Type Representation Description
Discharge pressure Continuous Feature-16 Pressure at the discharge side, abnormal values can indicate malfunction
Power factor Continuous Feature-17 Measure of electrical efficiency; a lower factor can indicate motor inefficiency
Lubrication pressure Continuous Feature-18 Pressure of the lubrication system, indicates proper lubrication of components
Fluid temperature Continuous Feature-19 Temperature of the fluid being pumped, can indicate thermal issues in the pump
Vibration symmetry Continuous Feature-20 Symmetry of vibrations, helps identify misalignment or imbalance

Net is a balanced approach to feature selection that incorporates the a positive sign, as it suggests that these features contribute distinctive
strengths of both Lasso (𝐿1 regularization) and Ridge (𝐿2 regulariza- value to the model without being redundant.
tion) techniques. This reduces overfitting by ensuring that the model Elastic net has effectively largely overcome the problem of multi-
maintains predictive power while enforcing sparsity. The generaliz- collinearity by removing the highly correlated features, as evidenced
ability and stability of the model can be enhanced by retaining the by Fig. 5. This guarantees that the model is not significantly influenced
essential features, even when certain variables are highly correlated, by a small number of dominant, correlated features, thereby reducing
through the application of Elastic Net. This method guarantees that the the risk of overfitting and improving model stability. Additionally, the
model is not excessively influenced by any specific feature, resulting in model’s complexity is reduced as a result of the reduced number of
more dependable and durable predictions. Ultimately, the results can features, which can result in faster computation times and enhanced
be rendered more actionable and meaningful by integrating Elastic Net generalization when making predictions on new or unseen data.
into the model-building process, which can improve both performance Fig. 6 provides a clear visual comparison of the accuracy levels
and interpretability. obtained by five popular machine learning models: SVM with a linear
The correlation matrix that has been provided by elastic net for a kernel, SVM with an RBF kernel, SVM with a polynomial kernel, the
subset of features illustrates the connections between the features, as KNN, and logistic regression.
shown in Fig. 5. A more refined set of features that are most pertinent The accuracy score of each model is represented by the bar in the
to the prediction assignment is obtained after Elastic Net regularization chart, which offers a simple method of comparing their performance
is applied. The model is simplified by the reduction of the feature on the specified dataset. Upon reviewing the chart, it is evident that
set, which concentrates on the variables that are most important for all models perform exceptionally well, with accuracy scores typically
predictive accuracy and eliminates irrelevant or redundant features. We falling within the range of 0.85 to 0.92. The SVM with a linear kernel
enhanced the interpretability of the model and reduced its complexity is the most highly performing among these, achieving an impressive
by emphasizing important predictors. The degree of correlation within accuracy score of 0.92. This suggests that this model is well-suited
this subset of selected features is variable. Certain feature pairings, for this specific classification task and that it may provide reliable
including Feature-4 and Feature-5, continue to demonstrate a robust predictive performance in comparable contexts. The accuracy of 0.91
positive correlation (e.g., 0.79), suggesting that these features are still is achieved by logistic model and the SVM with polynomial kernel and
closely related. This implies that the information provided by these RBF kernel. These findings underscore the adaptability of the SVM in
two variables is highly similar, and they may be working in tandem attaining robust performance across a variety of kernel functions, sug-
to influence the model’s predictions. gesting that the SVM-based models are resilient and can be customized
Other pairs, including Feature-1 and Feature-2, exhibit moderate to accommodate the dataset’s diverse complexities.
positive correlations (approximately 0.65). This indicates that, despite Conversely, the KNN model exhibits a slightly lower performance
their shared relationship, they also offer unique information that may than the other models, despite maintaining a respectable accuracy of
be beneficial to the model. Conversely, certain features demonstrate 0.85. This may be due to the fact that the KNN may encounter difficulty
negligible or low correlations, such as the correlation between Feature- with high-dimensional data or noise, as it is dependent on distance
3 and Feature-18 (0.07). These feeble correlations suggest that the metrics for classification, which may be less effective when feature
linear relationships are minimal, which implies that the features do spaces are large or contain inconsequential features. The chart is a
not provide a significant amount of overlapping information and may valuable resource for obtaining an understanding of the comparative
be relatively independent in their contributions to the model. This is performance of these models. It not only offers a concise summary of

592
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

the findings but also assists in the formulation of well-informed deci-


sions regarding which models to prioritize for additional assessment.
In particular, the SVM with a linear kernel Regression appears to be
promising model for further experimentation, particularly when factors
such as interpretability and computational efficiency are considered.
The bar chart compares the results of five ML algorithms from
the perspective of Precision and Recall for faults’ detection, depicted
in Fig. 7. The SVM (Linear Kernel) shows efficiency where both the
precision and the recall are 0.91.
The SVM with RBF Kernel performance in this experiment is rela-
tively low Recall of 0.88 as compared to precision score of 0.90, this
suggests that there is a possibility of missing some true positive data.
In the Poly kernel version of the SVM, we get the highest precision at
0.92, which shows that it does well in minimizing the number of errors
filled with positive predictions while its recall falls to 0.85 which may
reveal that this version of the SVM overlooks a number of faults. The
precision of the model is 0.84 and the recall is 0.82, which is lower
compared to other models, in the KNN. Lastly, logistic regression shows
fair accuracy in classification with the precision of 0.89 and the recall
about 0.91, which makes it hold best competitor with the SVM models.
On average, the best models are the SVM (Linear Kernel) and
logistic regression (comparatively others) since they had stable results
in both recall and precision; the SVM (Polynomial Kernel) is close but
is oriented on high precision; the KNN had the lowest values in both
metrics.
The SVM with linear kernel learns well for predictive task specially
when data is linearly separable or nearly linearly separable. It necessar-
ily defines the decision boundary with great efficiency while remaining
computationally plain and relatively safe to overfitting. This makes it
suitable for use in high dimensional datasets and any other application
that needs accurate and especially interpretable results. Albeit it may
fail to find such patterns or noisy data when they are non-linear
The performance metrics of five machine learning models–SVM with
a linear kernel, SVM with an RBF kernel, SVM with a polynomial
kernel, the KNN, and logistic regression are visually compared in the
Fig. 8. Accuracy, Precision, and Recall are the metrics that are assessed,
with a particular emphasis on the ‘‘Fault = 1’’ class. In general, the
heatmap indicates that all models exhibit robust performance across
the metrics. The SVM with a linear kernel and logistic regression
consistently achieves high scores in all three metrics. The SVM models
with polynomial kernels and RBF also exhibit satisfactory performance,
with only minor variations in the metrics.
In comparison to the other models, the KNN demonstrates slightly
inferior performance, particularly in terms of Recall. This heatmap
offers a visual representation of the model performance comparison
Fig. 2. Comparative workflow for the logistic regression, KNN, and the SVM models. that is both concise and insightful, thereby facilitating the selection of
the most appropriate model based on the desired emphasis on accuracy,
precision, and recall.

5. Conclusions, limitations, and future avenues

The objectives of this research are to forecast the operational health


of a pump and to predict the characteristics of defective pumps by using
machine learning models, including SVM (with linear and nonlinear
kernels), the KNN, and logistic regression. These models are designed
to detect patterns that suggest faults such as cavitation, mechanical
misalignment, or overheating by analyzing continuous features such
as vibration amplitude, motor temperature, pressure measurements,
and flow rate. The target variable is categorical, with outcomes of
either ‘‘No fault’’ or ‘‘Fault detected’’. Ultimately, the models are trained
on this data to predict the pump’s health status, thereby assisting in
proactive maintenance and defect detection.
The SVM with a linear kernel outperforms the other models in
terms of accuracy, as evidenced by the results, which have a maximum
Fig. 3. The distribution of fault detected. score of 0.92. This renders it the most effective model for reliably
detecting defects. The linear kernel’s precision surpasses that of the

593
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

Fig. 4. The visual representation of overall feature correlations.

SVM with an RBF kernel and the polynomial kernel, which both achieve learning techniques could provide more sophisticated capabilities for
accuracy ratings of 0.91. Additionally, logistic regression exhibits a identifying intricate defect patterns. Additionally, the development of
robust performance with an accuracy of 0.91, while the KNN barely more effective predictive maintenance strategies for pump systems
trails behind with an accuracy of 0.85. This performance discrepancy may be achieved by expanding the dataset and experimenting with
underscores the linear kernel SVM’s superior generalization capability, alternative ML algorithms.
rendering it the optimal choice for predicting defects in pump systems.
We suggest that the SVM with a linear kernel be prioritized for fault CRediT authorship contribution statement
detection, as it has the utmost accuracy and reliability, as indicated
by these results. Companies can guarantee the most accurate predic- Heling Jiang: Writing – review & editing, Writing – original draft,
tions for pump health by employing this model, which enables earlier Visualization, Validation, Software, Resources, Methodology, Investi-
intervention and reduces downtime. Furthermore, the integration of gation, Formal analysis, Data curation, Conceptualization. Yongping
machine learning models into predictive maintenance strategies can Xia: Writing – review & editing, Writing – original draft, Visualization,
substantially reduce operational disruptions, resulting in more efficient Validation, Software, Resources, Methodology, Investigation, Formal
and cost-effective pump management. In order to optimize this method- analysis, Data curation, Conceptualization. Changjie Yu: Writing –
ology, the integration of real-time data from a variety of operational review & editing, Writing – original draft, Visualization, Validation,
and environmental variables could enhance model performance. This Software, Resources, Methodology, Investigation, Formal analysis, Data
would enable more responsive and timely responses to defects that are curation, Conceptualization. Zhao Qu: Writing – review & editing,
currently developing. Ultimately, the use of this SVM-based predictive Writing – original draft, Visualization, Validation, Software, Resources,
system could transform the maintenance strategy from reactive to Methodology, Investigation, Formal analysis, Data curation, Conceptu-
proactive, thereby improving the efficiency and longevity of pump alization. Huaiyong Li: Writing – review & editing, Writing – origi-
systems. nal draft, Visualization, Validation, Software, Resources, Methodology,
A limitation of the current research is that only ML algorithms Investigation, Formal analysis, Data curation, Conceptualization.
as seen with SVM, the KNN and logistic regression have been used
while more reliable models such as deep learning could not have Declaration of competing interest
been employed. In the future, the models could be further refined
by incorporating a wider range of sensor data, including environmen- The authors declare that they have no known competing finan-
tal conditions and real-time operational parameters, to enhance the cial interests or personal relationships that could have appeared to
accuracy of fault detection. Furthermore, the incorporation of deep influence the work reported in this paper.

594
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

Fig. 5. The visual representation of selected feature correlations.

Fig. 6. The comparison of ML models by accuracy.

595
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

Fig. 7. The models comparative analysis using various metrics.

Fig. 8. The heatmap of models’ performance metrics.

References [5] D. Dai, T. Xu, X. Wei, G. Ding, Y. Xu, J. Zhang, H. Zhang, Using machine learning
and feature engineering to characterize limited material datasets of high-entropy
[1] C. Wang, T. Li, Z. Lu, Z. Wang, T. Alballa, S.A. Alhabeeb, H.A.E.W. Khalifa, alloys, Comput. Mater. Sci. 175 (2020) 109618.
Application of artificial intelligence for feature engineering in education sector
[6] N.H. Paulson, J. Kubal, L. Ward, S. Saxena, W. Lu, S.J. Babinec, Feature
and learning science, Alex. Eng. J. 110 (2025) 108–115.
engineering for machine learning enabled early prediction of battery lifetime,
[2] C.A. Arinze, V.O. Izionworu, D. Isong, C.D. Daudu, A. Adefemi, Integrating
J. Power Sources 527 (2022) 231127.
artificial intelligence into engineering processes for improved efficiency and
safety in oil and gas operations, Open Access Res. J. Eng. Technol. 6 (1) (2024) [7] A. Mumuni, F. Mumuni, Automated data processing and feature engineering for
39–51. deep learning and big data applications: a survey, J. Inf. Intell. 3 (2) (2025)
[3] A. Ucar, M. Karakose, N. Kırımça, Artificial intelligence for predictive mainte- 113–153.
nance applications: key components, trustworthiness, and future trends, Appl.
Sci. 14 (2) (2024) 898. [8] C. Fan, Y. Sun, Y. Zhao, M. Song, J. Wang, Deep learning-based feature
[4] T. Verdonck, B. Baesens, M. Óskarsdóttir, S. vanden Broucke, Special issue on engineering methods for improved building energy prediction, Appl. Energy 240
feature engineering editorial, Mach. Learn. 113 (7) (2024) 3917–3928. (2019) 35–45.

596
H. Jiang et al. Alexandria Engineering Journal 119 (2025) 587–597

[9] O.J. Adeyeye, I. Akanbi, Artificial intelligence for systems engineering complex- [17] Z. Lu, I. Afridi, H.J. Kang, I. Ruchkin, X. Zheng, Surveying neuro-symbolic
ity: a review on the use of AI and machine learning algorithms, Comput. Sci. IT approaches for reliable artificial intelligence of things, J. Reliab. Intell. Environ.
Res. J. 5 (4) (2024) 787–808. 10 (3) (2024) 257–279.
[10] A. Sircar, K. Yadav, K. Rayavarapu, N. Bist, H. Oza, Application of machine [18] A. Bendaouia, S. Qassimi, A. Boussetta, I. Benzakour, A. Benhayoun, O. Amar, O.
learning and artificial intelligence in oil and gas industry, Pet. Res. 6 (4) (2021) Hasidi, Hybrid features extraction for the online mineral grades determination
379–391. in the flotation froth using deep learning, Eng. Appl. Artif. Intell. 129 (2024)
[11] Y. Pan, L. Zhang, Roles of artificial intelligence in construction engineering and 107680.
management: A critical review and future trends, Autom. Constr. 122 (2021) [19] M. Hussain, A. Alamri, T. Zhang, I. Jamil, Application of artificial intelligence in
103517. the oil and gas industry, in: Engineering Applications of Artificial Intelligence,
[12] Y.S. Afridi, K. Ahmad, L. Hassan, Artificial intelligence based prognostic main- Springer Nature Switzerland, Cham, 2024, pp. 341–373.
tenance of renewable energy systems: A review of techniques, challenges, and [20] F. Khan, A. Urooj, S.A. Khan, S.K. Khosa, S. Muhammadullah, Z. Almaspoor,
future research directions, Int. J. Energy Res. 46 (15) (2022) 21619–21642. Evaluating the performance of feature selection methods using huge big data: a
[13] H. Szczepaniuk, E.K. Szczepaniuk, Applications of artificial intelligence Monte Carlo simulation approach, Math. Probl. Eng. 2022 (1) (2022) 6607330.
algorithms in the energy sector, Energies 16 (1) (2022) 347. [21] F. Khan, O. Albalawi, Analysis of fat big data using factor models and penal-
[14] S. Mohamed Almazrouei, F. Dweiri, R. Aydin, A. Alnaqbi, A review on the ization techniques: A Monte Carlo simulation and application, Axioms 13 (7)
advancements and challenges of artificial intelligence based models for predictive (2024) 418.
maintenance of water injection pumps in the oil and gas industry, SN Appl. Sci. [22] M.H.D.M. Ribeiro, R.G. da Silva, V.C. Mariani, L. dos Santos Coelho, Short-term
5 (12) (2023) 391. forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil, Chaos
[15] P.K. Ambadekar, S. Ambadekar, C.M. Choudhari, S.A. Patil, S.H. Gawande, Solitons Fractals 135 (2020) 109853.
Artificial intelligence and its relevance in mechanical engineering from industry [23] Z. Ahmad, Z. Almaspoor, F. Khan, M. El-Morshedy, On predictive modeling using
4.0 perspective, Aust. J. Mech. Eng. (2023) 1–21. a new flexible Weibull distribution and machine learning approach: Analyzing
[16] H. Kumar, M. Shafiq, K. Kauhaniemi, M. Elmusrati, A review on the classification the COVID-19 data, Mathematics 10 (11) (2022) 1792.
of partial discharges in medium-voltage cables: Detection, feature extraction,
artificial intelligence-based classification, and optimization techniques, Energies
17 (5) (2024) 1142.

597

You might also like