1 Introduction

Fault detection and classification play an essential role in process control, monitoring, health management and maintenance because it builds a bridge between system monitoring data and its health status [1]. In industry, many complex process systems or plants comprise a large number of components; many key components (e.g. rolling bearings commonly used in machines) are expensive, vulnerable to damage, and prone to fault. Therefore, carrying out condition monitoring and fault detection of complex processes and machines are paramount for either safety or economic purposes.

Traditional fault detection and diagnosis methods can be roughly categorized into two groups: model-based and data-driven methods. The former relies on explicit mathematical models of the plant, while the latter uses historical data of the plant to determine its health status [2]. Traditional methods work under the following assumption: the distribution of the training data is similar to that of the test data (e.g., the fault samples), implying that the training data should contain a good number of fault samples so that the models used for fault detection are well trained. However, such an assumption may be violated in many real applications due to the following reasons. Firstly, it usually assumes that data are well collected and sufficiently represent both health and all potential faults status of the system (machine, plant, equipment, etc.) of interest. In practice, however, it could be such a case where no or few samples of target faults are available during system operation [3]. Secondly, few plants or systems would be allowed to operate to an occurrence of a major fault or a number of minor faults; therefore, normally it is very difficult (if not impossible) to collect a sufficiently large amount of samples to well train a fault detection model, because faults are destructive and can result in enormous losses [4]. Thirdly, systems (e.g. machines) typically gradually decline from health to failure, implying that obtaining adequate fault samples for data-driven approaches is time-consuming and costly.

A way to effectively solve fault detection tasks with small and imbalanced data is to develop intelligent fault detection (IFD) algorithms using, e.g., data augmentation-based strategy and transfer learning [1]. Such an algorithm first augments the data-by-data generation or data resampling, then uses the augmented data to extract features with machine learning models, e.g., neural networks, together with a feature adaptation process (where necessary), and finally builds a suitable fault classifier to identify the types of faults. However, in practice there exists such a case where a certain type of fault may not be observed or recorded due to some reasons.

In certain extreme circumstances, signals for specific fault types or working conditions are unobtainable, implying that diagnosis models cannot be well trained due to the lack of training samples for unseen fault classes. Moreover, in data-driven fault detection, identifying the unseen fault classes is a challenging task for traditional IFD methods [1]. Still, in practice there is a high need to tackle a realistic and highly challenging scenario as follows: samples of one or more certain fault types are not available at all or just a very limited small number of samples are available. In recent years, a learning method, called zero-shot learning (ZSL), has been widely used in image classification due to its power to recognize new objects ( not seen in the model training stage) based on information inferred from seen classes [5]. ZSL provides a powerful tool for solving the unseen fault detection problem concerned in this work, and a reasonable solution is to combine fault detection with zero-shot learning to classify the certain (unseen) types of faults without using samples of these types of faults. Inspired by the idea of zero-shot learning [6], this paper proposes a new zero-shot fault detection method based on a semantic space embedded model for industrial systems or devices. The implementation of the proposed method is as follows. Step 1: To build convolutional neural networks for feature extraction from raw data; Step 2: To define and specify a semantic based space for faults by creating a shared attribute form (matrix) for each type of fault; Step 3: To adopt and define a bilinear compatibility function to learn the relationship between the extracted features and fault attributes, based on which the highest-ranking unseen fault class is determined.

The performance of the proposed method is tested and assessed on real datasets, collected from a bearing and a complex chemical system, respectively. Two case studies are presented accordingly. The first case is about the bearing, widely used as a piece of crucial rotary equipment in many applications. We propose a method to detect ‘large-diameter faults’ by training classifiers only using samples of ‘small-diameter faults’. The second case emphasizes an entire and comprehensive industrial process control system, which aims to detect different unseen types of faults (i.e., not used in the training stage).

The main contributions of this paper are summarized as follows:

  1. 1)

    We design a new zero-shot learning scheme for unseen fault detection without using samples of the unseen faults in the training stage. Specifically, for the bearing case, the proposed approach can detect and identify large-diameter in bearing, by training a model using data containing small-diameter fault samples but without including samples of large-diameter faults.

  2. 2)

    To improve the adaptation of zero-shot learning to 1D time-series (most industrial fault data are in such a format), we comprehensively analyze various feature extraction methods, including traditional methods and deep learning based methods.

  3. 3)

    The proposed semantic space, which treats the fault attributes as the side information, builds a bridge from seen faults to unseen faults for zero-shot fault detection.

The remainder of the paper is organized as follows. Section 2 provides a relevant literature review on traditional, intelligent, and zero-shot fault detection. Section 3 presents the details of our proposed method. Section 4 conducts two case studies of fault detection to demonstrate and verify the effectiveness of the proposed method. Case 1 is concerned with the detection of the unseen faults of larger size diameters occurring in the Case Western Reserve University (CWRU) experimental dataset. Case 2 conducts experiments on the Tennessee-Eastman process (TEP) dataset, aiming at detecting unseen types of faults. Finally, Section 5 concludes the main work.

2 Relevant literate review

2.1 Traditional fault detection and intelligent fault detection

The procedure of traditional classification-based, data-driven fault detection contains three main steps: data acquisition, feature extraction, and fault detection and classification [7]. In practice, data are collected via different means including the use of numerous sensors. Feature extraction is usually implemented through linear or nonlinear transformation and data decomposition. Commonly used linear methods includes principal component analysis (PCA) [8] and independent component analysis (ICA) [9]. Nonlinear data processing approaches, such as kernel based methods, are usually more powerful for characterizing nonlinear relationships, for example, it has been proved that the kernel principal analysis (KPCA) works better than its linear counterpart PCA for many applications [10]. It is usually important and useful to reduce the dimensionality of features or variables in the training space for several reasons, e.g., to make the classification tasks easier to implement, improve classification accuracy, or make the models and results easier to explain. PCA performs poorly when extracting features from a set of signals that are nonlinearly associated to or dependent on each other. ICA can work better for extracting non-Gaussian features from multivariate signals [11].

Recently, 1D convolutional neural network (CNN) was introduced to extract optimal damage-sensitive features automatically for vibration-based fault detection. For example, in [12], raw signals were transformed into two-dimensional grayscale images based on wavelet transform and deep CNN to extract robust features. In [13], a novel fault detection and classification method was proposed by using the DWT and CWT filter banks.

After feature extraction, the resulting features are fed into a fault classification model to determine the system’s health status. Many machine learning models have been developed for fault detection and classification (see e.g., [14]). In [15], a bearing fault diagnosis method was proposed based on deep CNN and random forest (RF) ensemble learning.

As mentioned in Section 1, traditional fault detection methods may not work well with small or imbalance data, therefore intelligent fault detection methods are needed to guarantee fault detection performance. A way to build intelligent fault detectors is to use deep learning. Deep transfer learning (DTL) methods have been introduced to the field of fault detection to overcome the difficulty in data collection (e.g., samples of certain faults are not enough or not available). DTL methods treat the insufficient samples as a cross-domain learning task and aims to find a solution by performing domain adaptation [16] for handling different distributions between source domain data and target domain data. Zhang et al. [17] investigated an end-to-end method based on a deep convolutional neural network to achieve high accuracy when the working load changes. Wen et al. [18] used sparse auto-encoder and the maximum mean discrepancy term to transfer training features to testing features; in this way, none of the target fault samples were needed for fault detection. In [19], a new optimal transport-based deep domain adaptation method was presented for rotating machine fault diagnosis. In [20], a deep adversarial domain adaptation (DADA) method was proposed for rolling bearing fault diagnosis; the method builds a DADA network to better address the commonly encountered challenge in real world applications: the distribution of the target domain data is different from that of the source domain data. Note that in typical deep transfer learning [21], it is assumed that the same faults appear in both the training and test stages.

2.2 Zero-shot learning

Recently, Lampert et al. [22] proposed a zero-shot learning (ZSL) scheme, which has received significant attention in the field of image recognition. Instead of using trained objects, it uses a high-level description provided by field experts to detect the target items. The description comprises semantic attributes, e.g., colour, shape and even habits, which could be pre-learned without samples of unseen classes. Roughly speaking, ZSL is a method for training models for pattern recognition of unseen types of images based on side information learned from seen classes with relevant description [23]. ZSL has two learning schemes: inductive and transductive models [24]. In the inductive model, only data from seen classes are available during the training stage While in the transductive model, it is assumed that data of both the unseen classes (i.e., unlabeled classes) and seen classes are available for model training, hence it is a type of semi-unsupervised learning. This study is mainly concerned with the inductive scheme.

Zero-shot classification approaches under an inductive setting could be broadly categorized into four groups, namely, direct-attribute prediction based, semantic space embedded based, non-linear multi-modal embedded based, and common space embedded based [25].

For the direct-attribute prediction based method, the most representative model is the direct attribute prediction (DAP) method presented by Lampert et al. [26], which directly builds the relationship between visual features and attributes, and then uses the learned model to predict the attributes of the unseen samples. Lampert et al. [26] also presented an indirect attribute prediction (IAP) method, in which the unseen samples were first assigned to seen classes. Then the unseen samples were predicted using the semantic attribute relationship between seen and unseen types. Note that direct-attribute prediction methods suffer from several drawbacks. Firstly, the two-step prediction method is an indirect approach to find a solution by solving intermediate problems; the solution might be optimal for predicting attributes based on attribute classifiers, but it is not necessarily optimal for predicting classes. Secondly, it is difficult for DAP to extend to incremental learning scenarios. These drawbacks can be overcome by a method based on semantic space embedding discussed below.

The method based on semantic space embedding learns a mapping from features to a semantic space [27, 28]. Frome et al. [29] constructed a deep visual-semantic model by learning a linear mapping from image features to the joint embedding space based on an online learning-to-rank algorithm. Akata et al. [6] presented a label-embedding method for learning a bilinear compatibility function between an image and a label embedding to find the matching embeddings assigned a higher score than the mismatching ones. Akata et al. [30] also proposed a label embedding model for fine-grained classification by combining supervised attributes and unsupervised output embeddings from hierarchies or text corpora. Romera et al. [5] used squared loss as a compatibility function with a regularizer to optimize classification accuracy. Kodirov et al. [31] presented a semantic autoencoder to handle project domain shift problems by reconstructing features after projecting to the semantic space. The approaches based on semantic space embedding enable the representation of visual samples in semantic space and allow recognition in such a space.

A non-linear multi-modal embedded model is usually capable of learning non-linear compatibility relations to optimize projection accuracy or embeddings. Xian et al. [32] extended the bilinear compatibility function to multiple linear (piecewise linear) compatibility functions, making a collection of maps highly interpretable.

Embedding both features and semantic descriptions into a common space is referred to as the common space embedded based method. Changpinyo et al. [33] proposed a synthetic classifier approach for zero-shot classification, which used linear combinations of base classifiers to train classifiers of unseen classes. Hayashi et al. [34] proposed a cluster-based method for multivariate binary classification in ZSL situation, where classifiers (models) were trained based on seen classes first and then used to separate the future data (test data) into two classes: the seen class and unseen (unknown) class. In [35], a novel one-class classification (OCC) method was proposed and used for image classification. The proposed OCC approach can effectively determine whether the input data of interest were from the seen class or the unseen class; the method is potentially very useful for developing and adapting ZSL methods and algorithms.

2.3 Zero-shot learning for fault detection

Zero-shot learning might bring breakthroughs in intelligent fault detection, especially for classifying the types of unseen faults under the condition that samples of these unseen faults are not available for some reason. Many preliminary research results on zero-shot fault detection have already been reported recently in the literature. Lv et al. [36] used a hybrid attribute conditional adversarial denoising autoencoder to tackle the zero-shot fault diagnosis problem. Gao et al. [37] proposed a ZSL method based on contractive stacked autoencoders for bearing fault diagnosis under different working loads. Feng et al. [3] introduced a novel fault description model based on an attribute transfer strategy to classify zero-shot faults in complex mechanical systems. Xing et al. [38] proposed a label description space embedding model for detecting the unseen compound faults of machines. Xu et al. [39] presented a zero-shot intelligent diagnosis method for unseen compound faults of devices using a visual space-based model.

It is worth highlighting that visual attributes ( e.g., colour and shape) used for zero-shot image recognition are unsuitable for sensor signal processing (e.g., vibration signals) [3]. When a new type of fault occurs in a system or machine, we will first notice the semantic attribute and description rather than individual samples. For example, from the description “an equipment that converts gas or vapour into liquid and transfers heat from the tube to the air near the tube,” professional workers can detect the object “condenser” without seeing it at all. Similarly, if “high condensing temperature” is a pre-defined fault type, then it is straightforward for us to know that such a fault occurs in the condenser when we are told the high-level attribute information that “high temperature gas from the compressor does not exchange heat well”. Furthermore, it is redundant to design separate attributes for each type of fault because it is not helpful for us to explore that fault attributes defined by humans transcend class boundaries [22], hence the attributes should be shared with different classes of seen or unseen faults. For example, both “reactor cooling water inlet temperature change” and “reactor cooling water valve change” [40] occur at “reactor”, so the attribute “related to reactor” could be shared across the above two seen faults. Then the attribute would be transferred to unseen faults in the testing stage. In conclusion, the fault attributes could include many aspects such as the position of the fault, the related process variable, the size of the fault, etc. The fault attributes provide side information for unseen classes faults, which facilitate the model to detect unseen faults and directly solve the zero-shot fault detection problem.

Feature extraction from raw signals is another considerably crucial process for zero-shot fault detection. Feng et al. [3] used supervised principal component analysis [41] to extract features, under the assumptions that the process control system is linear and follows Gaussian distribution. Such assumptions are strong since most of the data generated by complex industrial processes are non-linear. In [39], 1D vibration signals of interest were transformed to time-frequency images and then fed into a convolutional neural network (CNN) to extract features. It is worth mentioning that converting 1D vibration signals into 2D representations, an additional procedure, is of high computational complexity and needs some application-specific adaptation.

3 Proposed method

3.1 Problem formulation

Following [6], we assume that there is a training (seen) dataset \(S={\left\{\left({x}_{i}^{s},{y}_{i}^{s}\right)\right\}}_{i=1}^{{N}_{s}}\) with \({x}_{i}^{s}\in {X}^{s}\), \({y}_{i}^{s}\in {Y}^{s}\), which consists of \({N}_{s}\) fault data samples and \(s\) classes of seen faults. Each sample \({x}_{i}^{s}\) corresponds to a label \({y}_{i}^{s}\). Likewise, given a testing (unseen) dataset \(U={\left\{\left({x}_{i}^{u},{y}_{i}^{u}\right)\right\}}_{i=1}^{{N}_{u}}\) with \({x}_{i}^{u}\in {X}^{u}\), \({y}_{i}^{u}\in {Y}^{u}\), the dataset consists of \({N}_{u}\) fault data samples and \(u\) classes of unseen faults. Each sample \({x}_{i}^{u}\) corresponds to a label \({y}_{i}^{u}\). The attributes of a fault are denoted as \(A=\left[{A}^{s},{A}^{u}\right]\in {R}^{L\times C}\), where \(L=s+u\), and \(C\) is the number of fault attributes. It is important to point out that both \({A}^{s}\) and \({A}^{u}\) are available in the training stage because the fault attributes are class-level common knowledge rather than expert knowledge. We can obtain the fault attributes in advance. The samples and classes need to meet the following conditions in zero-shot learning settings: \({Y}^{s}\cup {Y}^{u}=Y\), \({Y}^{s}\cap {Y}^{u}=\varphi\).

3.2 Model structure

The proposed method, SSB-ZSL-1DCNN, motivated by and adapted from the idea of semantic space embedding, comprises three steps: feature extraction, human-defined label embeddings, and a feature embedding model. The overall structure of the method is presented in Fig. 1.

Fig. 1
figure 1

The structure of the proposed method

  1. 1)

    Feature extraction: 1D CNN is preferable when tackling industrial fault 1D signals since 1D CNN is easier to train and needs lower computational complexity than 2D convolutions [42]. Therefore, we use 1D CNN as the feature extractor. The architecture of the designed 1D CNN is shown in Table 1. It contains two convolution layers, two max-pooling layers, one flatten layer and one fully-connected layer. The inputs of the 1D CNN are the 1D time-series signals, and the outputs from the fully-connected layer are extracted features.

    Table 1 The architecture of the designed 1D CNN for feature extraction
  2. 2)

    Human-defined label embeddings: In practical fault diagnosis, tagging each fault sample is complex and time-consuming. Fault attributes (represented by a matrix A in the Section 4) provide side information which can be used to establish the relationships between seen faults and unseen faults. Fault attributes allow for sharing characteristics of faults such as fault position and fault effect, which are easily annotated by experts and transformed into computer-readable vector forms [30]. The description of each attribute could be a binary value \(\varphi^{\text{0,1}}\in\left\{\text{0,1}\right\}\)or a continuous value \({\varphi }^{C}\in \left[\text{0,1}\right]\) for each class. The attributes for each fault class can be written as:

    $${\Phi }\left(y\right)={\left[{\varphi }_{y,1},\dots ,{\varphi }_{y,E} \right]}^{T}$$
    (1)

    where \({\varphi }_{y,1}\) could be one of the binary numbers in \(\left\{\text{0,1}\right\}\) or a real number between 0 and 1; y denotes fault class, and E denotes the dimension of attributes for a fault class. Note that continuous attributes \({\varphi }^{C}\) carry more information than binary attributes \({\varphi }^{\text{0,1}}\). For illustration purposes here, we consider the attribute matrix in binary, but in the subsequent experiments, we set random and continuous attributes rather than binary attributes. If a fault does not have this attribute, set it to 0; if it has this attribute, set it to a random number in the range of (0,1).

  1. 3)

    Feature embedding model: We define a prediction function f by maximizing the bi-linear compatibility function F as follows:

    $$f\left(x;w\right)=\text{arg }\ \underset{y\in Y}{\text{max}}F(x,y;w)$$
    (2)

    where w denotes the parameter vector of F and can be written as a \(D\times E\) matrix W that D is the extracted features dimension and E is the attributes dimension. The bi-linear form of the compatibility function \(F:X\times Y \to R\) between a raw fault data space X and a fault label space Y can be defined as follows:

    $$F\left(x,y;W\right)={\theta \left(x\right)}^{T}W{\Phi }\left(y\right)$$
    (3)

    where the extracted features are denoted by \(\theta \left(x\right)\) and fault label embedding is denoted by \({\Phi }\left(y\right)\). \(F\left(x,y;W\right)\) is an optimized compatibility function based on the ranking, enabling that the correct label will get the highest rank than any other labels by learning W. This idea is closely related to the web scale annotation by image embedding (WSABIE) algorithm [43] which learns a low-dimensional joint embedding space for both images and annotations to classify the annotations from the ranked list of annotations. The significant difference between our method and WSABIE is that the latter learns both \({\Phi }\left(y\right)\) and W, whereas the former only learns W and uses fault attributes as side information \({\Phi }\left(y\right)\).

  1. 4)

    Parameterestimation: similar to the formulation defined in the unregularized structured SVM [43], the weighted approximate ranking objective function is to minimize:

    $$\sum\nolimits_{y\in Y}\frac{{\beta }_{{r}_{\Delta ({x}_{n},{y}_{n})}}}{{r}_{\Delta ({x}_{n},{y}_{n})}}\sum\nolimits_{y\in Y}\text{m}\text{a}\text{x}\left\{0,s\left({x}_{n},{y}_{n},y\right)\right\}$$
    (4)

    where

    $$s\left({x}_{n},{y}_{n},y\right)=\Delta \left({y}_{n},y\right)+F\left({x}_{n},y;W\right)-F\left({x}_{n},{y}_{n};W\right)$$
    (5)
    $${\beta }_{k}={\sum }_{i=1}^{k} {\alpha }_{i}$$
    (6)
    $${r}_{\Delta ({x}_{n},{y}_{n})}=\sum\nolimits_{y\in Y}1(\text{s}\left({x}_{n},{y}_{n},y\right)>0)$$
    (7)

    Here, \(s\left({x}_{n},{y}_{n},y\right)\) is misclassification loss function with margin \(\Delta \left({y}_{n},y\right)\), where \(\Delta \left({y}_{n},y\right)=1\) if \(y\ne {y}_{n}\) and 0 otherwise. As suggested in the WSABIE algorithm, we choose \({\alpha }_{i}=1/i\). \({r}_{\Delta ({x}_{n},{y}_{n})}\) is the upper bound on the rank of fault label \({y}_{n}\) related to fault data \({x}_{n}\). The ranking-based function (4) aims to obtain higher compatibility between the feature extraction and fault label embedding of the target fault label than between the feature extraction and fault label embedding of the wrong fault labels.

In the training stage, we use the extracted features \(\theta \left(x\right)\) and fault attributes \({\Phi }\left(y\right)\), which are only from seen fault classes, to learn W. We apply stochastic gradient descent (SGD) to optimize W and then find the highest scored class y, if \({\text{arx}}\;_{y\in Y}{\text{max}}s(x_n,y_n,y)\neq y_n\) :

$${W}^{\left(t\right)}={W}^{(t-1)}+{\eta }_{t}{\beta }_{\lfloor\frac{N-1}{k}\rfloor}\theta \left({x}_{n}\right){\left[{\Phi }\left({y}_{n}\right)-{\Phi }\left(y\right)\right]}^{T}$$
(8)

where \({\eta }_{t}\) is the learning rate at iteration t; in this study, a constant step size \({\eta }_{t}=\eta\) is used. Based on WSABIE, \({r}_{\Delta ({x}_{n},{y}_{n})}\) is approximated as \({r}_{\Delta ({x}_{n},{y}_{n})}\approx \lfloor\frac{N-1}{k}\rfloor\), where N is the number of fault labels and k is the number of wrong fault labels. After the completion of the training stage, the best W can be obtained.

In the testing stage, we embed a feature onto the best W and use the cosine similarity measure to search for the nearest fault attribute vector, which belongs to one of the unseen fault classes.

4 Experiments

This section presents two case studies for two real datasets: rolling bearing fault dataset created by Case Western Reserve University (CWRU) and chemical process control fault dataset known as Tennessee-Eastman process (TEP). The two case studies were carried out from different perspectives to comprehensively evaluate the performance of the proposed method.

4.1 The Case Western Reserve University (CWRU) dataset

4.1.1 Introduction to the CWRU dataset

The CWRU dataset, consisting of vibration-based rolling bearing fault data, is from the Case Western Reserve University Bearing Data Center [44]. The test bench, shown in Fig. 2, contains a 2 hp reliance electric motor, a torque and a dynamometer. In addition, an acceleration sensor is installed above the bearing housing at the fan-end and drive-end to collect the vibration acceleration when collecting the fault data.

Fig. 2
figure 2

The CWRU test bench

The faults are located on drive-end bearing and fan-end bearing, respectively, which contain inner race fault, rolling element fault and outer race fault with four working loads (0,1,2 and 3 hp). The fault dimeter for each type of faults ranges from 0.007 to 0.028 inch on the bearings using electro-discharge machining (EDM). As for the variables in each class of faults, there are drive-end acceleration data (DE), fan-end acceleration data (FE), base plate acceleration data (BA) and motor speed (RPM). The sampling rate of signals for the dataset is 12 kHz.

We chose to use the 12 kHz drive-end bearing fault data as our experimental dataset. Overall, there are four groups of experiments, namely, 0 hp, 1 hp, 2 hp and 3 hp. For each group, there are nine kinds of faults in total, and only DE is selected as a variable because the vibration signal, collected at the drive-end, is more comprehensive and less infected by other components and environmental noise. For graphical illustration purposes, the vibration signal samples of the rolling race fault with 3 hp and 0.021 inch fault diameter are shown in Fig. 3, where the top panel shows the waveform of the signal in the time domain and the bottom panel shows the corresponding spectrum. For each type of fault, the first 102,400 data for DE are considered; the data were then pre-processed and rearranged with an overlap sampling approach, with the overlap ratio of 50%, resulting in a total of 200 samples, each consists of 1024 data points. The details of faults in the dataset are given in Table 2.

Fig. 3
figure 3

The 3 hp 0.021’’ rolling element fault signal (time domain) and its spectrum (frequency domain)

Table 2 Fault labels (9 features in each working load)

Note that different sizes (diameters) of failures can damage the equipment to different degrees, some operating conditions do not allow for larger size failures, and few factories will be allowed to run to large size failures and collect samples for training. Therefore, the largest size failures might have zero sample for model training, it is extremely difficult (if not impossible) for traditional multi-classification methods to detect unseen faults. As a result, the proposed zero-shot fault detection method is meaningful and realistic.

4.1.2 Model implementation

The first stage of this method is feature extraction. The designed 1D CNN was applied to extract fault features from raw data. As shown in Table 1, the architecture of the 1D-CNN model for vibration signals is constructed with two Conv layers and one FC layer. The input signal of the 1D-CNN is of size 1 × 1024 and the output of the FC layer is of size 1 × 64.

To illustrate the feature extraction results intuitively, the t-SNE (t-distributed stochastic neighbor embedding) algorithm was employed to provide a 2D representation of the features as the output of the FC layer. Taking the case of 1 hp working load as an example, the distributions and clusters of the nine types of faults after t-SNE are presented in Fig. 4, where the horizontal and vertical coordinates represent the first two principal components extracted from t-SNE, respectively.

Fig. 4
figure 4

The t-SNE results of the output features of the FC layer for the case of working load 1 hp. The horizontal and vertical coordinates represent the first two principal components extracted from t-SNE, respectively

The second stage is to build human-defined fault label embeddings for the preprocessed experimental dataset, i.e., the fault attribute matrix A, which is shown in Fig. 5, and the meaning of each attribute is displayed in Table 3. For each bearing fault, seven fine-grained fault attributes were specified by using the statements given in Table 2. Note that the attribute matrix does not distinguish the same type of faults under different working loads. For example, the two types of faults 1 hp 0.007’’ inner race fault and 2 hp 0.007’’ inner race fault have the same attributes. Once fault attributes are obtained in the preparation and training stage, the resulting classifier can then detect unseen faults without using samples of the unseen faults in the training stage.

Fig. 5
figure 5

Fault attribute matrix A

Table 3 Side Information of the Attributes for the CWRU dataset

Finally, we put the extracted features into the feature embedding model to match the most similar fault attributes to find the corresponding unseen fault categories. The ordinary cosine distance was used to measure the similarity between attributes.

There are a total of 36 types of bearing faults in total, divided into four groups of experiments; each group has nine types of faults. The dataset was split as follows: four types of 0.007’’ (inch) and 0.014’’ faults were used for training the models, two types of 0.007’’ and 0.014’’ faults were used for validation, and the rest three types of 0.021’’ faults belong to the test set. The data split for the train/validation/test sets is displayed in Table 4. The numbers of training, validation and test samples are 4 × 200 = 800, 2 × 200 = 400, and 3 × 200 = 600, respectively.

Table 4 The Four groups of data split for the CWRU dataset

As for performance evaluation, we are interested in the accuracy of each type of unseen fault, so the average per-class top-1 defined below is used to measure the accuracy [28]:

$$acc={\frac{1}{\| {Y^u}\|}}\sum\nolimits_{y^u}^{\| { Y^u}\|}\frac{\#correct\;detections\;in\;y^u}{\#samples\;in\;y^u}$$
(9)

where \({y}^{u}\) is an unseen fault of large size and \({Y}^{u}\) is the number of the unseen faults.

4.1.3 Accuracy of zero-shot fault detection

The results of zero-shot fault detection are presented in Table 5. We used six types of 0.007’’ and 0.014’’ bearing faults to train the models, and then used the resulting classification models to detect three types of 0.021’’ bearing faults. The accuracies vary from 66.67 to 87.67%, under different working loads from 0 hp to 3 hp. Clearly, the performance of the proposed method is significantly better than the chance level of 33.33%, which indicates that it is possible to detect unseen large size fault without using their samples in the training stage, and this was achieved by sharing fault attributes with seen small-size faults and unseen large-size faults.

Table 5 The zero-shot fault detection results for the CWRU dataset (%)

As for the individual result of each group, the proposed method performs better for the groups of 0 hp and 2 hp working loads than for 1 hp and 3 hp working loads. We analyzed the confusion matrices for these two groups, which show that fault 9 (0.021’’ outer race fault) has a chance of being misclassified to fault 7 (0.021’’ inner race fault). To investigate the reason of the misclassification, we further analyzed the time-domain signals of fault 7 and fault 9 in 3 hp and their frequency-domain properties (spectra) which are shown in Fig. 6, from which it can be observed that the two signals nearly have the same resonance band at around 3000 Hz in the frequency domain. Therefore, inner race fault and outer race fault are similar in the frequency domain, implying that it is more difficult to distinguish these two types of faults than the rolling element faults. Nevertheless, in the following experiments we will demonstrate that our results are pretty competitive compared with many state-of-the-art methods.

Fig. 6
figure 6

Time-domain signal and frequency-domain signal of fault 7 and fault 9 in 3hp

4.1.4 Comparison with other ZSL methods

We compare five state-of-the-art zero-shot learning methods mentioned in Section 1 (Introduction), namely, SJE [30], DEVISE [29], SAE [31], ESZSL [5], and the zero-shot fault detection method from [3], under the same setting. It should be noticed that these methods were designed for image classification, where image attributes were defined and used for discrimination purposes. However, the visual attributes proposed in these methods are not applicable to time series signal-based fault detection tasks. Hence, in this study, we incorporate the newly designed fault attribute matrix (shown in Fig. 5) to our proposed SSB-ZSL-1DCNN model, and use it to replace the image attributes and label information employed in these compared methods. For a fair comparison, we train SJE, DEVISE, SAE and ESZSL models using the same raw data. The results produced by the trained models are presented in Table 6.

Table 6 Performance comparison of different zero-shot learning method for fault detection of the CWRU dataset (%)

From Table 6, it is clear that our method significantly outperforms the five compared methods: the first four are semantic space embedded ZSL methods and the fifth is a fault description-based attribute transfer approach for ZSL. The main reasons that these compared methods show very low classification accuracies may be explained as follows. These methods which were initially designed for 2D data (images) could work well for image recognition and classification tasks based on the extracted features. However, for 1D data considered in this study, these methods may experience considerable degradation since they could not find useful image features from the given 1D vibration signals, and the required visual attributes are not available, either. All this suggests that the exploration and use of good feature extraction methods which can effectively find most useful and representative features for 1D time-series fault data are highly needed. Hence, in the following, we apply two groups of feature extraction methods to the experimental dataset and evaluate their accuracy performances: deep learning based methods (VGG16, VGG19 and Resnet50) and traditional methods (PCA, ICA and KPCA).

4.1.5 Performance comparison of feature extraction methods

In this section, we evaluate the performance of the proposed method for feature extraction and compare it with other two groups of feature extraction methods: three deep learning-based methods and three traditional feature extraction methods.

Most CNNs are designed for learning from 2D data (e.g. images). However, in the field of fault detection and diagnosis of industrial systems, signals are represented as 1D time series in most cases. Typical CNN models cannot be directly applied to such tasks. A way of using CNNs to handle 1D signals is to convert 1D signals to 2D data. In this study, we use continuous wavelet transform to obtain time-frequency images. As a result, each output image is a 3-channel RGB image of 236 × 236 × 3. In order to avoid the effect of the network colour bars, the RGB image is transformed into a 224 × 224 time-frequency grayscale image. Next, each type of fault images are fed into the following three 2D deep neural network methods: VGG16, VGG19 [45] and Resnet50 [46], all of which were pre-trained on ImageNet 1 K classes to determine the initial values of the parameters. The detailed information about these three models is shown in Table 7. The implementation procedures of the three methods are depicted in Fig. 7.

Table 7 Experimental settings for deep learning-based feature extraction methods
Fig. 7
figure 7

Procedure of 2D feature extraction

For the three traditional feature extraction methods, we firstly perform an overlap sampling procedure to the raw signals, to generate input data for PCA, ICA and KPCA. The output of each of the feature extraction methods is a 64 × 1 dimensional vector, whose length is the same as that of the 1D CNN proposed in this study.

The performances of six compared methods and the proposed method are shown in Table 8, from which it can be seen that the 1D CNN model designed in this study performs far better than the other three deep CNN methods. These results strongly confirm that 1D CNN is very efficient and promising for solving fault detection of industrial systems. It has the following advantages: (1) 1D CNN has much lower computational complexity than 2D CNN; (2) 1D CNN has fewer hidden layers and simpler architecture than 2D CNN, which means that less time is needed to train and implement; (3) Unlike 2D CNNs which need GPU or high performance computational resources, 1D CNN is feasible to implemted and operate in normal CPU, this significantly reduces cost in many real applications.

Table 8 Performance comparison of different fault feature extraction methods for the CWRU dataset (%)

Meanwhile, compared with the three traditional feature extraction methods, the propsoed SSB-ZSL-1DCNN model also shows significantly much better performance, especially when compared with the two linear feature extraction methods, PCA and ICA. For the case of ‘3 hp’, our proposed method, SSB-ZSL-1DCNN, achieves an accuracy performance of 66.78%, which is slightly better than that of KPCA. However, for other cases, SSB-ZSL-1DCNN peforms far better than the best results of the three basline methods, with 18.67%, 11.00% and 27.17% increase in accuracy, respectively.

In addition, when compared with the attribute transfer method [3], our model performs better for all the cases, especially for the two gropus of 0 hp and 2 hp. There are several reasons that may explain the better performance of the propsoed method. Firstly, SSB-ZSL-1DCNN is an end-to-end model. The structure of the model used in the baseline method is not in an end-to-end manner, as it just trains an attribute learner for each attribute, and classify the faults based on the outputs of attribute learners, making the classification results rely heavily on the accuracy of the attribute learners. Secondly, the classifiers used in the compared method are LSVM (linear support vector machine), RF (random forest) and NB (naïve Bayes), of which LSVM is a linear approach. The results from LSVM are much worse than from the other two. In the comparisons, we used the average accuracy of the three classifiers. Thirdly, the feature extraction method used in the baseline method is supervised PCA, which is a linear feature reduction method, requiring the training samples to follow a Gaussian distribution; such a requirment, however, may not be met for the CWRU dataset.

Regarding computational cost, the two groups of feature extaction methods need completely different running times, as shown in in Table 9. The average running time of the three traditional methods is 1.83s, whereas the average running time of VGG16, VGG19 and Resnet50 is 2485.06s. The overall time used by proposed SSB-ZSL-1DCNN method is 71.98s for 20 epochs. Clearly, the deep learning based methods need far more time for training, and usually the deeper the CNN, the more time it needs. Compared to other convolutional networks and traditional methods, the propsoed 1D CNN model only needs a relatively small amount of time and shows obviously, significantly better results. Putting running time and classification accuarcy together, SSB-ZSL-1DCNN has excellent adaptability to 1D time series signals and shows high efficiency.

Table 9 Computational time of different feature extraction methods for the CWRU dataset

4.2 The Tennessee-Eastman Process (TEP) dataset

4.2.1 Introduction to the TEP dataset

The Tennessee-Eastman dataset [39] was collected for a comprehensive industrial chemical process by Eastman Chemical Company. This dataset is widely studied in the field of fault detection. The process contains five major operations, including a reactor, a product condenser, a vapor-liquid separator, a recycle compressor and a product stripper. Since this plant-wide industrial process has many kinds of faults, it could be challenging to collect enough samples to build a fault detection system and might bring considerable losses to the factory; a zero-shot fault detection method therfore becomes necessary to detect specific faults for such a system without their samples in the training stage.

A total of 15 types of faults are exploited in this case study, and the details are introduced in Table 10. Each type of fault contains 480 samples, and each sample invovles 52 variables (fetures) in total.

Table 10 Fault description for the TEP dataset

4.2.2 Model implementation

The structure of the 1D CNN is the same as that for CWRU described in Section 4.1.2. As for the fault attribute matrix in the second step, we use the fault attribute matrix A proposed in [3], which is shown in Table 11; Fig. 8. There are a total of 20 human-defined fault attributes shared between seen faults and unseen faults.

Table 11 Fault attributes for the TEP dataset
Fig. 8
figure 8

Fault attribute matrix A [3]

As for the experimental setting, we split the 15 types of faults into three parts: eight for training (seen), four for validation and three for test (unseen) faults. To comprehensively evaluate the proposed method, we follow the practice in [3] to divide the TEP data into five groups, the details are shown in Table 12. The number of training samples is 5760, and the test samples is 1440. The evaluation criteria are the same as that of CWRU dataset.

Table 12 Five groups of sub-datasets for the TEP dataset [3]

4.2.3 Fault detection accuracy with ZSL

The fault detection results of ZSL for the 5 groups with of the TEP data are shown in Table 13. The detection accuracies vary from 59.72 to 96.67% for different types of unseen faults. The accuracy details of each unseen fault are presented in the confusion matrices in Fig. 9. For some specific classes, such as fault 8 in Group C and fault 2 in Group D, the accuracies are 39% and 33%, respectively, a sort of chance level of 33.33%, but none is below that the chance level. From the fault description in Table 9, we can see that fault 2 and fault 8 are two complicated faults, involving three-quarters of the components in the chemical reactions. Hence, given the nature and mechanism of ZSL method, it is more difficult to train an accurate model based only on the seen fault data with few fault attributes, to detect the unseen faults which invovlve much more fault attributes.

Table 13 Fault detection results for the TEP dataset with the proposed ZSL method (%)
Fig. 9
figure 9

Confusion matrices of the results of unseen faults

From Table 13, the proposed method shows a much better performance in Group A, B, C and E, although its accruacy is slightly below that of [3] in Group D. Note that for Group E, the accuracy of our method is far higher than that reported in [3]. To explain such a large differenc in accruacy, we checked the confusion matrices presented in [3], the accuracy was 26% for fault 5 and 21% for fault 9, which are below the random choice chance level (33.33%). The accuracies of our mehtods for the three types of faults, however, reach 98%, 93% and 99%, respectively.

4.2.4 Comparison with traditional feature extraction methods

In the previous section, we have compared different deep learning based feature extraction methods for the CWRU dataset. We now focus on three commonly used traditional feature extraction methods, namely, PCA, ICA and KPCA, to further evaluate and compare the performance of the propsoed method.

For the three feature extraction methods, PCA, ICA and KPCA, we implemented the experiments by using the scikit-learn package [47]. As for the parameter settings, each method adopts 20 features extracted from a total 52 raw variables. The accuracies of different feature extraction methods are presented in Table 14. For all groups (but Group D), we achieve impressive results. For Group D, the proposed SSB-ZSL-1DCNN model achieved a slightly lower accuracy than ICA. However, we obtained significant improvements for Groups A-C: 15.84%, 20.06% and 16.25% higher compared with the best results of the other three methods, respectively. In Table 15, we also compare the computational time used by PCA, ICA, KPCA and SSB-ZSL-1DCNN. The time used by KPCA is more than twice of that used by 1D CNN. Althogh the running time of PCA and ICA is shorter than SSB-ZSL-1DCNN, their overall accuracy perfromances are obvously much lower than our method.

Table 14 Performance comparison of different feature extraction methods for the TEP dataset (%)
Table 15 Computational time of different feature extraction methods for the TEP dataset

From the above comparisons, it can be concluded that the proposed method performs better than the traditional feature extraction methods. The SSB-ZSL-1DCNN model works well for 1D time-series fault signals and can achieve higher accuracy perfromances for bearing fault detection.

4.2.5 Comparison with traditional multi-classification methods

In this section, we further compare our method with three different multi-classification methods, namely, naïve Bayes (NB), random forest (RF) and support vector machine (SVM). In doing so, we conducted experiments using the five groups (A-E) of data described in Table 12. Taking group A as an example for explanation,the experimental settings are as follows.

  1. 1)

    Fault 2–5, 4–13, 15 are treated as known (seen) types, and faults 1, 6 and 14 are treated to be unknown (unseen). We denote the former by G1, and the latter by G2.

  2. 2)

    For the three traditional methods, samples from both G1 and G2 were were used to train the fault classification models. Specifically, a total of 480 × 12 = 5760 samples from G1 (480 samples for each fault type) were used for model training. In additional to the 5750 samples, two sets of samples from G2, one set conssiting of 10 × 3 = 30 (10 samples for each fault type) and another consisting of 50 × 3 = 150 (50 samples for each fault type), were also added to the training data for classification model training.

  3. 3)

    For the proposed method, only the 5760 samples from G1were used for SSB-ZSL-1DCNN model training; no sample from G2 was used.

  4. 4)

    A total of 480 × 3 = 1440 samples from G2 (480 samples for each fault type), which are the same as that used in the previous experiments, were used for model prfromance test.

  5. 5)

    The experiments for the three methods, NB, RF, and SVM, were implemneted by utilizing the scikit-learn package [47].

The results of the traditional multi-classification methods under few-shot learning setting are shown in Table 16.

Table 16 Performance comparison of different multi-class classification methods for the TEP dataset (%)

From the results, all the three traditional multi-classification methods showed poor performance, even though a number of unseen fault samples were included in the training dataset. Further calculations show that the overall accuracies are distributed between 20% and 40%, the sample size of unseen faults being increased from 10 to 50. None of the three traditional methods produced a comarable result to that by our method. The highest accuracy of the three methods is 46.64%, which is achived by RF for Group E with 10 unseen fault samples being included in the training dataset. Our method achieves 96.67% Group E without using any samples in unseen faultsin five groups.

It shoulld be stressed that a diect comparison of our methods with these three traditional multi-classification methods is unfair and puts our method in a disadvantegous position, this is because zero-shot learning setting is totally different from few-shot learning setting and far more different than the settings for traditional multi-class classfication methods. It should be noted that the fault descriptions proposed in this study may not be applicable in the multi-classification methods. In conclusion, it is important to introduce fault attributes as the side information for fault detection under zero-shot learning setting.

4.2.6 Comparison with other ZSL methods

The experimental implementation process in this section is similar to that in Section 4.1.4, which considers four classic semantic based ZSL methods (i.e., SJE, DEVISE, ESZSL, SAE). We replace the visual attributes used in these four methods with the fault attributes proposed in this study. The experimental results are shown in Table 17, from which it can be seen that for most groups, our method outperforms the other four ZSL methods, except for group D. SAE is a deep learning-based method which uses auto-encoders with several hidden layers. It has a better non-linear representation ability than the other three shallow network models. For group D, the accuracy of SSB-ZSL-1DCNN is slightly lower than that of SAE (by 5.84%), but for other groups, our method performs much better than SAE: 38.61% higher for group A, 2.64% higher for group B, 28.54% higher for group C and 10.16% higher for group E. Overall, SSB-ZSL-1DCNN is more suitable for 1D time-series data thanks to the introduction of the fault attribute matrix. This, in turn, shows the importance and usability of feature extraction and fault descriptions for solving zero-shot fault detection problem.

Table 17 Performance comparison of different zero-shot learning method for fault detection of the TE dataset (%)

5 Conclusion

In the field of fault detection, we often encounter the following situation: samples of certain types of faults are not available or extemely difficult to obtain for various reasons. Bearing this in mind, a new semantic space based zero-shot learning model with 1D CNN (SSB-ZSL-1DCNN) is proposed in this work for fault detection. The proposed method has the following feature: a SSB-ZSL-1DCNN model can detect new (unseen) faults even though the dataset used for training the model does not include any samples of the unseen faults. This is important and useful for solving fualt detection tasks where new fault types that have neve been seen before.

The applicablity and effectiveness of the prososed SSB-ZSL-1DCNN has been demonstrated using two well-known benchmarks: the bearing dataset CWRU for rotary machines and the TEP dataset for an entire chemical process control system. For the first case, the experiments focus on training the model using only samples of small size faults and detecting the large-size faults using the trained model. For the second case, the focus is on solving many-fault detection and many-class classfication problems, which are more comprehensive and challenging tasks. For both case studies, the proposed method shows excellent and impressive performances, which are far better than the compared methods.

In the future, we will improve the overall performance of the proposed SSB-ZSL-1DCNN model from the following two aspects. Firstly, the utility of the source of side information could be extended to unsupervised fault label embeddings rather than defining fault attributes mannually; this can help increase the efficiency of the method. Secondly, the application of the proposed method can be extended to generalized zero-shot learning (GZSL), which emphasizes a more practical scenario where there is a need to effectviely detect both seen (known) faults and unseen (unknown) faults by training models using only samples of seen faults.