1.
Introduction
Bearings play an important role in supporting rotating shafts, reducing friction and ensuring rotational accuracy, and are key components of high-end equipment such as aero engines [1], wind turbines [2] and high-speed trains [3]. At the same time, due to the high temperature, high pressure and high-speed working conditions, bearings are also the most prone to failures, and common failure types include but are not limited to ball failures, inner ring failures and outer ring failures, accounting for approximately 30% of the total rotating machinery failures [4]. Therefore, timely and accurate bearing fault diagnosis is basic and necessary to ensure the safe operation of equipment.
Initial research in bearing fault diagnosis mainly focused on signal processing methods, applying techniques such as autoregressive (AR) modelling, fast Fourier transform (FFT), and wavelet packet transform (WPT) to manually extract fault features from the time, frequency, and time-frequency domains [5,6]. On the other hand, these methods are always laborious, time-consuming and require much experience when dealing with large amounts of monitoring data. The successful applications of deep learning (DL) in the field of speech recognition [7], image classification [8] and natural language processing [9] have shown the advantages of DL in automatic feature extraction, pattern recognition and less reliance on expert knowledge, which has inspired scholars to explore intelligent fault diagnosis methods based on DL algorithms.
Diverse advanced DL models have been widely applied in bearing fault diagnosis with their respective features and advantages. Deep Belief Neural Network (DBN) is one of the earliest kinds of generative networks that follows an unsupervised training procedure and has been used to reconstruct fault signals [10]. Convolutional Neural Network (CNN) has the advantage of learning spatial features from monitoring data with a shared-weight architecture of convolutional kernels [11], and Recurrent Neural Network (RNN) is good at capturing temporal and dynamic relationships from sequences and is often used to predict the health conditions of devices [12]. For example, Shao et al. designed a deep wavelet auto-encoder (DWAE) with extreme learning machine (ELM) for unsupervised feature learning from vibration signals [13]. Zhang et al. [14] designed a one-dimensional (1-D) CNN model with wide first-layer kernels (WDCNN) for bearing fault diagnosis with ten classes of faults, achieving an average accuracy of 95.9% in domain adaptation experiments. Hoang et al. [15] applied a vibration image-based CNN model to classify four health statuses of bearings in the Case Western Reserve University (CWRU) dataset: normal condition, inner race failure, ball fault failure, and outer race failure. Chen et al. [16] proposed a multisensory data fusion technique that used sparse autoencoder and deep belief network (SAE-DBN) for feature extraction and classification, respectively. Jiang et al. [17] developed an improved deep recurrent neural network (DRNN) model for identifying both fault category and fault severity of rolling bearings. More details about these methods can be found in Table 1.
Although the above methods have achieved a high degree of accuracy in fault diagnosis, most of them are only good at classifying a small number of bearing faults (4–12 categories). However, the reality of faults in the industry is far more complex: modern equipment is highly integrated by multiple components, and each production line may contain many devices and work under different conditions. These factors significantly increase the probability and variety of faults. As the number of fault types increases, the inter-class distance of some faults from the same system or component becomes smaller and their fault characteristics more and more similar [20], which poses challenges to the classification ability of the traditional DL models described above. Therefore, there is an urgent need to propose advanced models to perform large-scale fine-grained fault diagnosis to provide specific and explicit decision-making information for equipment maintenance and repair [21].
Fortunately, several researchers have conducted exploratory studies on this topic, Sun et al. [22] combined multisynchrosqueezing transform (MSST) with sparse feature coding based on dictionary learning (SFC-DL) to achieve fine-grained fault diagnosis considering fault type and fault severity. Wang et al. [23] proposed a three-stage fault diagnosis method that first extracted knowledge from coarse-grained tasks, then transferred the knowledge and finally fine-tuned it on fine-grained tasks, and obtained good performance in large-scale fault diagnosis with 66 classes. From these works, we found that the extraction of discriminative features is the key to solving large-scale fine-grained fault diagnosis.
This paper further puts forward a multiscale hybrid model (MSHM), which consists of multiscale 1-D residual convolutional neural networks (ResCNN) and long short-term memory (LSTM) for fine-grained fault diagnosis, with five main implementation steps: noise filtering, spatial feature encoding, multiscale feature fusion, temporal feature learning, and fine-grained fault identification. The following highlights the main innovations in this paper:
(1) This paper aims to address the challenges of fine-grained fault diagnosis under different working conditions and proposes an intelligent solution of MSHM.
(2) Bearing faults with different fault locations and different fault severities under different operating conditions are considered to be fine-grained faults whose vibration signals are collected for diagnostic purposes.
(3) The discriminative features of fine-grained faults are extracted by integrating the end-to-end spatial and temporal feature encoding capabilities of multiscale 1D ResCNN and LSTM.
(4) Extensive experiments are conducted based on two bearing datasets, and the results of comparison with other popular diagnostic models prove the superiority of the proposed method.
The rest of the paper is organized as follows: Section 2 presents the theoretical background and the details of the proposed method. Dataset introduction and experimental setup are described in Section 3. In Section 4, the experimental results from two case studies are analyzed. Section 5 concludes this paper and provides some future research directions.
2.
Methodology
This part begins by introducing the basic theories of 1D CNN, residual learning and LSTM in processing vibrational signals, and then presents the structure and learning process of our method.
2.1. 1D CNN
CNN is a classical artificial neural network that uses convolutional operations to filter information and produces feature maps from the input data. 1D CNNs are a modified version of 2D CNNs that have some advantages in dealing with sequence signals: automatically learning underlying information of different signals; and processing high-dimensional data with low computational complexity based on a shared-weight architecture. As displayed in Figure 1, the learning process of a conventional CNN with the input vibration signal usually includes the following operations: 1) convolution (Conv), which acts as a filter on the input; 2) batch normalization (BN), which rescales the input of each layer to speed up the training of the model; 3) nonlinear activation (ReLU, sigmoid, etc.) improves the expression ability by introducing nonlinear transformation; and 4) pooling (P), down sampling to reduce the number of parameters. In the design process of CNN, the size of the kernel is an important parameter that determines the receptive field for feature extraction, and the calculation of the receptive field Rl of the l-th CNN layer can be described as:
where ki and si represent respectively the kernel size and the stride of the i-th layer.
2.2. Residual learning
Although traditional 1D CNNs are good at extracting fault features from time-series signals, their performance degrades significantly when using shallow networks to process nonlinear and high-dimensional data obtained from variable speed and non-stationary conditions. Therefore, it is often necessary to build deeper neural networks to extract representative features, but the increase in the number of network layers introduces problems such as gradient explosion. To address this problem, He et al. [24] proposed to reduce the error during training by skipping one or several layers through the design of shortcut connections.
For a deep network structure with an input x and a learned feature H(x) the network can learn the residuals F(x) = H(x) − x instead of learning mappings of each stacked layer. The output y of a residual learning block can be described with the following equation and the corresponding schematic diagram.
where F (‧) and Wi represent the residual mapping function and the model parameters, respectively. The residual learning block shown in Figure 2 consists of two 1D convolution layers and two ReLU activation layers.
2.3. LSTM
RNN is another popular neural network that is adept at processing variable length of time-series signals with the memory operation. Moreover, the LSTM is a variant of the RNN that specially designed to avoid the vanishing gradient problem by controlling the memory state. Each typical LSTM layer has four major components: input gate it, forget gate ft, state gate st, and output gate pt. Their corresponding mathematical formulas are listed as follows.
Where g denotes the gating function, e represents the element-wise multiply operation, and h(t) is the hidden state. In addition, Figure 3 describes the feature extraction process of LSTM for time series signal x.
2.4. End-to-end MSHM architecture
As shown in Figure 4, the proposed MSHM model is designed with a multiscale and hybrid structure and consists of five main components: a noise-filter module, a multiscale spatial feature encoder, a feature fusion module, a temporal feature encoder and a fault classifier. As described in Eq (2.9), the noise-filter module is a combination of a convolution layer (Conv), a batch normalization layer (BN), a ReLU activation function, and a max-pooling layer (MP). The point is the application of the wide kernel principle [14], which expands the receptive field of the first convolutional layer to improve the sample quality by filtering the noise in the vibration signal x.
The information I obtained from the noise-filter module is then fed to the following multiscale spatial feature encoder, where three ResCNN blocks with different kernel scales are established in parallel to extract the spatial features of different failure modes. Specifically, each ResCNN block consists of two subblocks, as depicted in the right part of Figure 4, where residual learning is introduced between the second BN and the ReLU layer to extend the feature learning ability. The kernel sizes of the convolutional layers of the three ResCNN blocks are designed to be 1 × 3, 1 × 5, and 1 × 7. The specific architecture was selected based on two reasons: first, the odd kernel size can avoid alignment errors; second, the fault features extracted from different scales cover fault information from low to high frequencies. After that, a global average pooling (GAP) in each block is applied to avoid overfitting by reducing the spatial feature, and finally three feature maps containing different fault patterns are obtained as follows:
In order to aggregate the information extracted from the previous layers, the feature fusion module uses the concatenation operation Fcon to merge the feature map F1, F2 and F3 in the channel dimension, which can be described as:
Although multiscale spatial features have been obtained, they focus only on local features and ignore the sequential relationships hidden in the time-series signals. Therefore, an LSTM-based temporal feature encoder is added, which takes the fused features Fcon as input and adds important fault information to the cell states or removes redundancy by applying the gating mechanism, and then outputs the state of the last hidden layer, as described in Eq (2.8).
At last, one dense layer with a softmax function plays the role of the classifier to convert the predictions into the probability distributions of fine-grained faults. The softmax is defined as:
where zi represents the output of the j-th neuron and n is the number of the fine-grained faults. One point to note is that we use only one LSTM layer and one dense layer for feature extraction and final classification, which is much more lightweight than existing methods utilizing multiple network layers with a large number of parameters [14,25].
The MSHM-based fine-grained fault diagnosis framework is shown in Figure 5 and its implementation steps include: (1) Acquiring vibration signals of large-scale faults based on the data acquisition system. (2) Constructing fine-grained fault samples for training, validation and testing based on raw signals. (3) The MSHM model is trained to adaptively extract fault features and identify faults with the cross-entropy loss function and the Adam optimization algorithm. (4) In the testing phase, the trained model is used to predict the classes of fine-grained faults and the results are analyzed.
3.
Datasets and experimental settings
This section describes the two bearing datasets as well as the experimental settings for the validation experiments of large-scale fine-grained bearing fault diagnosis under different working conditions.
3.1. Dataset introduction
This paper studies the problem of fine-grained fault diagnosis with multiple classes. Although there is no standard fine-grained fault dataset as in the field of computer vision, we innovate the sample organization based on two benchmark datasets to simulate the large-scale fine-grained bearing faults in practice. Inspired by the sample organization method of [23], we consider various practical factors such as load, speed, health status, bearing type, damage size, and treat each failure under different working conditions as a class of fine-grained fault.
CWRU dataset. The Case Western Reserve University Bearing Fault Database (CWRU) is commonly used in evaluating bearing fault diagnosis methods [24]. The test bench used in the experiments is shown in Figure 6a and includes: one fan-end bearing, one drive-end bearing, a 2 hp motor, a torque encoder and a dynamometer. The bearings with different fault types (normal, ball fault, inner and outer raceway faults) with different severities (0.007 to 0.028 inches) were seeded using electro-discharge machining, and four loads (0 to 3 hp) and four speeds (1730 to 1797 RPM) were chosen to conduct the experiments. As listed in Table 2, 109 classes of fine-grained faults under different conditions are constructed based on the CWRU dataset in this paper.
PU dataset. Paderborn University (PU) bearing dataset is another popular benchmark with a higher level of complexity as it contains not only artificially damaged faults but also naturally damaged faults. Artificially damaged faults were caused by electric discharge machining, drilling and manual electric engraving, and the artificial fault signals were obtained with the test rig shown in the left part of Figure 6b. As for the real bearing faults, they were generated by accelerated lifetime test, as depicted in the left part of Figure 6b. Failure data of 32 different bearings were obtained under each working condition: 6 healthy bearings, 12 artificial damaged bearings and 14 accelerated lifetime tested bearings. As listed in Table 3, we organized 128 categories of health statuses as fine-grained faults based on PU dataset.
3.2. Experimental settings
To avoid the information loss and maintain sequential characteristics, our proposed method directly utilizes raw vibration signal as input and employs the time-window-based sequence sampling strategy as shown in Figure 7: first, each vibration signal is divided into two disjoint parts according to the time order of the signal generation, and then a time-window is slid along the time-axis of the two split parts to generate samples for training and test sets in turn. The length and the shift step of the time-window are set to be 1024 and 100, and the validation set is generated by taking 20% of samples from the training set. Specifically, 160, 40, and 20 samples of each type of fault were randomly selected from the corresponding sets for training, validation and testing, respectively.
In the following experiments, the effectiveness of our proposed method is verified by comparing it with two conventional DL models—1D CNN and Bidirectional LSTM (BiLSTM) [26], and three state-of-the-art methods, including WDCNN [14], a multiscale kernel based residual convolutional neural network (MK-ResCNN) [27] and a multi-scale CNN and LSTM model (MCNN-LSTM) [25]. All these models were trained on an NVIDIA GeForce RTX 2080 Ti GPU using the PyTorch 1.7 framework, and the learning rate and batch size are set to be 0.005 and 1024, respectively, through exploratory experiments.
In addition, accuracy and F1 score are introduced as the evaluation indicators for the model performance. Accuracy is defined by the ratio of the correctly predicted samples to the total number of samples, which describes the degree of closeness of predictions to the true fault classes. The F1 score measures the model performance on each class. The mathematical formulas for these two indicators are as follows:
where TP, TN, FP, FN are the amount of true positive, true negative, false positive and false negative, respectively.
4.
Validation experiments and analysis
Two case studies based respectively on the CWRU and the PU datasets, are conducted to validate the model performance in large-scale fine-grained bearing fault diagnosis, and the experimental results are analyzed.
4.1. Case study 1: Validation based on the CWRU dataset
4.1.1. MSHM model construction
Our proposed MSHM is a multiscale and hybrid model whose construction involves many parameters, and in order to search for the optimal model structure, we used the control variable method to adjust some important parameters to observe the changes of model performance. According to Table 4, seven MSHM models with different parameters were built and compared based on the CWRU dataset with 109 fine-grained faults, and each model was performed five times and the average accuracy was calculated.
As can be seen from Figure 8, different parameter choices have a huge impact on model performance. MSHM1 performs poorly because there are only 32 convolution kernels in the noise-filter module, which cannot effectively filter the noise in the signal. As the number of convolution kernels increases, the model performance improves the most when the number is 64. Comparing MSHM5 with the first 4 models, we found that multiscale spatial feature encoder improves the model performance by more than 7% when configured with 64 and 128 convolution kernels for the first and the second subblock in each ResCNN block. MSHM6 demonstrates that the single LSTM layer is suitable for temporal feature encoding and also reduces the model complexity. MSHM7 is the best model for fine-grained fault classification with an average accuracy of 90.41%, and therefore we finally use it as the architecture of our proposed model in this paper.
4.1.2. Comparison of model performance with state-of-the-art methods
The fine-grained bearing fault diagnosis task was built considering various factors present in the actual production, such as load, speed, bearing type and damage size, and finally a total of 109 classes of faults counted based on the CWRU dataset. In comparison with other state-of-the-art diagnosis models, each experiment was performed five times, and the minimum accuracy, maximum accuracy, average accuracy and F1 score were used for a comprehensive evaluation. The results are shown in Figure 9.
The experimental results of fine-grained fault diagnosis clearly demonstrate the effectiveness of our proposed MSHM model, which outperforms the other five DL diagnostic models with its end-to-end discriminative feature encoding capability. MSHM has achieved the highest accuracy of 91.28%, and even its minimum accuracy of 89.77% is higher than other models' maximum accuracy. The WDCNN, on the other hand, ranks second in model performance, with F1 score and average accuracy of 87.05% and 88.13%, respectively, which are 2.84% and 2.28% lower than MSHM. Meanwhile, MK-ResCNN shows the lowest performance in classifying 109 faults, where it loses 41.05%, 35%, 27.96% and 38.15% respectively when compared with MSHM. There is a considerable performance gap, and this because MK-ResCNN was designed for diagnosing a limited class of faults and it is unable to extract sufficient fault features from fine-grained faults under different conditions.
4.1.3. Model performance on bearing fault diagnosis with different granularities
In practice, the scale of faults that may occur is related to the complexity of the equipment system and the working conditions. Therefore, a good diagnostic model should have excellent generalization ability to deal with faults of different levels of complexity. Here, we conducted a series of experiments based on the CWRU dataset to validate the performance of the model in classifying bearing faults with different granularities-coarse, medium and fine-grained, as determined by the number of various faults under different working conditions. Specifically, as listed in Table 2, samples of ball, inner race and outer race faults with damage sizes of 0.007, 0.014 and 0.021 inches and normal status under the 1730 rpm working condition, at a total of 10 statuses, are used to establish the coarse-grained fault classification task. Samples of drive-end bearing with all health statuses and all damage sizes under four working conditions are then used to construct the medium-grained fault diagnosis with 64 categories. Finally, samples from all 109 categories of faults from the bearings at the drive end and the fan end are applied to simulate the fine-grained fault classification. Table 5 summarizes the results of the model performance comparison on different granularity fault diagnosis, with the best model indicated in bold.
It can be seen that the difficulty of fault classification increases and the model performance decreases from coarse to medium to fine granularity, but our proposed MSHM model achieves the best performance among the models in diagnosing faults with different granularities, with the classification accuracy remaining above 90%. Among the 10 classes of coarse-grained fault diagnosis, all models with the exception of MCNN-LSTM obtained an average accuracy of over 90% and an F1 score of over 99%, as the fault types are limited, are generated under the same working condition and are easily distinguished from each other. MSHM achieved 100%, 99%, 99.80% and 100% in terms of minimum accuracy, maximum accuracy, average accuracy and F1 score, 7.50%, 24%, 18.20% and 8.68%, respectively, higher than the performance of MCNN-LSTM model. In medium-grained fault diagnosis with the fault scale of 64 classes, the trend of the average accuracy and the F1 score indicates that the increase in operating conditions leads to a diversity and similarity of faults, causing a certain degree of degradation in feature extraction of the model. However, the multiscale and hybrid feature encoder design of MSHM model improves the feature extraction capability by simultaneously learning spatial and temporal fault information, and its average accuracy and F1 score reaches at around 92%, which are 10% higher than WDCNN's. In the third experiment with fine-grained fault diagnosis, failure samples of another bearing at the fan-end were added and for a total of 109 categories of faults, MSHM maintains its superiority with average accuracy of 90.41% and an F1 score of 89.89%. In a word, the above experiments indicate that our proposed method has better classification and generalization performance in identifying small and large-scale faults with different granularities.
To further explore the feature learning process of the MSHM model, we take the medium-grained fault diagnosis with 64 categories as an example to visualize the feature distribution of all test samples of the spatial feature encoder, the temporal feature encoder and the last fully-connected layer via t-SNE method, as shown in Figure 10. It is clear from the visualization that at the beginning the signals clustered together and afterwards they are fed into the MSHM, each component contributes to the fault feature extraction, producing increasingly clear boundaries between different types of faults, allowing the classifier to eventually achieve accurate fine-grained fault classification.
4.2. Case study 2: Validation based on the PU dataset
4.2.1. Comparison of model performance with state-of-the-art methods
In this case study, to further verify the model performance, we implemented validation experiments on the PU dataset, which has a higher fault complexity than the CWRU, as reflected in three aspects: first, the PU dataset includes not only artificially generated faults, but also real faults obtained by accelerated life tests; the PU dataset has both single and multiple damage forms; the categories of fine-grained faults of PU dataset are more than that of CWRU. The following experiments were conducted for five times with the same setup as in case study 1, and the model performance is evaluated with the minimum accuracy, maximum accuracy, average accuracy and F1 score, as shown in Table 6.
It is obvious that the overall performance of all models in fine-grained fault diagnosis based on the PU dataset is much lower than that of the CWRU. Specifically, our proposed MSHM model achieves the highest accuracy of 80.12% on the PU dataset, which is 11.16% lower than that obtained on CWRU, while for other comparative methods, especially for 1D CNN and WDCNN, their performance drops by nearly 20%. Such obvious performance difference reflects the difficulty of the PU dataset, and even so, MSHM performs best with an average accuracy of 79.25% and an F1 score of 76.95%, thanks to its strength in feature extraction. BiLSTM performs second best, with a lower standard deviation of both average accuracy and F1 score than MSHM, indicating that temporal features are important and useful for fine-grained fault diagnosis. One thing to note is that MK-ResCNN improves its performance on the PU dataset, obtaining better results than 1D CNN, WDCNN and MCNN-LSTM, which may be due to the fact that MK-ResCNN is designed for fault diagnosis under complex working conditions and the PU dataset maximizes its performance. In addition, the confusion matrix in Figure 11 shows the detailed accuracy of the proposed method for the first 64 classes of fine-grained faults. MSHM can correctly classify most of the faults but performs poorly in classifying the fault classes of 17, 21, 24, 37, 41, 59, and 63, which are mostly bearing faults with actual damage and are more difficult to distinguish than other faults.
4.2.2. Model learning ability with limited training data
The previous experiments are based on the assumption that there are sufficient training samples, but it is challenging to collect massive data for fine-grained faults, and the labelling also requires much more expert knowledge. Thus, in this section, we verify the model learning ability under limited data condition by reducing the training samples from 50% to 10% of the original amount.
The experimental results in Figure 12 show the influence of the amount of training data on model performance in fine-grained fault diagnosis. When trained with 10% size of the training set and only 16 samples per class of fault are provided for model learning, the compared 1D CNN, WDCNN and MCNN-LSTM model perform poorly with an average accuracy below 30%, MK-ResCNN and BiLSTM perform better with accuracies of 37.92% and 42.08%, which are respectively 13.08% and 8.92% lower than our proposed MSHM method. With the growth of the training size, the performance of the model is improved due to the increase of fault information available during the training process. Specifically, the model performance of MSHM and BiLSTM improves the most when the training sample is increased to 20%, while the rest comparison models obtain the largest performance gain with the 30% training size. Further, the proposed MSHM model is able to achieve impressive performance with an average accuracy of 70.94% when trained with 50% training samples and it is 5% to 20% higher than that of the other models. These results fully demonstrate the excellent learning ability of MSHM with limited samples, we attribute this to the adaptive multi-scale feature extraction and fusion of MSHM, which can compensate for the problem of incomplete information due to insufficient samples.
5.
Conclusion
Most of the existing DL-based models are designed only for the diagnosis of a limited number of faults, and this paper aims to fill the research gap in fine-grained bearing fault diagnosis under various working conditions. Considering the difficulty of diagnosing fine-grained faults due to similar feature patterns, this paper proposes a novel deep multiscale hybrid model consisting of multiscale 1D ResCNNs and LSTM as an intelligent solution, where the noise in the input signal is first removed by a noise-filter module with a wide kernel, and then the extraction and fusion of the spatial and temporal features are realized by the encoders and the fusion layer of the proposed MSHM model, respectively. The performance of the MSHM is comprehensively evaluated on two benchmarks with more than 100 classes of fine-grained faults, and the results of the comparison experiments verify the importance of multiscale and hybrid fault features for fine-grained fault diagnosis and demonstrate the superiority of the MSHM over other mainstream DL models. Additionally, experiments with different fault granularity prove the general ability and the application potential of the MSHM in practice. In addition, there are two issues need to be further studied in future research, one is to improve the accuracy in diagnosing fine-grained faults with real damages, and the other is to explore the anti-noise capability of the proposed model.
Acknowledgments
This work was supported in part by the Guizhou Province Higher Education Project [No. QJHKY[2020]005, QJH KY[2020]009], and in part by China Scholarship Council [No. 202106670003]. Thanks for the computing support of the State Key Laboratory of Public Big Data, Guizhou University.
Conflict of interest
All authors declare no conflicts of interest in this paper.