Research article Special Issues

Modeling of daily confirmed Saudi COVID-19 cases using inverted exponential regression

  • Received: 14 December 2020 Accepted: 24 February 2021 Published: 08 March 2021
  • The coronavirus disease 2019 (COVID-19) pandemic caused by the coronavirus strain has had massive global impact, and has interrupted economic and social activity. The daily confirmed COVID-19 cases in Saudi Arabia are shown to be affected by some explanatory variables that are recorded daily: recovered COVID-19 cases, critical cases, daily active cases, tests per million, curfew hours, maximal temperatures, maximal relative humidity, maximal wind speed, and maximal pressure. Restrictions applied by the Saudi Arabia government due to the COVID-19 outbreak, from the suspension of Umrah and flights, and the lockdown of some cities with a curfew are based on information about COVID-15. The aim of the paper is to propose some predictive regression models similar to generalized linear models (GLMs) for fitting COVID-19 data in Saudi Arabia to analyze, forecast, and extract meaningful information that helps decision makers. In this direction, we propose some regression models on the basis of inverted exponential distribution (IE-Reg), Bayesian (BReg) and empirical Bayesian regression (EBReg) models for use in conjunction with inverted exponential distribution (IE-BReg and IE-EBReg). In all approaches, we use the logarithm (log) link function, gamma prior and two loss functions in the Bayesian approach, namely, the zero-one and LINEX loss functions. To deal with the outliers in the proposed models, we apply Huber and Tukey's bisquare (biweight) functions. In addition, we use the iteratively reweighted least squares (IRLS) algorithm to estimate Bayesian regression coefficients. Further, we compare IE-Reg, IE-BReg, and IE-EBReg using some criteria, such as Akaike's information criterion (AIC), Bayesian information criterion (BIC), deviance (D), and mean squared error (MSE). Finally, we apply the collected data of the daily confirmed from March 23 - June 21, 2020 with the corresponding explanatory variables to the theoretical findings. IE-EBReg shows good model for the COVID-19 cases in Saudi Arabia compared with the other models

    Citation: Sarah R. Al-Dawsari, Khalaf S. Sultan. Modeling of daily confirmed Saudi COVID-19 cases using inverted exponential regression[J]. Mathematical Biosciences and Engineering, 2021, 18(3): 2303-2330. doi: 10.3934/mbe.2021117

    Related Papers:

    [1] Yan Yan, Yong Qian, Hongzhong Ma, Changwu Hu . Research on imbalanced data fault diagnosis of on-load tap changers based on IGWO-WELM. Mathematical Biosciences and Engineering, 2023, 20(3): 4877-4895. doi: 10.3934/mbe.2023226
    [2] Jing Yang, Guo Xie, Yanxi Yang, Qijun Li, Cheng Yang . A multilevel recovery diagnosis model for rolling bearing faults from imbalanced and partially missing monitoring data. Mathematical Biosciences and Engineering, 2023, 20(3): 5223-5242. doi: 10.3934/mbe.2023242
    [3] Javad Hassannataj Joloudari, Faezeh Azizi, Issa Nodehi, Mohammad Ali Nematollahi, Fateme Kamrannejhad, Edris Hassannatajjeloudari, Roohallah Alizadehsani, Sheikh Mohammed Shariful Islam . Developing a Deep Neural Network model for COVID-19 diagnosis based on CT scan images. Mathematical Biosciences and Engineering, 2023, 20(9): 16236-16258. doi: 10.3934/mbe.2023725
    [4] Kunli Zhang, Bin Hu, Feijie Zhou, Yu Song, Xu Zhao, Xiyang Huang . Graph-based structural knowledge-aware network for diagnosis assistant. Mathematical Biosciences and Engineering, 2022, 19(10): 10533-10549. doi: 10.3934/mbe.2022492
    [5] Ivan Izonin, Nataliya Shakhovska . Special issue: informatics & data-driven medicine-2021. Mathematical Biosciences and Engineering, 2022, 19(10): 9769-9772. doi: 10.3934/mbe.2022454
    [6] Gang Chen, Binjie Hou, Tiangang Lei . A new Monte Carlo sampling method based on Gaussian Mixture Model for imbalanced data classification. Mathematical Biosciences and Engineering, 2023, 20(10): 17866-17885. doi: 10.3934/mbe.2023794
    [7] Liang-Sian Lin, Chen-Huan Kao, Yi-Jie Li, Hao-Hsuan Chen, Hung-Yu Chen . Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model. Mathematical Biosciences and Engineering, 2023, 20(10): 17672-17701. doi: 10.3934/mbe.2023786
    [8] Qiushi Wang, Zhicheng Sun, Yueming Zhu, Chunhe Song, Dong Li . Intelligent fault diagnosis algorithm of rolling bearing based on optimization algorithm fusion convolutional neural network. Mathematical Biosciences and Engineering, 2023, 20(11): 19963-19982. doi: 10.3934/mbe.2023884
    [9] Lili Jiang, Sirong Chen, Yuanhui Wu, Da Zhou, Lihua Duan . Prediction of coronary heart disease in gout patients using machine learning models. Mathematical Biosciences and Engineering, 2023, 20(3): 4574-4591. doi: 10.3934/mbe.2023212
    [10] Xu Zhang, Wei Huang, Jing Gao, Dapeng Wang, Changchuan Bai, Zhikui Chen . Deep sparse transfer learning for remote smart tongue diagnosis. Mathematical Biosciences and Engineering, 2021, 18(2): 1169-1186. doi: 10.3934/mbe.2021063
  • The coronavirus disease 2019 (COVID-19) pandemic caused by the coronavirus strain has had massive global impact, and has interrupted economic and social activity. The daily confirmed COVID-19 cases in Saudi Arabia are shown to be affected by some explanatory variables that are recorded daily: recovered COVID-19 cases, critical cases, daily active cases, tests per million, curfew hours, maximal temperatures, maximal relative humidity, maximal wind speed, and maximal pressure. Restrictions applied by the Saudi Arabia government due to the COVID-19 outbreak, from the suspension of Umrah and flights, and the lockdown of some cities with a curfew are based on information about COVID-15. The aim of the paper is to propose some predictive regression models similar to generalized linear models (GLMs) for fitting COVID-19 data in Saudi Arabia to analyze, forecast, and extract meaningful information that helps decision makers. In this direction, we propose some regression models on the basis of inverted exponential distribution (IE-Reg), Bayesian (BReg) and empirical Bayesian regression (EBReg) models for use in conjunction with inverted exponential distribution (IE-BReg and IE-EBReg). In all approaches, we use the logarithm (log) link function, gamma prior and two loss functions in the Bayesian approach, namely, the zero-one and LINEX loss functions. To deal with the outliers in the proposed models, we apply Huber and Tukey's bisquare (biweight) functions. In addition, we use the iteratively reweighted least squares (IRLS) algorithm to estimate Bayesian regression coefficients. Further, we compare IE-Reg, IE-BReg, and IE-EBReg using some criteria, such as Akaike's information criterion (AIC), Bayesian information criterion (BIC), deviance (D), and mean squared error (MSE). Finally, we apply the collected data of the daily confirmed from March 23 - June 21, 2020 with the corresponding explanatory variables to the theoretical findings. IE-EBReg shows good model for the COVID-19 cases in Saudi Arabia compared with the other models



    With the implementation of China's three-child policy, obstetric clinical research is faced with unprecedented challenges, and the number of elderly parturient women is increasing rapidly. The incidence rate of obstetric and gynecological diseases in elderly maternal is significantly higher than that of pregnant women at the right age [1]. Electronic Medical Records (EMRs) are the most detailed and direct record of clinical medical activities. The clinical diagnosis process of doctors can be regarded as judging the probability of suffering from a certain disease according to the clinical manifestations and examination results of patients. An obstetric EMR usually contains multiple diagnostic results, that is, patients may be diagnosed with both "gestational diabetes mellitus" and "gestational hypertension", and the diagnosis results have strong coupling. If an EMR is regarded as one sample, each sample can be grouped into multiple categories. Therefore, the intelligent diagnosis problem can be regarded as a multi-label classification problem in machine learning, and multiple diagnostic results in an EMR have different labels [2].

    However, the data distribution of EMRs is often imbalanced, and the sample number of rare diseases is far less than that of common diseases [3]. The imbalanced distribution of datasets will lead to the performance degradation of traditional classification algorithms [4]. Traditional algorithms tend to treat a few classes as noise or outliers, and ignore them in classification [5]. For imbalanced EMRs, the cost of false negative is much higher than that of false positive. For example, in 100 EMRs, 99 results were normal and one detected cancer. If the traditional classification algorithm is directly applied to this kind of data, the diagnostic results of EMR will be predicted as normal. Although the precision rate is as high as 99%, the most critical cancer information is ignored.

    In neural networks, the features of input samples have a great influence on classifiers. The reason that rebalancing method works is that it can update the weights of classifiers and significantly improve the learning ability of deep network classifiers, but it damages the feature learning ability of deep network [6,7]. In addition, the coupling of high-frequency diagnostic results and low-frequency diagnostic results in an EMR needs to be considered in the multi-label rebalancing method: removing the EMR containing high-frequency diagnostic results also means losing low-frequency diagnostic results, and cloning the EMR containing low-frequency diagnostic results to add new instances will also increase the frequency of existing high-frequency diagnostic results.

    In recent years, intelligent diagnosis has become a research focus. Yin et al. [8] proposed a method to extract signal features from heart rate variability signals and classify patients' states using the long short-term memory network. Yan et al. [9] collected 1880 endoscopic images, and developed a Gastric Intestinal Metaplasia (GIM) system with these images using a modified convolutional neural network algorithm. Wang et al. [10] proposed Patch Shuffle stochastic pooling neural network, which improved recognition performances of Corona Virus Disease 2019 (COVID-19) infection from chest CT (CCT) images, and will help assist radiologists to make diagnosis more quickly and accurately on COVID-19 cases.

    In addition to using image data for intelligent diagnosis, text-based intelligent diagnosis is also a research focus. Rajkomar et al. [11] proved that in all the methods using EMRs, deep learning methods are superior to the most advanced statistical prediction model. Maxwell et al. [12] used physical examination data to predict the possible chronic diseases such as diabetes, hypertension and fatty liver. Yang et al. [13] proposed a Convolutional Neural Network (CNN) based auxiliary diagnosis method, which is based on self-learning to learn high-level semantic understanding from EMRs, and output the prediction probability of common diseases including hypertension and diabetes. Liang et al. [14] proposed a system framework based on pediatric EMRs, which integrates medical knowledge in pediatric EMRs for intelligent diagnosis.

    In the current researches, the imbalanced distribution of datasets is an important factor limiting the performance of intelligent diagnosis. The precision of low-frequency disease diagnosis is too low, which leads to the decline of the practicability of the diagnosis model. The existing researches based on imbalanced data can be divided into data-level methods and algorithm-level methods.

    The data-level methods are to transform the original datasets into a relatively balanced datasets from the data preparation stage. Liu et al. [15] proposed an algorithm based on information granulation. The algorithm assembles the data in most classes into particles to balance the proportion of classes in the data, and uses prostate cancer data to predict the survival rate of patients. Huang et al. [16] proposed random balanced sampling algorithm based on association rule selection, and verified the performance of the algorithm in a private diabetes EMRs.

    Unlike the data-level methods, the algorithm-level methods do not change the distribution of training data; instead, they increase the importance of minority classes in the learning and decision-making process [17]. Li et al. [18] proposed dice loss to improve the weight of difficult samples and reduce the weight of simple negative samples. For high-dimensional imbalanced text data, researchers found that selecting features that are conducive to identifying minority classes can effectively deal with imbalanced data. Yang et al. [19] proposed a text feature selection method based on relation score. By calculating the relationship score of each feature and category, the relationship score of minority features is increased and the imbalance degree of data concentration categories.

    At present, intelligent diagnosis researches based on EMRs are mostly for a single disease, and does not consider the multiple complications and other diagnostic results. In addition, the existence of rare diseases causes the imbalance of EMRs, and the diagnostic performance of the model needs to be further improved.

    This paper proposes Double Decoupled Network (DDN) to improve the performance of intelligent diagnosis based imbalanced datasets. Our main contributions are summarized as follows:

    1) DDN is proposed to decouple representation learning from classifier learning and high coupling diagnostic results.

    2) In the classifier learning stage, a Decoupled and Rebalancing highly Imbalanced Labels (DRIL) algorithm is proposed to decouple the highly coupled diagnostic results and rebalance the datasets.

    3) Experiments on a real Chinese Obstetric EMR (COEMR) datasets and two public datasets show that DDN method has better performance than comparison methods.

    Inspired by paper [6], we firstly analyze the performance of rebalancing strategies in neural network. Secondly, we discuss the rebalancing strategies in high coupling diagnostic results of COEMR datasets. This paper proposes DDN to decouple representation learning from classifier learning, and high coupling diagnostic results.

    In order to solve the problem of high coupling of diagnosis results and obtain better features of input samples, an intelligent diagnosis model based on double decoupling network is proposed in this section. The overall architecture of DDN is shown in Figure 1. First, DDN decouples representation learning from classifier learning. In the presentation learning stage, DDN uses CNN to learn the original features of COEMR datasets D=[d1,d2,,dn], and fixed parameters of presentation learning, where di represents a sample in the datasets. Map the input text to a sequence of embedding vector. Words are embedded in a vector sequence through a convolutional layer and linearly transformed using a non-linear activation function to capture indicative information. Different types of useful information for prediction are selected in the pooling layer, and the maximum value is chosen from the feature mapping for each type. Finally, fully connected layers are used to integrate information with disease differentiation in the convolutional or pooling layers. In the classifier learning stage, DRIL algorithm is proposed to decouple the high coupling diagnostic results and rebalance the datasets D=[d1,d2,,dn], using the datasets D to train the classifier. The classifier consists a full connection layer and Softmax function. Both stages use the same CNN network structure and share all weights except for the last full connection layer.

    Figure 1.  The architecture of DDN model.

    Two methods are adopted to deal with unbalanced data: resampling samples and reweighting sample loss in small batches. In order to explore the working mechanism of rebalancing method, we divide the training process of neural network into two stages, namely representation learning and classifier learning. Specifically, in the first stage, we use the common method (Cross Entropy, CE) or rebalancing methods (Re-sampling/R-weighting, RS/RW) to train the neural network to get the corresponding feature extractor. Then we fix the parameters of the feature extractor. In the second stage, the classifier is retrained by common method or rebalancing methods. In this section, representation learning and classifier learning are used to compare the effects of different training methods. Figure 2 shows the precision of different methods on COEMR and Arxiv Academic Papers Datasets (AAPD) datasets.

    Figure 2.  Precision of different methods on COEMR and AAPD datasets.

    1) CE: the traditional cross entropy loss is used to train the network on the original imbalanced data.

    2) RS: in this section, class balanced resampling method is used to ensure that the probability of each class in each batch is the same. The sampling probability is pj calculated as Eq (1).

    pj=1C (1)

    where C is the number of all labels in the training set.

    3) RW: reweighting all classes according to the reciprocal of their sample size.

    For representation learning, when using the same classifier learning method (comparing the precision of three blocks in the horizontal direction), the precision of CE block is always higher than that of RW/RS block. We can find that CE can get better classification results since it obtains better features. The worse results of RW/RS block show that RW/RS method has a poor ability to identify depth features, which will damage the ability of representation learning.

    For classifier learning, when using the same representation learning method (comparing the precision of three blocks in the vertical direction), it can be found that RW/RS methods can achieve higher precision than CE. The results show that the main reason why rebalancing methods can achieve a balanced performance on imbalanced data is that these methods directly affect the update of the weights of deep network classifiers, that is, promote the learning of classifiers.

    This section discusses the influence of rebalancing methods on representation learning and classifier learning in neural network. We can find that rebalancing methods can significantly promote classifier learning, but it also damages the ability of learning features to a certain extent. To solve these problems, we propose decoupled representation learning from classifier learning. In the representation learning stage, the original features of the datasets are learned. In the classifier learning stage, the rebalancing datasets are used for training, so as to give consideration to both representation learning and classifier learning, and improve the generalization ability of low-frequency data, and improve the classification performance of imbalanced data.

    The rebalancing strategy is independent of classifier, so it is applicable to a wider range of scenarios than does the adaptive classifier. It is difficult for traditional rebalancing methods to achieve good performance for multi-label data. The main problems include the huge difference of imbalanced degree between labels in multi-label datasets, and the highly coupling of low-frequency labels and high-frequency labels in the same sample.

    The imbalance ratio and average imbalance ratio between labels in a multi-label dataset can be determined according to the method proposed in reference [20]. The first measure is the Imbalance Ratio per label (IR), as shown in Eq (2), which evaluates the imbalance ratio of a single label. Define a multi-label dataset D={(Xi,Yi)|0in,YiL}, where Xi is the ith sample in the datasets, Yi is the label set of the datasets Xi, and L is the label set of the datasets.

    IR(l)=L|L|arg max(|D|i=1h(l,Yi))l=L1|D|i=1h(l,Yi),h(l,Yi)={1,lYi0,lYi (2)

    The second measure is Mean Imbalance Ratio (MeanIR), which is the overall estimation of the imbalance degree of multi-label datasets, that is, the average IR value of all labels, See Eq (3), where |L| is the label set of the datasets.

    MeanIR=1|L|L|L|l=L1(IR(l)) (3)

    According to MeanIR and IR value, this section defines high-frequency label and low-frequency label. When IR value is higher than mean IR value, it is a low-frequency label; otherwise, it is high-frequency label. For label y, if IR(y)>MeanIR, it belongs to minBags, otherwise it belongs to majBags.

    In order to understand the coupling degree of low-frequency and high-frequency labels in the same sample in multi-label datasets, we can evaluate it by SCUMBLE measure [21]. As can be seen from Eq (4), SCUMBLE relies on the aforementioned IR metric. The coupling degree for each sample is first obtained SCUMBLEins(i), and then, the average coupling degree SCUMBLE (D) is calculated for the entire multi-label datasets, as shown in Eq (5). The values of SCUMBLE are normalized in the [0, 1] range, and the larger the values, the higher the coupling between the unbalanced labels.

    SCUMBLEins(i)=1¯IRi(|L|l=1IRil)(1/|L|) (4)
    SCUMBLE(D)=1|D||D|i=1SCUMBLEins(i) (5)

    We calculate the average coupling degree SCUMBLE(D)=0.3 of the COEMR datasets, and visualizes the label coupling of the COEMR datasets by using chord diagram, as shown in Figure 3. It can be seen that there is a high coupling degree in the COEMR datasets, and the low-frequency labels are completely associated with some high-frequency labels.

    Figure 3.  Visualization of label coupling in COEMR datasets.

    The high coupling of imbalanced labels can be alleviated with label decoupling strategy. Chart et al. [22] proposed remedial algorithm, which is independent of resampling algorithm and uses SCUMBLEins(i)>SCUMBLE(D) as the judgment condition of whether to decouple labels, the coupling degree between high-frequency labels and low-frequency labels is reduced by decoupling them. Based on this, this paper proposes a DRIL algorithm, which decouples and clones the samples with high scumble value, and obtains two examples, one of which is associated with high-frequency labels, and the other is associated with low-frequency labels, so as to reduce the coupling degree, and then uses the method of combining oversampling and under-sampling to rebalance the datasets. It can reduce the loss of high-frequency disease sample information and the over fitting of low-frequency disease.

    Table 1 shows the pseudo code of DRIL. Specifically, DRIL first calculates the IR value and MeanIR value of each label to determine which category the label belongs to. Resampling rate P indicates that the proportion of samples needs to be adjusted. Then, DRIL calculates the SCUMBLE of each sample and whole dataset SCUMBLE(D). The NumOfMinBag is calculated:

    NumOfMinBag=Yilh(l, minBags) (6)
    Table 1.  DRIL algorithm.
    Input: multi-label datasets D, resampling rate P
    Output: decoupled datasets D
    1 Calculate samplesToResampling=|D|P, IR, MeanIR & MeanSamples
    2 Calculate the SCUMBLEins of each sample Di in D, calculate the SCUMBLE(D) of the datasets
    3 For each instance Diin D do
    4   Calculate number of low-frequency labels of sample Di:NumOfMinBag
    5 If SCUMBLEins(i)>SCUMBLE and NumOfMinBag<|Yi| then
    6 clone DiDi,Li=Li[IR(yMeanIR)],Li=Li[IR(yMeanIR)]
    7   Dd=Dd+Di+Di, D=DDi(Dd is a decoupled dataset)
    8 D=D+Dd, D=D
    9 While SCUMBLE(D)>0.1 or samplesToResampling>0
    10 Randomly select label y
    11 If |y|<MeanSamples and yminBags then
    13   x=Random(0,MeanSamples|y|) Select x samples from the samples of label y
    14   Add x to D, D+=x,
    16 If |y|>MeanSamples and ymajBags then
    18   x=Random(0,|y|MeanSamples) Select x samples from the samples of label y
    19 Remove x from D, D=x,
    20   samplesToResampling=x, Recalculate MeanIR and IR
    21 Return D

     | Show Table
    DownLoad: CSV

    When SCUMBLEins(i)>SCUMBLE(D) and NumOfMinBag<|Yi|, the label will be decoupled. Find sample Di with high coupling label according to step 5. From step 5 to step 7, one sample is decoupled into two samples, namely clone sample Di is Di, Li is the label set of Di, Li is the label set of Di, Li=Li[IR(yMeanIR)],Li=Li[IR(yMeanIR)]. By decoupling, the low-frequency and high-frequency labels in the samples can be separated, and the decoupled datasets Di is obtained. MeanSamples is the number of samples required for all labels to reach the mean state of MeanIR. Its calculation method is to divide the number of samples of the most frequent labels by the value of MeanIR:

    MeanSamples=L|L|argmax(|D|i=1h(l,Yi))l=L1MeanIR (7)

    From step 9 to step 21, the datasets are balanced by the method of combining oversampling and under-sampling. Firstly, a random label y is generated and X samples are randomly selected from y. If y belongs to minBags, x=Random(0,MeanSamples|y|), then the selected samples are added to the datasets D. If y belongs to majBags, x=Random(0,|y|MeanSamples), delete the X samples from the datasets D. DRIL uses Mean samples to limit the number of samples x to balance the distribution among samples, and the number of samples needed to achieve the balance is not more than or less than the number of samples needed to achieve the balance. At the end of each rebalancing, MeanIR and IR are recalculated. Mean samples always use the initial value, and the original distribution of the datasets will not be greatly affected. Finally, we get the datasets D which is decoupled and balanced by the DRIL algorithm.

    We evaluate the proposed intelligent diagnosis model using COEMR datasets, and verifies the effectiveness and universality of the model in two benchmark multi-label text classification datasets: AAPD and RCV1. Table 2 shows the descriptive statistics of the datasets used in the experiment. In the presentation learning stage, the data filter widths of CNN are set to (2, 3, 4) and the number of each filter is 25. When training the classifier, the Xaiver method [23] is used to randomly initialize the classifier parameters. The resampling rate P of DRIL is set to 0.1, which is the best resampling rate for multi-label data [24]. Adam [25] is employed as the optimizer and the learning rate set to 0.001, batch size set to 32 and dropout set to 0.3.

    Table 2.  Statistical information of COEMR, AAPD and RCV1 datasets.
    Dataset Total Train Test Label MeanIR SCUMBLE
    COEMR 24,339 21,905 2434 73 246.5693 0.3028
    AAPD 55,840 54,840 1000 54 16.9971 0.1158
    RCV1 804,414 23,149 781,265 103 279.6319 0.3497

     | Show Table
    DownLoad: CSV

    COEMR: In this dataset, 24,339 EMRs were randomly selected from the inpatient departments of several hospitals. EMRs are mainly composed of structured and unstructured text data. Structured data includes basic patient information, such as age, ethnicity, and laboratory test data. Unstructured data mainly refers to patients' statement, condition of hospitalization and objective examination, etc. In order to protect the privacy of patients, the patient's name, ID number and other privacy information were removed. Table 3 shows the detailed description of the obstetric COEMR.

    Table 3.  Sample of an obstetric COEMR datasets.
    Title Content
    Sex Female
    Age Thirty-six years old
    Chief complaint Taking "rest of June, vaginal bleeding for 4 hours" as the chief complaint, the pregnant woman had regular menses with normal menses, and the urine HCG was positive by self-measurement for 30 days after menorrhea. After rest of January, the patient was diagnosed with ectopic pregnancy by B-ultrasound examination, and 40 days of menorrhea showed nausea and vomiting and other early pregnancy reactions...
    Admission physical examination T:36.6 ℃. P: 80/min R: 20/min BP:120/80 mm Hg.
    Normal development, nutrition medium, Clear headed, spirit can, walk into the ward, autonomous body position, check body cooperation. the whole-body skin mucous membrane is rudimentary and without yellow stain, rash, bleeding points, not touch the enlarged superficial lymph nodes...
    Obstetric examination Extrapelvic measurements is: 24.0 cm IC: 27.0 cm ec19.0 cm to: 9.0 CM. Uterine height: 29.0 cm abdominal circumference: 93.0 cm fetal heart 144 beats/min fetal estimated weight 2600 g, no contractions
    Auxiliary examination Fetal color ultrasound: BPD: 74.0 mm, FL: 53.0 mm, AFI: 165.0 mm, fetal orientation: breech s/D 2.2 placenta grade I
    Admission diagnosis Threatened preterm birth
    Placenta previa (marginal)
    Intrauterine pregnancy 28+2 weeks
    G3P1
    Breech presentation
    One week around umbilicus
    Diagnostic basis Pregnancy greater than or equal to 28 weeks and less than 37 weeks
    Presence of irregular or regular contractions with or without distension of the endocervical OS
    Minimal vaginal bleeding

     | Show Table
    DownLoad: CSV

    The distribution of COEMR datasets diagnostic results is visualized and shown in Table 4. In the COEMR datasets, more than 90% of the diagnostic results have "head position", while "gestational hypertension" accounts for less than 10%. According to the diagnostic results of 73 diseases with high coupling degree, all 24,339 samples were divided into training set (21,905) and test set (2434) according to 9:1.

    Table 4.  Distribution diagnostic results in COEMR datasets.
    Label Number Label Number Label Number
    Head position 18,139 Fetal dysplasia 1249 Induced labor 265
    Threatened labor 6257 threatened abortion 1112 RH negative blood 259
    Pregnancy with uterine scar 5757 Placenta previa 1033 Fetal distress 257
    Premature rupture of membranes 3239 Preeclampsia 1029 Pregnancy induced hypertension 251
    Oligohydramnios 2897 Precious child 819 Cervical insufficiency 217
    Gestational diabetes mellitus 2661 Polyhydramnios 496 Pregnancy complicated with hysteromyoma 201
    Threatened preterm birth 2130 Intrauterine fetal growth restriction 405 Diabetes complicated with pregnancy 189
    Umbilical cord around neck 2054 Group B streptococcal infection 374 Pregnancy complicated with hyperthyroidism 182
    Breech 1806 Pregnancy with hypothyroidism 335 Pregnancy complicated with anemia 178
    Twin pregnancy 1329 Low placental 287 Inevitable abortion 177

     | Show Table
    DownLoad: CSV

    Arxiv academic papers datasets (AAPD): AAPD datasets is a large dataset of MLTC constructed by Yang et al. [26]. It includes 55,840 abstracts from Arxiv1 on computer science.

    1 https://arxiv.org/

    Reuters Corpus1 (RCV1): RCV1 is provided by Gleviset et al. [27] and consists of artificially annotated Reuters news from 1996 to 1997. Each piece of news can be assigned multiple topics, a total of 103 topics.

    This paper compares DDN model with common multi-label text classification methods: Binary Relevance (BR) [28], Label Powerset (LP) [29] and CNN. BP algorithm establishes a two classifier for each tag in the tag data set to predict, which has the advantages of simplicity and efficiency. LP algorithm regards the combination of each label as a new class, and transforms the multi label problem into a multi classification problem. Its advantage is that it pays attention to the semantic correlation between each label. Chen [30] used CNN for the first time in the text classification task. The key information in the sentence is extracted by multiple filters of different sizes in CNN, and the local correlation of the text can be better captured.

    In addition, in order to verify the effectiveness of the proposed decoupled module, DDN is compared with BR + DRIL, LP + DRIL and RS + CNN + DRIL is to use the datasets balanced by the DRIL algorithm proposed in this section for classification. RS + CNN means that the data after class rebalancing is directly input into CNN for classification.

    From the mean IR value in Table 2, it can be seen that the three datasets are imbalanced, so theoretically, the three datasets can benefit from the rebalancing method. It can be seen from the SCUMBEL value that the SCUMBEL value of COEMR datasets and RCV1 datasets is larger, which belongs to the difficult multi-label datasets, and has high coupling on the labels of different imbalance levels. Tables 57 show the experimental results on COEMR datasets, AAPD and RCV1 datasets respectively. In each column, the best results are expressed in bold.

    Table 5.  Experimental results of DDN in COEMR datasets.
    Model HL P R F1
    BR 0.0307 0.6114 0.5442 0.5758
    LP 0.0305 0.6073 0.5030 0.5503
    BR+DRIL 0.0281 0.6438 0.5581 0.5979
    LP+DRIL 0.0293 0.6352 0.5176 0.5704
    CNN 0.0266 0.8065 0.5427 0.6488
    RS+CNN 0.0251 0.8134 0.5496 0.6560
    DDN 0.0241 0.8417 0.5534 0.6678

     | Show Table
    DownLoad: CSV
    Table 6.  Experimental results of DDN in AAPD datasets.
    Model HL P R F1
    BR 0.0316 0.6642 0.6476 0.6558
    LP 0.0312 0.6624 0.6082 0.6344
    BR+DRIL 0.0433 0.6665 0.6491 0.6576
    LP+DRIL 0.0320 0.6630 0.6113 0.6361
    CNN 0.0256 0.8491 0.5456 0.6643
    RS+CNN 0.0253 0.8540 0.5587 0.6755
    DDN 0.0243 0.8635 0.5986 0.7071

     | Show Table
    DownLoad: CSV
    Table 7.  Experimental results of DDN in RCV1 datasets.
    Model HL P R F1
    BR 0.0086 0.9043 0.8160 0.8578
    LP 0.0087 0.8956 0.8244 0.8585
    BR+DRIL 0.0081 0.9192 0.8225 0.8682
    LP+DRIL 0.0084 0.9034 0.8251 0.8625
    CNN 0.0089 0.9223 0.7984 0.8559
    RS+CNN 0.0086 0.9244 0.7986 0.8569
    DDN 0.0076 0.9387 0.8231 0.8771

     | Show Table
    DownLoad: CSV

    It can be seen from the results in Tables 5 and 6 that after the application of the proposed DRIL on COEMR and RCV1 datasets, all measurements have been improved. Compared with the traditional BR algorithm, BR + DRIL algorithm improves the P, R and F1 by 5.30, 2.55 and 3.84% respectively. Compared with the traditional LP algorithm, LP + DRIL algorithm improves the P, R and F1 by 4.59, 2.90 and 3.65% respectively. For RCV1 datasets, due to the large number of labels and the complex hierarchical structure between labels, the improvement of each index is relatively small. Compared with the traditional BR algorithm, the F1 value of BR + DRIL algorithm is increased by 1.21%, and that of LP + DRIL algorithm is increased by 0.27%. This is because the DRIL algorithm decouples the high coupling labels, and combines the over-sampling and under-sampling methods to balance the distribution of the datasets, which makes it easier for the multi-label classification algorithm to process. In addition, the smaller SCUMBLE value of AAPD datasets indicates that there is almost no imbalanced label concurrency in the datasets, so the impact of using DRIL algorithm on the results is relatively small. DDN model is based on CNN. CNN is suitable for extracting local features. Using Bert and other models can have higher recall than DDN model, but F1 value may be reduced.

    The neural network model can capture more abundant features and deeper semantic information, so CNN has a certain improvement in most of the evaluation indexes than the traditional classification methods BR and LP. Because CNN model is suitable for extracting local features, CNN model tends to select features favorable to high-frequency samples for unbalanced data, so the recall increment of CNN is relatively low. For the RS + CNN model, the traditional class resampling method can improve the frequency of low-frequency class samples, but without considering the high coupling between labels, the performance is only slightly improved. Compared with CNN model, the P value and F1 value of DDN are improved by 4.36 and 2.93%, respectively, reaching 84.17 and 66.78%, and the Hamming loss is reduced by 9.40%.

    DDN model can solve the problem of high coupling of imbalanced labels by decoupling of representation learning from classifier learning and high coupling label so that the model can learn high-quality text feature representation, and performance has been further improved. DDN also performs well on AAPD and RCV1, which indicates that DDN model can also be applied to other multi-label text classification tasks.

    This paper further evaluates the COEMR datasets to explore the effect of DDN model on different parts of the dataset distribution. As shown in the Figure 4, the horizontal axis is the diagnostic result sorted in descending order according to the number of corresponding samples, and the vertical axis is the accuracy increment of each category.

    Figure 4.  DDN increases the accuracy of diagnostic results on COEMR datasets.

    It can be seen that DDN has a certain improvement in the diagnostic performance of low-frequency diseases. Taking the disease "Pregnancy with hypothyroidism" as an example, after label decoupling, the number of electronic medical records containing the disease will increase, and the model is more sensitive to the characteristics of this type of disease, such as "Excessive serum TSH", "Edema" and other disease-related signs and symptoms. The decoupled low-frequency labels will not be restrained by the high-frequency labels. In addition, DDN does not improve the performance of low-frequency diseases by sacrificing the diagnostic accuracy of high-frequency diseases. After label decoupling, the performance of the model can be improved on most diseases. As mentioned above, the resampling method often results in over fitting of low-frequency data, and DDN decouples representation learning from classifier learning to learn good feature representations, and improve the generalization ability of low-frequency data. In conclusion, the DDN model including representation and classifier learning decoupling and label decoupling proposed in this paper has good performance for the diagnosis of low-frequency diseases.

    This paper proposes a DDN model for intelligent diagnosis based on imbalanced EMRs. A two-stage training method is proposed to decouple the representation learning and classifier learning: in the representation learning stage, CNN model is used to learn the original features of data. In the classifier stage, considering the high coupling diagnostic results of EMRs, a DRIL algorithm is proposed to decouple the high coupling diagnostic results and balance the data distribution. The experimental results on COEMR datasets show that DDN can effectively improve the performance of intelligent diagnosis based on imbalanced EMRs, especially the precision of low-frequency disease diagnosis. In the future, we will try to use DDN for intelligent diagnosis in diseases with more complications such as diabetes.

    We thank the anonymous reviewers for their constructive comments, and gratefully acknowledge the support of Major Science and Technology Project of Yunnan Province (202102AA100021), Zhengzhou City Collaborative Innovation Major Projects (20XTZX11020), National Key Research and Development Program (2017YFB1002101), National Natural Science Foundation of China (62006211), Henan Science and Technology Research Project (192102210260), Henan Medicine Science and Technology Research Plan: Provincial and Ministry Co-construction Project (SB201901021), Henan Provincial Key Scientific Research Project of Colleges and Universities (19A520003, 20A520038), The MOE Layout Foundation of Humanities and Social Sciences (Grant No. 20YJA740033), Henan Social Science Planning Project (Grant No. 2019BYY016).

    The authors declare there is no conflict of interest.



    [1] E. Tello-Leal, B. A. Macias-Hernandez, Association of environmental and meteorological factors on the spread of COVID-19 in Victoria, Mexico, and air quality during the lockdown, Environ. Res., (2020), 110442.
    [2] S. Kodera, E. A. Rashed, A. Hirata, Correlation between COVID-19 morbidity and mortality rates in Japan and local population density, temperature, and absolute humidity, Int. J. Env. Res. Pub. He., 17 (2020), 5477. doi: 10.3390/ijerph17155477
    [3] S. A. Meo, A. A. Abukhalaf, A. A. Alomar, N. M. Alsalame, T. Al-Khlaiwi, A. M. Usmani, Effect of temperature and humidity on the dynamics of daily new cases and deaths due to COVID-19 outbreak in Gulf countries in Middle East Region, Eur. Rev. Med. Pharmacol. Sci., 24 (2020), 7524-7533.
    [4] L. A. Casado-Aranda, J. Sanchez-Fernandez, M. I. Viedma-del-Jesus, Analysis of the scientific production of the effect of COVID-19 on the environment: A bibliometric study, Environ. Res., (2020), 110416.
    [5] B. Dogan, M. B. Jebli, K. Shahzad, T. H. Farooq, U. Shahzad, Investigating the effects of meteorological parameters on COVID-19: Case study of New Jersey, United States, Environ. Res., 191 (2020), 110148. doi: 10.1016/j.envres.2020.110148
    [6] S. A. Meo, A. A. Abukhalaf, A. A. Alomar, O. M. Alessa, W. Sami, D. C. Klonoff, Effect of environmental pollutants PM-2.5, carbon monoxide, and ozone on the incidence and mortality of SARS-COV-2 infection in ten wildfire affected counties in California, Sci. Total Environ., 757 (2021), 143948. doi: 10.1016/j.scitotenv.2020.143948
    [7] J. Yuan, Y. Wu, W. Jing, J. Liu, M. Du, Y. Wang, et al., Non-linear correlation between daily new cases of COVID-19 and meteorological factors in 127 countries, Environ. Res., 193 (2021), 110521. doi: 10.1016/j.envres.2020.110521
    [8] P. McCullagh, J. A. Nelder, Generalized Linear Models, 1st edition, Chapman and Hall, London, 1983.
    [9] J. A. Nelder, D. Pregibon, An extended quasi-likelihood function, Biometrika, 74 (1987), 221-232. doi: 10.1093/biomet/74.2.221
    [10] J. A. Nelder, R. W. M. Wedderburn, Generalized linear models, J. R. Stat. Soc. Ser. A, 135 (1972), 370-384.
    [11] K. H. Yuan, P. M. Bentler, Improving the convergence rate and speed of Fisher-scoring algorithm: ridge and anti-ridge methods in structural equation modeling, Ann. Inst. Stat. Math., 69 (2017), 571-597. doi: 10.1007/s10463-016-0552-2
    [12] P. De Jong, G. Z. Heller, Generalized linear models for insurance data, 1st edition, Cambridge Books, 2008.
    [13] T. F. Liao, Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models, No 07-101, SAGE Publications, Thousand Oaks, 1994.
    [14] R. Richardson, B. Hartman, Bayesian nonparametric regression models for modeling and predicting healthcare claims, Insur. Math. Econ., 83 (2018), 1-8. doi: 10.1016/j.insmatheco.2018.06.002
    [15] C. Song, Y. Wang, X. Yang, Y. Yang, Z. Tang, X. Wang, et al., Spatial and Temporal Impacts of Socioeconomic and Environmental Factors on Healthcare Resources: A County-Level Bayesian Local Spatiotemporal Regression Modeling Study of Hospital Beds in Southwest China, Int. J. Env. Res. Pub. He., 17 (2020), 5890. doi: 10.3390/ijerph17165890
    [16] Y. Mohamadou, A. Halidou, P. T. Kapen, A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19, Appl. Intell., 50 (2020), 3913-3925. doi: 10.1007/s10489-020-01770-9
    [17] T. A. Trunfio, A. Scala, A. D. Vecchia, A. Marra, A. Borrelli, Multiple Regression Model to Predict Length of Hospital Stay for Patients Undergoing Femur Fracture Surgery at "San Giovanni di Dio e Ruggi d'Aragona" University Hospital, In European Medical and Biological Engineering Conference, Springer, Cham, (2020), 840-847.
    [18] A. Z. Keller, A. R. R. Kamath, U. D. Perera, Reliability analysis of CNC machine tools, Reliab. Eng., 3 (1982), 449-473. doi: 10.1016/0143-8174(82)90036-1
    [19] Y. Abdel-Aty, A. Shafay, M. M. M. El-Din, M. Nagy, Bayesian inference for the inverse exponential distribution based on pooled type-II censored samples, J. Stat. Appl. Pro., 4 (2015), 235.
    [20] S. Dey, Inverted exponential distribution as a life distribution model from a Bayesian viewpoint, Data Sci. J., 6 (2007), 107-113. doi: 10.2481/dsj.6.107
    [21] S. K. Singh, U. Singh, A. S. Yadav, P. K. Vishwkarma, On the estimation of stress strength reliability parameter of inverted exponential distribution, IJSW, 3 (2015), 98-112. doi: 10.14419/ijsw.v3i1.4329
    [22] L. Fahrmeir, G. Tutz, Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd edition, Springer Science and Business Media, Berlin/Heidelberg, 2013.
    [23] E. Cepeda, D. Gamerman, Bayesian methodology for modeling parameters in the two parameter exponential family, Rev. Estad., 57 (2015), 93-105.
    [24] D. K. Dey, S. K. Ghosh, B. K. Mallick, Generalized Linear Models: A Bayesian Perspective, 1st edition, CRC Press, New York, 2000.
    [25] U. Olsson, Generalized Linear Models, An Applied Approach, 1st edition, Student Litteratur Lund., Sweden, 2002.
    [26] N. Sano, H. Suzuki, M. Koda, A robust ensemble learning using zero-one loss function, J. Oper. Res. Soc. Japan, 51 (2008), 95-110.
    [27] H. Robbins, An empirical Bayes approach to statistics, In Breakthroughs in statistics, Springer, (1955), 388-394.
    [28] L. Wei, Empirical Bayes test of regression coefficient in a multiple linear regression model, Acta Math. Appl. Sin-E, 6 (1990), 251-262. doi: 10.1007/BF02019151
    [29] R. S. Singh, Empirical Bayes estimation in a multiple linear regression model, Ann. Inst. Stat. Math., 37 (1985), 71-86. doi: 10.1007/BF02481081
    [30] W. M. Houston, D. J. Woodruff, Empirical Bayes Estimates of Parameters from the Logistic Regression Model, ACT Res. Report Ser., (1997), 97-96.
    [31] S. L. Wind, An empirical Bayes approach to multiple linear regression, Ann. Stat., 1 (1973), 93-103. doi: 10.1214/aos/1193342385
    [32] S. Y. Huang, Empirical Bayes testing procedures in some nonexponential families using asymmetric Linex loss function, J. Stat. Plan. Infer., 46 (1995), 293-305. doi: 10.1016/0378-3758(94)00112-9
    [33] R. J. Karunamuni, Optimal rates of convergence of empirical Bayes tests for the continuous one-parameter exponential family, Ann. Stat., (1996), 212-231.
    [34] M. Yuan, Y. Lin, Efficient empirical Bayes variable selection and estimation in linear models, J. Am. Stat. Assoc., 100 (2005), 1215-1225. doi: 10.1198/016214505000000367
    [35] L. S. Chen, Empirical Bayes testing for a nonexponential family distribution, Commun. Stat., Theor. M., 36 (2007), 2061-2074. doi: 10.1080/03610920601143675
    [36] B. Efron, Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction, 1st, Cambridge University Press, 2012.
    [37] M. Shao, An empirical Bayes test of parameters for a nonexponential distribution family with Negative Quadrant Dependent random samples, In 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), IEEE, (2013), 648-652.
    [38] J. E. Kim, D. A. Nembhard, Parametric empirical Bayes estimation of individual time-pressure reactivity, Int. J. Prod. Res., 56 (2018), 2452-2463. doi: 10.1080/00207543.2017.1380321
    [39] K. Jampachaisri, K. Tinochai, S. Sukparungsee, Y. Areepong, Empirical Bayes Based on Squared Error Loss and Precautionary Loss Functions in Sequential Sampling Plan, IEEE Access, 8 (2020), 51460-51465. doi: 10.1109/ACCESS.2020.2979872
    [40] Y. Li, L. Hou, Y. Yang, J. Tong, Huber's M-Estimation-Based Cubature Kalman Filter for an INS/DVL Integrated System, Math. Probl. Eng., (2020), 2020.
    [41] B. Sinova, S. Van Aelst, Advantages of M-estimators of location for fuzzy numbers based on Tukey's biweight loss function, Int. J. Approx. Reason., 93 (2018), 219-237. doi: 10.1016/j.ijar.2017.10.032
    [42] P. McCullagh, J. A. Nelder, Generalized Linear Models, 2nd edition, Chapman and Hall/CRC, 1985.
    [43] S. Das, D. K. Dey, On Bayesian analysis of generalized linear models using the Jacobian technique, Am. Stat., 60 (2006), 264-268. doi: 10.1198/000313006X128150
    [44] S. Ferrari, F. Cribari-Neto, Beta regression for modelling rates and proportions, J. Appl. Stat., 31 (2004), 799-815. doi: 10.1080/0266476042000214501
    [45] S. Das, D. K. Dey, On Bayesian analysis of generalized linear models: A new perspective, Technical Report, Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, (2007), 33.
    [46] P. J. Huber, Robust estimation of a location parameter, Ann. Math. Stat., 35 (1964), 73-101. doi: 10.1214/aoms/1177703732
    [47] P. J. Rousseeuw, A. M. Leroy, Robust Regression and Outlier Detection, 1st edition, John Wiley and Sons, NY, 1987.
    [48] L. Chang, B. Hu, G. Chang, A. Li, Robust derivative-free Kalman filter based on Huber's M-estimation methodology, J. Process Control, 23 (2013), 1555-1561. doi: 10.1016/j.jprocont.2013.05.004
    [49] P. J. Huber, Robust Statistics, 1st edition, John Wiley and Sons, NY, 1981.
    [50] R. A. Maronna, R. D. Martin, V. J. Yohai, Robust Statistics: Theory and Methods, 1st edition, John Wiley and Sons, West Sussex, 2006.
    [51] F. Wen, W. Liu, Iteratively reweighted optimum linear regression in the presence of generalized Gaussian noise, In 2016 IEEE International Conference on Digital Signal Processing (DSP), IEEE, (2016), 657-661.
    [52] H. Kikuchi, H. Yasunaga, H. Matsui, C. I. Fan, Efficient privacy-preserving logistic regression with iteratively Re-weighted least squares, In 2016 11th Asia Joint Conference on Information Security (AsiaJCIS), IEEE, (2016), 48-54.
    [53] J. Tellinghuisen, Least squares with non-normal data: Estimating experimental variance functions, Analyst, 133(2) (2008), 161-166.
    [54] R. M. Leuthold, On the use of Theil's inequality coefficients, Am. J. Agr. Econ., 57 (1975), 344-346. doi: 10.2307/1238512
    [55] T. Niu, L. Zhang, B. Zhang, B. Yang, S. Wei, An Improved Prediction Model Combining Inverse Exponential Smoothing and Markov Chain, Math. Probl. Eng., 2020 (2020), 11.
    [56] J. J. Faraway, Extending the linear model with R: Generalized linear, mixed effects and nonparametric regression models, 2nd edition, CRC press, 2016.
    [57] J. Fox, S. Weisberg, An R companion to applied regression, 3rd edition, Sage publications, Inc., 2018.
    [58] E. Dikici, F. Orderud, B. H. Lindqvist, Empirical Bayes estimator for endocardial edge detection in 3D+ T echocardiography, In 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI), IEEE, (2012), 1331-1334.
    [59] A. Coluccia, F. Ricciato, Improved estimation of instantaneous arrival rates via empirical Bayes, In 2014 13th Annual Mediterranean Ad Hoc Networking Workshop, IEEE, (2014), 211-216.
  • This article has been cited by:

    1. Juan Zhou, Xiong Li, Yuanting Ma, Zejiu Wu, Ziruo Xie, Yuqi Zhang, Yiming Wei, Optimal modeling of anti-breast cancer candidate drugs screening based on multi-model ensemble learning with imbalanced data, 2023, 20, 1551-0018, 5117, 10.3934/mbe.2023237
  • Reader Comments
  • © 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3257) PDF downloads(186) Cited by(1)

Figures and Tables

Figures(6)  /  Tables(8)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog