Loading [MathJax]/jax/output/SVG/jax.js
Research article

The generalization ability of logistic regression with Markov sampling

  • Received: 08 May 2023 Revised: 07 July 2023 Accepted: 17 July 2023 Published: 20 July 2023
  • In the case of non-independent and identically distributed samples, we propose a new ueMC algorithm based on uniformly ergodic Markov samples, and study the generalization ability, the learning rate and convergence of the algorithm. We develop the ueMC algorithm to generate samples from given datasets, and present the numerical results for benchmark datasets. The numerical simulation shows that the logistic regression model with Markov sampling has better generalization ability on large training samples, and its performance is also better than that of classical machine learning algorithms, such as random forest and Adaboost.

    Citation: Zhiyong Qian, Wangsen Xiao, Shulan Hu. The generalization ability of logistic regression with Markov sampling[J]. Electronic Research Archive, 2023, 31(9): 5250-5266. doi: 10.3934/era.2023267

    Related Papers:

    [1] Qin Guo, Binlei Cai . Learning capability of the rescaled pure greedy algorithm with non-iid sampling. Electronic Research Archive, 2023, 31(3): 1387-1404. doi: 10.3934/era.2023071
    [2] Weishang Gao, Qin Gao, Lijie Sun, Yue Chen . Design of a novel multimodal optimization algorithm and its application in logistics optimization. Electronic Research Archive, 2024, 32(3): 1946-1972. doi: 10.3934/era.2024089
    [3] Showkat Ahmad Lone, Hanieh Panahi, Sadia Anwar, Sana Shahab . Estimations and optimal censoring schemes for the unified progressive hybrid gamma-mixed Rayleigh distribution. Electronic Research Archive, 2023, 31(8): 4729-4752. doi: 10.3934/era.2023242
    [4] Meixin Xiong, Liuhong Chen, Ju Ming, Jaemin Shin . Accelerating the Bayesian inference of inverse problems by using data-driven compressive sensing method based on proper orthogonal decomposition. Electronic Research Archive, 2021, 29(5): 3383-3403. doi: 10.3934/era.2021044
    [5] Dandan Zuo, Wansheng Wang, Lulu Zhang, Jing Han, Ling Chen . Non-fragile sampled-data control for synchronizing Markov jump Lur'e systems with time-variant delay. Electronic Research Archive, 2024, 32(7): 4632-4658. doi: 10.3934/era.2024211
    [6] Zhuang Wang, Renting Liu, Jie Xu, Yusheng Fu . FedSC: A federated learning algorithm based on client-side clustering. Electronic Research Archive, 2023, 31(9): 5226-5249. doi: 10.3934/era.2023266
    [7] Kaixuan Wang, Shixiong Zhang, Yang Cao, Lu Yang . Weakly supervised anomaly detection based on sparsity prior. Electronic Research Archive, 2024, 32(6): 3728-3741. doi: 10.3934/era.2024169
    [8] Shengming Hu, Yongfei Lu, Xuanchi Liu, Cheng Huang, Zhou Wang, Lei Huang, Weihang Zhang, Xiaoyang Li . Stability prediction of circular sliding failure soil slopes based on a genetic algorithm optimization of random forest algorithm. Electronic Research Archive, 2024, 32(11): 6120-6139. doi: 10.3934/era.2024284
    [9] Haoqing Wang, Wen Yi, Yannick Liu . An innovative approach of determining the sample data size for machine learning models: a case study on health and safety management for infrastructure workers. Electronic Research Archive, 2022, 30(9): 3452-3462. doi: 10.3934/era.2022176
    [10] Amal S. Hassan, Najwan Alsadat, Mohammed Elgarhy, Laxmi Prasad Sapkota, Oluwafemi Samson Balogun, Ahmed M. Gemeay . A novel asymmetric form of the power half-logistic distribution with statistical inference and real data analysis. Electronic Research Archive, 2025, 33(2): 791-825. doi: 10.3934/era.2025036
  • In the case of non-independent and identically distributed samples, we propose a new ueMC algorithm based on uniformly ergodic Markov samples, and study the generalization ability, the learning rate and convergence of the algorithm. We develop the ueMC algorithm to generate samples from given datasets, and present the numerical results for benchmark datasets. The numerical simulation shows that the logistic regression model with Markov sampling has better generalization ability on large training samples, and its performance is also better than that of classical machine learning algorithms, such as random forest and Adaboost.



    The logistic regression model has become one of the most popular machine learning methods in classification [1,2,3,4]. Besides the advantages of good performance and strong interpretability in practical applications, the logistic regression model also has a complete theoretical basis in terms of consistency and generalization performance when the training samples are independent and identically distributed (i.i.d) [5,6,7,8]. However, the hypothesis of i.i.d is quite hard to be proved in practice, so it is natural to consider the logistic regression model with non-i.i.d samples.

    The relaxation of the i.i.d hypothesis has been discussed for a long time in machine learning and statistical literature. For example, Wang et al. [9] pointed out that the statistical learning theory in the case of small samples cannot be directly applied to large samples and proposed the generalization bounds of under-sample models based on strict Bayesian network processing. Sun and Wu [10], Sun and Guo [11] and Chu and Sun [12] used mixed samples to analyze the error of l2-norm least squares and l1-norm least squares regression, respectively. Guo and Shi [13] proved that the learning speed of regularized least squares regression was faster than the classical method. Machine learning algorithms in the non-i.i.d case are solved by concentration inequalities because the concentration inequalities can provide probability upper bounds for the deviation [14]. Modha and Masry [15] extended the classical inequalities in the condition from i.i.d to m-correlation and strong mixing, respectively. Merlevède et al. [16] obtained a Bernstein type inequality for a class of weakly dependent and bounded random variables. Fan et al. [17] studied the Hoeffding inequality for general Markov chains and time-dependent functions.

    In the paper, we enhance the performance of the logistic regression model from small samples to large samples. Machine learning usually performs well in Markov chain samples [18], so we develop the uniformly ergodic Markov chain algorithm (ueMC algorithm) for the logistic regression model. The ueMC algorithm can identify samples with classification errors and close to the decision boundary, and determine the final Markov samples used in the pre-training models. Compared with the algorithm proposed by Thongkam et al. [19], the ueMC algorithm does not directly eliminate the misclassified samples, as they are based on a random probability. Similar to the method proposed by Miranda et al. [20], the previous eliminated samples may be selected later. Inspired by [21], we study the generalization ability of the ueMC algorithm, and establish the optimal learning rate of the logistic regression classification for ueMC samples. Through numerical study and the simulation, it is verified that the performance of the ueMC algorithm based on ueMC samples is more effective than the algorithm based on other random samples, and the performance of the ueMC algorithm is also better than of the classical machine learning algorithm.

    The paper is organized as follows. In Section 2, some definitions and notations are given. In Section 3, we present and prove the main results on the learning rates of the logistic regression model with ueMC samples. In Section 4, we develop a new ueMC algorithm, and present the numerical studies on the generalization performance of the logistic regression model based on Markov sampling for benchmark datasets. Finally, the conclusions are given in Section 5.

    In this section we introduce the definitions and notations throughout this paper.

    Let (X,d) be a compact metric space and Y={0,1}. A binary classifier is a function ˆf:XY which labels every point xXR with some yY. Let φ be a probability distribution on Z=X×Y and (X,Y) be the corresponding random variable. Given Xi=(1,xi1,,xik)TRk+1, i=1,2,,N, the form of the separating hyperplane is as follows:

    ˆf=WTX=w0+w1xn1+w2xn2++wkxnk=0, (2.1)

    where W=(w0,w1,,wk)T is the coefficient of the variable.

    Let P(YiXi,W)=11+eWTXi=sigmod(ˆf) represent the probability of the sample being a positive sample (Yi=1), and 1P(YiXi,W) represent the probability of the sample being a negative sample (Yi=0). So we have:

    lnP(Yi=1Xi,W)P(Yi=0Xi,W)=WTXi, (2.2)

    where P(Yi=1Xi,W)=eWTXi1+eWTXi.

    The log-likelihood function is:

    g(W)=Ni=1ln(YiP(Yi=1Xi,W)+(1Yi)P(Yi=0Xi,W)). (2.3)

    The objective function of the logistic regression model is:

    L(W)=argminWNi=1(YiWTXi+ln(1+eWTXi)). (2.4)

    Linear models are easy to over-fitting in high dimensions because of the correlation between features. In the paper, the weight term of the model is constrained to alleviate the problem of over-fitting by adding a regular term c(W). We choose norm l2 regularization and use BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm [22] to solve the parameters.

    L(W)=argminWNi=1(YiWTXi+ln(1+eWTXi))+c(W). (2.5)

    Let H be Hilbert space, a set of real function on space XR. If there is a kernel function K:X×XR satisfying xX,K,xR, then HK is called a reproducing kernel Hilbert space satisfying xX,fH,f(x)=f,K,x.

    For a function f : XR, sign function sgn(f)=1 if f(x)0 and sgn(f)=0 if f(x)<0. Then the logistic regression model is defined as sgn(fz,λ), where fz,λ is a minimizer of the following optimization problem involving a set of random sample S={zi}mi=1Zm :

    fz,λ:=argminfHK{λf2HK+Ez(f)}. (2.6)

    In Formula (2.6), (f,z)=[yln(sigmoid(f(x)))+(1y)ln(1sigmoid(f(x)))] is the loss function, Ez(f)=1mmi=1(f,zi) is the empirical error, E(f)=E[(f,z)] denotes the generalization error of the corresponding function f. λ=1/(2C) is the regularization parameter, where C is a constant which depends on m:C=C(m) and often limmC(m)=.

    Suppose a Markov chain on (Z,S) is a sequence of random variables {Zi}i1 with a set of transition probability measures P(n)(Azi),AS,ziZ. It is assumed that

    P(n)(Azi)=P{Zn+iAZj,j<i,Zi=zi},n>0. (2.7)

    For any Markov chain, if the transition probability is independent of time, the Markov chain is stationary [23].

    Definition 1. Given two probabilities ν1,ν2 on the measure space (Z,S), we define ν1ν2TV=supAS|ν1(A)ν2(A)| as the total variation distance between the measures ν1 and ν2. Meyn and Tweedie [24] pointed out that for a Markov chain {Zi}i1, if there are 0<γ0< and 0<ρ0<1,

    Pk(z)π()TVγ0ρk0, k1, kN, (2.8)

    where π() is the stationary distribution of {Zi}i1, then {Zi}i1 is a ueMC.

    Another equivalent definition of ueMC is the following Doeblin condition [25].

    Proposition 1. (Doeblin condition): Suppose {Zi}i1 is a Markov chain with transition probability Pn(), and μ is a specific non-negative metric with nonzero mass μ0. If there is some integer k such that zZ and for all measurable sets A,Pk(Az)μ(A), then for any integer n, z1,z2Z, we have

    Pk(z1)Pk(z2)TV2βn/k1, (2.9)

    where β1=1μ0.

    In this section, we estimate the bounds on the generalization performance of the logistic regression model based on the ueMC sampling by following the enlightening ideas of [26].

    To measure the generalization ability of fz,λ, we identify how sgn(fz,λ) converges (with respect to the misclassification error) to the best classifier as C(m) tends to infinity. Recall the regression function of φ, fφ=yydφ(y|x), xX. Then the Bayes rule is given by the sign of the regression function fc=sgn(fφ). Referring to Vapnik [27], the speed at which fz,λ approaches fφ is measured by excess generalization error E(fz,λ)E(fφ). Since the minimization (2.6) is taken over the discrete quantity Ez(f), we have to regulate the capacity of the function set. Here the capacity is measured by the covering number.

    We should estimate the excess misclassification error R(sgn(fz,λ))R(fc) to bound the generalization ability of fz,λ. The relation between excess misclassification error and excess generalization error E(f)E(fφ) for convex loss is for f:XR,

    R(sgn(f))R(fc)E(f)E(fφ). (3.1)

    Definition 2. Let U>0, and BU={f:fHK,fHKU} be the sphere with radius U in HK, let N(ϵ)=N(BU,ϵ),ϵ>0 be the covering number of BU.

    Definition 3. The complexity index of HK is s, if there is some Cs>0 such that ϵ>0,

    lnN(ϵ)Cs(1/ϵ)s.

    Proposition 2. Let fz,λ be the function defined by (2.6), and define fλ:=argminfHK{λf2HK+E(f)} as a regularizing function, then for any λ>0,

    E(fz,λ)E(fφ)RS+Rλ, (3.2)

    where sample error:

    RS=(E(fz,λ)Ez(fz,λ)+Ez(fλ)E(fλ)),

    and regularization error:

    Rλ=(E(fλ)E(fφ)+λfλ2HK).

    Proof. According to the definition of fz,λ,

    λfz,λ2HK+Ez(fz,λ)λfλ2HK+Ez(fλ),

    we have

    Ez(fλ)Ez(fz,λ)+λfλ2HKλfz,λ2HK0.

    So

    E(fz,λ)E(fφ)E(fz,λ)E(fφ)+Ez(fλ)Ez(fz,λ)+λfλ2HK=(E(fz,λ)Ez(fz,λ)+Ez(fλ)E(fλ))+(E(fλ)E(fφ)+λfλ2HK).

    In Proposition 2, sample error

    RS=RS1+RS2={Eξ11mmi=1ξ1(zi)}+{1mmi=1ξ2(zi)Eξ2},

    where ξ1=(fz,λ,z)(fφ,z) and ξ2=(fλ,z)(fφ,z).

    Definition 4. We say the function fφ can be approximated by HK with exponent 0<β1, if there exists a constant Cβ such that for any λ>0,RλCβλβ. In the paper, we assume that there is a constant B such that |fφ|B.

    To prove the main results, there are four lemmas.

    Lemma 1. Let f be a continuous function defined on X and f=supxX|f(x)|. Let κ=supxXKx,x, then f=supxX|f(x)|κfHK,fHK.

    Proof. According to the Cauchy-Schwartz inequality of PSD kernel and the reproducing property of reproducing kernel Hilbert space, we have:

    f2(x)f,fKx,x,

    it follows that

    supxX|f(x)|κfHK,fHK.

    Lemma 2. ξ2=(fλ,z)(fφ,z), |fφ|B, we have

    |ξ2|d:=κRλ/λ+B+ln(1+eκRλ/λ)+ln(1+eB). (3.3)

    Proof. Due to

    Rλ=E(fλ)E(fφ)+λfλ2HKλfλ2HK,

    then fλHKRλ/λ. By applying Lemma 1, there holds fλκRλ/λ as ξ2=(fλ,z)(fφ,z) and |fφ|B, and when y=1,

    ξ2=ln(sigmod(fλ))ln(sigmod(fφ))=fφfλ+ln(1+efλ)ln(1+efφ);

    when y=0,

    ξ2=ln(1sigmod(fλ))ln(1sigmod(fφ))=ln(1+efλ)ln(1+efφ).

    Then, the proof ends with |ξ2|=|(fλ,z)(fφ,z)||fλ|+|fφ|+|ln(1+efφ)|+|ln(1+efλ)|.

    Lemma 3. (proved in [21]): Hoeffding's inequality: Let Z={Zi}mi=1 be a ueMC sample and F be a set of bounded and measurable functions, i.e. there is a constant C, such that zZ, gF, 0g(z)C. For any ε>0, we have

    Pr{|1mmi=1g(zi)E(g)|ε}2exp{mε256CΓ02E(g)}, (3.4)

    where Γ0=2/(1β1/2t) and β=1μ0. Here μ0 and t are from Doeblin condition [25].

    According to Lemma 3, for any ε>0,

    Pr{1mmi=1ξ2(zi)E(ξ2)E(ξ2)+εε}exp{εm56Γ02d}. (3.5)

    Let δ1=exp{εm56Γ02d}, so ε=56Γ02dln(δ1)m and εε(f)+ε12ε(f)+ε. Then for any 0<δ1<1, we obtain

    RS2=1mmi=1ξ2(zi)Eξ212Rλ56ln(δ1)dΓ02m, (3.6)

    with probability at least 1δ1.

    Let U>0 and FU={g=(f,z)(fφ,z),fBU}, we have

    E(g)=E(f)E(fφ)0,  1mmi=1g(zi)=Ez(f)Ez(fφ).

    From the definition of BU, it can be seen that for any fBU, there are fκfHKκU and then |fφ|B. It follows that

    |g(z)|b:=κU+B+ln(1+eκU)+ln(1+eB).

    Under the condition of Lemma 3, for any ϵ>0

    Pr{supfBU(E(f)Ez(f))(E(fφ)Ez(fφ))E(f)E(fφ)+εε}=Pr{supgFUE(g)1mmi=1g(zi)E(g)+εε}N(FU,ϵ)exp{εm56bΓ02}. (3.7)

    For any g1,g2FU, |ln(1+ex1)ln(1+ex2)||x1x2|, it follows that |g1(x)g2(x)|f1f2.

    According to inequality (3.7), for any ε>0

    Pr{supfBU(E(f)Ez(f))(E(fφ)Ez(fφ))E(f)E(fφ)+εε}N(εU)exp{εm56bΓ02}. (3.8)

    Therefore, according to N(εU)exp{Cs(Uε)s} by Definition 3, there holds, for fz,λ,

    Pr{supfBU(E(fz,λ)Ez(fz,λ))(E(fφ)Ez(fφ))E(fz,λ)E(fφ)+εε}exp{Cs(Uε)sεm56bΓ02}. (3.9)

    In the (3.9), let

    exp{Cs(Uε)sεm56bΓ02}=δ2,

    so

    εm56bΓ02Cs(Uε)s=ln(δ2),

    it is obtained that

    εs+1CsUs56bΓ02m+εsln(δ2)56bΓ02m=0. (3.10)

    Lemma 4. (Lemma 4 in [28]) Let c1, c2>0, and p1>p2>0, then the equation xp1c1xp2c2=0 has a unique positive zero x. In addition, xmax{(2c1)1/(p1p2), (2c2)1/p1}.

    Through the above inference process, we form the following main results.

    Theorem 1. Let Z={Zi}mi=1 be a ueMC sample, then for any 0<δ<1,

    E(fz,λ)E(fφ)+2λfz,λ2HK3Rλ+2ˆE+112dΓ02ln(1/δ)m (3.11)

    holds true with probability at least 1δ, where

    ˆεmax{ε1,ε2}, Γ02=2/(1β1/2t1),
    ε1=112bΓ02ln(1/δ)m,   ε2=[112CsUsbΓ02m]1/(1+s),

    here β1 and t are defined as that in Proposition 1.

    Proof. According to Lemma 4, Eq (3.10) has a solution ˆε=max{ε1,ε2} with

    ε1=112bΓ02ln(δ2)m,  ε2=[112CsUsbΓ02m]1/(1+s).

    Since εε(f)+ε12ε(f)+ε, in combination with inequality (3.9), inequality (3.12) holds at least with a probability of 1δ2,

    RS1=Eξ11mmi=1ξ1(zi)12(E(fz,λ)E(fφ))+ˆε. (3.12)

    Combining inequality (3.6) with inequality (3.12), inequality (3.13) holds at least 1δ probability for any 0<δ<1,

    E(fz,λ)E(fφ)3Rλ+2ˆε2λfz,λ2HK112dΓ02ln(δ)m. (3.13)

    Theorem 2. Let Z={Zi}mi=1 be a ueMC sample, Taking λ=(1/m)ϑ. For any ϵ>0 and 0<δ<1, there exists a constant ˆC independent of m such that

    R(sgn(fz,λ))R(fc)ˆC(1/m)θ, (3.14)

    holds true with probability at least 1δ, providing m112bΓ02ln(1/δ)(ln(1/δ)/Cs)1/s/U, where

    ϑ=min{2β+1,2(1+β)(1+s)},θ=min{2ββ+1,2β(1+β)(1+s)ϵ}.

    Proof. The proof is easily obtained from Theorem 1 with the proof of Theorem 2 in [21].

    In this section, we introduce a ueMC algorithm to generate the samples from a given dataset. We give numerical studies on the learning performance of the logistic regression model and present some useful discussions.

    Here are the notations:

    ST: the initial training set

    Siid: a i.i.d sample from ST

    Smkv: a ueMC sample from ST

    NT: the number of positive samples in Smkv

    NF: the number of negative samples in Smkv

    k,m,N,n,a, max_iteration: super-parameters

    The pseudocode of the ueMC algorithm is as follows:

    The ueMC algorithm: Markov sampling for Logistic Regression
    Input: ST,m,k,N,n, max_iteration, a.
    Output: fk.
    Step 1: Set i=0, draw randomly m samples from ST called Siid, and train Siid, get the premilinary mode fi.
    Step 2: Let Smkv=, NT=0, NF=0, t:=1.
    Step 3: Draw randomly a sample zt from ST, let Smkv=Smkvzt, NT=NT+1, if yt=1; NF=NF+1, if yt=0. set j:=0.
    Step 4: Draw randomly another sample zcandidate  from ST, calculate p1=e(fi, zcantididete)/e(fi,zt), j:=j+1.
    Step 5: Draw randomly a number prandom from U(0,1).
    Step 6: If 1>p1>prandom, accept zcandidate with p1; If p11 and ytycandidate=1, calculate p2=eycandidatefi/eytfi, if p2>prandom, accept zcandidate  with p2; If p11 and ytycandidate=1, accept zcandidate with p1; If n samples are rejected continuously, accept zcandidate with p3=ap1.
    Step 7: If ycandidate=1 and NT<N2, let NT=NT+1,Smkv=Smkvzcandidate; If ycandidate =1 and NF<N2, let NF=NF+1,Smkv=Smkvzcandidate.
    Step 8: If j> max_iteration or NT+NF>N, train Smkv to get fi+1, else go to Step 4.
    Step 9: If i<k, let i:=i+1 and go to Step 2, else output fi+1.

    Compared with data noise, the over-fitting problem caused by small sample size has a stronger impact on the generalization performance of the classifier ([29]). Different from the methods from [11,19] based on threshold to eliminate noise data, the ueMC is constructed to avoid the problem of over-fitting caused by small training samples.

    The ueMC algorithm uses the initial model f0 of Siid and the loss function (f,z) to construct the transition probability p1,p2,p3 of ueMC. Since p1,p2,p3 is greater than 0 and the sample size in ST is limited, Smkv obtained by the ueMC algorithm is a ueMC, according to the research conclusion of [30].

    Different from the MCMC algorithm proposed by [31,32], the ueMC algorithm does not need to know the probability distribution information of the training set samples. In addition, when k=0, the algorithm degenerates into the classical logistic regression model. In order to make the following experimental results without loss of generality, we take k=2 and a=1.2.

    In this subsection, we present the numerical study on the learning performance of the logistic regression model based on linear prediction models for 9 real-world datasets (from http://archive.ics.uci.edu/ml/datasets and https://www.fml.tuebingen.mpg.de/). The information of these data sets is summarized in Table 1, and all these datasets are 2-classes realworld datasets. The samples in these datasets obey non-independent identical distribution. We use the SMOTE algorithm to deal with unbalanced data. For each data set, it is randomly divided into two parts: the training set and the test set.

    Table 1.  General Information about 9 Real-world Datasets.
    Dataset Training Size Test Size Input Dimension
    UCI- Heart Disease 800 225 13
    UCI-Skin 160000 85057 3
    UCI-HTRU2 20000 12518 8
    UCI-Wine 1200 399 11
    UCI-Diabetes 600 168 8
    UCI-Waveform 4000 1000 21
    MNIST 60000 10000 784
    ELEC2 35000 10312 6
    Splice 20000 43500 60

     | Show Table
    DownLoad: CSV

    In order to simplify the experimental process, we take N=m in the ueMC algorithm, and carry out 50 repeated experiments for each dataset. The experimental results are shown in Table 2, where "MR (i.i.d.)" and "MR (Markov)" denote the misclassification rates of the logistic regression model based on random sampling and Markov sampling, respectively. In the following, we also discuss the experimental results based on N<m.

    Table 2.  Misclassification Rates (%) for 500 Training Samples.
    Dataset MR (i.i.d.) MR (Markov)
    UCI- Heart Disease 20.94±0.98 19.90±1.07
    UCI-Skin 8.92±0.48 8.86±0.54
    UCI-HTRU2 17.38±1.71 14.65±1.45
    UCI-Wine 24.91±1.40 24.34±1.38
    UCI-Diabetes 35.25±1.61 30.00±1.22
    UCI-Waveform 17.42±2.19 12.48±1.67
    MNIST 0.51±0.10 0.49±0.09
    ELEC2 26.83±0.68 25.89±0.36
    Splice 12.85±5.12 9.30±3.77

     | Show Table
    DownLoad: CSV

    From Table 2, we can find that for 500 training samples, the standard deviations and means of average misclassification rates of the logistic regression model based on Markov sampling are smaller than that of random sampling except UCI-Heart Disease and UCI-Skin, and the means of average misclassification rates based on Markov sampling in UCI-Heart Disease and UCI-Skin datasets are still smaller than that of random samples. To show the learning performance of the logistic regression model based on Markov sampling, we present the average misclassification rates for 50 experimental results of the logistic regression model based on Markov sampling (non-i.i.d) and random sampling (i.i.d.) for different training sizes and four datasets in Figures 14.

    Figure 1.  Average misclassification rates for UCI-HTRU2 and m = 1000, 1500, 2000, 2500, 3000.
    Figure 2.  Average misclassification rates for UCI-Waveform and m = 500,800, 1000, 1500, 2000.
    Figure 3.  Average misclassification rates for ELEC2 and m = 1000, 2000, 3000, 4000, 5000.
    Figure 4.  Average misclassification rates for UCI-Diabetes and m = 100,200,300,400,500.

    To have a better understanding of learning performance of the logistic regression model based on Markov sampling, the following figures are presented to show the 50 experimental misclassification rates of the logistic regression model based on Markov sampling.

    Figures 18 show that for UCI-HTRU2, UCI-Waveform, ELEC2 and UCI-Diabetes, the logistic regression model based on Markov sampling would have better learning performance than that of random sampling as the number of training samples is large. For other datasets, since the experimental results are similar, we do not present all of them here.

    Figure 5.  UCI-HTRU2, m=3000.
    Figure 6.  UCI-Waveform, m=2000.
    Figure 7.  ELEC2, m=5000.
    Figure 8.  UCI-Diabetes, m=500.

    For the case of N<m, Table 3 shows that for the datasets of UCI-HTRU2, ELEC2 and UCI-Waveform, the logistic regression model based on smaller Markov chain samples (600 samples for UCI-HTRU2, ELEC2 and UCI-Waveform) can present smaller misclassification rates compared to more i.i.d. samples (2000 samples).

    Table 3.  Misclassification Rates for Different Training Sizes.
    Dataset i.i.d. (2000) Markov (600) Markov (800) Markov (1000)
    UCI-HTRU2 16.18±0.94 9.54±0.55 10.23±0.47 11.10±0.65
    ELEC2 26.94±0.28 25.53±0.21 25.55±0.20 25.49±0.25
    UCI-Waveform 15.41±0.76 8.48±0.42 8.42±0.42 8.28±0.33

     | Show Table
    DownLoad: CSV

    We compare the performance of the logistic regression model with the ueMC samples with the classical logistic regression model, SVMC, Adaboost, RandomForest and other classical machine learning models on UCI-HTRU2, UCI-Skin and MNIST. Tables 4 and 5 show the average classification error rates and Wilkeson signed rank test results of 50 experiments.

    Table 4.  Classification Error Rates Comparison Results.
    Dataset Markov_Logistic Classical_Logistic SVM RandomForest Adaboost
    MNIST 0.4054 0.5201 0.6374 0.7292 0.4657
    SKIN 6.6451 9.2786 7.6901 3.1789 4.4347
    HTRU2 10.2353 14.5536 13.5935 11.6205 11.9470

     | Show Table
    DownLoad: CSV
    Table 5.  Wilkerson signed rank test results of classification error rate.
    Comparison Statistic P-value Optimal
    Markov_Logistic
    vs.
    Classical_Logistic
    MNIST 109.0000 6.1 ×1004 Markov_Logistic
    SKIN 0.0000 7.5 × 1010 Markov_Logistic
    HTRU2 0.0000 7.5 × 1010 Markov_Logistic
    Markov_Logistic
    vs.
    RandomForest
    MNIST 0.0000 7.5 × 1010 Markov_Logistic
    SKIN 0.0000 7.5 × 1010 RandomForest
    HTRU2 0.0000 7.5 × 1010 Markov_Logistic
    Markov_Logistic
    vs.
    Adaboost
    MNIST 302.5101 5.3 × 1004 Markov_Logistic
    SKIN 0.0000 7.5 × 1010 Adaboost
    HTRU2 0.0000 7.5 × 1010 Markov_Logistic

     | Show Table
    DownLoad: CSV

    On the three real data sets, the difference between the logistic regression model based on the ueMC sampling and classical logistic regression models in classification error rates is statistically significant. It can be concluded that: 1) The generalization ability of the logistic regression model based on the ueMC sampling is better than that of the classical logistic regression model classifier, and it is robust. 2) The generalization ability of the logistic regression model with the ueMC samples is comparable to that of complex classifiers, such as random forest and Adaboost.

    The ueMC algorithm divides the samples into three categories according to the model pre-trained in the previous step. The first one is the samples with correct classification and close to the decision boundary, the second one is the samples with correct classification but far away from the decision boundary, and the third one is the samples with wrong classification.

    According to the definition of the loss function, the distance between the sample and the decision boundary determines the size of the loss. Samples close to the decision boundary would train a better decision boundary for classification. The ueMC algorithm designs the acceptance probability p1 and p2 which ensure that the samples obtained according to the acceptance probability are close to the decision boundary. Therefore, when the initial logistic regression model can better fit the data set, the ueMC algorithm can ensure that the ueMC samples are excellent samples. In addition, we do not directly eliminate the samples with classification errors or far away from the decision boundary, but accept them with a small probability, which not only ensures the excellent properties of the training set, but also maintains the diversity of the training set samples to a certain extent, and can reduce the error caused by the misjudgment of the pre-training model. Therefore, the learning performance of the logistic regression model based on Markov sampling is better than that of random sampling, and the classifier based on Markov sampling is more sparse compared to random sampling.

    To study the generalization performance of the logistic regression model based on the ueMC sampling, inspired by the idea from [21], we estimate first the generalization error of logistic model algorithm based on the ueMC sampling. The generalization error is deconstructed into sample error and regularization error by error decomposition, and the convergence of the algorithm is proved. In addition, we also generate the samples from given dataset by the ueMC algorithm. The numerical studies show that as the number of training samples is large, the learning performance of the logistic regression model based on Markov sampling is better than that of random sampling, and its performance is also better than that of classical machine model algorithms, such as random forest and Adaboost. In other words, the ueMC algorithm significantly improves the learning performance of the logistic regression model. To our knowledge, this study is the first on this topic.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work is supported by the Innovation and Talent Base for Digital Technology and Finance (B21038) and "the Fundamental Research Funds for the Central Universities", Zhongnan University of Economics and Law (2722023EJ002).

    The authors declare there is no conflicts of interest.



    [1] A. Bayaga, Multinomial logistic regression: Usage and application in risk analysis, J. Appl. Quant. Methods, 5 (2010), 288–297.
    [2] A. Selmoune, Z. Liu, J. Lee, To pay or not to pay? Understanding public acceptance of congestion pricing: A case study of Nanjing, Electron. Res. Arch, 30 (2022), 4136–4156. https://doi.org/10.3934/era.2022209 doi: 10.3934/era.2022209
    [3] Z. Ahmad, Z. Almaspoor, F. Khan, S. E. Alhazmi, M. El-Morshedy, O. Y. Ababneh, et al., On fitting and forecasting the log-returns of cryptocurrency exchange rates using a new logistic model and machine learning algorithms, AIMS Math., 7 (2022), 18031–18049. https://doi.org/10.3934/math.2022993 doi: 10.3934/math.2022993
    [4] N. Dwarika, Asset pricing models in South Africa: A comparative of regression analysis and the Bayesian approach, Data Sci. Financ. Econ., 3 (2023), 55–75. https://doi.org/10.3934/DSFE.2023004 doi: 10.3934/DSFE.2023004
    [5] D. McAllester, Generalization bounds and consistency, Predicting Struct. Data, 2007. https://doi.org/10.7551/mitpress/7443.003.0015
    [6] N. Kordzakhia, G. D. Mishra, L. Reiersølmoen, Robust estimation in the logistic regression model, J. Stat. Plan. Infer., 98 (2001), 211–223. https://doi.org/10.1016/S0378-3758(00)00312-8 doi: 10.1016/S0378-3758(00)00312-8
    [7] M. Rashid, Inference on Logistic Regression Models, Ph.D thesis, Bowling Green State University, 2008.
    [8] D. Dai, D. Wang, A generalized Liu-type estimator for logistic partial linear regression model with multicollinearity, AIMS Math., 8 (2023), 11851–11874. https://doi.org/10.3934/math.2023600 doi: 10.3934/math.2023600
    [9] Z. Wang, Z. Wang, B. Fu, Learning restricted bayesian network classifiers with mixed non-i.i.d. sampling, in 2010 IEEE International Conference on Data Mining Workshops, (2010), 899–904. https://doi.org/10.1109/ICDMW.2010.199
    [10] H. Sun, Q. Wu, Least square regression with indefinite kernels and coefficient regularization, Appl. Comput. Harmon A, 30 (2011), 96–109 https://doi.org/10.1016/j.acha.2010.04.001 doi: 10.1016/j.acha.2010.04.001
    [11] H. Sun, Q. Guo, Coefficient regularized regression with non-iid sampling, Int. J. Comput. Math., 88 (2011), 3113–3124. https://doi.org/10.1080/00207160.2011.587511 doi: 10.1080/00207160.2011.587511
    [12] X. Chu, H. Sun, Regularized least square regression with unbounded and dependent sampling, Abstr. Appl. Anal., 2013 (2013), 900–914. https://doi.org/10.1155/2013/139318. doi: 10.1155/2013/139318
    [13] Z. C. Guo, L. Shi, Learning with coefficient-based regularization and l1-penalty, Adv. Comput. Math., 39 (2013), 493–510. https://doi.org/10.1007/s10444-012-9288-6 doi: 10.1007/s10444-012-9288-6
    [14] B. Jiang, Q. Sun, J. Q. Fan, Bernstein's inequality for general Markov chains, preprint, arXiv: 1805.10721.
    [15] D. S. Modha, E. Masry, Minimum complexity regression estimation with weakly dependent observations, IEEE Trans. Inf. Theory, 42 (1996), 2133–2145. https://doi.org/10.1109/18.556602 doi: 10.1109/18.556602
    [16] F. Merlevède, M. Peligrad, E. Rio, Bernstein inequality and moderate deviations under strong mixing conditions, Inst. Math. Stat. (IMS) Collect., 2009 (2009), 273–292. https://doi.org/10.1214/09-IMSCOLL518 doi: 10.1214/09-IMSCOLL518
    [17] J. Q. Fan, B. Jiang, Q. Sun, Hoeffding's lemma for Markov Chains and its applications to statistical learning, preprint, arXiv: 1802.00211.
    [18] P. J. M. Laarhoven, E. H. L. Aarts, Simulated Annealing: Theory and Applications, Springer, Dordrecht, 1987.
    [19] J. Thongkam, G. Xu, Y. Zhang, et.al., Support vector machine for outlier detection in breast cancer survivability prediction, in Asia-Pacific Web Conference, Springer, (2008), 99–109. https://doi.org/10.1007/978-3-540-89376-9_10
    [20] A. L. B. Miranda, L. P. F. Garcia, A. C. P. L. F. Carvalho, A. C. Lorena, Use of classification algorithms in noise detection and elimination, in International Conference on Hybrid Artificial Intelligence Systems, Springer, (2009), 417–424. https://doi.org/10.1007/978-3-642-02319-4_50
    [21] J. Xu, Y. Y. Tang, B. Zou, Z. Xu, L. Li, Y. Lu, et al., The generalization ability of SVM classification based on Markov sampling, IEEE Trans. Cybern., 45 (2014), 1169–1179. https://doi.org/10.1109/TCYB.2014.2346536 doi: 10.1109/TCYB.2014.2346536
    [22] J. D. Head, M. C. Zerner, A Broyden—Fletcher—Goldfarb—Shanno optimization procedure for molecular geometries, Chem. Phys. Lett., 122 (1985), 264–270. https://doi.org/10.1016/0009-2614(85)80574-1 doi: 10.1016/0009-2614(85)80574-1
    [23] M. Vidyasagar, Learning and Generalization: With Applications to Neural Networks, Springer, London, 2003.
    [24] S. P. Meyn, R. L. Tweedie, Markov Chains and Stochastic Stability, Springer, Berlin, 2012.
    [25] P. Doukhan, Mixing: Properties and Examples, Springer, Berlin, 2012.
    [26] P. Zhang, N. Riedel, Discriminant analysis: A unified approach, in Fifth IEEE International Conference on Data Mining (ICDM'05), 2005. https://doi.org/10.1109/ICDM.2005.51
    [27] V. N. Vapnik, An overview of statistical learning theory, IEEE T. Neur. Net. Lear., 10 (1999), 988–999. https://doi.org/10.1109/72.788640 doi: 10.1109/72.788640
    [28] F. Cucker, S. Smale, Best choices for regularization parameters in learning theory: On the bias-variance problem, Found. Comput. Math., 2 (2002), 413–428. https://doi.org/10.1007/s102080010030 doi: 10.1007/s102080010030
    [29] G. Stempfel, L. Ralaivola, Learning SVMs from sloppily labeled data, in Lecture Notes in Computer Science, Springer, 2009. http://dx.doi.org/10.1007/978-3-642-04274-4_91
    [30] M. P. Qian, G. L. Gong, Applied random processes, Peking University Press, Beijing, 1998.
    [31] W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57 (1970), 97–109. https://doi.org/10.1093/biomet/57.1.97 doi: 10.1093/biomet/57.1.97
    [32] S. Geman S, D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., 6 (1984), 721–741. https://doi.org/10.1109/TPAMI.1984.4767596 doi: 10.1109/TPAMI.1984.4767596
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1489) PDF downloads(80) Cited by(0)

Figures and Tables

Figures(8)  /  Tables(5)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog