
This paper introduces a new extension of the traditional Lomax distribution. The current distribution, which has one scale and two shape parameters, is known as logarithmic transformed Lomax distribution. The hazard rate function of the new distribution based on parameter values, making monotonically decreasing, upside-down and bathtub upside-down shaped. Some properties of the proposed distribution including mixture expansion, moments, entropies, probability weighted moments, order statistics, reversed failure rate and mean past life time are discussed. The distribution parameters are estimated using four frequentist methods: namely, maximum likelihood, least squares, weighted least squares, and maximum product of spacing. The efficiency of the different estimators is compared using a simulation analysis using mean square error. Finally, two real data sets for service times of Aircraft Windshield and survival times of a group of patients given chemotherapy medication are analyzed and compared to some other competing distributions. Based on the results of data analysis, the estimates of minimum and maximum values for the service times of Aircraft Windshield and the survival times of patients given chemotherapy medication are obtained.
Citation: Refah Alotaibi, Hassan Okasha, Hoda Rezk, Abdullah M. Almarashi, Mazen Nassar. On a new flexible Lomax distribution: statistical properties and estimation procedures with applications to engineering and medical data[J]. AIMS Mathematics, 2021, 6(12): 13976-13999. doi: 10.3934/math.2021808
[1] | Azam Zaka, Ahmad Saeed Akhter, Riffat Jabeen . The new reflected power function distribution: Theory, simulation & application. AIMS Mathematics, 2020, 5(5): 5031-5054. doi: 10.3934/math.2020323 |
[2] | Nora Nader, Dina A. Ramadan, Hanan Haj Ahmad, M. A. El-Damcese, B. S. El-Desouky . Optimizing analgesic pain relief time analysis through Bayesian and non-Bayesian approaches to new right truncated Fréchet-inverted Weibull distribution. AIMS Mathematics, 2023, 8(12): 31217-31245. doi: 10.3934/math.20231598 |
[3] | Ayed. R. A. Alanzi, Muhammad Imran, M. H. Tahir, Christophe Chesneau, Farrukh Jamal, Saima Shakoor, Waqas Sami . Simulation analysis, properties and applications on a new Burr XII model based on the Bell-X functionalities. AIMS Mathematics, 2023, 8(3): 6970-7004. doi: 10.3934/math.2023352 |
[4] | Huimin Li, Jinru Wang . Pilot estimators for a kind of sparse covariance matrices with incomplete heavy-tailed data. AIMS Mathematics, 2023, 8(9): 21439-21462. doi: 10.3934/math.20231092 |
[5] | Abdulhakim A. Al-Babtain, Ibrahim Elbatal, Christophe Chesneau, Mohammed Elgarhy . On a new modeling strategy: The logarithmically-exponential class of distributions. AIMS Mathematics, 2021, 6(7): 7845-7871. doi: 10.3934/math.2021456 |
[6] | Boubaker Smii . Representation of the solution of a nonlinear molecular beam epitaxy equation. AIMS Mathematics, 2024, 9(12): 36012-36030. doi: 10.3934/math.20241708 |
[7] | Salim Bouzebda, Amel Nezzal, Issam Elhattab . Limit theorems for nonparametric conditional U-statistics smoothed by asymmetric kernels. AIMS Mathematics, 2024, 9(9): 26195-26282. doi: 10.3934/math.20241280 |
[8] | Refah Alotaibi, Mazen Nassar, Zareen A. Khan, Ahmed Elshahhat . Analysis of reliability index R=P(Y<X) for newly extended xgamma progressively first-failure censored samples with applications. AIMS Mathematics, 2024, 9(11): 32200-32231. doi: 10.3934/math.20241546 |
[9] | Xiao Zhang, Rongfang Yan . Stochastic comparisons of extreme order statistic from dependent and heterogeneous lower-truncated Weibull variables under Archimedean copula. AIMS Mathematics, 2022, 7(4): 6852-6875. doi: 10.3934/math.2022381 |
[10] | Mohamed S. Eliwa, Essam A. Ahmed . Reliability analysis of constant partially accelerated life tests under progressive first failure type-II censored data from Lomax model: EM and MCMC algorithms. AIMS Mathematics, 2023, 8(1): 29-60. doi: 10.3934/math.2023002 |
This paper introduces a new extension of the traditional Lomax distribution. The current distribution, which has one scale and two shape parameters, is known as logarithmic transformed Lomax distribution. The hazard rate function of the new distribution based on parameter values, making monotonically decreasing, upside-down and bathtub upside-down shaped. Some properties of the proposed distribution including mixture expansion, moments, entropies, probability weighted moments, order statistics, reversed failure rate and mean past life time are discussed. The distribution parameters are estimated using four frequentist methods: namely, maximum likelihood, least squares, weighted least squares, and maximum product of spacing. The efficiency of the different estimators is compared using a simulation analysis using mean square error. Finally, two real data sets for service times of Aircraft Windshield and survival times of a group of patients given chemotherapy medication are analyzed and compared to some other competing distributions. Based on the results of data analysis, the estimates of minimum and maximum values for the service times of Aircraft Windshield and the survival times of patients given chemotherapy medication are obtained.
Let φ(z) and Φ(z) be the density and cumulative functions of a standard normal random variable, respectively.
φ(z):=1√2πez22,Φ(z):=∫z−∞φ(t)dt. |
Let X be a truncated normal random variable with truncation point θ on the left, where the mean and variance of the underlying normal distribution are denoted by μ and σ2. The density function of X can be written as:
f(x|μ,σ2,θ)={φ[(x−μ)/σ]σ[1−Φ((θ−μ)/σ)],x>θ0,otherwise. | (1.1) |
Truncated normal distribution on the right can be similarly defined. Since the density function of the normal distribution is symmetrical about μ, truncation on the right at θ′ is equivalent to truncation on the left at 2μ−θ′. Therefore, only truncation on the left is discussed in this paper.
It can be easily shown that
μθ:=E(X|μ,σ2,θ)=σh(θ∗)+μ | (1.2) |
σ2θ:=V(X|μ,σ2,θ)=σ2[1−h′(θ∗)] | (1.3) |
where θ∗=(θ−μ)/σ, and h(x)=φ(x)/[1−Φ(x)] is the hazard function of the standard normal distribution. The reciprocal of h(x) is also known as Mills' ratio.
Other moments of singly truncated normal distribution such as skewness and kurtosis can be found in Horrace's work [1]. In addition, Pender [2] has presented the corresponding results for doubly truncated normal distribution.
A singly truncated normal sample from X with size n is denoted by x1,x2,⋯,xn. The likelihood function for the sample is:
L=1[1−Φ(θ∗)]n1(2πσ2)n/2exp[−12σ2n∑i=1(xi−μ)2]. | (1.4) |
As for maximum likelihood estimation, it will be easier to work with the log-likelihood function:
lnL=−n2ln(2π)−n2lnσ2−12σ2n∑i=1(xi−μ)2−nln[1−Φ(θ∗)]. | (1.5) |
For the inference of truncated normal distributions, most of the available literature tends to focus on the estimation of mean and standard deviation (or variance) of the underlying normal distribution when the truncation point θ is known. The initial results were given by Pearson and Lee [3], who employed the method of moments to obtain formulas for estimating the mean and standard deviation, and the computation of these estimates relied on certain special functions that can be evaluated by a pre-prepared table. Later, Fisher [4] gave a solution to this problem by the maximum likelihood method, and he proved that the maximum likelihood estimation is equivalent to the moment estimation. Stevens [5] discussed the estimation of doubly truncated normal samples, as well as censored samples in which the frequency of observations outside the limits is recorded but the individual values of these observations are not measured.
Based on the maximum likelihood principle proposed by Fisher, some other authors have obtained more results on the estimation problem of truncated or censored samples. Hald [6] and Gupta [7] studied the maximum likelihood estimation of the parameters of censored normal samples. Halperin [8] showed that maximum likelihood estimates for single-parameter truncated samples under mild conditions are consistent, asymptotically normally distributed, and of minimum variance for large samples. Halperin's results can be readily extended and applied to truncated normal samples. Cohen [9,10,11] developed simplified estimators for singly truncated normal samples as follows:
ˆμ=ˉx−c(ˉx−θ)ˆσ2=s2+c(ˉx−θ)2 |
where ˉx and s2 are the sample mean and the sample variance, respectively, c=h(θ∗)/[h(θ∗)−θ∗] is an auxiliary function. The value of c can be determined directly from s2/(ˉx−θ)2 by a table provided by Cohen.
On the other hand, there is very little literature on truncation point estimation. For general truncated samples, Robson and Whitlock [12] presented point estimators of a truncation point with bias of order n−k. For example, estimator x(1)−(x(2)−x(1)) is unbiased to order n−2, where x(1) and x(2) are the order statistics of the truncated sample.
In this article, the truncated samples are different from those of Robson and Whitlock. Specifically, xi(i=1,⋯,n) in truncated samples cannot be measured directly, and yi=xi−θ can be observed instead. In such a case, all observations in the sample are subtracted by an unknown truncation point θ, and only observations greater than 0 are recorded. By (1.1), the density function of Y=X−θ is:
f(y)={φ[(y+θ−μ)/σ]σ[1−Φ((θ−μ)/σ)],y>00,otherwise | (1.6) |
and the log-likelihood function for the truncated sample y1,y2,⋯,yn is:
lnL=−n2ln(2π)−n2lnσ2−12σ2n∑i=1(yi+θ−μ)2−nln[1−Φ(θ∗)]. | (1.7) |
Next, we will discuss the estimation of the truncation point θ for sample y1,y2,⋯,yn. As mentioned earlier, the calculation of estimates of mean and variance for truncated normal samples requires some special functions, which can be evaluated by a pre-prepared table. The estimation of the truncation point also requires some auxiliary functions, and the hazard function h(x) of the standard normal distribution plays an important role in these functions. The following lemma gives some important properties of h(x), which we will use later.
Lemma 1.1. Let h(x) be the hazard function of the standard normal distribution, then:
(√8+x2+3x)/4<h(x)<(√4+x2+x)/2,x>0 | (1.8) |
0<h′(x)=h(h−x)<1,−∞<x<∞ | (1.9) |
h″(x)=h[(h−x)(2h−x)−1]>0,−∞<x<∞. | (1.10) |
The detailed proof of this lemma can be found in Sampford's work [14]. It should be noted that the inequality on the right in (1.8) is attributed to Birnbaum's work [13]. Moreover, Yang and Chu [15] provided tighter bounds for h(x) than those in (1.8).
The estimation for the truncation point θ makes no sense when the mean μ is unknown. In Section 3, we will present an application where there is a normal sample from which good estimates of the mean and variance can be made. In addition, there are many truncated samples derived from the same normal population, and the truncation points of interest may be different.
In this section, we will discuss truncation point estimation for two cases: The variance σ2 is known or unknown. For each of the two cases, the point estimates of the truncation points are given using the method of moment and the maximum likelihood method, respectively, and the estimated variance of the point estimates and the confidence intervals of the truncation points are also given.
Theorem 2.1. Let y1,y2,⋯,yn be a sample from a truncated normal population with density function (1.6), where the mean μ and the variance σ2 are known, and the truncation point θ is unknown, the moment estimator for θ is given as follows:
ˆθ=μ+σg−11(ˉyσ) | (2.1) |
where auxiliary function g1(x):=h(x)−x is monotonically decreasing.
Proof. Since E(X)=E(Y)+θ, replace the left side of Eq (1.2) with ˉy+θ, and we have:
ˉy+ˆθ=σh(ˆθ∗)+μ |
where ˆθ∗=(ˆθ−μ)/σ. It is algebraically equivalent to:
ˉyσ=h(ˆθ∗)−ˆθ−μσ=h(ˆθ∗)−ˆθ∗=g1(ˆθ∗). |
Note that (1.9) implies g′1(x)=h′(x)−1<0, that is, g1(x) is monotonically decreasing. Applying the inverse function of g1 to the both sides yields:
g−11(ˉyσ)=ˆθ∗=ˆθ−μσ. |
The estimator for truncation point in (2.1) is immediately obtained from above.
Theorem 2.2. For large samples, the variance of the truncation point estimator in (2.1) is approximated by:
V(ˆθ)≈σ2/n1−h′(ˆθ∗). | (2.2) |
Proof. With first-order Taylor series expansion of (2.1) at μθ−θ, we obtain:
ˆθ≈θ+(g−11)′|y=(μθ−θ)/σ⋅(ˉy−μθ+θ). |
For large samples, ˉy≈μθ−θ, hence:
V(ˆθ)≈[(g−11)′]2y=ˉy/σV(ˉy). | (2.3) |
Recall that the derivative of the inverse function is the reciprocal of the derivative of the original function, namely, (g−11)′=(g′1)−1, it follows that:
[(g−11)′]2y=ˉy/σ=1[g′1(ˆθ∗)]2=1[h′(ˆθ∗)−1]2. | (2.4) |
Since ˉy=ˉx−θ, we have:
V(ˉy)=V(ˉx)=σ2θn≈1−h′(ˆθ∗)nσ2. | (2.5) |
The approximate variance (2.2) follows from (2.3)–(2.5).
Theorem 2.3. Let y1,y2,⋯,yn be a sample from a truncated normal population with density function (1.6), where the mean μ and the variance σ2 are known, and the truncation point θ is unknown, the maximum likelihood estimator for θ is given as follows:
ˆθ=μ+σg−11(ˉyσ). |
That is, the maximum likelihood estimator of θ is identical to the moment estimator of θ.
Proof. Differentiating the log-likelihood function lnL in (1.7) with respect to θ:
ddθlnL=−1σ2n∑i=1(yi+θ−μ)+nφ(θ∗)1−Φ(θ∗)dθ∗dθ=−1σ2n∑i=1yi−n(θ−μ)+nσh(θ∗)=nσ[−ˉyσ−θ∗+h(θ∗)]. | (2.6) |
Setting this derivative equal to zero, we obtain the same estimator as the moment estimator in (2.1).
It can be shown that the log-likelihood function in (1.7) satisfies the mild regularity conditions given by Halperin [8]. Thus, the maximum likelihood estimator of the truncation point θ is consistent, asymptotically normal, and approximately minimum variance under large samples.
We will employ another approach based on large-sample theory to find the approximate variance of the maximum likelihood estimator of the truncation point θ. By large-sample theory, we have:
√n(ˆθ−θ)⟶dN(0,1I(θ)) |
where I(θ) is the Fisher information.
For large samples,
I(θ)=E[d2dθ2lnf(Y;θ)]≈−1nd2dθ2(lnL). |
Therefore, the variance of ˆθ can be approximated as follows:
V(ˆθ)≈[−d2dθ2(lnL)]−1|θ=ˆθ. | (2.7) |
By (2.6), we have:
d2dθ2(lnL)=ddθ[ddθ(lnL)]=nσ2[h2(θ∗)−θ∗h(θ∗)−1]=nσ2[h′(θ∗)−1]. |
Hence, substituting into (2.7), we obtain the same approximation as (2.2):
V(ˆθ)≈σ2/n1−h′(ˆθ∗). |
According to the asymptotic normality for large samples, the approximate 95% confidence interval for the truncation point θ is:
μ+σg−11(ˉyσ)±1.96×σ/√n√1−h′(ˆθ∗). | (2.8) |
Theorem 2.4. Let y1,y2,⋯,yn be a sample from a truncated normal population with density function (1.6), where the mean μ is known, the variance σ2 and the truncation point θ are unknown, the moment estimator for θ is given as follows:
ˆθ∗=g−12(s2y/ˉy2) | (2.9) |
ˆσ=ˉy/g1(ˆθ∗) | (2.10) |
ˆθ=μ+ˆσˆθ∗ | (2.11) |
where s2y is the sample variance of the truncated sample, and auxiliary function
g2(x):=1−h′(x)[h(x)−x]2,x>0 | (2.12) |
is monotonically increasing.
First, we prove the following lemma:
Lemma 2.5. Let g2(x) be the auxiliary function defined in (2.12), then for any x>0, we have 0<g2(x)<1 and g′2(x)>0, i.e., g2(x) is non-negative and monotonically increasing.
Proof. From (1.9), g2(x)>0, and we have
g2(x)=1−h′(x)[h(x)−x]2=1−h(x)[h(x)−x]−[h(x)−x]2[h(x)−x]2+1=1−[h(x)−x][2h(x)−x][h(x)−x]2+1. |
By (1.10), we know that 1−[h(x)−x][2h(x)−x]<0, hence g2(x)<1. Decompose g2(x) as follows
g2(x)=1[h(x)−x]2−h(x)h(x)−x. |
Differentiating g2(x), we obtain:
g′2(x)=2[1−h′(x)]−[h(x)−xh′(x)][h(x)−x][h(x)−x]3. |
Let ξ denote the numerator of the above fraction, namely
ξ(x)=2[1−h′(x)]−[h(x)−xh′(x)][h(x)−x]. |
Therefore, for x>0, in order to prove g′2(x)>0, it is sufficient to prove ξ(x)>0.
Notice that the value of ξ(x) at x=0 is greater than 0:
ξ(0)=2[1−h2(0)]−h2(0)=2−6π>0. |
As x→∞, it follows from (1.8) that
1>h′(x)=h(x)[h(x)−x]>√8+x2+3x4⋅√8+x2−x4=√8+x2+3x2(√8+x2+x)→1. |
Thus, we have 1−h′(x)→0. Similarly, inequalities
0<2√8+x2+x<h(x)−x<2√4+x2+x→0 |
imply that h(x)−x→0, and x[h(x)−x]→1. Hence, we have:
[h(x)−xh′(x)][h(x)−x]=[h(x)−x]2−[1−h′(x)]⋅x[h(x)−x]→0. |
Therefore, we obtain that ξ→0.
We employ proof by contradiction to show ξ(x)>0. If there exists some x1>0 such that ξ(x1)≤0, then there exists x2>0 such that
{ξ(x2)≤0ξ′(x2)=0. |
In addition, we have
ξ′(x)=−2h″+xh″(h−x)−(h−xh′)(h′−1)=−2h[(h−x)2+h(h−x)−1]+xh″(h−x)−h(h−x)(h−xh′)+h−xh′=h[2−2h(h−x)−(h−xh′)(h−x)]−2h(h−x)2+xh″(h−x)+h−xh′=hξ+xh″(h−x)−2h(h−x)2+h−xh(h−x)=hξ+xh″(h−x)−h″=hξ+h″[x(h−x)−1]. |
Since h″(x)>0 and x[h(x)−x]<1, it follows that
ξ′(x)<h(x)ξ(x). |
Hence, we obtain
ξ′(x2)<h(x2)ξ(x2)≤0. |
This contradicts ξ′(x2)=0. We complete the proof the lemma.
Proof of Theorem 2.4. Substituting ˉy for μθ into (1.2) and rearranging the equation, we have
ˉy=σ[h(ˆθ∗)−ˆθ∗]=σg1(ˆθ∗). | (2.13) |
Similarly, substituting s2y for σ2θ into (1.3), we obtain
s2y=σ2[1−h′(ˆθ∗)]. | (2.14) |
We eliminate σ between (2.13) and (2.14), hence we obtain:
s2yˉy2=1−h′(ˆθ∗)[h(ˆθ∗)−ˆθ∗]2=g2(ˆθ∗). |
From Lemma 2.5, g2(x) is invertible when x>0, so we have:
ˆθ∗=g−12(s2yˉy2)=g−12(CV2y) |
where CVy:=sy/ˉy is the coefficient of variation of the truncated sample.
The estimator of the standard deviation σ can be obtained directly from (2.13):
ˆσ=ˉyg1(ˆθ∗). |
Since θ=μ+σθ∗, the estimator for truncation point is:
ˆθ=μ+ˆσˆθ∗. |
Theorem 2.6. Let y1,y2,⋯,yn be a sample from a truncated normal population with density function (1.6), where the mean μ is known, the variance σ2 and the truncation point θ are unknown, the maximum likelihood estimators for θ and ˆσ are given as follows:
ˆθ=μ+ˆσˆθ∗ | (2.15) |
ˆσ=ˉyg1(ˆθ∗) | (2.16) |
where
ˆθ∗=g−12(n∑i=1(yi−ˉy)2nˉy2). | (2.17) |
Proof. Taking partial derivative of lnL in (1.7) with respect to θ, we obtain:
∂lnL∂θ=nσ[−ˉyσ−θ∗+h(θ∗)]. |
Equating the partial derivative to zero, hence:
ˉyˆσ=h(ˆθ∗)−ˆθ∗=g1(ˆθ∗). | (2.18) |
The partial derivative of lnL with respect to σ is:
∂lnL∂σ=−nσ+1σ3n∑i=1(yi+θ−μ)2+nφ(θ∗)1−Φ(θ∗)∂θ∗∂σ=−nσ+1σ3n∑i=1[(yi−ˉy)+σ(θ−μσ+ˉyσ)]2−nφ(θ∗)1−Φ(θ∗)θ−μσ1σ=−nσ+1σ3n∑i=1[(yi−ˉy)+σ(θ∗+ˉyσ)]2−nσθ∗h(θ∗). |
From (2.18), we know that ˆθ∗+ˉy/ˆσ=h(ˆθ∗), it therefore follows:
∂lnL∂σ|(ˆθ,ˆσ)=−nˆσ+1ˆσ3n∑i=1[(yi−ˉy)+ˆσh(ˆθ∗)]2−nˆσˆθ∗h(ˆθ∗)=−nˆσ[1−1nˆσ2n∑i=1(yi−ˉy)2−h(ˆθ∗)[h(ˆθ∗)−ˆθ∗]]=−nˆσ[1−1nˆσ2n∑i=1(yi−ˉy)2−h′(ˆθ∗)]. |
Setting the partial derivative with respect to σ2 equal to zero, we obtain:
1nn∑i=1(yi−ˉy)2=ˆσ2[1−h′(ˆθ∗)]. | (2.19) |
By eliminating ˆσ between (2.18) and (2.19), we have:
∑ni=1(yi−ˉy)2n¯y2=1−h′(ˆθ∗)h(ˆθ∗)−ˆθ∗=g2(ˆθ∗). |
Since g2 is monotonically increasing at x>0 from Lemma 2.5, hence:
ˆθ∗=g−12(∑ni=1(yi−ˉy)2nˉy2). |
Now, the estimator of σ can be obtained by (2.18):
ˆσ=ˉyh(ˆθ∗)−ˆθ∗=ˉyg1(ˆθ∗). |
Since θ∗=(θ−μ)/σ, it follows that ˆθ=μ+ˆσˆθ∗.
According to large-sample theory, the maximum likelihood estimators (ˆθ,ˆσ2) is asymptotically normal and the estimated approximate variance-covariance matrix is V=(−H)−1|(ˆθ,ˆσ), where H is the Hessian matrix of the log-likelihood function lnL in (1.7). Then
V(ˆθ)≈v11,V(ˆσ)≈v22,Cov(ˆθ,ˆσ)≈v12 |
where v11, v12, and v22 are the elements of V.
Similar to the results given by Cohen [9,11], the Hessian matrix H is obtained by:
h11:=nˆσ2[h′(ˆθ∗)−1] | (2.20) |
h12:=nˆσ2h(ˆθ∗)[1−ˆθ∗(h(ˆθ∗)−ˆθ∗)] | (2.21) |
h22:=−2nˆσ2−ˆθ∗h12 | (2.22) |
and the elements of the approximate variance-covariance matrix V are functions of h11, h12, and h22 as follows:
v11=h22h212−h11h22,v22=h11h212−h11h22,v12=−h12h212−h11h22. | (2.23) |
By asymptotic normality, An approximate 95 percent confidence interval of the trunction point θ is:
ˆθ±1.96×√v11. | (2.24) |
The fundamental principle of DNA sequencing is that the target DNA is randomly sheared into small fragments and sequenced separately; then, utilizing overlapping information between adjacent fragments identified (called reads), assembly packages pieced them together to restore the original sequence. Due to low coverage, sequencing errors, or repetitive sequences, some regions of the target DNA were not completed, and we call them gaps.
To address the difficulties caused by the repetitive sequences, a technique known as paired-end sequencing was developed. By paired-end sequencing, reads in pairs are produced from both ends of each DNA fragment and they are called mates each other. Mates are in opposite directions, at approximately known distances. Randomness in mates' distance results from random fragmentation of target DNA. However, the size of these fragments can be controlled within a certain range by biological technology. That is, the length of the DNA fragments in sequencing follows a certain probability distribution, which is usually assumed to be a normal distribution known.
The distance between mate pair provides information for the estimation of gap length. If a DNA fragment spans the entire gap and its indentation lengths L and R at the left and right ends (denoted by L and R, respectively) are known (as shown in Figure 1). Let the fragment's length be X∼N(μ,σ2), where μ and σ2 are known, thus the gap length θ=X−(L+R). Let Y=L+R, then Y=X−θ follows a truncated normal distribution, where θ is the truncation point.
Unlike the truncated normal distribution discussed earlier, the truncation variable Y is not only related to the truncation point but also affected by the start or end position of the fragment on the target DNA. Specifically, whether a fragment spans the gap depends on its length, as well as its starting position on the target DNA. In order to obtain more practical truncated distributions, we employ the Lander-Waterman model, a mathematical model for shotgun sequencing, and all estimates of gap length are based on this model.
Lander and Waterman [16] developed a mathematical model for shotgun sequencing. Some key statistics of sequencing projects, such as coverage, the average number of gaps, etc., can be estimated by this model. Therefore, the Lander-Waterman model was often used to guide the plan of sequencing projects.
Let G be the length of the target DNA sequence and P={f1,f2,⋯,fN} be a sequencing project for the target DNA, where fi=(Si,Xi) are i.i.d. random vectors. Si∼U(0,L) is the start position of the ith fragment located on the target DNA, Xi∼N(μ,σ2) is the length of fi, Si and Xi are mutually independent.
Roach et al. [17] introduced paired-end sequencing information into the Lander-Waterman model. A pair of reads named mate-pair is obtained by sequencing from both ends of each fragment in paired-end sequencing projects. If one read in mate-pair overlaps with a finished region (called a contig), and the other one overlaps with another contig, the mate-pair provides the information about the order of the contigs, as well as the information about the gap between the two adjacent contigs.
For a given gap in a sequencing project, it is assumed to be located in the region of the target sequence starting from a and of length θ. Therefore, the event that a fragment f=(S,X) spans the entire gap can be expressed as {S<a;S+X>a+θ}. All we can observe is that the mate-pair has an indent L=a−S on the left neighboring contig and an indent R=(S+X)−(a+θ) on the right neighboring contig, respectively. Indeed, Y=L+R=X−θ is a truncated variable. However, the conditional event for Y is {S<a;S+X>a+θ} instead of {X>θ}.
Let G={f=(S,X)∈P|S<a;S+X>a+θ} be the set of all the fragments that can span the entire gap. Let Li and Ri (i=1,⋯,n) be the left indent and the right indent of the mate-pair of the ith fragment in G. We define truncated sample as follows:
Y:={yi=Li+Ri,i=1,2,⋯,n} | (3.1) |
Our inference about gap length θ is based on the truncated sample Y. The conditional event is important for inference. Let A denote the conditional event {S<a;S+X>a+θ}, The following lemma gives an approximation to the probability P(A):
Lemma 3.1.
P(A)≈σG[h(θ∗)−θ∗][1−Φ(θ∗)] | (3.2) |
where θ∗=(θ−μ)/σ is defined as before.
Proof. From the definition for event A, we know that:
P(A)=P(0<S<a,a+θ<S+X<G)=P(0<S<a;a+θ−S<X<G−S)=P(θ<X<G;a+θ−X<S<(G−X)∧a). |
Let fX(x) be the density function of X, hence we have:
GP(A)=G∫GθfX(x)dx∫(G−x)∧aa+θ−x1Gds=∫G−aθfX(x)dx∫aa+θ−xds+∫GG−afX(x)dx∫G−xa+θ−xds=∫G−aθ(x−θ)fX(x)dx+(G−a−θ)∫GG−afX(x)dx. |
The integral in the second term is:
(G−a−θ)∫GG−afX(x)dx=P(G−a<X<G)=(G−a−θ)[Φ(G−μσ)−Φ(G−a−μσ)]≤(G−a−θ)[1−Φ(G−a−μσ)]≤σG−a−θG−a−μφ(G−a−μσ). |
The last equality above is from 1−Φ(x)<x−1φ(x) for x>0. Except for the gaps at the edge of the genome, G−a≫max(μ,σ,θ). For example, the genome size of E. coli is 4.4M, and the mean of insert size is generally less than 1K. Since φ(x)→0 as x→+∞, The second term is approximated as 0.
The first term is:
∫G−aθ(x−θ)fX(x)dx=σ∫G−aθ(x−μσ−θ−μσ)fX(x)dx=σ∫(G−a−μ)/σ(θ−μ)/σ(t−θ−μσ)φ(t)dt=σ[−φ(t)−θ−μσΦ(t)](G−a−μ)/σ(θ−μ)/σ≈σ[−φ(t)−θ−μσΦ(t)]+∞(θ−μ)/σ=σ[φ(θ−μσ)+θ−μσΦ(θ−μσ)−θ−μσ]=σ[φ(θ∗)+θ∗Φ(θ∗)−θ∗]. |
Combining the two terms, we finally obtain:
P(A)≈σG[φ(θ∗)−θ∗(1−Φ(θ∗))]=σ[h(θ∗)−θ∗][1−Φ(θ∗)]G. |
Conditioning on that event A occurs, Y=X−θ, we then have:
E(Y)=E(X−θ|A)=E(X|A)−θ. | (3.3) |
In a similar way to the proof of Lemma 3.1, we obtain the conditional expectation E(X|A), denoted by μθ. We substitute ˉy for E(Y) into (3.3), and we obtain a equation for the unknown parameter θ:
ˉy=μθ−θ. | (3.4) |
The solution of Eq (3.4) is the moment estimate of the truncation point θ.
Theorem 3.2. Let y1,y2,⋯,yn be a truncated sample as (3.1), the moment estimator for the truncation point θ is given as follows:
ˆθ=μ+σg−13(ˉyσ) | (3.5) |
where auxiliary function
g3(x):=1h(x)−x−x | (3.6) |
is monotonically decreasing.
First, we prove by a lemma that the auxiliary function g3 is monotonically increasing.
Lemma 3.3. The auxiliary function g3(x) defined in (3.6) is monotonically increasing.
Proof. The derivative of g3(x) is:
g′3(x)=−h′(x)−1[h(x)−x]2−1=1−h(x)[h(x)−x]−[h(x)−x]2[h(x)−x]2=1−[h(x)−x][2h(x)−x][h(x)−x]2. |
By (1.10), we know that [h(x)−x][2h(x)−x]−1>0, it follows that g′3(x)<0. Therefore, g3(x) is monotonically increasing.
Proof of Theorem 3.2. First, we calculate E[(X−μ)1A] in a similar way in Lemma 3.1:
G⋅E[(X−μ)1A]=∫G−aθ(x−μ)(x−θ)fX(x)dx+(G−a−θ)∫GG−a(x−μ)fX(x)dx≈σ2∫G−aθ(x−μσ)(x−μσ−θ−μσ)fX(x)dx≈σ2∫+∞θ∗(t2−θ∗t)φ(t)dt=σ2[Φ(t)−tφ(t)+θ∗φ(t)]+∞θ∗=σ2[1−Φ(θ∗)]. |
Next, we calculate E(X|A) as follows:
μθ=E(X|A)=1P(A)E(X1A)=G⋅E[(X−μ)1A]G⋅P(A)+μ≈σ[1−Φ(θ∗)]φ(θ∗)−θ∗[1−Φ(θ∗)]+μ=σφ(θ∗)1−Φ(θ∗)−θ∗+μ. |
Hence, we obtain:
μθ=σh(θ∗)−θ∗+μ. | (3.7) |
Substituting σ/(h(θ∗)−θ∗)+μ for μθ in (3.4) and rearrange it, we obtain:
g3(θ∗)=1h(θ∗)−θ∗−θ∗=ˉyσ. | (3.8) |
We know that g3 is monotonically decreasing from Lemma 3.3, hence:
ˆθ∗=g−13(ˉyσ). | (3.9) |
Since θ∗=(θ−μ)/σ, we thus have:
ˆθ=μ+σg−13(ˉyσ). |
Lemma 3.4.
V(Y)=V(X|A)≈σ2h″(θ∗)h(θ∗)[h(θ∗)−θ∗]2. | (3.10) |
Proof. In the same way as for Theorem 3.2, we obtain:
G⋅E[(X−μ)21A]=∫∞θ(x−μ)2(x−θ)fX(x)dx≈σ3∫+∞θ∗(t3−θ∗t2)φ(t)dt=σ3[−(t2+2)φ(t)+θ∗tφ(t)+θ∗(1−Φ(t))]+∞θ∗=σ3[2φ(θ∗)−θ∗(1−Φ(θ∗))]. |
Therefore,
E(X2|A)=1P(A)E(X21A)=1P(A){E[(X−μ)21A]+2μE[(X−μ)1A]+μ2P(A)}=G⋅E[(X−μ)21A]G⋅P(A)+G⋅E[(X−μ)1A]G⋅P(A)2μ+μ2≈σ22h(θ∗)−θ∗h(θ∗)−θ∗+2σμh(θ∗)−θ∗+μ2. |
Consequently, we have:
V(X|A)=E(X2|A)−[E(X|A)]2≈σ22h(θ∗)−θ∗h(θ∗)−θ∗−σ2[h(θ∗)−θ∗]2=σ2[2h(θ∗)−θ∗][h(θ∗)−θ∗]−1[h(θ∗)−θ∗]2=σ2h″(θ∗)h(θ∗)[h(θ∗)−θ∗]2. |
The last equation is from (1.10) in Lemma 1.1.
Theorem 3.5. For large samples, the variance of ˆθ in (3.5) is approximated by:
V(ˆθ)≈h(θ∗)[h(θ∗)−θ∗]2σ2nh″(θ∗). | (3.11) |
Proof. In a similar way as for Theorem 2.2, we have:
V(ˆθ)≈[(g−13)′]2y=ˉy/σ⋅V(ˉy)=1[g′3(ˆθ∗)]2⋅1nV(Y)=h(θ∗)[h(θ∗)−θ∗]2h″(θ∗)⋅σ2n. |
Here we use the results for g′3 in Lemma 3.3.
The conditional density of Y=X−θ under the condition that the event A occurs:
fY(y|θ)=1P(A)∫aa−yfS,X(s,y+θ)ds=1P(A)∫aa−y1GfX(y+θ)ds=1P(A)yGfX(y+θ)=y/σφ(θ∗)−θ∗[1−Φ(θ∗)]1√2πσe−12σ2(y+θ−μ)2 |
where fS,X the joint probability density of S and X.
Therefore, the log-likelihood function for the truncated sample in (3.1) is:
lnL(θ)=n∑i=1lnYi−n2ln(2π)−nlnσ2−nln(φ(θ∗)−θ∗[1−Φ(θ∗)])−12σ2n∑i=1(Yi+θ−μ)2. |
Differentiating lnL(θ) with respect to θ, we have:
dlnLdθ=nσ⋅1−Φ(θ∗)φ(θ∗)−θ∗[1−Φ(θ∗)]−1σ2n∑i=1(Yi+θ−μ)=nσ⋅1h(θ∗)−θ∗−nσ2(ˉy+θ−μ)=nσ (1h(θ∗)−θ∗−ˉyσ−θ∗). |
Setting this derivative equal to zero, we obtain the same equation as (3.8):
g3(θ∗)=1h(θ∗)−θ∗−θ∗=ˉyσ. |
Hence, we obtain the same estimator as (3.5).
The second-order derivative of lnL(θ) is:
d2lnLdθ2=nσ2⋅−h′(θ∗)+1[h(θ∗)−θ∗]2−nσ2=nσ2⋅1−[h(θ∗)−θ∗][2h(θ∗)−θ∗][h(θ∗)−θ∗]2=−nσ2⋅h″(θ∗)h(θ∗)[h(θ∗)−θ∗]2. |
By large sample properties of maximum likelihood estimates, we have:
√n(ˆθ−θ)→dN[0,(−d2lnLndθ2)−1]. |
The variance of the maximum likelihood estimate of the gap length ˆθ can be approximated by:
Var(ˆθ)≈(−d2lnLdθ2)−1|θ=ˆθ=h(ˆθ∗)[h(ˆθ∗)−ˆθ∗]2h″(ˆθ∗)⋅σ2n. |
Based on the asymptotic normality for large samples, the approximate 95% confidence interval for the truncation point θ is:
ˆθ±1.96×√h(ˆθ∗)[h(ˆθ∗)−ˆθ∗]√h″(ˆθ∗)×σ√n. | (3.12) |
Since the analytic form of the auxiliary functions g1(x), g2(x), and g3(x) cannot be given, we usually use numerical search algorithms, such as the Newton-Raphson method, to obtain approximate solutions to the maximum likelihood estimate of the truncation point or gap length θ. In practice, another simple approach is to construct the inverse function table based on the monotonicity of the auxiliary function and combine it with interpolation techniques, which can quickly obtain the inverse function values. In the following, we perform Monte Carlo simulations for the truncated normal model and the Lander-Waterman model, respectively, to evaluate the estimates and the corresponding confidence intervals given earlier. Finally, our estimation methods are tested on actual DNA sequencing dataset.
Using Monte Carlo simulations, we evaluated the large-sample nature of the point estimates of the truncation point θ in (2.1) and the interval estimates in (2.8). In our experiments, the parameters of the population X are set as μ=180 and σ2=372. The combinations of truncation point θ=30,90,150 and sample size n=50,500 were simulated respectively, and each combination was sampled 1000 times.
It can be seen from Figure 2 that the kernel densities of the truncated point estimates in (2.1) for different combinations are very close to the corresponding curves of theoretical normal densities. The numbers in the figure indicate the proportion that the true parameters are covered by the approximate 95% confidence intervals, and these numbers show that the true confidence coefficients of the confidence intervals given by (2.8) are quite close to 95%. Therefore, the simulation results of the truncation point estimators given in this paper are completely consistent with the theoretical analysis results.
In this experiment, we randomly generated a DNA sequence of 316K bp in length with 500 gaps, and randomly generated 1.58M DNA fragments from it. The length of DNA fragment is normally distributed: X∼N(180,372). In the same sequencing project, the larger the gap length θ, the less the number of fragments n spanning the gap, and there is roughly linear association between θ and n (see Figure 3).
In order to evaluate the estimates in (3.5) at different gap lengths, studentized errors were computed as follows:
t=θ−ˆθ^σθ |
where the standard deviation ^σθ was estimated by the square root of V(ˆθ) in (3.11).
From this experiment, we found that 6.2% of the studentized error falls outside the approximate 95% confidence limit (see Figure 4). This shows that the true confidence coefficient of the approximate confidence interval given by (3.12) is very close to the nominal confidence coefficient of 0.95.
The Illumina whole-genome paired-end sequencing dataset SRR001665 of E. coli was used in this experiment. The total length of the target sequence was 4,639,675 bp, and a total of 20,816,448 pairs of read sequences of length 36 were found in this sequencing project. The dataset can be accessed online from https://www.ncbi.nlm.nih.gov/sra.
By mapping read pairs to reference sequences of the E. coli genome, we obtain the distribution of insert size for the SRR001665 dataset, which follows a normal distribution with mean μ=215.43 and standard deviation σ=10.51 almost perfectly (see Figure 5).
First, we used the velvet [18] assembly software to assemble the reads of the sequencing dataset into contigs, and the paired-end information was not used in the assembly phase. Then, we utilized the paired-end information to find adjacent contigs by mapping mate pairs onto contigs. That is, one of a read pair is on one contig and the other one is on some other contig, and so this read pair provides evidence that the two contig are adjacent. Furthermore, under the assumption of normality for the distribution of insert size, the length of the gap between two adjacent contigs can be estimated by (3.5) in Theorem 3.2.
For the large sample nature, we impose a limit on the number of mate pairs spanning the gap, i.e., we require the size of the truncated sample Y in (3.1) to be no less than 50. In the SRR001665 dataset, there are 542 gaps satisfying the condition n≥50, and the estimation results are shown in Figure 6.
The experimental results show that the estimated values are very close to the true values (see Figure 6a). Most of the estimates have errors within 4, with one outlier that has an error close to 8 (see Figure 6b). In fact, by the estimate variance in (3.11), outliers can be used to diagnose whether the connection of two contigs considered to be adjacent is reliable.
It should be noted that the performance of the estimator (3.5) is sensitive to the normality. Small departures from normality do not create any serious problems in the estimates of gap length. However, if the distribution of insert size depart far from the normality, the estimator may have notable bias.
In fact, the length distribution of inserts is mainly affected by different platforms and their operational parameters, and in general, the insert size in a specific project approximately follow a normal distribution, which is generally slightly right-skewed [19].
In this paper, we study the truncation point estimation for truncated normal distribution, where the mean μ is known and the actual sample observations are the original observations after subtracting the truncation point. The cases in which the variance σ2 is known and unknown are discussed, and the moment method and the maximum likelihood method are employed to give the estimators of the unknown parameters. When σ2 is known, the two methods obtain the same estimator of the truncation point. When σ2 is unknown, the two methods yield similar results, and the key difference lies in the estimation of the sample variance. Approximate confidence intervals for truncation point θ are given for large samples. Based on the Lander-Waterman model, the method of estimating truncation points is extended to the estimation of gap length in DNA sequencing, and the point estimation and interval estimation of gap length under the assumption of normality are given. The Monte Carlo experimental results show that the estimators given in this paper are consistent with the results of the simulation experiments, whether it is the truncated normal model or the Lander-Waterman model. Experimental results from actual DNA sequencing dataset show that accurate estimates of gap length can be obtained by the estimation method proposed in this paper when the insert size is approximately normally distributed.
This work was supported by the Foundation of Jiangxi Educational Committee under Grant GJJ150926; Jingdezhen Ceramic University National Level Student Innovation and Entrepreneurship Training Program Project under Grant 202110408025.
The authors declare no conflict of interest.
[1] | K. S. Lomax, Business failures: Another example of the analysis of failure data, J. Am. Stat. Assoc., 49 (1954), 847–852. |
[2] | L. Bain, M. Engelhardt, Introduction to Probability and Mathematical Statistics, Duxbury Press, 1992. |
[3] | M. E. Ghitany, F. A. Al-Awadhi, L. A. Alkhalfan, Marshall-Olkin extended Lomax distribution and its application to censored data, Commun. Stat. Theory Methods, 36 (2007), 1855–1866. |
[4] | I. B. Abdul-Moniem, H. F. Abdel-Hameed, On exponentiated Lomax distribution, Int. J. Math. Arch., 3 (2012), 2144–2150. |
[5] | A. J. Lemonte, G. M. Cordeiro, An extended Lomax distribution, Statistics, 47 (2013), 800–816. |
[6] | G. M. Cordeiro, E. M. M. Ortega, D. C. C. da Cunha, The Exponentiated Generalized Class of Distributions, J. Data Sci., 11 (2013), 1–27. |
[7] | M. H. Tahir, G. M. Cordeiro, M. Mansoor, M. Zubair, The Weibull-Lomax distribution: properties and applications, Hacettepe J. Math. Stat., 44 (2015), 461–480. |
[8] | B. AL-Zahrani, An extended Poisson-Lomax distribution, Adv. Math. Sci. J., 4 (2015), 79–89. |
[9] | M. Nassar, S. Dey, D. Kumar, Logarithm transformed Lomax distribution with applications, Calcutta Stat. Assoc. Bull., 70 (2018), 122–135. |
[10] | V. Pappas, K. Adamidis, S. Loukas, A family of lifetime distributions, Int. J. Qual., Stat. Reliab., (2012), 1–6. |
[11] | B. Al-Zahrani, P. R. D. Marinho, A. A. Fattah, A. H. N. Ahmed, G. M. Cordeiro, The (P-A-L) extended Weibull distribution: a new generalization of the Weibull distribution, Hacettepe J. Math. Stat., 45 (2016), 851–868. |
[12] | S. Dey, M. Nassar, D. Kumar, Alpha logarithmic transformed family of distributions with applications, Ann. Data Sci., 4 (2017), 457–482. |
[13] | M. Eltehiwy, Logarithmic inverse Lindley distribution: Model, properties and applications, J. King Saud. Univ. Sci., 32 (2018), 136–144. |
[14] | I. S. Gradshteyn, I. M. Ryzhik, Tables of Integrals, Series and Products, Seventh Edition, Academic Press, Inc., 2007. |
[15] | A. Rényi, On measures of entropy and information, In: Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, 1961,547–561. |
[16] | J. Navarro, M. Franco, J. M. Ruiz, Characterization through moments of the residual life and conditional spacing, Indian J. Stat., Ser. A, 60 (1998), 36–48. |
[17] | D. N. P. Murthy, M. Xie, R. Jiang, Weibull Models, John Wiley, New York, 2004. |
[18] | A. Bekker, J. Roux, P. Mostert, A generalization of the compound Rayleigh distribution: using a Bayesian methods on cancer survival times, Commun. Stat. Theory Methods, 29 (2000), 1419–1433. |