1.
Introduction
Many real-world phenomena involve interactions between multiple factors. A multivariate data analysis (MVDA) allows researchers to dissect these complex systems and to understand how different variables affect each other and contribute to the overall outcomes. For example, in medical research, datasets often include multiple variables such as patient demographics, genetic information, medical history, biomarkers, and treatment outcomes. MVDA helps researchers identify factors that influence the disease risk, treatment response, and patient outcomes. This is just an example for illustration; however, MVDA is relevant to many other fields, including engineering, economics, agriculture, public health, psychology, urban planning, energy, and more. Essentially, any discipline that deals with complex systems or phenomena that involve multiple interacting factors relies on multivariate analyses to extract meaningful insights from data. Accordingly, many statisticians were interested in extending univariate and bivariate statistical tests to multivariate cases. Among these tests are the sign tests. Sign-based approaches are non-parametric methods that are very appealing because of their inherent simplicity and resistance to standard Gaussian assumptions. Sign tests originated in the univariate case when they were primarily used to assess issues with location and symmetry. Over several decades, multivariate extensions of univariate sign-based approaches have garnered significant interest. Bivariate sign test location testing may be traced back to Hodges [1] and Blumen [2]. Numerous sign test and signed-rank approaches for the multivariate location issue have recently been developed in scientific literature. Randles [3] suggested an inter-directional, distribution-free multivariate sign test. Hettmansperger et al. [4] presented a new approach to conduct hypothesis tests on the central location of multivariate data (MVD), and emphasized the importance of asymptotic invariance for robust statistical inference. Möttönen and Oja [5] introduced the multivariate spatial sign method to robustly estimate location parameters in an MVDA. Hettmansperger et al. [6] presented a novel approach to construct affine-invariant multivariate one-sample sign tests. For more information and knowledge about non-parametric tests for MVD, the reader can review the following references: Möttönen and Oja [7], Larocque and Labarre [8], Mahfoud and Randles [9], and Bernard and Verdebout [10]. Moreover, we refer to Oja's article, Oja [11], which reviewed the literature on multivariate sign-rank tests, and focused on their properties, applications, and limitations.
Approximation techniques in statistics approximate intricate statistical measures, distributions, or functions when precise computations become challenging or unfeasible. Such methods prove invaluable when analytical solutions are absent or when computations entail high-dimensional or computationally demanding tasks. This article suggests the saddlepoint approximation (SPA) to approximate the exact distribution function for the multivariate sign and signed-rank test class. The SPA is a method employed in statistics and the probability theory to approximate probability distributions, especially in situations where traditional methods such as numerical integration or exact calculations are challenging. SPA provides highly accurate approximations to the distribution of a statistic, and often outperforms traditional methods such as the normal approximation, especially in the distribution's tails. It includes terms from the Edgeworth expansion, which provides a more refined approximation compared to the central limit theorem. It is computationally feasible with modern computing resources, which allows practitioners to implement it practically. It improves the accuracy of statistical inference, particularly for hypothesis testing and confidence interval estimation, thus leading to more reliable conclusions. Unlike many other asymptotic methods, the SPA is often effective even with relatively small sample sizes, thus making it useful in practical situations where data may be limited. The method has been implemented in various statistical software packages, making it accessible to practitioners without requiring deep theoretical knowledge of the underlying mathematics. Some essential references on SPA include Daniels [12], Lugannani and Rice [13], Skovgaard [14], Barndorff-Nielsen and Cox [15], and Butler [16]. The SPA has applications in various statistics and related fields, such as biostatistics and epidemiology, reliability engineering, finance and risk management, genetics and genomics, machine learning and data science, statistical physics, and thermodynamics. The following are some recent references on SPA: Meng et al. [17], Abd El-Raheem and Abd-Elfattah [18,19], Zhao et al. [20], Shanan et al. [21], Meng et al. [22], Abd El-Raheem and Hosny [23], and Abd El-Raheem et al. [24,25]. These references cover various aspects and applications of SPA, including tail probabilities, high-dimensional data, genetics, and nonlinear functionals. They provide insights into recent developments and advancements in the field, thus making them valuable resources for researchers interested in SPA and related topics.
The existing studies of asymptotic approximations for the p-value of the one-sample multivariate sign and signed rank tests provide valuable and accessible tools for non-parametric inferences in multivariate settings. While these approaches are powerful, they have some limitations, the most important of which is their reliance on large sample sizes. The asymptotic normal method assumes that the sample size is sufficiently large for the central limit theorem to hold. For small or moderate sample sizes, the normal approximation may not be accurate, thus leading to biased or incorrect inferences. Furthermore, for small sample sizes, the distribution of the sample mean or other statistics may significantly deviate from normality, making the approximation unreliable. Thus, the accuracy of asymptotic approximations in small samples is a significant concern. Accordingly, in this article, we propose the SPA method to approximate the p-value of the one-sample multivariate sign and signed rank tests as a more accurate alternative to the normal approximation, especially with medium and small sample sizes, and as a less computationally demanding and time-consuming alternative to the permutation method.
The remaining sections of this article are organized as follows: Section 2 presents multivariate sign and signed-rank tests with minimal adjustments to account for the linear case; the proposed approximation is described in Section 3; Sections 4 and 5 compare the performance of the saddlepoint technique versus the asymptotic method using numerical examples and simulation studies; and finally, the conclusion is provided in Section 6.
2.
Multivariate sign and signed-rank tests
This section presents the most common sign and signed-rank test statistics for multivariate samples. The first statistic of the multivariate sign test was developed by Hettmansperger et al. [4] as a general multivariate analog of the bivariate sign test. Suppose that X1,X2,...,Xn is a random sample from the multivariate symmetric distribution F with an unknown symmetry center θ. Here, symmetry means that Xi−θ and θ−Xi are identically distributed. To examine the hypothesis, H0: θ=0, let H be a fixed half-space such that if X is a member of H, then −X is not a member and let Xi=aiYi,i=1,…,n, where Yi∈H, ai indicates whether Xi belongs to H or not.
First, we start by presenting the statistic in the trivariate case; then, we present its generalization to the multivariate case. Thus, the trivariate sign test statistic is given by the following:
where
such that
indicates whether 0 is above or below the plane defined by the three points Yi, Yj and Yl.
From the above, we can present the generalization of the statistic Ms to the case of the k-variate as follows:
where
and Wj(i1,…,ik−1),j=1,2,...,k is the cofactor of yij for the matrix
and
which indicates whether 0 is above or below the hyperplane defined by the points Yi1,…,Yik.
Under the null hypothesis, which is dependent upon the observed values Y1,Y2,...,Yn, the ai are independent and P(ai=1)=P(ai=−1)=1/2, which means that E(Ms|H0)=0 and σ2=E(MsMsT|H0)=∑ni=1qiqiT.
The second statistic is a multivariate signed-rank test statistic, which was introduced by Hettmansperger et al. [7]. Let X1,X2,...,Xn be a random sample from a k-variate continuous distribution and
be the set of NP=(nk) different k-tuples of the index set {1,2,…,n}. The index p belongs to the set P, which refers to a k-subset of the initial observations. Furthermore, using the k observations provided in p as vertices, p determines a (k−1)-dimensional hyperplane (passing through the k observations included in p) in the k-dimensional space and a (k−1)-dimensional sub-simplex.
Based on the symbols and definitions contained in the previous statistic (1), let E be the set of 2k possible vectors (±1,±1,…,±1) and define the following:
where Spe(X)=sgn(d0pe+XTdpe), such that d0pe=(−1)kdet(e1Xi1,e2Xi2,…,ekXik) and dpe is the k-dimensional vector of cofactors of X in the following:
If Spe(X)>or(<)0,then this means the hyperplane p is above or below X, respectively.
The formula for the multivariate signed-rank test statistic is given by the following:
In the statistic (2), Xi=aiYi, where Yi∈H. Hence, ai=±1 as Xi∈H or Xi∈Hc, and R is defined as the vector signed-rank function as follows:
Under the null hypothesis, which is dependent upon the observed values Y1,Y2,...,Yn, the ai are independent and P(ai=1)=P(ai=−1)=1/2. Hence, E(MR|H0)=0 and covariance matrix is as follows:
Because both the multivariate sign test statistic Ms in Eq (1) and the multivariate signed-rank test statistic MR in Eq (2) are basically multivariate normal distributions, we can modify them and obtain an equivalent statistic of a linear nature form as follows:
where Ti=∑kj=1qij forthe sign test statistic Ms in Eq (1) and Ti=∑kj=1rij for the signed-rank test statistic MR in Eq (2).
The following section derives the SPA for the permutation distribution of the class D in Eq (3).
3.
Saddlepoint p-values of the multivariate sign and signed-rank tests
Let bi=ai+12; then, the statistic D in Eq (3) can be written as follows:
where bi=0or1fori=1,2,…,n are independent identically Bernoulli (1/2) random variates. Thus, the moment generating function of the statistic D in Eq (4) is given by
and the cumulant generating function of the statistic D in Eq (4) is given by
The SPA for the distribution function [13], FD(.), and the probability mass function [12], fD(.), at D=D0 are given, respectively, by the following:
and
where
are functions of D0 where C″D and C‴D are the second and third derivatives of CD, respectively. The two symbols, Φ and φ, denote the standard normal distribution and density functions, respectively, and the symbol sgn(˜s) denotes the sign of ˜s. The saddlepoint ˜s is the unique solution of the equation C′D(˜s)=D0, that is,
The Newton-Raphson method is used to solve the saddlepoint equation in Eq (6). The calculation of the saddlepoint p-value is summed up as follows: start by solving the saddlepoint Eq (6) to find ˜s at given D0; then use ˜s to find ˜wand ˜u; and finally, substitute with ˜w and ˜u in Eq (5) to find ˆFD(D0). Thus, the saddlepoint p-value at D0 is given by ˜P(D≥D0)=1−ˆFD(D0).
It is necessary to point out that the test statistic in (4) includes discrete random variables, even though the continuous SPA was used to approximate its distribution function. The reason for using the continuous formula is that it offers the most precise approximation for the mid-p-value. This accuracy was discussed by Pierce and Peters [26], Davison and Wang [27], and discussed in Section 6.1.4 in Butler [16]. The simplest explanation from the last reference suggests that a continuous SPA serves as an approximation to the true inverse Fourier transform, which determines P(D≤D0). Given that P(D≤D0) has a step discontinuity at D0, the exact Fourier inversion at D=D0 is the midpoint of the step or the mid-p-value, which is what the continuous SPA actually approximates; see Theorem 10.7b in the reference Henrici [28].
4.
Numerical examples
Analyzing some numerical examples can deepen our grasp of how various methods accurately approximate the exact p-value of the multivariate sign and signed-rank tests. Therefore, this section includes the analysis of four distinct types of multivariate real data sets (i.e., four numerical examples). Rao [29] conducted a study that involved cork bores from the north (N), east (E), south (S), and west (W) directions of tree trunks in a plantation block comprised of 28 trees. The aim was to assess whether there were variations in the bark deposit thickness and weight across the four cardinal directions. The tested hypothesis was whether cork bores on the trees exhibited a uniform thickness in all directions. In the first example, 14 observations were selected from the original dataset, which were utilized by Hettmansperger et al. [4] in their article. In the second example, all 28 observations were included. For both examples, the data were transformed into trivariate observations using x1=E−N,x2=S−Eandx3=W−S; see Hettmansperger et al. [4] for more information about these transformations. The third example analyzes multivariate lung function data from Merchant et al. [30] for 12 workers that were exposed to cotton dust for six hours. The data includes numerous variables discussed in the attempt to determine lung function changes, including forced vital capacity, forced expiratory volume, and closing capacity. These data were analyzed to test the hypothesis that the mean vector of lung function changes is equal to 0. The fourth example is 4D data that represents the monthly minimum grass temperature (℃) recorded in 2022 for the Cavan, Donegal, Carlow, and Galway counties in Ireland (Met Éireann [31]). This data is presented in Table 1. The tested hypothesis was whether the monthly minimum grass temperatures were uniform in the four counties. The approximated p-values using the simulation, SPA, and normal approximation (NA) methods are displayed in Table 2.
Based on the results shown in Table 2, we observe that the suggested approximation method, namely the saddlepoint method, is more accurate than the normal approximation method. This is evident from the proximity of the saddlepoint method results to those of the simulation method. It is worth noting that we cannot calculate the exact p-value due to the unknown distribution of the test statistic. However, this can be compensated for by calculating the p-value by assuming all permutations of the statistic. However, this requires a significant amount of time. This method is called the simulation or reference method. We use this method to compare the accuracy of the approximation techniques, which are the normal and saddlepoint methods. Throughout this article, we will refer to the p-value approximated using the simulation method as either a simulated p-value or a reference p-value.
5.
Simulation study
In this section, the accuracy of the SPA for the two tests, the multivariate sign test (MST) and multivariate signed-rank test (MSRT) is verified by conducting a simulation study. Four multivariate distributions are used to simulate the data: the standard multivariate normal distribution with a correlation coefficient equal to 0.5, the standard multivariate logistic distribution, the standard multivariate extreme value distribution, and the standard multivariate exponential distribution. The motivation behind selecting these distributions for data generation in the simulation study is to ensure a comprehensive evaluation of the statistical methods under a wide range of conditions (e.g., symmetry, heavy tails, outliers, and skewness). This variety of distributions helps to test the methods' robustness, versatility, and applicability to real-world data, and ultimately provides a thorough understanding of their performance. 1,000 datasets are generated from the four distributions with different sample sizes, n=10,20,30and50. Tables 3–6 show the results for the four distributions. The following data are included in each table: "Sad.P." refers to the percentage of the 1,000 different datasets in which the saddlepoint p-value was closer to the simulated mid p-value than it was to the asymptotic normal p-value; the term 'E.Nor.' refers to the average relative absolute error of the normal p-value from the simulated mid p-value; and the term 'E.Sad.' refers to the average relative absolute error of the saddlepoint p-value from the simulated mid p-value. The simulated mid p-value is calculated based on 106 permutations of the indicators {bi}.
To illustrate the results obtained from the simulation study, we take Table 5 with n=30 and the MST as an illustration, and note that the saddlepoint p-values were closer to the simulated mid p-value values 97.1% of the time with a relative absolute error of 3.988% versus 65.826% for the asymptotic normal method. Across all tests and cases, the SPA proved to be highly accurate and superior to the normal approximations. This is evidenced by the high proportions listed in the Sad.P. rows in Tables 3–6.
Previously, it was verified that the saddlepoint method is more accurate than the normal approximation method. Now, we must explain why the saddlepoint method is a possible alternative to the simulation method. To clarify this, we calculated the computing time for both methods, and the results are presented in Table 7. Table 7 shows a significant difference between the computing times of the saddlepoint and simulation approaches. Using the SPA approach, we can compute the average value of 1000 p-values for each of the considered cases in less than a minute. In contrast, the simulation method takes between fifty to one hundred and thirty hours to compute the corresponding values.
6.
Conclusions
In conclusion, the MVDA offers a robust statistical approach to examine datasets with multiple variables, and captures the complex interactions and relationships that univariate or bivariate analyses cannot. This article highlights the challenges of computing exact distributions for nonparametric tests of one-sample multivariate location problems, which are typically addressed through an asymptotic approximation. This study introduced the saddlepoint method as a more accurate alternative to the traditional asymptotic approximation and a faster alternative to the time-intensive simulation method. The effectiveness of the saddlepoint method was demonstrated through illustrative examples and a simulation study, thus underscoring its potential as a superior approach for approximating distribution functions in multivariate analyses.
Author contributions
A. M. Abd El-Raheem: Conceptualization, Methodology, Investigation, Software, Writing – review & editing; I. A. A. Shanan: Visualization, Resources, Software, Writing – original draft; M. Hosny: Funding acquisition, Project administration.
Use of AI tools declaration
The authors declare that they have not used Artificial Intelligence tools in the creation of this article.
Acknowledgments
The third author extends her appreciation to the Deanship of Scientific Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/398/45.
Conflict of interest
The authors declare no conflict of interest.