For any pair of probability measures defined on a common space, their relative information spectra——specifically, the distribution functions of the loglikelihood ratio under either probability measure——fully encapsulate all that is relevant for distinguishing them. This paper explores the properties of the relative information spectra and their connections to various measures of discrepancy including total variation distance, relative entropy, Rényi divergence, and general $ f $-divergences. A simple definition of sufficient statistics, termed $ I $-sufficiency, is introduced and shown to coincide with longstanding notions under the assumptions that the data model is dominated and the observation space is standard. Additionally, a new measure of discrepancy between probability measures, the NP-divergence, is proposed and shown to determine the area of the error probability pairs achieved by the Neyman-Pearson binary hypothesis tests. For independent identically distributed data models, that area is shown to approach 1 at a rate governed by the Bhattacharyya distance.
Citation: Sergio Verdú. Relative information spectra with applications to statistical inference[J]. AIMS Mathematics, 2024, 9(12): 35038-35090. doi: 10.3934/math.20241668
For any pair of probability measures defined on a common space, their relative information spectra——specifically, the distribution functions of the loglikelihood ratio under either probability measure——fully encapsulate all that is relevant for distinguishing them. This paper explores the properties of the relative information spectra and their connections to various measures of discrepancy including total variation distance, relative entropy, Rényi divergence, and general $ f $-divergences. A simple definition of sufficient statistics, termed $ I $-sufficiency, is introduced and shown to coincide with longstanding notions under the assumptions that the data model is dominated and the observation space is standard. Additionally, a new measure of discrepancy between probability measures, the NP-divergence, is proposed and shown to determine the area of the error probability pairs achieved by the Neyman-Pearson binary hypothesis tests. For independent identically distributed data models, that area is shown to approach 1 at a rate governed by the Bhattacharyya distance.
[1] | C. E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J., 27 (1948), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x doi: 10.1002/j.1538-7305.1948.tb01338.x |
[2] | S. Kullback, R. A. Leibler, On information and sufficiency, Ann. Math. Stat., 22 (1951), 79–86. https://doi.org/10.1214/aoms/1177729694 doi: 10.1214/aoms/1177729694 |
[3] | P. R. Halmos, L. J. Savage, Application of the Radon-Nikodym theorem to the theory of sufficient statistics, Ann. Math. Stat., 20 (1949), 225–241. https://doi.org/10.1214/aoms/1177730032 doi: 10.1214/aoms/1177730032 |
[4] | R. M. Fano, Class notes for course 6.574: Statistical theory of information, Massachusetts Institute of Technology, Cambridge, Mass., 1953. |
[5] | D. V. Lindley, On a measure of the information provided by an experiment, Ann. Math. Stat., 27 (1956), 986–1005. https://doi.org/10.1214/aoms/1177728069 doi: 10.1214/aoms/1177728069 |
[6] | H. Chernoff, Large-sample theory: Parametric case, Ann. Math. Stat., 27 (1956), 1–22. Available from: https://www.jstor.org/stable/2236974. |
[7] | J. Neyman, E. S. Pearson, On the problem of the most efficient tests of statistical hypotheses, Philos. T. Roy. Soc. London Ser. A, 231 (1933), 289–337. https://doi.org/10.1098/rsta.1933.0009 doi: 10.1098/rsta.1933.0009 |
[8] | I. N. Sanov, On the probability of large deviations of random variables, Mat. Sb., 42 (1957), 11–44. https://doi.org/10.2307/3197345 doi: 10.2307/3197345 |
[9] | H. Cramér, Sur un nouveau théorème-limite de la théorie des probabilités, Actual. Sci. Ind., 736 (1938), 5–23. |
[10] | E. T. Jaynes, Information theory and statistical mechanics, Phys. Rev. Ser. II, 106 (1957), 620–630. https://doi.org/10.1103/PhysRev.106.620 doi: 10.1103/PhysRev.106.620 |
[11] | E. T. Jaynes, Information theory and statistical mechanics Ⅱ, Phys. Rev. Ser. II, 108 (1957), 171–190. https://doi.org/10.1103/PhysRev.108.171 doi: 10.1103/PhysRev.108.171 |
[12] | S. Kullback, Information theory and statistics, Dover: New York, 1968. |
[13] | A. Rényi, On measures of information and entropy, In: Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press: Berkeley, California, 1961,547–561. |
[14] | H. Chernoff, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, Ann. Math. Stat., 23 (1952), 493–507. https://doi.org/10.1214/aoms/1177729330 doi: 10.1214/aoms/1177729330 |
[15] | I. Csiszár, Information-type measures of difference of probability distributions and indirect observations, Stud. Sci. Math. Hung., 2 (1967), 299–318. https://doi.org/10.1016/S0010-8545(00)80126-5 doi: 10.1016/S0010-8545(00)80126-5 |
[16] | K. Pearson, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, London Edinb. Dublin Philos. Mag. J. Sci., 50 (1900), 157–175. https://doi.org/10.1080/14786440009463897 doi: 10.1080/14786440009463897 |
[17] | H. Jeffreys, An invariant form for the prior probability in estimation problems, Proc. Roy. Soc. London Ser. A Math. Phys. Sci., 186 (1946), 453–461. https://doi.org/10.1098/rspa.1946.0056 doi: 10.1098/rspa.1946.0056 |
[18] | I. Vincze, On the concept and measure of information contained in an observation, In: Contributions to Probability: A Collection of Papers Dedicated to Eugene Lukacs, Academic Press: New York, 1981,207–214. https://doi.org/10.1016/0091-3057(81)90179-9 |
[19] | L. Le Cam, Asymptotic methods in statistical decision theory, Springer: New York, 1986. |
[20] | M. H. DeGroot, Uncertainty, information, and sequential experiments, Ann. Math. Stat., 33 (1962), 404–419. https://doi.org/10.1214/aoms/1177704567 |
[21] | T. S. Han, S. Verdú, Approximation theory of output statistics, IEEE T. Inform. Theory, 39 (1993), 752–772. https://doi.org/10.1109/18.256486 doi: 10.1109/18.256486 |
[22] | T. S. Han, Information spectrum methods in information theory, Springer: Heidelberg, Germany, 2003. |
[23] | Y. Polyanskiy, H. V. Poor, S. Verdú, Channel coding rate in the finite blocklength regime, IEEE T. Inform. Theory, 56 (2010), 2307–2359. https://doi.org/10.1109/TIT.2010.2043769 doi: 10.1109/TIT.2010.2043769 |
[24] | S. Verdú, The Cauchy distribution in information theory, Entropy, 25 (2023), 1–48. https://doi.org/10.3390/e25010048 doi: 10.3390/e25010048 |
[25] | D. Burkholder, Sufficiency in the undominated case, Ann. Math. Stat., 32 (1961), 1191–1200. https://doi.org/10.1214/aoms/1177704859 doi: 10.1214/aoms/1177704859 |
[26] | P. R. Halmos, Measure theory, Springer: New York, 1974. |
[27] | P. Billingsley, Probability and measure, 4 Eds., Wiley-Interscience: New York, 2012. |
[28] | I. Csiszár, J. Körner, Information theory: Coding theorems for discrete memoryless systems, Academic: New York, 1981. |
[29] | J. Bhattacharyya, On some analogues of the amount of information and their use in statistical estimation, Sankhyā Indian J. Stat., 8 (1946), 1–14. |
[30] | T. van Erven, P. Harremoës, Rényi divergence and Kullback-Leibler divergence, IEEE T. Inform. Theory, 60 (2014), 3797–3820. https://doi.org/10.1109/TIT.2014.2320500 doi: 10.1109/TIT.2014.2320500 |
[31] | A. Rényi, New version of the probabilistic generalization of the large sieve, Acta Math. Hung., 10 (1959), 217–226. https://doi.org/10.1007/BF02063300 doi: 10.1007/BF02063300 |
[32] | I. Csiszár, Eine Informationstheorische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten, Publ. Math. Inst. Hung. Acad. Sci., 8 (1963), 85–108. https://real.mtak.hu/201426/ doi: https://real.mtak.hu/201426/ |
[33] | S. M. Ali, S. D. Silvey, A general class of coefficients of divergence of one distribution from another, J. Roy. Stat. Soc. Ser. B, 28 (1966), 131–142. https://doi.org/10.2307/4441277 doi: 10.2307/4441277 |
[34] | I. Sason, On $f$-divergences: Integral representations, local behavior, and inequalities, Entropy, 20 (2018), 1–32. https://doi.org/10.3390/e20010032 doi: 10.3390/e20010032 |
[35] | F. Liese, I. Vajda, $f$-divergences: Sufficiency, deficiency and testing of hypotheses, In: Advances in Inequalities from Probability Theory and Statistics, Nova Science: New York, 2008,113–158. |
[36] | I. Sason, S. Verdú, $f$-divergence inequalities, IEEE T. Inform. Theory, 62 (2016), 5973–6006. https://doi.org/10.1109/TIT.2016.2603151 doi: 10.1109/TIT.2016.2603151 |
[37] | S. Vajda, Theory of statistical inference and information, Kluwer: Dordrecht, The Netherlands, 1989. |
[38] | I. Csiszár, Information measures: A critical survey, In: Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Publishing House of the Czechoslovak Academy of Sciences, Prague, 1974, 73–86. https://doi.org/10.1111/j.1559-3584.1974.tb03703.x |
[39] | F. Oesterreicher, I. Vajda, Statistical information and discrimination, IEEE T. Inform. Theory, 39 (1993), 1036–1039. https://doi.org/10.1109/18.256536 doi: 10.1109/18.256536 |
[40] | F. Liese, I. Vajda, On divergences and informations in statistics and information theory, IEEE T. Inform. Theory, 52 (2006), 4394–4412. https://doi.org/10.1109/TIT.2006.881731 doi: 10.1109/TIT.2006.881731 |
[41] | F. Liese, $\phi$-divergences, sufficiency, Bayes sufficiency, and deficiency, Kybernetika, 48 (2012), 690–713. Available from: https://www.kybernetika.cz/content/2012/4/690. |
[42] | S. Verdú, Total variation distance and the distribution of relative information, In: Proceedings of the 2014 Workshop on Information Theory and Applications, University of California: La Jolla, California, 2014. |
[43] | A. Kontorovich, Obtaining measure concentration from Markov contraction, Markov Process. Relat., 18 (2012), 613–638. |
[44] | V. Strassen, The existence of probability measures with given marginals, Ann. Math. Stat., 36 (1965), 423–439. https://doi.org/10.1214/aoms/1177700153 doi: 10.1214/aoms/1177700153 |
[45] | R. L. Dobrushin, Prescribing a system of random variables by conditional distributions, Theor. Probab. Appl., 15 (1970), 458–486. https://doi.org/10.1137/1115049 doi: 10.1137/1115049 |
[46] | Y. Polyanskiy, S. Verdú, Arimoto channel coding converse and Rényi divergence, In: Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, University of Illinois: Monticello, Illinois, 2010, 1327–1333. |
[47] | R. A. Fisher, On the mathematical foundations of theoretical statistics, Proc. Roy. Soc. London Ser. A Math. Phys. Sci., 222 (1922), 309–368. https://doi.org/10.1098/rsta.1922.0009 doi: 10.1098/rsta.1922.0009 |
[48] | D. Blackwell, Equivalent comparisons of experiments, Ann. Math. Stat., 24 (1953), 265–272. https://doi.org/10.1214/aoms/1177729032 doi: 10.1214/aoms/1177729032 |
[49] | R. R. Bahadur, Sufficiency and statistical decision functions, Ann. Math. Stat., 25 (1954), 423–462. https://doi.org/10.1214/aoms/1177728715 doi: 10.1214/aoms/1177728715 |
[50] | D. Blackwell, R. V. Ramamoorthi, A Bayes but not classically sufficient statistic, Ann. Stat., 10 (1982), 1025–1026. https://doi.org/10.1016/0305-750X(82)90014-6 doi: 10.1016/0305-750X(82)90014-6 |
[51] | J. Dieudonné, Sur le théoréme de Lebesgue-Nikodym, Ann. Math., 42 (1941), 547–555. https://doi.org/10.1016/S0002-9378(16)40717-9 doi: 10.1016/S0002-9378(16)40717-9 |
[52] | H. Heyer, Theory of statistical experiments, Springer: New York, 1982. |
[53] | J. Neyman, Su un teorema concernente le cosiddette statistiche sufficienti, Istituto Italiano degli Attuari, 6 (1935), 320–334. |
[54] | T. P. Speed, A note on pairwise sufficiency and completions, Sankhyā Indian J. Stat. Ser. A, 38 (1976), 194–196. |
[55] | A. N. Kolmogorov, Definition of center of dispersion and measure of accuracy from a finite number of observations, Izv. Akad. Nauk SSSR Ser. Mat., 6 (1942), 4–32. |
[56] | T. M. Cover, J. A. Thomas, Elements of information theory, 2 Eds., Wiley: New York, 2006. |
[57] | D. Blackwell, Comparison of experiments, In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, University of California Press: Berkeley, California, 18 (1951), 93–102. https://doi.org/10.2307/1438094 |
[58] | R. D. Reiss, Approximate distributions of order statistics: With applications to nonparametric statistics, Springer: New York, 2012. |
[59] | H. Strasser, Mathematical theory of statistics: Statistical experiments and asymptotic decision theory, Walter de Gruyter: Berlin, 1985. |
[60] | R. R. Bahadur, A characterization of sufficiency, Ann. Math. Stat., 26 (1955), 286–293. https://doi.org/10.1214/aoms/1177728545 doi: 10.1214/aoms/1177728545 |
[61] | J. Pfanzagl, A characterization of sufficiency by power functions, Metrika, 21 (1974), 197–199. https://doi.org/10.1080/0156655740210307 doi: 10.1080/0156655740210307 |
[62] | H. L. van Trees, Detection, estimation and modulation theory. 1. Detection, estimation and linear modulation theory, John Wiley, 1968. |
[63] | D. J. Hand, R. J. Till, A simple generalisation of the area under the ROC curve for multiple class classification problems, Mach. Learn., 45 (2001), 171–186. https://doi.org/10.1023/A:1010920819831 doi: 10.1023/A:1010920819831 |
[64] | T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett., 27 (2006), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010 doi: 10.1016/j.patrec.2005.10.010 |