Processing math: 31%
Research article Special Issues

Relative information spectra with applications to statistical inference

  • For any pair of probability measures defined on a common space, their relative information spectra——specifically, the distribution functions of the loglikelihood ratio under either probability measure——fully encapsulate all that is relevant for distinguishing them. This paper explores the properties of the relative information spectra and their connections to various measures of discrepancy including total variation distance, relative entropy, Rényi divergence, and general f-divergences. A simple definition of sufficient statistics, termed I-sufficiency, is introduced and shown to coincide with longstanding notions under the assumptions that the data model is dominated and the observation space is standard. Additionally, a new measure of discrepancy between probability measures, the NP-divergence, is proposed and shown to determine the area of the error probability pairs achieved by the Neyman-Pearson binary hypothesis tests. For independent identically distributed data models, that area is shown to approach 1 at a rate governed by the Bhattacharyya distance.

    Citation: Sergio Verdú. Relative information spectra with applications to statistical inference[J]. AIMS Mathematics, 2024, 9(12): 35038-35090. doi: 10.3934/math.20241668

    Related Papers:

    [1] Gideon Simpson, Daniel Watkins . Relative entropy minimization over Hilbert spaces via Robbins-Monro. AIMS Mathematics, 2019, 4(3): 359-383. doi: 10.3934/math.2019.3.359
    [2] Tareq Saeed, Muhammad Adil Khan, Hidayat Ullah . Refinements of Jensen's inequality and applications. AIMS Mathematics, 2022, 7(4): 5328-5346. doi: 10.3934/math.2022297
    [3] Muhammad Aslam, Muhammad Saleem . Neutrosophic test of linearity with application. AIMS Mathematics, 2023, 8(4): 7981-7989. doi: 10.3934/math.2023402
    [4] Mohamed Said Mohamed, Muqrin A. Almuqrin . Properties of fractional generalized entropy in ordered variables and symmetry testing. AIMS Mathematics, 2025, 10(1): 1116-1141. doi: 10.3934/math.2025053
    [5] Qiansheng Zhang, Jingfa Liu . Entropy of credibility distribution for intuitionistic fuzzy variable. AIMS Mathematics, 2023, 8(4): 9671-9691. doi: 10.3934/math.2023488
    [6] Zengtai Gong, Chengcheng Shen . Monotone set-valued measures: Choquet integral, $ f $-divergence and Radon-Nikodym derivatives. AIMS Mathematics, 2022, 7(6): 10892-10916. doi: 10.3934/math.2022609
    [7] Alaa M. Abd El-Latif, Hanan H. Sakr, Mohamed Said Mohamed . Fractional generalized cumulative residual entropy: properties, testing uniformity, and applications to Euro Area daily smoker data. AIMS Mathematics, 2024, 9(7): 18064-18082. doi: 10.3934/math.2024881
    [8] Lian-Ta Su, Kuldip Raj, Sonali Sharma, Qing-Bo Cai . Applications of relative statistical convergence and associated approximation theorem. AIMS Mathematics, 2022, 7(12): 20838-20849. doi: 10.3934/math.20221142
    [9] Ran Tamir . Testing for correlation in Gaussian databases via local decision making. AIMS Mathematics, 2025, 10(4): 7721-7766. doi: 10.3934/math.2025355
    [10] C. T. J. Dodson . Information distance estimation between mixtures of multivariate Gaussians. AIMS Mathematics, 2018, 3(4): 439-447. doi: 10.3934/Math.2018.4.439
  • For any pair of probability measures defined on a common space, their relative information spectra——specifically, the distribution functions of the loglikelihood ratio under either probability measure——fully encapsulate all that is relevant for distinguishing them. This paper explores the properties of the relative information spectra and their connections to various measures of discrepancy including total variation distance, relative entropy, Rényi divergence, and general f-divergences. A simple definition of sufficient statistics, termed I-sufficiency, is introduced and shown to coincide with longstanding notions under the assumptions that the data model is dominated and the observation space is standard. Additionally, a new measure of discrepancy between probability measures, the NP-divergence, is proposed and shown to determine the area of the error probability pairs achieved by the Neyman-Pearson binary hypothesis tests. For independent identically distributed data models, that area is shown to approach 1 at a rate governed by the Bhattacharyya distance.



    Shortly after the advent of information theory [1], Kullback and Leibler [2] introduced relative entropy (or Kullback-Leibler divergence) as a means of generalizing to arbitrary alphabets Shannon's foundational information measures——entropy, differential entropy, and mutual information. They recognized that relative entropy could play a pivotal role, not in Shannon's data compression and transmission problems, but in statistical inference, in particular, in the theory of sufficient statistics, which had recently been put on a sound mathematical footing by Halmos and Savage [3]. The application of information theory to statistical inference, initiated in [2], continued with Fano's inequality [4]——a lower bound on error probability in Bayesian M-ary hypothesis testing based on conditional entropy. Lindley [5] suggested using mutual information to explore sufficiency. Chernoff [6] found an asymptotic operational role for relative entropy in another fundamental pillar of statistical inference, the theory of hypothesis testing pioneered by Neyman and Pearson in [7]. Soon after, Sanov [8] showed that relative entropy plays a pivotal role in the theory of large deviations pioneered two decades earlier by Cramér in [9]. For the purpose of statistical modeling, Jaynes [10,11] and Kullback [12] advocated the maximization of entropy and the minimization of relative entropy with a fixed nominal reference measure, respectively.

    Other information theoretic measures would prove useful in statistical inference. Rényi divergence [13] and Chernoff information [14] emerged as key tools in the asymptotic analysis of non-Bayesian and Bayesian hypothesis testing, respectively. Csiszár [15] showed that the role of relative entropy in sufficient statistics, found by Kullback and Leibler, could be extended to f-divergences, a much wider collection of discrepancy measures that obey the data processing principle (no processing can increase them). Among the many f-divergences that have found widespread applications in statistical inference are total variation distance, χ2-divergence [16], Hellinger distance [17], Hellinger divergence [14], Vincze-Le Cam divergence [18,19], and de Groot statistical information [20].

    Moving forward to the last decade of the XXth century, [21] started a new direction in information theory: The information spectrum approach, whose original goal was to generalize the flagship asymptotic results in information theory without assumptions of discreteness, memorylessness, ergodicity, or even stationarity. Working with very little structure has the benefit of bringing out the essential aspects that allow Shannon's results [1] to transcend their original habitat. A price to be paid for the generality of those results is that entropy, relative entropy, and mutual information are no longer sufficient to express the asymptotic fundamental limits. Those information measures are expectations of random variables whose distributions, dubbed information spectra in [21], emerge as the crucial ingredients in the solution. Han's monograph [22] provides a comprehensive overview of the application of the information spectrum method to the asymptotic fundamental limits in various domains, including lossless and lossy data compression, data transmission, hypothesis testing, channel resolvability, and random number generation. Started in [23], another trend in information theory seeks to determine non-asymptotic fundamental limits, e.g., what is the transmission rate compatible with a blocklength of 1000 and a probability of decoding error of 102? Approximate answers to this type of questions can be obtained through upper and lower bounds that depend on the information spectra.

    Entropy is a special case of mutual information, which in turn is a special case of relative entropy. The relative entropy, D(PXPY), of probability measures PX and PY defined on the same space is the expectation of the random variable ıXY(X), where ıXY(x) stands for the relative information defined as the logarithm of the Radon-Nikodym derivative dPXdPY(x), or more generally, logdPXdμ(x)logdPYdμ(x), where μ dominates both probability measures. The key objects of interest in this paper are the relative information spectra, namely, the cumulative distribution functions of ıXY(X) and ıXY(Y). In addition to a number of properties satisfied by the relative information spectra, we show new results in both sufficient statistics and binary hypothesis testing through the application of those properties.

    To enhance readability and ease of reference, the rest of this work is organized in one hundred items grouped into eight sections, plus an appendix.

    Section 2 contains most of the terminology and notation used throughout the paper, as well as several auxiliary results used in the sequel. As no restrictions (including absolute continuity) are placed on the pairs of probability measures under purview, the notions of relative support and coefficient of absolute discontinuity prove to be of central importance in the subsequent development.

    Section 3 deals with the fundamental properties of the relative information, including the change of measure formulas without requiring absolute continuity. It also explores properties of Rényi divergence and information density——a special case of relative information whose expectation is the mutual information, and which proves useful in Section 7.

    Section 4 focuses on the interplay of the distributions of the random variables ıXY(X) and ıXY(Y). The key notion of equivalent pairs of probability measures, proposed recently in [24] in the special case of absolutely continuous probability measures, is given here in full generality, along with several necessary and sufficient conditions involving Rényi divergence and f-divergences.

    Section 5 shows various ways to express and bound total variation distance as a function of the relative information spectra, a problem initially undertaken by Le Cam [19].

    A new measure of discrepancy between probability measures, dubbed the NP-divergence, is introduced in Section 6. Although it satisfies the data processing principle, NP-divergence is not an f-divergence. Its main operational role, which justifies its name, is revealed in Section 8.

    Section 7 presents a new notion of sufficient statistics, I-sufficiency, based on equivalent pairs. To put this notion in perspective, Section 7 also includes a discussion of the leading existing definitions of sufficient statistics, such as classical (Fisher) sufficiency, Blackwell sufficiency, and Bayes sufficiency. I-sufficiency is a natural bridge between those notions and criteria based on the equality of the data processing inequality for f-divergence. All those notions turn out to be equivalent under the assumption that the data model is dominated and defined on a standard space.

    Section 8 gives a self-contained solution to the non-asymptotic fundamental tradeoff region consisting of the set of achievable conditional error probabilities in non-Bayesian binary hypothesis testing, at a level of detail apparently not available elsewhere. A scalar proxy is often sought to quantify how well probability measures can be distinguished. In non-Bayesian hypothesis testing, the area of the tradeoff region is arguably the most natural scalar measure. Section 8 demonstrates that this area equals one-half of the NP-divergence. This establishes an interesting relationship between the hypothesis testing problems

    H0:yP0,HL:(y1,y2)P0P1,
    H1:yP1,HR:(y1,y2)P1P0.

    The area of the fundamental tradeoff region for {H0,H1} is shown to equal 12ϵmin(HL,HR), where ϵmin(HL,HR) is the minimum (Bayesian) error probability when {HL,HR} are equally likely. A new asymptotic operational role is found for the Bhattacharyya distance in the setting of independent identically distributed data.

    Section 9 gives a recap of the main new results found in the paper.

    This section introduces basic terminology and notation, along with supporting results used in the remainder of the paper.

    1. PA denotes the set of probability measures defined on the measurable space (A,F).

    2. For PPA, XP means that P[XE]=P(E) for all EF.

    3. A random transformation

    PY|X:(A,F)(B,G) (2.1)

    is a collection {PY|X=aPB,aA} of probability measures defined on the measurable space (B,G), such that for every BG, fB:A[0,1] given by fB(a)=PY|X=a(B) is a Borel F-measurable function. In the literature, random transformations are also referred to as Markov kernels. The sets A and B are known as the input and output alphabets of the random transformation, respectively. Note that a joint probability measure PXY need not be defined notwithstanding the notation in (2.1). If in addition to the random transformation (2.1), a probability measure PXPA is defined, then the input/output joint probability measure PXY on (A×B,FG) is given by

    PXY(A×B)=APY|X=a(B)dPX(a),A×BFG. (2.2)

    The marginal output probability measure, PY, also known as the response of PY|X to PX, is denoted by

    PXPY|XPY. (2.3)

    4. If (P,Q)PA2, PQ means that Q dominates P, or alternatively, P is absolutely continuous with respect to Q, i.e., P(A)=0 for any AF such that Q(A)=0. More generally, a collection PPA is said to be dominated by Q if PQ for all PP. If Q dominates P and Q(E)=0 whenever EF is such that P(E)=0 for all PP, Q is said to be equivalent to P. The same terminology applies to general measures on (A,F).

    5. If (P,Q)PA2 and PQP, then we write P≪≫Q and P and Q are said to be mutually absolutely continuous or equivalent.

    6. Lemma 1. [3, Lemma 7] Assume that there exists a σ-finite measure on (A,F) that dominates the collection PPA. Then, there exists a probability measure in PA that is equivalent to the collection P. In fact, there exists a finite or countably infinite collection {PiP,iI}, such that iIπiPi is equivalent to P for every probability mass function π on I with πi>0 for all iI.

    In light of Lemma 1, we frequently refer to a collection of probability measures as being dominated, without specifying the dominating measure, which is understood to be either a probability measure, or not more generally, a σ-finite measure. Non-σ-finite measures are of no interest in this paper. Any finite or countably infinite collection of probability measures is dominated. Examples of undominated collections of probability measures on (R,B) are:

    {δθ,θ[0,1]}, with δθ the point mass on (R,B) that assigns probability one to {θ}.

    {P:P(B)=P({ω:ωB}),for allBB}.

    Despite a contrary claim in [25], undominated collections are more often the exception than the rule in most applications commonly encountered in statistical inference and information theory.

    7. If (P,Q)PA2, P and Q are said to be mutually singular or orthogonal, PQ, if there exists an event FF with P(F)=0 and Q(F)=1.

    8. If (P,Q)PA2, the coefficient of absolute discontinuity of P relative to Q is defined as

    Π(PQ)=minAF:Q(A)=1P(A). (2.4)

    Note that

    PQΠ(PQ)=1, (2.5)
    PQΠ(PQ)=0Π(QP)=0. (2.6)

    9. If PPA and QPB (the set of probability measures defined on the measurable space (B,G)), then PQ denotes the product measure on the measurable space (A×B,FG). The coefficient of absolute discontinuity for product probability measures is

    Π(P1Q1P0Q0)=minAFG:[P0Q0](A)=1[P1Q1](A) (2.7)
    minFF:P0(F)=1P1(F)minGG:Q0(G)=1Q1(G) (2.8)
    =Π(P1Q1)Π(P0Q0). (2.9)

    In fact, equality holds in (2.8) since we can lower bound the left side replacing [P1Q1](A) by [P1Q1](F×G) for any F×GA.

    10. If (P,Q)PA2 and PQ, then the Radon-Nikodym derivative (or density) of P with respect to Q is the Borel F-measurable nonnegative function

    dPdQ:A[0,) (2.10)

    such that any nonnegative Borel F-measurable function f:A[0,) satisfies the change of measure formula

    E[f(X)]=E[dPdQ(Y)f(Y)],XP,YQ. (2.11)

    11. If (P,Q)PA2, define——up to an event of zero P+Q——the support of P relative to Q as

    SPQ={aA:dPdμ(a)>0}F, (2.12)

    where μ is any measure that dominates {P,Q}, such as P+Q.

    Lemma 2. For any (P,Q)PA2,

    P(SPQ)=1, (2.13)
    Q(SPQ)=Q(SPQSQP)=minAF:P(A)=1Q(A)=Π(QP). (2.14)

    Proof. To verify (2.13), simply note that P(ScPQ)=1{aSPQ}dPdμ(a)dμ(a)=0 for any μ{P,Q}. To justify (2.14), we need to show that if AF with Q(A)<Q(SPQ), then P(A)<1. If GF is such that GSPQ and Q(G)>0, then

    P(G)=GdPdμ(a)dμ(a)>0, (2.15)

    because μ(G)>0 and dPdμ(a)>0 if aG. Since Q(SPQAc)Q(SPQ)Q(A)>0, we obtain 0<P(SPQAc)=P(Ac).

    12. Apart from their essential contribution in our framework, the concepts in Items 8 and 11 merit broader popularity in probability theory. For example, they lead to an elementary constructive proof (cf. the standard proof in [26, p. 135]) of the Lebesgue decomposition theorem: If (P,Q)PA2, there exist probability measures (P1,P0)PA2 such that P1Q, P0Q, and P is the mixture

    P=λP1+(1λ)P0, (2.16)

    for some λ[0,1]. First, observe that the constructions for the cases PQ and PQ are trivial: If PQ, then (P1,λ)=(P,1); if PQ, then (P0,λ)=(P,0). In the nontrivial case P≪̸Q and P⊥̸Q, we have

    0<Π(PQ)=P(SPQ)<1, (2.17)

    and the law of total probability yields, for any AF,

    P(A)=P(ASQP)P(SQP)+P(AScQP)P(ScQP). (2.18)

    So in (2.16), we have λ=Π(PQ), P1=P(SQP)Q, and P0=P(ScQP)Q.

    13. The moment generating function and cumulant generating function of a [,+)-valued random variable U are defined, respectively, by

    MU(t)=E[etU], (2.19)
    ΛU(t)=logeMU(t). (2.20)

    Note that limt0ΛU(t) is either infinite or equal to logP[U>]. If there exists t0>0 such that ΛU(t)=ΛV(t)< for t(0,t0), then U and V have identical distributions (e.g., [27, p. 337]). Since ΛU(t)=ΛU(t), U and V have identical distributions if ΛU(t)=ΛV(t)< for t(t0,0).

    14. If (P,Q)PA2 and PQ, then the relative information of P with respect to Q is the Borel F-measurable function

    ıPQ(a)=logdPdQ(a)[,). (3.1)

    More generally, without requiring PQ, if ρ is a probability (or σ-finite) measure which dominates {P,Q}, and the respective densities are denoted by p=dPdρ and q=dQdρ, then the (generalized) relative information is defined as

    ıPQ(a)=logp(a)q(a)={+,aSPQScQP;ıPρ(a)ıQρ(a)R,aSPQSQP;,aScPQSQP. (3.2)

    If aSPQSQP, it is immaterial how to define ıPQ(a). Therefore, any identity involving relative informations (or densities) is to be understood almost surely with respect to any measure dominating both probability measures. It follows from (3.2) that relative information satisfies the skew-symmetry property

    ıPQ(a)=ıQP(a),aA. (3.3)

    In the discrete case, i.e., A is finite or countably infinite, if P(a)+Q(a)>0, then

    ıPQ(a)=ıQ(a)ıP(a), (3.4)

    where the (absolute) information is ıP(a)=log1P(a). The base of the logarithms in (3.1) and (3.2) determines the units of the relative information. Unless specifically indicated, it can be chosen by the reader. If the chosen base is b>1, then exp(t)=bt. If b=e [resp., 2], the unit is called nat [resp., bit]. By convention, exp()=0 and log0=. The generalized relative information in bits is equal to ıPQ(a)=υ(ıPR(a)) where R=12P+12Q, ıPR(a) is also in bits, and υ:[,1][,+] is υ(t)=tlog2(22t).

    15. If aA, (P,Q,R)PA3 and R dominates {P,Q}, then (3.2) implies the chain rule

    ıPR(a)ıQR(a)=ıPQ(a). (3.5)

    16. Often (recall Item 2) we denote XPX and YPY, in which case we abbreviate ıPXPY as ıXY. The same convention applies to the coefficient of absolute discontinuity in Item 8 and the relative support in Item 11, as well as to relative entropy and other information measures (except total variation distance) considered in the remainder of the paper. This notational convention was popularized by [28] in the context of the entropy function.

    17. It follows from (2.13), (2.14), and (3.2) that

    P[ıXY(X)=+]=PX(SXYScYX)=1Π(XY), (3.6)
    P[ıXY(X)=]=PX(SYXScXY)=0, (3.7)
    P[ıXY(Y)=+]=PY(SXYScYX)=0, (3.8)
    P[ıXY(Y)=]=PY(SYXScXY)=1Π(YX). (3.9)

    18. Change of measure. Without the assumption of absolute continuity, the basic change of measure formula (2.11) needs to be modified as follows.

    Lemma 3. For any nonnegative Borel measurable f:A[0,),

    E[f(Y)exp(ıXY(Y))]=E[f(X)1{ıXY(X)R}] (3.10)
    =E[f(X)1{ıXY(X)<}] (3.11)
    =E[f(X)],ifPXPY; (3.12)
    E[f(X)exp(ıXY(X))]=E[f(Y)1{ıXY(Y)R}] (3.13)
    =E[f(Y)1{ıXY(Y)>}] (3.14)
    =E[f(Y)],ifPYPX, (3.15)

    regardless of whether the expectations are finite or +. More generally, if f:AR, the random variable on the left side of (3.10) [resp., (3.13)] is integrable if and only if so is the random variable on the right side, in which case (3.10) [resp., (3.13)] holds.

    Proof. Identities (3.11) and (3.14) follow from (3.7) and (3.8), respectively. Suppose that ρ dominates {PX,PY} and the respective densities are denoted by pX and pY. The random variable in the expectation on the left side of (3.10) is equal to zero if YSXYSYX. Therefore, the left side of (3.10) equals

    E[f(Y)exp(ıXY(Y))1{YSXYSYX}]=SXYSYXf(ω)pX(ω)pY(ω)pY(ω)dρ(ω)=E[f(X)1{ıXY(X)R}], (3.16)

    where we have used (3.2). If PXPY, then P[ıXY(X)R]=1, yielding (3.12). Alternatively, we can prove it directly from the change of measure formula (2.11). To show the claimed result for f:AR, we decompose f=[f]+[f]+, and the desired result follows from the definition of expectation E[f(V)]=E[f+(V)]E[f(V)] for any V once we apply (3.10) to both [f]+ and [f]+. Swapping XY and recalling (3.3) results in (3.13)–(3.15).

    Corollary 1. For any β>0, and nonnegative measurable function g:A[0,),

    βE[g(Y)1{ıXY(Y)logβ}]E[g(X)1{logβıXY(X)<}], (3.17)
    E[g(X)1{ıXY(X)logβ}]βE[g(Y)1{<ıXY(Y)logβ}]. (3.18)

    Proof.

    ● (3.17) (3.10) with f(a)=g(a)1{logβıXY(a)}.

    ● (3.18) (3.10) with f(a)=g(a)1{<ıXY(a)logβ}.

    19. Invariance to labeling.

    Lemma 4. Fix measurable spaces (A,F) and (B,G), and let the (F,G)-measurable function f:AB be injective. Then,

    ıf(X)f(Y)(f(a))=ıXY(a),aA. (3.19)

    Conversely, if the (F,G)-measurable function g:AB is such that ıXY(a) depends on aA only through g(a), then

    ıXY(a)=ıg(X)g(Y)(g(a)). (3.20)

    Proof. Suppose that PZ dominates {PX,PY}. Then,

    PX(A)=P[f(X)f(A)] (3.21)
    =E[1{f(Z)f(A)}ϱ(Z)] (3.22)
    =E[1{ZA}ϱ(Z)], (3.23)

    where

    ● (3.21) and (3.23) 1{aA}=1{f(a)f(A)} for any AF f is injective.

    ● (3.22) Lemma 3 with ϱ(a)=exp(ıf(X)f(Z)(f(a))).

    Therefore, ϱ(a)=dPXdPZ(a). In particular, SXZ={aA:f(a)Sf(X)f(Z)}. If PXPY, we are done since we can just let PZ=PY. Otherwise, we can follow the same reasoning with XY, and (3.19) follows from (3.2). To show the converse part, assume for now that PXPY. Suppose that dPXdPY(a)=ψ(g(a)). Then, again invoking Lemma 3, we get

    Pg(X)(B)=E[1{g(X)B}]=E[1{g(Y)B}ψ(g(Y))],BG, (3.24)

    and, consequently, dPg(X)dPg(Y)(t)=ψ(t), which implies (3.20). Without assuming PXPY, note that ıXY(a)= implies aScYXg1(Scg(Y)g(X)) and (3.24) continues to hold if BSg(Y)g(X).

    20. Relative information of relative informations.

    Lemma 5. Let (PX,PY)PA2. Define the extended-valued random variables W=ıXY(X) and Z=ıXY(Y). Then, using the same units for all three relative informations,

    ıWZ(x)=x,x[,]. (3.25)

    Proof. Since the relative information need not be an injective function, we cannot invoke Lemma 4. The probability of a Borel set AB, AR can be expressed as

    PW(A)=P[ıXY(X)A] (3.26)
    =E[1{ıXY(Y)A}exp(ıXY(Y))] (3.27)
    =E[1{ZA}exp(Z)], (3.28)

    where (3.27) follows from (3.10). Hence, we are free to choose dPWdPZ(a)=exp(a) for all aR. If Π(XY)=1=Π(YX), i.e., PX≪≫PY, then SWZSZW=R and it is immaterial how to define ıWZ() and ıWZ(). If Π(XY)<1=Π(YX), then Item 17 implies

    P[W=]>0=P[Y=], (3.29)
    P[W=]=0=P[Y=], (3.30)

    so ıWZ()= and it is immaterial how to define ıWZ(). The same reasoning shows that if Π(XY)=1>Π(YX), then ıWZ()= and it is immaterial how to define ıWZ(). If Π(XY)<1 and Π(YX)<1, then Item 17 gives

    P[W=]>0=P[Y=], (3.31)
    P[W=]=0<P[Y=], (3.32)

    which implies that ıWZ()= and ıWZ()=.

    21. If (P,Q)PA2, D(PQ) stands for the relative entropy (or Kullback-Leibler divergence [2]), which, with the convention in Item 16, satisfies

    E[ıXY(X)]=D(XY), (3.33)
    E[ıXY(Y)]=D(YX). (3.34)

    The binary relative entropy function is the continuous extension to the domain [0,1]2 of the function d(pq)=plogpq+(1p)log1p1q. The data processing lemma for relative entropy implies

    D(PQ)maxAFd(P(A)Q(A))maxAF:P(A)=1d(1Q(A))=log1Π(QP). (3.35)

    22. If PXPY, the change of measure formula (2.11) implies

    E[exp(ıXY(Y))]=1. (3.36)

    Without assuming PXPY, we have

    E[exp(ıXY(Y))]=Π(XY), (3.37)
    E[exp(ıXY(X))]=Π(YX). (3.38)

    To verify (3.37) we let f(a)=1 in (3.11) and recall (3.6). Swapping XY in (3.37) yields (3.38) in light of (3.3). The χ2-divergence introduced by Pearson in [16] is

    χ2(XY)=Var[exp(ıXY(Y))] (3.39)
    =E[exp(ıXY(X))]1, (3.40)

    where (3.40) holds if PXPY; otherwise, χ2(XY)=.

    23. For (PX,PY)PA2 and α(0,1)(1,), the α-order Rényi divergence [13] is

    Dα(XY)=1α1logE[exp((α1)ıXY(X))] (3.41)
    =1α1logE[exp(αıXY(Y))], (3.42)

    where (3.42) holds for α(0,1). Equivalently, if ZR, with R a probability measure that dominates {P,Q}, then

    Dα(PQ)=1α1logE[exp(αıPR(Z)+(1α)ıQR(Z))], (3.43)

    which, using the generalized relative information (3.2), also holds without requiring that R{P,Q}. In addition, define

    D1(XY)=D(XY), (3.44)
    D(XY)=inf{vR:P[ıXY(X)v]=1}=limαDα(XY). (3.45)

    Along with (3.10) specialized to f(a)=exp((α1)ıXY(a)), the skew-symmetry of relative information (3.3) results in the skew-symmetry of Rényi divergence

    (1α)Dα(XY)=αD1α(YX),α(0,1). (3.46)

    The coefficients of absolute discontinuity can be obtained from the Rényi divergence by means of

    limα0Dα(XY)=log1Π(YX), (3.47)
    limα0αD1α(XY)=log1Π(XY), (3.48)

    where (3.48) follows from (3.46) and (3.47). To show (3.47), note that

    exp(αzz)(1α)exp(z)+α,(α,z)[0,1]×(,+], (3.49)

    so the dominated convergence theorem implies that

    limα0E[exp((α1)ıXY(X))]=E[exp(ıXY(X))], (3.50)

    which is equivalent to (3.47) in view of (3.38) and (3.41).

    24. Section 8 shows a new operational role for the Bhattacharyya distance [29],

    B(PQ)=12D12(PQ)=log1pqdμ, (3.51)

    where p=dPdμ, q=dQdμ and {P,Q}μ.

    25. A couple of properties of Rényi divergence used in the sequel are (e.g., [30]):

    (a) If α1, then

    P≪̸QDα(PQ)=. (3.52)

    (b) The following circular implications hold:

    α0(0,1)s.t.Dα0(PQ)=(3.53)PQ(3.54)Dα(PQ)=,α(0,].

    26. Given PXPA and a random transformation PY|X:(A,F)(B,G), the following special case of relative information is known as the information density,

    ıX;Y(x;y)=ıPXYPXPY(x,y)=ıPY|X=xPY(y), (3.55)

    where PXPY|XPY. We use the same notation in non-Bayesian settings in which PX need not be defined and PYPB on the rightmost term in (3.55) is an arbitrary unconditional probability measure. For future use, we observe that information density satisfies the chain rule

    ıXZ;Y(a,c;b)=ıX;Y(a;b)+ıY;Z|X(b;c|a). (3.56)

    Note that mutual information is I(X;Y)=E[ıX;Y(X;Y)], with (X,Y)PXPY|X.

    27. Following [31], whenever PXYPXPY, the dependence between X and Y is said to be regular. The following result gives sufficient conditions for regularity.

    Lemma 6. Fix PXPA and a random transformation PY|X:(A,F)(B,G).

    With PXPY|XPY, the following hold.

    PXYPXPY(3.57)A0F:PX(A0)=1and{PYX=x,xA0}isdominatedbyPY(3.58){PYX=x,xA}isdominated.

    Proof. To show (3.57) by contraposition, let DFG be such that

    (PXPY)(D)=0<PXY(D), (3.59)

    and denote

    f(x,y)=1{(x,y)D}, (3.60)
    Dx={yB:(x,y)D}G. (3.61)

    The function PY|X=x(Dx)=E[f(X,Y)|X=x] is Borel F-measurable with mean

    PY|X=x(Dx)dPX(x)=PXY(D)>0. (3.62)

    Therefore, there exists A1F with PX(A1)>0 and PY|X=x(Dx)>0 for all xA1. Likewise, the expectation of the Borel F-measurable PY(Dx)=E[f(x,Y)], with YPY is simply

    PY(Dx)dPX(x)=(PXPY)(D)=0. (3.63)

    Consequently, there exists A2F with PX(A2)=1 and PY(Dx)=0 for all xA2. We conclude that PY(Dx)=0<PY|X=x(Dx) and, thus, PY|X=x≪̸PY, for all xA1A2 and PX(A1A2)>0. To show (3.58), we assume without loss of generality that {PY|X=xxA}{PY} is dominated by QYPA. Denote by pY|X=x and pY the corresponding densities of PY|X=x and PY with respect to QY, and define the Borel (FG)-measurable function f(x,y)=1{pY(y)=0}pY|X=x(y). If (X,Y,ˆY)PXYQY, then

    0=P[pY(Y)=0]=E[f(X,ˆY)] (3.64)
    =E[f1(X)], (3.65)

    where f1(x)=E[f(x,ˆY)]=1{pY(y)=0}dPY|X=x(y) and (3.65) follows from Fubini's theorem. The proof of (3.58) is complete since f1(x)=0 if and only if PY|X=xPY.

    In Items 21 and 22, we saw that the expectations of the random variables ıXY(X) and ıXY(Y), as well as their exponentials, are well-known quantities in probability theory. We now consider the cumulative distribution functions of those [,+]-valued random variables, which, unlike relative entropy, χ2-divergence, or total variation distance, capture everything that serves to distinguish the probability measures PX and PY.

    28. Definition 1. The relative information spectra of probability measures (PX,PY)PA2 are the cumulative distribution functions of the relative information evaluated at X and Y, respectively,

    FXY(α)=P[ıXY(X)α],αR, (4.1)
    ˉFXY(α)=P[ıXY(Y)α],αR. (4.2)

    The arguments of FXY and ˉFXY have units inherited by the units of the relative information. In [22], the relative information spectra are referred to as divergence spectra.

    29. If X and Y are discrete random variables, so are ıXY(X) and ıXY(Y). If X and Y are absolutely continuous random variables with probability density functions fX and fY, then

    FXY(α)=P[fX(X)exp(α)fY(X)], (4.3)
    ˉFXY(α)=P[fX(Y)exp(α)fY(Y)], (4.4)

    which need not be continuous.

    30. Due to (3.6)–(3.9),

    limαFXY(α)=Π(XY), (4.5)
    limαFXY(α)=0, (4.6)
    limαˉFXY(α)=1, (4.7)
    limαˉFXY(α)=1Π(YX). (4.8)

    Although (4.5) is less than 1 if PX≪̸PY and (4.8) is positive if PY≪̸PX, the relative information spectra are monotonically increasing and right-continuous.

    31. As a result of the skew-symmetry (3.3) of the relative information,

    FYX(α)=1limxαˉFXY(x)=P[ıXY(Y)α],αR, (4.9)
    ˉFYX(α)=1limxαFXY(x)=P[ıXY(X)α],αR. (4.10)

    32. If PX=PY, then FXY(α)=ˉFXY(α)=1{α0}.

    If PXPY, then FXY(α)=0 and ˉFXY(α)=1 for all αR.

    Lemma 7. Let (PX,PY)PA2. Then,

    (a) PXPY FXY(0)<1.

    (b) If, in addition, PXPY, then PXPY ˉFXY(0)<1.

    Proof.

    (a) FXY(0)=10=E[[ıXY(X)]+]D(XY)PX=PY.

    (b) For any α>0,

    1ˉFXY(0)=P[ıYX(Y)<0] (4.11)
    =E[1{ıYX(X)<0}exp(ıYX(X))] (4.12)
    E[1{α<ıYX(X)<0}exp(ıYX(X))] (4.13)
    exp(α)P[α<ıYX(X)<0] (4.14)
    =exp(α)P[0<ıXY(X)<α], (4.15)

    where

    ● (4.11) (4.10) with XY.

    ● (4.12) (3.12) with f(a)=1{ıYX(a)<0}exp(ıXY(a)).

    ● (4.15) (3.3).

    On account of (a), (2.5), and (4.5), limγFXY(γ)=1>FXY(0). Therefore, there must exist α>0 such that P[0<ıXY(X)<α]>0.

    An example of PXPY with ˉFXY(0)=1 is PX=[1212], PY=[10].

    33. Example: If XN(μ,σ2), YN(0,σ2), and Q(t)=t12πex2/2dx, then, with α in nats,

    FXY(α)=Q(μ2σασμ)andˉFXY(α)=Q(μ2σασμ). (4.16)

    34. Example: Let ξ=12logeσ2Xσ2Y, with σ2X>σ2Y>0, XN(0,σ2X), and YN(0,σ2Y). If α>ξ, then

    FXY(α)=12Q(2σ2Y(α+ξ)σ2Xσ2Y), (4.17)
    ˉFXY(α)=12Q(2σ2X(α+ξ)σ2Xσ2Y), (4.18)

    while FXY(α)=ˉFXY(α)=0 if αξ.

    35. Example: [24] Suppose that V is standard Cauchy, λ0λ10, X1=λ1V+μ1, and X0=λ0V+μ0. Then, ˉFX1X0(α)=1FX1X0(α) and

    FX1X0(logβ)={1,ζ+ζ21β;12+1πarcsinβζζ21,ζζ21<β<ζ+ζ21;0,0<βζζ21, (4.19)
    withζ=λ21+λ20+(μ1μ0)22|λ0λ1|1. (4.20)

    36. Example: Suppose that U is uniform on [0,1]. Define the probability measures on (R,B),

    P1=12δ2+12P2U,andP0=13δ1+13δ2+13P3U+1. (4.21)

    Then, Π(P1P0)=34, Π(P0P1)=49, and

    FP1P0(α)={34,α2log32;12,log32α<2log320,α<log32;¯FP1P0(α)={1,α2log3289,log32α<2log3259,α<log32; (4.22)

    37. A key aspect of the relative information spectra is that FP1P0 and ˉFP1P0 determine each other through the following result.

    Theorem 1. Fix arbitrary (PX,PY)PA2.

    (a) For all β>0,

    ˉFXY(logβ)=11β0(FXY(log1t)FXY(logβ))dt. (4.23)

    (b) For all β>0,

    FXY(logβ)=β0(ˉFXY(logβ)ˉFXY(logt))dt. (4.24)

    (c) For all β>0,

    P[ıXY(X)=logβ]=βP[ıXY(Y)=logβ]. (4.25)

    (d) ıXY(X)(,+] is discrete if and only if ıXY(Y)[,+) is discrete. ıXY(X) is absolutely continuous, except for a possible mass at +, if and only if ıXY(Y) is absolutely continuous, except for a possible mass at . Then, the density functions fXY(x)=ddxFXY(x) and ˉfXY(x)=ddxˉFXY(x) satisfy

    fXY(logt)=tˉfXY(logt),t>0. (4.26)

    (e)

    Π(YX)=P[ıXY(Y)>] (4.27)
    =E[exp(ıXY(X))] (4.28)
    =0FXY(log1β)dβ (4.29)
    =exp(t)dFXY(t). (4.30)
    Π(XY)=limαFXY(α) (4.31)
    =E[exp(ıXY(Y))] (4.32)
    =0(1ˉFXY(logβ))dβ (4.33)
    =exp(t)dˉFXY(t). (4.34)

    (f) If g:R[0,), then

    g(t)exp(t)dˉFXY(t)=g(t)dFXY(t), (4.35)
    g(t)exp(t)dFXY(t)=g(t)dˉFXY(t). (4.36)

    (g) The cumulant generating functions of ıXY(X) and ıXY(Y) (nats) satisfy

    ΛiXY(X)(t)=ΛlXY(Y)(t+1),{tR,PXPYPX;t(1,),PXPY≪̸PX;t(,0),PX≪̸PYPX;t(1,0),PX≪̸PY≪̸PX, (4.37)

    and ΛıXY(X)(1)=logeΠ(YX) and ΛıXY(Y)(1)=logeΠ(XY).

    Proof.

    (a) Fix αR and let PZ dominate {PX,PY}. Then,

    E[exp(ıXY(X))1{α<ıXY(X)}]=E[exp(ıXY(X))1{α<ıXY(X)<}] (4.38)
    =E[exp(ıYZ(X)ıXZ(X))1{α<ıXY(X)<}] (4.39)
    =E[exp(ıYZ(Z))1{α<ıXY(Z)<}] (4.40)
    =P[α<ıXY(Y)<] (4.41)
    =1ˉFXY(α), (4.42)

    where

    ● (4.38) exp()=0.

    ● (4.39) (3.2) and the random variable in the expectation in the left side can be positive only if XSXYSYX.

    ● (4.40) and (4.41) change of measure (2.11).

    ● (4.42) (3.8).

    Then, we have

    FXY(α)+exp(α)(1ˉFXY(α))=E[exp([ıXY(X)α]+)] (4.43)
    \begin{align} & = \int_0^1 \mathbb{P} \left[ \exp \left( - [ \imath_{X\|Y} (X) - \alpha ]^+ \right) \geq \tau \right] \, \mathrm{d}\tau \end{align} (4.44)
    \begin{align} & = \int_0^1 \mathbb{F}_{X\|Y} \left(\alpha + \log \frac1\tau \right) \, \mathrm{d}\tau , \end{align} (4.45)

    where

    ● (4.43) \Longleftarrow (4.38)–(4.42).

    ● (4.44) \Longleftarrow \mathbb{E} [ T] = \int_0^1 \mathbb{P} [T \geq \tau ] \, \mathrm{d}\tau if T\in [0, 1] .

    Rearranging the outer terms in (4.43)–(4.45) with \alpha = \log \beta and changing the integration variable to t = \frac{\tau}{\beta} yields (4.23).

    (b) Denote V = \exp (\imath_{X\|Y} (Y)) . By change of measure on \mathcal{{S}}_{X\|Y} \cap \mathcal{{S}}_{Y\|X} , we obtain

    \begin{align} \mathbb{F}_{X\|Y} (\log \beta) & = \mathbb{E} \left[ 1\left\{ \exp (\imath_{X\|Y} (X) \leq \beta \right\} \, 1\left\{ X \in \mathcal{{S}}_{X\|Y} \cap \mathcal{{S}}_{Y\|X} \right\} \right] \end{align} (4.46)
    \begin{align} & = \mathbb{E} \left[ V \, 1\left\{0 < V \leq \beta \right\} \right] \end{align} (4.47)
    \begin{align} & = \int_0^\infty \mathbb{P} \left[ 1\left\{0 < V \leq \beta \right\} V > t\right]\, \mathrm{d}t \end{align} (4.48)
    \begin{align} & = \int_0^{\beta} \mathbb{P} \left[ t < V \leq \beta \right]\, \mathrm{d}t \end{align} (4.49)
    \begin{align} & = \int_0^{\beta} \left( \bar{ \mathbb{F}}_{X\|Y} (\log \beta) - \bar{ \mathbb{F}}_{X\|Y} (\log t) \right) \, \mathrm{d}t. \end{align} (4.50)

    (c) \Longleftarrow (3.10) with f(a) = 1\{\imath_{X\|Y} (a) = \log \beta\} .

    (d) Lemma 5 implies that, when restricted to \mathbb R , the probability measures of W and Z are mutually absolutely continuous; therefore, one is discrete [resp., absolutely continuous] if and only if the other one is discrete [resp., absolutely continuous]. Differentiating (4.23) with respect to t yields (4.26).

    ● (4.27) and (4.31) are (3.9) and (3.6), respectively.

    ● (4.28) and (4.32) are (3.38) and (3.37), respectively.

    ● (4.29) \Longleftarrow (4.8) and (4.23).

    ● (4.33) \Longleftarrow (4.5) and (4.24).

    ● (4.30) and (4.34) are (4.28) and (4.32), respectively, since those expectations are unchanged when restricted to \imath_{X\|Y} (X) \in \mathbb R and \imath_{X\|Y} (Y) \in \mathbb R .

    (f) \Longleftarrow Lemma 3 with f(a) = g\left(\imath_{X\|Y} (a) \right) .

    Note that the left sides of (3.10) and (3.13) are unchanged if the random variables inside the expectations are multiplied by 1\{\imath_{X\|Y} (Y) \in \mathbb R\} and 1\{\imath_{X\|Y} (X) \in \mathbb R\} , respectively.

    (g) The formulas for \Lambda_{\imath_{X\|Y} (X)} (-1) and \Lambda_{\imath_{X\|Y} (Y)} (1) follow from (4.28) and (4.32), respectively. In addition to \Lambda_{\imath_{X\|Y} (X)} (0) = 0 = \Lambda_{\imath_{X\|Y} (Y)} (0) , the following expressions ( \infty \cdot 0 = 0 ) for the moment generating functions at t \notin \{0, -1\} yield (4.37):

    \begin{align} \rm{M}_{\imath_{X\|Y} (X)} (t) = \int_{\mathcal{{S}}_{X\|Y} \cap \mathcal{{S}}_{Y\|X}}\!\! \left(\frac{p_X (\omega)}{p_Y (\omega)}\right)^t \!\! \mathrm{d}P_X (\omega) + \infty \cdot 1\{ P_X \not \ll P_Y\} \cdot 1\{ t > 0 \}, \end{align} (4.51)
    \begin{align} \rm{M}_{\imath_{X\|Y} (Y)} (t\!+\!1) = \int_{\mathcal{{S}}_{X\|Y} \cap \mathcal{{S}}_{Y\|X}}\!\! \left(\frac{p_X (\omega)}{p_Y (\omega)}\right)^t \!\! \mathrm{d}P_X (\omega) + \infty \cdot 1\{ P_Y \not \ll P_X\} \cdot 1\{ t < -1 \}. \end{align} (4.52)

    38. In this item and the next, we upper bound { \mathbb{F}}_{X\|Y} in terms of \bar{ \mathbb{F}}_{X\|Y} .

    Lemma 8. For \beta > 0 ,

    \begin{align} { \mathbb{F}}_{X\|Y} (\log \beta) &\leq \min\{ 1, \beta \} , \end{align} (4.53)
    \begin{align} { \mathbb{F}}_{X\|Y} (\log \beta) &\leq \inf\limits_{B\in \mathscr{F}} \left\{ P_X (B^c) + \beta \, P_Y (B)\right\}, \end{align} (4.54)
    \begin{align} 2\, { \mathbb{F}}_{X\|Y} (\log \beta) &\leq \beta \, \bar{ \mathbb{F}}_{X\|Y} ( \log \beta ) + 1 , \end{align} (4.55)
    \begin{align} { \mathbb{F}}_{X\|Y} (\log \beta) &\leq \beta \, \bar{ \mathbb{F}}_{X\|Y} ( \log \beta ) + \beta \, \Pi (Y\| X) -\beta, \end{align} (4.56)
    \begin{align} { \mathbb{F}}_{X\|Y} (\log \beta) &\leq \beta \, \bar{ \mathbb{F}}_{X\|Y} ( \log \beta ) + \Pi (X\| Y) - \beta , \end{align} (4.57)
    \begin{align} { \mathbb{F}}_{X\|Y} (\log \beta) & < \bar{ \mathbb{F}}_{X\|Y} ( \log \beta ) + \mathrm{e}^{-\frac1\beta} -1 + \Pi (Y\,\|\,X). \end{align} (4.58)

    Proof. Let L_\beta = \{a \in \mathcal{{A}}\colon \imath_{X\|Y} (a) \leq \log \beta\} .

    ● (4.53) \Longleftarrow (3.11) with f(a) = 1\{a \in L_\beta\} .

    ● (4.54) is Lemma 4.1.2 in [22]. P_X (L_\beta) \leq P_X (B^c) + P_X (L_\beta \cap B) \leq P_X (B^c) + \beta\, P_Y (B), where the second inequality follows from (3.18) with g(a) = 1\{a\in B\} .

    ● (4.55) \Longleftarrow (4.54) with the suboptimal choice B = L_\beta .

    ● (4.56) \Longleftarrow (4.24) and for t > 0 , \bar{ \mathbb{F}}_{X\|Y} (\log t) \geq \lim_{\tau \to 0} \bar{ \mathbb{F}}_{X\|Y} (\log \tau) = 1 - \Pi (Y\|X) .

    ● (4.53) \Longleftarrow \max\{ \bar{ \mathbb{F}}_{X\|Y} (\log \beta), \Pi (Y\| X)\} \leq 1 .

    ● (4.57) \Longleftarrow (3.6), (4.25), and (3.17) with g(a) \leftarrow 1 .

    ● (4.58) \Longleftarrow \left(1 - \frac1{\beta}\right) { \mathbb{F}}_{X\|Y} (\log \beta) \leq 1 - \frac1{\beta} < \mathrm{e}^{-\frac1\beta} and upper bound \frac1{\beta} { \mathbb{F}}_{X\|Y} (\log \beta) by means of (4.56).

    39. The following bound is instrumental in hypothesis testing (Section 8).

    Lemma 9. Let (P_X, P_Y)\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 . For any \beta > 0 , and measurable function g\colon\mathcal{{A}} \to [0, 1] ,

    \begin{align} { \mathbb{F}}_{X\|Y} (\log \beta) - \mathbb{E} \left[ g ( X)\right] \leq \beta \, \bar{ \mathbb{F}}_{X\|Y} ( \log \beta ) - \beta \,\mathbb{E} \left[ g ( Y) \right]. \end{align} (4.59)

    Proof. For all a\in \mathcal{{A}} , \beta > 0 , and measurable function g\colon\mathcal{{A}} \to [0, 1] ,

    \begin{align} \left(\beta - \exp( \imath_{X\|Y} (a)) \right) \left(1\{\imath_{X\|Y} (a) \leq \log \beta\} - g(a) \right) \geq 0, \end{align} (4.60)

    because when the first factor is positive [resp., negative], then the second factor is 1-g(a) [resp., -g(a) ]. Averaging (4.60) with respect to a\leftarrow Y , we obtain (4.59) invoking (3.10) twice with f(a) \leftarrow 1\{\imath_{X\|Y} (a) \leq \log \beta\} and f(a) \leftarrow g(a) , respectively, where in the second case the nonnegativity of g yields \mathbb{E} [ g(X) 1\{\imath_{X\|Y} (X) \in \mathbb R \}] \leq \mathbb{E} [ g(X)] .

    40. Definition 2. ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{B}}}}^2 are said to be equivalent pairs, denoted as ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) , if

    \begin{align} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\alpha) = \mathbb{F}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} (\alpha), \quad \alpha \in \mathbb R, \end{align} (4.61)

    i.e., \frac{\mathrm{d}{P_{{{\mathtt{1}}}}}}{\mathrm{d}{P_{{{\mathtt{0}}}}}} ({X_{{{\mathtt{1}}}}}) and \frac{\mathrm{d}{Q_{{{\mathtt{1}}}}}}{\mathrm{d}{Q_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) are identically distributed when {X_{{{\mathtt{1}}}}} \sim {P_{{{\mathtt{1}}}}} and {Y_{{{\mathtt{1}}}}} \sim {Q_{{{\mathtt{1}}}}} .

    A word of caution is that a different notion of equivalence for pairs of real-valued random variables (not pairs of probability measures) was proposed by Halmos and Savage in [3]: Suppose (X_1, Y_1, X_2, Y_2) are real-valued random variables such that \mathbb{P} [ X_i = Y_i = 0] = 0 , i = 1, 2 ; then (X_1, Y_1) and (X_2, Y_2) are equivalent in the sense of [3], if there is a fifth random variable such that \mathbb{P} [F = 0] = 0 and \mathbb{P}[ (X_1, Y_1) = (F \cdot X_2, F\cdot Y_2)] = 1 .

    41. Definition 2 and Theorem 4.1 result in

    \begin{array}{*{20}{c}} {\left\{ {{\mathbb{F}_{{P_{\mathtt{1}}}||{P_{\mathtt{0}}}}}(\alpha ) = } \right.{\rm{ }}\left. {{\mathbb{F}_{{Q_{\mathtt{1}}}||{Q_{\mathtt{0}}}}}(\alpha ),\alpha \in \mathbb{R}} \right\}}&{}\\ \Updownarrow &\qquad\qquad\qquad{\left( {4.62} \right)}\\ {\left( {{P_{\mathtt{1}}},{P_{\mathtt{0}}}} \right){\rm{ }} \equiv \left( {{Q_{\mathtt{1}}},{Q_{\mathtt{0}}}} \right)}&{}\\ \Updownarrow &\qquad\qquad\qquad{\left( {4.63} \right)}\\ {\left\{ {{{\overline{\mathbb{F}} }_{{P_{\mathtt{1}}}||{P_{\mathtt{0}}}}}(\alpha ) = } \right.{\rm{ }}\left. {{{\overline {\mathbb{F}}}_{{Q_{\mathtt{1}}}||{Q_{\mathtt{0}}}}}(\alpha ),\alpha \in {\mathbb{R}}} \right\}.}&{} \end{array}

    The remainder of the section is devoted to finding necessary and sufficient conditions for the equivalence of pairs. The relevance of such conditions will be apparent in Section 7.

    42. In view of (4.9) and the fact that the relative information spectra are right-continuous, (4.62)–(4.63) imply

    \begin{align} ({P_{{{\mathtt{1}}}}},{P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}},{Q_{{{\mathtt{0}}}}}) \quad \Longleftrightarrow \quad ({P_{{{\mathtt{0}}}}},{P_{{{\mathtt{1}}}}}) \equiv ({Q_{{{\mathtt{0}}}}},{Q_{{{\mathtt{1}}}}}). \end{align} (4.64)

    However, ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}) is more the exception than the rule. In addition to \mathbb R^n -valued random variables that differ by a constant, one of the most notable cases satisfying this property is the Cauchy case described in Item 35.

    43. Theorem 4.2. For ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{B}}}}^2 , the following circular implications hold:

    \begin{array}{*{20}{c}} {\exists {\alpha _{\mathtt{0}}} > 0\;s.t.\;\left\{ {{D_\alpha }\left( {{P_{\mathtt{1}}}||{P_{\mathtt{0}}}} \right)} \right.{\rm{ }}\left. { = {D_\alpha }\left( {{Q_{\mathtt{1}}}||{Q_{\mathtt{0}}}} \right),\alpha \in \left( {0,{\alpha _{\mathtt{0}}}} \right)} \right\}}&{}\\ \Downarrow &\qquad\qquad\qquad{\left( {4.65} \right)}\\ {\left( {{P_{\mathtt{1}}},{P_{\mathtt{0}}}} \right){\rm{ }} \equiv \left( {{Q_{\mathtt{1}}},{Q_{\mathtt{0}}}} \right)}&{}\\ \Downarrow &\qquad\qquad\qquad{\left( {4.66} \right)}\\ {{D_\alpha }\left( {{P_{\mathtt{1}}}||{P_{\mathtt{0}}}} \right){\rm{ }} = {D_\alpha }\left( {{Q_{\mathtt{1}}}||{Q_{\mathtt{0}}}} \right),\alpha \in (0,\infty ].}&{} \end{array}

    Proof. If D_\alpha ({P_{{{\mathtt{1}}}}}\, \|\, {P_{{{\mathtt{0}}}}}) = D_\alpha ({Q_{{{\mathtt{1}}}}}\, \|\, {Q_{{{\mathtt{0}}}}}) = \infty for some \alpha < 1 , then (4.65) follows from Item 25-(b) since {P_{{{\mathtt{1}}}}} \perp {P_{{{\mathtt{0}}}}} and {Q_{{{\mathtt{1}}}}} \perp {Q_{{{\mathtt{0}}}}} implies ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) . If D_\alpha ({P_{{{\mathtt{1}}}}}\, \|\, {P_{{{\mathtt{0}}}}}) = D_\alpha ({Q_{{{\mathtt{1}}}}}\, \|\, {Q_{{{\mathtt{0}}}}}) < \infty for 0 < \alpha < \alpha_1 < 1 , recall from (3.42) that (\alpha-1) D_\alpha (X\| Y) = \Lambda_{\imath_{X\|Y} (Y)} (\alpha) (assuming nats for convenience). Then, Item 13 implies that the values of D_\alpha (X\| Y) in a neighborhood of the origin determine the function \bar{ \mathbb{F}}_{X\|Y} ; therefore, (4.63) yields (4.65). On account of (3.41), we have

    \exp \left((\alpha-1) D_\alpha(P \| Q)\right) = \begin{cases}\int_0^{\infty} \mathbb{F}_{P \| Q}\left(\frac{\log \beta}{\alpha-1}\right) \mathrm{d} \beta, & \alpha \in(0,1) ; \\ \int_0^{\infty}\left(1-\mathbb{F}_{P \| Q}\left(\frac{\log \beta}{\alpha-1}\right)\right) \mathrm{d} \beta, & \alpha > 1,\end{cases} (4.67)

    which shows (4.66) for \alpha \in (0, 1)\cup (1, \infty) . For \alpha = 1 , we recall the definition of relative entropy, or, equivalently,

    \begin{align} D ( P \, \| \, Q ) & = \int_{-\infty}^\infty \left(1\{x > 0\} - \mathbb{F}_{P\|Q} (x) \right) \, \mathrm{d} x . \end{align} (4.68)

    For \alpha = \infty , note that according to (3.45), D_\infty (P\, \|\, Q) = \inf\{ v \in \mathbb R \colon \mathbb{F}_{P\|Q} (v) = 1 \} .

    44. The following concentration bound for the relative information spectrum holds as a function of the Rényi divergence of order \alpha > 1 : If \delta > 0 , then

    \begin{align} \mathbb{F}_{P\|Q} \left( D_\alpha ( P \,\|\, Q ) + \delta \right) \geq \Pi (P\,\|\,Q) - \exp \left( (1 - \alpha) \delta \right). \end{align} (4.69)

    To verify (4.69), let L_\beta = \{a \in \mathcal{{A}}\colon \imath_{P\|Q} (a) \leq \log \beta\} , and X\sim P . Then,

    \begin{align} &\Pi (P\,\|\, Q) - \mathbb{F}_{P\|Q} \left( \log \beta \right) \\ & = \mathbb{P} [ \log \beta < \imath_{P\|Q} (X) < \infty] \end{align} (4.70)
    \begin{align} & = \int 1\{ \log \beta < \imath_{P\|Q} (a) < \infty \} \, \exp \left( (1- \alpha) \imath_{P\|Q} (a) + (\alpha-1) \imath_{P\|Q} (a) \right) \, \mathrm{d}P (a) \end{align} (4.71)
    \begin{align} &\leq \exp \left( (1-\alpha) \left( \log \beta - D_\alpha (P\,\|\, Q) \right) \right), \end{align} (4.72)

    on account of (3.43). Letting \log \beta = \delta + D_\alpha (P\, \|\, Q) yields (4.69).

    45. Let \mathtt{F} be the collection of convex functions f\colon(0, \infty) \to \mathbb R . For f\in \mathtt{F} , the f -divergence, introduced and shown to satisfy the data processing principle in [15,32,33], can be expressed in terms of the relative information spectrum as

    \begin{align} D_f(P\,\|\, Q) = \int_{-\infty}^\infty \!\!f\left(\exp(t) \right)\, \mathrm{d} \bar{ \mathbb{F}}_{P\|Q}(t) + \left( 1 - \Pi (Q\,\|\,P) \right) f(0) + \left( 1 - \Pi (P\,\|\,Q)\right) f^\dagger(0), \end{align} (4.73)

    where f(0) = \lim_{t\downarrow 0} f(t) and f^\dagger (0) = \lim_{t\downarrow 0} t \, f \left(\frac1t \right) . Other integral representations of f -divergence as a function of the relative information spectrum, the deGroot statistical information (Item 48), and the E_\gamma -divergence (Item 49), can be found in [34], [35], and [36], respectively.

    46. Lemmas 10 and 11 are used in Section 6 to show that the NP-divergence is not an f -divergence.

    Lemma 10. [37, (9.4)] Suppose that the convex functions f\colon (0, \infty) \to \mathbb R and g\colon (0, \infty) \to \mathbb R are such that D_f (P\, \|\, Q) = D_g (P\, \|\, Q) for all (P, Q)\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 , where |\mathcal{{A}}| = 2 . Then, f(t)-g(t) = \alpha \, t - \alpha for some \alpha \in \mathbb R .

    Csiszár showed in [38, Theorem 1] that a discrepancy measure that satisfies the data processing inequality and the property in Lemma 11 must be an f -divergence.

    Lemma 11. Whenever ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}, {Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^4 are such that there exists an event \mathcal{{A}}_{\mathtt{0}} \in \mathscr{F} such that {Q_{{{\mathtt{0}}}}}(\mathcal{{A}}_{\mathtt{0}}) = 1 = {P_{{{\mathtt{0}}}}}(\mathcal{{A}}_{\mathtt{0}}) and {Q_{{{\mathtt{1}}}}}(\mathcal{{A}}_{\mathtt{0}}) = 0 = {P_{{{\mathtt{1}}}}}(\mathcal{{A}}_{\mathtt{0}}) ,

    \begin{align} D_f (\lambda \, {P_{{{\mathtt{1}}}}} + (1 - \lambda) {P_{{{\mathtt{0}}}}} \,\|\, \lambda \, {Q_{{{\mathtt{1}}}}} + (1 - \lambda) {Q_{{{\mathtt{0}}}}} ) = \lambda\, D_f ( {P_{{{\mathtt{1}}}}} \,\|\, {Q_{{{\mathtt{1}}}}} ) + (1 -\lambda ) D_f ( {P_{{{\mathtt{0}}}}} \,\|\, {Q_{{{\mathtt{0}}}}} ) , \end{align} (4.74)

    for all \lambda\in [0, 1] and f\in \mathtt{F} .

    Proof. Denoting the corresponding densities with respect to a common dominating \sigma -finite measure \mathsf{μ} by p_{\mathtt{0}} , p_{\mathtt{1}} , q_{\mathtt{0}} , and q_{\mathtt{1}} , we have p_{\mathtt{0}} (x) = q_{\mathtt{0}} (x) = 0 if x \not\in \mathcal{{A}}_{\mathtt{0}} and p_{\mathtt{1}} (x) = q_{\mathtt{1}} (x) = 0 if x \in \mathcal{{A}}_{\mathtt{0}} . Furthermore, we can express the densities of the mixtures by p_\lambda = \lambda \, p_{\mathtt{1}} + (1 - \lambda) p_{\mathtt{0}} , and q_\lambda = \lambda \, q_{\mathtt{1}} + (1 - \lambda) q_{\mathtt{0}} , respectively. Then, with the usual conventions 0 \cdot f \left(\frac{p}{0} \right) = p \, f^\dagger (0) if p \geq 0 , and f(0) \cdot 0 = f^\dagger(0) \cdot 0 = 0 ,

    \begin{align} &D_f (\lambda \, {P_{{{\mathtt{1}}}}} + (1 - \lambda) {P_{{{\mathtt{0}}}}} \,\|\, \lambda \, {Q_{{{\mathtt{1}}}}} + (1 - \lambda) {Q_{{{\mathtt{0}}}}} ) \\ & = \int_{\mathcal{{A}}_{\mathtt{0}}} q_\lambda \, f \left( \frac{p_\lambda}{q_\lambda} \right) \mathrm{d} \mathsf{μ} + \int_{\mathcal{{A}}_{\mathtt{0}}^c} q_\lambda \, f \left( \frac{p_\lambda}{q_\lambda} \right) \mathrm{d} \mathsf{μ} \end{align} (4.75)
    \begin{align} & = (1 - \lambda) \int_{\mathcal{{A}}_{\mathtt{0}}} q_{\mathtt{0}} \, f \left( \frac{p_{\mathtt{0}}}{q_{\mathtt{0}}} \right) \mathrm{d} \mathsf{μ} + \lambda\, \int_{\mathcal{{A}}_{\mathtt{0}}^c} q_{\mathtt{1}} \, f \left( \frac{p_{\mathtt{1}}}{q_{\mathtt{1}}} \right) \mathrm{d} \mathsf{μ} \end{align} (4.76)
    \begin{align} & = (1 -\lambda ) D_f ( {P_{{{\mathtt{0}}}}} \,\|\, {Q_{{{\mathtt{0}}}}} ) + \lambda\, D_f ( {P_{{{\mathtt{1}}}}} \,\|\, {Q_{{{\mathtt{1}}}}} ). \end{align} (4.77)

    47. The convex functions

    \begin{align} f_\alpha(t) = \frac{t^\alpha -1 }{\alpha -1} \end{align} (4.78)

    result in an important special case of f -divergence known as the Hellinger divergence of order \alpha\in (0, 1)\cup(1, \infty) ,

    \begin{align} \mathscr{H}_\alpha ( P \,\|\, Q ) = D_{f_\alpha} ( P \,\|\, Q ) & = \frac{1}{\alpha - 1} \left( \mathbb{E} \left[ \exp \left( \alpha \, \imath_{P\|R} (Z) + (1-\alpha) \imath_{Q\|R} (Z) \right) \right] - 1\right) \end{align} (4.79)
    \begin{align} & = \frac{1}{1-\alpha} \left( 1 - \int p^{\alpha} q^{1-\alpha} \mathrm{d}\mathsf{ρ} \right), \end{align} (4.80)

    which use the same notation as in (3.2) and (3.43). Furthermore, we let \mathscr{H}_1 (P \, \|\, Q) = D(P \, \|\, Q) . The squared Hellinger distance is

    \begin{align} \mathscr{H}^2 ( P\, \| \, Q ) = \tfrac12 \mathscr{H}_{\frac12} ( P \, \| \,Q ) = 1 - \exp \left( - B ( P \, \| \, Q ) \right) = D_f (P\,\|\, Q) \leq 1, \end{align} (4.81)

    with B (P \, \| \, Q) defined in Item 24, and f(t) = 1 -\sqrt{t} or f(t) = \frac12 (1 - \sqrt{t})^2 .

    Theorem 3. For ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{B}}}}^2 , the following circular implications hold:

    \begin{array}{*{20}{c}} {\exists {\alpha _{\mathtt{0}}} > 0\;{{ s}}{{.t}}{{. }}\;\left\{ {{{\mathscr H}_\alpha }\left( {{P_{\mathtt{1}}}||{P_{\mathtt{0}}}} \right)} \right.\;\left. { = {{\mathscr H}_\alpha }\left( {{Q_{\mathtt{1}}}||{Q_{\mathtt{0}}}} \right),\alpha \in \left( {0,{\alpha _{\mathtt{0}}}} \right)} \right\}}&{}\\ \Downarrow &\qquad\qquad\qquad{\left( {4.82} \right)}\\ {\left( {{P_{\mathtt{1}}},{P_ {\mathtt{0}}}} \right)\; \equiv \left( {{Q_{\mathtt{1}}},{Q_{\mathtt{0}}}} \right)}&{}\\ \Downarrow &\qquad\qquad\qquad{\left( {4.83} \right)}\\ {{D_f}\left( {{P_{\mathtt{1}}}||{P_{\mathtt{0}}}} \right)\; = {D_f}\left( {{Q_{\mathtt{1}}}||{Q_{\mathtt{0}}}} \right),\quad \;for\;all\;f \in {{F}}.}&{} \end{array}

    Proof.

    ● (4.82) \Longleftarrow (4.65) because although Rényi divergence is not an f -divergence, it can be put in a one-to-one correspondence with \mathscr{H}_\alpha (P \, \|\, Q) by means of

    \begin{align} D_\alpha (P \,\|\, Q ) = \frac1{\alpha -1} \log \left( 1 + (\alpha - 1) \mathscr{H}_\alpha ( P \,\|\, Q ) \right), \end{align} (4.84)

    in light of (3.43) and (4.79).

    ● (4.83) \Longleftarrow (4.73).

    48. For p\in (0, 1) , the deGroot statistical information [20] is defined as the \phi_p -divergence

    \begin{align} \mathcal{I}_p(P\,\|\,Q) = D_{\phi_p}(P\,\|\,Q), \end{align} (4.85)

    with the convex function \phi_p\colon (0, \infty) \to (-1, \frac12) ,

    \phi_p(t) = \min \{p, 1-p\}-\min \{p t, 1-p\} = \begin{cases}\min \{p, 1-p\}-p t, & 0 < t \leq \frac{1}{p}-1; \\ -[1-2 p]^{+}, & t > \frac{1}{p}-1 .\end{cases} (4.86)

    Theorem 4. For ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{B}}}}^2 ,

    \begin{align} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) \quad \Longleftrightarrow \quad\{ \mathcal{I}_p({P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) = \mathcal{I}_p({Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}), \, p\in (0,1) \}. \end{align} (4.87)

    Proof. \Longrightarrow follows from (4.83). To show \Longleftarrow , we use the fact that as long as f is convex and twice differentiable, the f -divergence can be expressed as [34,39,40,41]

    \begin{align} D_f(P\,\|\,Q) = \int_0^1 \mathcal{I}_p(P\,\|\,Q) \cdot \frac1{p^3} \cdot \ddot{f}\left(\frac{1-p}{p}\right) \, \text{d}p . \end{align} (4.88)

    Therefore, \{ \mathcal{I}_p({P_{{{\mathtt{1}}}}}\, \|\, {P_{{{\mathtt{0}}}}}) = \mathcal{I}_p({Q_{{{\mathtt{1}}}}}\, \|\, {Q_{{{\mathtt{0}}}}}), \, p\in (0, 1) \} \Longrightarrow D_f ({P_{{{\mathtt{1}}}}}\, \|\, {P_{{{\mathtt{0}}}}}) = D_f({Q_{{{\mathtt{1}}}}}\, \|\, {Q_{{{\mathtt{0}}}}}) . Since (4.78) is convex and twice differentiable, \Longleftarrow in (4.87) follows from (4.82). Alternatively, we can invoke the representation of the relative information spectrum in [34, Theorem 4]:

    \mathbb{F}_{P \| Q}\left(\log \frac{1-p}{p}\right) = \begin{cases}-\mathcal{I}_p(P \| Q)-(1-p) \dot{\mathcal{I}}_p(P \| Q)+1, & p \in\left(0, \frac{1}{2}\right) ; \\ -\mathcal{I}_p(P \| Q)-(1-p) \dot{\mathcal{I}}_p(P \| Q), & p \in\left(\frac{1}{2}, 1\right),\end{cases} (4.89)

    and \mathbb{F}_{P\|Q} (0) = \lim_{\alpha \downarrow 0} \mathbb{F}_{P\|Q} (\alpha) .

    49. For \gamma \geq 1 , denote g_\gamma (t) = [ t - \gamma ]^+ , and define the E_\gamma divergence as the g_\gamma -divergence

    \begin{align} E_\gamma ( P \,\|\, Q ) = D_{g_\gamma} ( P \,\|\, Q ). \end{align} (4.90)

    Theorem 5. For ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{B}}}}^2 ,

    \begin{align} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) &\equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) \\ & \Updownarrow \\ \{ E_\gamma({P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) = E_\gamma({Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}) \; \; &\mathit{\mbox{and}}\; \; E_\gamma({P_{{{\mathtt{0}}}}}\,\|\,{P_{{{\mathtt{1}}}}}) = E_\gamma({Q_{{{\mathtt{0}}}}}\,\|\,{Q_{{{\mathtt{1}}}}}), \quad \gamma \geq 1 \}. \end{align} (4.91)

    Proof.

    \Downarrow We can invoke (4.64) and either (4.83) or the representation in [34, (112)],

    \begin{align} E_\gamma ( P \,\|\, Q ) & = \gamma \int_\gamma^\infty \frac{1 - \mathbb{F}_{P\|Q} (\log \beta )}{\beta^2} \mathrm{d} \beta. \end{align} (4.92)

    \Uparrow We can rely on Theorem 4 and

    \mathcal{I}_p(P \| Q) = \begin{cases}p E_{\frac{1-p}{p}}(P \| Q), & p \in\left(0, \frac{1}{2}\right] ; \\ (1-p) E_{\frac{p}{1-p}}(Q \| P), & p \in\left[\frac{1}{2}, 1\right) .\end{cases} (4.93)

    Alternatively, we can capitalize on Theorem 3 of [34], namely,

    \mathbb{F}_{P \| Q}(\log \gamma) = \begin{cases}1-E_\gamma(P \| Q)+\gamma \dot{E}_\gamma(P \| Q), & \gamma > 1 ; \\ -\lim\limits _{\beta \downarrow 1} \dot{E}_\beta(Q \| P), & \gamma = 1 ; \\ \left.\dot{E}_\beta(Q \| P)\right|_{\beta \leftarrow \frac{1}{\gamma}}, & 0 < \gamma < 1 .\end{cases} (4.94)

    50. The fact (stated in Item 45) that no random transformation can increase the f -divergence between a pair of input probability measures suggests the possibility that the input relative information may stochastically dominate the output relative information. In other words, is it true that

    \begin{align} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x) \leq \mathbb{F}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} (x), \quad \,\, x\in \mathbb R , \end{align} (4.95)

    for all P_{Y|X}\colon \mathcal{{A}} \to \mathcal{{B}} and ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 , with {P_{{{\mathtt{0}}}}} \to P_{Y|X} \to {Q_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} \to P_{Y|X} \to {Q_{{{\mathtt{1}}}}} ? There are indeed cases in which (4.95) not only holds but holds with strict inequality on an interval of the real line. For example, if Y is independent of the input, then \mathbb{F}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} (x) = 1\{x \geq 0\} while \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x) = 1\{x \geq 1\, \mbox{bit}\} if {P_{{{\mathtt{1}}}}} = [0 \; \; 1], {P_{{{\mathtt{0}}}}} = [\frac12 \; \; \frac12] . However, as long as {P_{{{\mathtt{0}}}}} \ll {P_{{{\mathtt{1}}}}} , it is impossible for (4.95) to hold and be strict in any interval because that would mean

    \begin{align} 1 & = \Pi ({P_{{{\mathtt{0}}}}}\,\| {P_{{{\mathtt{1}}}}}) \end{align} (4.96)
    \begin{align} & = \int_0^\infty \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} \left( \log \frac1t \right) \, \mathrm{d}t \end{align} (4.97)
    \begin{align} & < \int_0^\infty \mathbb{F}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} \left( \log \frac1t \right) \, \mathrm{d}t \end{align} (4.98)
    \begin{align} & = \Pi ({Q_{{{\mathtt{0}}}}}\,\| {Q_{{{\mathtt{1}}}}}), \end{align} (4.99)

    where (4.97) and (4.99) follow from (4.33). Therefore, we reach the contradiction that a coefficient of absolute discontinuity is strictly greater than 1.

    In this section we turn our attention to the interplay between the relative information spectra and total variation distance

    \begin{align} | P - Q | = 2 \max\limits_{A \in \mathscr{F}} | P (A) - Q(A) | . \end{align} (5.1)

    51. Theorem 6. The total variation distance between (P_X, P_Y)\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 can be expressed in terms of the relative information spectra through

    \begin{align} \\[-8mm] \tfrac12 \, | P_X - P_Y | & = \bar{ \mathbb{F}}_{X\|Y} (0) - \mathbb{F}_{X\|Y} (0) \end{align} (5.2)
    \begin{align} & = \int_{0}^1 \bar{ \mathbb{F}}_{X\|Y} ( \log \beta ) \, \mathrm{d} \beta \end{align} (5.3)
    \begin{align} & = 1- \int_0^1 \mathbb{F}_{X\|Y} \left( \log \frac1\beta \right) \, \mathrm{d}\beta \end{align} (5.4)
    \begin{align} & = 1 - \Pi (X\,\|\, Y) + \int_{1}^\infty \left( 1 - \bar{ \mathbb{F}}_{X\|Y} ( \log \beta ) \right) \, \mathrm{d} \beta \end{align} (5.5)
    \begin{align} & = 1 - \Pi (Y\,\|\,X) + \int_{-\infty}^0 \left( 1 - \exp ( t ) \right) \, \mathrm{d} \bar{ \mathbb{F}}_{X\|Y}(t) \end{align} (5.6)
    \begin{align} & = 1 - \Pi (X\,\|\,Y) - \int_0^\infty \left( 1 - \exp ( t ) \right) \, \mathrm{d} \bar{ \mathbb{F}}_{X\|Y}(t) \end{align} (5.7)
    \begin{align} & = 1 -\tfrac12 \Pi (X\,\|\,Y) -\tfrac12 \Pi (Y\,\|\,X) + \tfrac12 \int_{-\infty}^\infty \left| 1 - \exp ( t ) \right| \, \mathrm{d} \bar{ \mathbb{F}}_{X\|Y}(t) \end{align} (5.8)
    \begin{align} & = 1 -\tfrac12 \Pi (X\,\|\,Y) -\tfrac12 \Pi (Y\,\|\,X) + \tfrac12 \int_{-\infty}^\infty \left| 1 - \exp ( -t ) \right| \, \mathrm{d} { \mathbb{F}}_{X\|Y}(t) \end{align} (5.9)
    \begin{align} & = 1 - \Pi (X\,\|\,Y) + \int_{0}^\infty \left( 1 - \exp ( -t ) \right) \, \mathrm{d} { \mathbb{F}}_{X\|Y}(t) \end{align} (5.10)
    \begin{align} & = 1 - \Pi (Y\,\|\,X) - \int_{-\infty}^0 \left( 1-\exp (- t ) \right) \, \mathrm{d} { \mathbb{F}}_{X\|Y}(t) \end{align} (5.11)
    \begin{align} & = 1- \Pi (Y\,\|\,X) + \int_1^\infty \mathbb{F}_{X\|Y} \left( \log \frac1\beta \right) \, \mathrm{d}\beta \end{align} (5.12)
    \begin{align} & = \mathbb{E} \left[ \left| \tanh \left( \tfrac12 \imath_{X\|Y} (W) \right) \right| \right], \quad W \sim \tfrac12 P_X + \tfrac12 P_Y, \end{align} (5.13)

    where the relative information in (5.13) is in nats and \tanh (\pm \infty) = \pm 1 .

    Proof.

    ● (5.2) Let \mathcal{{A}}_+ = \{ a\in \mathcal{{A}}\colon \imath_{X\|Y} (a) > 0 \} . Then,

    a) P_X(\mathcal{{A}}_+) - P_Y(\mathcal{{A}}_+) = \bar{ \mathbb{F}}_{X\|Y} (0) - \mathbb{F}_{X\|Y} (0) ;

    b) the absolute value in (5.1) is superfluous \Longleftarrow P_X(A) - P_Y(A) = P_Y(A^c) - P_X(A^c) ;

    c) \mathcal{{A}}_+ achieves the maximum in (5.1) because for any E\in \mathscr{F} ,

    \begin{align} P_X(\mathcal{{A}}_+) - P_X(E) & = P_X(\mathcal{{A}}_+ - E) - P_X ( E -\mathcal{{A}}_+) \end{align} (5.14)
    \begin{align} &\geq P_Y(\mathcal{{A}}_+ - E) - P_Y ( E -\mathcal{{A}}_+) \end{align} (5.15)
    \begin{align} & = P_Y(\mathcal{{A}}_+) - P_Y(E). \end{align} (5.16)

    ● (5.3) \Longleftarrow (4.24) with \beta = 1 .

    ● (5.4) \Longleftarrow (4.23) with \beta = 1 .

    ● (5.5) \Longleftarrow its right side is the right side of (5.3) \Longleftarrow (4.33).

    ● (5.6) Let \mathsf{μ} dominate \{P_X, P_Y\} and let p_X = \frac{\mathrm{d}P_X}{\mathrm{d}\mathsf{μ}} and p_Y = \frac{\mathrm{d}P_Y}{\mathrm{d}\mathsf{μ}} . Then,

    \begin{align} \tfrac12 |P_X - P_Y | & = \int \left[ p_Y - p_X \right]^+ \, \mathrm{d}\mathsf{μ} \end{align} (5.17)
    \begin{align} & = \int_{\mathcal{{S}}_{Y\|X} \cap \mathcal{{S}}_{X\|Y}^c} \left[ p_Y - p_X \right]^+ \, \mathrm{d}\mathsf{μ} + \int_{\mathcal{{S}}_{Y\|X} \cap \mathcal{{S}}_{X\|Y}} \left[ p_Y - p_X \right]^+ \, \mathrm{d}\mathsf{μ} \end{align} (5.18)
    \begin{align} & = P_Y ( \mathcal{{S}}_{Y\|X} \cap \mathcal{{S}}_{X\|Y}^c ) + \mathbb{E} \left[1\{\imath_{X\|Y} (Y) \in \mathbb R\} \left[ 1 - \exp (\imath_{X\|Y} (Y) ) \right]^+ \right] \end{align} (5.19)
    \begin{align} & = 1 - \Pi (Y\|X) + \int_{-\infty}^0 \left( 1 - \exp ( t ) \right) \, \mathrm{d} \bar{ \mathbb{F}}_{X\|Y}(t), \end{align} (5.20)

    where we have used (3.9).

    ● (5.7) Swapping X \leftrightarrow Y in (5.17),

    \begin{align} \tfrac12 |P_X - P_Y | & = \int \left[ p_X - p_Y \right]^+ \, \mathrm{d}\mathsf{μ} \end{align} (5.21)
    \begin{align} & = \int_{\mathcal{{S}}_{X\|Y} \cap \mathcal{{S}}_{Y\|X}^c} \left[ p_X - p_Y \right]^+ \, \mathrm{d}\mathsf{μ} + \int_{\mathcal{{S}}_{X\|Y} \cap \mathcal{{S}}_{Y\|X}} \left[ p_X - p_Y \right]^+ \, \mathrm{d}\mathsf{μ} \end{align} (5.22)
    \begin{align} & = P_X ( \mathcal{{S}}_{X\|Y} \cap \mathcal{{S}}_{Y\|X}^c ) + \mathbb{E} \left[1\{\imath_{X\|Y} (Y) \in \mathbb R\} \left[ \exp (\imath_{X\|Y} (Y) ) -1 \right]^+ \right] \end{align} (5.23)
    \begin{align} & = 1 - \Pi (X\|Y) + \int_0^\infty \left( \exp ( t ) - 1 \right) \, \mathrm{d} \bar{ \mathbb{F}}_{X\|Y}(t), \end{align} (5.24)

    where we have used (3.6).

    ● (5.8) \Longleftarrow its right side is the arithmetic mean of the right sides of (5.6) and (5.7).

    ● (5.9) \Longleftarrow X \leftrightarrow Y in (5.8) and (3.3).

    ● (5.10) \Longleftarrow X \leftrightarrow Y in (5.6) and (3.3).

    ● (5.11) \Longleftarrow X \leftrightarrow Y in (5.7) and (3.3).

    ● (5.12) \Longleftarrow its right side is the right side of (5.4) \Longleftarrow (4.29).

    ● (5.13) \Longleftarrow choose \mathsf{μ} = \frac12 P_X + \frac12 P_Y in |P_X - P_Y | = \int \left| p_X - p_Y \right| \, \mathrm{d}\mathsf{μ} and note that

    \begin{align} \left| \tanh \left(\tfrac12 \log_\mathrm{e} \frac{p_X}{p_Y} \right) \right| = \frac{|p_X-p_Y|}{p_X+p_Y}\; \; \mbox{if}\; \; (p_X,p_Y)\in [0,\infty)^2 -\{(0,0)\}. \end{align} (5.25)

    52. Under the assumption P_X \ll P_Y , several of the representations in Theorem 6 can be found in [36, Theorem 12] and earlier in [42]. In addition, [36, Theorem 15] gives upper bounds on total variation distance as a function of the relative information spectrum if P_X \ll P_Y . Since those results are based on (5.3)–(5.4), which continue to hold in general, they too hold without restrictions on absolute continuity. In particular, the monotonicity of the relative information spectra and (5.3)–(5.4) result in

    \begin{align} \tfrac12 \, | P_X - P_Y | &\leq \inf\limits_{\delta > 0} \left\{ \left( 1 - \exp(-\delta) \right) \,\bar{ \mathbb{F}}_{X\|Y} ( 0) +\exp(-\delta) \, \bar{ \mathbb{F}}_{X\|Y} ( -\delta ) \right\}, \end{align} (5.26)
    \begin{align} \tfrac12 \, | P_X - P_Y | &\leq 1- \sup\limits_{\delta > 0} \left\{ \left( 1 - \exp(-\delta) \right) \,{ \mathbb{F}}_{X\|Y} ( 0) +\exp(-\delta) \, { \mathbb{F}}_{X\|Y} ( \delta ) \right\}, \end{align} (5.27)

    which coincide with Le Cam's upper bounds in [19, p. 51], except that he weakens (5.27) by forbidding \delta > \log 2 . As noted in [36], further strengthening of (5.26) [resp., (5.27)] is possible if \bar{ \mathbb{F}}_{X\|Y} (-\Delta) ] = 0 [resp., { \mathbb{F}}_{X\|Y} (\Delta) ] = 0 ] for some \Delta > 0 .

    53. We can also lower bound total variation distance using Theorem 6 and the monotonicity of the relative information spectra. The following result supersedes Le Cam's lower bound in [19, p. 50], as well as the lower bounds in [36, Lemmas 17 and 18] claimed under P_X \ll P_Y .

    Theorem 7. For arbitrary (P_X, P_Y) \in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 and \delta > 0 ,

    \begin{align} \tfrac12 \, | P_X - P_Y | &\geq \exp(-\delta) \left( 1 - \Pi (X\,\|\, Y) \right) + \left( 1 - \exp(-\delta) \right) \mathbb{P} [ \imath_{X\|Y} (X) \geq \delta], \end{align} (5.28)
    \begin{align} \tfrac12 \, | P_X - P_Y | &\geq 1 - \Pi (Y\,\|\, X) + \left( \exp(\delta) -1\right)\, { \mathbb{F}}_{X\|Y} ( -\delta ), \end{align} (5.29)
    \begin{align} \tfrac12 \, | P_X - P_Y | &\geq 1 - \Pi (X\,\|\, Y) + \left( \exp(\delta) - 1 \right) \, \mathbb{P} [ \imath_{X\|Y} (Y) \geq \delta], \end{align} (5.30)
    \begin{align} \tfrac12 \, | P_X - P_Y | &\geq \exp(-\delta)\left( 1 - \Pi (Y\,\|\, X) \right) + \left( 1 - \exp(-\delta) \right)\, \bar{ \mathbb{F}}_{X\|Y} ( -\delta ). \end{align} (5.31)

    Proof.

    ● (5.28) \Longleftarrow (5.4) and { \mathbb{F}}_{X\|Y}(t) \leq \Pi (X\|Y) \, 1\{ t\geq \delta\} + (1 - \mathbb{P} [ \imath_{X\|Y} (X) \geq \delta])\, 1\{ t < \delta\} .

    ● (5.29) \Longleftarrow (5.12) and { \mathbb{F}}_{X\|Y}(t) \geq { \mathbb{F}}_{X\|Y}(- \delta) \, 1\{t \geq -\delta\} .

    ● (5.30) \Longleftarrow X \leftrightarrow Y in (5.29) and (4.9).

    ● (5.31) \Longleftarrow X \leftrightarrow Y in (5.28) and (4.10).

    54. Even if P_{X_1}\in {\mathscr{P}_{\!\!\mathcal{{A}}}} and P_{X_2}\in {\mathscr{P}_{\!\!\mathcal{{A}}}} are close in total variation distance, their relative informations with respect to a third probability measure may behave quite differently.

    Example. [19, p. 50]. Let \mathcal{{A}} = [0, \infty) , and suppose that P_{X_1} , P_{X_2} , and P_Y are uniform on [0, n^2] , [1, n^2] , and [0, 1] , respectively. Then, for all \alpha \in \mathbb R ,

    \begin{align} \bar{ \mathbb{F}}_{X_1\|Y} (\alpha) & = 1\{ \alpha \geq -\log n^2 \} , \end{align} (5.32)
    \begin{align} \bar{ \mathbb{F}}_{X_2\|Y} (\alpha) & = 1, \end{align} (5.33)
    \begin{align} | P_{X_1} - P_{X_2} | & = \frac{2}{n^2}, \end{align} (5.34)
    \begin{align} \mathbb{P} [ \imath_{X_1\|Y} ( Y ) - \imath_{X_2\|Y} ( Y ) = \infty] & = 1. \end{align} (5.35)

    55. The NP-divergence between (P, Q) \in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 is defined as

    \begin{align} S ( P \, \| \, Q ) = | P\otimes Q - Q\otimes P|. \end{align} (6.1)

    The terminology is motivated by an important operational role for S (P \, \| \, Q) shown in Section 8 in the context of non-Bayesian hypothesis testing. For now, we point out that NP-divergence satisfies a simple Bayesian hypothesis testing operational role. Recall that the minimum average probability of error is equal to \frac12 - \frac14 | P - Q| for equally likely P and Q . Now, suppose that we obtain a pair of observations one drawn from P and the other from Q , but we do not know the order of the pair and have no reason to favor one ordering over the other. Therefore, we have the equally likely hypotheses

    \begin{aligned} & \mathsf{H}_L:\left(y_1, y_2\right) \sim P_{\mathtt{0}} \otimes P_{\mathtt{1}} \\ & \mathsf{H}_R:\left(y_1, y_2\right) \sim P_{\mathtt{1}} \otimes P_{\mathtt{0}} \end{aligned}

    and the minimum probability of erroneous ordering is \frac12 - \frac14 S (P \, \| \, Q) .

    56. Blind wine tasting. Offered a glass of 1982 Château Pétrus and a glass of 1990 Château Margaux, we are asked to identify which one is which. Suppose that for a given set of environmental conditions (temperature, lighting, etc.), P and Q stand for the probability measures of the respective wines on the space of visual, olfactory, and gustatory sensations. The probability of error is equal to \frac12 - \frac14 S (P \, \| \, Q) since, a priori, the contents of the glasses are equally likely. Wanting to show off, a confident wine connoisseur makes a decision on the basis of tasting only one of the glasses. Then, the probability of error is \frac12 - \frac14 | P - Q | . Indeed, as shown below, | P - Q | \leq S (P \, \| \, Q). If we do not condition on a given set of environmental conditions, the tasting sensations of both wines are dependent mainly because of their dependence on temperature. In that case, S (P \, \| \, Q) is generalized to |P_{XY} - P_{YX}| , which can be applied whenever X and Y are defined on the same space. The potential utility of such measure of asymmetry of joint probability measures is yet to be explored.

    57. Example. If P = [\, p\; \; 1-p\, ] and Q = [\, q\; \; 1-q\, ] , then S \left(P \, \|\, Q\right) = \left| P -Q \right| = 2 \, |p-q| .

    58. Example. If P = \left[\, \frac12\; \frac12\; 0\, \right] and Q = \left[\, 0\; \frac12\; \frac12\, \right] , then |P - Q| = 1 while S(P\, \|\, Q) = \frac{3}{2} .

    59. Example. S \left(\mathcal{N}\!\left({{\mu_1}}, {{\sigma^2}}\right) \, \|\, \mathcal{N}\!\left({{\mu_0}}, {{\sigma^2}}\right) \right) = 2 - 4 \, \mathrm{Q} \left(\frac{|\mu_1 - \mu_0 |}{\sqrt{2} \sigma} \right) = \left| \mathcal{N}\!\left({{\mu_1}}, {{\frac{\sigma^2}{2}}}\right) - \mathcal{N}\!\left({{\mu_0}}, {{\frac{\sigma^2}{2}}}\right) \right|.

    60. The NP-divergence satisfies the following properties.

    Theorem 8. Let (P, Q) \in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 . Then,

    (a) S(P\, \|\, Q) = S(Q\, \|\, P) .

    (b)

    \begin{align} \\[-10mm] 0 \leq S ( P \, \| \, Q ) \leq 2, \end{align} (6.2)

    with equality on the left if and only if P = Q , and equality on the right if and only if P \perp Q .

    (c) S (P\, \|\, Q) does not satisfy the triangle inequality.

    (d)

    \begin{align} \\[-10mm] S (P_{X_1} \otimes \cdots \otimes P_{X_m} \, \| \, Q_{X_1} \otimes \cdots \otimes Q_{X_m}) \leq \sum\limits_{i = 1}^m S ( P_i \, \| \, Q_i ). \end{align} (6.3)

    (e) If P \neq Q , then

    \begin{align} \tfrac12 S (P^{\otimes n} \, \| \, Q^{\otimes n} ) = 1 - \exp \left(-2 n \, B ( P\,\| \, Q) + o(n) \right), \end{align} (6.4)

    where B(P\| Q) is the \textrm{Bhattacharyya distance} in Item 24.

    (f) If |\mathcal{{A}}| = 2 , then S (P\, \| \, Q) = | P - Q | . In general,

    \begin{align} | P - Q | \leq S ( P\, \| \, Q ) \leq 2\, | P - Q | - \tfrac12 | P - Q | ^2. \end{align} (6.5)

    (g)

    \begin{align} \\[-10mm] \tfrac12 S (P\,\|\, Q ) \geq 1 - \Pi (P\,\|\,Q) \cdot \Pi (Q\,\|\,P). \end{align} (6.6)

    (h) Data processing inequality. If P_{{X_{{{\mathtt{1}}}}}} \to P_{Y|X} \to P_{{Y_{{{\mathtt{1}}}}}} and P_{{X_{{{\mathtt{0}}}}}} \to P_{Y|X} \to P_{{Y_{{{\mathtt{0}}}}}} , for some random transformation P_{Y|X}\colon \mathcal{{A}} \to \mathcal{{B}} , then

    \begin{align} S ({Y_{{{\mathtt{1}}}}}\,\|\, {Y_{{{\mathtt{0}}}}} ) \leq S ({X_{{{\mathtt{1}}}}}\,\|\, {X_{{{\mathtt{0}}}}} ). \end{align} (6.7)

    (i) No convex f\colon (0, \infty) \to \mathbb R exists so that D_f (P\, \|\, Q) = S (P\, \|\, Q) for all (P, Q) \in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 .

    (j) With (X, Y) \sim P_X \otimes P_Y ,

    \begin{align} \tfrac12 | P_X\otimes P_Y - P_Y\otimes P_X| = \mathbb{P} [ \imath_{X\|Y} (X) > \imath_{X\|Y} (Y) ] - \mathbb{P} [ \imath_{X\|Y} (X) < \imath_{X\|Y} (Y) ]. \end{align} (6.8)

    (k)

    \begin{align} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) \; \Longrightarrow\; S({P_{{{\mathtt{1}}}}}\,\|\, {P_{{{\mathtt{0}}}}}) = S({Q_{{{\mathtt{1}}}}}\,\|\, {Q_{{{\mathtt{0}}}}}). \end{align} (6.9)

    Proof.

    (a) \Longleftarrow |P - Q | = |Q -P| .

    (b) The inequalities follow because S(P\|Q) is a total variation distance. Moreover,

    \begin{align} S(P\,\|\,Q) & = 0 \; \Longleftrightarrow\; P\otimes Q = Q \otimes P \; \Longleftrightarrow\; P = Q, \end{align} (6.10)
    \begin{align} S(P\,\|\,Q) & = 2 \; \Longleftrightarrow\; P\otimes Q \perp Q \otimes P \; \Longleftrightarrow\; P \perp Q. \end{align} (6.11)

    (c) P = \left[\, \frac12\; \frac12\; 0 \, \right] , Q = \left[\, \frac13\; \frac13\; \frac13\, \right] , R = \left[\, 0\; \frac12\; \frac12\, \right] . Then, S(P\|Q) + S(Q\|R) < S(P\|R) , since

    \begin{align} S(P\|Q) = S(Q\|R) = \tfrac{2}{3}, \quad\mbox{and}\quad S(P\|R) = \tfrac{3}{2}. \end{align} (6.12)

    (d) Total variation distance satisfies the tensorization bound

    \begin{align} | P_{X_1} \otimes \cdots \otimes P_{X_m} - Q_{X_1} \otimes \cdots \otimes Q_{X_m} | \leq \sum\limits_{i = 1}^m | P_{X_i} - Q_{X_i} |. \end{align} (6.13)

    Letting (P_{X_i}, Q_{X_i}) \leftarrow (P_{X_i} \otimes Q_{X_i}, Q_{X_i} \otimes P_{X_i}) yields (6.3).

    (e) The proof consists of three building blocks:

    ⅰ. As shown by Chernoff [14], if P \neq Q ,

    \begin{align} \tfrac{1}{2} | P^{\otimes n} - Q^{\otimes n} | = 1 - \exp \left(-n \, C ( P\,\| \, Q) + o(n) \right), \end{align} (6.14)

    where the Chernoff information is defined as

    \begin{align} C ( P \,\|\, Q ) = \sup\limits_{\alpha \in (0,1)} (1 - \alpha ) D_\alpha ( P \,\|\, Q ). \end{align} (6.15)

    ⅱ. By relabeling of indices,

    \begin{align} S( P^{\otimes n} \,\|\, Q^{\otimes n} ) = | P^{\otimes n} \otimes Q^{\otimes n} - Q^{\otimes n} \otimes P^{\otimes n} | = | (P\otimes Q)^{\otimes n} - (Q \otimes P)^{\otimes n} |. \end{align} (6.16)

    ⅲ.

    \begin{align} C ( P\otimes Q \,\|\, Q\otimes P ) & = \sup\limits_{\alpha \in (0,1)} (1 - \alpha ) D_\alpha ( P\otimes Q \,\|\, Q \otimes P) \end{align} (6.17)
    \begin{align} & = \sup\limits_{\alpha \in (0,1)} \left\{ (1 - \alpha ) D_\alpha ( P \,\|\, Q ) + (1 - \alpha ) D_\alpha ( Q \,\|\, P ) \right\} \end{align} (6.18)
    \begin{align} &\geq \tfrac12 D_\frac12 ( P \,\|\, Q ) + \tfrac12 D_\frac12 ( Q \,\|\, P ) \end{align} (6.19)
    \begin{align} & = 2 \, B( P \,\|\, Q ), \end{align} (6.20)

    according to (3.51). To show that equality holds in (6.19), note that the function of \alpha within \{ \} is concave (e.g., [30]) with derivative

    \begin{align} \frac{\mathrm{d}}{\mathrm{d} \alpha} (1 - \alpha ) \left( D_\alpha ( P \,\|\, Q ) + D_\alpha ( Q \,\|\, P )\right) = \frac{\int \left( \frac{p}{q}\right)^\alpha q \,\log \frac{q}{p}\, \mathrm{d}\mathsf{μ}}{\int \left( \frac{p}{q}\right)^\alpha q\, \mathrm{d}\mathsf{μ} } + \frac{\int\left( \frac{q}{p}\right)^\alpha p \,\log \frac{p}{q} \, \mathrm{d}\mathsf{μ} }{\int \left( \frac{q}{p}\right)^\alpha p\, \mathrm{d}\mathsf{μ}}, \end{align} (6.21)

    which equals 0 at \alpha = \frac12 .

    (f) For any joint probability measures P_{XY} and Q_{XY} on the same product space,

    \begin{align} | P_{XY} - Q_{XY} | \geq \max \{ | P_X - Q_X | , | P_Y - Q_Y | \}. \end{align} (6.22)

    Letting P_{XY} = P \otimes Q and Q_{XY} = Q \otimes P yields S (P\, \| \, Q) \geq | P - Q | . As we saw in Item 57, equality holds for binary \mathcal{{A}} . The right inequality in (6.5) is a special case of

    \begin{align} | {P_{{{\mathtt{0}}}}} \otimes {P_{{{\mathtt{1}}}}} - {Q_{{{\mathtt{0}}}}} \otimes {Q_{{{\mathtt{1}}}}} | = | {P_{{{\mathtt{0}}}}} - {Q_{{{\mathtt{0}}}}} | + | {P_{{{\mathtt{1}}}}} - {Q_{{{\mathtt{1}}}}} | - \tfrac12 | {P_{{{\mathtt{0}}}}} - {Q_{{{\mathtt{0}}}}} | \cdot | {P_{{{\mathtt{1}}}}} - {Q_{{{\mathtt{1}}}}} |, \end{align} (6.23)

    proved in [43] in the discrete case by means of the Strassen-Dobrushin coupling representation of total variation [44,45], which requires that ({P_{{{\mathtt{0}}}}}, {Q_{{{\mathtt{0}}}}}) be probability measures on a measurable space (\mathcal{{A}}_{\mathtt{0}}, \mathscr{F}_{\mathtt{0}}) such that \{ (a, a)\colon a \in \mathcal{{A}}_{\mathtt{0}}\} \in \mathscr{F}_{\mathtt{0}} \otimes \mathscr{F}_{\mathtt{0}} , and analogously for ({P_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{1}}}}}) . This is satisfied by any Polish \mathcal{{A}}_{\mathtt{0}} endowed with its Borel field.

    (g) Use the fact that S(P\, \|\, Q) = D_f (P \otimes Q \, \| \, Q \otimes P) with f(t) = |1-t| , and dropping the first term on the right side of (4.73), we obtain

    \begin{align} S(P\, \| \,Q) &\geq 2 - \Pi (P \otimes Q \, \| \, Q \otimes P) - \Pi (Q \otimes P \, \| \, P \otimes Q) \end{align} (6.24)
    \begin{align} & = 2 - 2\, \Pi (P \, \| \, Q )\cdot \Pi (Q \, \| \,P ), \end{align} (6.25)

    where (6.25) follows from Item 9.

    (h) Given P_{Y|X}\colon(\mathcal{{A}}, \mathscr{F}) \to (\mathcal{{B}}, \mathscr{G}) , construct the random transformation

    P_{Y_1 Y_2 | X_1 X_2} \colon (\mathcal{{A}}^2, \mathscr{F}^2) \to (\mathcal{{B}}^2, \mathscr{G}^2) defined by

    P_{Y_1 Y_2 | X_1 X_2} (G_1 \times G_2 \,|\, a_1, a_2) = P_{Y | X} (G_1 \,|\, a_1)P_{Y | X} (G_2 \,|\, a_2),\quad (G_1, G_2, a_1, a_2)\in \mathscr{F}^2 \times \mathcal{{A}}^2.

    Note that

    P_{{X_{{{\mathtt{0}}}}}}\otimes P_{{X_{{{\mathtt{1}}}}}} \to P_{Y_1 Y_2 | X_1 X_2} \to P_{{Y_{{{\mathtt{0}}}}}}\otimes P_{{Y_{{{\mathtt{1}}}}}}

    and

    P_{{X_{{{\mathtt{1}}}}}}\otimes P_{{X_{{{\mathtt{0}}}}}} \to P_{Y_1 Y_2 | X_1 X_2} \to P_{{Y_{{{\mathtt{1}}}}}}\otimes P_{{Y_{{{\mathtt{0}}}}}}.

    Applying the data processing inequality for total variation distance to P_{Y_1 Y_2 | X_1 X_2} with inputs P_{{X_{{{\mathtt{0}}}}}}\otimes P_{{X_{{{\mathtt{1}}}}}} and P_{{X_{{{\mathtt{1}}}}}}\otimes P_{{X_{{{\mathtt{0}}}}}} , we obtain

    \begin{align} S ({X_{{{\mathtt{1}}}}}\,\|\, {X_{{{\mathtt{0}}}}} ) = | P_{{X_{{{\mathtt{0}}}}}}\otimes P_{{X_{{{\mathtt{1}}}}}} - P_{{X_{{{\mathtt{1}}}}}}\otimes P_{{X_{{{\mathtt{0}}}}}} | \geq | P_{{Y_{{{\mathtt{0}}}}}}\otimes P_{{Y_{{{\mathtt{1}}}}}} - P_{{Y_{{{\mathtt{1}}}}}}\otimes P_{{Y_{{{\mathtt{0}}}}}} | = S ({Y_{{{\mathtt{1}}}}}\,\|\, {Y_{{{\mathtt{0}}}}} ). \end{align} (6.26)

    (i) Let's proceed by contradiction and assume that there exists a convex f\colon (0, \infty) such that S(P\, \| \, Q) = D_f (P\, \| \, Q) for all (P, Q)\in {\mathscr{P}_{\!\!\mathcal{{A}}}}^2 . Since S(P\, \|\, Q) = | P - Q| in the special case |\mathcal{{A}}| = 2 , Lemma 10 implies that there exists \alpha \in \mathbb R such that f(t) = |1-t| + \alpha \, t - \alpha , i.e., S(P\, \| \, Q) = | P-Q| which contradicts the examples in Items 58 and 59. An alternative route is to verify that NP-divergence fails to satisfy Lemma 11 by considering the special case \lambda = \frac12 , {P_{{{\mathtt{1}}}}} = {Q_{{{\mathtt{1}}}}} = [ \, \frac14 \; \frac34\; 0\; 0\, ] , {P_{{{\mathtt{0}}}}} = [ \, 0\; 0\; \frac14 \; \frac34\, ] , {Q_{{{\mathtt{0}}}}} = [ \, 0\; 0\; \frac34 \; \frac14\, ] .

    (j) With the notation used in the proof of Theorem 6,

    \begin{align} \mathbb{P} [ \imath_{X\|Y} (X) > \imath_{X\|Y} (Y) ] & = \iint p_X(a)\, p_Y(b) \, 1\{p_X(a)\, p_Y(b) > p_X(b)\, p_Y(a) \} \, \mathrm{d}\mu \, \mathrm{d}\mu, \end{align} (6.27)
    \begin{align} \mathbb{P} [ \imath_{X\|Y} (X) < \imath_{X\|Y} (Y) ] & = \iint p_X(a)\, p_Y(b) \, 1\{p_X(a)\, p_Y(b) < p_X(b)\, p_Y(a) \} \, \mathrm{d}\mu\, \mathrm{d}\mu \\ & = \iint p_X(b)\, p_Y(a) \, 1\{p_X(b) \, p_Y(a) < p_X(a)\, p_Y(b) \} \, \mathrm{d}\mu\, \mathrm{d}\mu . \end{align} (6.28)

    Then, (6.8) follows from (5.17) and (6.1).

    (k) The terms in the right side of (6.8) are determined by the relative information spectra:

    \begin{align} \mathbb{P} [ \imath_{X\|Y} (X) > \imath_{X\|Y} (Y) ] & = 1 - \mathbb{E} \left[ \mathbb{F}_{X\|Y} (\imath_{X\|Y} (Y)) \right], \end{align} (6.29)
    \begin{align} \mathbb{P} [ \imath_{X\|Y} (X) < \imath_{X\|Y} (Y) ] & = 1 - \mathbb{E} \left[\bar{ \mathbb{F}}_{X\|Y} (\imath_{X\|Y} (X)) \right]. \end{align} (6.30)

    Any pair such that (P, Q) \not \equiv (Q, P) provides a counterexample to \Longleftarrow in (6.9).

    61. While not an f -divergence, Theorem 8–(f) implies that NP-divergence is a g -divergence [46]. Several properties for the measure of dependence \inf_{Q_Y\in {\mathscr{P}_{\!\!\mathcal{{B}}}}} S (P_{XY} \, \|\, P_X \otimes Q_Y) can be obtained by specializing [46, Theorem 8].

    62. It may be useful to generalize the NP-divergence by replacing the total variation distance by any other f -divergence, i.e., define

    \begin{align} S_f ( P\,\|\, Q) = D_f ( P\otimes Q \, \| \, Q \otimes P), \end{align} (6.31)

    which satisfies S_f (P\, \|\, Q) = S_f (Q\, \|\, P) even if D_f is not symmetric.

    Since its inception by Ronald Fisher in [47], the concept of sufficient statistics has played a fundamental role in mathematical statistics. This section offers a brief review of the various notions of sufficient statistics proposed in the literature, as well as their interrelationships emphasizing the connections with information theory. Moreover, we propose a new notion of sufficient statistics building upon the notion of equivalent pairs.

    63. The basic setup in this section has the following ingredients:

    ● measurable spaces (\mathcal{{Y}}, \mathscr{F}) and (\mathcal{{Z}}, \mathscr{G}) ;

    ● a parameter set \Theta ;

    ● a data model (collection of distributions on (\mathcal{{Y}}, \mathscr{F}) ): \mathscr{P} = \{ P_{Y|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \theta \in \Theta \} ;

    ● a random transformation: P_{Z|Y}\colon (\mathcal{{Y}}, \mathscr{F}) \to (\mathcal{{Z}}, \mathscr{G}) .

    An inference on the unknown parameter \theta \in \Theta is made on the basis of the output of the random transformation P_{Z|Y} when its input is distributed according to P_{Y|V = \theta} . Recall from Item 3 that no joint distribution P_{VY} is assumed to exist. In fact, the setting is non-Bayesian: No distribution is assumed on the set of parameters \Theta , i.e., P_V need not be defined. The question to be formalized is: Under what conditions does the random transformation P_{Z|Y} preserve all the information in Y that is relevant for inferring the parameter? Before proceeding, note that most of the statistical literature restricts attention to deterministic transformations, i.e., Z = f(Y) for a (\mathscr{F}, \mathscr{G}) -measurable f\colon \mathcal{{Y}} \to \mathcal{{Z}} . Allowing random transformations (as in [15,48,49]) is practically useful, since sometimes the data is observed through an inherently random mechanism which, nevertheless, does not spoil the relevant information. For example, if \mathcal{{Y}} = \mathcal{{B}}^n , and under each \theta \in \Theta , P_{Y|V = \theta} = P_{\theta} \otimes \cdots \otimes P_{\theta} with P_\theta \in {\mathscr{P}_{\!\!\mathcal{{B}}}} , then a random interleaver P_{Z|Y} \colon \mathcal{{B}}^n \to \mathcal{{B}}^n preserves all the information in the observed n -tuple relevant to the inference of \theta \in \Theta because \{ P_{Y|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \, \theta \in \Theta \} = \{ P_{Z|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \, \theta \in \Theta \} . At any rate, allowing random transformations is really a matter of mathematical convenience/elegance; in fact it does not widen the scope since the randomness can be incorporated into the data model: Letting P_{Y|V = \theta} \leftarrow P_{Z|Y}P_{Y|V = \theta} and f(z, y) = z subsumes the notion of random transformations as sufficient statistics into the classical deterministic transformations.

    64. Fisher's notion [47] states that Z is a sufficient statistic of Y for the collection \mathscr{P} = \{ P_{Y|V = \theta}, \theta \in \Theta \} if

    P_{Y|Z,V = \theta}\; {\rm{does \;not\; depend \;on}} \;\theta,

    where, given that the unknown parameter is V = \theta , the joint probability measure of Y and Z is P_{YZ|V = \theta} = P_{Y|V = \theta} P_{Z|Y} . Following [50], when distinguishing from other notions of sufficient statistics, the sufficiency in the sense of this item is referred to as classical sufficiency.

    65. Although P_{YZ|V = \theta} is always well defined, we have to face the unfortunate fact that the conditional probability measure P_{Y|Z, V = \theta} need not exist when \mathcal{{Y}} is uncountable. The existence of such a conditional probability measure requires that for every B \in \mathscr{F} , there exist a \mathscr{G} -measurable \phi_B \colon \mathcal{{Z}} \to [0, 1] such that for all (\theta, B_0) \in \Theta\times\mathscr{G} ,

    \begin{align} P_{YZ|V = \theta} ( B \times B_0) = \mathbb{E} [ \phi_B (Z) 1\{Z \in B_0\} | V = \theta], \end{align} (7.1)

    and \phi_\cdot (z) \in {\mathscr{P}_{\!\!\mathcal{{Y}}}} for z\in \mathcal{{Z}}_\theta , with P_{Z|V = \theta} (\mathcal{{Z}}_\theta) = 1 for all \theta\in \Theta . To guarantee that this is the case, it is customary to abide by the restriction that (\mathcal{{Y}}, \mathscr{F}) is a standard measurable space (i.e., it is isomorphic to (E, \mathscr{B}_E) for some Borel subset of the real line E \in \mathscr{B} ). Without such a restriction, Dieudonné [51] showed a counterexample where the required conditional probability does not exist, in which case the notion of sufficient statistics in the classical sense is vacuous. Whenever classical sufficiency is considered, it is typically assumed that the observation space is standard, even if this is not explicitly stated. As we see in Item 66, we need to place another restriction on the data model to make the notion of classical sufficiency well-behaved.

    66. Bahadur [49] introduced the slightly more succinct notion of a sufficient \sigma -field \bar{\mathscr{F}} \subset \mathscr{F} , meaning that for every B \in \mathscr{F} there exists a \bar{\mathscr{F}} -measurable \varphi_B\colon \mathcal{{Y}} \to [0, 1] , such that for all (\theta, B_0) \in \Theta \times\bar{\mathscr{F}} ,

    \begin{align} P_{Y|V = \theta} (B \cap B_0 ) = \mathbb{E} [ \varphi_B (Y) 1\{Y \in B_0\} | V = \theta]. \end{align} (7.2)

    Then, a measurable function f\colon\mathcal{{Y}}\to \mathcal{{Z}} is sufficient if and only if the \sigma -field it induces is sufficient. Curiously, the sufficiency of \bar{\mathscr{F}} does not guarantee the sufficiency of every \sigma -field \hat{\mathscr{F}} such that \bar{\mathscr{F}} \subset \hat{\mathscr{F}} \subset \mathscr{F} [25]. Indeed, there may exist f (y) = h (g(y)) that is sufficient even though g is not sufficient. Fortunately, if the data model \mathscr{P} is dominated, not only is that anomalous behavior impossible [49], but the notions of sufficient random transformation and sufficient \sigma -field are equivalent [52] (see also [37, (6.38)]).

    67. Due to Halmos and Savage [3, Corollary 1], formalizing earlier ideas of Fisher [47, p. 331] and Neyman [53, Theorem Ⅱ] in restricted settings, the following result is known as the factorization theorem.

    Theorem 9. Suppose that (\mathcal{{Y}}, \mathscr{F}) is standard, \mathscr{P} is dominated, and P_{Z|Y} is a deterministic transformation, i.e., P_{Z|Y = y} = \delta_{f(y)} , for a (\mathscr{F}, \mathscr{G}) -measurable f\colon \mathcal{{Y}} \to \mathcal{{Z}} . Then, Z is a classically sufficient statistic of Y for \mathscr{P} if and only if there exist Borel-measurable w\colon \mathcal{{Y}}\to [-\infty, \infty) and v\colon \Theta \times \mathcal{{Z}}\to [-\infty, \infty) such that, for all \theta\in\Theta , v(\theta, \cdot) is Borel-measurable and

    \begin{align} \imath_{V;Y} (\theta; Y) = v \left(\theta, f(Y) \right) + w(Y),\; \mathit{\mbox{a.s.}}\; Y \sim P_{Y|V = \theta}, \end{align} (7.3)

    where the information density is non-Bayesian (Item 26) with an arbitrary reference measure dominating \mathscr{P} .

    Since the second term on the right side of (7.3) reflects the choice of the dominating measure, alternatively, we can express the condition for classical sufficiency in Theorem 9 as the existence of a dominating measure \mathsf{μ} such that \frac{\mathrm{d}P_{Y|V = \theta}}{\mathrm{d}\mathsf{μ}} is measurable with respect to the \sigma -field generated by f , for all \theta\in\Theta .

    68. An important application of the factorization theorem is the following corollary to Corollary 3 in [3].

    Theorem 10. If | \Theta | = 2 and (\mathcal{{Y}}, \mathscr{F}) is standard, then \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} (Y) is a sufficient statistic of Y for \mathscr{P} = \{P_{Y|V = \theta}, \, \theta \in \Theta\} = \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} .

    Proof. Define the function g\colon(0, 1)\times \Theta \times [-\infty, +\infty] \to (0, \infty] ,

    \begin{align} g(p,\theta, z) = \begin{cases} 1 - p + p \exp (z) , &\theta = \mathtt{0},\, z \in \mathbb R ;\\ p + (1- p )\exp (-z), &\theta = \mathtt{1},\, z \in \mathbb R ; \\ 1 - p , & (\theta,z) = (\mathtt{0}, -\infty);\\ p ,& (\theta,z) = (\mathtt{1}, \infty); \\ \infty, &(\theta,z) = (\mathtt{1}, -\infty) \; \mbox{or}\; (\theta,z) = (\mathtt{0}, \infty). \end{cases} \end{align} (7.4)

    We can easily verify that with \bar{P} = \frac12 {P_{{{\mathtt{0}}}}} + \frac12 {P_{{{\mathtt{1}}}}} ,

    \begin{align} \imath_{{P_{{{\mathtt{0}}}}}\|\bar{P}} (Y) & = -\log g \left(\tfrac12 , \mathtt{0}, \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (Y) \right), \; \mbox{a.s.}\, Y\sim {P_{{{\mathtt{0}}}}}, \end{align} (7.5)
    \begin{align} \imath_{{P_{{{\mathtt{1}}}}}\|\bar{P}} (Y) & = -\log g \left(\tfrac12, \mathtt{1}, \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (Y) \right), \; \mbox{a.s.}\, Y\sim {P_{{{\mathtt{1}}}}}. \end{align} (7.6)

    Consequently, letting v (\theta, z) = - \log g(\frac12, \theta, z) and w (y) = \imath_{\bar{P}\|R} (y) , (7.3) holds if the non-Bayesian information density on the left side is defined with reference measure R that dominates both {P_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} .

    69. Theorem 11. Let \mathscr{P} = \{P_{Y|V = \theta}, \, \theta \in \Theta\} = \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} be such that D ({P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}}) < \infty . Fix P_{Z|Y} and denote {P_{{{\mathtt{0}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{1}}}}} . A necessary and sufficient condition for Z to be a classically sufficient statistic of Y for \mathscr{P} is

    \begin{align} D ( {P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}}) = D ( {Q_{{{\mathtt{1}}}}} \, \| \, {Q_{{{\mathtt{0}}}}}). \end{align} (7.7)

    Introducing relative entropy in [2], Kullback and Leibler identified Theorem 11 as its most important property. However, they gave the result without the condition D ({P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}}) < \infty , in which case it need not hold.

    70. The following generalization of Theorem 11 is due to Csiszár [15,32].

    Theorem 12. Let \mathscr{P} = \{P_{Y|V = \theta}, \, \theta \in \Theta\} = \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} be such that D_f ({P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}}) < \infty , where f\colon (0, \infty) \to \mathbb R is strictly convex. Fix P_{Z|Y} and denote {P_{{{\mathtt{0}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{1}}}}} . A necessary and sufficient condition for Z to be a classically sufficient statistic of Y for \mathscr{P} is

    \begin{align} D_f ( {P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}}) = D_f ( {Q_{{{\mathtt{1}}}}} \, \| \, {Q_{{{\mathtt{0}}}}}). \end{align} (7.8)

    71. Since Theorems 11 and 12 exclude pairs such that D_f ({P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}}) = \infty , it is interesting to see if there are any f -divergences such that f\colon (0, \infty) \to \mathbb R is strictly convex and D_f ({P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}}) is bounded for any pair ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \in \mathscr{P}_{\mathcal{{Y}}}^2 . The answer is affirmative: In view of (4.80), any order- \alpha Hellinger divergence with \alpha \in (0, 1) (including the squared Hellinger distance (4.81)) is finite regardless of the pair of probability measures. Even though unbounded, the Bhattacharyya distance also qualifies since it is in one-to-one correspondence with the squared Hellinger distance. Also fitting the bill is the f -divergence with f(t) = \frac{(t-1)^2}{t+1} known as the Vincze-LeCam divergence [18,19],

    \begin{align} \Delta(P\,\|\,Q) = D_f (P\,\|\,Q) \leq | P - Q| \leq 2. \end{align} (7.9)

    Although outside of the scope of Theorem 12, could the simpler total variation distance serve the same purpose? The answer is negative as we verify with a simple counterexample in Item 84.

    72. The binary case in Items 68–70 is particularly important: Z is said to be a pairwise sufficient statistic of Y for \mathscr{P} = \{P_{Y|V = \theta}, \, \theta \in \Theta\} if it is a sufficient statistic for \{P_{Y|V = \theta}, P_{Y|V = \vartheta}\} , for all \theta \neq \vartheta \in \Theta . Every sufficient statistic is pairwise sufficient. The converse holds if \mathscr{P} is dominated [3,37,52,54]. Therefore, as long as the data model is dominated, we need not wander beyond binary models to deal with classically sufficient statistics.

    73. Introduced by Kolmogorov [55], Z is said to be a Bayes sufficient statistic of Y for \mathscr{P} if for all P_V \in \mathscr{P}_\Theta , V and Y are conditionally independent given Z . While in Items 64–66 we did not impose the condition that the collection \mathscr{P} = \{ P_{Y|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \theta \in \Theta \} be a random transformation (Item 3), in this case we are indeed imposing the corresponding measurability requirement for which a \sigma -field \mathscr{H} of the subsets of \Theta is also specified. Therefore, in this setting we have the Markov chain

    \begin{align} P_V \to P_{Y|V} \to P_{Z|Y} \to P_Z. \end{align} (7.10)

    Classical sufficiency (Item 64) implies Bayes sufficiency, because once a probability measure is defined on V , P_{VY|Z} = P_{V|Z} P_{Y|Z, V} = P_{V|Z} P_{Y|Z} if the classical criterion (7.1) is satisfied; therefore, V and Y are conditionally independent given Z . Conversely, if the collection \mathscr{P} is dominated, then [3,50] shows that pairwise Bayes sufficiency implies pairwise sufficiency, which in turn implies classical sufficiency as we saw in Item 72.

    72. Theorem 13. Suppose that \mathscr{P} = \{ P_{Y|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \theta \in \Theta \} is dominated. Then, Z is a Bayes sufficient statistic of Y for \mathscr{P} if for all P_V \in \mathscr{P}_\Theta ,

    \begin{align} \imath_{V;Y} (V;Y) = \imath_{V;Z} (V;Z), \; \mathit{\mbox{a.s.}}\; (V,Y,Z) \sim P_V P_{Y|V} P_{Z|Y}. \end{align} (7.11)

    Proof. Fix P_V \in \mathscr{P}_\Theta . We need to show that (7.11) is equivalent to the conditional independence of V and Y given Z . Particularizing the chain rule in (3.56),

    \begin{align} \imath_{V;Y} (a;b) = \imath_{V;YZ} (a;b,c) = \imath_{V;Z} (a;c) + \imath_{V;Y|Z} (a;b|c). \end{align} (7.12)

    According to Lemma 6, P_{VY} \ll P_V \otimes P_Y ; therefore, \mathbb{P}[ \imath_{V; Y} (V; Y) \in \mathbb R ] = 1 . We conclude that (7.11) is equivalent to \imath_{V; Y|Z} (V; Y|Z) = 0 a.s.

    75. In the binary case, Theorem 13 simplifies as follows.

    Theorem 14. Let \mathscr{P} = \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} . Denote {P_{{{\mathtt{0}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{1}}}}} for a fixed P_{Z|Y} . Then, Z is a Bayes sufficient statistic of Y for \mathscr{P} if and only if

    \begin{align} \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} (Y) = \imath_{{Q_{{{\mathtt{1}}}}}\| {Q_{{{\mathtt{0}}}}}} (Z), \quad \mathit{\mbox{a.s. for both}} \; (Y,Z) \sim {P_{{{\mathtt{1}}}}} P_{Z|Y}\; \mathit{\mbox{and}}\; (Y,Z) \sim {P_{{{\mathtt{0}}}}} P_{Z|Y}. \end{align} (7.13)

    Proof. For P_V = [0\; \; 1] or [1\; \; 0] , both sides of (7.11) are 0 . Fix P_V (\mathtt{1}) = p \in (0, 1) , and denote by p_{ \mathtt{1} } and p_{ \mathtt{0} } the densities of {P_{{{\mathtt{1}}}}} and {P_{{{\mathtt{0}}}}} , respectively, with respect to the dominating measure p\, {P_{{{\mathtt{1}}}}} + (1-p) {P_{{{\mathtt{0}}}}} . Analogously, denote by q_{ \mathtt{1} } and q_{ \mathtt{0} } the densities of {Q_{{{\mathtt{1}}}}} and {Q_{{{\mathtt{0}}}}} , respectively, with respect to the dominating measure p\, {Q_{{{\mathtt{1}}}}} + (1-p) {Q_{{{\mathtt{0}}}}} . It readily follows that, with the notation in (7.4),

    \begin{align} p_{ \mathtt{1} } ( y ) & = \frac1{g\left(p, \mathtt{1} , \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (y) \right)} \,\quad\mbox{and}\quad p_{ \mathtt{0} } ( y ) = \frac1{g\left(p, \mathtt{0} , \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (y) \right)}, \quad y\in \mathcal{{Y}}, \end{align} (7.14)
    \begin{align} q_{ \mathtt{1} } ( z ) & = \frac1{g\left(p, \mathtt{1} , \imath_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} (z) \right)} \quad\mbox{and}\quad q_{ \mathtt{0} } ( z ) = \frac1{g\left(p, \mathtt{0} , \imath_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} (z) \right)}, \quad z\in \mathcal{{Z}}. \end{align} (7.15)

    The condition in (7.11) is equivalent to

    \begin{align} p_{ \mathtt{1} } ( Y_{\mathtt{1}} ) & = q_{ \mathtt{1} } ( Z_{\mathtt{1}}) , \quad \; \mbox{a.s.}\,(Y_{\mathtt{1}}, Z_{\mathtt{1}}) \sim {P_{{{\mathtt{1}}}}} P_{Z|Y}, \end{align} (7.16)
    \begin{align} p_{ \mathtt{0} } ( Y_{\mathtt{0}} ) & = q_{ \mathtt{0} } ( Z_{\mathtt{0}}) , \quad \; \mbox{a.s.}\,(Y_{\mathtt{0}}, Z_{\mathtt{0}}) \sim {P_{{{\mathtt{0}}}}} P_{Z|Y}, \end{align} (7.17)

    which in turn is equivalent to (7.13) in view of (7.14)–(7.15) and the strict monotonicity of the function g\left(p, \theta, \cdot \right) for all (p, \theta)\in (0, 1) \times \{\mathtt{0}, \mathtt{1}\} .

    76. Theorem 15. Suppose that \mathscr{P} = \{ P_{Y|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \theta \in \Theta \} is dominated. Then, Z is a Bayes sufficient statistic of Y for \mathscr{P} if and only if I (V; Y) = I (V; Z) for all those P_V supported on two elements of \Theta .

    Proof. Because of the domination assumption, Bayes sufficiency is equivalent to pairwise Bayes sufficiency (Item 73). Therefore, we can restrict attention to those P_V supported on two elements of \Theta . Note that I(V; Y) \leq 1 bit with those input distributions. Since I(V; Y) is finite and I (V; Z| Y) = 0 , the chain rule of mutual information

    \begin{align} I( V; Y, Z) = I ( V ; Z ) + I (V; Y | Z) = I (V ; Y) + I (V; Z| Y) \end{align} (7.18)

    implies that I (V; Y) = I (V; Z) is equivalent to I (V; Y | Z) = 0 , which, in turn, is equivalent to conditional independence of V and Y given Z .

    Without imposing the domination assumption, a related claim can be found in [56, p. 36]. However, note that if I (V; Y) = I (V; Z) = \infty , (7.18) does not guarantee I (V; Y | Z) = 0 . Apparently unaware of the notion of Bayes sufficiency, Lindley [5] had proposed I (V; Y) = I (V; Z) for all P_V as a criterion for sufficiency, which he noticed to be implied by classical sufficiency.

    77. Following [37,52,57,58], P_{Z|Y} is called Blackwell sufficient for \mathscr{P} = \{ P_{Y|V = \theta}, \, \theta \in \Theta \} (sometimes also called exhaustive [59]) if there exists P_{Y|Z}\colon (\mathcal{{Z}}, \mathscr{G}) \to (\mathcal{{Y}}, \mathscr{F}) (dependent on P_{Z|Y} and \mathscr{P} ) such that for all \theta \in \Theta ,

    \begin{align} P_{Y|V = \theta} \to P_{Z|Y} \to P_{Y|Z} \to P_{Y|V = \theta}. \end{align} (7.19)

    Therefore, P_{Y|Z} acts as an "inverse random transformation" as long as the input to P_{Z|Y} is drawn from \mathscr{P} . As shown in [60,61], (see also [52] and [37, (6.51)]), for dominated collections defined on standard spaces, classical sufficiency is the same as Blackwell sufficiency.

    78. Let \mathscr{P}_{\mathcal{{Z}}} stand for the collection of probability measures defined on (\mathcal{{Z}}, \mathscr{G}) . In the terminology introduced by Blackwell [48,57], \{ P_{Y|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \theta \in \Theta \} is at least as informative as \{ P_{Z|V = \theta} \in \mathscr{P}_{\mathcal{{Z}}}, \theta \in \Theta \} if there exists a random transformation P_{Z|Y}\colon (\mathcal{{Y}}, \mathscr{F}) \to (\mathcal{{Z}}, \mathscr{G}) such that

    \begin{align} P_{Y|V = \theta} \to P_{Z|Y} \to P_{Z|V = \theta}, \quad \theta\in \Theta. \end{align} (7.20)

    So, P_{Z|Y} is Blackwell sufficient if and only if \{ P_{Z|V = \theta}, \theta \in \Theta \} and \{ P_{Y|V = \theta}, \theta \in \Theta \} are equally informative.

    79. Taking stock of the various notions of sufficient statistics reviewed so far in this section, the notion of Bayes sufficiency is, in principle, easier to apply than the classical notion in Item 64 and does not require the topological assumption of a standard space. On the other hand, the factorization theorem (Theorem 9) typically provides a convenient method for verifying the sufficiency of deterministic transformations. Although the Blackwell criterion (Item 77) is intuitively appealing, identifying the required inverse random transformation (or showing that none exists) is not always straightforward. Building on Definition 2, next we introduce a new notion of sufficient statistic that is both easy to verify and equivalent to the foregoing notions for dominated models in standard spaces.

    Definition 3. Fix \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} and P_{Z|Y} , and denote {P_{{{\mathtt{0}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{1}}}}} . Then, Z is an I -sufficient statistic of Y for \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} if

    \begin{align} ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}) \equiv ({Q_{{{\mathtt{0}}}}}, {Q_{{{\mathtt{1}}}}}). \end{align} (7.21)

    More generally, Z is an I -sufficient statistic of Y for \{P_{Y|V = \theta}, \, \theta \in \Theta\} if it is I -sufficient for every pair (\theta, \vartheta) , \theta \neq \vartheta \in \Theta .

    80. Example. For any ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in {\mathscr{P}_{\!\!\mathcal{{Y}}}}^2 , \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (Y) is an I -sufficient statistic of Y for \{ {P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} because in view of Lemma 5,

    \begin{align} \imath_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} ( \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}})) = \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}})\; \; \mbox{a.s.}\; {Y_{{{\mathtt{1}}}}}\sim {P_{{{\mathtt{1}}}}}. \end{align} (7.22)

    81. Example. Suppose that \Theta = \mathbb R , \mathcal{{Y}} = \mathbb R^n and Y = (Y_1, \ldots, Y_n) = (\theta + X_1, \ldots, \theta + X_n) , with (X_1, \ldots, X_n) independent geometrically distributed with known parameter q \in (0, 1) . To verify that

    \begin{align} Z = \min\limits_{i = 1, \ldots, n} Y_i \end{align} (7.23)

    is an I -sufficient statistic of (Y_1, \ldots, Y_n) for this undominated data model, first note that P_{Y|V = \theta + \ell} \perp P_{Y|V = \theta} and P_{Z|V = \theta + \ell} \perp P_{Z|V = \theta} , unless \ell is an integer. With \ell \in \{1, 2, \ldots\} , we obtain

    \begin{align} \imath_{P_{Z|V = \theta + \ell}\|P_{Z|V = \theta}} (t) = \begin{cases} \ell \, n \, \log \frac1{1-q}, & t \in \{ \theta + \ell, \theta + \ell+ 1 , \ldots \}; \\ -\infty, & t \in \{ \theta, \ldots , \theta + \ell- 1 \}; \\ \mbox{arbitrary},&\mbox{otherwise}. \end{cases} \end{align} (7.24)

    Moreover, we can easily check that \imath_{P_{Y|V = \theta + \ell}\|P_{Y|V = \theta}} (y_1, \ldots, y_n) = \imath_{P_{Z|V = \theta + \ell}\|P_{Z|V = \theta}} (\min_{i = 1, \ldots, n} y_i) .

    82. Theorem 16. Assume that the data model \mathscr{P} = \{P_{Y|V = \theta}, \, \theta \in \Theta\} is dominated and fix P_{Z|Y} .

    (a) If Z is a Bayes sufficient statistic of Y for \mathscr{P} , then Z is an I -sufficient statistic of Y for \mathscr{P} .

    (b) Assume that the observation space (\mathcal{{Y}}, \mathscr{F}) is standard. If Z is an I -sufficient statistic of Y for \mathscr{P} , then Z is a classically sufficient statistic of Y for \mathscr{P} .

    Proof. In view of Items 72 and 73, the domination assumption allows us to restrict attention to the |\Theta| = 2 case.

    (a) If Z is a Bayes sufficient statistic of Y for \mathscr{P} , then Theorem 14 implies that the random variables \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} (Y_{\mathtt{1}}) and \imath_{{Q_{{{\mathtt{1}}}}}\| {Q_{{{\mathtt{0}}}}}} (Z_{\mathtt{1}}) with Y_{\mathtt{1}} \sim {P_{{{\mathtt{1}}}}} and Z_{\mathtt{1}}\sim {Q_{{{\mathtt{1}}}}} must have identical cumulative distribution functions. Therefore, ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) are equivalent pairs.

    (b) If Z is an I -sufficient statistic of Y for \mathscr{P} , then Theorem 3 implies that D_f ({P_{{{\mathtt{1}}}}}\, \|\, {P_{{{\mathtt{0}}}}}) = D_f ({Q_{{{\mathtt{1}}}}}\, \|\, {Q_{{{\mathtt{0}}}}}) for all convex f\colon(0, \infty) \to \mathbb R . In particular, this encompasses the functions allowed in Theorem 12, and, consequently, Z is a classically sufficient statistic of Y for \mathscr{P} . Recall from Item 71 that the set of functions allowed in Theorem 12 is nonempty regardless of ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \in \mathscr{P}_{\mathcal{{Y}}}^2 .

    83. The notions in Items 40 and 78 are related as follows.

    Theorem 17. Suppose that (\mathcal{{Y}}, \mathscr{F}) and (\mathcal{{Z}}, \mathscr{G}) are standard spaces. For any ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})\in {\mathscr{P}_{\!\!\mathcal{{Y}}}}^2 and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})\in \mathscr{P}^2_{\mathcal{{Z}}} ,

    \begin{align} ({P_{{{\mathtt{1}}}}},{P_{{{\mathtt{0}}}}}) &\equiv ({Q_{{{\mathtt{1}}}}},{Q_{{{\mathtt{0}}}}}) \\ &\Updownarrow \\ \{ {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}\} \; \mathit{\mbox{and}}\; \{{Q_{{{\mathtt{1}}}}},{Q_{{{\mathtt{0}}}}}\}\; & \mathit{\mbox{are equally informative models}}. \end{align} (7.25)

    Proof.

    \Uparrow \{ {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}\} and \{{Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}\} are equally informative \Rightarrow there exists P_{Z|Y}\colon (\mathcal{{Y}}, \mathscr{F}) \to (\mathcal{{Z}}, \mathscr{G}) which is Blackwell sufficient for \{ {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}\} , and {P_{{{\mathtt{1}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{1}}}}} , {P_{{{\mathtt{0}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{0}}}}} . Since the model is dominated and lives in a standard space, Item 77 and Theorem 16 imply that P_{Z|Y} is I -sufficient; therefore, ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) .

    \Downarrow As we saw in Theorem 10, the deterministic transformation P_{X|Y} that outputs X = \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (Y) is a classically sufficient statistic for \{{P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}\} , and, analogously, the deterministic transformation P_{\bar{X}|Z} that outputs \bar{X} = \imath_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} (Z) is a classically sufficient statistic for \{{Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}\} . Because the spaces are standard and the models are dominated, those statistics are Blackwell sufficient; therefore, there exist P_{Y|X} and P_{Z|\bar{X}} such that

    \begin{align} {P_{{{\mathtt{1}}}}} \to P_{X|Y} \to P_{Y|X} \to {P_{{{\mathtt{1}}}}} \end{align} (7.26)
    \begin{align} {P_{{{\mathtt{0}}}}} \to P_{X|Y} \to P_{Y|X} \to {P_{{{\mathtt{0}}}}} \end{align} (7.27)
    \begin{align} {Q_{{{\mathtt{1}}}}} \to P_{\bar{X}|Z} \to P_{Z|\bar{X}} \to {Q_{{{\mathtt{1}}}}} \end{align} (7.28)
    \begin{align} {Q_{{{\mathtt{0}}}}} \to P_{\bar{X}|Z} \to P_{Z|\bar{X}} \to {Q_{{{\mathtt{0}}}}}. \end{align} (7.29)

    Now by definition of ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) , the response of P_{X|Y} to {P_{{{\mathtt{1}}}}} is the same as the response of P_{\bar{X}|Z} to {Q_{{{\mathtt{1}}}}} , and the response of P_{X|Y} to {P_{{{\mathtt{0}}}}} is the same as the response of P_{\bar{X}|Z} to {Q_{{{\mathtt{0}}}}} . Therefore,

    \begin{align} {Q_{{{\mathtt{1}}}}} \to P_{\bar{X}|Z} \to P_{Y|X} \to {P_{{{\mathtt{1}}}}} \end{align} (7.30)
    \begin{align} {Q_{{{\mathtt{0}}}}} \to P_{\bar{X}|Z} \to P_{Y|X} \to {P_{{{\mathtt{0}}}}} \end{align} (7.31)

    which implies that \{{Q_{{{\mathtt{0}}}}}, {Q_{{{\mathtt{1}}}}}\} is at least as informative as \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} . Reversing the roles ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) and ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) , we conclude that \{{P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} is at least as informative as \{{Q_{{{\mathtt{0}}}}}, {Q_{{{\mathtt{1}}}}}\} .

    84. Example. Let \mathcal{{B}} = \{ -, 0, +\} , \mathcal{{C}} = \{-, +\} , and the random transformation P_{Z|Y = 0} (+) = \frac12 , P_{Z|Y = +} (+) = P_{Z|Y = -} (-) = 1 . Furthermore, consider \{ P_{Y|V = \theta}, \theta \in \Theta \} = \{ {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} \} with

    \begin{align} {P_{{{\mathtt{1}}}}} = \left[\,\tfrac13\; \; \tfrac23\; \; 0\,\right] \to P_{Z|Y} &\to {Q_{{{\mathtt{1}}}}} = \left[\,\tfrac23\; \; \tfrac13\,\right], \end{align} (7.32)
    \begin{align} {P_{{{\mathtt{0}}}}} = \left[\,0\; \; \tfrac23\; \; \tfrac13\,\right] \to P_{Z|Y} &\to {Q_{{{\mathtt{0}}}}} = \left[\,\tfrac13\; \; \tfrac23\,\right]. \end{align} (7.33)

    Then, | {P_{{{\mathtt{1}}}}} - {P_{{{\mathtt{0}}}}} | = | {Q_{{{\mathtt{1}}}}} - {Q_{{{\mathtt{0}}}}} | = \frac23 . Although P_{Z|Y} preserves total variation distance, Z is not a sufficient statistic of Y because ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \not \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) .

    85. Summarizing the various results in this section as well as the necessary and sufficient conditions for equivalent pairs in Section 4, we have the following result.

    Theorem 18. Suppose that the data model \mathscr{P} = \{ P_{Y|V = \theta} \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}, \theta \in \Theta \} is dominated and (\mathcal{{Y}}, \mathscr{F}) is a standard space. Fix any random transformation: P_{Z|Y}\colon (\mathcal{{Y}}, \mathscr{F}) \to (\mathcal{{Z}}, \mathscr{G}) , and denote {P_{{{\mathtt{0}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} \to P_{Z|Y} \to {Q_{{{\mathtt{1}}}}} for ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 . The following are equivalent.

    (a) Z is a classically sufficient statistic of Y for \mathscr{P} .

    (b) Z is a Bayes sufficient statistic of Y for \mathscr{P} .

    (c) Z is a Blackwell sufficient statistic of Y for \mathscr{P} .

    (d) Z is an I -sufficient statistic of Y for \mathscr{P} .

    (e) For all P_V \in \mathscr{P}_\Theta ,

    \begin{align} \imath_{V;Y} (V;Y) = \imath_{V;Z} (V;Z), \; \mathit{\mbox{a.s.}}\; (V,Y,Z) \sim P_V P_{Y|V} P_{Z|Y}. \end{align} (7.34)

    (f) I (V; Y) = I (V; Z) for all those P_V\in \mathscr{P}_\Theta supported on two elements of \Theta .

    (g) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 and \alpha \in (0, \infty] ,

    \begin{align} D_\alpha ( {P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) & = D_\alpha ( {Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}). \end{align} (7.35)

    (h) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 and convex functions f\colon(0, \infty)\to \mathbb R ,

    \begin{align} D_f ( {P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) & = D_f ( {Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}). \end{align} (7.36)

    (i) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 ,

    \begin{align} \mathscr{H}^2 ( {P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) & = \mathscr{H}^2 ( {Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}). \end{align} (7.37)

    (j) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 ,

    \begin{align} B ( {P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) & = B ( {Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}). \end{align} (7.38)

    (k) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 ,

    \begin{align} \Delta ( {P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) & = \Delta ( {Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}). \end{align} (7.39)

    (l) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 and p\in (0, 1) ,

    \begin{align} \mathcal{I}_p({P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) = \mathcal{I}_p({Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}). \end{align} (7.40)

    (m) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 and \gamma \geq 1 ,

    \begin{align} E_\gamma({P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) = E_\gamma({Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}) \; \; &\mathit{\mbox{and}}\; \; E_\gamma({P_{{{\mathtt{0}}}}}\,\|\,{P_{{{\mathtt{1}}}}}) = E_\gamma({Q_{{{\mathtt{0}}}}}\,\|\,{Q_{{{\mathtt{1}}}}}). \end{align} (7.41)

    (n) For all ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}})\in \mathscr{P}^2 ,

    \begin{align} S ( {P_{{{\mathtt{1}}}}}\,\|\,{P_{{{\mathtt{0}}}}}) & = S ( {Q_{{{\mathtt{1}}}}}\,\|\,{Q_{{{\mathtt{0}}}}}). \end{align} (7.42)

    See Item 100 for the justification of Theorem 18-(n).

    86. When compelled to choose among various non-sufficient statistics, it is helpful to assign a figure of merit to every random transformation P_{Z|Y} providing an indication of how close it is to being sufficient. Motivated by Theorem 18-(i), a couple of possibilities are

    \begin{align} \inf\limits_{{P_{{{\mathtt{0}}}}}\neq {P_{{{\mathtt{1}}}}}\in \mathscr{P}} \frac{\mathscr{H}^2 ({Q_{{{\mathtt{0}}}}}\,\|\,{Q_{{{\mathtt{1}}}}})}{\mathscr{H}^2 ({P_{{{\mathtt{0}}}}}\,\|\,{P_{{{\mathtt{1}}}}})} \leq 1, \quad \mbox{and} \quad \inf\limits_{{P_{{{\mathtt{0}}}}}\neq {P_{{{\mathtt{1}}}}}\in \mathscr{P}} \frac{\Delta ({Q_{{{\mathtt{0}}}}}\,\|\,{Q_{{{\mathtt{1}}}}})}{\Delta ({P_{{{\mathtt{0}}}}}\,\|\,{P_{{{\mathtt{1}}}}})} \leq 1, \end{align} (7.43)

    where the inequalities follow from the fact that the both the squared Hellinger distance and the Vincze-Le Cam divergence are f -divergences and equality occurs if and only if P_{Z|Y} is a sufficient statistic. Alternatively, Theorem 18-(d) suggests using

    \begin{align} \sup\limits_{{P_{{{\mathtt{0}}}}}\neq {P_{{{\mathtt{1}}}}}\in \mathscr{P}} K ( { \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} , { \mathbb{F}}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} ), \end{align} (7.44)

    where K(\mathbb{F}, \mathbb{G}) = \sup_{x\in \mathbb R} | \mathbb{F}(x) - \mathbb{G}(x) | is the Kolmogorov-Smirnov distance between the cumulative distribution functions \mathbb{F} and \mathbb{G} . Then, P_{Z|Y} is a sufficient statistic if and only if (7.44) is zero. Naturally, in (7.44) we can substitute K ({ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}, { \mathbb{F}}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}}) by | P_{\imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}} - P_{\imath_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}}} | or any other measure of distance between probability measures.

    The information spectra of the absolute information \imath_X (X) and of the information density \imath_{X; Y} (X; Y) prove to be instrumental in determining the fundamental limits of lossless and lossy compression, respectively, as well as data transmission, in the latter case. Unfortunately, explicit solutions are not feasible and we must be contented with bounds, which become tight under stationary/ergodic assumptions in the limit of long data blocks. In contrast, the relative information spectra determine exactly the non-asymptotic fundamental tradeoff in hypothesis testing. This section gives a full detailed solution of that tradeoff in non-Bayesian hypothesis testing including an operational role for the NP-divergence. No restrictions are placed on the pair of probability measures that govern the observation under the respective hypotheses:

    \begin{aligned} & \mathsf{H}_0: y \sim P_{\mathtt{0}}, \\ & \mathsf{H}_1: y \sim P_{\mathtt{1}}. \end{aligned}

    Since we place no restrictions on {P_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} , this "single-shot" setting encompasses the popular special case in which the observations are n independent drawings from a given distribution, {P_{{{\mathtt{0}}}}} = \mathsf{P}^{\otimes n}_0 and {P_{{{\mathtt{1}}}}} = \mathsf{P}^{\otimes n}_1 .

    87. Let ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}^2 . A (randomized) hypothesis test is a deterministic measurable function \phi\colon \mathcal{{Y}} \to [0, 1] , such that \phi(y) is the probability of guessing {P_{{{\mathtt{1}}}}} if y\in \mathcal{{Y}} is observed. A test \phi is said to be deterministic if its range is \{0, 1\} , i.e., \phi (y) = 1\{ y \in A\} for some measurable subset A \subset \mathcal{{Y}} . The performance of test \phi is determined by the conditional probabilities of error,

    \begin{align} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} & = \mathbb{P} [ \mbox{test decides} \; {\mathsf{H}_{{\mathtt{0}}}} \,|\, {\mathsf{H}_{{\mathtt{1}}}} ] = 1 - \mathbb{E} [ \phi ( {Y_{{{\mathtt{1}}}}} ) ], \quad {Y_{{{\mathtt{1}}}}} \sim {P_{{{\mathtt{1}}}}}, \end{align} (8.1)
    \begin{align} {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} & = \mathbb{P} [ \mbox{test decides} \; {\mathsf{H}_{{\mathtt{1}}}} \,|\, {\mathsf{H}_{{\mathtt{0}}}}] = \mathbb{E} [ \phi ( {Y_{{{\mathtt{0}}}}} ) ], \quad{Y_{{{\mathtt{0}}}}} \sim {P_{{{\mathtt{0}}}}}. \end{align} (8.2)

    88. The hypothesis testing fundamental tradeoff region consists of the set of achievable error probability pairs,

    \begin{align} \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) = \bigcup\limits_{\phi\colon \mathcal{{Y}} \to [0,1]} \left\{ ( \mathbb{E} [ \phi ( {Y_{{{\mathtt{0}}}}} ) ], 1 - \mathbb{E} [ \phi ( {Y_{{{\mathtt{1}}}}} ) ] ) \right\}, \quad\; {Y_{{{\mathtt{0}}}}}\sim {P_{{{\mathtt{0}}}}},\; {Y_{{{\mathtt{1}}}}}\sim {P_{{{\mathtt{1}}}}}. \end{align} (8.3)

    In other words, ({\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}}) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) if there is a hypothesis test for ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) achieving conditional error probabilities ({\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}}) . Elementary properties of the fundamental tradeoff region include:

    Theorem 19.

    (a) \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) is a convex set.

    (b) \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) is a closed set.

    (c) (a, b) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \Longleftrightarrow (1-a, 1-b) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}).

    (d) (a, b) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \Longleftrightarrow (b, a) \in \mathcal{C} ({P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}) .

    Proof.

    (a) If \phi_0 and \phi_1 attain (a_0, b_0) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) and (a_1, b_1) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) , respectively, and \alpha\in (0, 1) , then the test (1-\alpha) \phi_0 + \alpha\, \phi_1 attains (1-\alpha) (a_0, b_0) + \alpha\, (a_1, b_1) .

    (b) The mapping \phi \mapsto (\mathbb{E} [ \phi ({Y_{{{\mathtt{0}}}}}) ], 1 - \mathbb{E} [ \phi ({Y_{{{\mathtt{1}}}}}) ]) is linear.

    (c) The test 1-\phi achieves (1 - {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}, 1-{\pi_{{\mathtt{0}} \mid {\mathtt{1}}}}) if \phi achieves ({\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}}) .

    (d) Interchanging \phi \leftrightarrow 1 - \phi and {P_{{{\mathtt{0}}}}} \leftrightarrow {P_{{{\mathtt{1}}}}} in (8.3).

    89. In view of Theorem 19-(c), the set of points in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) above the (0, 1)—(1, 0) diagonal is redundant. The set of Pareto optimal error probability pairs is the lower boundary of \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) below the diagonal, which we refer to as the fundamental tradeoff function \{\alpha_\nu \in [0, 1], \nu \in [0, 1]\} defined by

    \begin{align} \alpha_\nu ( {P_{{{\mathtt{1}}}}} , {P_{{{\mathtt{0}}}}} ) & = \min\left\{y \in [0,1]\colon (\nu, y) \in \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} )\right\} \end{align} (8.4)
    \begin{align} & = \min\limits_{\phi\colon {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} \leq \nu} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} = 1 - \max\limits_{\phi\colon \mathbb{E} [ \phi ( {Y_{{{\mathtt{0}}}}} ) ] \leq \nu} \mathbb{E} [ \phi ( {Y_{{{\mathtt{1}}}}} ) ]. \end{align} (8.5)

    As a consequence of Theorem 19-(a), \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) is convex on [0, 1] . Although the fundamental tradeoff region \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) and the fundamental tradeoff function \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) determine each other, it is advantageous to work with both simultaneously, as we see below.

    90. The diagonal connecting (0, 1)—(1, 0) belongs to the fundamental tradeoff region

    \begin{align} \{ (p,1-p) \in [0,1]^2 \colon p\in [0,1] \} \subset \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) , \end{align} (8.6)

    since (p, 1-p) is attained by the blind test \phi (y) = p , y \in \mathcal{{Y}} . If {P_{{{\mathtt{1}}}}} = {P_{{{\mathtt{0}}}}} , then blind tests are optimal and equality holds in (8.6). Note that in this case the area of the fundamental tradeoff region satisfies \left| \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{1}}}}}) \right| = 0 , and the fundamental tradeoff function is \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{1}}}}}) = 1 -\nu , \nu \in [0, 1] .

    91. At the other extreme, if {P_{{{\mathtt{1}}}}} \perp {P_{{{\mathtt{0}}}}} , then \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) = [0, 1]^2 , \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) = 0 , \nu \in [0, 1] , and the area of the fundamental tradeoff region satisfies \left| \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \right| = 1 . To see this, recall (Item 7) that there exists an event F\in \mathscr{F} such that {P_{{{\mathtt{1}}}}} (F) = 1 and {P_{{{\mathtt{0}}}}} (F) = 0 . The deterministic test \phi (y) = 1\{y\in F\} achieves the point (0, 0) , while the test \phi (y) = 1\{y\not\in F\} achieves the point (1, 1) . All other points in the square are achievable because of (8.6) and Theorem 19-(a).

    92. Inspired by radar, the function \nu \mapsto 1 - \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) is frequently (e.g., [62]) referred to as the receiver operating characteristic, or ROC. In the radar application, {P_{{{\mathtt{0}}}}} is the distribution of the observations under the absence of target return. In fact, this terminology is applied not just to the best possible curve but to the tradeoff between {{\pi_{{\mathtt{1}} \mid {\mathtt{1}}}}} and {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} achieved by any particular family of tests. The so-called area under the (ROC) curve, commonly abbreviated as AUC,

    \begin{align} \int_0^1 \left( 1 - \alpha_\nu ( {P_{{{\mathtt{1}}}}} , {P_{{{\mathtt{0}}}}} ) \right) \, \mathrm{d}\nu = \tfrac12 + \tfrac12\, | \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) | \in [\tfrac12, 1], \end{align} (8.7)

    is frequently used as a scalar proxy to evaluate the degree to which {P_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} can be distinguished. It ranges from \frac12 if {P_{{{\mathtt{0}}}}} = {P_{{{\mathtt{1}}}}} to 1 if {P_{{{\mathtt{0}}}}}\perp{P_{{{\mathtt{1}}}}} .

    93. Data processing theorem for the fundamental tradeoff region. If P_Y \to P_{Z|Y} \to P_Z and Q_Y \to P_{Z|Y} \to Q_Z , then

    \begin{align} \alpha_\nu ( P_Y , Q_Y ) &\leq \alpha_\nu ( P_Z , Q_Z ), \quad \nu \in [0,1], \end{align} (8.8)
    \begin{align} \mathcal{C} (P_Z , Q_Z ) &\subset \mathcal{C} (P_Y , Q_Y ), \end{align} (8.9)

    since we always have the option of incorporating P_{Z|Y} as a front end of the hypothesis test. Equality holds in (8.8)–(8.9) if Z is a Blackwell sufficient statistic of Y for (P_Y, Q_Y) , because from Z (along, with possibly additional randomness) we can synthesize data whose conditional distributions are P_Y and Q_Y . Feeding that data to \phi results in the same ({\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}}) as feeding the original Y to \phi .

    94. The minimal error probabilities compatible with zero error probability of the other kind are denoted by {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}} and {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}} , i.e., they are defined by

    ({{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}, 0) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) but ({\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}, 0) \not\in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) if {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} < {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}} ;

    (0, {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}}) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) but (0, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}}) \not\in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) if {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} < {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}} .

    By definition,

    \begin{align} \alpha_0 ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) & = {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}}, \end{align} (8.10)
    \begin{align} \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) & = 0 \; \Longleftrightarrow\; \nu \in [{{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}, 1]. \end{align} (8.11)

    Theorem 20. For any ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}^2 ,

    (a) {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}} = \Pi ({P_{{{\mathtt{0}}}}}\, \|\, {P_{{{\mathtt{1}}}}}) , achieved by the test \phi(y) = 1\{ y \in \mathcal{{S}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} \} ;

    (b) {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}} = \Pi ({P_{{{\mathtt{1}}}}}\, \|\, {P_{{{\mathtt{0}}}}}) , achieved by the test \phi(y) = 1\{ y \not\in \mathcal{{S}}_{{P_{{{\mathtt{0}}}}}\|{P_{{{\mathtt{1}}}}}} \} .

    Proof. For any test \phi achieving error probabilities ({\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}}) ,

    {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} = 0 \quad \Longleftrightarrow\quad {P_{{{\mathtt{1}}}}} ( \phi^{-1} (1) ) = 1 , (8.12)

    and

    {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} \geq {P_{{{\mathtt{0}}}}} ( \phi^{-1} (1) ) (8.13)
    \begin{align} & \geq {P_{{{\mathtt{0}}}}} ( \mathcal{{S}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ), \end{align} (8.14)

    where (8.14) holds if {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} = 0 because of (2.4) and (8.12). Furthermore, as we saw in (2.14), if \phi(y) = 1\{ y \in \mathcal{{S}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} \}, then the right side of (8.12) is satisfied and (8.13)–(8.14) become identities. Recalling (2.14) completes the proof of (a). The proof of (b) is identical.

    95. Introduced in [7], for (\gamma, \lambda) \in \mathbb R \times [0, 1] , a Neyman-Pearson test between {P_{{{\mathtt{1}}}}} and {P_{{{\mathtt{0}}}}} is

    \begin{align} \phi_{\gamma,\lambda} (y) = \begin{cases} 1, & \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} (y) > \gamma; \\ \lambda, & \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} (y) = \gamma; \\ 0, & \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} (y) < \gamma. \end{cases} \end{align} (8.15)

    The tests \phi_{\gamma, 0} and \phi_{\gamma, 1} are known as deterministic Neyman-Pearson tests. The limiting Neyman-Pearson tests are the deterministic tests

    \begin{align} \lim\limits_{\gamma \to \infty} \phi_{\gamma,\lambda} (y) & = 1\{ y \not\in \mathcal{{S}}_{{P_{{{\mathtt{0}}}}}\|{P_{{{\mathtt{1}}}}}}\}, \end{align} (8.16)
    \begin{align} \lim\limits_{\gamma \to -\infty} \phi_{\gamma,\lambda} (y) & = 1\{ y \in \mathcal{{S}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}\} . \end{align} (8.17)

    96. With {Y_{{{\mathtt{0}}}}} \sim {P_{{{\mathtt{0}}}}} and {Y_{{{\mathtt{1}}}}} \sim {P_{{{\mathtt{1}}}}} , the Neyman-Pearson test (8.15) achieves the conditional error probabilities

    \begin{align} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} (\gamma, \lambda) & = 1 - \mathbb{E} [ \phi_{\gamma,\lambda} ({Y_{{{\mathtt{1}}}}}) ] = { \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) - \lambda\, \mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \gamma \right] , \end{align} (8.18)
    \begin{align} {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\gamma, \lambda) & = \mathbb{E} [ \phi_{\gamma,\lambda} ({Y_{{{\mathtt{0}}}}}) ] = 1 - \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ( \gamma ) + \lambda\, \mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) = \gamma \right]. \end{align} (8.19)

    The randomization serves to obtain convex combinations of the performances obtained by deterministic Neyman-Pearson tests,

    \begin{align} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} (\gamma, \lambda) & = \lambda \, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} (\gamma, 1) + (1- \lambda) \, {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} (\gamma, 0), \end{align} (8.20)
    \begin{align} {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\gamma, \lambda) & = \lambda \, {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\gamma, 1) + (1- \lambda) \, {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\gamma, 0), \end{align} (8.21)

    where

    \begin{align} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} (\gamma, 0) & = \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (8.22)
    \begin{align} {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\gamma, 0) & = 1- \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (8.23)
    \begin{align} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} (\gamma, 1) & = \lim\limits_{\alpha \uparrow \gamma} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} (\alpha, 0) = \lim\limits_{\alpha \uparrow \gamma} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\alpha), \end{align} (8.24)
    \begin{align} {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\gamma, 1) & = \lim\limits_{\alpha \uparrow \gamma} {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\alpha, 0) = 1- \lim\limits_{\alpha \uparrow \gamma} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\alpha). \end{align} (8.25)

    97. The following venerable result states that the non-limiting Neyman-Pearson tests are Pareto-optimal.

    Lemma 12. Neyman-Pearson [7]. Let {Y_{{{\mathtt{0}}}}} \sim {P_{{{\mathtt{0}}}}} and {Y_{{{\mathtt{1}}}}} \sim {P_{{{\mathtt{1}}}}} . For any \lambda \in [0, 1] , \gamma \in \mathbb R , and measurable function \phi\colon \mathcal{{Y}} \to [0, 1] ,

    \begin{align} \mathbb{E} [ \phi ({Y_{{{\mathtt{1}}}}} ) ] > \mathbb{E} [ \phi_{\gamma,\lambda} ({Y_{{{\mathtt{1}}}}} ) ] \quad \Longrightarrow \quad \mathbb{E} [ \phi ({Y_{{{\mathtt{0}}}}} ) ] > \mathbb{E} [ \phi_{\gamma,\lambda} ({Y_{{{\mathtt{0}}}}} ) ]. \end{align} (8.26)

    Proof. Invoking (8.18)–(8.19), (4.25), and Lemma 9 with g(a) = 1 -\phi (a) , we obtain

    \begin{align} \mathbb{E} [ \phi ({Y_{{{\mathtt{0}}}}} ) ] - \mathbb{E} [ \phi_{\gamma,\lambda} ({Y_{{{\mathtt{0}}}}} ) ] \geq \exp( -\gamma) \left( \mathbb{E} [ \phi ({Y_{{{\mathtt{1}}}}} ) ] - \mathbb{E} [ \phi_{\gamma,\lambda} ({Y_{{{\mathtt{1}}}}} ) ] \right). \end{align} (8.27)

    98. In addition to giving the fundamental tradeoff in terms of the relative information spectra, the following result finds an operational role for the NP-divergence.

    Theorem 21. Let ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \in {\mathscr{P}_{\!\!\mathcal{{Y}}}}^2 such that {P_{{{\mathtt{0}}}}} \not \perp {P_{{{\mathtt{1}}}}} .

    (a) The limiting Neyman-Pearson tests \phi (y) = 1\{ y \in \mathcal{{S}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}\} and \phi (y) = 1\{ y \not\in \mathcal{{S}}_{{P_{{{\mathtt{0}}}}}\|{P_{{{\mathtt{1}}}}}}\} achieve the Pareto-optimal points ({{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}, 0) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) and (0, {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}}) \in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) , respectively.

    (b) For any other Pareto-optimal point in \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) , there exist \gamma \in \mathbb R and \lambda\in [0, 1] , such that the point is achieved by the Neyman-Pearson test \phi_{\gamma, \lambda} .

    (c) The intersection of \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) and the triangle (0, 0)— (0, 1)—(1, 0)— (0, 0) is the convex closure of the points

    \begin{align} \cup_{\gamma \in \mathbb R} \left( 1 - \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) \right) \cup \left( 0, 1\right) \cup \left( 1, 0\right). \end{align} (8.28)

    (d) For \nu \in (0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}) , the fundamental tradeoff function is given by

    \begin{align} \alpha_\nu ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) & = \min\limits_{\gamma\in \mathbb R} \left\{ \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) - \exp (\gamma) \left( \nu - 1 + \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) \right) \right\}, \end{align} (8.29)

    where the minimum is achieved by \gamma^\star :

    If \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}^{-1} (1- \nu) \neq \varnothing , \gamma^\star is any solution to

    \begin{align} 1-\nu = \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma^\star), \end{align} (8.30)

    in which case

    \begin{align} \alpha_\nu ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) & = { \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} \left( \gamma^\star \right). \end{align} (8.31)

    If \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}^{-1} (1- \nu) = \varnothing , then \gamma^\star is the unique scalar such that

    \begin{align} \lim\limits_{x\uparrow \gamma^\star} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x) < 1- \nu < \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma^\star). \end{align} (8.32)

    Moreover, in this case,

    \begin{align} \alpha_\nu ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) & = \lambda^\star \, \lim\limits_{x\uparrow \gamma^\star} { \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x) + (1-\lambda^\star) \, \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma^\star) , \end{align} (8.33)

    where

    \begin{align} \lambda^\star = \frac{\bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma^\star) - 1 +\nu}{\bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma^\star) - \lim\limits_{x\uparrow \gamma^\star} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x)}\in (0,1). \end{align} (8.34)

    (e)

    \begin{align} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) \, \; \Longleftrightarrow\; \, \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) = \mathcal{C}({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}). \end{align} (8.35)

    (f)

    \begin{align} \\[-10mm] | \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) | & = \tfrac12 \, S ( {P_{{{\mathtt{1}}}}} \, \| \, {P_{{{\mathtt{0}}}}} ). \end{align} (8.36)

    (g) If P \neq Q , then

    \begin{align} \\[-10mm] | \mathcal{C} ( P^{\otimes n} , Q^{\otimes n} ) | & = 1 - \exp \left(-2 n \, B ( P\,\| \, Q) + o(n) \right). \end{align} (8.37)

    Proof.

    (a) \Longleftarrow Theorem 20, (8.16)–(8.17), and the fact that ({{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}, 0) and (0, {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}}) are Pareto optimal by definition.

    (b) As a result of Lemma 12, the Neyman-Pearson test \phi_{\gamma, \lambda} is Pareto-optimal for all (\gamma, \lambda)\in \mathbb R \times [0, 1] . Moreover, for every value of \nu \in (0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}) , there is a (possibly non-unique) pair (\gamma, \lambda)\in \mathbb R \times [0, 1] which yields

    \begin{align} \nu = {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}}(\gamma,\lambda). \end{align} (8.38)

    To verify this, bearing (8.19) in mind, we have

    ● If \bar{ \mathbb{F}}^{-1}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (1-\nu) \neq \varnothing , then any \lambda\in [0, 1] and \gamma^\star that satisfies (8.30) give a solution to (8.38) since (8.30) implies \mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) = \gamma \right] = 0 .

    ● If \bar{ \mathbb{F}}^{-1}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (1-\nu) = \varnothing , then \gamma^\star and \lambda^\star defined by (8.32) and (8.34), respectively, provide a (unique) solution to (8.38). Actually, a solution to (8.38) ensues if \gamma^\star were replaced in (8.34) by any \gamma , but \gamma^\star is the only choice that guarantees that \lambda^\star \in (0, 1) .

    (c) The closure of \mathcal{{C}}_0 = \cup_{\gamma \in \mathbb R} \left(1 - \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) \right) is \mathcal{{C}}_0 \cup \{({{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}, 0)\} \cup \{(0, {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}})\} , since (4.5), (4.8), and Theorem 8.2 yield

    \begin{align} ({{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}, 0) & = \lim\limits_{\gamma \to -\infty} \left( 1- \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) \right), \end{align} (8.39)
    \begin{align} (0,{{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}}) & = \lim\limits_{\gamma \to \infty} \left( 1- \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) \right). \end{align} (8.40)

    The presence of the points (0, 1) and (1, 0) in (8.28) serves to include the non-Pareto-optimal segments (0, 1)—(0, {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}}) and (1, 0)—({{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}, 0) , as well as their convex combinations. Note from (8.22)–(8.23) that the elements in \mathcal{{C}}_0 are the error probability pairs achieved by the deterministic Neyman-Pearson tests \phi_{\gamma, 0} . Moreover, the error probability pairs achieved by deterministic Neyman-Pearson tests \phi_{\gamma, 1} belong to \mathrm{cl} (\mathcal{{C}}_0) due to (8.24)–(8.25). Finally, as we saw in (8.20)–(8.21), the convex combinations of the pairs achieved by deterministic tests are the error probability pairs achieved by the randomized tests \phi_{\gamma, \lambda} , with \lambda \in (0, 1) .

    (d) Thanks to (4.25), for all \gamma\in \mathbb R ,

    \begin{align} \exp (\gamma )\, \mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) = \gamma \right] = \mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \gamma \right], \end{align} (8.41)

    and we can eliminate \lambda from (8.18)–(8.19) to show that the conditional probabilities achieved by the Neyman-Pearson test \phi_{\gamma, \lambda} satisfy the linear relationship

    \begin{align} {\pi_{{\mathtt{0}} \mid {\mathtt{1}}}} ( \gamma, \lambda) = \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) - \exp (\gamma) \left( {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} ( \gamma, \lambda) - 1 + \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) \right). \end{align} (8.42)

    As a result of (b), minimizing (8.42) over \gamma such that \nu = {\pi_{{\mathtt{1}} \mid {\mathtt{0}}}} (\gamma, \lambda) \in (0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}) , yields \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) as stated in (8.29). The minimizers of (8.42) were justified in the proof of (b). In particular,

    ● If \bar{ \mathbb{F}}^{-1}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (1-\nu) \neq \varnothing , then, when evaluated at any solution to (8.30), both probabilities in (8.41) are equal to zero and (8.18)–(8.19) yield (8.31). Note that if \bar{ \mathbb{F}}^{-1}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (1-\nu) has more than one element, then not only is \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} constant on an interval but so is { \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} on the same interval according to Theorem 4.1-(b). Therefore, (8.31) is well-defined.

    ● If \bar{ \mathbb{F}}^{-1}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (1-\nu) = \varnothing , then evaluating (8.18) at (\gamma^\star, \lambda^\star) yields (8.33), since \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma^\star) - \lim_{x\uparrow \gamma^\star} { \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x) = \mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \gamma^\star \right] .

    (e) Recalling Item 41,

    \begin{align} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) \equiv ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) &\Longleftrightarrow \; \left\{ \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} = \mathbb{F}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}}\; \mbox{and}\; \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} = \bar{ \mathbb{F}}_{{Q_{{{\mathtt{1}}}}}\|{Q_{{{\mathtt{0}}}}}} \right\} \end{align} (8.43)
    \begin{align} &\,\Longrightarrow\; \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) = \mathcal{C}({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) , \end{align} (8.44)

    where (8.44) follows from (c). To prove the reverse implication, we must show that the function \alpha_\nu ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) determines \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} . The explicit dependence is given in Theorem 22 in the Appendix.

    (f) Recalling the symmetry property in Theorem 19-(c),

    \begin{align} | \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} )| = 1 - 2 \int_0^{{{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}} \alpha_\nu ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) \, \mathrm{d}\nu. \end{align} (8.45)

    Because of the convexity of \alpha_\nu (Item 89), its derivative is a non-decreasing function which may have at most a countable number of discontinuities on the interval [0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}] . We partition the integral in (8.45) as the finite or countably infinite sum of subintegrals of differentiable sections, distinguishing between the sections in which \alpha_\nu is a straight line (corresponding to jumps in the relative information spectra) and those in which it is not. Recall that the non-straight-line sections are due to portions of the relative information spectra that are strictly monotonically increasing. Flat portions in the spectra only affect the kinks—points of discontinuous derivative—in \alpha_\nu , which do not contribute to its integral. Therefore, we have

    \begin{align} \int_0^{{{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}} \alpha_\nu \, \mathrm{d}\nu = \sum\limits_{\gamma \in \Gamma} \int_{\nu^-(\gamma)}^{\nu^+(\gamma)} \alpha_\nu \, \mathrm{d}\nu + \sum\limits_{i \in \mathcal{{I}}} \int_{\nu_i}^{\nu_{i+1}} \alpha_\nu \, \mathrm{d}\nu, \end{align} (8.46)

    where \Gamma is the finite, or countably infinite, set of abscissas at which the jumps in the relative information spectra occur, \alpha_\nu is a straight line on the intervals [ \nu^-(\gamma), \nu^+(\gamma) ] , and \alpha_\nu is differentiable but not a straight line on the intervals [\nu_i, \nu_{i+1}] .

    ⅰ. We saw in (d) that each \gamma \in \Gamma contributes a straight line in the fundamental tradeoff function of slope -\exp (\gamma) between the abscissas \nu^-(\gamma) = 1- \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) and \nu^+ (\gamma) = 1- \lim_{x\uparrow \gamma} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x) . In view of (8.33), observe that

    \begin{align} \alpha_{\nu^-(\gamma)} & = \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (8.47)
    \begin{align} \alpha_{\nu^-(\gamma)} - \alpha_{\nu^+(\gamma)} & = \mathbb{P} [ \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \gamma]. \end{align} (8.48)

    Therefore, the trapezoidal area is

    \begin{align} \int_{\nu^-(\gamma)}^{\nu^+(\gamma)} \alpha_\nu \, \mathrm{d}\nu & = \tfrac12 \left( \alpha_{\nu^-(\gamma)} + \alpha_{\nu^+(\gamma)} \right) \left( \nu^+(\gamma) - \nu^-(\gamma) \right) \end{align} (8.49)
    \begin{align} & = \left( \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) - \tfrac12 \mathbb{P} [ \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \gamma] \right) \, \mathbb{P} [ \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) = \gamma]. \end{align} (8.50)

    Then, the sum of the subintegrals (8.50) due to the straight-line segments equals

    \begin{align} \sum\limits_{\gamma \in \Gamma} \int_{\nu^-(\gamma)}^{\nu^+(\gamma)} \alpha_\nu \, \mathrm{d}\nu & = \sum\limits_{\gamma \in \Gamma} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma)\, \mathbb{P} [ \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) = \gamma] \\ &\; \; - \tfrac12\, \mathbb{P} [\imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) ] , \end{align} (8.51)

    where {Y_{{{\mathtt{0}}}}} and {Y_{{{\mathtt{1}}}}} are independent.

    ⅱ. For a section between \nu_0 and \nu_1 on which \alpha_\nu is differentiable and not a straight line, the parametric solution in (8.18)–(8.19) reduces to

    \begin{align} \alpha_\nu & = \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (8.52)
    \begin{align} 1 -\nu & = \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (8.53)

    whose definite integral can be written as the Lebesgue-Stieltjes integral

    \begin{align} \int_{\nu_0}^{\nu_1} \alpha_\nu \, \mathrm{d}\nu = \int_{\bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}^{-1} (1- \nu_1)}^{\bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}}^{-1} (1- \nu_0)} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (t) \, \mathrm{d} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (t). \end{align} (8.54)

    Summing (8.51) and all subintegrals of the non-straight-line portions in (8.54) yields

    \begin{align} \int_0^{{{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}} \alpha_\nu ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} ) \, \mathrm{d}\nu & = \int_{-\infty}^{\infty} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (t) \, \mathrm{d} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (t) - \tfrac12 \mathbb{P} [\imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) ] \end{align} (8.55)
    \begin{align} & = \mathbb{P} [\imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) \leq \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) ] - \tfrac12 \mathbb{P} [\imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) ] . \end{align} (8.56)

    Plugging (8.56) into (8.45) yields

    \begin{align} | \mathcal{C} ( {P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}} )| & = \mathbb{P} [\imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) \leq \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) ] - \mathbb{P} [\imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) \leq \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) ] \end{align} (8.57)
    \begin{align} & = \tfrac{1}{2} | {P_{{{\mathtt{1}}}}} \otimes {P_{{{\mathtt{0}}}}} - {P_{{{\mathtt{0}}}}} \otimes {P_{{{\mathtt{1}}}}}|, \end{align} (8.58)

    in light of Theorem 8-(i).

    (g) \Longleftarrow (6.4) and (8.37).

    99. A folk theorem (e.g., [63,64]) is that the area under the curve (Item 92) is the "probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance". The ambiguity in whether "higher" means \geq or > is inconsequential if the relative information spectra are continuous. Otherwise, we must split the difference as (8.7) together with Theorem 21-(f) yields, with ({Y_{{{\mathtt{0}}}}}, {Y_{{{\mathtt{1}}}}}) \sim {P_{{{\mathtt{0}}}}} \otimes {P_{{{\mathtt{1}}}}} ,

    \begin{align} \int_0^1 \left( 1 - \alpha_\nu ( {P_{{{\mathtt{1}}}}} , {P_{{{\mathtt{0}}}}} ) \right) \, \mathrm{d}\nu & = \mathbb{P} [ \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) > \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) ] + \tfrac12 \,\mathbb{P} [ \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \imath_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) ]. \end{align} (8.59)

    100. A corollary to a result by Pfanzagl [61] is that \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) = \mathcal{C} ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) (in the notation of Theorems 11 and 12) is a sufficient condition for Z to be a sufficient statistic of Y for \{ {P_{{{\mathtt{0}}}}}, {P_{{{\mathtt{1}}}}}\} . In fact, Theorems 16 and 21-(e) imply that the preservation of the fundamental tradeoff region in hypothesis testing is an equivalent criterion for pairwise sufficiency. Therefore, Theorem 7.10-(n) will follow from

    \begin{array}{*{20}{c}} {\mathcal{C}\left( {{P_{\mathtt{1}}},{P_{\mathtt{0}}}} \right){\rm{ }} = \mathcal{C}\left( {{Q_{\mathtt{1}}},{Q_{\mathtt{0}}}} \right)}&{}\\ \Updownarrow &\qquad\qquad\qquad{\left( {8.60} \right)}\\ {\left| {\mathcal{C}\left( {{P_{\mathtt{1}}},{P_{\mathtt{0}}}} \right)} \right|{\rm{ }} = \left| {\mathcal{C}\left( {{Q_{\mathtt{1}}},{Q_{\mathtt{0}}}} \right)} \right|}&{}\\ \Updownarrow &\qquad\qquad\qquad{\left( {8.61} \right)}\\ {S\left( {{P_{\mathtt{1}}},{P_{\mathtt{0}}}} \right){\rm{ }} = S\left( {{Q_{\mathtt{1}}},{Q_{\mathtt{0}}}} \right).}&{} \end{array}

    To justify (8.60), recall from Item 93 that \mathcal{C} ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) \subset \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) . Therefore, |\mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}})| > |\mathcal{C} ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}})| unless \mathcal{C} ({P_{{{\mathtt{1}}}}}, {P_{{{\mathtt{0}}}}}) = \mathcal{C} ({Q_{{{\mathtt{1}}}}}, {Q_{{{\mathtt{0}}}}}) . Theorem 8.3-(f) implies (8.61).

    One of the defining features of information theory is the study of random variables such as \imath_X (X) = \log \frac1{P_X (X)} , \imath_{X\|Y} (X) and \imath_{X\|Y} (Y) , where the probability mass function of X is evaluated at X and the log density function of P_X with respect to P_Y is evaluated at X or Y , respectively. The averages of those random variables, entropy and relative entropy, are the pillars that sustain the asymptotics of the fundamental limits in data compression, hypothesis testing, and data transmission in stationary ergodic models. Beyond averages, the study of the distributions of those random variables, also known as information spectra and relative information spectra, is the key to non-asymptotic fundamental limits.

    This paper has studied the relative information spectra for arbitrary pairs of probability measures defined on the same measurable space. To that end, the formalization of the concepts of relative support and coefficient of absolute discontinuity has proven valuable. Particular emphasis has been placed on the interplay of the distributions of \imath_{X\|Y} (X) and \imath_{X\|Y} (Y) , which determine each other, as well as their relationships with measures of discrepancy such as total variation distance, relative entropy, Rényi divergence and f -divergences. Equivalent pairs of probability measures (possibly belonging to different measurable spaces) are those with identically-distributed relative informations.

    The exposition of the applications to statistical inference has emphasized their connections to the literature. Based on equivalent pairs, we have introduced the conceptually simple notion of I -sufficiency, which can be checked easily even without the usual assumptions of deterministic statistics and dominated collections on standard spaces. When those assumptions are satisfied, the necessary and sufficient condition given by the Halmos-Savage factorization necessary and sufficient condition (Theorem 9) remains the gold standard for verifying the sufficiency of deterministic transformations.

    The non-asymptotic (Neyman-Pearson) fundamental tradeoff region of conditional error probabilities in binary hypothesis testing is a major application of the relative information spectra. We have given a detailed description of the region without any assumptions of absolute continuity. The area of the Neyman-Pearson tradeoff region is a normalized measure of the discrepancy between the probability measures, equal to zero [resp., one] for identical [resp., orthogonal] probability measures, which is popular in applications in a slightly modified form referred to as the area under the curve (AUC). We have shown that the area of the Neyman-Pearson tradeoff region is equal to (one-half) the NP-divergence, | {P_{{{\mathtt{0}}}}}\otimes {P_{{{\mathtt{1}}}}} - {P_{{{\mathtt{1}}}}} \otimes {P_{{{\mathtt{0}}}}} | , a new discrepancy measure between probability measures {P_{{{\mathtt{0}}}}} and {P_{{{\mathtt{1}}}}} . Along with Chernoff information, it appears to be one of the most interesting divergences among those that satisfy the data processing inequality but are not f -divergences. We have shown that the preservation of the NP-divergence is a necessary and sufficient condition for the statistic to be sufficient. An immediate operational role is inherited from total variation distance, as the NP-divergence governs the error probability of the Bayesian test that identifies the order of a pair of observations, one drawn from {P_{{{\mathtt{0}}}}} and the other from {P_{{{\mathtt{1}}}}} . A new asymptotic operational role for the Bhattachrayya distance has been shown for independent identically distributed observations: The rate of approach to 1 of the area of the fundamental non-Bayesian tradeoff region decays exponentially in twice the number of observations times the Bhattachrayya distance. In contrast, as shown in [14] in the Bayesian setting, the exponential decay of the minimum error probability is governed by the Chernoff information regardless of the values of the nonzero a priori probabilities.

    On account of the convexity of \alpha_\nu , its derivative on (0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}) is negative monotonic non-decreasing with a finite, or countably infinite, number of discontinuities. Those discontinuities determine the locations of the jumps of \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} , which are the same as those of \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} . For \nu \in (0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}) , denote the left/right derivatives by

    \begin{align} \dot{\alpha}^-_{\nu} = \lim\limits_{\epsilon \downarrow 0} \frac{\alpha_\nu - \alpha_{\nu-\epsilon}}{\epsilon} \leq \lim\limits_{\epsilon \downarrow 0} \frac{ \alpha_{\nu+\epsilon}- \alpha_\nu }{\epsilon} = \dot{\alpha}^+_{\nu} < 0. \end{align} (A.1)

    Naturally, we drop the superscript whenever \dot{\alpha}^-_{\nu} = \dot{\alpha}^+_{\nu} . The following result gives \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} as a function of \{ \alpha_\nu, \nu \in [0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}})\} , with {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}} = \max\{ \nu\in [0, 1] \colon \alpha_\nu = 0\} , as per (8.11). The fact that the relative information spectrum and the fundamental tradeoff region determine each other validates the opening sentence in the abstract.

    Theorem 22.

    1. \lim_{t \to \infty} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (t) = {{\underline{\pi}_{{\mathtt{0}} \mid {\mathtt{1}}}}} = \alpha_0 .

    2. \lim_{t \to -\infty} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (t) = 1- {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}} .

    3. Fix \gamma \in \mathbb R . To determine \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) , there are two possibilities.

    (a) There is a unique \bar{\nu}_\gamma \in (0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}) such that

    \begin{align} \dot{\alpha}^-_{\bar{\nu}_\gamma} \leq - \exp(\gamma) \leq \dot{\alpha}^+_{\bar{\nu}_\gamma}. \end{align} (A.2)

    Then, \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) = \alpha_{\bar{\nu}_\gamma} and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) = 1 - {\bar{\nu}_\gamma} .

    (b) Let (\nu_\gamma^-, \nu_\gamma^+) \subset [0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}] be the largest open interval such that

    \begin{align} \dot{\alpha}_{\nu} = - \exp ( \gamma ), \; \mathit{\mbox{for}} \; \nu \in (\nu_\gamma^-, \nu_\gamma^+ ). \end{align} (A.3)

    Then, \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) = \alpha_{\nu_\gamma^-} and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) = 1 - {\nu_\gamma^-} . Furthermore, \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} experiences a jump at \gamma of height \alpha_{\nu_\gamma^-} - \alpha_{\nu_\gamma^+} , while the jump at \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) has height \nu_\gamma^+ - \nu_\gamma^- .

    Proof.

    1) \Longleftarrow (8.10) and (8.39).

    2) \Longleftarrow (8.40).

    3) As we saw in Theorem 21-(d), \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} experience a jump at \gamma if and only if the function \alpha_\nu has a straight line such that case 3b) holds.

    3a) Since \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} are continuous at \gamma , Theorem 21-(d) gives

    \begin{align} \alpha_\nu & = \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (A.4)
    \begin{align} 1- \nu & = \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma). \end{align} (A.5)

    At those \nu \in (0, {{\underline{\pi}_{{\mathtt{1}} \mid {\mathtt{0}}}}}) such that \dot{\alpha}^-_{\nu} = \dot{\alpha}^+_{\nu} , we can differentiate (A.4) and (A.5) with respect to \nu and \gamma , respectively, to conclude, with the aid of (4.26), that

    \begin{align} \dot{\alpha}_\nu = - \exp (\gamma ). \end{align} (A.6)

    If \dot{\alpha}^-_{\nu} < \dot{\alpha}^+_{\nu} , the discontinuity in the derivative is caused by the fact that there is an interval of values of \gamma on which both \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma) are constant; therefore, according to (A.4)–(A.5), those values of \gamma result in a single Pareto-optimal point ({\bar{\nu}_\gamma}, \alpha_{\bar{\nu}_\gamma}) . The interval of values of \gamma is indeed (A.2) since any slope strictly lower than \dot{\alpha}^-_{\bar{\nu}_\gamma} , or strictly higher than \dot{\alpha}^+_{\bar{\nu}_\gamma} , corresponds to Pareto-optimal points other than ({\bar{\nu}_\gamma}, \alpha_{\bar{\nu}_\gamma}) .

    3b) Since \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} and \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} experience a jump at \gamma , \alpha_\nu has a straight line with slope

    \begin{align} - \frac{\mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{1}}}}}) = \gamma \right]}{\mathbb{P} \left[ \imath_{{P_{{{\mathtt{1}}}}}\| {P_{{{\mathtt{0}}}}}} ({Y_{{{\mathtt{0}}}}}) = \gamma \right]} = -\exp (\gamma), \end{align} (A.7)

    according to (8.33)–(8.34) and (8.41). The range of abscissas \nu of that straight line is given by (8.32), thereby indicating that

    \begin{align} \nu_\gamma^- & = 1- \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (A.8)
    \begin{align} \nu_\gamma^+ & = 1- \lim\limits_{x\uparrow \gamma} \bar{ \mathbb{F}}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x). \end{align} (A.9)

    Again, according to (8.33)–(8.34), the corresponding ordinates of those points are

    \begin{align} \alpha_{\nu_\gamma^-} & = \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (\gamma), \end{align} (A.10)
    \begin{align} \alpha_{\nu_\gamma^+} & = \lim\limits_{x\uparrow \gamma} \mathbb{F}_{{P_{{{\mathtt{1}}}}}\|{P_{{{\mathtt{0}}}}}} (x). \end{align} (A.11)

    The author declares he has not used Artificial Intelligence (AI) tools in the creation of this article.

    I am indebted to guest-editor Prof. Igal Sason for graciously extending the invitation for this submission and his meticulous reading of the manuscript. Reference [43] was kindly brought to my attention by an anonymous reviewer.

    The author declares no conflict of interest.



    [1] C. E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J., 27 (1948), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x doi: 10.1002/j.1538-7305.1948.tb01338.x
    [2] S. Kullback, R. A. Leibler, On information and sufficiency, Ann. Math. Stat., 22 (1951), 79–86. https://doi.org/10.1214/aoms/1177729694 doi: 10.1214/aoms/1177729694
    [3] P. R. Halmos, L. J. Savage, Application of the Radon-Nikodym theorem to the theory of sufficient statistics, Ann. Math. Stat., 20 (1949), 225–241. https://doi.org/10.1214/aoms/1177730032 doi: 10.1214/aoms/1177730032
    [4] R. M. Fano, Class notes for course 6.574: Statistical theory of information, Massachusetts Institute of Technology, Cambridge, Mass., 1953.
    [5] D. V. Lindley, On a measure of the information provided by an experiment, Ann. Math. Stat., 27 (1956), 986–1005. https://doi.org/10.1214/aoms/1177728069 doi: 10.1214/aoms/1177728069
    [6] H. Chernoff, Large-sample theory: Parametric case, Ann. Math. Stat., 27 (1956), 1–22. Available from: https://www.jstor.org/stable/2236974.
    [7] J. Neyman, E. S. Pearson, On the problem of the most efficient tests of statistical hypotheses, Philos. T. Roy. Soc. London Ser. A, 231 (1933), 289–337. https://doi.org/10.1098/rsta.1933.0009 doi: 10.1098/rsta.1933.0009
    [8] I. N. Sanov, On the probability of large deviations of random variables, Mat. Sb., 42 (1957), 11–44. https://doi.org/10.2307/3197345 doi: 10.2307/3197345
    [9] H. Cramér, Sur un nouveau théorème-limite de la théorie des probabilités, Actual. Sci. Ind., 736 (1938), 5–23.
    [10] E. T. Jaynes, Information theory and statistical mechanics, Phys. Rev. Ser. II, 106 (1957), 620–630. https://doi.org/10.1103/PhysRev.106.620 doi: 10.1103/PhysRev.106.620
    [11] E. T. Jaynes, Information theory and statistical mechanics Ⅱ, Phys. Rev. Ser. II, 108 (1957), 171–190. https://doi.org/10.1103/PhysRev.108.171 doi: 10.1103/PhysRev.108.171
    [12] S. Kullback, Information theory and statistics, Dover: New York, 1968.
    [13] A. Rényi, On measures of information and entropy, In: Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press: Berkeley, California, 1961,547–561.
    [14] H. Chernoff, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, Ann. Math. Stat., 23 (1952), 493–507. https://doi.org/10.1214/aoms/1177729330 doi: 10.1214/aoms/1177729330
    [15] I. Csiszár, Information-type measures of difference of probability distributions and indirect observations, Stud. Sci. Math. Hung., 2 (1967), 299–318. https://doi.org/10.1016/S0010-8545(00)80126-5 doi: 10.1016/S0010-8545(00)80126-5
    [16] K. Pearson, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, London Edinb. Dublin Philos. Mag. J. Sci., 50 (1900), 157–175. https://doi.org/10.1080/14786440009463897 doi: 10.1080/14786440009463897
    [17] H. Jeffreys, An invariant form for the prior probability in estimation problems, Proc. Roy. Soc. London Ser. A Math. Phys. Sci., 186 (1946), 453–461. https://doi.org/10.1098/rspa.1946.0056 doi: 10.1098/rspa.1946.0056
    [18] I. Vincze, On the concept and measure of information contained in an observation, In: Contributions to Probability: A Collection of Papers Dedicated to Eugene Lukacs, Academic Press: New York, 1981,207–214. https://doi.org/10.1016/0091-3057(81)90179-9
    [19] L. Le Cam, Asymptotic methods in statistical decision theory, Springer: New York, 1986.
    [20] M. H. DeGroot, Uncertainty, information, and sequential experiments, Ann. Math. Stat., 33 (1962), 404–419. https://doi.org/10.1214/aoms/1177704567
    [21] T. S. Han, S. Verdú, Approximation theory of output statistics, IEEE T. Inform. Theory, 39 (1993), 752–772. https://doi.org/10.1109/18.256486 doi: 10.1109/18.256486
    [22] T. S. Han, Information spectrum methods in information theory, Springer: Heidelberg, Germany, 2003.
    [23] Y. Polyanskiy, H. V. Poor, S. Verdú, Channel coding rate in the finite blocklength regime, IEEE T. Inform. Theory, 56 (2010), 2307–2359. https://doi.org/10.1109/TIT.2010.2043769 doi: 10.1109/TIT.2010.2043769
    [24] S. Verdú, The Cauchy distribution in information theory, Entropy, 25 (2023), 1–48. https://doi.org/10.3390/e25010048 doi: 10.3390/e25010048
    [25] D. Burkholder, Sufficiency in the undominated case, Ann. Math. Stat., 32 (1961), 1191–1200. https://doi.org/10.1214/aoms/1177704859 doi: 10.1214/aoms/1177704859
    [26] P. R. Halmos, Measure theory, Springer: New York, 1974.
    [27] P. Billingsley, Probability and measure, 4 Eds., Wiley-Interscience: New York, 2012.
    [28] I. Csiszár, J. Körner, Information theory: Coding theorems for discrete memoryless systems, Academic: New York, 1981.
    [29] J. Bhattacharyya, On some analogues of the amount of information and their use in statistical estimation, Sankhyā Indian J. Stat., 8 (1946), 1–14.
    [30] T. van Erven, P. Harremoës, Rényi divergence and Kullback-Leibler divergence, IEEE T. Inform. Theory, 60 (2014), 3797–3820. https://doi.org/10.1109/TIT.2014.2320500 doi: 10.1109/TIT.2014.2320500
    [31] A. Rényi, New version of the probabilistic generalization of the large sieve, Acta Math. Hung., 10 (1959), 217–226. https://doi.org/10.1007/BF02063300 doi: 10.1007/BF02063300
    [32] I. Csiszár, Eine Informationstheorische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten, Publ. Math. Inst. Hung. Acad. Sci., 8 (1963), 85–108. https://real.mtak.hu/201426/ doi: https://real.mtak.hu/201426/
    [33] S. M. Ali, S. D. Silvey, A general class of coefficients of divergence of one distribution from another, J. Roy. Stat. Soc. Ser. B, 28 (1966), 131–142. https://doi.org/10.2307/4441277 doi: 10.2307/4441277
    [34] I. Sason, On f-divergences: Integral representations, local behavior, and inequalities, Entropy, 20 (2018), 1–32. https://doi.org/10.3390/e20010032 doi: 10.3390/e20010032
    [35] F. Liese, I. Vajda, f-divergences: Sufficiency, deficiency and testing of hypotheses, In: Advances in Inequalities from Probability Theory and Statistics, Nova Science: New York, 2008,113–158.
    [36] I. Sason, S. Verdú, f-divergence inequalities, IEEE T. Inform. Theory, 62 (2016), 5973–6006. https://doi.org/10.1109/TIT.2016.2603151 doi: 10.1109/TIT.2016.2603151
    [37] S. Vajda, Theory of statistical inference and information, Kluwer: Dordrecht, The Netherlands, 1989.
    [38] I. Csiszár, Information measures: A critical survey, In: Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Publishing House of the Czechoslovak Academy of Sciences, Prague, 1974, 73–86. https://doi.org/10.1111/j.1559-3584.1974.tb03703.x
    [39] F. Oesterreicher, I. Vajda, Statistical information and discrimination, IEEE T. Inform. Theory, 39 (1993), 1036–1039. https://doi.org/10.1109/18.256536 doi: 10.1109/18.256536
    [40] F. Liese, I. Vajda, On divergences and informations in statistics and information theory, IEEE T. Inform. Theory, 52 (2006), 4394–4412. https://doi.org/10.1109/TIT.2006.881731 doi: 10.1109/TIT.2006.881731
    [41] F. Liese, \phi-divergences, sufficiency, Bayes sufficiency, and deficiency, Kybernetika, 48 (2012), 690–713. Available from: https://www.kybernetika.cz/content/2012/4/690.
    [42] S. Verdú, Total variation distance and the distribution of relative information, In: Proceedings of the 2014 Workshop on Information Theory and Applications, University of California: La Jolla, California, 2014.
    [43] A. Kontorovich, Obtaining measure concentration from Markov contraction, Markov Process. Relat., 18 (2012), 613–638.
    [44] V. Strassen, The existence of probability measures with given marginals, Ann. Math. Stat., 36 (1965), 423–439. https://doi.org/10.1214/aoms/1177700153 doi: 10.1214/aoms/1177700153
    [45] R. L. Dobrushin, Prescribing a system of random variables by conditional distributions, Theor. Probab. Appl., 15 (1970), 458–486. https://doi.org/10.1137/1115049 doi: 10.1137/1115049
    [46] Y. Polyanskiy, S. Verdú, Arimoto channel coding converse and Rényi divergence, In: Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, University of Illinois: Monticello, Illinois, 2010, 1327–1333.
    [47] R. A. Fisher, On the mathematical foundations of theoretical statistics, Proc. Roy. Soc. London Ser. A Math. Phys. Sci., 222 (1922), 309–368. https://doi.org/10.1098/rsta.1922.0009 doi: 10.1098/rsta.1922.0009
    [48] D. Blackwell, Equivalent comparisons of experiments, Ann. Math. Stat., 24 (1953), 265–272. https://doi.org/10.1214/aoms/1177729032 doi: 10.1214/aoms/1177729032
    [49] R. R. Bahadur, Sufficiency and statistical decision functions, Ann. Math. Stat., 25 (1954), 423–462. https://doi.org/10.1214/aoms/1177728715 doi: 10.1214/aoms/1177728715
    [50] D. Blackwell, R. V. Ramamoorthi, A Bayes but not classically sufficient statistic, Ann. Stat., 10 (1982), 1025–1026. https://doi.org/10.1016/0305-750X(82)90014-6 doi: 10.1016/0305-750X(82)90014-6
    [51] J. Dieudonné, Sur le théoréme de Lebesgue-Nikodym, Ann. Math., 42 (1941), 547–555. https://doi.org/10.1016/S0002-9378(16)40717-9 doi: 10.1016/S0002-9378(16)40717-9
    [52] H. Heyer, Theory of statistical experiments, Springer: New York, 1982.
    [53] J. Neyman, Su un teorema concernente le cosiddette statistiche sufficienti, Istituto Italiano degli Attuari, 6 (1935), 320–334.
    [54] T. P. Speed, A note on pairwise sufficiency and completions, Sankhyā Indian J. Stat. Ser. A, 38 (1976), 194–196.
    [55] A. N. Kolmogorov, Definition of center of dispersion and measure of accuracy from a finite number of observations, Izv. Akad. Nauk SSSR Ser. Mat., 6 (1942), 4–32.
    [56] T. M. Cover, J. A. Thomas, Elements of information theory, 2 Eds., Wiley: New York, 2006.
    [57] D. Blackwell, Comparison of experiments, In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, University of California Press: Berkeley, California, 18 (1951), 93–102. https://doi.org/10.2307/1438094
    [58] R. D. Reiss, Approximate distributions of order statistics: With applications to nonparametric statistics, Springer: New York, 2012.
    [59] H. Strasser, Mathematical theory of statistics: Statistical experiments and asymptotic decision theory, Walter de Gruyter: Berlin, 1985.
    [60] R. R. Bahadur, A characterization of sufficiency, Ann. Math. Stat., 26 (1955), 286–293. https://doi.org/10.1214/aoms/1177728545 doi: 10.1214/aoms/1177728545
    [61] J. Pfanzagl, A characterization of sufficiency by power functions, Metrika, 21 (1974), 197–199. https://doi.org/10.1080/0156655740210307 doi: 10.1080/0156655740210307
    [62] H. L. van Trees, Detection, estimation and modulation theory. 1. Detection, estimation and linear modulation theory, John Wiley, 1968.
    [63] D. J. Hand, R. J. Till, A simple generalisation of the area under the ROC curve for multiple class classification problems, Mach. Learn., 45 (2001), 171–186. https://doi.org/10.1023/A:1010920819831 doi: 10.1023/A:1010920819831
    [64] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett., 27 (2006), 861–874. https://doi.org/10.1016/j.patrec.2005.10.010 doi: 10.1016/j.patrec.2005.10.010
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(671) PDF downloads(182) Cited by(0)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog