Proving prediction prudence

Dirk Tasche; Dirk Tasche

doi:10.3934/DSFE.2022017

Data Science in Finance and Economics

2022, Volume 2, Issue 4: 335-355. doi: 10.3934/DSFE.2022017

Previous Article Next Article

Research article Special Issues

Proving prediction prudence

Dirk Tasche ^,

Independent researcher, Switzerland

Received: 15 August 2022 Revised: 02 September 2022 Accepted: 11 September 2022 Published: 23 September 2022
JEL Codes: G21, C12

We study how to perform tests on samples of pairs of observations and predictions in order to assess whether or not the predictions are prudent. Prudence requires that the mean of the difference of the observation-prediction pairs can be shown to be significantly negative. For safe conclusions,we suggest testing both unweighted (or equally weighted) and weighted means and explicitly taking into account the randomness of individual pairs. The test methods presented are mainly specified as bootstrap and normal approximation algorithms. The tests are general but can be applied in particular in the area of credit risk,both for regulatory and accounting purposes.

Keywords:

Citation: Dirk Tasche. Proving prediction prudence[J]. Data Science in Finance and Economics, 2022, 2(4): 335-355. doi: 10.3934/DSFE.2022017

Related Papers:

[1]	Dominic Joseph . Estimating credit default probabilities using stochastic optimisation. Data Science in Finance and Economics, 2021, 1(3): 253-271. doi: 10.3934/DSFE.2021014
[2]	Michael Jacobs Jr. . Validation of corporate probability of default models considering alternative use cases and the quantification of model risk. Data Science in Finance and Economics, 2022, 2(1): 17-53. doi: 10.3934/DSFE.2022002
[3]	Sami Mestiri . Credit scoring using machine learning and deep Learning-Based models. Data Science in Finance and Economics, 2024, 4(2): 236-248. doi: 10.3934/DSFE.2024009
[4]	Michael Jacobs, Jr . Benchmarking alternative interpretable machine learning models for corporate probability of default. Data Science in Finance and Economics, 2024, 4(1): 1-52. doi: 10.3934/DSFE.2024001
[5]	Lindani Dube, Tanja Verster . Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models. Data Science in Finance and Economics, 2023, 3(4): 354-379. doi: 10.3934/DSFE.2023021
[6]	Ying Li, Keyue Yan . Prediction of bank credit customers churn based on machine learning and interpretability analysis. Data Science in Finance and Economics, 2025, 5(1): 19-34. doi: 10.3934/DSFE.2025002
[7]	Changjun Zheng, Md Abdul Mannan Khan, Mohammad Morshedur Rahman, Shahed Bin Sadeque, Rabiul Islam . The impact of monetary policy on banks' risk-taking behavior in an emerging economy: The role of Basel II. Data Science in Finance and Economics, 2023, 3(4): 427-451. doi: 10.3934/DSFE.2023024
[8]	László Bokor . Alarm probabilities of simple tests of merger control: An analytic derivation. Data Science in Finance and Economics, 2024, 4(2): 188-217. doi: 10.3934/DSFE.2024007
[9]	Markus Haas . The Cowles–Jones test with unspecified upward market probability. Data Science in Finance and Economics, 2023, 3(4): 324-336. doi: 10.3934/DSFE.2023019
[10]	Dong Qiu, Tingyi Liu . Multi-indicator comprehensive evaluation: reflection on methodology. Data Science in Finance and Economics, 2021, 1(4): 298-312. doi: 10.3934/DSFE.2021016

Abstract

1. Introduction

Testing if the means of two samples significantly differ or the mean of one sample significantly exceeds the mean of the other sample is a problem that is widely covered in the statistical literature [see for instance Casella and Berger, 2002, Davison and Hinkley, 1997, Venables and Ripley, 2002]. In this paper, we study how to perform such tests on samples of pairs of observations and predictions in order to assess whether or not the predictions are prudent. Prudence is here understood as the requirement that the mean of the differences of the observations and predictions can be shown to be significantly negative.

At the latest by the validation requirements for credit risk parameter estimates in the regulatory Basel II framework [BCBS, 2006, paragraph 501], such tests also became an important issue in the banking industry^*:

^* PD means 'probability of default', IRB means 'internal ratings based', LGD means 'loss given default' and EAD is 'exposure at default'.

● "Banks must regularly compare realised default rates with estimated PDs for each grade and be able to demonstrate that the realised default rates are within the expected range for that grade", and "banks using the advanced IRB approach must complete such analysis for their estimates of LGDs and EADs".

More recently, as a consequence of the introduction of new rules for loss provisioning in financial reporting standards, the validation of risk parameter estimates also attracted interest in the accounting community [see, e.g., Bellini, 2019]. Over the course of the past fifteen years or so, a variety of statistical tests for the comparison of realised and predicted values have been proposed for use in the banks' validation exercises. For overviews on estimation and validation as well as references see Blümke[2019, PD], Loterman et al. [2014, LGD], and Gürtler et al. [2018, EAD]. Scandizzo [2016] presents validation methods for all these kinds of parameters in the general context of model risk management.

In order to make validation results by different banks to some extent comparable, in February 2019, the European Central Bank [ECB, 2019]^† asked the banks it supervises under the Single Supervisory Mechanism (SSM) to deliver standardised annual reports on their internal model validation exercises. In particular, the requested reports are assumed to include data and tests regarding the "predictive ability (or calibration)" of PD, LGD and CCF (credit conversion factor)^‡ parameters in the most recent observation period. Predictive ability for LGD estimation is explained through the statement "the analysis of predictive ability (or calibration) is aimed at ensuring that the LGD parameter adequately predicts the loss rate in the event of a default i.e. that LGD estimates constitute reliable forecasts of realised loss rates" [ECB, 2019, Section 2.6.2]. The meanings of predictive ability for PD and EAD / CCF respectively are illustrated in similar ways.

^† In May 2020, this document could be downloaded at https://www.bankingsupervision.europa.eu/banking/tasks/internal_models/shared/pdf/instructions_validation_reporting_credit_risk.en.pdf.

^‡ EAD and CCF of a credit facility are linked by the relation $\text{EAD = DA + CCF*(limit-DA)}$ where DA is the already drawn amount.

ECB [2019] proposed "one-sample t-test[s] for paired observations" to test the "null hypothesis that estimated LGD [or CCF or EAD] is greater than true LGD" (or CCF or EAD). ECB [2019] also suggested a Jeffreys binomial test for the "null hypothesis that the PD applied in the portfolio/rating grade at the beginning of the relevant observation period is greater than the true one (one sided hypothesis test)".

Recall that the possible outcomes of testing a null hypothesis against an alternative are 'the null hypothesis is not rejected' or 'the null is rejected and the alternative is accepted'. Not rejecting the null hypothesis does not mean accepting it because in hypothesis testing the type II error (not rejecting the null hypothesis although the alternative is true) cannot be controlled and, therefore, can be rather large. In contrast, the type I error (rejecting the null hypothesis although it is true) can be controlled and usually is kept small by choosing a significance level like 5% or 1%. Hence, if the null hypothesis is rejected the alternative can be accepted at properly controlled risk. In the following, we understand the acceptance of an alternative hypothesis by rejection of the null hypothesis as statistical 'proof' with an error probability tag (i.e. the significance level or p-value).

In this paper,

● we make a case for also testing the null hypothesis that the estimated parameter is less than or equal to the true parameter in order to be able to 'prove' that the estimate is prudent (or conservative),

● we suggest additionally using exposure- (or limit-)weighted^§ sample averages in order to better inform assessments of estimation (or prediction) prudence, and

^§ ECBintructions presumably only looks at "number-weighted" (i.e. equally weighted) averages because the Basel framework BaselAccord requires such averages for the risk parameter estimates. In banking practice, however, also exposure-weighted averages are considered [see, e.g., Li et al., 2009].

● we propose more elaborate statements of the hypotheses for the tests (by including 'variance expansion') in order to account for portfolio inhomogeneity in terms of composition (exposure sizes) and riskiness.

The proposal to look for a 'proof' of prediction prudence is inspired by the regulatory requirement [BCBS, 2006, paragraph 451]: "In order to avoid over-optimism, a bank must add to its estimates a margin of conservatism that is related to the likely range of errors".

As a matter of fact, the statistical tests discussed in this paper can be deployed both for proving prudence and for proving aggressiveness of estimates. However, an unsymmetric approach is recommended for making use of the evidence from the tests:

● For proving prudence, request that both the equal-weights test and the exposure-weighted test reject the null hypothesis of the parameter being aggressive.

● For an alert of potential aggressiveness, request only that the equal-weights test or the exposure-weighted test reject the null hypothesis of the parameter being prudent.

The paper is organised as follows:

● In Section 2, we introduce a general non-parametric paired difference test approach to testing for the sign of a weighted mean value (Section 2.1). We compare this approach to the t-test for LGD, CCF and EAD proposed in ECBintructions and note possible improvements of both approaches (Section 2.2). We then present in Section 2.3 a test approach to put into practice these improvements in the case of variables with values in the unit interval like LGD and CCF. Appendices A and B supplement Section 2.3 with regard to weight-adjustments as an alternative to sampling with inhomogeneous weights and to testing non-negative but not necessarily bounded variables like EAD.

● In Section 3, we discuss paired difference tests in the special case of differences between observed event indicators and the predicted probabilities of the events. We start in Section 3.1 with the presentation of a test approach that takes account of potential weighting of the observation pairs and variance expansion to deal with the individual randomness of the observations. In Section 3.2, we compare this test approach to the Jeffreys test proposed in ECBintructions for assessing the 'predictive ability' of PD estimates.

● In Section 4, the test methods presented in the preceding sections are illustrated with two examples of test results.

● Section 5 concludes the paper with summarising remarks.

2. Paired difference tests

The statistical tests considered in this paper are 'paired difference tests'. This test design accounts for the strong dependence that is to be expected between the observation and the prediction in the matched observation-prediction pairs which the analysed samples consist of. See Mendenhall et al. [2008, Chapter 10] for a discussion of the advantages of such test designs.

2.1. Basic approach

Starting point.

● One sample of real-valued observations $\Delta_1, \ldots, \Delta_n$ .

● Weights $0 < w_i < 1$ , $i = 1, \ldots, n$ , with $\sum_{i = 1}^n w_i = 1$ .

● Define the weighted-average observation $\Delta_w$ as

$\begin{equation} \Delta_w = \sum\limits_{i = 1}^n w_i\, \Delta_i. \end{equation}$

(2.1)

Interpretation in the context of credit risk back-testing.

● $\Delta_1, \ldots, \Delta_n$ may be a sample of differences (residuals) between observed and predicted LGD (or CCF or EAD) for defaulted credit facilities (matched pairs of observations and predictions).

● The weight $w_i$ reflects the relative importance of observation $i$ . For instance, in the case of CCF or EAD estimates of credit facilities, one might choose

$\begin{equation} w_i \ = \ \frac{\text{limit}_i}{\sum_{j = 1}^n \text{limit}_j}, \end{equation}$

(2.2a)

where $\text{limit}_j$ is the limit of credit facility $j$ at the time when the estimates were made.

● In case of LGD estimates, the weights $w_i$ could be chosen as [Li et al., 2009, Section 5]

$\begin{equation} w_i \ = \ \frac{EAD_i}{\sum_{j = 1}^n EAD_j}, \end{equation}$

(2.2b)

where $EAD_j$ is the exposure at default estimate for credit facility $j$ at the time when the estimates were made.

Goal. We consider $\Delta_w$ as defined by (2.1) the realisation of a test statistic to be defined below and want to answer the following two questions:

● If $\Delta_w < 0$ , how safe is the conclusion that the observed (realised) values are on weighted average less than the predictions, i.e. the predictions are prudent / conservative?

● If $\Delta_w > 0$ , how safe is the conclusion that the observed (realised) values are on weighted average greater than the predictions, i.e. the predictions are aggressive?

The safety of conclusions is measured by p-values which provide error probabilities for the conclusions to be wrong. The lower the p-value, the more likely the conclusion is right.

In order to be able to examine the properties of the sample and $\Delta_w$ with statistical methods, we have to make the assumption that the sample was generated with some random mechanism. The key idea for the mechanism is to interpret the weights $w_i$ as the probabilities of the corresponding observations $\Delta_i$ . Consequently, we look at an inhomogeneous version of the empirical distribution of the sample $\Delta_1, \ldots, \Delta_n$ , with the weight $w_i$ replacing $1/n$ as the probability of observation $\Delta_i$ . The details of the mechanism are described in the following assumption.

Assumption 2.1. The sample $\Delta_1, \ldots, \Delta_n$ consists of independent realisations of a random variable $X_\vartheta$ with distribution given by

$\begin{equation} \mathrm{P}\big[X_\vartheta = \Delta_i - \vartheta\big] \ = \ w_i, \qquad i = 1, \ldots, n, \end{equation}$

(2.3)

where the value of the parameter $\vartheta \in \mathbb{R}$ is unknown.

Note that (2.3) includes the case of equally weighted observations^¶, by choosing $w_i = 1/n$ for all $i$ .

^¶ See Appendix A for a more detailed discussion of special cases with equal weights.

Proposition 2.2. For $X_\vartheta$ as described in Assumption 2.1, the expected value and the variance are given by

$\begin{align} \mathrm{E}[X_\vartheta] & = \Delta_w - \vartheta, \mathit{\text{ and}} \end{align}$

(2.4a)

$\begin{align} \mathrm{var}[X_\vartheta] & = \sum\limits_{i = 1}^n w_i\, \Delta_i^2 - \Delta_w^2. \end{align}$

(2.4b)

Proof. Obvious.

By Assumption 2.1 and Proposition 2.2, the questions on the safety of conclusions from the sign of $\Delta_w$ can be translated into hypotheses on the value of the parameter $\vartheta$ :

● If $\Delta_w < 0$ , can we conclude that $H_0: \vartheta \le \Delta_w$ is false and $H_1: \vartheta > \Delta_w \Leftrightarrow \mathrm{E}[X_\vartheta] < 0$ is true?

● If $\Delta_w > 0$ , can we conclude that $H^\ast_0: \vartheta \ge \Delta_w$ is false and $H^\ast_1: \vartheta < \Delta_w \Leftrightarrow \mathrm{E}[X_\vartheta] > 0$ is true?

If we assume that the sample $\Delta_1, \ldots, \Delta_n$ was generated by independent realisations of $X_\vartheta$ then the distribution of the sample mean is different from the distribution of $X_\vartheta$ , as shown in the following corollary to Proposition 2.2.

Corollary 2.3. Let $X_{1, \vartheta}, \ldots, X_{n, \vartheta}$ be independent and identically distributed copies of $X_\vartheta$ as in Assumption 2.1 and define $\bar{X}_\vartheta = \frac{1}{n} \sum_{i = 1}^n X_{i, \vartheta}$ . Then for the mean and the variance of $\bar{X}_\vartheta$ , it holds that

$\begin{align} \mathrm{E}[\bar{X}_\vartheta] & = \Delta_w - \vartheta, \end{align}$

(2.5a)

$\begin{align} \mathrm{var}[\bar{X}_\vartheta] & = \frac{1}{n} \Big(\sum\limits_{i = 1}^n w_i\, \Delta_i^2 - \Delta_w^2\Big). \end{align}$

(2.5b)

In the following, we use $\bar{X}_\vartheta$ as the test statistic and interpret $\Delta_w$ as its observed value^||. Next we describe a bootstrap test to answer the above questions under Assumption 2.1 and then provide the rationale behind its design.

^|| For arithmetic reasons, actually most of the time $\Delta_w$ cannot be a realisation of $\bar{X}_\vartheta$ . As long as the sample size $n$ is not too small, however, by (2.5a) and the law of large numbers considering $\Delta_w$ as realisation of $\bar{X}_\vartheta$ is not unreasonable.

Bootstrap test. Generate a Monte Carlo sample^** $\bar{x}_1, \ldots, \bar{x}_R$ from $\Delta_1, \ldots, \Delta_n$ as follows:

^** According to , Section 5.2.3], sample size $R = 999$ should suffice for the purposes of this paper.

● For $j = 1, \ldots, R$ : $\bar{x}_j$ is the equally weighted mean of $n$ independent draws from the distribution of $X_{\widehat{\vartheta}}$ as given by (2.3), with $\widehat{\vartheta} = 0$ . Equivalently, $\bar{x}_j$ is the mean of $n$ draws with replacement from the sample $\Delta_1, \ldots, \Delta_n$ , where $\Delta_i$ is drawn with probability $w_i$ .

● $\bar{x}_1, \ldots, \bar{x}_R$ are realisations of independent, identically distributed random variables.

Then a bootstrap p-value for the test of $H_0: \vartheta \le \Delta_w$ against $H_1: \vartheta > \Delta_w$ can be calculated as^††

^†† $\#S$ denotes the number of elements of the set $S$ .

$\begin{equation} \text{p-value} \ = \ \frac{1 + \#\bigl\{i: i \in\{1, \ldots, n\}, \bar{x}_i \le 2\, \Delta_w\bigr\}}{R+1}. \end{equation}$

(2.6a)

A bootstrap p-value for the test of $H^\ast_0: \vartheta \ge \Delta_w$ against $H^\ast_1: \vartheta < \Delta_w$ is given by

$\begin{equation} \text{p-value}^\ast \ = \ \frac{1 + \#\bigl\{i: i \in\{1, \ldots, n\}, \bar{x}_i \ge 2\, \Delta_w\bigr\}}{R+1}. \end{equation}$

(2.6b)

Rationale. By (2.3), for each $\vartheta$ the distributions of $X_0 - \vartheta$ and $X_\vartheta$ are identical. As a consequence, if under $H_0$ the true parameter is $\vartheta \le \Delta_w$ and $(-\infty, \, x]$ is the critical (rejection) range for the test of $H_0$ against $H_1$ based on the test statistic $\bar{X}_\vartheta$ , then it holds that

$\begin{align} \mathrm{P}\bigl[\bar{X}_\vartheta \in (-\infty, \, x]\bigr] &\ = \ \mathrm{P}[\bar{X}_0 \le x + \vartheta]\\ &\ \le \ \mathrm{P}[\bar{X}_0 \le x + \Delta_w]. \end{align}$

(2.7)

Hence, by Theorem 8.3.27 of , in order to obtain a p-value for $H_0: \vartheta \le \Delta_w$ against $H_1: \vartheta > \Delta_w$ , according to (2.7) it suffices to specify:

● The upper limit $x$ of the critical range for rejection of $H_0: \vartheta \le \Delta_w$ as 'observed' value $\Delta_w$ of $\bar{X}_\vartheta$ , and

● an approximation of the distribution of $\bar{X}_0$ , as it is done by generating the bootstrap sample $\bar{x}_1, \ldots, \bar{x}_R$ .

This implies Equation (2.6a) for the bootstrap p-value^‡‡ of the test of $H_0$ against $H_1$ . The rationale for (2.6b) is analogous.

^‡‡ We adopt here the definition provided by Davison and Hinkley [1997, Eq. (4.11)].

Normal approximate test. By Corollary 2.3 for $\vartheta = \Delta_w$ , we find that the distribution of $\bar{X}_{\Delta_w}$ can be approximated by a normal distribution with mean 0 and variance as shown on the right-hand side of (2.5b). With $x = \Delta_w$ , therefore, we obtain the following expression for the normal approximate p-value of $H_0: \vartheta \le \Delta_w$ against $H_1: \vartheta > \Delta_w$ :

$\begin{align} \text{p-value} &\ = \ \mathrm{P}[\bar{X}_{\Delta_w} \le x] \\& \ \approx\ \Phi\left(\frac{\sqrt{n}\, \Delta_w} {\sqrt{\sum_{i = 1}^n w_i\, \Delta_i^2 - \Delta_w^2}}\right). \end{align}$

(2.8a)

Here $\Phi$ denotes the standard normal distribution function. The same reasoning gives for the normal approximate p-value of $H^\ast_0: \vartheta \ge \Delta_w$ against $H^\ast_1: \vartheta < \Delta_w$ :

$\begin{equation} \text{p-value}^\ast \ \approx\ 1 - \Phi\left(\frac{\sqrt{n}\, \Delta_w} {\sqrt{\sum_{i = 1}^n w_i\, \Delta_i^2 - \Delta_w^2}}\right). \end{equation}$

(2.8b)

2.2. The t-test approach

In Sections 2.6.2 (for LGD back-testing), 2.9.3.1 (for CCF back-testing) and 2.9.3.2 (for EAD back-testing) of ECBintructions, the ECB proposes a t-test for (in the terms of Section 2.1 of this paper) $H^\ast_0: \vartheta \ge \Delta_w$ against $H^\ast_1: \vartheta < \Delta_w$ . Transcribed into the notation of Section 2.1, the test can be described as follows:

● $n$ is the number of matched pairs of observations and predictions in the sample.

● $\Delta_i$ is the difference of

● the realised LGD for facility $i$ and the estimated LGD for facility $i$ in ECB Section 2.6.2,

● the realised CCF for facility $i$ and the estimated CCF for facility $i$ in ECB Section 2.9.3.1, and

● the drawings (balance sheet exposure) at the time of default of facility $i$ and the estimated EAD of facility $i$ in ECB Section 2.9.3.2.

- All $w_i$ equal $1/n$ .

- The right-hand side of (2.5b) is replaced by the sample variance

$\begin{equation*} s_n^2 \ = \ \frac{1}{n-1} \left(\frac{1}{n}\sum\limits_{i = 1}^n \Delta_i^2 - \Delta_{1/n}^2\right). \end{equation*}$

● The p-value is computed as

$\begin{equation} \text{p-value}^\ast \ = \ 1 - \Psi_{n-1}\left(\frac{\Delta_{1/n}}{s_n}\right), \end{equation}$

(2.9)

where $\Psi_{n-1}$ denotes the distribution function of Student's t-distribution with $n-1$ degrees of freedom.

By the Central Limit Theorem, the p-values according to (2.6b), (2.8c) and (2.9) will come out almost identical for large sample sizes $n$ and equal weights $w_i = 1/n$ for all $i = 1, \ldots, n$ . For smaller $n$ , the value of (2.9) would be exact if the variables $X_{i, \vartheta}$ in Corollary 2.3 were normally distributed.

Criticisms of the basic approach. The basic approach as described in Sections 2.1 and 2.2 fails to take account of the following issues:

● The random mechanism reflected by (2.3) can be interpreted as an expression of uncertainty about the cohort / portfolio composition. The randomness of the loss rate / exposure of the individual facilities – the degree of which potentially can differ between facilities – is not captured by (2.3).

● The parametrisation of the distribution by a location parameter in (2.3) could result in distributions with features that are not realistic, for instance negative exposures or loss rates greater than one.

In the following section and in Appendix B, we are going to modify the basic approach for LGD / CCF on the one hand and EAD on the other hand in such a way as to take into account these two issues.

2.3. Tests for variables with values in the unit interval

By definition, both LGD and CCF take values only in the unit interval $[0, 1]$ . This fact allows for more specific tests than the ones considered in the previous sections. In this section, we talk only about LGD most of the time. But the concepts discussed also apply with little or no modification to CCF or any other variables with values in the unit interval.

Starting point.

● A sample of paired observations $(\lambda_1, \ell_1), \ldots, (\lambda_n, \ell_n)$ , with predicted LGDs $0 < \lambda_i < 1$ and realised loss rates $0 \le \ell_i \le 1$ .

● Weights $0 < w_i < 1$ , $i = 1, \ldots, n$ , with $\sum_{i = 1}^n w_i = 1$ ,

● Weighted average loss rate $\ell_w = \sum_{i = 1}^n w_i\, \ell_i$ and weighted average loss prediction $\lambda_w = \sum_{i = 1}^n w_i\, \lambda_i$ .

Interpretation in the context of LGD back-testing.

● A sample of $n$ defaulted credit facilities / loans is analysed.

● The LGD $\lambda_i$ is an estimate of loan $i$ 's loss rate as a consequence of the default, measured as percentage of the exposure at the time of default (EAD).

● The realized loss rate $\ell_i$ shows the percentage of loan $i$ 's exposure at the time of default that cannot be recovered.

● The weight $w_i$ reflects the relative importance of observation $i$ . In the case of LGD predictions, one might choose (2.2b) for the definition of the weights, for CCF one might choose (2.2a) instead.

● Define $\Delta_i = \ell_i - \lambda_i$ , $i = 1, \ldots, n$ . If $|\Delta_i| \approx 0$ then $\lambda_i$ is a good LGD prediction. If $|\Delta_i| \approx 1$ then $\lambda_i$ is a poor LGD prediction.

Goal. We want to use the observed weighted average difference / residual $\Delta_w = \sum_{i = 1}^n w_i\, \Delta_i = \ell_w - \lambda_w$ to assess the quality of the calibration of the model / approach for the $\lambda_i$ to predict the realised loss rates $\ell_i$ . Again we want to answer the following two questions:

● If $\Delta_w < 0$ , how safe is the conclusion that the observed (realised) values are on weighted average less than the predictions, i.e. the predictions are prudent / conservative?

● If $\Delta_w > 0$ , how safe is the conclusion that the observed (realised) values are on weighted average greater than the predictions, i.e. the predictions are aggressive?

The safety of such conclusions is measured by p-values which provide error probabilities for the conclusions to be wrong. The lower the p-value, the more likely the conclusion is right.

In order to be able to examine the specific properties of the sample and $\Delta_w$ with statistical methods, we have to make the assumption that the sample was generated with some random mechanism. This mechanism is described in the following modification of Assumption 2.1.

Assumption 2.4. The sample $\Delta_1, \ldots, \Delta_n$ consists of independent realisations of a random variable $X_\vartheta$ with distribution given by

$\begin{equation} X_\vartheta \ = \ \ell_I - Y_\vartheta, \end{equation}$

(2.10a)

where $I$ is a random variable with values in $\{1, \ldots, n\}$ and $\mathrm{P}[I = i] = w_i$ , $i = 1, \ldots, n$ . $Y_\vartheta$ is a beta $(\alpha_i, \beta_i)$ -distributed random variable^§§ conditional on $I = i$ for $i = 1, \ldots, n$ . The parameters $\alpha_i$ and $\beta_i$ of the beta-distribution depend on the unknown parameter $0 < \vartheta < 1$ by

^§§ See Casella and Berger [2002, Section 3.3] for a definition of the beta-distribution.

$\begin{equation} \begin{split} \alpha_i & \ = \ \vartheta_i\, \frac{1-v}{v}, \qquad\mathit{\text{and}}\\ \beta_i & \ = \ (1-\vartheta_i)\, \frac{1-v}{v}. \end{split} \end{equation}$

(2.10b)

In (2.10b), the constant $0 < v < 1$ is the same for all $i$ . The $\vartheta_i$ are determined by

$\begin{equation} \vartheta_i \ = \ (\lambda_i)^{h(\vartheta)}, \end{equation}$

(2.10c)

where $0 < h(\vartheta) < \infty$ is the unique solution $h$ of the equation

$\begin{equation} \vartheta \ = \ \sum\limits_{i = 1}^n w_i\, (\lambda_i)^h. \end{equation}$

(2.10d)

Assumption 2.4 introduces randomness of the difference between loss rate and LGD prediction for individual facilities. Comparison between (2.13b) below and (2.4b) shows that this entails variance expansion of the sample $\Delta_1, \ldots, \Delta_n$ .

Note that Assumption 2.4 also describes a method for recalibration of the LGD estimates $\lambda_1, \ldots, \lambda_n$ to match targets $\vartheta$ with the weighted average of the $\vartheta_i$ . In contrast to (2.3), the transformation (2.10c) makes it sure that the transformed LGD parameters still are values in the unit interval. By definition of $Y_\vartheta$ , it holds that $\mathrm{E}[Y_\vartheta\, |\, I = i] = \vartheta_i$ .

The constant $v$ specifies the variance of $Y_\vartheta$ conditional on $I = i$ as percentage of the supremum $\vartheta_i\, (1-\vartheta_i)$ of its possible conditional variance, i.e. it holds that

$\begin{equation} \mathrm{var}[Y_\vartheta\, |\, I = i] \ = \ v\, \vartheta_i\, (1-\vartheta_i), \qquad i = 1, \ldots, n. \end{equation}$

(2.11)

The constant $v$ must be pre-defined or separately estimated. We suggest estimating it from the sample $\ell_1, \ldots, \ell_n$ as

$\begin{equation} \hat{v}\ = \ \frac{\sum_{i = 1}^n w_i\, \ell_i^2 - \ell_w^2}{\ell_w\, (1-\ell_w)}. \end{equation}$

(2.12)

This approach yields $0 \le \hat{v} \le 1$ because the fact that $0 \le \ell_i \le 1$ , $i = 1, \ldots, n$ , implies

$\begin{equation*} \sum\limits_{i = 1}^n w_i\, \ell_i^2 - \ell_w^2\ \le\ \ell_w\, (1-\ell_w). \end{equation*}$

A simpler alternative to the definition (2.10c) of $\vartheta_i$ would be linear scaling: $\vartheta_i = \lambda_i\, \frac{\vartheta}{\lambda_w}$ . However, with this definition $\vartheta_i > 1$ may be incurred. This is not desirable because then the beta-distribution for $Y_\vartheta\, |\, I = i$ would be ill-defined.

Proposition 2.5. For $X_\vartheta$ as described in Assumption 2.4, the expected value and the variance are given by

$\begin{align} \mathrm{E}[X_\vartheta] & = \ell_w - \vartheta,~~ \mathit{\text{ and}} \end{align}$

(2.13a)

$\begin{align} \mathrm{var}[X_\vartheta] & = \sum\limits_{i = 1}^n w_i\, (\ell_i-\vartheta_i)^2 - (\ell_w - \vartheta)^2 + v \sum\limits_{i = 1}^n w_i\, \vartheta_i\, (1-\vartheta_i). \end{align}$

(2.13b)

Proof. For deriving the formula for $\mathrm{var}[X_\vartheta]$ , make use of the well-known variance decomposition

$\mathrm{var}[X_\vartheta] = \mathrm{E}\bigl[ \mathrm{var}[X_\vartheta\, |\, I]\bigr] + \mathrm{var}\bigl[ \mathrm{E}[X_\vartheta\, |\, I]\bigr] .$

In contrast to (2.4b), the variance of $X_\vartheta$ as shown in (2.13b) depends on the parameter $\vartheta$ and has an additional component $v \sum_{i = 1}^n w_i\, \vartheta_i\, (1-\vartheta_i)$ which reflects the potentially different variances of the loss rates in an inhomogeneous portfolio.

By Assumption 2.4 and Proposition 2.5, the questions on the safety of conclusions from the sign of $\Delta_w = \ell_w - \lambda_w$ again can be translated into hypotheses on the value of the parameter $\vartheta$ :

● If $\Delta_w < 0$ , can we conclude that $H_0: \vartheta \le \ell_w$ is false and $H_1: \vartheta > \ell_w \Leftrightarrow \mathrm{E}[X_\vartheta] < 0$ is true?

● If $\Delta_w > 0$ , can we conclude that $H^\ast_0: \vartheta \ge \ell_w$ is false and $H^\ast_1: \vartheta < \ell_w \Leftrightarrow \mathrm{E}[X_\vartheta] > 0$ is true?

Corollary 2.6. Let $X_{1, \vartheta}, \ldots, X_{n, \vartheta}$ be independent and identically distributed copies of $X_\vartheta$ as in Assumption 2.4 and define $\bar{X}_\vartheta = \frac{1}{n} \sum_{i = 1}^n X_{i, \vartheta}$ . Then for the mean and variance of $\bar{X}_\vartheta$ , it holds that

$\begin{align} \mathrm{E}[\bar{X}_\vartheta] & = \ell_w - \vartheta. \end{align}$

(2.14a)

$\begin{align} \mathrm{var}[\bar{X}_\vartheta] & = \frac{1}{n} \left(\sum\limits_{i = 1}^n w_i\, (\ell_i-\vartheta_i)^2 - (\ell_w - \vartheta)^2 + v \sum\limits_{i = 1}^n w_i\, \vartheta_i\, (1-\vartheta_i)\right). \end{align}$

(2.14b)

In the following, we use $\bar{X}_\vartheta$ as the test statistic and interpret $\Delta_w = \ell_w - \lambda_w$ as its observed value.

Proposition 2.7. In the setting of Assumption 2.4 and Corollary 2.6, $\vartheta \le \widehat{\vartheta}$ implies that

$\begin{equation*} \mathrm{P}[\bar{X}_\vartheta \le x]\ \le \ \mathrm{P}[\bar{X}_{\widehat{\vartheta}} \le x], \qquad \mathit{\text{for all}}~~x\in\mathbb{R}. \end{equation*}$

Proof. Observe that $\vartheta \le \widehat{\vartheta}$ implies $\vartheta_i \le \widehat{\vartheta}_i$ for all $i = 1, \ldots, n$ . For fixed $i$ , the family of beta $(\alpha_i, \beta_i)$ -distributions, parametrised by $\vartheta \in (0, 1)$ , has got a monotone likelihood ratio in the sense of Definition 8.3.16 of . This implies that for $\vartheta \le \widehat{\vartheta}$ , conditional on $I = i$ , the distribution of $Y_{\widehat{\vartheta}}$ is stochastically not less than the distribution of $Y_{\vartheta}$ , i.e. it holds that

$\begin{equation*} \mathrm{P}[Y_\vartheta \le x\, |\, I = i]\ \ge \ \mathrm{P}[Y_{\widehat{\vartheta}} \le x\, |\, I = i], \qquad \text{for all }x\in\mathbb{R}. \end{equation*}$

From this, it follows that for all $i = 1, \ldots, n$

$\begin{equation*} \mathrm{P}[X_\vartheta \le x\, |\, I = i]\ \le \ \mathrm{P}[X_{\widehat{\vartheta}} \le x\, |\, I = i], \qquad \text{for all }x\in\mathbb{R}. \end{equation*}$

But this inequality implies for all $x\in\mathbb{R}$ that

$\begin{equation} \mathrm{P}[X_\vartheta \le x]\ = \ \sum\limits_{i = 1}^n w_i\, \mathrm{P}[X_\vartheta \le x\, |\, I = i] \ \le \ \mathrm{P}[X_{\widehat{\vartheta}} \le x]. \end{equation}$

(2.15)

Property (2.15) is passed on to convolutions of independent copies of $X_\vartheta$ and $X_{\widehat{\vartheta}}$ . This proves the assertion.

Bootstrap test. Generate a Monte Carlo sample $\bar{x}_1, \ldots, \bar{x}_R$ from $X_\vartheta$ with $\vartheta = \ell_w$ as follows:

● For $j = 1, \ldots, R$ : $\bar{x}_j$ is the equally weighted mean of $n$ independent draws from the distribution of $X_{\vartheta}$ as given by Assumption 2.4, with $\vartheta = \ell_w$ .

● $\bar{x}_1, \ldots, \bar{x}_R$ are realisations of independent, identically distributed random variables.

Then a bootstrap p-value for the test of $H_0: \vartheta \le \ell_w$ against $H_1: \vartheta > \ell_w$ can be calculated as

$\begin{equation} \text{p-value} \ = \ \frac{1 + \#\bigl\{i: i \in\{1, \ldots, n\}, \bar{x}_i \le \ell_w- \lambda_w\bigr\}}{R+1}. \end{equation}$

(2.16a)

A bootstrap p-value for the test of $H^\ast_0: \vartheta \ge \ell_w$ against $H^\ast_1: \vartheta < \ell_w$ is given by

$\begin{equation} \text{p-value}^\ast \ = \ \frac{1 + \#\bigl\{i: i \in\{1, \ldots, n\}, \bar{x}_i \ge \ell_w- \lambda_w\bigr\}}{R+1}. \end{equation}$

(2.16b)

Rationale. By Proposition 2.7, if under $H_0$ the true parameter is $\vartheta \le \ell_w$ and $(-\infty, \, x]$ is the critical (rejection) range for the test of $H_0: \vartheta \le \ell_w$ against $H_1: \vartheta > \ell_w$ based on the test statistic $\bar{X}_\vartheta$ , then it holds that

$\begin{equation} \mathrm{P}\bigl[\bar{X}_\vartheta \in (-\infty, \, x]\bigr] \ \le\ \mathrm{P}[\bar{X}_{\ell_w} \le x]. \end{equation}$

(2.17)

Hence, by Theorem 8.3.27 of , in order to obtain a p-value for $H_0: \vartheta \le \ell_w$ against $H_1: \vartheta > \ell_w$ , according to (2.17) it suffices to specify:

● The upper limit $x$ of the critical range for rejection of $H_0: \vartheta \le \ell_w$ as our realisation $\Delta_w = \ell_w- \lambda_w$ of $\bar{X}_\vartheta$ , and

● an approximation of the distribution of $\bar{X}_{\ell_w}$ , as it has been done by generating the bootstrap sample $\bar{x}_1, \ldots, \bar{x}_R$ .

This implies Equation (2.16a) for the bootstrap p-value. The rationale for (2.16b) is analogous.

Normal approximate test. By Corollary 2.6, we find that the distribution of $\bar{X}_{\ell_w}$ can be approximated by a normal distribution with mean 0 and variance as shown on the right-hand side of (2.13b) with $\vartheta = \ell_w$ . With $x = \ell_w-\lambda_w$ , one obtains for the approximate p-value of $H_0: \vartheta \le \ell_w$ against $H_1: \vartheta > \ell_w$ :

$\begin{align} \text{p-value} &\ = \ \mathrm{P}[\bar{X}_{\ell_w} \le x] \\ & \ \approx\ \Phi\left(\frac{\sqrt{n}\, (\ell_w - \lambda_w)} {\sqrt{\sum\limits_{i = 1}^n w_i\, (\ell_i-\widehat{\vartheta}_i)^2 + v\, \sum\limits_{i = 1}^n w_i\, \widehat{\vartheta}_i\, (1-\widehat{\vartheta}_i)}}\right), \end{align}$

(2.18a)

with $\widehat{\vartheta}_i = (\lambda_i)^{h(\ell_w)}$ as in Assumption 2.4. The same reasoning gives for the normal approximate p-value of $H^\ast_0: \vartheta \ge \ell_w$ against $H^\ast_1: \vartheta < \ell_w$ :

$\begin{equation} \text{p-value}^\ast \ \approx\ 1 - \Phi\left(\frac{\sqrt{n}\, (\ell_w - \lambda_w)} {\sqrt{\sum_{i = 1}^n w_i\, (\ell_i-\widehat{\vartheta}_i)^2 + v\, \sum_{i = 1}^n w_i\, \widehat{\vartheta}_i\, (1-\widehat{\vartheta}_i)}}\right). \end{equation}$

(2.18b)

3. Tests of probabilities

Starting point.

● A sample of paired observations $(p_1, b_1), \ldots, (p_n, b_n)$ , with probabilities $0 < p_i < 1$ and status indicators $b_i \in \{0, 1\}$ (1 for defaulted, 0 for performing).

● Weights $0 < w_i < 1$ , $i = 1, \ldots, n$ , with $\sum_{i = 1}^n w_i = 1$ ,

● Weighted default rate $b_w = \sum_{i = 1}^n w_i\, b_i$ and weighted average PD $p_w = \sum_{i = 1}^n w_i\, p_i$ .

Interpretation in the context of PD back-testing.

● A sample of $n$ borrowers is observed for a certain period of time, most commonly one year.

● The PD $p_i$ is an estimate of borrower $i$ 's probability to default during the observation period, estimated before the beginning of the period.

● The status indicator $b_i$ shows borrower $i$ 's performance status at the end of the observation period. $b_i = 1$ means "borrower has defaulted", $b_i = 0$ means "borrower is performing".

● $w_i$ could be the relative importance of observation $i$ . In the case of default predictions, one might choose weights as in (2.2b).

● Define $\Delta_i = b_i - p_i$ , $i = 1, \ldots, n$ . If $|\Delta_i| \approx 0$ then $p_i$ is a good default prediction. If $|\Delta_i| \approx 1$ then $p_i$ is a poor default prediction.

Goal. We want to use the observed weighted average difference / residual $\Delta_w = \sum_{i = 1}^n w_i\, \Delta_i = b_w - p_w$ to assess the quality of the calibration of the model / approach for the $p_i$ to predict the realised status indicators $b_i$ . Again we want to answer the following two questions:

● If $\Delta_w < 0$ , how safe is the conclusion that the observed (realised) values are on weighted average less than the predictions, i.e. the predictions are prudent / conservative?

● If $\Delta_w > 0$ , how safe is the conclusion that the observed (realised) values are on weighted average greater than the predictions, i.e. the predictions are aggressive?

The safety of such conclusions is measured by p-values which provide error probabilities for the conclusions to be wrong. The lower the p-value, the more likely the conclusion is right. In determining the p-values, we take into account the criticisms of the basic approach as mentioned at the end of Section 2.2.

3.1. Testing probabilities on inhomogeneous samples

In order to be able to examine the PD-specific properties of the sample and $\Delta_w = b_w -p_w$ with statistical methods, we have to make the assumption that the sample was generated with some random mechanism. This mechanism is described in the following modification of Assumptions 2.1 and 2.4.

Assumption 3.1. The sample $\Delta_1, \ldots, \Delta_n$ consists of independent realisations of a random variable $X_\vartheta$ with distribution given by

$\begin{equation} X_\vartheta \ = \ b_I - Y_\vartheta, \end{equation}$

(3.1a)

where $I$ is a random variable with values in $\{1, \ldots, n\}$ and $\mathrm{P}[I = i] = w_i$ , $i = 1, \ldots, n$ . $Y_\vartheta$ is a Bernoulli variable with

$\begin{equation} \mathrm{P}[Y_\vartheta = 1\, |\, I = i] \ = \ \vartheta_i, \qquad i = 1, \ldots, n. \end{equation}$

(3.1b)

Define $\varrho_i = \frac{1-p_i}{p_i}\, \frac{p_w}{1-p_w}$ . Then the $\vartheta_i$ depend on the unknown parameter $0 < \vartheta < 1$ by

$\begin{equation} \vartheta_i = \frac{\vartheta}{\vartheta +(1-\vartheta)\, \varrho_i\, h(\vartheta)}, \end{equation}$

(3.1c)

where $0 < h(\vartheta) < \infty$ is the unique^¶¶ solution of the equation

^¶¶ See Tasche [2013a, Section 4.2.4].

$\begin{equation} 1\ = \ \sum\limits_{i = 1}^n \frac{w_i}{\vartheta +(1-\vartheta)\, \varrho_i\, h}, \end{equation}$

(3.1d)

when solved for $h$ .

Assumption 3.1 introduces randomness of the difference between status indicator and PD prediction for individual facilities. Comparison between (3.2b) below and (2.4b) shows that this entails variance expansion of the sample $\Delta_1, \ldots, \Delta_n$ .

Note that Assumption 3.1 also describes a method for recalibration of the PD estimates $p_1, \ldots, p_n$ to match targets $\vartheta$ with the weighted average of the $\vartheta_i$ . In contrast to (2.3), the transformation (3.1c) makes it sure that the transformed PD parameters still are values in the unit interval. In principle, instead of (3.1c) also the transformation (2.10c) could have been used. (3.1c) was preferred because it has a probabilistic foundation through Bayes' theorem. By definition of $Y_\vartheta$ , it holds that $\mathrm{E}[Y_\vartheta\, |\, I = i] = \vartheta_i$ .

Another simple alternative to the definition (3.1c) of $\vartheta_i$ would be linear scaling: $\vartheta_i = p_i\, \frac{\vartheta}{p_w}$ . However, with this definition $\vartheta_i > 1$ may be incurred. This is not desirable because then the Bernoulli distribution for $Y_\vartheta\, |\, I = i$ would be ill-defined.

Proposition 3.2. For $X_\vartheta$ as described in Assumption 3.1, the expected value and the variance are given by

$\begin{align} \mathrm{E}[X_\vartheta] & = b_w - \vartheta,\ \mathit{\text{and}} \end{align}$

(3.2a)

$\begin{align} \mathrm{var}[X_\vartheta] & = \sum\limits_{i = 1}^n w_i\, (b_i-\vartheta_i)^2 - (b_w - \vartheta)^2 + \sum\limits_{i = 1}^n w_i\, \vartheta_i\, (1-\vartheta_i). \end{align}$

(3.2b)

Proof. Similar to the proof of Proposition 2.5.

Note that $\sum_{i = 1}^n w_i\, (b_i-\vartheta_i)^2$ is a weighted version of the Brier Score [see, e.g., ] for the observation-prediction sample $(b_1, \vartheta_i), \ldots, (b_n, \vartheta_n)$ . This observation suggests that the power of the calibration tests considered in this section will be the greater, the better the discriminatory power of the PD predictions is (reflected by lower Brier scores).

By Assumption 3.1 and Proposition 3.2, the questions on the safety of conclusions from the sign of $\Delta_w = b_w - p_w$ again can be translated into hypotheses on the value of the parameter $\vartheta$ :

● If $\Delta_w < 0$ , can we conclude that $H_0: \vartheta \le b_w$ is false and $H_1: \vartheta > b_w \Leftrightarrow \mathrm{E}[X_\vartheta] < 0$ is true?

● If $\Delta_w > 0$ , can we conclude that $H^\ast_0: \vartheta \ge b_w$ is false and $H^\ast_1: \vartheta < b_w \Leftrightarrow \mathrm{E}[X_\vartheta] > 0$ is true?

If we assume as before in Section 2 that the sample $\Delta_1, \ldots, \Delta_n$ was generated by independent realisations of $X_\vartheta$ then the distribution of the sample mean is different from the distribution of $X_\vartheta$ , as shown in the following corollary to Proposition 3.2.

Corollary 3.3. Let $X_{1, \vartheta}, \ldots, X_{n, \vartheta}$ be independent and identically distributed copies of $X_\vartheta$ as in Assumption 3.1 and define $\bar{X}_\vartheta = \frac{1}{n} \sum_{i = 1}^n X_{i, \vartheta}$ . Then for the mean and variance of $\bar{X}_\vartheta$ , it holds that

$\begin{align} \mathrm{E}[\bar{X}_\vartheta] & = b_w - \vartheta. \end{align}$

(3.3a)

$\begin{align} \mathrm{var}[\bar{X}_\vartheta] & = \frac{1}{n} \left(\sum\limits_{i = 1}^n w_i\, (b_i-\vartheta_i)^2 - (b_w - \vartheta)^2 + \sum\limits_{i = 1}^n w_i\, \vartheta_i\, (1-\vartheta_i)\right). \end{align}$

(3.3b)

In the following, we use $\bar{X}_\vartheta$ as the test statistic and interpret $\Delta_w = b_w - p_w$ as its observed value.

Lemma 3.4. In the setting of Assumption 3.1, $\vartheta < \widehat{\vartheta}$ implies that $\vartheta_i < \widehat{\vartheta}_i$ for all $i = 1, \ldots, n$ .

Proof. Assume $\vartheta < \widehat{\vartheta}$ and let $h = h(\vartheta)$ and $\widehat{h} = h\bigl(\widehat{\vartheta}\bigr)$ . Along the same lines of algebra as in Section 3 of , it can be shown that (with $w_i$ and $\varrho_i$ as in Assumption 3.1) for $0 < t < 1$ and $\eta > 0$ the following two equations are equivalent:

$\begin{equation} \begin{split} 1\ = \ \sum\limits_{i = 1}^n \frac{w_i}{t +(1-t)\, \varrho_i\, \eta} \\ \iff \quad 0 \ = \ \sum\limits_{i = 1}^n \frac{w_i\, (1-\varrho_i\, \eta)}{t +(1-t)\, \varrho_i\, \eta}. \end{split} \end{equation}$

(3.4)

Define $f(t, \eta) = \sum_{i = 1}^n \frac{w_i\, (1-\varrho_i\, \eta)}{t +(1-t)\, \varrho_i\, \eta}$ . Then we obtain

$\begin{align} \frac{\partial f}{\partial t}(t, \eta) &\ = \ - \sum\limits_{i = 1}^n \frac{w_i\, (1-\varrho_i\, \eta)^2} {(t +(1-t)\, \varrho_i\, \eta)^2}\ < \ 0, \end{align}$

(3.5a)

$\begin{align} \frac{\partial f}{\partial \eta}(t, \eta) &\ = \ - \sum\limits_{i = 1}^n \frac{w_i\, \varrho_i} {(t +(1-t)\, \varrho_i\, \eta)^2}\ < \ 0. \end{align}$

(3.5b)

By definition, (3.1d) holds for $\vartheta$ and $h$ . From (3.4) and (3.5a) then it follows that

$\begin{equation*} 0\ > \ \sum\limits_{i = 1}^n \frac{w_i\, (1-\varrho_i\, h)}{\widehat{\vartheta} +(1-\widehat{\vartheta})\, \varrho_i\, h}. \end{equation*}$

However, by (3.4) we also have

$\begin{equation*} 0\ = \ \sum\limits_{i = 1}^n \frac{w_i\, (1-\varrho_i\, \widehat{h})}{\widehat{\vartheta} + (1-\widehat{\vartheta})\, \varrho_i\, \widehat{h}}. \end{equation*}$

By (3.5b), this only is possible if it holds that $h > \widehat{h}$ . Hence it follows that

$\begin{equation*} \frac{(1-\vartheta)\, h}{\vartheta} \ > \ \frac{(1-\widehat{\vartheta})\, \widehat{h}}{\widehat{\vartheta}}. \end{equation*}$

By (3.1c) (i.e. the definition of $\vartheta_i$ and $\widehat{\vartheta}_i$ ), this inequality implies $\vartheta_i < \widehat{\vartheta}_i$ .

Theorem 3.5. In the setting of Assumption 3.1 and Corollary 3.3, $\vartheta \le \widehat{\vartheta}$ implies that

$\begin{equation*} \mathrm{P}[\bar{X}_\vartheta \le x]\ \le \ \mathrm{P}[\bar{X}_{\widehat{\vartheta}} \le x], \qquad \mathit{\text{for all}}x\in\mathbb{R}. \end{equation*}$

Proof. By Lemma 3.4, $\vartheta \le \widehat{\vartheta}$ implies for all $i = 1, \ldots, n$ that $\vartheta_i \le \widehat{\vartheta}_i$ and therefore also

$\begin{equation*} \mathrm{P}[Y_\vartheta \le x\, |\, I = i]\ \ge \ \mathrm{P}[Y_{\widehat{\vartheta}} \le x\, |\, I = i], \qquad \text{for all }x\in\mathbb{R}. \end{equation*}$

The remainder of the proof is identical to the last part of the proof of Proposition 2.7.

Exact p-values. Since by definition up to the constant $1/n$ the test statistic $\bar{X}_\vartheta$ as defined in Assumption 3.1 and Corollary 3.3 takes only integer values in the range $\{-n, \ldots, -1, 0, 1, \ldots, n\}$ , its distribution can readily be exactly determined by means of an inverse Fourier transform [, Section 4.7]. By Theorem 3.5 and Theorem 8.3.27 of , then a p-value for the test of $H_0: \vartheta \le b_w$ against $H_1: \vartheta > b_w$ can exactly be computed as

$\begin{equation} \text{p-value}\ = \ \mathrm{P}[\bar{X}_{b_w} \le b_w - p_w]. \end{equation}$

(3.6a)

A p-value for the test of $H^\ast_0: \vartheta \ge b_w$ against $H^\ast_1: \vartheta < b_w$ is given by

$\begin{equation} \text{p-value}^\ast \ = \ \mathrm{P}[\bar{X}_{b_w} \ge b_w - p_w]. \end{equation}$

(3.6b)

Normal approximate test. By Corollary 3.3, we find that the distribution of $\bar{X}_{b_w}$ can be approximated by a normal distribution with mean 0 and variance as shown on the right-hand side of (3.3b). With $x = b_w-p_w$ , one obtains for the approximate p-value of $H_0: \vartheta \le b_w$ against $H_1: \vartheta > b_w$ :

$\begin{align} \text{p-value} &\ = \ \mathrm{P}[\bar{X}_{b_w} \le x]\\& \ \approx\ \Phi\left(\frac{\sqrt{n}\, (b_w - p_w)} {\sqrt{\sum_{i = 1}^n w_i\, (b_i-\widehat{\vartheta}_i)^2 + \sum_{i = 1}^n w_i\, \widehat{\vartheta}_i\, (1-\widehat{\vartheta}_i)}}\right), \end{align}$

(3.7a)

with $\widehat{\vartheta}_i = \frac{b_w}{b_w +(1-b_w)\, \varrho_i\, h(b_w)}$ as in Assumption 3.1. The same reasoning gives for the normal approximate p-value of $H^\ast_0: \vartheta \ge \ell_w$ against $H^\ast_1: \vartheta < \ell_w$ :

$\begin{equation} \text{p-value}^\ast \ \approx\ 1 - \Phi\left(\frac{\sqrt{n}\, (b_w - p_w)} {\sqrt{\sum_{i = 1}^n w_i\, (b_i-\widehat{\vartheta}_i)^2 + \sum_{i = 1}^n w_i\, \widehat{\vartheta}_i\, (1-\widehat{\vartheta}_i)}}\right). \end{equation}$

(3.7b)

3.2. The Jeffreys test approach

In Section 2.5.3.1 of ECBintructions, the ECB proposes "PD back testing using a Jeffreys test". Transcribed into the notation of Section 3.1 of this paper, the starting point for the test can be described as follows:

● $n = N$ , where "N is the number of customers in the portfolio/rating grade".

● $\sum_{i = 1}^n b_i = D$ , where "D is the number of those customers that have defaulted within that observation period".

● $\frac{1}{n}\sum_{i = 1}^n p_i = PD$ , where PD means the "PD [probability of default] of the portfolio/rating grade".

● All $w_i$ equal $1/n$ .

The Jeffreys test for the success parameter of a binomial distribution.

● In a Bayesian setting, an "objective Bayesian" prior distribution beta $(1/2, 1/2)$ for the PD is chosen such that – assuming a binomial distribution for the number of defaults – the posterior distribution (i.e. conditional on the observed number of defaults) of the PD is beta $(D+1/2, \, N-D+1/2)$ . See Kazianka2016 for the rationale for choosing this method of test. If estimated as the mean of the posterior distribution, the Bayesian PD estimate is $\frac{D+1/2}{N+1}$ .

● The Null hypothesis is "the PD applied in the portfolio/rating grade $\ldots$ is greater than the true one (one sided hypothesis test)", i.e. $H_0: \theta \le \widehat{\theta}$ with $\widehat{\theta} =$ "applied PD" and $\theta =$ "true PD". In the notation of Section 3.1, this can be phrased as testing $H^\ast_0: \vartheta \ge b_{1/n}$ against $H^\ast_1: \vartheta < b_{1/n}$ .

● ECBintructions: "The test statistic is the PD of the portfolio/rating grade." The construction principle for the Jeffreys test is to determine a credibility interval for the PD and then to check if the applied PD is inside or outside of the interval.

● The p-value for this kind of Jeffreys test is

$\begin{equation} \text{p-value}_{\text{Jeffreys}} \ = \ F_{D+1/2, \, N-D+1/2}(PD), \end{equation}$

(3.8)

where $F_{\alpha, \, \beta}$ denotes the distribution function of the beta $(\alpha, \beta)$ -distribution.

Comments.

● The standard (frequentist) one-sided binomial test would be: 'Reject $H_0$ if $D \ge c$ ' where $c$ is a 'critical' value such the probability under $H_0$ to observe $c$ or more defaults is small. For this test, the p-value is

$\begin{equation} \text{p-value}_{\text{freq}}\ = \ \sum\limits_{i = D}^N \bigl(\begin{smallmatrix} N \\ i\end{smallmatrix}\bigr)\, PD^i\, (1-PD)^{N-i} \ = \ F_{D, \, N-D+1}(PD). \end{equation}$

(3.9)

Hence, unless the observed number of default $D$ is very small or even zero, from (3.8) it follows that in practice most of the time the Jeffreys test and the standard binomial test give similar results.

● For a 'fair' comparison of the Jeffreys test and the test proposed in Section 3.1, we have to modify Assumption 3.1 such that there is no variance expansion and all weights are equal, i.e. the random variable $X_\vartheta$ is simply defined by

$\begin{equation} \mathrm{P}[X_\vartheta = b_i - \vartheta_i] \ = \ \frac{1}{n}, \qquad i = 1, \ldots, n, \end{equation}$

(3.10)

where the $\vartheta_i$ depend on the unknown parameter $0 < \vartheta < 1$ in the way described by (3.1c) and (3.1d). The normal approximate p-value of $H_0$ against $H_1$ is then (using the ECB notation)

$\begin{equation} \text{p-value} \ \approx\ 1 - \Phi\left(\frac{\sqrt{N}\, (D/N - PD)} {\sqrt{D/N\, (1-D/N)}}\right). \end{equation}$

(3.11)

● The normal approximation of the frequentist (and by (3.8) and (3.9) also Jeffreys) binomial test p-value is

$\begin{equation} \text{p-value}_{\text{freq}}\ \approx \ 1 - \Phi\left(\frac{\sqrt{N}\, (D/N - PD)} {\sqrt{PD\, (1-PD)}}\right). \end{equation}$

(3.12)

● The test for $H_0$ as required by the ECB would typically be performed when $D/N > PD$ , i.e. when there are doubts with regard to the conservatism of the PD estimate. Rejection of $H_0$ would then be regarded as 'proof' of the estimate being aggressive while non-rejection would entail 'acquittal' for lack of evidence. In case of $1/2 \ge D/N > PD$ , it holds that $PD\, (1-PD) < D/N\, (1-D/N)$ such that the p-value according to the ECB test is lower than the p-value according to (3.10) and (3.11), i.e. the ECB test would reject $H_0$ earlier than the simplified version of the test according to Section 3.1.

4. Numerical examples

The test methods of Section 2 and the appendices are illustrated in Section 4.1 below with numerical results from tests on a data set from Fischer and Pfeuffer [2014, Table 1]. The test methods of Section 3 are illustrated in Section 4.2 below with numerical results from tests on a data set consisting of simulated data. However, the exposures in the data set are again from Fischer and Pfeuffer [2014, Table 1]. A zip-archive with the R-scripts and csv-files that were used for computing the results can be downloaded from https://www.researchgate.net/profile/Dirk_Tasche.

4.1. Example: Tests for variables with values in the unit interval

Explanations.

● Sample means: According to (2.1). Weights according to (2.2b) with EAD from the column 'raw.w' of the data set, and $w_i = 1/100$ in the equally weighted case.

● Sample standard deviations: First two values according to the square root of the right-hand side of (2.4b). Third value also according to (2.4b), but with $\widetilde{\Delta}_i$ from (A3.a) and equal weights.

● Weights according to (2.2b) with EAD from the column 'raw.w' of the data set.

● Sample quantiles: Based on sample $\Delta_1, \ldots, \Delta_{100}$ computed as difference of columns 'obs' and 'pred' of the data set.

● Weight-adjusted sample quantiles: Based on sample $\widetilde{\Delta}_1, \ldots, \widetilde{\Delta}_{100}$ according to (A3.a).

● t-test results: 'Eq-weighted' according to (2.9) and $1-\text{p-value}^\ast$ for the first row of the t-test results. 'Weighted' analogously adapted for the weighted case (but without strong theoretical foundation). 'W-adjusted' like 'Eq-weighted' but for the sample $\widetilde{\Delta}_1, \ldots, \widetilde{\Delta}_{100}$ .

● 'Basic' results: Bootstrapped according to (2.6a) and (2.6b) respectively, with weights and samples like for the t-test rows.

● 'Basic normal' results: Normal approximations according to (2.8b) and (2.8c) respectively, with weights and samples like for the t-test rows.

● 'Expanded variance' results: With weights and samples like for the t-test rows, bootstrapped according to (2.16a) and (2.16b) respectively for the first two values, and according to (B6.a) and (B6.b) respectively for the third value.

● 'Exp var normal' results: With weights and samples like for the t-test rows, normal approximations according to (2.18b) and (2.18c) respectively for the first two values, and according to (B7.a) and (B7.b) respectively for the third value.

This example demonstrates that

● test results based on equally weighted means and means with inhomogeneous weights can lead to contradictory conclusions,

● variance expansion to capture the individual randomness of single observation-prediction pairs can have some impact on the degree of certainty of the test results, by entailing greater p-values, and

● the two different approaches to account for the weights of the observation-prediction pairs discussed in this paper can deliver similar but still clearly different results.

4.2. Example: Testing probabilities on inhomogeneous samples

Explanations.

● See Section 4.1 for an explanation of the summary of the sample distribution.

● 'Jeffreys' results: The 'Eq-weighted' value for 'H0: mean(obs-pred) $\le$ 0 vs. H1: mean(obs-pred) $>$ 0' is computed according to (3.8). The 'Eq-weighted' value for 'H0: mean(obs-pred) $\ge$ 0 vs. H1: mean(obs-pred) $<$ 0' is $1-\text{p-value}_{\text{Jeffreys}}$ . No 'Weighted' results are computed because there is no obvious 'weighted mean'-version of the binomial Jeffreys test.

● 'Basic' results: Bootstrapped according to (2.6a) and (2.6b) respectively.

● 'Basic normal' results: Normal approximations according to (2.8b) and (2.8c) respectively.

● 'Expanded variance' results: Exact p-values by inverse Fourier transform according to (3.6a) and (3.6b) respectively.

● 'Exp var normal' results: Normal approximations according to (3.7b) and (3.7c) respectively.

This example demonstrates that

● as mentioned in Section 3.2, the Jeffreys test has a tendency to earlier reject 'H0: mean(obs-pred) $\le$ 0' than the other tests discussed in Section 3,

● test results based on equally weighted means and means with inhomogeneous weights can lead to different outcomes (no conclusion vs. rejection of the null hypothesis), and

● variance expansion to capture the individual randomness of single observation-prediction pairs can have some impact on the degree of certainty of the test results, by entailing greater p-values.

5. Conclusions

In this paper, we have made suggestions of how to improve on the t-test and the Jeffreys test presented in ECBintructions for assessing the 'preditive ability (or calibration)' of credit risk parameters. The improvements refer to

● also testing the null hypothesis that the estimated parameter is less than or equal to the true parameter in order to be able to 'prove' that the estimate is prudent (or conservative),

● additionally using exposure- or limit-weighted sample averages in order to better inform assessments of estimation (or prediction) prudence, and

● 'variance expansion' in order to account for sample inhomogeneity in terms of composition (exposures sizes) and riskiness.

The suggested test methods have been illustrated with exemplary test results. R-scripts with code for the tests are available.

Acknowledgments

The author is grateful to two anonymous reviewers whose comments redounded to significant improvements of the paper.

Conflicts of interest

The author declares no conflicts of interest in this paper.

References

[1]	BCBS. International Convergence of Capital Measurement and Capital Standards. A Revised Framework, Comprehensive Version, 2006.
[2]	Bellini T(2019) IFRS 9 and CECL Credit Risk Modelling and Validation: A Practical Guide with Examples Worked in R and SAS. Academic Press.
[3]	Blümke O(2019) Out-of-Time Validation of Default Probabilities within the Basel Accord: A comparative study. SSRN Electron J. http://dx.doi.org/10.2139/ssrn.2945931
[4]	Casella G, Berger RL (2002) Statistical Inference. Duxbury Press, second edition.
[5]	Davison AC, Hinkley DV (1997) Bootstrap Methods and their Application. Cambridge University Press.
[6]	ECB. Instructions for reporting the validation results of internal models: IRB Pillar I models for credit risk. European Central Bank – Banking Supervision, 2019.
[7]	Fischer M, Pfeuffer M (2014) A statistical repertoire for quantitative loss given default validation: overview, illustration, pitfalls and extensions. J Risk Model Validation 8: 3–29. http://dx.doi.org/10.21314/JRMV.2014.115 doi: 10.21314/JRMV.2014.115
[8]	Gürtler M, Hibbeln MT, Usselmann P (2018) Exposure at default modeling – A theoretical and empirical assessment of estimation approaches and parameter choice. J Bank Financ 91: 176–188. http://dx.doi.org/10.1016/j.jbankfin.2017.03.004 doi: 10.1016/j.jbankfin.2017.03.004
[9]	Hand DJ (1997) Construction and Assessment of Classification Rules. John Wiley & Sons, Chichester.
[10]	Kazianka H (2016) Objective Bayesian estimation of the probability of default. J R Stat Soc Ser C 65: 1–27, 2016. http://dx.doi.org/10.1111/rssc.12107 doi: 10.1111/rssc.12107
[11]	Li D, Bhariok R, Keenan S, et al. (2009) Validation techniques and performance metrics for loss given default models. J Risk Model Validation 3: 3–26. http://dx.doi.org/10.21314/JRMV.2009.045 doi: 10.21314/JRMV.2009.045
[12]	Loterman G, Debruyne M, Vanden BK, et al. (2014) A proposed framework for backtesting loss given default models. J Risk Model Validation 8: 69–90. http://dx.doi.org/10.21314/JRMV.2014.117 doi: 10.21314/JRMV.2014.117
[13]	Mendenhall W, Beaver RJ, Beaver BM (2008) Introduction to probability and statistics. Cengage Learning, 13th edition.
[14]	Rolski T, Schmidli H, Schmidt V, et al. (1999) Stochastic Processes for Insurance and Finance. Wiley Series in Probability and Statistics. John Wiley Sons.
[15]	Scandizzo S (2016) The validation of risk models: A handbook for practitioners. Springer. https://doi.org/10.1057/9781137436962
[16]	Tasche D (2013a) The art of probability-of-default curve calibration. J Credit Risk, 9: 63–103. https://doi.org/10.21314/JCR.2013.169 doi: 10.21314/JCR.2013.169
[17]	Tasche D (2013b) The Law of Total Odds. Working paper. https://doi.org/10.48550/arXiv.1312.0365.
[18]	Venables WN, Ripley BD (2002) Modern Applied Statistics with S. Springer, fourth edition.

DSFE-02-04-017-s001.pdf

This article has been cited by:

Montaser Abdelsattar, Mohamed A. Ismeil, Mohamed M. A. Azim Zayed, Ahmed Abdelmoety, Ahmed Emad-Eldeen, Assessing Machine Learning Approaches for Photovoltaic Energy Prediction in Sustainable Energy Systems, 2024, 12, 2169-3536, 107599, 10.1109/ACCESS.2024.3437191

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Data Science in Finance and Economics

1.3

Metrics

Article views(1476) PDF downloads(95) Cited by(1)

Data Science in Finance and Economics

Proving prediction prudence

Related Papers:

Abstract

1. Introduction

2. Paired difference tests

2.1. Basic approach

2.2. The t-test approach

2.3. Tests for variables with values in the unit interval

3. Tests of probabilities

3.1. Testing probabilities on inhomogeneous samples

3.2. The Jeffreys test approach

4. Numerical examples

4.1. Example: Tests for variables with values in the unit interval

4.2. Example: Testing probabilities on inhomogeneous samples

5. Conclusions

Acknowledgments

Conflicts of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Catalog

Data Science in Finance and Economics

Proving prediction prudence

Related Papers:

Abstract

1. Introduction

2. Paired difference tests

2.1. Basic approach

2.2. The t-test approach

2.3. Tests for variables with values in the unit interval

3. Tests of probabilities

3.1. Testing probabilities on inhomogeneous samples

3.2. The Jeffreys test approach

4. Numerical examples

4.1. Example: Tests for variables with values in the unit interval

4.2. Example: Testing probabilities on inhomogeneous samples

5. Conclusions

Acknowledgments

Conflicts of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog