One of the beauties of maths is that it is the same in every language. So you don’t need to know Italian to read the table on the second page of this article on this week’s Milano Finanza.

The Made in Italy Fund started in May last year and is up 43% since then.

Here are the main points of the article:

The Italian stock market is dominated by the largest 40 stocks included in the FTSE MIB index. The FTSE MIB and the FTSE Italia All Shares indices are virtually overlapping (first graph on page 1).

2/3 of the Italian market is concentrated in 4 sectors.

Small Caps – companies with a market cap of less than 1 billion euro – are 3/4 of the 320 quoted names, but represent only 6% of the value of the market.

Small Caps as a whole have underperformed Large Caps (second graph).

But quality Small Caps – those included in the Star segment of the market – have outclassed the MIB index (third graph).

However, the Star index is itself concentrated (table on page 3): the top 11 stocks in the index with a market cap above 1 billion (not 12: Yoox is no longer there) represent more than 60% of the index value (a company needs to be below 1 billion to get into the Star segment, but it is not necessarily taken out when it goes above).

Therefore, to invest in Italian Small Caps you need to know what you’re doing: you can’t just buy a Mid/Small Cap ETF – which is what a lot of people did in the first quarter of this year, after the launch of PIR accounts (similar to UK ISAs), taking the Lyxor FTSE Italia Mid Cap ETF from 42 to 469 million.

To this I would add: you can’t just buy a fund tracking the Star index either (there are a couple): to own a stock just because it is part of an index makes no sense – more on this in the next post.

Fisher’s Bias – focusing on a low FPR without regard to TPR – is the mirror image of the Confirmation Bias – focusing on a high TPR without regard to FPR. They both neglect the fact that what matters is the ratio of the two – the Likelihood Ratio. As a result, they both give rise to major inferential pitfalls.

The Confirmation Bias explains weird beliefs – the ancient Greeks’ reliance on divination and the Aztecs’ gruesome propitiation rites, as well as present-day lunacies, like psychics and other fake experts, superstitions, conspiracy theories and suicide bombers, alternative medicine and why people drink liquor made by soaking a dried tiger penis, with testicles attached, into a bottle of French cognac.

Fisher’s Bias has no less deleterious consequences. FPR<5% hence PP>95%: ‘We have tested our theory and found it significant at the 5% level. Therefore, there is only a 5% probability that we are wrong.’ This is the source of a deep and far-reaching misunderstanding of the role, scope and goals of what we call science.

‘Science says that…’, ‘Scientific evidence shows that…’, ‘It has been scientifically proven that…’: the view behind these common expressions is of science as a repository of established certainties. Science is seen as the means for the discovery of conclusive evidence or, equivalently, the accumulation of overwhelmingly confirmative evidence that leaves ‘no room for doubt or opposition‘. This is a treacherous misconception. While truth is its ultimate goal, science is not the preserve of certainty but quite the opposite: it is the realm of uncertainty, and its ethos is to be entirely comfortable with it.

Fisher’s Bias sparks and propagates the misconception. Evidence can lead to certainty, but it often doesn’t: the tug of war between confirmative and disconfirmative evidence does not always have a winner. By equating ‘significance’ with ‘certainty beyond reasonable doubt’, Fisher’s Bias encourages a naïve trust in the power of science and a credulous attitude towards any claim that manages to be portrayed as ‘scientific’. In addition, once deflated by the reality of scientific controversy, such trust can turn into its opposite: a sceptical view of science as a confusing and unreliable enterprise, propounding similarly ‘significant’ but contrasting claims, all portrayed as highly probable, but in fact – as John Ioannidis crudely puts it – mostly false.

Was Ronald Fisher subject to Fisher’s Bias? Apparently not: he stressed that ‘the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis’, immediately adding that ‘if an experiment can disprove the hypothesis’ it does not mean that it is ‘able to prove the opposite hypothesis.’ (The Design of Experiments, p. 16). However, the reasoning behind such conclusion is typically awkward. The opposite hypothesis (in our words, the hypothesis of interest) cannot be tested because it is ‘inexact’ – remember in the tea-tasting experiment the hypothesis is that the lady has some unspecified level of discerning ability. But – says Fisher – even if we were to make it exact, e.g. by testing perfect ability, ‘it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation’ (ibid.). Notice the confusion: saying that FPR<5% disproves the null hypothesis but FPR>5% does not prove it, Fisher is using the word ‘prove’ in two different ways. By ‘disproving’ the null he means considering it unlikely enough, but not certainly false. By ‘proving’ it, however, he does not mean considering it likely enough – which would be the correct symmetrical meaning – but he means considering it certainly true. That’s why he says that the null hypothesis as well as the opposite hypothesis are never proved. But this is plainly wrong and misleading. Prove/disprove is the same as accept/reject: it is a binary decision – doing one means not doing the other. So disproving the null hypothesis does mean proving the opposite hypothesis – not in the sense that it is certainly true, but in the correct sense that it is likely enough.

Here then is Fisher’s mistake. If H is the hypothesis of interest and not H the null hypothesis, FPR=P(E|not H) – the probability of the evidence (e.g. a perfect choice in the tea-tasting experiment) given that the hypothesis of interest is false (i.e. the lady has no ability and her perfect choice is a chance event). Then saying that a low FPR disproves the null hypothesis is the same as saying that a low P(E|not H) means a low P(not H|E). But since P(not H|E)=1–P(H|E)=1–PP, then a low FPR means a high PP, as in: FPR<5% hence PP>95%.

Hence yes: Ronald Fisher was subject to Fisher’s Bias. Despite his guarded and ambiguous wording, he did implicitly believe that 5% significance means accepting the hypothesis of interest. We have seen why: prior indifference. Fisher would not contemplate any value of BR other than 50%, i.e. BO=1, hence PO=LR=TPR/FPR. Starting with prior indifference, all is needed for PP=1-FPR is error symmetry.

Fisher’s Bias gives rise to invalid inferences, misplaced expectations and wrong attitudes. By setting FPR in its proper context, our Power surface brings much needed clarity on the subject, including, as we have seen, Ioannidis’s brash claim. Let’s now take a closer look at it.

Remember Ioannidis’s main point: published research findings are skewed towards acceptance of the hypothesis of interest based on the 5% significance criterion. Fisher’s bias favours the publication of ‘significant’ yet unlikely research findings, while ‘insignificant’ results remain unpublished. As we have seen, however, this happens for a good reason: it is unrealistic to expect a balance, as neither researchers nor editors are interested in publishing rejections of unlikely hypotheses. What makes a research finding interesting is not whether it is true or false, but whether it confirms an unlikely hypothesis or disconfirms a likely one.

Take for instance Table 4 in Ioannidis’s paper (p. 0700), which shows nine examples of research claims as combinations of TPR, BO and PP, given FPR=5%. Remember the match between our and Ioannidis’s notation: FPR=α, TPR=1-β (FNR=β), BO=R and PP=PPV. For the moment, let’s just take the first two columns and leave the rest aside:

So for example the first claim has TPR=80%, hence LR=16 and, under prior indifference (BO=1, BR=50%), PO=16 and therefore PP=94.1%. In the second, we have TPR=95%, hence LR=19, BO=2 and BR=2/3, hence PO=38 and therefore PP=97.4%. And so on. As we can see, four claims have PP>50%: there is at least a preponderance of evidence that they are true. Indeed the first three claims are true even under a higher standard, with claim 2 in particular reaching beyond reasonable doubt, as it starts from an already high prior, which gets further increased by powerful confirmative evidence. In 3, powerful evidence manages to update a sceptical 25% prior to an 84% posterior, and in 6 to update an even more strongly sceptical prior to a posterior above 50%. The other five claims, on the other hand, have PP<50%: they are false even under the lowest standard of proof, with 8 and 9 in particular standing out as extremely unlikely. Notice however that in all nine cases we have LR>1: evidence is, in various degrees, confirmative, i.e. it increases prior odds to a higher level. Even in the last two cases, where evidence is not very powerful and BR is a tiny 1/1000 – just like in our child footballer story – LR=4 quadruples it to 1/250. The posterior is still very small – the claims remain very unlikely – but this is the crucial point: they are a bit less unlikely than before. That’s what makes a research finding interesting: not a high PP but a LR significantly different from 1. All nine claims in the table – true and false – are interesting and, as such, worth publication. This includes claim 2, where further confirmative evidence brings virtual certainty to an already strong consensus. But notice that in this case disconfirmative evidence, reducing prior odds and casting doubt on such consensus, would have attracted even more interest. Just as we should expect to see a preponderance of studies confirming unlikely hypotheses, we should expect to see the same imbalance in favour of studies disconfirming likely hypotheses. It is the scientific enterprise at work.

Let’s now look at Ioannidis’s auxiliary point: the preponderance of ‘significant’ findings is reinforced by a portion of studies where significance is obtained through data manipulation. He defines bias u as ‘the proportion of probed analyses that would not have been “research findings”, but nevertheless end up presented and reported as such, because of bias’ (p. 0700).

How does Ioannidis’s bias modify his main point? This is shown in the following table, where PP* coincides with PPV in his Table 4:

Priors are the same, but now bias u causes a substantial reduction in LR and therefore in PP. For instance, in the first case u=0.10 means that 10% of research findings supporting the claim have been doctored into 5% significance through some form of data tampering. As a result, LR is lowered from 16 to 5.7 and PP from 94.1% to 85%. So in this case the effect of bias is noticeable but not determinant. The same is true in the second case, where a stronger bias causes a big reduction in LR from 19 to 2.9, but again not enough to meaningfully alter the resulting PP. In the third case, however, an even stronger bias does the trick: it reduces LR from 16 to 2 and PP from 84.2% all the way down to 40.6%. While the real PP is below 50%, a 40% bias makes it appear well above: the claim looks true but is in fact false. Same for 6, while the other five claims, which would be false even without bias, are even more so with bias – their LR reduced to near 1 and their PP consequently remaining close to their low BR.

This sounds a bit confusing so let’s restate it, taking case 3 as an example. The claim starts with a 25% prior – it is not a well established claim and would therefore do well with some confirmative evidence. The appearing evidence is quite strong: FPR=5% and TPR=80%, giving LR=16, which elevates PP to 84.1%. But in reality the evidence is not as strong: 40% of the findings accepting the claim have been squeezed into 5% significance through data fiddling. Therefore the real LR – the one that would have emerged without data alterations – is much lower, and so is the real PP resulting from it: the claim appears true but is false. So is claim 6, thus bringing the total of false claims from five to seven – indeed most of them.

How does bias u alter LR? In Ioannidis’s model, it does so mainly by turning FPR into FPR*=FPR+(1-FPR)u – see Table 2 in the paper (p. 0697). FPR* is a positive linear function of u, with intercept FPR and slope 1-FPR, which, since FPR=5%, is a very steep 0.95. In case 3, for example, u produces a large increase of FPR from 5% to 43%. In addition, u turns TPR into TPR*=TPR+(1-TPR)u, which is also a positive linear function of u, with intercept TPR and slope 1-TPR which, since the TPR of confirmative evidence is higher than FPR, is flatter. In case 3 the slope is 0.2, so u increases TPR from 80% to 88%. The combined effect, as we have seen, is a much lower LR*=TPR*/FPR*, going down from 16 to 2.

I will post a separate note about this model, but the point here is that, while Ioannidis’s bias increases the proportion of false claims, it is not the main reason why most of them are false. Five of the nine claims in his Table 4 would be false even without bias.

In summary, by confusing significance with virtual certainty, Fisher’s Bias encourages Ioannidis’s bias (I write it with a small b because it has no cognitive value: it is just more or less intentional cheating). But Ioannidis’s bias does not explain ‘Why Most Research Findings Are False’. The main reason is that many of them test unlikely hypotheses, and therefore, unless they manage to present extraordinary or conclusive evidence, their PP turns out to be lower and often much lower than 50%. But this doesn’t make them worthless or unreliable – as the paper’s title obliquely suggests. As long as they are not cheating, researchers are doing their job: trying to confirm unlikely hypotheses. At the same time, however, they have another important responsibility: to warn the reader against Fisher’s Bias, by explicitly clarifying that, no matter how ‘significant’ and impressive their results may appear, they are not ‘scientific revelations’ but tentative discoveries in need of further evidence.

Armed with our Power surface, let’s revisit John Ioannidis’s claim according to which ‘Most Published Research Findings Are False’.

Ioannidis’s target is the immeasurable confusion generated by the widespread mistake of interpreting Fisher’s 5% statistical significance as implying a high probability that the hypothesis of interest is true. FPR<5%, hence PP>95%. As we have seen, this is far from being the general case: TPR=PP if and only if BO=FPR/FNR, which under prior indifference requires error symmetry.

Fisher’s 5% significance is neither a sufficient not a necessary condition for accepting the hypothesis of interest. Besides FPR, acceptance and rejection depend on BR, PP and TPR. Given FPR=5%, all combinations of the three variables lying on or above the curved surface indicate acceptance. But, contrary to Fisher’s criterion, combinations below the surface indicate rejection. The same is true for values of FPR below 5%, which are even more ‘significant’ according to Fisher’s criterion. These widen the curved surface and shrink the roof, thus enlarging the scope for acceptance, but may still indicate rejection for low priors and high standards of proof, if TPR power is not or cannot be high enough. On the other hand, TPR values above 5%, which according to Fisher’s criterion are ‘not significant’ and therefore imply unqualified rejection, reduce the curved surface and expand the roof, thus enlarging the scope for rejection, but may still indicate acceptance for higher priors and lower standards of proof, provided TPR power is high enough. Here are pictures for FPR=2.5% and 10%:

So let’s go back to where we started. We want to know whether a certain claim is true or false. But now, rather than seeing it from the perspective of a statistician who wants to test the claim, let’s see it from the perspective of a layman who wants to know if the claim has been tested and whether the evidence has converged towards a consensus, one way or the other.

For example: ‘Is it possible to tell whether milk has been poured before or after hot water by just tasting a cup of tea?’ (bear with me please). We google the question and let’s imagine we get ten papers delving into this vital issue. The first and earliest is by none other than the illustrious Ronald Fisher, who performed it on the algologist Muriel Bristol and, on finding that she made no mistakes with 8 cups – an event that has only 1/70 probability of being the product of chance, i.e. a p-value of 1.4%, much lower than required by his significance criterion – concluded, against his initial scepticism, that ‘Yes, it is possible’. That’s it? Well, no. The second paper describes an identical test performed on the very same Ms Bristol three months later, where she made 1 mistake – an event that has a 17/70 probability of being a chance event, i.e. a p-value of 24.3%, much larger than the 5% limit allowed by Fisher’s criterion. Hence the author rejected Professor Fisher’s earlier claim about the lady’s tea-tasting ability. On to the third paper, where Ms Bristol was given 12 cups and made 1 mistake, an event with a 37/924=4% probability of being random, once again below Fisher’s significance criterion. And so on with the other papers, each one with its own set up, its own numbers and its own conclusions.

It is tempting at this point for the layman to throw up his arms in despair and execrate the so-called experts for being unable to give a uniform answer to such a simple question. But he would be entirely wrong. The evidential tug of war between confirmative and disconfirmative evidence is the very essence of science. It is up to us to update our prior beliefs through multiplicative accumulation of evidence and to accept or reject a claim according to our standard of proof.

If anything, the problem is the opposite: not too much disagreement but – and this is Ioannidis’s main point – too little. Evidence accumulation presupposes that we are able to collect the full spectrum, or at least a rich, unbiased sample of all the available evidence. But we hardly ever do. The evidence we see is what reaches publication, and publication is naturally skewed towards ‘significant’ findings. A researcher who is trying to prove a point will only seek publication if he thinks he has gathered enough evidence to support it. Who wants to publish a paper about a theory only to announce that he has got inadequate evidence for it? And even if he tried, what academic journal would publish it?

As a result, available evidence tends to be biased towards acceptance. And since acceptance is still widely based on Fisher’s criterion, most published papers present FPR<5%, while those with FPR>5% remain unpublished and unavailable. To add insult to injury, in order to reach publication some studies get squeezed into significance through more or less malevolent data manipulation. It is what Ioannidis calls bias: ‘the combination of various design, data, analysis, and presentation factors that tend to produce research findings when they should not be produced’. (p. 0697).

This dramatically alters the evidential tug of war. It is as if, when looking into the milk-tea question, we would only find Fisher’s paper and others accepting the lady’s ability – including some conveniently glossing over a mistake or two – and none on the side of rejection. We would then be inclined to conclude that the experts agree and would be tempted to go along with them – perhaps disseminating and reinforcing the bias through our own devices.

How big a problem is this? Ioannidis clearly thinks it is huge – hence the dramatic title of his paper, enough to despair not just about some experts but about the entire academic community and its scientific enterprise. Is it this bad? Are we really swimming in a sea of false claims?

Let’s take a better look. First, we need to specify what we mean by true and false. As we know, it depends on the standard of proof, which in turn depends on utility preferences. What is the required standard of proof for accepting a research finding as true? Going by the wrongful interpretation of Fisher’s 5% significance criterion, it is PP>95%. But this is not only a mistake: it is the premise behind an insidious misrepresentation of the very ethos of scientific research. Obviously, truth and certainty are the ultimate goals of any claim. But the value of a research finding is not in how close it is to the truth, but in how closer it gets us to the truth. In our framework, it is not in PP but in LR.

The goal of scientific research is finding and presenting evidence that confirms or disconfirms a specific hypothesis. How much more (or less) likely is the evidence if the hypothesis is true than if it is false? The value of evidence is in its distance from the unconfirmative middle point LR=1. A study is informative, hence worth publication, if the evidence it presents has a Likelihood Ratio significantly different from 1, and is therefore a valuable factor in the multiplicative accumulation of knowledge. But a high (or low) LR is not the same as a high (or low) PP. They only coincide under prior indifference, where, more precisely, PO=LR, i.e. PP=LR/(1+LR). So, for example, if LR=19 – the evidence is 19 times more likely if the hypothesis is true than if it is false – then PP=19/20=95%. But, as we know very well, prior indifference is not a given: it is a starting assumption which, depending on the circumstances, may or may not be valid. BO – Ioannidis calls it R, ‘the ratio of “true relationships” to “no relationships”‘ (p. 0696) – gives the pre-study odds of the investigated hypothesis being true. It can be high, if the hypothesis has been tested before and is already regarded as likely true, or low, if it is a novel hypothesis that has never been tested and, if true, would be an unexpected discovery. In the first case, LR>1 is a further confirmation that the hypothesis should be accepted as true – a useful but hardly noteworthy exercise that just reinforces what is already known. On the other hand, LR<1 is much more interesting, as it runs against the established consensus. LR could be as low as to convert a high BO into a low PO, thus rejecting a previously accepted hypothesis. But not necessarily: while lower than BO, PO could remain high, thus keeping the hypothesis true, while casting some doubts on it and prodding further investigation. In the second case, LR<1 is a further confirmation that the hypothesis should be rejected as false. On the other hand, LR>1 increases the probability of an unlikely hypothesis. It could be as high as to convert a low BO into a high PO, thus accepting what was previously an unfounded conjecture. But not necessarily: while higher than BO, PO could remain low, thus keeping the hypothesis false, but at the same time stimulating more research.

Such distinctions get lost in Ioannidis’s sweeping claim. True, SUTCing priors and neglecting a low BO can lead to mistakenly accepting hypotheses on the basis of evidence that, while confirmative, leaves PO well below any acceptance level. The mistake is exacerbated by Fisher’s Bias – confusing significance (a low FPR) with confirmation (a high PP) – and by Ioannidis’s bias – squeezing FPR below 5% through data alteration. FPR<5% does not mean PP>95% or even PP>50%. As shown in our Power surface, for any standard of proof, the lower is BO the higher is the required TPR for any level of FPR. Starting from a low BO, accepting the hypothesis requires very powerful evidence. Without it, acceptance is a false claim. Moreover, published acceptances – rightful or wrongful – are not adequately counterbalanced by rejections, which remain largely unpublished. This however occurs for an entirely legitimate reason: there is little interest in rejecting an already unlikely hypothesis. Interesting research is what runs counter prior consensus. Starting from a low BO, any confirmative evidence is interesting, even when it is not powerful enough to turn a low BO into a high PO. Making an unlikely hypothesis a bit less unlikely is interesting enough, and is worth publication. But – here Ioannidis is right – it should not be confused with acceptance. Likewise, there is little interest in confirmative evidence when BO is already high. What is interesting in this case is disconfirmative evidence, again even when it is not powerful enough to reject the hypothesis by turning a high BO into a low PO. Making a likely hypothesis a bit less likely is interesting enough. But it should not be confused with rejection.

Whether we accept or reject a hypothesis, i.e. decide whether a claim is true or false, depends on all three elements.

Posterior Odds. The minimum standard of proof required to accept a hypothesis is PO>1 (i.e. PP>50%). We call it Preponderance of evidence. But, depending on the circumstances, this may not be enough. We have seen two other cases: Clear and convincing evidence: PO>3 (i.e. PP>75%), and Evidence beyond reasonable doubt: PO>19 (PP>95%), to which we can add Evidence beyond the shadow of a doubt: PO>99 (PP>99%) or even PO>999 (PP>99.9%). The spectrum is continuous, from 1 to infinity, where Certainty (PP=100%) is unattainable and is therefore a decision. The same is symmetrically true for rejecting the hypothesis, from PO<1 to the other side of Certainty: PO=PP=0.

Base Odds. To reach the required odds we have to start somewhere. A common starting point is Prior indifference, or Perfect ignorance: BO=1 (BR=50%). But, depending on the circumstances, this may not be a good starting place. With BO=1 it looks like Base Odds have disappeared, but they haven’t: they are just being ignored – which is never a good start. Like PO, Base Odds are on a continuous spectrum between the two boundaries of Faith: BR=100% and BR=BO=0. Depending on BO, we need more or less evidence in order to achieve our required PO.

Likelihood Ratio. Evidence is confirmative if LR>1, i.e. TPR>FPR, and disconfirmative if LR<1, i.e. TPR<FPR. The size of TPR and FPR are not relevant per se – what matters is their ratio. A high TPR means nothing without a correspondingly low FPR. Ignoring this leads to the Confirmation Bias. Likewise, a low FPR means nothing without a correspondingly high TPR. Ignoring this leads to Fisher’s Bias.

To test a hypothesis, we start with a BO level that best reflects our priors and set our required standard of proof PO. The ratio of PO to BO determines the required LR: the strength or weight of the evidence we demand to accept the hypothesis. In our tea-tasting story, for example, we have BO=1 (BR=50%) and PO>19 (PP>95%), giving LR>19: in order to accept the hypothesis that the lady has some tea-tasting ability, we require evidence that is at least 19 times more likely if the hypothesis is true than if the hypothesis is false. A test is designed to calculate FPR: the probability that the evidence is a product of chance. This requires defining a random variable and assigning to it a probability distribution. Our example is an instance of what is known as Fisher’s exact test, where the random variable is the number of successes over the number of trials without replacement, as described by the hypergeometric distribution. Remember that with 8 trials the probability of a perfect choice under the null hypothesis of no ability is 1/70, the probability of 3 successes and 1 failure is 16/70, and so on. Hence, in case of a perfect choice we accept the hypothesis that the lady has some ability if TPR>19∙(1/70)=27% – a very reasonable requirement. But with 3 successes and 1 failure we would require an impossible TPR>19∙(17/70). On the other hand, if we lower our required PO to 3 (PP>75%), then all we need is TPR>3∙(17/70)=73% – a high but feasible requirement. But if we lower our BO to a more sceptical level, e.g. BO=1/3 (BR=25%), then TPR>3∙3∙(17/70) is again too high, whereas a perfect choice may still be acceptable evidence of some ability, even with the higher PO: TPR>3∙19∙(1/70)=81%.

So there are four variables: PP, BR, FPR and TPR. Of these, PP is set by our required standard of proof, BR by our prior beliefs and FPR by the probability distribution of the relevant random variable. These three combined give us the minimum level of TPR required to accept the hypothesis of interest. TPR – the probability of the evidence in case the hypothesis is true – is also known as statistical power, or sensitivity. Our question is: given our starting priors and required standard of proof, and given the probability that the evidence is a chance event, how powerful should the evidence be for us to decide that it is not a chance event but a sign that the hypothesis of interest is true?

Clearly, the lower is FPR, the more inclined we are to accept. As we know, Fisher would do it if FPR<5% – awkwardly preferring to declare himself able to disprove the null hypothesis at such significance level. That was enough for him: he apparently took no notice of the other three variables. But, as we have seen, what he might have been doing was implicitly assuming error symmetry, prior indifference and setting PP beyond reasonable doubt, thus requiring TPR>95%, i.e. LR>19. Or, more likely, at least in the tea test, he was starting from a more sceptical prior (e.g. 25%), while at the same time lowering his standard of proof to e.g. 75%, which at FPR=5% requires TPR>45%, i.e. LR>9, or perhaps to 85%, which requires TPR>85%, i.e. LR>17.

There are many combinations of the four variables that are consistent with the acceptance of the hypothesis of interest. To see it graphically, let’s fix FPR: imagine we just ran a test and calculated that the probability that the resulting evidence is the product of chance is 5%. Do we accept the hypothesis? Yes, says Fisher, But we say: it depends on our priors and standard of proof. Here is the picture:

For each BR, TPR is a positively convex function of PP. For example, with prior indifference (BR=50%) and a minimum standard of proof (PP>50%) all we need to accept the hypothesis is TPR>5% (i.e. LR>1): the hypothesis is more likely to be true than false. But with a higher standard, e.g. PP>75%, we require TPR>15% (LR>3), and with PP>95% we need TPR>95% (LR>19). The requirement gets steeper with a sceptical prior. For instance, halving BR to 25% we need TPR>15% for a minimum standard and TPR>45% for PP>75%. But PP>95% would require TPR>1: evidence is not powerful enough for us to accept the hypothesis beyond reasonable doubt. For each BR, the maximum standard of proof that keeps TPR below 1 is BO/(BO+FPR). Under prior indifference, that is 95% (95.24% to be precise: PO=20), but with BR=25% it is 87%. The flat roof area in the figure indicates the combination of priors and standards of proof which is incompatible with accepting the hypothesis at the 5% FPR level.

If TPR is on or above the curved surface, we accept the hypothesis. But, unlike Fisher, if it is below we reject it: despite a 5% FPR, the evidence is not powerful enough for our priors and standard of proof. Remember we don’t need to calculate TPR precisely. If, as in the tea-tasting story, the hypothesis of interest is vague – the lady has some unspecified ability – it might not be possible. But what we can do is to assess whether TPR is above or below the required level. If we are prior indifferent and want to be certain beyond reasonable that the hypothesis is true, we need TPR>95%. But if we are happy with a lower 75% standard then all we need is TPR>15%. If on the other hand we have a sceptical 25% prior, there is no way we can be certain beyond reasonable doubt, while with a 75% standard we require TPR>45%.

It makes no sense to talk about significance, acceptance and rejection without first specifying priors and standards of proof. In particular, very low priors and very high standards land us on the flat roof corner of the surface, making it impossible for us to accept the hypothesis. This may be just fine – there many hypotheses that I am too sceptical and too demanding to be able to accept. At the same time, however, I want to keep an open mind. But doing so does not mean reneging my scepticism or compromising my standards of proof. It means looking for more compelling evidence with a lower FPR. That’s what we did when the lady made one mistake with 8 cups. We extended the trial to 12 cups and, under prior indifference and with no additional mistakes, we accepted her ability beyond reasonable doubt. Whereas, starting with sceptical priors, acceptance required lowering the standard of proof to 75% or extending the trial to 14 cups.

To reduce the size of the roof we need evidence that is less likely to be a chance event. For instance, FPR=1% shrinks the roof to a small corner, where even a sceptical 25% prior allows acceptance up to PP=97%. In the limit, as FPR tends to zero – there is no chance that the evidence is a random event – we have to accept. Think of the lady gulping 100 cups in a row and spotlessly sorting them in the two camps: even the most sceptical statistician would have no shadow of a doubt to regard this as conclusive positive evidence (a Smoking Gun). On the other hand, coarser evidence, more probably compatible with a chance event, enlarges the roof, thus making acceptance harder. With FPR=10%, for instance, the maximum standard of proof under prior indifference is 91%, meaning that even the most powerful evidence (TPR=1) is not enough for us to accept the hypothesis beyond reasonable doubt. And with a sceptical 25% prior the limit is 77%, barely above the ‘clear and convincing’ level. While harder, however, acceptance is not ruled out, as it would be with Fisher’s 5% criterion. According to Fisher, a 1 in 10 probability that the evidence is the product of chance is just too high for comfort, making him unable to reject the null hypothesis. But what if TPR is very high, say 90%? In that case, LR=9: the evidence is 9 times more likely if the hypothesis is true than if it false and, under prior indifference, PP=90%. Sure, we can’t be certain beyond reasonable doubt that the hypothesis is true, but in many circumstances it would be eminently reasonable to accept it.

Sorry we’re very late, but anyway, HAPPY NEW YEAR.

One reason is that once again the video was blocked on a copyright issue. Sony Music, owner of Uptown Funk, had no problems: their policy is to monetise the song. But Universal Music Group, owner of Here Comes the Sun, just blocks all Beatles songs – crazy. I am disputing their claim – it is just the first 56 seconds and we’re singing along, for funk sake! They will give me a response by 17 February.

Talking about Funk, Bruno Mars says the word 29 times. Lorenzo assures me there is an n in each one of them, but I have my doubts. (By the way, check out Musixmatch, a great Italian company).

I am putting the video online against my brother’s advice – “It’s bellissimo for the family, but what do others care?” I say fuggettaboutit, it’s funk.

Update. I got the response: “Your dispute wasn’t approved. The claimant has reviewed their claim and has confirmed it was valid. You may be able to appeal this decision, but if the claimant disagrees with your appeal, you could end up with a strike on your account.” Silly.

So I put the video on Vimeo. No problem there. Crazy.

This may be obvious to some of you, so I apologise. But it wasn’t to me, and since it has been such a great discovery I’d like to offer it as my Boxing Day present.

If you are my age, or even a bit younger, chance is that you have a decent/good/awesome hifi stereo system, with big, booming speakers and all the rest of it. Mine is almost 20 years old, and it works great. But until recently I’d been using it just to listen to my CDs. All the iTunes and iPhone stuff – including dozens of CDs that I imported in it – was, in my mind, separate: I would only listen to it through the PC speakers. Good, but not great. I wished there was a way to do it on the hifi. Of course, I could have bought better PC speakers, or a new whiz-bang fully-connected stereo system. But I couldn’t be bothered – I just wanted to use my old one.

The PC and the hifi are in different rooms, so connecting them doesn’t work, even wirelessly. But after some thinking and research I found a perfect solution. So here it is.

All I needed was this:

It’s a tiny 5×5 cm Bluetooth Audio Adapter, made by Logitech. You can buy it on the Logitech website or on Amazon, currently at £26. All you need to do is connect it to the stereo and pair your iPhone or iPad (or equivalent) to it.

To complete, I also bought this:

It is a 12×10 cm 1-Output 2-Port Audio Switch, also available on Amazon, currently at £7. Here is the back:

So you connect:

The Audio Adapter to the first Input of the switch (the Adapter comes with its own RCA cable).

The Output of your system’s CD player to the second Input of the switch (you need another RCA cable).

The CD Input of your system’s amplifier to the Output of the switch (you need a third RCA cable).

And you’re done! Now if you want to listen to your iPhone stuff you press button 1 on the front of the switch. If you want to listen to your CDs you press button 2.

And of course – but this you should already know – you can also listen to the tons of music on display on any of the streaming apps: Amazon Music, Google Play Music, Groove, Spotify, Deezer etc.

There you go. You’ve just saved hundreds or thousands of your favourite currency that an upgrade would have cost you. And you’re using your good old system to the full.

That morning I had an early meeting and left in a rush. So, on the way back home, I stopped by M&S Foodhall – I had to get some milk, the children were coming over – and got myself a freshly baked pain au chocolat. I got two, just in case. At home, I ate one in the kitchen, with a nice espresso. I didn’t need the second one, so I left it in its paper bag, up on the shelf.

The children arrived in the afternoon. They had their homework to do, I had mine. A couple of hours later, I went to the kitchen to make a cup of tea. There, on the table, I saw a blue plate, with some flaky crumbs on it.

There you go again. “Mau!”

Maurits, my younger son, is a bit of a sweet tooth. Nothing major – none of us care much about sugar. But while Lorenzo, my elder, is just like me, Mau likes the occasional candy – Mentos are his favourite. At around teatime, they are used to take a break and fix themselves a slice of bread with Nutella. But this time Mau had also evidently discovered and gobbled up the second pain au chocolat.

“Mau, can you come here for a sec?” He was doing his maths. “Who did this?” I asked him with a stern smile.

“Not me! I swear!”

He had done it before. Of course, I didn’t mind him eating the pastry. And I didn’t even mind that he hadn’t asked me first, sweet boy. But I wanted to check on his disposition to own up to his actions.

“Mau, please tell me the truth. It’s fine you ate it. But just tell me. Did you?”

“No dad, I didn’t!” His initial smirk had gone. He looked down and started sobbing, quietly at first, but with a menacing crescendo. It was his tactics – he had even told me once, in a wonderful moment of intimate connivance.

“I know you, Mau. Don’t do this. It’s ok, you understand? I just want you to tell me the truth.”

Other times he would had given up, especially when, like now, it was just the two of us – Lorenzo was in the living room. But not this time. His sob turned into a louder cry:

“I didn’t! I swear I didn’t!”

Big tears started streaming on his cheeky cheeks – that’s how we call him sometimes, Lori and I. Lorenzo came over to check what was going on. As soon as I told him, he started defending his brother:

“Dad, if he says he didn’t do it he didn’t, ok?”

Good, I thought. That didn’t happen very often. Usually, in similar circumstances, they would snitch and bicker. But this time, evidently moved by his brother’s distress, Lorenzo was with him.

“You stay out of it. Go back to work”, I said.

I took Mau to the bedroom and closed the door. I hugged him and wiped his tears with kisses. It was a great opportunity to teach him a fundamental lesson about honesty, truth and trust. I told him how important it was for me, and how important for my children to understand it. He had heard it before.

“Dad, I did not do it.” He was still upset, but had stopped crying. Great, I thought, he’s about to surrender. One more go and we’re done – I couldn’t wait. So I gave him another big hug and whispered a few more words about sincerity and love.

“Ok dad, I admit it. I did it” he whispered back, with a sombre sigh.

“Great, Mau! I’m so proud of you, well done!” I felt a big smile exploding on my face. “You see how good it is when you tell the truth? You should never, never lie to me again, ok?”

He nodded, and we hugged one more time. Then I lifted him, threw him on the bed and we started fighting, bouncing, tickling and laughing, as we always do. Lorenzo joined in.

A couple of hours later, it was time for dinner. I went back to the kitchen and started preparing. I cleared the table – the blue plate and everything else – and asked the children what they wanted to eat. “Mau’s turn to decide!”

But then, as I looked up to get the plates, my blood froze. There, on the shelf where I had left it in the morning, was the paper bag and, inside it, the second pain au chocolat.

“Mau!”

He came over, with a not-again look on his face. I grabbed him and kissed him all over. “I’m sorry sorry sorry sorry!” I explained to him what had happened. I had used the blue plate in the morning and left it there. He had nothing to do with it.

“I told you, dad.”

“Yes, but why did you then admit you did it?” I replied, choking on my words as soon as I realised that the last thing I could do was to blame him again.

“I know dad, but you were going on and on. I had to do it.”

Tempus fugit. It has been two years since I wrote about Ferrexpo, and more than three since I presented it at the valueconferences.com’s European Investment Summit. (By the way, Meyer Burger’s pearl status turned out to be short-lived. Barratt Developments stock price, on the other hand, proceeded to grow 40% in 2014 and another 40% in 2015, to a peak of 6.6 pounds. This year it fell like a stone after Brexit to as low as 3.3, but it is now back at 4.8 – market efficiency for you).

My point two years ago was to show how wrong I had been on Ferrexpo and to try to explain – first to myself – why. My main argument was that I had underestimated the negative consequences of the big three iron ore producers’ decision to continue to expand their production capacity. This had caused a steep drop in the iron price from 130 dollars per ton to 70, which the market was regarding as permanent, thus taking all iron ore producers – the big three as well as smaller companies like Ferrexpo – to very low valuations. My decision to stick with the investment was predicated on the assumption that the iron ore price drop would turn out to be cyclical rather permanent, and that valuations based on a permanently low price would eventually be proven wrong.

Here is what happened since then:

Ferrexpo ended 2014 at 0.5. But then, in the first half of 2015, it started to look as if I was right: the stock price climbed back 60% to 0.8. Alas, not for long. In the second half, the iron ore price resumed its fall, and by the end of the year it was as low as 41 dollars, taking Ferrexpo’s price all the way down to 0.2.

In the meantime, the company had been chugging along, reporting lower but overall solid results. Its profile was as low as ever, and the message invariably the same: we are doing as best as we can in the current adverse circumstances, but there is nothing we can do other than sit tight and weather the storm. Brokerage analysts took notice and kept pummelling the stock.

To my increasing frustration – there were things to do: share buybacks, dividend suspension, divestments, insider buying, strategic alliances – a stone-faced IR would, each time I saw her, give me the same ‘don’t look at me, I just work here’ message, the CFO kept sighing, and the CEO would not even meet me.

The attitude could not have been more different at Cliffs Natural Resources, the largest US iron ore producer. Here, after a fierce shareholder and management clash, a new CEO had been appointed in August 2014. Lourenço Gonçalves, a burly, ebullient Brazilian with a long experience in the mining industry, was as far as you could get from the amorphous Ferrexpo people. I had taken a position in Cliffs in 2013 and added to it in 2014.

Gonçalves’ first public foray, the Q3 2014 earnings call in October, was a stunner. The stock price was at 9 dollars, down from 17 in August. But Gonçalves was adamant: he was going to turn the company around, restore profitability and reverse the decline. The old management – he said – had been driving the company to the ground by pursuing the same aggressive, China-focused expansion and acquisition strategy that the big three producers had adopted. He warned that the strategy was reckless and self-destructive, and that the big three’s shareholders would eventually see the light and throw out the management responsible for it. His plan was to concentrate on the domestic market and sell everything else. He said brokerage analysts – who were almost unanimously sceptical about his plan and were ‘predicting’ more iron price weakness – were all wrong, and he was determined to prove them wrong. The company was buying back its stock and he and other company insiders were buying too on their personal account.

What a refreshing contrast to Ferrexpo’s lethargy! Here is a most memorable exchange with one on the analysts in the Q&A. And here is Gonçalves’ presentation at a Steel Industry conference in June 2015, where he effectively and presciently articulated his analysis and outlook:

The stock had gone up 22% on the day of the Q3 earnings call. But it had soon resumed its downward trend, ending 2014 at 7 dollars. On the day of the June ’15 conference it had reached 5, and a month later it was down to the beaten-up analyst’s 4 dollar valuation target. Lourenço would have none of it – but being in the midst of all this drama was no fun.

A value investor is used to endure the pain when things go the wrong way – but there is always a limit, beyond which endurance becomes stubbornness, valiance turns into delusion and composure into negligence. I had bought Ferrexpo in 2012, added to it in 2013, when I also bought Cliffs, to which I had added in 2014. I just couldn’t do more. All along I had kept doing the Blackstonian right thing – trying to prove myself wrong. But there was no way: I agreed with Gonçalves one hundred percent. Giving up was surely the wrong thing to do.

But there is one thing I did in 2015: I moved out of Ferrexpo and concentrated entirely on Cliffs. Realising a loss is never a pleasant experience, but resisting it when there is a better alternative is a mistake. The loss is there, whether one realises it or not – and selling creates a tax asset. I was not giving up the fight, but I reckoned that, if I had to keep fighting, I might as well seek comfort in Cliffs’ enthusiasm rather than remain mired in Ferrexpo’s apathy. If I was ever going to be proven right, it was more likely – and more fun, though admittedly the company’s name didn’t help – to be on Cliffs’ side:

We are squeezing the freaking shorts and we are going to take them out one by one: I’m going to make them sell everything to get out of my way and they are going to lose their pants. How am I going to get there? The toolbox is full. (Gonçalves, Cliffs Q1 Earnings call, 28 April 2015).

Fun, but tough. The shorts kept winning, and winning big, for the rest of the year, which Cliffs ended all the way down to 1.6 dollars. But, in the meantime, things had started to change. Exactly as Gonçalves had predicted, the big three’s shareholders lost their patience and forced a radical change in management strategy. The first to go was the Head of Iron Ore at BHP Billiton, who had been the most vocal advocate of the supposed virtues of lower iron ore prices – BHP’s stock price halved in 2014-15 – soon followed by Rio Tinto’s CEO and its own Head of Iron Ore – Rio’s price went down 40% in the same period. The new management put a halt to the misconceived policy of big actual and planned production expansion, which by the end of the year had driven the iron ore price down to 40 dollars. As a result, lo and behold, at the beginning of this year – two full years later than my Ferrexpo post – the cyclical upturn in the iron ore price began in earnest. And, unlike the year before, it didn’t stop: the price has now doubled to 80 dollars.

Right, at last. And freaking shorts on the run, as Cliffs’ stock price blasted to a peak of 10.5 dollars earlier this month – five and a half times where it ended last year. Thank you, Lourenço, great job. So here is the graph I showed in July at this year’s Italian Value Investing Seminar, where, alongside their stock picks, speakers are asked to talk about a past mistake.

My big mistake was – what else? – Ferrexpo, one of the three stocks I had presented at the 2014 Seminar (the other were two Italian stocks, Cembre and B&C Speakers, which went up 39% and 34% in 2015). Ferrexpo had by then gone up merely 50% since year-end, from 0.2 to 0.3, while Cliffs was already at 6. It was the silver lining of a bad story, and a vindication of my 2015 move – enhanced, ironically, by the post-Brexit dollar appreciation versus the pound.

Little was I expecting that, right after the July presentation, Ferrexpo’s own price would start a mighty catch-up, bringing it to a peak, earlier this month, of 1.5 pounds, basically in line with Cliffs, which is still a bit ahead from my pound perspective only because of the currency move:

There might be a perverse correlation between Ferrexpo’s stock price and my talking about it in Trani – but I doubt I will collect a third data point to test the hypothesis. The move had nothing to do with the company, which this year continued to be its dreary self. And very little to do with a change in brokerage analysts’ views – the Eeek analyst, bless him, still has a 0.26 price target. As it is clear in the first graph, it has all to do with the iron ore price correlation.

Be that as it may, I am very happy for the few friends who followed me on this roller coaster. This has not been a successful investment (so far!) – and no, I did not increase my Cliffs position at the end of last year. So I hesitate to categorize this post under ‘Intelligent Investing’. But then ‘intelligent’ doesn’t always mean ‘successful’. Investing implies risk and must be judged on the intelligence of ex ante reasoning rather than the success of ex post outcomes. At the same time, however, it is wrong to indulge in self-forgiveness. We should learn from our mistakes: there is no right way to do the wrong thing. So, what did I get wrong?

When I first bought Ferrexpo at 1.7, back in September 2012, I saw it rapidly appreciate to 2.7 in February 2013. It was at that time that the first big downward revisions in iron ore price estimates started to surface, first from BREE (the Australian Bureau of Resources and Energy Economics), then from Rio Tinto’s chief economist, then from about everybody else and becoming consensus. As a result, in less than a month Ferrexpo’s stock price was back to where I had bought it. There I made my big mistake. The lower price forecasts were based on slowing Chinese demand and increased supply from the big three producers. But I dismissed the second point, on the assumption that the big three would not be so irrational. Therein lies the big lesson: it is wrong to assume that the turn of events will necessarily follow its most rational course – we’ve had a flood of evidence on this from this year’s major political developments!

Once it became clear, by 2013, that the big three were set on their expansion strategy and were not going to change their mind any time soon, I should have changed my mind and get out of Ferrexpo at close to par. I didn’t: neglecting my own precepts, I failed to acknowledge the multiplicative nature of evidence accumulation, whereby even a single piece of strong disconfirmative evidence can be enough to overwhelm any amount of confirmative evidence on the other side of the hypothesis. Confirmation Bias, right there in my own face! Going public with my Ferrexpo pick – first at valueconference.com in October ’13 and then at the Trani Seminar the following summer – did not help. So I have promised myself not to fall in that trap again, not by avoiding public picks, which are a useful and worthy exercise, but by refusing to abide by them (so far so good with my subsequent picks).

I am still invested in Cliffs. I was right in expecting the iron price upturn. But iron ore futures are still predicting a drop to 50-60 dollars in the next couple of years. China is likely to slow down steel production and the US likely to limit steel import – notice the big jump in Cliffs’ stock price after Trump’s election: one more irony in this mind blowing saga. The pellet premium – both Cliffs and Ferrexpo produce energy-efficient iron ore pellets, rather than lump and fines – should stay up, given China’s impelling need to combat air pollution.

Ronald Fisher was sceptical about the lady’s tea tasting prowess. But he was prepared to change his mind. If the lady could correctly identify 8 cups, he was willing to acknowledge her ability, or – in his unnecessarily convoluted but equivalent wording – to admit that her ability could not be disproved. He would have done the same with one mistake in 12 cups. In his mind, such procedure had nothing to do with scepticism. He would not let his subjective beliefs taint his conclusions: only data should decide.

This is a misconception. Data can do nothing by themselves. We interpret data into explanations and decide to accept or reject hypotheses, based on our standard of proof (PO), evaluation of evidence (LR) and prior beliefs (BO).

There is hardly a more appropriate illustration of this point than Fisher’s astonishing smoking blunder.

By the time of his retirement in 1957, Sir Ronald Aylmer Fisher (he had been knighted by the newly crowned Elisabeth II in 1952) was a world-renowned luminary in statistics and biology. In the same year, he wrote a letter to the British Medical Journal, in response to an article that had appeared there a month earlier, on the ‘Dangers of Cigarette-Smoking’. Fisher’s letter read ‘Alleged Dangers of Cigarette-Smoking‘:

Your annotation on “Dangers of Cigarette-smoking” leads up to the demand that these hazards “must be brought home to the public by all the modern devices of publicity”. That is just what some of us with research interests are afraid of. In recent wars, for example, we have seen how unscrupulously the “modern devices of publicity” are liable to be used under the impulsion of fear; and surely the “yellow peril” of modern times is not the mild and soothing weed but the original creation of states of frantic alarm.

A common “device” is to point to a real cause for alarm, such as the increased incidence of lung cancer, and to ascribe it urgent terms to what is possibly an entirely imaginary cause. Another, also illustrated in your annotation, is to ignore the extent to which the claims in question have aroused rational scepticism.

Amazing. Fisher’s scepticism at work again. But this time he would not change his mind. To our eyes, this is utterly dumbfounding. Today we know: the evidence accumulated by epidemiological research in establishing tobacco smoking as a major cause of multiple diseases is overwhelming, conclusive and undisputed. But it took a while.

This is from 1931. This, featuring a well-known actor, is from 1949:

Public attitudes to smoking have changed a great deal over the last hundred years – especially in the last few decades. I remember – to my horror – having to bear with my colleagues’ “right” to smoke next to me in the open space I was working on in the ’90s. And what about smoking seats in the back of airplanes? (By the way, why do some airlines still remind passengers, before departure, that ‘This is a non-smoking flight’?).

The adverse health effects of smoking were not as clear in the ’50s as they are today. But there was already plenty of evidence. Granted, the earlier studies had an unsavoury source: Adolf Hitler – a heavy smoker in his youth – had later turned into an anti-tobacco fanatic and initiated the first public anti-smoking campaign in Nazi Germany – that’s where the term ‘passive smoking’ (Passivrauchen) comes from. The link between tobacco and lung cancer was first identified by German doctors. But by 1956 it had been well established by the British Doctors’ Study, building on the pioneering research of Doll and Hill.

How could Fisher – a world authority in the evaluation of statistical evidence! – be so misguided? Let’s see. As a start, he was sceptical about the harmfulness of smoking. Did being a lifelong cigarette and pipe smoker have anything to do with it? Of course. But not – of course – if you asked him: he was prior indifferent – ready to let the data decide. His aggressive, confrontational character – he was apparently intolerant to contradiction and hated being wrong on any subject – did not help either. Besides, he was being paid some – probably small – fee by the tobacco industry. None of this, however, is very important. It is fair to assume that, as in the tea tasting experiment, Fisher was open to let evidence overcome his scepticism. The crucial point is that his scepticism (and stubbornness), combined with a suitably high standard of proof, meant that the amount of evidence he required in order to change his mind was likely to be particularly large.

Here is where Fisher failed: answering Popper’s question. The more sceptical we are about the truth of a hypothesis, the more we should try to verify it. But he didn’t: rather than taking stock of all the evidence that had already been accumulated on the harmful effects of smoking, Fisher largely ignored it, focusing instead on implausible alternatives – What if lung cancer causes smoking? What if there is a genetic predisposition to smoking and lung cancer? – with no indication on how to explore them. And he nitpicked on minor issues with the data, blowing them out of proportion in order to question the whole construct.

The parallel with the today’s global warming debate is evident, indicative and disturbing. It is what happens with all the weird beliefs we have been examining in this blog. The multiplicative nature of evidence accumulation is such that there are always two main ways to undermine a hypothesis: ignore or neglect confirmative evidence and look for conclusive counterevidence – dreaming it up or fabricating it when necessary.

What evidence do you require that would be enough to change your mind? Popper’s question invokes searching for disconfirmative evidence when we are quite convinced that a hypothesis is true. But if, on the contrary, we believe that the hypothesis is likely to be false, what is called for is a search for confirmative evidence – as much as we need to potentially overturn our initial conviction. Popper himself failed to draw this distinction, with his unqualified emphasis on falsifiability. And Fisher followed suit, as reflected in his null hypothesis flip and his verbal contortions to avoid saying ‘prove’, ‘accept’ or ‘verify’. The result was his embarrassing counter-anti-smoking tirade – definitely not the best end to a shining academic career.

PO=LR∙BO. Get any of these wrong and you are up for weird beliefs, embarrassing mistakes and wrong decisions. It can happen to anyone.

Fisher’s 5% significance criterion can be derived from Bayes’ Theorem under error symmetry and prior indifference, with a 95% standard of proof.

Let’s now ask: What happens if we change any of these conditions?

We have already seen the effect of relaxing the standard of proof. While, with a 95% standard, TPR needs to be at least 19 times FPR, with a 75% standard 3 times will suffice. Obviously, the lower our standard of proof the more tolerant we are towards accepting the hypothesis of interest.

Error symmetry is an incidental, non-necessary condition. What matters for significance is that TPR is at least rPO times FPR. So, with rPO=19 it just happens that rTPR=19∙5%=95%=PP. But, for example, with FPR=4% any TPR above 76% – i.e. any FNR below 24% – would do (not, however, with FPR=6%, which would require TPR>1).

Prior indifference, on the other hand, is crucially important. Remember we assumed it at the start of our tea-tasting story, when we gave the lady a 50/50 initial chance that she might be right: BR=50%. Hence BO=1 and Bayes’ Theorem in odds form simplifies to PO=LR. Now it is time to ask: does prior indifference make sense?

Fisher didn’t think so. According to his daughter’s biography, his initial reaction to the lady’s claim that she could spot the difference between the two kinds of tea was: ‘Nonsense. Surely it makes no difference’. Such scepticism would call for a lower BO – not as low as zero, in deference to Cromwell’s rule, but clearly much lower than 1. But Fisher would have adamantly opposed it. For him there was no other value for BO but 1. This was not out of polite indulgence towards the lady, but because of his lifelong credo: ‘I shall not assume the truth of Bayes’ axiom’ (The Design of Experiments, p. 6).

Such stalwart stance was based on ‘three considerations’ (p. 6-7):

1. Probability is ‘an objective quantity measured by observable frequencies’. As such, it cannot be used for ‘measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes.’

This is the seemingly interminable and incredibly vacuous dispute between the objective and the subjective interpretations of probability. It is the same reason why Fisher ignored TPR: if it cannot be quantified, one might as well omit it. He was plainly wrong. As Albert Einstein did not say: ‘Not everything that can be counted counts, and not everything that counts can be counted.’ Science is the interpretation of evidence arranged into explanations. It relies on hard as well as soft evidence. Hard evidence is the result of a controlled, replicable experiment, generating objective, measurable probabilities grounded on empirical frequencies. Soft evidence is everything else: any sign that can help the observer’s effort to evaluate whether a hypothesis is true or false. Such effort arises from a primal need that long predates any theory of probability. The interpretation of soft evidence is inherently subjective. But, contrary to Fisher’s view, there is nothing unscientific about it: subjective probability can be laid out as a complete and coherent theory.

2. ‘My second reason is that it is the nature of an axiom that its truth should be apparent to any rational mind which fully apprehends its meaning. The axiom of Bayes has certainly been fully apprehended by a good many rational minds, including that of its author, without carrying this conviction of necessary truth. This, alone, shows that it cannot be accepted as the axiomatic basis of a rigorous argument.’

This is downright bizarre. First, Bayes’ is a theorem, not an axiom. Second, it is a straightforward consequence of two straightforward definitions. Hence, it is obviously and necessarily true. But somehow Fisher didn’t see it this way. He even believed that Reverend Bayes himself was not completely convinced about it, and that was the reason why he left his Essay unpublished. He had no evidence to support this claim – it was just a prior belief!

3. ‘My third reason is that inverse probability has been only very rarely used in the justification of conclusions from experimental facts’.

This is up there with Decca executive’s Beatles rejection: ‘We don’t like their sounds. Groups of guitars are on the way out’.

Fisher’s credo was embarrassingly wrong. But it doesn’t matter: whether one believes it or not, we are all Bayesian. We all have priors. Ignoring them only means that we are inadvertently assuming prior indifference: BR=50% and BO=1, with all its potentially misleading consequences. Rather than pretending they do not exist, we should try to get our priors right.

So let’s go back to Fisher’s sceptical reception of the lady’s claim. We might at first interpret his prior indifference as neutral open mindedness, expressing perfect ignorance. Maybe the lady is skilled, maybe not – we just don’t know: let’s give her a 50/50 chance and let the data decide. But hang on. Would we say the same and use the same amount of data if the claim had been much more ordinary – e.g. spotting sugar in the tea, or distinguishing between Darjeeling and Earl Grey? And what would we do, on the other hand, with a truly outlandish claim – e.g. spotting whether the tea contains one or two grains of sugar, or whether it has been prepared by a right-handed or a left-handed person? Surely, our priors would differ and we would require much more evidence to test the extraordinary claim than the ordinary one: extraordinary claims require extraordinary evidence.

Prior indifference may be appropriate for a fairly ordinary claim. But the more extraordinary the claim, the lower should be our prior belief and, therefore, the larger should be the amount of confirmative evidence required to satisfy a given standard of proof. For example, let’s share Fisher’s scepticism and halve BR to 25%, hence BO=1/3. Now, with a 95% standard of proof, we have rTPR=3∙19∙(1/70) in case of a perfect choice: it is three times as much as under prior indifference, but we can still accept the hypothesis. More so, obviously, if we relax the standard of proof to 75%. But notice in this case that with one mistake we now have rTPR=3∙3∙(17/70), which is higher than 1. So, while with prior indifference one mistake would still be clear and convincing evidence of some ability, starting with a sceptical prior would lead us to a rejection – coinciding again with Fisher’s conclusion. Did Fisher have an indifferent prior and a 95% standard of proof, or a sceptical prior and a 75% standard? Neither, if we asked him: he shunned priors. But in reality it is either (and – given his scepticism – more likely the latter): we are all Bayesian. In fact, Fisher’s 5% threshold is compatible with various combinations of priors and standards of proof. For instance, BR=25% and an 85% standard give rTPR=85%, where again, incidentally, PP=TPR.

(Note: Prior indifference and error symmetry are sufficient but not necessary conditions for PP=TPR. The necessary condition is BO=FPR/FNR).

Finally, let’s see what happens as we gather more evidence. Remember Fisher ran the experiment with 8 cups. If the lady made no mistake, he accepted the hypothesis that she had some ability; with one or more mistakes, he rejected it. Lowering the standard of proof would tolerate one mistake. But a lower prior would again mean rejection. We can however hear the lady’s protest: Come on, that was one silly mistake – I got distracted for a moment. Give me 4 more cups and I will show you: no more mistakes. As Bayesians, we consent: we allow new evidence to change our mind.

So let’s re-run the experiment with 12 tea cups (The Design of Experiments, p. 21). Now the probability of no mistake goes down to 1/924 (as 12!/[6!(12-6)!]=924), the probability of one mistake to 36/924 (as there are 6×6 ways to choose 5 right and 1 wrong cups), and the probability of one or no mistake to 37/924. Obviously, with no mistakes on 12 cups, the lady’s ability is even more apparent. But now, under prior indifference, one mistake satisfies Fisher’s 5% criterion, as 37/924=4%, and is compatible with a 95% standard of proof, as rTPR=19∙(37/924) is lower than 1. Hence we accept the hypothesis even if the lady makes one mistake. Not so, however, if we start with a sceptical prior: for that we need a 75% standard. Or we need to extend the experiment to 14 cups (by now you know what to do).

To summarise (A=Accept, R=Reject):

There is more to testing a hypothesis than 5% significance. The decision to accept or reject it is the result of a fine balance between standard of proof, evidence and priors: PO=LR∙BO.

‘The lady tasting tea‘ is one of the most famous experiments in the history of statistics. Ronald Fisher told the story in the second chapter of The Design of Experiments, published in 1935 and considered since then the Bible of experimental design. Apparently, the lady was right: she could easily distinguish the two kinds of tea. We don’t know the details of the impromptu experiment, but on his subsequent reflection Fisher agreed that ‘an event which would occur by chance only once in 70 trials is decidedly “significant“‘ (p.13). At the same time, however, he found ‘obvious’ that ‘3 successes to 1 failure, although showing a bias, or deviation, in the right direction, could not be judged as statistically significant evidence of a real sensory discrimination’. (p. 14-15). His reason:

It is usual and convenient for experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the great part of the fluctuations which chance causes have introduced into their experimental results. (p. 13).

Statistically significant at the 5% level: that’s where it all started – the most used, misused and abused criterion in the history of statistics.

Where did it come from? Let’s follow Fisher’s train of thought. Remember what we said about the Confirmation Bias: we cannot look at TPR without looking at its associated FPR. Whatever the result of an experiment, there is always a probability, however small, that it is just a product of chance. How small – asked Fisher – should that probability be for us to be comfortable that the result is not the product of chance? How small – in our framework – should FPR be? 5% – said Fisher. If FPR is lower than 5% – as it is with a perfect cup choice – we can safely conclude that the result is significant, and not a chance event. If FPR is above 5% – as it is with 3 successes and 1 failure – we cannot. That’s it – no further discussion. What about TPR? Shouldn’t we look at FPR in relation to TPR? Not according to Fisher: FPR – the probability of observing the evidence if the hypothesis is false – is all that matters. So much so that, with a bewildering flip, he pretended that the hypothesis under investigation was not that the lady could taste the difference, but its opposite: that she couldn’t. He called it the null hypothesis. After the flip, his criterion is: if the probability of the evidence, given that the null hypothesis is true – he calls it the p-value – is less than 5%, the null hypothesis is ‘disproved’ (p. 16). If it is above 5%, it is not disproved.

Why such an awkward twist? Because – said Fisher – only the probability of the evidence under the hypothesis of no ability can be calculated exactly, according to the laws of chance. Under the null hypothesis, the probability of a perfect choice is 1/70, the probability of 3 successes and 1 failure is 16/70, and so on. Whereas the probability of the evidence under the hypothesis of some ability cannot be calculated exactly, unless the ability level is specified. For instance, under perfect ability, the probability of a perfect choice is 100% and the probability of any number of mistakes is 0%. But how can we quantify the probability distribution under the hypothesis of some unspecified degree of ability – which, despite Fisher’s contortions, is the hypothesis of interest and the actual subject of the enquiry? We can’t. And if we can’t quantify it – seems to be the conclusion – we might as well SUTC it: Sweep it Under The Carpet.

How remarkable. This is the Confirmation Bias’s mirror image. The Confirmation Bias is transfixed on TPR and forgets about FPR. Fisher’s Bias does something similar: it focuses on FPR – because it can be calculated – and disregards TPR – because it can’t. The resulting mistake is the same: they both ignore that what matters is not how high TPR is, or how low is FPR, but how they relate to each other.

To see this, let’s assume for a moment that the hypothesis under investigation is ‘perfect ability’ versus ‘no ability’ – no middle ground. In this case, as we just said, TPR=1 for a perfect choice and 0 otherwise. Hence, under prior indifference, we have PO=LR=1/FPR or PO=0. Fisher agrees:

If it were asserted that the subject would never be wrong in her judgements we should again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation. (p. 16).

As we have seen, with a perfect choice over 8 cups we have FPR=1/70 and therefore PO=70 i.e. PP=98.6% (remember PO=PP/(1-PP)). True, it is not conclusive evidence that the lady is infallible – as the number of cups increases, FPR tends to zero but never reaches it – but to all intents and purposes we are virtually certain that she is. Fisher may abuse her patience and feed her a few more cups, and twist his tongue saying that what he did was disprove that the lady is not just lucky. In fact, all he demanded to do so was FPR<5%, i.e. PO>20 and PP>95.2% – a verdict beyond reasonable doubt. On the other hand, even a single mistake provides conclusive evidence to disprove the hypothesis that the lady is infallible – in the same way that a black swan disproves the hypothesis that all swans are white: TPR=0, hence PO=0 and PP=0%, irrespective of FPR.

Let’s now ask: What happens if we replace ‘perfect ability’ with ‘some ability’? The alternative hypothesis is still ‘no ability’, so FPR stays the same. The difference is that we cannot exactly quantify TPR. But we don’t need to. All we need to do is define the level of PP required to accept the hypothesis. This gives us a required PO – let’s call it rPO – which, given FPR, implies a required level of TPR: rTPR=rPO∙FPR. Let’s say for example that the required PP is 95%. Then rPO=19 and rTPR=19∙FPR. Hence, in case of a perfect choice, rTPR=19∙(1/70). At this point all we need to ask is: are we comfortable that the probability of a perfect choice, given that the lady has some ability, is at least 19/70? Remember that the same probability under no ability is 1/70 and under perfect ability is 70/70. If the answer is yes – as it should reasonably be – we accept the hypothesis. There is no need to know the exact value of TPR, as long as we are comfortable that it exceeds rTPR. On the other hand, if the lady makes one mistake we have rTPR=19∙(17/70): the required probability of one or no mistake, given some ability, exceeds 100%. Hence we reject the hypothesis.

This coincides with Fisher’s conclusion, as 1/70 is below 5% and 17/70 is above. But what happens if we lower rPO? After all, 95% is a very high standard of proof: do we really need to be sure beyond reasonable doubt that the lady has some tea tasting ability? What if we are happy with 75%, i.e. rPO=3? In this case, rTPR=3∙(1/70) for a perfect choice – a comfortable requirement, close to no ability. But now with one mistake we have rTPR=3∙(17/70). This is about two thirds of the way between no ability (17/70) and perfect ability (70/70 – remember we need to consider the cumulative probability of one or no mistake: under perfect ability, this is 0%+100%). We may or may not feel comfortable with such a high level, but if we do then we must conclude that there is clear and convincing evidence that the lady has some ability, despite her mistake.

For illustration purposes, let’s push this argument to the limit and ask: what if we lower the standard of proof all the way down to 50%, i.e. rPO=1? In this case, all we would need in order to grant the lady some ability is a preponderance of evidence. This comfortably covers one mistake, and may even allow for two mistakes, as long as we accept rTPR=53/70 (notice there are 6×6 ways to choose 2 right and 2 wrong cups).

This may well be too lenient. But the point is that, as soon as we explicitly relate FPR to TPR, we are able to place Fisher’s significance criterion in a proper context, where his 5% threshold is not a categorical standard but one choice within a spectrum of options. In fact, once viewed in this light, we can see where Fisher’s criterion comes from.

Fisher focused on the probability of the evidence, given the null hypothesis, stating that a probability of less than 5% was small enough to be comfortable that the evidence was ‘significant’ and not a chance event. But why did he then proceed to infer that such significant evidence disproved the null hypothesis? That is: why did he conclude that the probability of the null hypothesis, given significant evidence, was small enough to disprove it? As we know (very well by now!), the probability of E given H is not the same as the probability of H given E. Why did Fisher seem to commit the Inverse Fallacy?

To answer this question, remember that the two probabilities are the same under two conditions: symmetric evidence and prior indifference. Under error symmetry, FPR=FNR. Hence, in our framework, where the hypothesis of no ability is the alternative to the tested hypothesis of some ability, FPR=5% implies TPR=95% and therefore PO=19 and PP=TPR=95%. The result is the same in Fisher’s framework, where the two hypotheses are – unnecessarily and confusingly – flipped around, FPR becomes FNR and the null hypothesis is rejected if FNR is less than 5%.

Since Fisher could not quantify TPR, he avoided any explicit consideration of FNR=1-TPR and its relation with FPR – symmetric or otherwise. But his rejection of the null hypothesis required it: in the same way as we should avoid the Confirmation Bias – accepting a hypothesis based on a high TPR, without relating it to its associated FPR – we need to avoid Fisher’s Bias: accepting a hypothesis – or, as Fisher would have it, disprove a null hypothesis – based on a low FPR, without relating it to its associated TPR.

What level of TPR did Fisher associate to his 5% FPR threshold? We don’t know – and probably neither did he. All he said was that a p-value of less than 5% was low enough to disprove the null hypothesis. Since then, Fisher’s Bias has been a source of immeasurable confusion. Assuming symmetry, FPR<5% has been commonly taken to imply PP>95%: ‘We have tested our theory and found it significant at the 5% level: therefore, there is only a 5% probability that we are wrong.’

Fisher would have cringed at such a statement. But his emphasis on significance inadvertently encouraged it. Evidence is not significant or insignificant according to whether FPR is below or above 5%. It is confirmative or disconfirmative according to whether LR is above or below 1, i.e. TPR is above or below FPR. What matters is not the level of FPR per se, but its relation to TPR. Confirmative evidence increases the probability that the hypothesis of interest is true, and disconfirmative evidence decreases it. We accept the hypothesis of interest if we have enough evidence to move LR beyond the threshold required by our standard of proof. Only then can we call such evidence ‘significant’. So, if our threshold is 95%, then, under error symmetry and prior indifference, FPR<5% implies TPR=PP>95%. There is no fallacy: TPR – the probability of E given H – is equal to PP – the probability of H given E. And 5% significance does mean that we have enough confirmative evidence to decide that the hypothesis of interest has indeed been proven beyond reasonable doubt.

This is where Fisher’s 5% criterion comes from: Bayes’ Theorem under error symmetry and prior indifference, with a 95% standard of proof. Fisher ignored TPR, because he could not quantify it. But TPR cannot be ignored – or rather: it is there, whether one ignores it or not. Fisher’s criterion implicitly assumes that TPR is at least 95%. Without this assumption, a 5% ‘significance’ level cannot be used to accept the hypothesis of interest. Just like the Confirmation Bias consists in ignoring that a high TPR means nothing without a correspondingly low FPR, Fisher’s Bias consists in ignoring that a low FPR means nothing without a correspondingly high TPR.

The Made in Italy Fund started in May. It is up 7%, with the Italian market down 1%. It is a good time to go back to Hypothesis Testing.

We ask ourselves questions and give ourselves answers in response to thaumazein: wonder at what there is. Our questions spring from our curiosity. Our answers are grounded on evidence.

As David Hume and then Immanuel Kant made clear, all our answers are based on evidence. Everything we know cannot but be phenomena that are experienced by us as evidence. Even Kant’s synthetic a priori propositions – like those of geometry and mathematics – are ultimately based on axioms that we regards as self-evidently true.

The interpretation of evidence arranged into explanations is what we call Science – knowledge that results from separating true from false. Science is based on observation – evidence that we preserve and comply with. We know that the earth rotates around the sun because we observe it, in the same way that Aristotle and Ptolemy knew that the sun rotated around the earth. We are right and they were wrong but, like us, they were observing and interpreting evidence. So were the ancient Greeks, Egyptians, Indians and Chinese when they concluded that matter consisted of four or five elements. And when the Aztecs killed children to provide the rain god Tlaloc with their tears, their horrid lunacy – widespread in ancient times – was not a fickle mania, but the result of an age-old accumulation of evidence indicating that the practice ‘worked’, and it was therefore worth preserving. So was divination – the interpretation of multifarious signs sent by gods to humans.

While we now cringe at human sacrifice and laugh at divination, it is wrong to simply dismiss them as superseded primitiveness. Since our first Why?, humankind’s only way to answer questions is by making sense of evidence. Everything we say is some interpretation of evidence. Everything we say is science.

Contrary to Popper’s view, there is no such a thing as non-science. The only possible opposition is between good science and bad science. Bad science derives from a wrongful interpretation of evidence, leading to a wrongful separation of true and false. This in turn comes from neglecting or underestimating the extent to which evidence can be deceitful. Phenomena do not come to us in full light. What there is – what we call reality – is not always as it appears. Good science derives from paying due attention to the numerous perils of misperception. Hence the need to look at evidence from all sides and to collect plenty of it, analyse it, reproduce it, probe it, disseminate it and – crucially – challenge it, i.e. look for new evidence that may conflict with and possibly refute the prevailing interpretation. This is the essence of what we call the Scientific Revolution.

Viewed in this light, the obvious misperceptions behind the belief in the effectiveness of sacrifice and divination bear an awkward resemblance to the weird beliefs examined in many of my posts. Why did Steve Jobs try to cure his cancer with herbs and psychics? Why do people buy homeopathic medicines (and Boiron is worth 1.6 billion euro)? Why do people believe useless experts? Why did Othello kill Desdemona? Why did Arthur Conan Doyle believe in ghosts? Why did 9/11 truthers believe it was a conspiracy? Why do Islamists promote suicide bombing? It is tempting to call it lunacy. But it isn’t. It is misinterpretation of evidence.

The most pervasive pitfall in examining available evidence is the Confirmation Bias: focusing on evidence that supports the hypothesis under investigation, while neglecting, dismissing or obfuscating evidence that runs contrary to it. A proper experiment, correctly gathering and analysing the relevant evidence, can easily show the ineffectiveness of homeopathic medicine – in the same way as it would show the ineffectiveness of divination and sacrifice (however tricky it would be to test the Tlaloc hypothesis).

In our framework, PO=LR∙BO, where LR=TPR/FPR is the Likelihood Ratio, the ratio between the True Positive Rate – the probability of observing the evidence if the hypothesis is true – and the False Positive Rate – the probability of observing the same evidence if the hypothesis is false. The Confirmation Bias consists in paying attention to TPR, especially when it is high, while disregarding FPR. As we know, it is a big mistake: what matters is not just how high TPR is, but how high it is relative to FPR. We say evidence is confirmative if LR>1, i.e. TPR>FPR, and disconfirmative if LR<1. LR>1 increases the probability that the hypothesis is true; LR<1 decreases it. We cannot look at TPR without at the same time looking at FPR.

How high does the probability of a hypothesis have to be for us to accept that it is true? Equivalently, how low does it have to be for us to reject the hypothesis, or to accept that it is false?

As we have seen, there is no single answer: it depends on the specific standard of proof attached to the hypothesis and on the utility function of the decision maker. For instance, if the hypothesis is that a defendant is guilty of a serious crime, a jury needs a very high probability of guilt – say 95% – before convicting him. On the other hand, if the hypothesis is that an airplane passenger is carrying a gun, a small probability – say 5% – is all a security guard needs in order to give the passenger a good check. Notice that in neither case the decision maker is saying that the hypothesis is true. What he is saying is that the probability is high enough for him to act as if the hypothesis is true. Such threshold is known as significance level, and the accumulated evidence that allows the decision maker to surpass such threshold is itself called significant. We say that there is significant evidence to convict the defendant if, in the light of such evidence, the probability of guilt exceeds 95%. In the same way, we say that there is significant evidence to frisk the passenger if, in the light of the available evidence, the probability that he carries a gun exceeds 5%. In practice, we call the defendant ‘guilty’ but, strictly speaking, it is not what we are saying – in the same way that we are not saying that he is ‘innocent’ or ‘not guilty’ if the probability of Guilt is below 95%. Even more so, we are not saying that the passenger is a terrorist. What matters is the decision – convict or acquit, frisk or let go.

With such proviso, let’s examine the standard case in which we want to decide whether a certain claim is true or false. For instance, a lady we are having tea with tells us that tea tastes different depending on whether milk is poured in the cup before or after the tea. She says she can easily spot the difference. How can we decide if she is telling the truth? Simple: we prepare a few cups of tea, half one way and half the other, and ask her to say which is which. Let’s say we make 8 cups, and tell her that 4 are made one way and 4 the other way. She tastes them one after the other and, wow, she gets them all right. Surely she’s got a point?

Not so fast. Let’s define:

H: The lady can taste the difference between the two kinds of tea.

E: The lady gets all 8 cups right.

Clearly, TPR – the probability of E given H – is high. If she’s got the skill, she probably gets all her cups right. Let’s even say TPR=100%. But we are not Confirmation-biased: we know we also need to look at FPR. So we must ask: what is the probability of E given not-H, i.e. the lady was just lucky? This is easy to calculate: there are 8!/[4!(8-4)!]=70 ways to choose 4 cups out of 8, and there is only one way to get them all right. Therefore, FPR=1/70. This gives us LR=70. Hence PO – the odds of H in the light of E – is 70 times the Base Odds. What is BO? Let’s say for the moment we are prior indifferent: the lady may be skilled, she may be deluded – we don’t know. Let’s give her a 50/50 chance: BR=50%, hence BO=1 and PO=70. Result: PP – the probability that the lady is skilled, in the light of her fully successful choices, is 99%. That’s high enough, surely.

But what if she made one mistake? Notice that, while there is only one way to be fully right, there are 4 ways to make 3 right choices out of 4, and 4 ways to make 1 wrong choice out of 4. Hence, there are 4×4 ways to choose 3 right and 1 wrong cups. Therefore, FPR=16/70 and LR=4.4. Again assuming BO=1, this means PP=81%. Adding the 1/70 chance of a perfect choice, the probability of one or no mistake out of mere chance is 17/70 and LR=4.1, hence PP=80%. Is that high enough?

I would say yes. But Ronald A. Fisher – the dean of experimental design and one of the most prominent statisticians of the 20^{th} century – would have none of it.

It is not what I believe, it is what I know – said Sir Arthur Conan Doyle about the innumerable tricks and hoaxes he was haplessly subject to in the course of his long life. Despite Harry Houdini’s steadfast efforts to convince him to the contrary, Sherlock Holmes’ dad remained certain about the powers of Spiritualism: he saw plenty of conclusive evidence.

Just think of what Helder Guimarães could have made him believe – err, know:

Unbelievable. We know there is a trick. The whole setting, and the magician himself, give us conclusive evidence: there is no way that this is not a trick. Yet, it appears as just the opposite. This is the deep beauty of sleight of hand magic: it blinds us with evidence.

The framework we used to describe a judge’s decision to convict or acquit a defendant based on the probability of Guilt can be generalised to any decision about whether to accept or reject a hypothesis. The utility function is defined over two states – True or False – and two decisions – Accept or Reject:

The decision maker draws positive utility U(TP) from accepting a true hypothesis (True Positive) and negative utility U(FP) from accepting a false hypothesis (False Positive). And he draws positive utility U(TN) from rejecting a false hypothesis (True Negative) and negative utility U(FN) from rejecting a true hypothesis (False Negative). Based on these preferences, the threshold probability that leaves the decision maker indifferent between accepting and rejecting the hypothesis is:

(1)

Hence:

(2)

The decision maker accepts the hypothesis if he thinks the probability that the hypothesis is true is higher than P, and rejects it if he thinks it is lower. As in the judges’ case, we define BB=U(FP)/U(FN), CB=U(TN)/U(TP) and DB=-U(FN)/U(TP). BB is the ratio between the pain of a wrongful acceptance and the pain of a wrongful rejection. CB is the ratio between the pleasure of a rightful rejection and the pleasure of a rightful acceptance. And DB is the ratio between the pain of a wrongful rejection and the pleasure of a rightful acceptance. Using these definitions, (2) can be written as:

(3)

which renders P independent of the utility function’s metric.

Again, with BB=CB=DB=1 we have P=50%: the hypothesis is accepted if it is more likely to be true than false. In most cases, however, the decision maker has some bias. We have seen a Blackstonian judge has BB>1: the pain of a wrongful conviction is higher than the pain of a wrongful acquittal. This increases the threshold probability above 50%. For example, with BB=10 we have P=85%: the judge wants to be at least 85% sure that the defendant is guilty before convicting him. On the other hand, the threshold probability of a ‘perverse’ Bismarckian judge, who dislikes wrongful acquittals more than wrongful convictions, is lower than 50%. For instance, with BB=0.1 we have P=35%: the judge convicts even if he is only 35% sure that the defendant is guilty.

In other cases, however, there is nothing perverse about BB<1. For instance, if the hypothesis is ‘There is a fire‘, a False Negative – missing a fire when there is one – is clearly worse than a False Positive – giving a False Alarm. This is generally the case in security screening, such as airport checks, malware detection and medical tests, where the mild nuisance of a False Alarm is definitely preferable to the serious pain of missing a weapon, a computer virus or a disease. Hence BB<1 and P<50%. As we have seen, with BB=0.1 we have P=35%. The same happens if CB<1: the pleasure of a True Positive – catching a terrorist, blocking a virus, diagnosing an illness – is higher than the pleasure of a True Negative – letting through an innocuous passenger, a regular email, a healthy patient. When both BB and CB are 0.1, P is reduced to 9% (the complement to 91% for BB=CB=10). Obviously, a 9% probability that a passenger may be carrying a weapon is high enough to check him out. In fact, in such cases the threshold probability is likely to be substantially lower, implying lower values for BB and CB. With BB=CB=0.01, for example, P is reduced to 1%. Again, if BB=CB then (3) reduces to P=BB/(1+BB), which tends to 0% as BB tends to zero, independently of DB. If, on the other hand, BB differs from CB, then DB does affect P. Assuming for instance BB=0.01 and CB=0.1, increasing DB from 1 to 10 – the pain of letting an armed man on board is higher than the pleasure of catching him beforehand – decreases P from 5% to 2%, while decreasing DB from 1 to 0.1 increases P to 8%. It is the other way around if BB=0.1 and CB=0.01. A higher DB increases the sensitivity to misses and decreases the sensitivity to hits, while a lower one has the opposite effect.

Hence the size of the three biases depends on the tested hypothesis. If in some cases accepting a false hypothesis is ‘worse’ than rejecting a true one (BB>1), in some other cases the opposite is true (BB<1). Likewise, sometimes rejecting a false hypothesis is ‘better’ than accepting a true one (CB>1), and some other times it is the other way around (CB<1). Finally, the pain of rejecting a true hypothesis can be higher (DB>1) or lower (DB<1) than the pleasure of accepting it.

This is all consistent with the Neyman-Pearson framework. They called a False Negative a Type I error and a False Positive a Type II error. In their analysis, H is the hypothesis of interest: a statistician wants to know whether H is true, as he surmises, or false. From his point of view, therefore, the first error – rejecting H when it is true – is ‘worse’ than the second error – accepting H when it is false. Hence BB<1. As a result, it makes sense to fix the probability of a Type I error to a predetermined low value, known as the significance level and denoted by α, while designing the test as to minimise the probability of a type II error, denoted by β, i.e. maximise 1-β, known as the test’s power – the probability of rejecting the hypothesis when it is false.

An inordinate amount of confusion is generated by the circuitous convention of formulating the test not in terms of the hypothesis of interest – the defendant is guilty, there is a fire, the passenger is armed, the email is spam, the patient is ill – but in terms of its negation: the so-called null hypothesis. This goes back to Ronald Fisher who, in fierce Popperian spirit, insisted that one can never accept a hypothesis – only fail to reject it. In this topsy-turvy world, rejecting the null hypothesis when it is true is a Type I error – a False Positive: calling a fire when there is none – while failing to reject the null when it is false is a Type II error – a False Negative: missing a fire when there is one. This is a pointless convolution (one wonders what Fisher told his girlfriend when he asked her to marry him: ‘Will you not reject me?’). For all intents and purposes, a non-rejection is tantamount to an acceptance: a test’s objective is to reach a practical decision, not to consecrate an absolute truth. For reference, here is a depiction of the straightforward Neyman-Pearson framework vs. the roundabout Fisher framework:

Neyman-Pearson Fisher

Framing a test in terms of the hypothesis of interest reflects what a statistician is actually trying to accomplish: decide whether to accept or reject the hypothesis. As we have just seen, this depends not only on the tug of war between confirming and disconfirming evidence, indicating whether the hypothesis is true or false, but also on the decision maker’s utility preferences, measuring the relative costs and benefits of wrongful and rightful decisions.

Wittgenstein thought Leibniz’s question was unanswerable and, therefore, senseless. Asking the question was a misuse of language, sternly proscribed in the last sentence of the Tractatus:

7. Whereof one cannot speak, thereof one must be silent.

(Ironically, the sentence is often misused as meaning ‘Shut up if you don’t know what you’re talking about’, in blatant contravention of its own supposed prescription).

The riddle does not exist. This was a direct reference to Arthur Schopenhauer, who had traced the origin of philosophy to “a wonder about the world and our own existence, since these obtrude themselves on the intellect as a riddle, whose solution then occupies mankind without intermission” (The World as Will and Representation, Volume II, Chapter XVII, p. 170). Schopenhauer was himself recalling Aristotle (p. 160): “For on account of wonder (thaumazein) men now begin and at first began to philosophise” (Metaphysics, Alpha 2), and Plato (p. 170): “For this feeling of wonder (thaumazein) shows that you are a philosopher, since wonder is the only beginning of philosophy, and he who said that Iris was the child of Thaumas made a good genealogy. (Theaetetus, 155d).

There would be no riddle – said Schopenhauer – if, in Spinoza’s sense, the world were an “absolute substance“:

Therefore its non-being would be impossibility itself, and so it would be something whose non-being or other-being would inevitably be wholly inconceivable, and could in consequence be just as little thought away as can, for instance, time or space. Further, as we ourselves would be parts, modes, attributes, or accidents of such an absolute substance, which would be the only thing capable in any sense of existing at any time and in any place, our existence and its, together with its properties, would necessarily be very far from presenting themselves to us as surprising, remarkable, problematical, in fact as the unfathomable and ever-disquieting riddle. On the contrary, they would of necessity be even more self-evident and a matter of course than the fact that two and two make four. For we should necessarily be quite incapable of thinking anything else than that the world is, and is as it is (p. 170-171).

Like Parmenides, Spinoza saw non-being as inconceivable. What-is-not cannot be spoken or thought. There is only being, the absolute substance, as self-evident as 2+2=4. Schopenhauer vehemently disagreed:

Now all this is by no means the case. Only to the animal lacking thoughts or ideas do the world and existence appear to be a matter of course. To man, on the contrary, they are a problem, of which even the most uncultured and narrow-minded person is at certain more lucid moments vividly aware, but which enters the more distinctly and permanently into everyone’s consciousness, the brighter and more reflective that consciousness is, and the more material for thinking he has acquired through culture (p. 171).

Thaumazein is the origin of philosophy and the inexhaustible fount of its core, metaphysics:

The balance wheel which maintains in motion the watch of metaphysics, that never runs down, is the clear knowledge that this world’s non-existence is just as possible as is its existence. Therefore, Spinoza’s view of the world as an absolutely necessary mode of existence, in other words, as something that positively and in every sense ought to and must be, is a false one (p. 171).

Spinoza’s solution to thaumazein was straighter than Leibniz’s own. The world is not God’s contingent creation ex nihilo. It is God itself: Deus sive natura. Hence, Leibniz’s question does not even have an obvious answer. It is, as Wittgenstein put it, an unanswerable nonsense: like asking why 2+2 is 4 rather than 5. The riddle does not exist.

Nonsense – said Schopenhauer. The riddle does exist, and no solution can ever be found:

Therefore, the actual, positive solution to the riddle of the world must be something that the human intellect is wholly incapable of grasping and conceiving; so that if a being of a higher order came and took all the trouble to impart it to us, we should be quite unable to understand any part of his disclosures. Accordingly, those who profess to know the ultimate, i.e. the first grounds of things, thus a primordial being, an Absolute, or whatever else they choose to call it, together with the process, the reasons, grounds, motives, or anything else, in consequence of which the world results from them, or emanates, or falls, or is produced, set in existence, “discharged” and ushered out, are playing the fool, are vain boasters, if indeed they are not charlatans (p. 185).

Wow. So much for my faith. Notice the difference between Schopenhauer and Baloo. Baloo says: We don’t need an ultimate answer. Schopenhauer says: Of course we do. We’re not bears. Men wonder and care to know. But we can’t. As Immanuel Kant definitively demonstrated, there is no way for us to know ‘the first ground of things’ or, as he called them, things-in-themselves. All we can possibly know are phenomena – things as they appear to us, come to light and are experienced by us as evidence. Kant contrasted phenomena with noumena – things as abstract knowledge, thoughts and concepts produced by the mind (nous) independently of sensory experience. He used the term as a synonym for things-in-themselves, although – as noted by Schopenhauer (Volume I, Appendix, p. 477) – it was not quite the way the ancient Greeks used it. Be that as it may, Kant’s meaning has since prevailed. Noumena are things as they are per se – unknowable to the human intellect. Phenomena are things as they appear to us as evidence through our senses. Evidence is what there is, i.e. what ex-ists, is out there in what the ancient Greeks called physis and we call, in its Latin translation, nature. Physics is mankind’s endeavour to explain the phenomena of the natural world.

As we know, however, physics’ explanations unfold into endless why-chains, which we find inconceivable. Explanations cannot go on ad infinitum: why-chains must have a last ring – an ultimate answer that ends all questions. But where can we find it, if physis is all there is?

Hence thaumazein‘s first solution: physis in not all there is. Beyond physis there is a supernatural, self-sustaining entity that created it. We know such entity not through experience but through pure reason – the logos of ancient Greeks which, like Thaumas’ daughter, Iris, links mankind to the divine. Pure reason does not require evidence. God is logically self-evident, like 2+2=4: there cannot but be one.

It was against such pure reason that Kant unleashed his arresting Critique. After Kant, whoever professes to know the Absolute earns Schopenhauer’s unceremonious epithets. Mankind cannot know the Absolute, either by experience or pure reason. It can only believe in it through revelation, leaning upon the soft evidence emanating from a trusted source, such as: ‘We believe in one God, the Father Almighty, Maker of all things visible and invisible’. Alas, the obvious trouble with such divine disclosures is their wild variety in time, place and circumstance, leaving believers with a hodgepodge of conflicting but equally conclusive revelations, imparted by self-appointed messengers employing a full bag of tricks in order to establish and support their trustworthiness.

Where does this leave us, then, with our search for the last ring? If we cannot find it in physis, where phenomena, explained by endless why-chains, are all there is, and we cannot find it beyond physis, where noumena are inaccessible to our intellect, where can we look for it? Do we need to surrender and, following Schopenhauer, declare the riddle insoluble? Do we need to heed Wittgenstein’s proscription, declare thaumazein an unanswerable nonsense and remain silent?

Children have little trouble understanding zero. When we teach them to count to ten on their fingers, two closed fists mean zero fingers. Easy, and not that interesting. In their attempt to size up the world, children are much more interested in upper limits. As every parent knows, they bombard us with measurement questions: What is the strongest animal? The fastest car? The best footballer? And as soon as they get into numbers and figure out that they go far beyond ten, to millions, billions and gazillions, comes the fateful question: what is the biggest number? To which the standard answer – there is no biggest number: take any number, if you add one to it you get a bigger number – is at first puzzling and rather upsetting. Until they get a name for it: infinity.

But this goes only some way to appease them. Infinity sounds like the biggest number, but it is a weird one. What is infinity plus one? Infinity. And infinity plus infinity? Still infinity. You know – I told them to assuage their perplexity – infinity is not really a number: it is a concept. ‘What’s a consett?’ Well, it’s an idea that we create in our mind to talk about things. ‘Hmm’ – I could hear their brain whirring.

No problem with zero, puzzled by infinity. Interestingly, it was the other way around with the ancient Greeks, who had trouble with ‘nothing’, but were quite comfortable with the unlimited – a-peiron. Anaximander saw it as the principle of all things. Euclid used it to define parallel lines and demonstrated the infinity of prime numbers: ‘Prime numbers are more than any assigned multitude of prime numbers. (Elements, Book IX, Proposition 20). ‘More than any assigned magnitude’ is the concept of infinity that we teach our children. Aristotle called it potential infinity:

The infinite, then, exists in no other way, but in this way it does exist, potentially and by reduction. (Physics, Book III, Part 6).

According to Aristotle, there is no such thing as actual infinity. To demonstrate it, he contrasted arithmetic infinity with physical infinity, and infinity by addition with infinity by division:

It is reasonable that there should not be held to be an infinite in respect of addition such as to surpass every magnitude, but that there should be thought to be such an infinite in the direction of division. For the matter and the infinite are contained inside what contains them, while it is the form which contains. It is natural too to suppose that in number there is a limit in the direction of the minimum, and that in the other direction every assigned number is surpassed. In magnitude, on the contrary, every assigned magnitude is surpassed in the direction of smallness, while in the other direction there is no infinite magnitude. The reason is that what is one is indivisible whatever it may be, e.g. a man is one man, not many. Number on the other hand is a plurality of ‘ones’ and a certain quantity of them. Hence number must stop at the indivisible: for ‘two’ and ‘three’ are merely derivative terms, and so with each of the other numbers. But in the direction of largeness it is always possible to think of a larger number: for the number of times a magnitude can be bisected is infinite. Hence this infinite is potential, never actual: the number of parts that can be taken always surpasses any assigned number. But this number is not separable from the process of bisection, and its infinity is not a permanent actuality but consists in a process of coming to be, like time and the number of time.
With magnitudes the contrary holds. What is continuous is divided ad infinitum, but there is no infinite in the direction of increase. For the size which it can potentially be, it can also actually be. Hence since no sensible magnitude is infinite, it is impossible to exceed every assigned magnitude; for if it were possible there would be something bigger than the heavens. (Physics, Book III, Part 7).

According to Aristotle, in physics there are no infinitely large things, but there are infinitely small things. In arithmetic, however, it is the other way around: numbers are infinitely large, but not infinitely small. To make the point, he used Zeno’s Dichotomy argument: any magnitude can be bisected a potentially infinite number of times. Hence numbers are potentially infinite: ‘in the direction of largeness it is always possible to think of a larger number: for the number of times a magnitude can be bisected is infinite’. On the other hand, there is no number smaller than one: ‘one is indivisible whatever it may be, e.g. a man is one man, not many’…’Hence number must stop at the indivisible’.

For the ancient Greeks there were only natural numbers: ‘A unit is that by virtue of which each of the things that exist is called one’. ‘A number is a multitude composed of units’. (Elements, Book VII, Definitions 1 and 2). So, strictly speaking, one was not even a number (let alone zero) and there was definitely no number smaller than one. A fraction was not seen as a number per se, but as the ratio of two numbers: ‘A ratio is a sort of relation in respect of size between two magnitudes of the same kind’. (Elements, Book V, Definition 3). On the other hand, there was no largest number: the number of times a magnitude can be bisected is infinite. Hence, magnitudes can be infinitely small.

Aristotle was wrong on both counts. In mathematics, there are lots of real numbers smaller than one – in fact, an infinity of them – whereas in the physical world nothing is smaller than a Planck length. On the side of the large, however, he was right on both counts: numbers are infinite, but there is no such thing as an infinite physical magnitude.

Whether mathematical infinity is merely potential or fully actual has been and still is the subject of a heated debate. The prince of mathematicians, Carl Gauss, agreed with Aristotle:

So first of all I protest against the use of an infinite magnitude as something completed, which is never allowed in mathematics. The infinite is only a way of speaking, in which one is really talking in terms of limits, which certain ratios may approach as close as one wishes, while others may be allowed to increase without restriction. (Letter to H. C. Schumacher, no. 396, 12 July 1831).

On the other side, Georg Cantor was adamant about actual infinity, and regarded its staunch defence as a mission from God. If there is an actual infinity of natural numbers, infinity can be treated as a number. But then, since Cantor’s set theory implies that there is an infinity of infinities, our childish quest to get our arms around the biggest number is thrown into even deeper despair. Numbers are infinite – or even infinitely infinite.

At the same time, however, Aristotle made clear that there are no infinite physical magnitudes. When we call something ‘infinite’, what we usually mean is that it is really, really big. Otherwise, if we purposely intend to say that it is actually infinite, we don’t understand what we are talking about. Clearly, no thing can be infinite: the difference between the most ginormously big thing and infinity is, well, infinite. Take the 10^{80} atoms in the observable universe. That’s a lot of atoms, but it is still a finite quantity. Call it U1. How big is U1 compared to U2=U1^{80}? As big as an atom relative to the observable universe. So is U2 compared to U3=U2^{80}. And that’s just a miserable 3: how about U1Million, U1Billion, U1Gazillion? They are all next to nothing compared to actual infinity.

Actual physical infinity is not an awe-inspiring immensity that we are too small to comprehend. It is an ill-considered, meaningless and unusable concept. There is no such thing as actual physical infinity. Nor is there potential infinity: ‘For the size which it can potentially be, it can also actually be’. In the physical world, potential infinity – we can call it indefiniteness – coincides with actual infinity: nothing can be bigger than the universe, otherwise it would itself be the universe.

Once properly rearranged, Aristotle’s crucial distinction between the mathematical and the physical world should not be forgotten. In mathematics, there is zero and there is infinity, and we can speak and think of both. There are infinitely small numbers and infinitely large numbers. Zero is itself a number and infinity can be treated as a number. But neither zero nor infinity are something: there is no such thing as nothing and no such thing as infinity. There are no infinitely small things and no infinitely large things. In the physical world, zero and infinity are just useful signs: zero indicates that something is absent and infinity indicates that something is indefinitely big. There are, however, three fundamental differences between them:

1. We can observe zero: it is what we call negative evidence. There is nothing – zero things – on the table. But we cannot observe infinity: there is no infinity of things, on the table or anywhere else.

2. Something becomes nothing after a finite number of bisections. Zero is the magnitude of nothing. But something cannot become infinite: no finite number of operations can turn something into infinity. No thing has infinite magnitude.

3. We cannot observe the absence of everything – Nall – but we can imagine it. While Nall may be impossible, it is not senseless. But we can neither observe nor imagine the infinity of everything: it is an impossible and senseless concept.

Infinity is a sublime, breathtaking, but much abused word. We should never forget it is a consett – perfect to express a parent’s love for his children, but inapplicable to the size of any thing.

Parmenides‘ trouble with ‘nothing’ was nothing new. The ancient Greeks thought the world started with Chaos, a variably imagined primordial mess, where the principle of all things (Arche) eventually gave rise to an ordered Cosmos. Whatever that was – Anaximander called it Apeiron, the limitless – it was something.

This was common to all ancient creation myths, including the Bible. They all started with something. It was only in the second and third century CE that Christian theologians, eager to affirm God’s absolute omnipotence, reinterpreted Genesis as creation ex nihilo. Why is there something rather than nothing? Because God created it. Leibniz was so keen to demonstrate it that he got into his own muddle.

Take the infinite series S=1+x+x^{2}+x^{3}+…= 1/(1-x). For x=-1 we have S=1-1+1-1+…=1/2. That is S=(1-1)+(1-1)+…=0+0+…=1/2. Amazing but true: an infinite sum of zeros equals 1/2. Nothing=Something. Luigi Guido Grandi, a Camaldolese monk and mathematician, saw this as a marvellous representation of creation ex nihilo. Here is another way to see it: S can be written as (1-1)+(1-1)+… or as 1-(1-1)-(1-1)-… . The first sum, with an even number of terms, equals zero, while the second sum, with an odd number of terms, equals 1. So – just like a bit – S is either 0 or 1, depending on when we stop counting. Since we have an equal probability of stopping at an even or at an odd number, the expected value of S is 1/2.

Leibniz – one of the smartest men on earth and co-inventor of infinitesimal calculus – bought the argument. The wonders of binary arithmetic – he thought: God, the One, created all Being from Nothing. He was so impressed with the idea that, according to Laplace, he wrote a letter to the Jesuit missionary Claudio Filippo Grimaldi, president of the tribunal of Mathematics in China, asking him to show it to the Chinese emperor and convince him to convert to Christianity! “I report this incident only to show to what extent the prejudices of infancy can mislead the greatest men” (A Philosophical Essay on Probabilities, p. 169).

As should have been obvious to Leibniz (and likely it was to Grimaldi and to the emperor, who remained an infidel), S is only convergent for -1<x<1. For x=-1 it does not converge to any number, but bounces aimlessly between 1 and 0 – between something and nothing, if you want to insist on the metaphor. But in that case you don’t need the whole series. The first two terms are enough: 1-1=0. Nothing is the sum of two opposite somethings – hot and cold, wet and dry (as in Anaximander’s apeiron), positive and negative, good and evil, light and darkness, matter and antimatter, Yin and Yang, Laurel and Hardy or whatever opposite pair you may fancy. Why not 42-42=0?

Grandi’s series is not a good answer to Leibniz’s question. No wonder – Parmenides would have said. ‘Why is there something rather than nothing’ is a meaningless question: there is no such thing as nothing – we cannot even speak or think about it. In his Tractatus Logico-Philosophicus Wittgenstein agreed:

6.44 Not how the world is, is the mystical, but that it is.

6.45 The contemplation of the world sub specie aeterni is its contemplation as a limited whole.
The feeling of the world as a limited whole is the mystical feeling.

6.5 For an answer which cannot be expressed the question too cannot be expressed.
The riddle does not exist.
If a question can be put at all, then it can also be answered.

Wittgenstein’s world sub specie aeterni is the Einstein-Weyl block universe, and his world as a limited whole resembles Parmenides’ well-rounded sphere. In his 1929 Lecture on Ethics, Wittgenstein described thaumazein as ‘my experience par excellence‘: ‘when I have it I wonder at the existence of the world‘. Like Parmenides, however, he thought that any ‘verbal expression’ about thaumazein was ‘nonsense! If I say “I wonder at the existence of the world” I am misusing language’. ‘To say “I wonder at such and such being the case” has only sense if I can imagine it not to be the case’. ‘But it is nonsense to say that I wonder at the existence of the world, because I cannot imagine it not existing’ (p. 41). It is clear from this that the question Wittgenstein had in mind in 6.5 was precisely Leibniz’s question, which he regarded as senseless – a question that cannot be put at all and therefore cannot be answered. The riddle does not exist.

I find this very strange. What’s so difficult about imagining that nothing exists? Just imagine the absence of everything – any thing, all the 10^{80} atoms in the observable universe, or however many there are in the whole universe. And if after that you are left with something – a vacuum space – imagine that away as well, until there is nothing – nothing at all. What’s the big deal? Let’s call this ‘Nall’ – short for Nothing at all – which of course is not a thing but just a name for the absence of any thing. Nall may be impossible, but it is certainly not unimaginable. In fact, ‘Why is there All rather than Nall?’ is a shorter version of Leibniz’s question. Nothing senseless about it.

Nothing is smaller than a Planck length. When I say this, I mean: There is no such thing that is smaller than ℓ_{P}. Notice, however, that the same sentence could be misunderstood as having the exact opposite meaning: There is such a thing, called nothing, that is smaller than ℓ_{P}. Similarly, if I say: ‘Nothing is better than spaghetti alle vongole’, I mean that I like it a lot. But the opposite reading would mean that I hate it, and that I would rather eat nothing, i.e. not eat.

As obvious as this is, there is an astonishing amount of confusion about the use of the word ‘nothing’. The muddle goes back to Zeno’s teacher (and, according to Plato, his lover – which sheds a different light on his leaving the scene at the Dichotomy paradox): Parmenides. To be fair, the only thing Parmenides wrote is an allegorical poem, On Nature, of which we only have 160 lines, as reported in a few fragments by later writers. But, as these were purposeful selections, there is little chance that the rest was any clearer. Parmenides drew a stark distinction between the way of Truth (Aletheia), which he defined, in no uncertain but awkwardly convoluted terms, as ‘the way that it is and cannot not be‘, and the way of Appearance (Doxa), defined as ‘the way that it is not and that it must not be‘ (Fragment 3). According to Parmenides, the second is a no-go area: ‘There is no such thing as nothing’ (Fragment 5). The only way is the first:

Now only the one tale remains
Of the way that it is. On this way there are very many signs
Indicating that what-is is unborn and imperishable,
Entire, alone of its kind, unshaken, and complete.
It was not once nor it will be, since it is now, all together,
Single and continuous. (Fragment 8).

This is Einstein’s and Weyl’s block universe, which ‘simply is’, and where ‘the distinction between past, present and future is only a stubbornly persistent illusion’ (in conversation with Einstein, Karl Popper called him “Parmenides”. The Unended Quest, p. 129). It is a timeless world, governed by the Principle of Sufficient Reason, which Parmenides expressed in his own way, well before Spinoza and Leibniz:

For what birth could you seek for it?
How and from what did it grow? Neither will I allow you to say
Or to think that it grew from what-is-not, for that it is not
Cannot be spoken or thought. (Fragment 8).

Everything has a cause. Hence it is impossible to think that anything can come from what is not. As Lucretius would later put in De Rerum Natura: Nihil igitur fieri de nihilo posse (Book I, 205), which is commonly abbreviated as Ex nihilo nihil fit: Nothing comes from nothing. How could it be otherwise? There is no such thing as nothing.

Here is where the muddle starts. Let’s assume for a moment that the Principle of Sufficient Reason is right: what-is cannot come from what-is-not. Does that mean that what-is-not cannot even be spoken or thought about? Clearly not, as indeed Parmenides himself shows by repeatedly referring to it. We can speak and think of what-is-not, i.e. nothing, as a name for the absence of something. ‘There is nothing on the table’ does not mean that there is a thing, called nothing, that is lying on the table. It means the opposite: the table is bare, there is no thing lying on it.

The ancient Greeks were aware of the confusion. When Ulysses told Polyphemus that his name was Nobody (Outis) and proceeded to blind him, the giant called for help. But when his fellow Cyclopes asked him what was happening, they were mystified by his answer: Nobody is trying to kill me (Odyssey, Book 9). Somehow, however, Homer’s descendants never managed to dissolve the puzzle: how can nothing be something? Amazingly, therefore, they had no sign and no use for the concept of zero, until they later imported it from the East. Zero is The Nothing That Is. But it is not something: it is just a sign that we use to indicate that something is not. As such, not only we can speak and think about it, but we can also observe it. It is what we call negative evidence, or absence of evidence. Look: there is nothing on the table, no thing, zero things. And there is no green rhino – zero rhinos – under the carpet. This is not positive evidence about the presence of something. It is negative evidence about its absence. So it does not presuppose the existence of the absent thing: green rhinos do not exist, nor their existence is implied in the sentence.

Speaking and thinking of what-is-not does not imply that it is. Somehow Parmenides was confused about this and, as per Boileau’s rule, expressed it obscurely:

For the same thing both can be thought and can be (Fragment 4).
It must be that what can be spoken and thought is, for it is there for being (Fragment 5).
By thinking gaze unshaken on things which, though absent, are present,
For thinking will not sever what-is from clinging to what-is (Fragment 6).

What? Surely he did not mean that whatever we can speak and think of must exist – wouldn’t that be wonderful? What he meant was that it is impossible to think that things are not. What-is has always been and will always be. Ex nihilo nihil fit. Nothing cannot be something, or, more precisely, nothing cannot become something. It is, as we have seen, another way to state the Principle of Sufficient Reason.

For the same reason, Parmenides thought that something could not become nothing:

Nor ever will the power of trust allow that from what-is
It becomes something other than itself (Fragment 8).

As Lucretius put it: Haud igitur redit ad nihilum res ulla (Book I, 248). It is the same reason on which Zeno built his paradoxes. Which, as we have seen, are paradoxes precisely because, in reality, something can become nothing: nothing is smaller than a Planck length.

Parmenides was right: there is no such thing as nothing. More precisely, there is no such thing as nothingness – a place or a state in which phenomena are before they appear into existence. But then he went astray, insisting that what-is-not cannot even be thought about, and that there are no phenomena, no creation, no extinction and no change. What-is is an eternal, immutable one, ‘like the body of a well-rounded sphere’ (Fragment 8). In such a world, everything has a reason and could not be otherwise: there is no such thing as chance.

That is as crazy a world as that of Zeno’s paradoxes. In the real world, we can and do think of possibilities: what-is-not but could be. What-is was not bound to be. It is an event that happened, with some probability: one of several possibilities. Events are nothing before they happen and turn into nothing after they cease to exist. They do not appear from nothingness: they be-come from nothing – nothing at all.

Take something, say a cake, and break it in half. Take one piece and break it again. Keep breaking, until a small piece becomes a tiny crumb, then a barely visible speck, then an invisible molecule and an atom and a subatomic particle. Then what? However imperceptibly minuscule in size, an elementary particle is still something. As such, it can itself be broken in half – conceptually at least, if not in practice. So can each resulting piece, and so on. But we know that this cannot go on forever. At some point – at a size of about 10^{-35} metres, called the Planck length – an amazing phenomenon occurs: something becomes nothing. Where nothing is not an approximation to a really tiny something: it is nothing – nothing at all.

This is very odd, but indisputably true. We know it from Zeno’s Dichotomy paradox, a funny version of which is as follows: Zeno and Epicurus see a beautiful woman from a distance, who tells them: “Walk half the way between us, every ten seconds. When you get here, you can have me”. “Impossible!” – says Zeno and storms away. “Here I come” – says Epicurus (who, never mind, was born 150 years after Zeno) – “in a couple of minutes I’ll be close enough for all practical purposes”.

The distance between Epicurus and the lady – or, similarly, between Achilles and the tortoise – goes from something to nothing. Let’s say it starts at 100 metres. After 10 seconds, it is 50 metres. After 20 seconds, 25 metres. After N∙10 seconds, it is 100∙(1/2)^{N} metres: 2.4 centimetres in two minutes. Epicurus is right. However, Zeno, the ascetic stickler, is not having it: 2.4 centimetres is still some distance. In ten seconds, it will be 1.2 centimetres, then 6 millimetres, 3, 1.5 and so on: it will never go to zero. But he is wrong. N such that 100∙(1/2)^{N} is less than the Planck length ℓ_{P}=1.616199∙10^{-35 }is Log(ℓ_{P}/100)/Log(0.5), which is about 122.2: it takes 1222 seconds – 20 minutes and 22 seconds – to go from 100 metres to nothing – nothing at all.

Notice that, in mathematical terms, Zeno is right. The infinite series S=x+x^{2}+x^{3}+… equals x/(1-x) for x<1. So, for x=1/2, S=1. S converges to 1, but it does so only ad infinitum. S has many more than 122 terms: it has an infinite number of them, each one a non-zero addendum to the series. In mathematics, numbers can be smaller – infinitely smaller – than the Planck length. But in reality nothing – no thing – can be. Some thing with a Planck length breaks into no thing. That’s why Epicurus is happy and Achilles beats the tortoise.

A Planck length is really, really small: 10^{-20} the diameter of a proton – far too small for direct measurement with currently available instruments. But it is also the shortest measurable length: the lower limit to how small things can be. Nothing can be smaller than ℓ_{P} metres. Infinitesimal calculus, which contemplates numbers smaller than ℓ_{P}, is valid and useful, but it does not reflect the reality of small things, where addenda in S beyond N>122 are exactly zero. Nothing is infinitesimally small.

Something turning into nothing is weird. But that’s just one of the many oddities we encounter in the quantum world. Here is an excellent introduction by Brian Greene:

In Greene’s last words:

As strange as quantum mechanics may be, what’s now clear is that there is no boundary between the worlds of the tiny and the big. Instead, these laws apply everywhere, and it’s just that the weird features are most apparent when things are small. And so the discovery of quantum mechanics has revealed a reality – our reality – that’s both shocking and thrilling, bringing us that much closer to fully understanding the fabric of the cosmos.

Indeed. Quantum mechanics shows that spacetime is inherently granular, as implied by Loop Quantum Gravity, one of the most promising attempts – put forward by Lee Smolin – at unifying quantum mechanics and general relativity in a so-called Theory of Everything. At a more basic level, however, granularity is a simple consequence of Zeno’s paradox. That’s why it is a paradox.