On Philosophy

August 13, 2007

A Calculus For Knowledge (2)

Filed under: Epistemology — Peter @ 12:00 am

WARNING: this post contains integrals and the beta function; consult your mathematician before reading.

Suppose we are considering the hypothesis that all crows are black, and we have observed 10 black crows. How confident should we be in the hypothesis given that observation? If we observe 10 more black crows how much more confident should we be? And things can be more complicated than this. Suppose that our hypothesis is instead that 75% of crows are black or that 50% of crows are black. How does the observation of 10 black crows affect our confidence in these hypotheses (given that if they are true it is certainly possible for us to see 10 black crows in a row, just not too likely)? Surprisingly all these questions can be answered within a single framework (which I derived today from first principles, I might add).

Similar questions about how a hypothesis is affected by evidence can sometimes be answered by using Bayes’ theorem, so let’s start there. Bayes’ theorem states that the probability of a hypothesis given some evidence, P(H|E)
= \frac{P(E|H)*P(H)}{P(E)}.
The first part of the numerator isn’t too hard to evaluate. P(E|H) simply is the probability of the evidence occurring given that the hypothesis is true. Let’s say that our hypothesis says that the probability of a positive is x. (If our hypothesis is 70% of crows are black then in this case a positive is a single black crow, a negative is a non-black crow, and the probability of a positive is .7) And let’s suppose that we have had a positives and b negatives. The probability of that outcome given our hypothesis is (\#)*x^a*(1-x)^b. (#) stands for the number of ways to get that outcome, which we could figure out by doing some more math, but as you will see in a bit the exact value doesn’t matter.

The first problem with this is that the P(E) in the dominator stands for the probability of that particular evidence occurring, by itself. Obviously there is no way to determine that, the only way to determine the likelihood of seeing a particular number of black crows is to take a position on what percent of crows are black. Thus we replace P(E) with P(E|H)*P(H) + P(E|~H)*P(~H), which can be trivially shown to be identical. But now we run into problems with the second term. ~H means the hypothesis that is formed by negating the original hypothesis, but negating the hypothesis that 70% of crows are back is not the hypothesis that 30% of crows are black, in fact it doesn’t make a claim at all about what percent of crows are black, just that that it isn’t 70%. And that doesn’t help either. So, let’s retreat to another alternative. If there are a number of mutually exclusive hypotheses, A, B, and C then P(E) is also equal to P(E|A)*P(A) + P(E|B)*P(B) + P(E|C)*P(C). If life were simple we could just consider the 10% hypotheses (That there would be 0%, 10%, 20%, …, and 100% positives), which would make P(E)
= P(E|0\%)*\frac{1}{10} + P(E|10\%)*\frac{1}{10} + ... + P(E|100\%)*\frac{1}{10}.
That is something we can calculate, but it would be wrong, because it misses out on a whole bunch of hypotheses. We could make it slightly better by calculating each 1% hypothesis like so:
P(E) = P(E|0\%)*\frac{1}{100} + P(E|1\%)*\frac{1}{100} + ... + P(E|100\%)*\frac{1}{100}.
Some of you can probably see where this is going by now. We can sidestep this ever bettering process of approximation simply by integrating, and so P(E) is exactly equal to
(\#) * \int_0^1 x^a*(1-x)^b \,dx
That is not exactly the easiest integral to compute, but it is known to be the beta function, specifically:
\int_0^1 x^a*(1-x)^b \,dx = B(a+1, b+1)
That may or may not shed some light on what is going on for you, but in this context all that we need to know is that we can calculate the beta function, and thus can actually get some numbers. The following infinite sequence can be used to approximate Β(a,b)
= \frac{1}{a} + \frac{1-b}{a+1} + ... + \frac{(1-b)*(2-b)*...*(n-b)}{n!*(a+n)}+ ...
Now, back to the original Bayes’ theorem equation we have
P(H|E) = \frac{(\#)*x^a*(1-x)^b *P(H)}{(\#)*B(a+1,b+1)} = \frac{x^a*(1-x)^b *P(H)}{B(a+1,b+1)}
The only remaining problem is that floating P(H). If we were to be honest to the integration we preformed P(H) should approach zero, and hence the entire equation would approach zero. Now this is an accurate statement about probability. After all, a particular string of observations supports 50% and 50.1% to about the same degree. And since there are an infinite number of hypotheses between just those two the probability of any particular hypothesis shrinks to zero. What this reveals is that what we are really after, if we want to know the probability, is the probability of a range of hypotheses, say those between 45% and 55%, being true given the evidence. I’ll return to this later. Instead of treating P(H) like the probability of the hypothesis let us instead look at it as a measure of the degree of confidence we have in it. This has a number of implications. The first is that we need to revise our Bayes’ formula, again, to be:
P(H|E) = \frac{x^a*(1-x)^b *P(H)}{ x^a*(1-x)^b *P(H) + B(a+1,b+1)*(1-P(H))}
Why we must modify it like this is a bit complicated to explain. In brief we are going back to one of our earlier formulations and letting Β(a+1,b+1) stand in for P(E|~H), which it does nicely. Of course to actually to any calculations with this formula we have to pick a value for P(H). I use .4. But this is basically an arbitrary choice; another theorem about Bayes theorem states that as more evidence is collected the choice of P(H) becomes irrelevant, as the value resulting from using the formula converges, so long as P(H) isn’t 0 or 1.

That’s enough deriving for now. As they say, the proof is in the pudding, so let me now present a few applications of this formula to show that it performs as expected.

This graphic illustrates how each possible hypothesis is confirmed or disconfirmed by 10 observations. The scale on the left represents our degree of confidence. The scale on the bottom represents the possible hypotheses, for example 20% is the hypothesis claiming that there will be 20% positives. And the lines track how each hypothesis is affected by a particular observed ratio of positives to negatives (in a total of 10 observations). Obviously I haven’t included all the possibilities, but 0% to 40% are the mirror image of 60% to 100% (as we would expect), so I have included just 40%. And, as we expect, each set of observations supports best the hypothesis that predicts that exact ratio. And, as expected, the 0% and 100% hypotheses are completely disconfirmed (reduced to 0) unless there are all negatives or all positives.

This graphic represents how the confirmation value of particular hypotheses change as more observations are made. The scale on the left represents the degree of confidence, the scale on the bottom represents the number of observations made, and each line represents a single hypothesis, that there is some percent of positives. The first graphic represents a situation where we find only positives, the second where we find 75% positives and the third where we find 50% positives. Obviously for the last two I consider the 99% hypothesis instead of the 100% hypothesis since the 100% hypothesis goes immediately to zero. As expected the “true” hypothesis is best confirmed, and as the number of observations increases it approaches, but never reaches, 1, while the other hypotheses approach zero.

Another feature of this equation, which the images haven’t quite illustrated, is that hypothesis closer to 100% and 0% are the fastest to be confirmed or disconfirmed, while the 50% hypothesis is much less sensitive. But this is a feature, not a drawback. Logically if the 50% hypothesis were true we would expect a lot of variation, which makes deviation from the expected 50% not too uncommon. On the other hand, if our hypothesis is 99% deviation from it should be much less common, and so it is more sensitive to observations diverging from it.

To finish let me return to the earlier point I made, which is that this calculation doesn’t reflect the probability that the hypothesis is true given the observations, but rather gives a measure of confidence, which permits, for example, two nearby hypotheses, such as 50% and 55% to both have a high degree of confidence. To find how likely a certain range of hypotheses are given the observations we must solve the following integral:
\int_y^z \frac{x^a*(1-x)^b *P(H)}{B(a+1,b+1)} \,dx
This comes out to be:
\frac{B(z;a+1,b+1) - B(y;a+1,b+1)}{B(a+1,b+1)}
Β(z;a,b) is the incomplete beta function, and it can be approximated by the infinite series:
z^a(\frac{1}{a} + \frac{1-b}{a+1} * z + ... + \frac{(1-b)*(2-b)*...*(n-b)}{n!*(a+n)} * z^n + ...)
Since Β(0;a+1,b+1) = 0 and Β(1;a+1,b+1) = Β(a+1,b+1) the integral from 0 to 1 is equal to 1, which is (as usual) expected, since the probability of some hypothesis being true is 100%. And that is about all the math we need for one day, so I’ll leave experimenting with this function as an exercise for the reader.


  1. Wow, elegant and beautifully done! Not a mathematician, so can’t comment on validity, but I’m sure you know your stuff.
    And what I took from it, translated to another form of language than mathematics (and wow, are you fluent!), is that:
    The more one observes the same or similar thing occuring over time, the stronger the probability that you can count on your observations telling you something about the phenomenon that you can rely on as being true, or consistent. There are always exceptions, but what you end up with is a “working knowledge” about something, which is not the same as absolute knowledge (yes, I know: there is none of that available in real life except as the Greeks would have us understand: you can only know for certain that you can know absolutely nothing for 100% certain) (and I think your graphs proved that out, too). Probabilities, however, are our friends!
    Becomes important if one wants to be able to have a reasonable certainty of predicting behaviors or occurrances in the future. Helps one make plans. :)
    But how do you feel about Heisenberg’s Uncertainty Principle — do you think the act of observation affects phenonmena? Thanks,

    Comment by Monica Englander msw — August 13, 2007 @ 9:17 am

RSS feed for comments on this post.

Create a free website or blog at WordPress.com.

%d bloggers like this: