On Philosophy

July 11, 2006

Information Theory for Philosophers (Part 1)

Filed under: Epistemology,Information — Peter @ 1:12 am

What is information? That is a harder question to answer than you might think. So let me propose the following definition: information is the ability to deduce which state a set of possible states of affairs is in. For example if your room is well lit, and it is dark outside, you have some information, namely that your light is on. As you can tell from this example we usually talk about information in the context of a message. The message contains the information that allows us to deduce a given state of affairs. (Note here I will be using the term “set of states of affairs” to mean any situation that can possibly be in two or more exclusive states. For example the light can be on or the light can be off, each of these is a single state of affairs, and taken together they form a set of complete, and mutually exclusive, possible states of affairs.)

Not all messages are created equal however. Some contain more information and some contain less. The exact amount is determined by the following factors: how many different sets of states of affairs the message tells us about, how the sets of states of affairs are related to each other, and how likely a given state of affairs is in absence of any information about it. I think it is obvious why it is reasonable to consider a message telling us about more sets of states of affairs as containing more information than one that tells us about fewer. The other two factors may seem less obvious. The best way to understand how they have an influence on amount of information contained in a message is to consider what can be known (or guessed) in the absence of the message. For example if two sets of states of affairs are connected (in any way) then the probability of a given state of affairs, say A, is in a state will be increased or decreased by the knowledge about a state of affairs, say B that belongs to a connected set. (Pr(A) != Pr(A|B) and Pr(A) != Pr(A|~B)) Thus if the message tells us about B telling us about A is somewhat redundant, since there was at least some information we could have deduced about A given B alone. A message that contains information about two related states of affairs, A and B, definitely contains less information than one that contains information about two independent states of affairs, say A and C. Finally if a state of affairs, say Z, is more or less probable than 1/2 then we already have some knowledge about it, and thus a message containing information about Z will contain less information than one that contains information about a state of affairs that has only a probability of 1/2.

Finally, before we begin drawing up equations, we must decide on what kind of scale we want to quantify information. I think it is reasonable to quantify it by the number of independent sets states of affairs with two possibilities that it can inform us about (sets in the form of A or ~A). A message that told us about one set of state of affairs (two possibilities) would have 1 unit of information. A message that told us about 2 independent sets of states of affairs (4 equally likely possible states) would have 4 units of information. Already then we can see we will need some sort of a logarithmic scale.

Let us define the information provided by a message as follows:

Pr(e) is the probability of a given message being produced. In this equation x represents a single message, and ranges over all the possible messages in the summation. It is probably clear to you how this equation might capture how the probability of an event can influence the information provided by a message, and how this scale is logarithmic, but not necessarily how events that depend on one another are taken into account. It is indeed included in this formula, which you can see by considering the following: if an event B is made more probable by an event A then message where both B and A occur will be more probable than message were A is indicated and B is not. This will then alter the total distribution of message probabilities, and hence the result of the calculation.

For example let us consider a message that carries information concerning two independent sets of states of affairs. Thus we have four possible messages: A & B, ~A & B, A & ~B, and ~A & ~B, each of which has the probability 1/4 to occur. Thus we compute the information contained in a message as follows:

Now let us consider what would happen if events A and B were completely dependant on each other. Even though there are still four possibilities, A & B, ~A & B, A & ~B, and ~A & ~B, only A & B and ~A & ~B have a non-zero possibility to occur (1/2 each). Now we calculate the probability as follows:

This should be an intuitive result since A & B can be seen as corresponding to a single state of affairs, C, and ~A & ~B as corresponding to ~C (since the other possibilities are excluded).

Next let us consider a more complicated situation of dependence, where B always occurs when A occurs, but only occurs half the time when A does not occur. There are only three possibilities that can possibly occur in this situation: A & B with probability 1/2, ~A & B with probability 1/4, and ~A & ~B with probability 1/4. (The probabilities might not necessarily follow this distribution, I have simply picked them for the sake of the example) The calculation proceeds as follows:

This should also make intuitive sense, because half the time the events are completely dependant on each other (information amount 1), and half the time they are independent (information amount 2). Thus, as we expect, the information value for the message in this type of situation is between the two possibilities.

With the dependence between various states of affairs out of the way it is now time to discuss how the probability of a single state of affairs occurring can influence the amount of information a message contains. First let us consider a state of affairs that is certain (perhaps something of the form “the man is 6’ tall or the man is not 6’ tall”). The probability of such a state of affairs is 1. Thus we calculate the information a message about it can contain as:

And as we expect we find out that a message cannot provide us with any more information than we already had.

Finally, let us consider a set of state of affairs which can be in one of two possible alternatives, and in which one alternative is more likely than the other. In this calculation let us say that the probability of A is 3/4 and the probability of ~A is 1/4.

As I mentioned earlier this shows that a message containing information about A or ~A contains less information in the case when one state of affairs is more likely than the other than when they are both equally likely (the result would have been 1 in that case). This is because even in the absence of the message we have some information about A and ~A, namely that A is more likely, and so the message about them reduces our ignorance less than in the case of independent events.

Well, enough with the examples. Now let me prove some interesting facts about this formula. First off let me define the function I(m) as the amount of information that is given by a message; its value is calculated as given above. Now consider two messages, m1 and m2. It possible to form a third message by combining the two (in fact one could think of all messages as being built this way, from primitive components that give information concerning a single set of states of affairs. Let us call the information from this combined message I(m1, m2). How is this related to I(m1) and I(m2)? Examine then the following manipulation:

(within the summation m1 and m2 range over all possible messages that they can bear)
What is I(m2 | m1)? It is a measure of how much new information m2 gives us if we already have m1. It is trivial to show that when the states of affairs that m2 provides information about are independent of the states of affairs determined by m1 then I(m2 | m1) = I(m2), and hence I(m1, m2) = I(m1) + I(m2) (but only when they are independant).

We can also define the mutual information between two messages as I(m1 : m2) = I(m1) + I(m2) – I(m1, m2) = I(m1) – I(m1 | m2) = I(m2) – I(m2 | m1). This is a measure of how much information two messages share in common. Note that if the messages contain information about independent states of affairs then I(m1 : m2) = 0. For completeness I should also mention that the information distance between two messages is defined as D(m1, m2) = I(m1, m2) – I(m1 : m2) = I(m1 | m2) + I(m2 | m1).

Finally, something to prepare you for next time: so far I have assumed that two different states of affairs result in two different messages, and that the messages are always 100% accurate (a given state of affairs always results in a the same message). When these assumptions are violated we have to deal with a phenomena called noise, which is something I will cover next time.

If you are familiar with standard information theory you may be wondering how the account I have given here differs from that standard account. Well for one thing I like to think that mine is much easier to read. Ok, seriously: It is true that I use many of the same equations and definitions from standard information theory. The difference however can be seen as follows: I consider information from a standpoint of what a message can tell us about states of affairs, while standard information theory is concerned with how much information is contained within a continuous sequence of messages. This leads to the following differences: in my account the probabilities in question are how likely a combination of states of affairs is to occur. Secondly, I don’t have to worry about the length of a message or its particular encoding since we aren’t concerned with channel capacity here. In standard information theory the probabilities in question are how often a certain sequence occurs in the stream of information. Because it is not assumed in the standard theory that we know the message size or even how to pull apart the stream into messages more complicated equations must be introduced (for example dealing with Markov sources). This is needed for the computer science uses of information theory, but not so much for us as philosophers. Why do I say that we don’t need to look at it from this perspective? Primarily because we will be using it to prove interesting facts about perception and knowledge about the world, and in these cases the messages are pretty easy to distinguish from each other (i.e. separate experiences). Likewise, channel capacity will often be irrelevant.

That doesn’t mean you shouldn’t study standard information theory if it is of interest to you. I found the following resources helpful: Wikipedia on information theory, Wikipedia on information entropy, David MacKay’s lecture notes, and C. E. Shannon’s original paper (pdf).

Part 2

Blog at WordPress.com.