Last time I defined information and showed how we could quantify it in ideal situations (the ideal situation being that whenever we receive a message we can deduce with complete confidence which member of a set of states of affairs caused it). Now let us turn to information and messages as they are actually found “in the wild”, where when we receive a message we have less than perfect certainty that a given state of affairs was the cause. For example in some situations there may be a 10% chance that the message is turned into one of the other possible messages, due to some random or unaccounted for process. A given message then can only allow us to deduce that there was at best 90% chance of a state of affairs, with the other states of affairs also having small probability. (Pr(s | m) = 0.9). Clearly then we are provided with less new information from such a message than when we could be absolutely certain that a given state of affairs was the cause of a message. But how much less?
In situations where we are less than 100% certain that a given state of affairs is the cause of any message we compute the information provided by a message as I(m : s). As you remember from last time this is the mutual information contained between two sets of events, in this case the messages and the states of affairs. Earlier we defined: I(s : m) = I(m) – I(m | s) = I(s) – I(s | m) = I(s) + I(m) – I(s, m) . Using this definition we can derive a simpler equation for I(s : m), as follows:
This formalization captures the phenomenon I have earlier described as “noise”, when either problems in the transmission of information (the colloquial use of noise), or simply situations where two sates of affairs may have the same results, creates uncertainty as to which state of affairs was the cause of the events (messages) that we have observed. (Pr(s | m) < 1)
This formalization also satisfies our mathematical and practical intuitions concerning the phenomena of “noise”. For example if the messages are completely independent of the states of affairs then we would expect the information content of those messages to be zero, and indeed, as shown below, this is the case.
Likewise when each message corresponds exclusively with a single state of affairs (Pr(s-n | m-n) = 1) then the result is the standard definition of information without noise (as presented in part 1), again just as we expect. Again, the steps below demonstrate this fact.
The following facts can also be proven, which I will leave as an exercise for the reader (since I am tired of typing long equations in LaTeX).
Now, let’s apply this formula to get a sense of how it actually works. Consider the following situation: there are two equally likely states of affairs, A and ~A. Likewise there are two messages, W and ~W. 3/4 of the time state of affairs A results in message W and 3/4 of the time state of affairs ~A results in message ~W (Pr(A|W) = 3/4, Pr(~A|W) = 1/4, Pr(~A|W) = 1/4, Pr(~A|~W) = 3/4). We calculate the information provided by the messages in this situation as:
This may strike you as low, but consider the following: we gauge the probability of W or ~W to be only .25 higher than we did before we received a given message. Now the information value of 0.189 shouldn’t seem that unexpected, although I do admit that it does seem a bit low, even to me. What does this tell us about information? Simply that when further layers of complexity are added to the situation our intuitions become much less reliable. (The situation here is similar to a first encounter with Bayes’ theorem, the results of which can seem similarly contrary to common sense.)
With respect to information theory as used in computer science this is only the tip of the mathematical iceberg as far as the formalization of noise is concerned. However as philosophers we can leave off here, since we are not particularly interested in compression over a noisy channel or error correcting codes (not that such topics aren’t interesting). As for applications of this formalization, consider how I earlier mentioned that information could be used to formalize ideas about knowledge in respect to observations (specifically about how much an observation can tell us). In real life however a given event may have many possible causes (noise), and thus the formalization presented here is required for a thorough analysis. Of course a serious application of these concepts will have to wait for another post, since it is a rather large undertaking, to say the least.