On Philosophy

February 3, 2007

Information And Causation

Filed under: Information,Intentionality — Peter @ 12:00 am

Information, as defined by information theory, seems to ultimately reduce to the causal relations between a system and the objects it has information about. And if this was how information was necessarily defined it would pose a problem for informational functionalism, since it would make it a kind of causal functionalism. And causal functionalism isn’t a workable theory about the mind since it is an externalist theory.

The best response to this problem is not to reject information theory, but rather to realize that what it describes by the use of the word information and what informational functionalism describes by the use of the word information are not exactly the same thing. Let me illustrate the difference with an example. Consider then a piece of paper on the ground that says “some elephants are red”, which, unbeknownst to you, was generated by a random process. Does that paper convey any information to you? Information theory says no, that since there is no reliable causal channel between elephant color and the writing on the paper it cannot convey information to you. In contrast informational functionalism would say yes, that the paper conveys the information that elephants are red (assuming you can read), but that the information happens to be wrong.

So where is the difference? Well, what is meant by information in information theory is really what we would call reliable information in everyday discourse, or perhaps knowledge. For information to be reliable there must be a connection between the actual state of affairs and the information about that state of affairs. And so when there is no connection there is no information in this sense. But when we talk about information in the context of informational functionalism we certainly don’t mean reliable information. Much of the information that is part of a system may be inaccurate, distorted, or otherwise unreliable.

Obviously the kind information that informational functionalist theories refer to will have to be relativized to a system (which is why the paper only conveyed information about elephants if it could be read). So, as a starting point, let me assume then that we have a way of determining the intentional directedness of a system given all the facts about its operation (with the further assumption that we define intentionality in an internalist fashion). (I presented one theory about how this might be done in my thesis draft. Essentially the proposal is that what corresponds to intentionality in a system is an internal structure that encodes input-output correlations. For example, the internal structure that corresponds to being intentionally directed at a tree, or more specifically, the structure that is activated when a system is intending a tree, contains the inputs that the system would receive from perceiving a tree and how those inputs might change if the system took various actions. Of course intentionality may be directed at non-perceptual things as well. For example, the intentional structure that is directed at numbers has as its inputs various mathematical objects and describes how those mathematical objects act as a result of being operated on mathematically. Naturally the mathematical inputs and outputs of this structure are themselves abstracted from the many possible perceptual inputs and behavioral outputs that are the vehicles with which we deal with numbers in the world, in contrast to the structure that corresponded to being intentionally directed at a tree, which did not deal with input abstractions, but rather direct perceptual inputs.) Information then corresponds, roughly, to intentionality. To say that a system receives new information is to say that it becomes intentionally directed at something new. And the intentional habits (or abilities) that already exist in the system are the information it already contains.

For example, my mind contains information about unicorns because I have the capacity to be intentionally directed at them (well, at possible unicorns, since real unicorns don’t exist). Which means that I have a conception of unicorns in which they have certain properties (horse-like appearance, one horn); under almost all theories about intentionality this is what it means to be intentionally directed at something (conceiving of it as having certain properties). My mind also contains the information that unicorns don’t really exist. This isn’t so much a property of my being intentionally directed at unicorns, since when we think about anything we think of it as existing, but a property of by being intentionally directed at the real world, which contains the expectation that I won’t find any unicorns in it, or evidence that unicorns really exist.

So, back to the original example, the paper that states “some elephants are red” contains information, to a system that can understand the words, because when such a system reads the paper an intentional structure that is directed at elephants that have the potential to be red (or possibly at red elephants) is either created or brought to mind, and so we would say that the paper conveys that information to that system.

And so information can be defined in a way that is independent of causation, assuming that you accept a definition of intentionality that is independent of causation.

July 13, 2006

Information Theory for Philosophers (Part 2)

Filed under: Epistemology,Information — Peter @ 1:12 am

Part 1

Last time I defined information and showed how we could quantify it in ideal situations (the ideal situation being that whenever we receive a message we can deduce with complete confidence which member of a set of states of affairs caused it). Now let us turn to information and messages as they are actually found “in the wild”, where when we receive a message we have less than perfect certainty that a given state of affairs was the cause. For example in some situations there may be a 10% chance that the message is turned into one of the other possible messages, due to some random or unaccounted for process. A given message then can only allow us to deduce that there was at best 90% chance of a state of affairs, with the other states of affairs also having small probability. (Pr(s | m) = 0.9). Clearly then we are provided with less new information from such a message than when we could be absolutely certain that a given state of affairs was the cause of a message. But how much less?

In situations where we are less than 100% certain that a given state of affairs is the cause of any message we compute the information provided by a message as I(m : s). As you remember from last time this is the mutual information contained between two sets of events, in this case the messages and the states of affairs. Earlier we defined: I(s : m) = I(m) – I(m | s) = I(s) – I(s | m) = I(s) + I(m) – I(s, m) . Using this definition we can derive a simpler equation for I(s : m), as follows:

This formalization captures the phenomenon I have earlier described as “noise”, when either problems in the transmission of information (the colloquial use of noise), or simply situations where two sates of affairs may have the same results, creates uncertainty as to which state of affairs was the cause of the events (messages) that we have observed. (Pr(s | m) < 1)

This formalization also satisfies our mathematical and practical intuitions concerning the phenomena of “noise”. For example if the messages are completely independent of the states of affairs then we would expect the information content of those messages to be zero, and indeed, as shown below, this is the case.

Likewise when each message corresponds exclusively with a single state of affairs (Pr(s-n | m-n) = 1) then the result is the standard definition of information without noise (as presented in part 1), again just as we expect. Again, the steps below demonstrate this fact.

The following facts can also be proven, which I will leave as an exercise for the reader (since I am tired of typing long equations in LaTeX).

Now, let’s apply this formula to get a sense of how it actually works. Consider the following situation: there are two equally likely states of affairs, A and ~A. Likewise there are two messages, W and ~W. 3/4 of the time state of affairs A results in message W and 3/4 of the time state of affairs ~A results in message ~W (Pr(A|W) = 3/4, Pr(~A|W) = 1/4, Pr(~A|W) = 1/4, Pr(~A|~W) = 3/4). We calculate the information provided by the messages in this situation as:

This may strike you as low, but consider the following: we gauge the probability of W or ~W to be only .25 higher than we did before we received a given message. Now the information value of 0.189 shouldn’t seem that unexpected, although I do admit that it does seem a bit low, even to me. What does this tell us about information? Simply that when further layers of complexity are added to the situation our intuitions become much less reliable. (The situation here is similar to a first encounter with Bayes’ theorem, the results of which can seem similarly contrary to common sense.)

With respect to information theory as used in computer science this is only the tip of the mathematical iceberg as far as the formalization of noise is concerned. However as philosophers we can leave off here, since we are not particularly interested in compression over a noisy channel or error correcting codes (not that such topics aren’t interesting). As for applications of this formalization, consider how I earlier mentioned that information could be used to formalize ideas about knowledge in respect to observations (specifically about how much an observation can tell us). In real life however a given event may have many possible causes (noise), and thus the formalization presented here is required for a thorough analysis. Of course a serious application of these concepts will have to wait for another post, since it is a rather large undertaking, to say the least.

July 11, 2006

Information Theory for Philosophers (Part 1)

Filed under: Epistemology,Information — Peter @ 1:12 am

What is information? That is a harder question to answer than you might think. So let me propose the following definition: information is the ability to deduce which state a set of possible states of affairs is in. For example if your room is well lit, and it is dark outside, you have some information, namely that your light is on. As you can tell from this example we usually talk about information in the context of a message. The message contains the information that allows us to deduce a given state of affairs. (Note here I will be using the term “set of states of affairs” to mean any situation that can possibly be in two or more exclusive states. For example the light can be on or the light can be off, each of these is a single state of affairs, and taken together they form a set of complete, and mutually exclusive, possible states of affairs.)

Not all messages are created equal however. Some contain more information and some contain less. The exact amount is determined by the following factors: how many different sets of states of affairs the message tells us about, how the sets of states of affairs are related to each other, and how likely a given state of affairs is in absence of any information about it. I think it is obvious why it is reasonable to consider a message telling us about more sets of states of affairs as containing more information than one that tells us about fewer. The other two factors may seem less obvious. The best way to understand how they have an influence on amount of information contained in a message is to consider what can be known (or guessed) in the absence of the message. For example if two sets of states of affairs are connected (in any way) then the probability of a given state of affairs, say A, is in a state will be increased or decreased by the knowledge about a state of affairs, say B that belongs to a connected set. (Pr(A) != Pr(A|B) and Pr(A) != Pr(A|~B)) Thus if the message tells us about B telling us about A is somewhat redundant, since there was at least some information we could have deduced about A given B alone. A message that contains information about two related states of affairs, A and B, definitely contains less information than one that contains information about two independent states of affairs, say A and C. Finally if a state of affairs, say Z, is more or less probable than 1/2 then we already have some knowledge about it, and thus a message containing information about Z will contain less information than one that contains information about a state of affairs that has only a probability of 1/2.

Finally, before we begin drawing up equations, we must decide on what kind of scale we want to quantify information. I think it is reasonable to quantify it by the number of independent sets states of affairs with two possibilities that it can inform us about (sets in the form of A or ~A). A message that told us about one set of state of affairs (two possibilities) would have 1 unit of information. A message that told us about 2 independent sets of states of affairs (4 equally likely possible states) would have 4 units of information. Already then we can see we will need some sort of a logarithmic scale.

Let us define the information provided by a message as follows:

Pr(e) is the probability of a given message being produced. In this equation x represents a single message, and ranges over all the possible messages in the summation. It is probably clear to you how this equation might capture how the probability of an event can influence the information provided by a message, and how this scale is logarithmic, but not necessarily how events that depend on one another are taken into account. It is indeed included in this formula, which you can see by considering the following: if an event B is made more probable by an event A then message where both B and A occur will be more probable than message were A is indicated and B is not. This will then alter the total distribution of message probabilities, and hence the result of the calculation.

For example let us consider a message that carries information concerning two independent sets of states of affairs. Thus we have four possible messages: A & B, ~A & B, A & ~B, and ~A & ~B, each of which has the probability 1/4 to occur. Thus we compute the information contained in a message as follows:

Now let us consider what would happen if events A and B were completely dependant on each other. Even though there are still four possibilities, A & B, ~A & B, A & ~B, and ~A & ~B, only A & B and ~A & ~B have a non-zero possibility to occur (1/2 each). Now we calculate the probability as follows:

This should be an intuitive result since A & B can be seen as corresponding to a single state of affairs, C, and ~A & ~B as corresponding to ~C (since the other possibilities are excluded).

Next let us consider a more complicated situation of dependence, where B always occurs when A occurs, but only occurs half the time when A does not occur. There are only three possibilities that can possibly occur in this situation: A & B with probability 1/2, ~A & B with probability 1/4, and ~A & ~B with probability 1/4. (The probabilities might not necessarily follow this distribution, I have simply picked them for the sake of the example) The calculation proceeds as follows:

This should also make intuitive sense, because half the time the events are completely dependant on each other (information amount 1), and half the time they are independent (information amount 2). Thus, as we expect, the information value for the message in this type of situation is between the two possibilities.

With the dependence between various states of affairs out of the way it is now time to discuss how the probability of a single state of affairs occurring can influence the amount of information a message contains. First let us consider a state of affairs that is certain (perhaps something of the form “the man is 6’ tall or the man is not 6’ tall”). The probability of such a state of affairs is 1. Thus we calculate the information a message about it can contain as:

And as we expect we find out that a message cannot provide us with any more information than we already had.

Finally, let us consider a set of state of affairs which can be in one of two possible alternatives, and in which one alternative is more likely than the other. In this calculation let us say that the probability of A is 3/4 and the probability of ~A is 1/4.

As I mentioned earlier this shows that a message containing information about A or ~A contains less information in the case when one state of affairs is more likely than the other than when they are both equally likely (the result would have been 1 in that case). This is because even in the absence of the message we have some information about A and ~A, namely that A is more likely, and so the message about them reduces our ignorance less than in the case of independent events.

Well, enough with the examples. Now let me prove some interesting facts about this formula. First off let me define the function I(m) as the amount of information that is given by a message; its value is calculated as given above. Now consider two messages, m1 and m2. It possible to form a third message by combining the two (in fact one could think of all messages as being built this way, from primitive components that give information concerning a single set of states of affairs. Let us call the information from this combined message I(m1, m2). How is this related to I(m1) and I(m2)? Examine then the following manipulation:

(within the summation m1 and m2 range over all possible messages that they can bear)
What is I(m2 | m1)? It is a measure of how much new information m2 gives us if we already have m1. It is trivial to show that when the states of affairs that m2 provides information about are independent of the states of affairs determined by m1 then I(m2 | m1) = I(m2), and hence I(m1, m2) = I(m1) + I(m2) (but only when they are independant).

We can also define the mutual information between two messages as I(m1 : m2) = I(m1) + I(m2) – I(m1, m2) = I(m1) – I(m1 | m2) = I(m2) – I(m2 | m1). This is a measure of how much information two messages share in common. Note that if the messages contain information about independent states of affairs then I(m1 : m2) = 0. For completeness I should also mention that the information distance between two messages is defined as D(m1, m2) = I(m1, m2) – I(m1 : m2) = I(m1 | m2) + I(m2 | m1).

Finally, something to prepare you for next time: so far I have assumed that two different states of affairs result in two different messages, and that the messages are always 100% accurate (a given state of affairs always results in a the same message). When these assumptions are violated we have to deal with a phenomena called noise, which is something I will cover next time.

If you are familiar with standard information theory you may be wondering how the account I have given here differs from that standard account. Well for one thing I like to think that mine is much easier to read. Ok, seriously: It is true that I use many of the same equations and definitions from standard information theory. The difference however can be seen as follows: I consider information from a standpoint of what a message can tell us about states of affairs, while standard information theory is concerned with how much information is contained within a continuous sequence of messages. This leads to the following differences: in my account the probabilities in question are how likely a combination of states of affairs is to occur. Secondly, I don’t have to worry about the length of a message or its particular encoding since we aren’t concerned with channel capacity here. In standard information theory the probabilities in question are how often a certain sequence occurs in the stream of information. Because it is not assumed in the standard theory that we know the message size or even how to pull apart the stream into messages more complicated equations must be introduced (for example dealing with Markov sources). This is needed for the computer science uses of information theory, but not so much for us as philosophers. Why do I say that we don’t need to look at it from this perspective? Primarily because we will be using it to prove interesting facts about perception and knowledge about the world, and in these cases the messages are pretty easy to distinguish from each other (i.e. separate experiences). Likewise, channel capacity will often be irrelevant.

That doesn’t mean you shouldn’t study standard information theory if it is of interest to you. I found the following resources helpful: Wikipedia on information theory, Wikipedia on information entropy, David MacKay’s lecture notes, and C. E. Shannon’s original paper (pdf).

Part 2

June 30, 2006

More On The Transmission of Information

Filed under: Epistemology,Information — Peter @ 2:53 am

Earlier I argued that for us to have information about an event that event must lie in our casual past, such that the event in question can be said to be the cause of at least some of our mental states. (see here for a longer explanation)

Consider then the following objection: Say a certain type of particle, P, has a chance to decay into particles Q and R, which are found in nature only when a particle of type P decays. Now let us assume that we find a Q particle. We know then that an R particle must also exist. However the R particle doesn’t lie in our casual past. Is this a counterexample to the theory I earlier proposed?

No (if it was the word “retraction” would be in the title). We have information concerning the decay of P particles into Q and R because such events do lie in our casual past (i.e. P, Q, and R are in many cases all in our casual past). From this we have generalized, and assume that every P decay results in Q and R. Thus we might say that we have information concerning the class of P decay events since we have observed some instances of P decay. What makes the counterexample invalid is the assumption that information about a class of events yields information about specific instances. I would argue that in the “counterexample” we only really have information about P and Q. The reason can be seen simply as follows: it is possible that our generalization was wrong, and that sometimes P decays into Q and S. In the case we observed it is possible this did indeed happen, thus there was no R particle only and S particle. Since this is possible clearly we don’t have information about R, because to have information about R would be to know for sure, or at least have good reason to believe, that R exists. Thus the theory stands, for the moment.

June 24, 2006

The Transmission of Information

Filed under: Epistemology,Information,Language — Peter @ 2:25 pm

There are basically two kinds of information (that sometimes overlap). One kind is evidence than an event occurred some time in the past. The second is an organized thought (for example the words on this page), although partly the transmission of this type of information consists of evidence that a certain thought occurred in the past. Here I will give an account of the transmission of both these kinds of information.

First let me address the simpler type of information, that of evidence for some event in the past. For an event B to be evidence for event A it must be the case that A is a cause of event B. This is not to say that event B gives us certain knowledge that A occurred. In fact often it is the case that while A may have caused B there are many other events that could also have resulted in B. However we still say that B transmits information about A, because taken together with other events we can establish that A definitely had to occur, or at least that it is very likely that A occurred.

For example let us consider a tree outside the window. How does the information that there is a tree outside get to me? Well most immediately I have the visual impression of a tree. That visual impression is most likely caused by light striking my eye (although in theory it could be a hallucination). That light in turn is most likely caused by the existence of a physical tree (and not a hologram). Thus the tree itself is a cause of my visual impressions (indirectly), and more importantly it is the most likely cause for them, giving me both the information that a tree is there and good reason to believe this information to be accurate.

Although everything that can be known to exist interacts casually with the world, it is still possible for information to be lost. A trivial case is when the only result of an event is to create a particle moving away from us at the speed of light. Although this particle does technically carry information about the event we will never be able to observe it, and thus the information is lost to us. A more typical case though is when the information is lost due to “noise”. For example let us say that event A causes event D. It is also possible however that event D though could have been caused by event C. Now let us assume that nearby to A event B has caused event E. Once again however, C could have been the cause of E as well. Thus from looking at events D and E we will conclude that event C took place, when really it is a combination of A and B, and thus the information that A and B occurred has been lost. In situations such as this the other possible causes of an event (the noise) have outweighed the real cause, making the real cause seem no more likely, or possibly even less likely, than some other combination of events. Although technically the information about the real events still is there it is no longer discoverable.

Now we can build on this foundation to discuss the transmission of organized thoughts, which is what is more commonly thought of as information. One condition for an organized thought to be transmitted is that the thought must have a casual effect on the world. For example my thoughts now are being written down here, and thus have a casual effect on the computer, and later they will have a casual effect upon my readers. However there is also a second condition, which is that when the recipient receives the information that similar mental models to those entertained by the author must be invoked in their minds.

To see why this extra condition is important consider the following example: a man in a foreign country writes a book, that he ships to you. Unfortunately you cannot read his language, but even so when you receive his book some information has been conveyed to you, namely that a man in a foreign country is sending you books (this is the first kind of information). Clearly this is not the same as the information that is the content of the book. However when someone who speaks the language of the author reads the book they will have thoughts corresponding to those the author had when committing the words to print. (For example if I write the words “A man in a red house” I must think of a man in a red house, and if you understand me you also will think of a man in a red house when reading it.) It is not just books that convey this kind of information however; art may be a vehicle for it as well, and in some ways may be more successful since it can cross language barriers.

This type of information is even easier to lose than the first kind. For example small changes in the transmission medium (the events that result from the original) can completely change what a recipient will think (for example painting all the pages of a book red will destroy the information contained within it). It is also possible that people will forget how to interpret the writing or the art. In this case the information may be lost without anything happening to the transmission medium itself. It is specifically for this reason that I have separated the first type of information from the second. The first type of information is almost completely independent of the observer, in the sense that it bears the same information no matter who observes it. The second type however is not so independent of people; if no one can understand it the information is effectively lost.

Blog at WordPress.com.