A converter for making png files from latex equations for easy import to html docs can be found here. Here is an example of the output:
You can also go here for a more powerful version of the same.
Monday, March 24, 2008
Correlations vs Coincidences
***THIS IS A WORK IN PROGRESS POSTED FOR REVIEW ONLY****
Reviewing a paper today which makes a very common error, one that i will be discussing in an upcoming talk I have been putting together so I thought I would take this opportunity to write a little on the issue of correlations and coincidences and their consequences for neural computation and decoding. A typical discussion of correlations rightly notes that correlations can affect the information content of a neural code. This is trivially demonstrated by considering the information content of a Gaussian distributed population r.
Here $\mu(s)$ is the stimulus conditioned mean and $\Sigma(s)$ is the stimulus conditioned covariance. This expression has two terms. The first we label "linear" fisher information and the second we label the "quadratic" contribution to fisher information. This is because the first term represents the inverse variance of the unbiased, locally optimal, linear estimator of the stimulus. Here linear means the estimator is parameterized by $s_{est}=w*r+b$, optimal mean minimum variance, and unbiased means $w*\mu'(s)=1$. In a similar fashion, the total Fisher information (linear + quadratic) gives the inverse of the minimum variance associated with an un-biased estimator which operates on quadratic functions of r. Clearly, the form of this expression indicates that, at least for Gaussian distributed populations, correlations affect both the information content of the population and the form of the optional decoder. Indeed, a bit more work can be used to show that the weights of the optimal estimators take the form:
Here repeated indices imply summation and w^lin_i operates on r_i while w^quad_ij operates on r_i*r_j
Support Vector aficionados will also note that this computation hints at their favorite trick. Specifically, if we defined z to be a vector which contains each element of the vector r and every cross product r_i*r_j, then all the information about s contained in z is linear information, i.e. $I_r(s) = I_z^{lin}(s)$ and so linear discriminates in this augmented z space can optimally estimate/discriminate stimuli.
This transformation also indicates how to estimate fisher information in a non-parametric way. This is important when estimating information from data. Indeed, this issue came up the other day when JD was comparing different methods for estimating Fisher information. Among the methods considered (see below someday) one was the direct method. This involves taking data from two nearby values of the stimulus s and s+deltas. Then simply computing the empirical mean and covariance and their derivatives (differences) and plugging into the expression above. The problem, is that this only works when you know for a fact that your data is Gaussian distributed. This allows you to put some extra information into the computation, specifically, you are allowed to assume that the third central moments are actually zero and that the fourth central moments can be computed from the second moments (regardless of their empirical values). Of course, aprior you have no good reason to assume gaussian and you must utilize your estimates of these higher moments, which will be quite crappy when you are data limited.
Anyway, the non-parameteric estimate of Fisher information contained in the first N moments can be estimated by defining the vector function T(r) as a vector which spans the set of polynomials of order N, i.e. for N=2, T(r) contains each r_i and every possible r_ir_j cross product. Fisher information is then obtained from the generative model
for which
where
and <>_s indicates stimulus conditioned average. In this case information about the parameters theta takes the form
and information about the stimulus is given by
This equation clearly indicates that a reasonable direct estimation of the information content of in the first N moments requires good estimation of the stimulus dependence of the of the first 2N moments and then using the above equation. Or in the case of Gaussian distributed data using the first equation in this post. An important question arises when you are not sure whether or not your data is Guassian or whether or not higher moments are informative. Indeed it is possible to show that, even when all the information is linear Fisher information, it is still possible that all the moments of the data depend upon the stimulus. Indeed, direct calculation can be sued to show that
where
is the Nth central moment.
Now most methods for computing information effectively place a prior on the stimulus dependence of higher moments that takes this form. If that prior is correct then direct estimation will work well. For example, if the data is actually Gaussian then using the first equation in this post will give an accurate estimation of the information when you have enough data to get a reasonable estimation of the first and second order statistics. However, when the data is not Gaussian, then using this equation will not necessarily lead to a correct estimation , especially when the empirically observed third and fourth order statistics are inconsistent with the Gaussian assumption.
This particular issue is only relevant to the estimation of quadratic information. There is also a case for which direct estimation of linear fisher information is problematic. This is when correlations are small (or have small variability) but none-the-less have a strong influence on the information content. For example, this is the case when I_diag is significantly less than I and the correlations are on the order of the inverse of the number of units. In this case estimation of correlations is unreliable and so also is direct estimation of information based upon this estimation.
Indirect estimation of information, on the other hand, is based upon computing theta'(s) subject to some reasonable prior. Typically priors which favor small weights values of theta'(s) are used. As is clear above theta' can be estimated by inverting a covariance matrix. But this inversion is can be complicated and estimation of the covariance matrix can be very noisy. Early stopping, regularization, and bayesian logistic regression, avoid these issues.
Regardless, a popular but misguided method for analyzing the information consequences of correlations is to compute the information content of a population which has the same first order statistics and autocorrelations, but no cross-correlations between units. Practically speaking, this is accomplished by shuffling your data across trials. More generally, one can compute the information content contained in the product of the marginal distributions of each unit in the vector r. The difference between the information content of the true distribution and the information content of the shuffled distribution is labeled delta I_shuffled and is often used as a proxy for measuring the information consequences of correlations. Problems with this metric are discussed here. These criticisms can be summarized by noting that I_shuffled represents the information content associated with a code that simply does not exist in cortex. Wu and Amari understood this when they wrote their humorously (unintentionally?) titled Unfaithful Model... paper. In that work, they correctly pointed out that the presence or absence of correlations is simply a fact given by the data. What matters is whether or not the decoder of that activity which is implemented by cortex is capable of properly taking into account those correlations. To address this issue they constructed the so called I_diag metric which computes the information (fisher in this case) associated with a particular suboptimal decoder, specificlaly, the one which assumes statistical independence between neurons. For linear Fisher I_diag takes the simple form:
Here $\Sigma_{diag}$ is the covariance matrix with all the off diagonals removed. For contrast, I_shuffled takes the form:
Delta I_shuffled = I-I_shuffled can be either positive or negative. On the other hand delta I_diag = I-I_diag is a purely positive quantity and truly represents information loss due to suboptimal decoding and and upper bound on information loss due to suboptimal computation. Moreover, as pointed out in Averbeck and friends, delta I_shuffled can be either positive or negative even when delta I_diag is zero and delta I_diag can be quite large even when delta I_shuffle is negative or zero. So not only is I_diag the behaviorally relevant measure of the consequences of correlations, its the only one which has a consistent interpretation (i.e. a bound on information loss). For those who prefer Shannon information, similar metrics were discussed by Latham and Nirenberg. In that work, inverse variance of stimulus decoder applied to neural activity(Fisher information) was replaced with KL divergence between the true and the parameterized posterior distribution of the stimulus given neural activity. In particular, they defined delta I_cor-dep to be the KL divergence between the true posterior and a posterior which is obtained from a generative model which models the joint distribution of responses given the stimulus as the product of the marginal distributions, i.e.
In contrast, the Shannon equivalent of delta I_shuffled is the synergy/redundancy metric which takes the form:
This metric also can be positive or negative. Moreover, as with the Fisher information equivalent, this metric cannot be related to KL divergence or information loss and, as such, is also not a behaviorally relevant.
Reviewing a paper today which makes a very common error, one that i will be discussing in an upcoming talk I have been putting together so I thought I would take this opportunity to write a little on the issue of correlations and coincidences and their consequences for neural computation and decoding. A typical discussion of correlations rightly notes that correlations can affect the information content of a neural code. This is trivially demonstrated by considering the information content of a Gaussian distributed population r.
Here $\mu(s)$ is the stimulus conditioned mean and $\Sigma(s)$ is the stimulus conditioned covariance. This expression has two terms. The first we label "linear" fisher information and the second we label the "quadratic" contribution to fisher information. This is because the first term represents the inverse variance of the unbiased, locally optimal, linear estimator of the stimulus. Here linear means the estimator is parameterized by $s_{est}=w*r+b$, optimal mean minimum variance, and unbiased means $w*\mu'(s)=1$. In a similar fashion, the total Fisher information (linear + quadratic) gives the inverse of the minimum variance associated with an un-biased estimator which operates on quadratic functions of r. Clearly, the form of this expression indicates that, at least for Gaussian distributed populations, correlations affect both the information content of the population and the form of the optional decoder. Indeed, a bit more work can be used to show that the weights of the optimal estimators take the form:
Here repeated indices imply summation and w^lin_i operates on r_i while w^quad_ij operates on r_i*r_j
Support Vector aficionados will also note that this computation hints at their favorite trick. Specifically, if we defined z to be a vector which contains each element of the vector r and every cross product r_i*r_j, then all the information about s contained in z is linear information, i.e. $I_r(s) = I_z^{lin}(s)$ and so linear discriminates in this augmented z space can optimally estimate/discriminate stimuli.
This transformation also indicates how to estimate fisher information in a non-parametric way. This is important when estimating information from data. Indeed, this issue came up the other day when JD was comparing different methods for estimating Fisher information. Among the methods considered (see below someday) one was the direct method. This involves taking data from two nearby values of the stimulus s and s+deltas. Then simply computing the empirical mean and covariance and their derivatives (differences) and plugging into the expression above. The problem, is that this only works when you know for a fact that your data is Gaussian distributed. This allows you to put some extra information into the computation, specifically, you are allowed to assume that the third central moments are actually zero and that the fourth central moments can be computed from the second moments (regardless of their empirical values). Of course, aprior you have no good reason to assume gaussian and you must utilize your estimates of these higher moments, which will be quite crappy when you are data limited.
Anyway, the non-parameteric estimate of Fisher information contained in the first N moments can be estimated by defining the vector function T(r) as a vector which spans the set of polynomials of order N, i.e. for N=2, T(r) contains each r_i and every possible r_ir_j cross product. Fisher information is then obtained from the generative model
for which
where
and <>_s indicates stimulus conditioned average. In this case information about the parameters theta takes the form
and information about the stimulus is given by
This equation clearly indicates that a reasonable direct estimation of the information content of in the first N moments requires good estimation of the stimulus dependence of the of the first 2N moments and then using the above equation. Or in the case of Gaussian distributed data using the first equation in this post. An important question arises when you are not sure whether or not your data is Guassian or whether or not higher moments are informative. Indeed it is possible to show that, even when all the information is linear Fisher information, it is still possible that all the moments of the data depend upon the stimulus. Indeed, direct calculation can be sued to show that
where
is the Nth central moment.
Now most methods for computing information effectively place a prior on the stimulus dependence of higher moments that takes this form. If that prior is correct then direct estimation will work well. For example, if the data is actually Gaussian then using the first equation in this post will give an accurate estimation of the information when you have enough data to get a reasonable estimation of the first and second order statistics. However, when the data is not Gaussian, then using this equation will not necessarily lead to a correct estimation , especially when the empirically observed third and fourth order statistics are inconsistent with the Gaussian assumption.
This particular issue is only relevant to the estimation of quadratic information. There is also a case for which direct estimation of linear fisher information is problematic. This is when correlations are small (or have small variability) but none-the-less have a strong influence on the information content. For example, this is the case when I_diag is significantly less than I and the correlations are on the order of the inverse of the number of units. In this case estimation of correlations is unreliable and so also is direct estimation of information based upon this estimation.
Indirect estimation of information, on the other hand, is based upon computing theta'(s) subject to some reasonable prior. Typically priors which favor small weights values of theta'(s) are used. As is clear above theta' can be estimated by inverting a covariance matrix. But this inversion is can be complicated and estimation of the covariance matrix can be very noisy. Early stopping, regularization, and bayesian logistic regression, avoid these issues.
Regardless, a popular but misguided method for analyzing the information consequences of correlations is to compute the information content of a population which has the same first order statistics and autocorrelations, but no cross-correlations between units. Practically speaking, this is accomplished by shuffling your data across trials. More generally, one can compute the information content contained in the product of the marginal distributions of each unit in the vector r. The difference between the information content of the true distribution and the information content of the shuffled distribution is labeled delta I_shuffled and is often used as a proxy for measuring the information consequences of correlations. Problems with this metric are discussed here. These criticisms can be summarized by noting that I_shuffled represents the information content associated with a code that simply does not exist in cortex. Wu and Amari understood this when they wrote their humorously (unintentionally?) titled Unfaithful Model... paper. In that work, they correctly pointed out that the presence or absence of correlations is simply a fact given by the data. What matters is whether or not the decoder of that activity which is implemented by cortex is capable of properly taking into account those correlations. To address this issue they constructed the so called I_diag metric which computes the information (fisher in this case) associated with a particular suboptimal decoder, specificlaly, the one which assumes statistical independence between neurons. For linear Fisher I_diag takes the simple form:
Here $\Sigma_{diag}$ is the covariance matrix with all the off diagonals removed. For contrast, I_shuffled takes the form:
Delta I_shuffled = I-I_shuffled can be either positive or negative. On the other hand delta I_diag = I-I_diag is a purely positive quantity and truly represents information loss due to suboptimal decoding and and upper bound on information loss due to suboptimal computation. Moreover, as pointed out in Averbeck and friends, delta I_shuffled can be either positive or negative even when delta I_diag is zero and delta I_diag can be quite large even when delta I_shuffle is negative or zero. So not only is I_diag the behaviorally relevant measure of the consequences of correlations, its the only one which has a consistent interpretation (i.e. a bound on information loss). For those who prefer Shannon information, similar metrics were discussed by Latham and Nirenberg. In that work, inverse variance of stimulus decoder applied to neural activity(Fisher information) was replaced with KL divergence between the true and the parameterized posterior distribution of the stimulus given neural activity. In particular, they defined delta I_cor-dep to be the KL divergence between the true posterior and a posterior which is obtained from a generative model which models the joint distribution of responses given the stimulus as the product of the marginal distributions, i.e.
In contrast, the Shannon equivalent of delta I_shuffled is the synergy/redundancy metric which takes the form:
This metric also can be positive or negative. Moreover, as with the Fisher information equivalent, this metric cannot be related to KL divergence or information loss and, as such, is also not a behaviorally relevant.
Sunday, March 23, 2008
Behavioral Evidence for Bayesian Computation
Knill, D., and Kersten, D. (1991) Nature
Weiss, Simoncelli, Adelson (2002). Nat Neurosci
Kersten, D., Mamassian, P., Yuille, A. (2004) Ann Rev Psych
Knill, (1998). Vision Research
Jacobs, (1999). Vision Research
Ernst, Banks (2002). Nature
Wolpert, Ghahramani, Jordan (1995). Nature
Todorov, Jordan (2002). Nature Neuroscience
Kording, Wolpert (2004). Nature
Weiss, Simoncelli, Adelson (2002). Nat Neurosci
Kersten, D., Mamassian, P., Yuille, A. (2004) Ann Rev Psych
Knill, (1998). Vision Research
Jacobs, (1999). Vision Research
Ernst, Banks (2002). Nature
Wolpert, Ghahramani, Jordan (1995). Nature
Todorov, Jordan (2002). Nature Neuroscience
Kording, Wolpert (2004). Nature
Bayes, Laplace, Bayesian and neo-Bayesians
More evidence that Laplace was the father (or perhaps the dedicated single mother) of Bayesian inference can be found here.
Contains this little jem: ...the term Bayesian "was first used in print by R.A. Fisher in the 1950 introduction to his 1930 paper on fiducial inference entitled Inverse Probability [where in he seeks to] 'distinguish [his result] from the Bayesian probability a posteriori.'"
I also enjoyed the many references to Stigler's Law which states that: "no scientific discovery is named after its original discoverer." The author also notes that "...it is worth noting that Stigler proposed [this Law] in the spirit of a self-proving theorem," but then neglects to mention from whom it was stolen...
Also contains an interesting aside on objective vs subjective probability. I particularly, like the comment indicating that many philosophers disliked the notion of subjective probability, but ultimately the sheer utility of the concept won the day. Anyway, the definition of objective probability likens it to a frequentist situation where pr(H) = 1/2 every time i toss a fair coin. This theory can be tested by throwing many coin tosses and showing that the average number of heads converges to 1/2. So its objective in the sense that the number 1/2 refers to a quantity which can actually be observed.
On the other hand the typical exemplar of a subjective probability is a statement of belief about what the weather will be like tomorrow. The distinction being that since tomorrow only comes once there is no way to verify that, on this particular day, there is a 50% chance of rain that is similar to the method used for the coin toss.
However, it seems to me that either this example or this distinction is rather quite poorly thought out. The repeatability and inter-toss independence of a coin is itself an assumption which should be subject to verification. This is no different than questioning the reasonability of comparing the clouds (or temp/pressure/etc) of today with those of yesterday. True we've got alot more evidence about coins, but lets face it, even after a billion coin tosses, we can neither conclude for certain that we have a fair one, or even that we have been tossing the "same" coin all this time.
This lead me to think that this issue pertains to the debate, brought to my attention by JJ, concerning whether or not a rational doxastic state (fancy word for belief) could have probability one. To my mind the resolution of that issue was that only conditional probabilities could take on probability 1. Such conditionals are syllogisms. In the context of this discussion, I would suggest that conditional probabilities which represent statements of model assumptions are objective. While probabilistic statements concerning empirical quantities are subjective. This is because additional evidence or even the consideration of additional models can augment degrees of beliefs regarding empirical quantities, while it is true by assumption that if we have a fair coin it will come up 50% heads if we toss it an infinite number of times. This is true regardless of how many times we actually toss it.
Contains this little jem: ...the term Bayesian "was first used in print by R.A. Fisher in the 1950 introduction to his 1930 paper on fiducial inference entitled Inverse Probability [where in he seeks to] 'distinguish [his result] from the Bayesian probability a posteriori.'"
I also enjoyed the many references to Stigler's Law which states that: "no scientific discovery is named after its original discoverer." The author also notes that "...it is worth noting that Stigler proposed [this Law] in the spirit of a self-proving theorem," but then neglects to mention from whom it was stolen...
Also contains an interesting aside on objective vs subjective probability. I particularly, like the comment indicating that many philosophers disliked the notion of subjective probability, but ultimately the sheer utility of the concept won the day. Anyway, the definition of objective probability likens it to a frequentist situation where pr(H) = 1/2 every time i toss a fair coin. This theory can be tested by throwing many coin tosses and showing that the average number of heads converges to 1/2. So its objective in the sense that the number 1/2 refers to a quantity which can actually be observed.
On the other hand the typical exemplar of a subjective probability is a statement of belief about what the weather will be like tomorrow. The distinction being that since tomorrow only comes once there is no way to verify that, on this particular day, there is a 50% chance of rain that is similar to the method used for the coin toss.
However, it seems to me that either this example or this distinction is rather quite poorly thought out. The repeatability and inter-toss independence of a coin is itself an assumption which should be subject to verification. This is no different than questioning the reasonability of comparing the clouds (or temp/pressure/etc) of today with those of yesterday. True we've got alot more evidence about coins, but lets face it, even after a billion coin tosses, we can neither conclude for certain that we have a fair one, or even that we have been tossing the "same" coin all this time.
This lead me to think that this issue pertains to the debate, brought to my attention by JJ, concerning whether or not a rational doxastic state (fancy word for belief) could have probability one. To my mind the resolution of that issue was that only conditional probabilities could take on probability 1. Such conditionals are syllogisms. In the context of this discussion, I would suggest that conditional probabilities which represent statements of model assumptions are objective. While probabilistic statements concerning empirical quantities are subjective. This is because additional evidence or even the consideration of additional models can augment degrees of beliefs regarding empirical quantities, while it is true by assumption that if we have a fair coin it will come up 50% heads if we toss it an infinite number of times. This is true regardless of how many times we actually toss it.
Inaugural Post
It seemed appropriate to describe my likely to be unrealized intentions for this blog. Firstly, it became apparent recently that my life required more order and focus than it has had in recent years. So, I began an experiment with gCalendar and a PDA (actually it's a hacked Ipod touch which is not a toy :). This was sufficiently successful that I am now seeking ways to better organize my thoughts. My current strategy has been to utilize a hierarchical file structure which groups projects according to the tree structure:
topic
This works well enough for the projects which have gotten sufficiently off the ground, but has been a miserable failure for storing, sorting and otherwise tracking less well developed ideas or little bits of potentially useful and interesting information. Of course, privately blogging might be sufficient to deal with this issue, so why make this public? That pertains to my second issue: I truly despise the writing phase of scientific research. It is my hope that the practice and feedback that a blog can provide will sharpen both my ideas and my communication and presentation skills.
Finally, I should say that I quite enjoy a good debate/discussion concerning the fine details of mathematical proofs and empirical inference and hope that this blog will provide a forum for such discourse. I intend to share links with friends and collaborators and hope to write both very technical entries regarding research (including useful analytic and statistical methods and reviews of recent publications), and more general interest entries regarding philosophy of science should be expected as well.
That said, we live in an unpredictable world, and people are often the worst predictors of their own behavior...so, we shall see what we shall see.
topic
- collaborators
- background papers
- code
- data
- data analysis and figures
- papers
- drafts
- supplements
- mathematical proofs
- drafts
- background papers
This works well enough for the projects which have gotten sufficiently off the ground, but has been a miserable failure for storing, sorting and otherwise tracking less well developed ideas or little bits of potentially useful and interesting information. Of course, privately blogging might be sufficient to deal with this issue, so why make this public? That pertains to my second issue: I truly despise the writing phase of scientific research. It is my hope that the practice and feedback that a blog can provide will sharpen both my ideas and my communication and presentation skills.
Finally, I should say that I quite enjoy a good debate/discussion concerning the fine details of mathematical proofs and empirical inference and hope that this blog will provide a forum for such discourse. I intend to share links with friends and collaborators and hope to write both very technical entries regarding research (including useful analytic and statistical methods and reviews of recent publications), and more general interest entries regarding philosophy of science should be expected as well.
That said, we live in an unpredictable world, and people are often the worst predictors of their own behavior...so, we shall see what we shall see.
Subscribe to:
Posts (Atom)