Monday, March 24, 2008

Correlations vs Coincidences


Reviewing a paper today which makes a very common error, one that i will be discussing in an upcoming talk I have been putting together so I thought I would take this opportunity to write a little on the issue of correlations and coincidences and their consequences for neural computation and decoding. A typical discussion of correlations rightly notes that correlations can affect the information content of a neural code. This is trivially demonstrated by considering the information content of a Gaussian distributed population r.

Here $\mu(s)$ is the stimulus conditioned mean and $\Sigma(s)$ is the stimulus conditioned covariance. This expression has two terms. The first we label "linear" fisher information and the second we label the "quadratic" contribution to fisher information. This is because the first term represents the inverse variance of the unbiased, locally optimal, linear estimator of the stimulus. Here linear means the estimator is parameterized by $s_{est}=w*r+b$, optimal mean minimum variance, and unbiased means $w*\mu'(s)=1$. In a similar fashion, the total Fisher information (linear + quadratic) gives the inverse of the minimum variance associated with an un-biased estimator which operates on quadratic functions of r. Clearly, the form of this expression indicates that, at least for Gaussian distributed populations, correlations affect both the information content of the population and the form of the optional decoder. Indeed, a bit more work can be used to show that the weights of the optimal estimators take the form:

Here repeated indices imply summation and w^lin_i operates on r_i while w^quad_ij operates on r_i*r_j

Support Vector aficionados will also note that this computation hints at their favorite trick. Specifically, if we defined z to be a vector which contains each element of the vector r and every cross product r_i*r_j, then all the information about s contained in z is linear information, i.e. $I_r(s) = I_z^{lin}(s)$ and so linear discriminates in this augmented z space can optimally estimate/discriminate stimuli.

This transformation also indicates how to estimate fisher information in a non-parametric way. This is important when estimating information from data. Indeed, this issue came up the other day when JD was comparing different methods for estimating Fisher information. Among the methods considered (see below someday) one was the direct method. This involves taking data from two nearby values of the stimulus s and s+deltas. Then simply computing the empirical mean and covariance and their derivatives (differences) and plugging into the expression above. The problem, is that this only works when you know for a fact that your data is Gaussian distributed. This allows you to put some extra information into the computation, specifically, you are allowed to assume that the third central moments are actually zero and that the fourth central moments can be computed from the second moments (regardless of their empirical values). Of course, aprior you have no good reason to assume gaussian and you must utilize your estimates of these higher moments, which will be quite crappy when you are data limited.

Anyway, the non-parameteric estimate of Fisher information contained in the first N moments can be estimated by defining the vector function T(r) as a vector which spans the set of polynomials of order N, i.e. for N=2, T(r) contains each r_i and every possible r_ir_j cross product. Fisher information is then obtained from the generative model

for which

and <>_s indicates stimulus conditioned average. In this case information about the parameters theta takes the form

and information about the stimulus is given by

This equation clearly indicates that a reasonable direct estimation of the information content of in the first N moments requires good estimation of the stimulus dependence of the of the first 2N moments and then using the above equation. Or in the case of Gaussian distributed data using the first equation in this post. An important question arises when you are not sure whether or not your data is Guassian or whether or not higher moments are informative. Indeed it is possible to show that, even when all the information is linear Fisher information, it is still possible that all the moments of the data depend upon the stimulus. Indeed, direct calculation can be sued to show that
is the Nth central moment.

Now most methods for computing information effectively place a prior on the stimulus dependence of higher moments that takes this form. If that prior is correct then direct estimation will work well. For example, if the data is actually Gaussian then using the first equation in this post will give an accurate estimation of the information when you have enough data to get a reasonable estimation of the first and second order statistics. However, when the data is not Gaussian, then using this equation will not necessarily lead to a correct estimation , especially when the empirically observed third and fourth order statistics are inconsistent with the Gaussian assumption.

This particular issue is only relevant to the estimation of quadratic information. There is also a case for which direct estimation of linear fisher information is problematic. This is when correlations are small (or have small variability) but none-the-less have a strong influence on the information content. For example, this is the case when I_diag is significantly less than I and the correlations are on the order of the inverse of the number of units. In this case estimation of correlations is unreliable and so also is direct estimation of information based upon this estimation.

Indirect estimation of information, on the other hand, is based upon computing theta'(s) subject to some reasonable prior. Typically priors which favor small weights values of theta'(s) are used. As is clear above theta' can be estimated by inverting a covariance matrix. But this inversion is can be complicated and estimation of the covariance matrix can be very noisy. Early stopping, regularization, and bayesian logistic regression, avoid these issues.

Regardless, a popular but misguided method for analyzing the information consequences of correlations is to compute the information content of a population which has the same first order statistics and autocorrelations, but no cross-correlations between units. Practically speaking, this is accomplished by shuffling your data across trials. More generally, one can compute the information content contained in the product of the marginal distributions of each unit in the vector r. The difference between the information content of the true distribution and the information content of the shuffled distribution is labeled delta I_shuffled and is often used as a proxy for measuring the information consequences of correlations. Problems with this metric are discussed here. These criticisms can be summarized by noting that I_shuffled represents the information content associated with a code that simply does not exist in cortex. Wu and Amari understood this when they wrote their humorously (unintentionally?) titled Unfaithful Model... paper. In that work, they correctly pointed out that the presence or absence of correlations is simply a fact given by the data. What matters is whether or not the decoder of that activity which is implemented by cortex is capable of properly taking into account those correlations. To address this issue they constructed the so called I_diag metric which computes the information (fisher in this case) associated with a particular suboptimal decoder, specificlaly, the one which assumes statistical independence between neurons. For linear Fisher I_diag takes the simple form:

Here $\Sigma_{diag}$ is the covariance matrix with all the off diagonals removed. For contrast, I_shuffled takes the form:

Delta I_shuffled = I-I_shuffled can be either positive or negative. On the other hand delta I_diag = I-I_diag is a purely positive quantity and truly represents information loss due to suboptimal decoding and and upper bound on information loss due to suboptimal computation. Moreover, as pointed out in Averbeck and friends, delta I_shuffled can be either positive or negative even when delta I_diag is zero and delta I_diag can be quite large even when delta I_shuffle is negative or zero. So not only is I_diag the behaviorally relevant measure of the consequences of correlations, its the only one which has a consistent interpretation (i.e. a bound on information loss). For those who prefer Shannon information, similar metrics were discussed by Latham and Nirenberg. In that work, inverse variance of stimulus decoder applied to neural activity(Fisher information) was replaced with KL divergence between the true and the parameterized posterior distribution of the stimulus given neural activity. In particular, they defined delta I_cor-dep to be the KL divergence between the true posterior and a posterior which is obtained from a generative model which models the joint distribution of responses given the stimulus as the product of the marginal distributions, i.e.

In contrast, the Shannon equivalent of delta I_shuffled is the synergy/redundancy metric which takes the form:

This metric also can be positive or negative. Moreover, as with the Fisher information equivalent, this metric cannot be related to KL divergence or information loss and, as such, is also not a behaviorally relevant.

No comments: