Setting the statistical properties of a finite sample

Bottom line: How to create a finite sample of data with a fixed set of summary statistics

Often in psychology experiments we show participants stimuli that are intended to be samples from some distribution. One way of doing this is to simply draw the samples from the distribution (this is easy to do in most computer software packages). For example, we might sample numbers from a Gaussian distribution:

However, in any random finite sample, the mean and other distributional properties will slightly mis-match those of the generating population. A natural response is that, well, it can’t be that far off. For example, we know that mean will vary according to the standard error of the distribution. But with multivariate samples (such as those commonly needed for categorization experiments), the properties of the higher-order statistics, like covariance, are also important. The sampling distribution of these properties can be more distorted by small samples (Klaus Feidler has done some interesting psychological work that exploits exactly this fact).

Ideally, when we say that “we presented subjects in an experiment with items generated from some particular population distribution”, what we really mean is that we showed all participants a random sample of items with properties close or exactly matching to what we are describing about the population distribution. Otherwise, random differences in the sample any subject sees may be contributing to our “unexplained variance” in the experiment design.

How do you guarantee that the statistics of a finite sample match the population exactly? Since the points are already normally distributed, we just need to find a linear transform of the points of the form

$\mathbf{X}_{new}=\mathbf{A}\mathbf{X}_{old} + \mathbf{B}$

to change the mean and variance to the values we want. For univariate Gaussian distributions, the solution is the z-transform:

$\mathbf{Z}=(\mathbf{X}_{old}-\mathbf{\overline{X}}_{old} ) \cdot \mathbf{X}_{old}$

$\mathbf{X}_{new}=\mathbf{Z} \sigma_{new} + \mu_{new}$

where the sample mean and variance of $X_{target}$ are $\mu$ and $\sigma^2$, respectively. This is written a little different than a classic linear transformation, but basically, we just subtract and divide out the mean and variance that the sample had, giving the sample a mean of 0 and a standard deviation of 1. This is actually the same as taking the z-score for all the numbers in the sample. Once our mean is 0 and standard deviation is 1, we can multiply by any constant to make that scalar our standard deviation, and add by any constant to make that constant our new mean.

This kind of correction can be generalized to multivariate samples using linear algebra.

As before, the plan is to remove the mean and covariance of our original set of points, and then add and multiply in our desired parameters. This process is illustrated in the figure: we start with sample Xold, give it mean of zero and the identity matrix as its covariance (sample Z), then multiply in a new shape and add in a new mean to get our new sample, Xnew. Each of these conversions, represented by the arrows, will be a linear transformation from a given sample, $\mathbf{X}_{old}$ to a new sample $\mathbf{X}_{new}$ with different parameters. Because it is linear, it will take the form

$\mathbf{X}_{new}=\mathbf{A}\mathbf{X}_{old}+\mathbf{B}$

Ashby (1992) has shown that when performing the above transformation, the new covariance matrix can be determined from the old one as follows:

$\mathbf{\Sigma}_{new}=\mathbf{A}\mathbf{\Sigma}_{old}\mathbf{A}^{\dagger},$

where $\mathbf{A}^{\dagger}$ is the conjugate transpose of $\mathbf{A}$. To target a particular value for $\mathbf{\Sigma_{new}}$, it will help to use the Cholesky decomposition. $\mathrm{Chol}(\mathcal{A})$ is defined as a triangular matrix $\mathbf{L}$ such that for any Hermitian positive-definite matrix $\mathcal{A}$,

$\mathcal{A}=\mathbf{L} \mathbf{L}^{\dagger}.$

This is applicable here because a covariance matrix is Hermitian positive-definite. There are other solutions for $\mathbf{L}$ in this case, but the Cholesky decomposition is particularly efficient and does not require the use of singular value decomposition.

Our goal is to transform our sample $\mathbf{X}_{old}$ into a new sample $\mathbf{Z}$ with $\mathrm{mean}(\mathbf{Z})=\mathbf{0}$ and $\mathrm{cov}(\mathbf{Z})=\mathbf{\Sigma}_{Z}=\mathbf{I}$. Once we have that, it will be easy to transform it to have the statistics we want. So we solve the above:

$\mathbf{\Sigma_Z}= \mathbf{A}\mathbf{\Sigma}_{old}\mathbf{A}^{\dagger}.$

$\mathrm{Chol}(\mathbf{\Sigma_Z})= \mathrm{Chol}(\mathbf{A}\mathbf{\Sigma}_{old}\mathbf{A}^{\dagger})$

$\mathbf{I}= \mathrm{Chol}(\mathbf{A}\mathbf{\Sigma}_{old}\mathbf{A}^{\dagger}).$

Defining $\mathbf{S} = \mathrm{Chol}(\mathbf{\Sigma}_{old})$ (and therefore $\mathbf{\Sigma}_{old} = \mathbf{S}\mathbf{S}^{\dagger}$), we get

$\mathbf{I}= \mathrm{Chol}(\mathbf{A}\mathbf{S}\mathbf{S}^{\dagger}\mathbf{A}^{\dagger}).$

By the definition of the Cholesky decomposition, $\mathrm{Chol}(\mathbf{A}\mathbf{S}\mathbf{S}^{\dagger}\mathbf{A}^{\dagger})=\mathbf{A}\mathbf{S}$, so,

$\mathbf{I} = \mathbf{A}\mathbf{S}.$

$\mathbf{A} = \mathbf{S}^{-1}$

Noting, trivially, that $\mathbf{B} = -\mathrm{mean}(\textbf{A}\textbf{X})$, the transformation to a “clean” sample is now:

$\mathbf{Z}= \mathbf{S}^{-1} \mathbf{X}_{old} - \mathrm{mean}(\mathbf{S}^{-1} \textbf{X}_{old}),$

which is equivalent to

$\mathbf{Z}=\mathrm{Chol}^{-1}( \mathrm{cov}(\mathbf{X}_{old}))\mathbf{X}_{old}-\mathrm{mean}(\mathrm{Chol}^{-1}( \mathrm{cov}(\mathbf{X}_{old}))\mathbf{X}_{old}).$

To transform this into the desired distribution, we again turn to Ashby’s equation relating the transformed and untransformed covariances, but this time the term on the right contains the identity matrix, so it’s even easier:

$\mathbf{\Sigma}_{new}=\mathbf{A}\mathbf{\Sigma}_{Z}\mathbf{A}^{\dagger},$

$\mathbf{\Sigma}_{new}=\mathbf{A}\mathbf{A}^{\dagger},$

$\mathbf{A}= \mathrm{Chol}(\mathbf{\Sigma}_{new}).$

And so our transformation becomes

$\mathbf{X}_{new}=\mathrm{Chol}(\mathbf{\Sigma}_{new})\mathbf{Z}+\mathbf{\mu}_{new}.$

#### Summary:

To fix the sample mean and covariance of a sample from a multivariate Gaussian to arbitrary values, first transform the sample to a sample $\mathbf{Z}$ with mean $\mathbf{0}$ and covariance $\mathbf{I}$.  This can be done using this equation:

$\mathbf{Z}=\mathrm{Chol}^{-1}( \mathrm{cov}(\mathbf{X}_{old}))\mathbf{X}_{old}-\mathrm{mean}(\mathrm{Chol}^{-1}( \mathrm{cov}(\mathbf{X}_{old}))\mathbf{X}_{old}).$

Then transform that sample to the sample you want, using

$\mathbf{X}_{new}=\mathrm{Chol}(\mathbf{\Sigma}_{new})\mathbf{Z}+\mathbf{\mu}_{new}.$

#### Code:

Some Python code to accomplish this is provided here:

#### References:

Ashby, F.G. (1992). “Multivariate probability distributions.” In F. G. Ashby, Ed., Multidimensional Models of Perception and Cognition. Laurence Erlbaum Associates: Hillsdale, NJ.