# Constructing correlated data

A scatter plot of correlated bivariate data such as one might need in a simulation (Monte-Carlo) experiment.

On numerous occasions over the years I have wanted to construct sets of data with a specified correlational structure. Between occasions, I tend to forget the method I used and each time I perform a new simulation experiment, I have to hunt down my old references and go back to basic principles to figure out what to do.

Thinking that there might be other people in the same situation, I decided that it might be useful to give a summary here of the basic recipe for constructing multivariate correlated data. I’m not suggesting that what follows is a substitute for a good technical paper but at least it will set you off on the right path.

### Basic recipe for constructing correlated data

First, specify the correlation matrix for the data set that you want. Your software will undoubtedly require the full matrix for the calculations in the next step, so be sure to include the upper and lower triangles of the matrix as well as the unit values along the main diagonal. If you are planning on constructing a set of data with n variables and m cases then your correlation matrix will be a square matix of dimensions n×n. Call the correlation matrix r.

Second, compute the Cholesky decomposition of the correlation matrix r. The Cholesky decomposition of a real-valued, positive definite symmetric matrix (i.e., just like your correlation matrix!) is, by definition the decomposition of the matrix into two triangular matrices, one being the transpose of the other. Some software prints only the lower triangular matrix; other software produces only the upper triangular matrix. If you end up with a lower triangular matrix, convert it to an upper triangular matrix by taking the transpose. Call the resultant matrix, c.

Third, create a matrix, w, of “precursor data” of the same dimensions as the data set that you ultimately need and consisting of independent, identically distributed (i.i.d.) random numbers. Since there should be as many rows (m) as there are cases and as many columns (n) as their are variables in your required data set, the matrix w will be of dimensions m×n.

Fourth, post-multiply the precursor data matrix w by Cholesky decomposition matrix c. The multiplication is possible because w is of dimensions m×n and c is of dimensions n×n. The resultant product (which we’ll call d) will have dimensions m×n and the columns will be correlated in the desired manner. Of course, the empirical correlations between the columns of d might be different from the values in the correlation matrix r simply as a result of sampling error. Any new matrix w will produce slightly different empirical correlations in d.

Lastly, you might need to rescale your data so that it spreads over the correct range, or has a particular mean and standard deviation.

### Reasons to be careful

This is a recipe and a guide but you are the cook. If you really have no idea what each step in the recipe is doing, then it is more than likely that you will produce a matzah pudding. Seek advice, if necessary, from an experienced chef.

Be careful about generating random numbers. There are many hundreds of very bad ways of doing it, and only a few good ones. Either use a software library that you know is good, or read Press et al. [1] or Knuth [2].

Unfortunately I do not know anything about the quality of random number generation in the Open Office suite but the following paragraphs about Microsoft Excel highlight potential problems. I might post an Open Office spreadsheet on the blog at some stage to illustrate the recipe described above.

The uniform random number generator called RAND that is used in both Microsoft Excel 2003 and Microsoft Excel 2007 passes the Diehard tests of George Marsaglia and will do just fine for generating uniformly distributed data. However, if you want normally distributed data, do not try to use the Excel function called NORMINV by using the intuitively obvious formula

=NORMINV(RAND(), mean, standardDeviation) .

The reason is that NORMINV becomes progressively more inaccurate as one moves away from the mean and into the tails of the distribution. It appears to be even harder to produce a good calculator of the inverse of the Gaussian error function than it is to produce a good random number generator! If you need random normal deviates, use the Box-Muller transformation described by Press et al. [1]. If you don’t want to read Press et al., then you might just take my word for what follows. A good Excel implementation of the Box-Muller transformation would provide, as output, two independent random normal deviates for every two uniform random deviates that are used as input. However, an inefficient, wasteful, but nonetheless accurate implementation can be effected using the formula
=SQRT(-2*LN(RAND()))*SIN(2*PI()*RAND())

which will produce values from the standard normal distribution of mean zero and unit variance.

You might on occasion want to use a simulated, or random correlation matrix. Beware. I have found that it is generally possible to calculate the Cholesky decomposition for most genuine correlation matrices. On the other hand, fake correlation matrices that have random (though symmetric) entries in them often produce trouble, giving rise to error messages from the software saying that the Cholesky decomposition cannot be accurately calculated because the matrix is insufficiently positive definite.

### References

[1] Press, W. H., Teukolsky, S. A., Vetterling, W. T. Flannery, B. P. (1995). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press: Cambridge, UK.

[2] Knuth, D. E. (1981). The Art of Computer Programming Volume 2: Seminumerical Algorithms. Reading, MA: Addison-Wesley.

Contributors: Mark R. Diamond

1. Dan-Erik says:

The rescaling that needs to be done as the last step is really bothersome in my opinion. I would like to get away from manually entering numbers until I find a good fit. If anyone hears about a way to do this automatically in some way, I would appreciate a mail about it.

2. DDR says:

I too have had to rediscover this a number of times, and then I remembered you had blogged about it.

I have added some trivial R-code below which is easily modified. I was interested in data in which variables V1 and V2 are highly correlated (r=.9) and V3 and V4 are highly correlated (r=.9), and the remaining correlations are low (r=0)

############### START
r <- matrix( # Correlation matrix
c(1.0, 0.9, 0.0, 0.0,
0.9, 1.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.9,
0.0 ,0.0, 0.9, 1.0)
, nrow=4)

c <- chol(r) # Choleski decomposition of the correlation matrix
w <- matrix(rnorm(800), ncol=4) # Create a datamatrix of random normal deviates with the appropriate number of cases
correlated.data <- w %*% c

############### END

3. papak says:

the cor(correlated.data ) is not the same as r, neither the cov(correlated.data )

4. mrd says:

Papak: I am not sure what you have done but as far as I can see, the R code is correct. Deviations between cor(correlated.data) and r will occur as a result of random sampling. Naturally the approximation is better as the size of w increases.

There is a far more important error in my actual post; one that I will correct shortly. The error relates to the rescaling of the variables after they have been generated. The rescaling will mess up the covariances! Instead of using the Cholesky decomposition of the correlation matrix, one should use the decomposition of the covariance matrix.