Creating mulitple X variables using the package mvtnorm
The package mvtnorm provides a function to generate multivariate data (multiple columns) with a specified vector of means (the means of each column) and covariance matrix among the means.
rcov1 <- function(p){
# p is the number of columns or number of variables
pp <- p*(p-1)/2 # number of elements in lower tri
max_r <- 0.7
r <- rexp(pp)
r <- r*max_r/max(r)
# create correlation matrix
R <- matrix(1, nrow=p, ncol=p)
R[lower.tri(R, diag=FALSE)] <- r
R <- t(R)
R[lower.tri(R, diag=FALSE)] <- r
# convert to covariance matrix
L <- diag(sqrt(rexp(p))) # standard deviations
S <- L%*%R%*%L
# check -- these should be the same
# R
# cov2cor(S)
return(S)
}
Now let’s use mvtnorm to generate fake correlated X
p <- 5 # number of X variables
S <- rcov1(p)
# make the fake X
n <- 10^5
mu <- runif(p, min=10, max=100) # vecctor of p means
X <- rmvnorm(n, mean=mu, sigma=S)
# how close? (check the cor as this is easier to scan)
cov2cor(S)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.15511020 0.11863165 0.48788949 0.14545246
## [2,] 0.1551102 1.00000000 0.70000000 0.03514487 0.04345391
## [3,] 0.1186317 0.70000000 1.00000000 0.05346376 0.54083807
## [4,] 0.4878895 0.03514487 0.05346376 1.00000000 0.44504979
## [5,] 0.1454525 0.04345391 0.54083807 0.44504979 1.00000000
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.15618693 0.12146387 0.48987841 0.14569745
## [2,] 0.1561869 1.00000000 0.69709529 0.03879653 0.03961509
## [3,] 0.1214639 0.69709529 1.00000000 0.05598629 0.54138395
## [4,] 0.4898784 0.03879653 0.05598629 1.00000000 0.44214983
## [5,] 0.1456974 0.03961509 0.54138395 0.44214983 1.00000000
The rcov1 algorithm is naive
A problem with generating a fake covariance matrix as above is that it is likely to be singular as \(p\) get’s bigger. A singular covariance matrix is one where there are fewer orthogonal axes of variation then there are variables. Imagine a multidimensional scatterplot of a data set with the fake covariance matrix. If we zoom around this multidimensional space, we will come across a “view” in which all the points are compressed along a single line – that is there is no variation on the axis orthogonal to this line of points. This is bad, as it means we cannot fit a linear model using least squares (because the inverse of the covariance matrix doesn’t exist).
Let’s explore this. If the a covariance matrix is singular, then at least one eigenvalue of the matrix is negative (eigenvalues is a multivariate term beyond the scope of this text but, effectively, these are the variances of the orthogonal axes referred to above). Here I compute the frequency of covariance matrices with at least one negative eigenvalue as \(p\) increases
niter <- 1000
p_vec <- 3:10
counts <- numeric(length(p_vec))
i <- 0
for(p in 3:10){
i <- i+1
for(iter in 1:niter){
counts[i] <- ifelse(eigen(rcov1(p))$values[p] < 0, counts[i]+1, counts[i])
}
}
data.table(p=p_vec, freq=counts/niter)
## p freq
## 1: 3 0.000
## 2: 4 0.025
## 3: 5 0.066
## 4: 6 0.135
## 5: 7 0.190
## 6: 8 0.225
## 7: 9 0.275
## 8: 10 0.368
Generating multiple columns of X variables with a non-singular covariance matrix
This section uses some ideas from matrix algebra. The goal is to create a \(n \times p\) matrix of \(X\) variables that have some random covariance structure that is full-rank (not singular, or no negative eigenvalues). The algorithm starts with a \(p \times p\) random eigenvector matrix \(\mathbf{E}\) and a \(p \times p\) random eigenvalue matrix \(\mathbf{L}\) and then computes the random covariance matrix using \(\mathbf{E}\mathbf{L}\mathbf{E}^\top\)
- Generate a random \(p \times p\) random eigenvector matrix from a covariance matrix of \(p \times p\) matrix of random normal variables.
fake.eigenvectors <- function(p){
a <- matrix(rnorm(p*p), p, p) # only orthogonal if p is infinity so need to orthogonalize it
a <- t(a)%*%a # this is the sum-of-squares-and-cross-product-matrix
E <- eigen(a)$vectors # decompose to truly orthogonal columns
return(E)
}
- Generate \(p\) random eigenvalues in descending order and that sum to 1. There are several ways to create this sequence. Here are two:
# force the eigenvalues to descend at a constant rate
fake.eigenvalues <- function(p, m=p, start=1, rate=1){
# m is the number of positive eigenvalues
# start and rate control the decline in the eigenvalue
s <- start/seq(1:m)^rate
s <- c(s, rep(0, p-m)) # add zero eigenvalues
L <- diag(s/sum(s)*m) # rescale so that sum(s)=m and put into matrix,
# which would occur if all the traits are variance standardized
return(L)
}
# random descent
fake.eigenvalues.exp <- function(p, m=p, rate=1){
# exponential distribution of eigenvalues
# m is the number of positive eigenvalues
# start and rate control the decline in the eigenvalue
s <- rexp(m, rate)
s <- s[order(s, decreasing=TRUE)] # re-order into descending order
s <- c(s, rep(0, p-m)) # add zero eigenvalues
L <- diag(s/sum(s)*m) # rescale so that sum(s)=m and put into matrix,
# which would occur if all the traits are variance standardized
return(L)
}
- Generate the random covariance matrix
fake.cov.matrix <- function(p){
# p is the size of the matrix (number of cols and rows)
E <- fake.eigenvectors(p)
L <- diag(fake.eigenvalues(p))
S <- E%*%L%*%t(E)
return(S)
}
- Generate the random \(X\) variables using \(\mathbf{X} = \mathbf{X}' (\mathbf{E}\sqrt{\mathbf{L}})^\top\)
# two functions to compute the random data
fake.X <- function(n,p,E,L){
# n is number of observations
# p is number of variables
X <- matrix(rnorm(n*p),nrow=n,ncol=p) %*% t(E%*%sqrt(L))
return(X)
}
An example
n <- 10^5
p <- 5
E <- fake.eigenvectors(p)
L <- fake.eigenvalues(p, start=1, rate=1)
X <- fake.X(n, p, E, L)
colnames(X) <- paste0("X", 1:p)
E
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4762049 0.76679336 0.02189702 -0.37886958 -0.20306443
## [2,] -0.3226816 0.26514622 -0.21147219 0.50680583 -0.72387940
## [3,] 0.6343127 -0.07414265 0.40381446 0.65434666 0.03024185
## [4,] 0.2450305 -0.53523178 0.22900376 -0.40906173 -0.65856872
## [5,] -0.4546568 0.22305882 0.85982044 -0.06406746 -0.01166667
E <- eigen(cov(X))$vectors
scores <- X%*%E
colnames(scores) <- paste0("pc", 1:p)
cor(cbind(X, scores))[1:p, (p+1):(p*2)]
## pc1 pc2 pc3 pc4 pc5
## X1 0.6381822 0.71536319 0.009665104 0.2581645 -0.119317350
## X2 -0.5701649 0.32632936 -0.213505507 -0.4289443 -0.582102493
## X3 0.8417769 -0.06005797 0.312645510 -0.4358132 0.011149679
## X4 0.4130194 -0.64271492 0.232581875 0.3517727 -0.488358607
## X5 -0.6579685 0.23893819 0.712352851 0.0501983 -0.004418134