线性混合模型系列五:REML实战
- 2019 年 12 月 5 日
- 筆記
1. 混合线性模型
方程:

假定

2. 混合线性模型的似然函数
2.1 混合模型中y的分布

2.2 对V进行公式代换

2.3 写出似然函数

宁超在他的公众号“Pythn与数量遗传学”的“方差组分估计之约束最大似然”文章中,给出了下面两种计算公式,公式一是直接似然函数(direct REML),公式二是间接的似然函数(MME based REML)。
- 公式1: 常用于GWAS软件中,比如TASSEL、GCTA、GEMMA、FaST-LMM、EMMAX
- 公式2: 常用于传统遗传评估软件,比如ASREML,DMU,BLUPF90

3. 一个示例
下面这个示例是张勤老师在2019年统计遗传学暑假学校教材的数据
# zhangqin laoshi PPT codeherd <- factor(c(1,2,2,1,1,2,1,2,2)) sire <- factor(c(1,1,1,2,2,3,4,4,4)) y <- c(240,190,170,180,200,140,170,100,130) REML_data <- data.frame(herd,sire,y) X <- model.matrix(y~herd-1) Z <- model.matrix(y~sire-1) A <- matrix(c(1,0.25,0,0,0.25,1,0,0,0,0,1,0,0,0,0,1),nr=4)
3.1 公式1
书写似然函数

mixmodel.loglik<-function(theta,y){ sigmag2<-theta[1] sigmae2<-theta[2] n = length(y) G = A*sigmag2 R = diag(n)*sigmae2 V = Z %*% G %*% t(Z) + R Vi = solve(V) P = Vi - Vi %*% X %*% solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi logl = -0.5*(log(det(V)) + log(det(t(X) %*% Vi %*% X)) + t(y) %*% P %*% y) return(-logl) }
optim(c(1,1),mixmodel.loglik,y=y)

3.2 公式2

mixmodel.loglik2 <-function(theta,y){ sigmag2<-theta[1] sigmae2<-theta[2] n = length(y) G = A*sigmag2 R = diag(n)*sigmae2 V = Z %*% G %*% t(Z) + R Vi = solve(V) P = Vi - Vi %*% X %*% solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi C11 = t(X) %*% solve(R) %*% X C12 = t(X) %*% solve(R) %*% Z C21 = t(Z) %*% solve(R) %*% X C22 = t(Z) %*% solve(R) %*% Z + solve(G) C = rbind(cbind(C11,C12),cbind(C21,C22)) logl2 = -0.5*(log(det(C)) + log(det(R)) + log(det(G)) + t(y) %*% P %*% y) return(-logl2) } optim(c(1,1),mixmodel.loglik2,y=y)

两者结果一样。
3.3 张勤老师PPT上的EM算法
herd <- factor(c(1,2,2,1,1,2,1,2,2)) sire <- factor(c(1,1,1,2,2,3,4,4,4)) y <- c(240,190,170,180,200,140,170,100,130) REML_data <- data.frame(herd,sire,y) X <- model.matrix(y~herd-1) Z <- model.matrix(y~sire-1) XX <- crossprod(X) XZ <- crossprod(X,Z) ZX <- t(XZ) ZZ <- crossprod(Z) LHS <- rbind(cbind(XX,XZ),cbind(ZX,ZZ)) Xy <- crossprod(X,y) Zy <- crossprod(Z,y) RHS <- rbind(Xy,Zy) A <- matrix(c(1,0.25,0,0,0.25,1,0,0,0,0,1,0,0,0,0,1),nr=4) Ainv <- solve(A) yy <- crossprod(y) N <- length(y) rankX <-qr(X)$rank q <- length(unique(sire)) nh <- length(unique(herd)) k0 <- 12threshold <- 0.00000001write.table(t(c("sigmaE","sigmaS","k")),file="results.txt",row.names = FALSE, col.names = FALSE)repeat{ LHS1 <- LHS k <- k0 LHS1[(nh+1):(nh+q), (nh+1):(nh+q)] <- LHS1[(nh+1):(nh+q), (nh+1):(nh+q)]+Ainv*k sol <- solve(LHS1,RHS) C <- solve(LHS1) Czz <- C[(nh+1):(nh+q), (nh+1):(nh+q)] sigmaE <- (yy - crossprod(sol,RHS))/(N-rankX) sAs <- t(sol[(nh+1):(nh+q)]) %*% Ainv %*% sol[(nh+1):(nh+q)] trCA <- sum(diag(Czz %*% Ainv)) sigmaS <- (sAs + trCA*sigmaE)/q k0 <- as.numeric(sigmaE/sigmaS) out <- c(sigmaE,sigmaS,k0) write.table(t(out), file="results.txt", append = TRUE, row.names = FALSE, col.names = FALSE) if(abs(k-k0) < threshold) break} results <- read.table("results.txt",header = TRUE) results
sigmaE |
sigmaS |
k |
---|---|---|
925.6148 |
91.78376 |
10.0847344 |
897.9372 |
108.19047 |
8.2995954 |
863.6173 |
129.55950 |
6.6657968 |
820.9699 |
157.69430 |
5.2060850 |
768.2845 |
194.87483 |
3.9424508 |
704.4117 |
243.56957 |
2.8920351 |
629.8507 |
305.53381 |
2.0614764 |
548.0643 |
380.16717 |
1.4416402 |
465.8970 |
462.97884 |
1.0063030 |
391.7085 |
546.12170 |
0.7172550 |
331.7683 |
621.64909 |
0.5336907 |
287.8838 |
684.73480 |
0.4204312 |
258.0564 |
734.15339 |
0.3515020 |
238.7261 |
770.93255 |
0.3096589 |
226.5165 |
797.12077 |
0.2841684 |
218.8922 |
815.07790 |
0.2685538 |
214.1497 |
827.02910 |
0.2589385 |
211.2014 |
834.81106 |
0.2529931 |
209.3676 |
839.80212 |
0.2493058 |
208.2261 |
842.97109 |
0.2470145 |
207.5152 |
844.97003 |
0.2455888 |
207.0721 |
846.22564 |
0.2447009 |
206.7960 |
847.01226 |
0.2441476 |
206.6238 |
847.50423 |
0.2438027 |
206.5165 |
847.81160 |
0.2435877 |
206.4496 |
848.00351 |
0.2434537 |
206.4078 |
848.12328 |
0.2433701 |
206.3818 |
848.19801 |
0.2433179 |
206.3655 |
848.24463 |
0.2432854 |
206.3554 |
848.27371 |
0.2432651 |
206.3491 |
848.29185 |
0.2432525 |
206.3452 |
848.30317 |
0.2432446 |
206.3427 |
848.31022 |
0.2432397 |
206.3412 |
848.31462 |
0.2432366 |
206.3402 |
848.31737 |
0.2432347 |
206.3396 |
848.31908 |
0.2432335 |
206.3392 |
848.32015 |
0.2432328 |
206.3390 |
848.32081 |
0.2432323 |
206.3389 |
848.32123 |
0.2432320 |
206.3388 |
848.32149 |
0.2432318 |
206.3387 |
848.32165 |
0.2432317 |
206.3387 |
848.32175 |
0.2432316 |
206.3387 |
848.32181 |
0.2432316 |
206.3387 |
848.32185 |
0.2432316 |
206.3386 |
848.32188 |
0.2432315 |
206.3386 |
848.32189 |
0.2432315 |
206.3386 |
848.32190 |
0.2432315 |
最终的结果是,加性方差组分为206.3386,残差的方差组分为848.3219,结果一致
4. 注意
公式中的log,也可以写为ln,是自然对数,在R中log默认的就是自然对数
# 自然对数的3次方exp(3)
20.0855369231877
# 对上面结果求自然对数log(exp(3))
3
在R中,如果想计算10的对数,用函数log10(x)
log(10)
2.30258509299405
log10(10)
1
5. 参考文献
http://people.math.aau.dk/~rw/Undervisning/Mat6F12/Handouts/2.hand.pdf http://www2.stat.duke.edu/~sayan/Sta613/2018/lec/LMM.pdf 张勤老师 《统计遗传学暑期学校教材》 中国农业大学 2019.07 宁超公众号:Python与数量遗传学 方差组分估计之约束最大似然