線性混合模型系列五:REML實戰

  • 2019 年 12 月 5 日
  • 筆記

1. 混合線性模型

方程:

假定

2. 混合線性模型的似然函數

2.1 混合模型中y的分佈

2.2 對V進行公式代換

2.3 寫出似然函數

寧超在他的公眾號「Pythn與數量遺傳學」的「方差組分估計之約束最大似然」文章中,給出了下面兩種計算公式,公式一是直接似然函數(direct REML),公式二是間接的似然函數(MME based REML)。

  • 公式1: 常用於GWAS軟件中,比如TASSEL、GCTA、GEMMA、FaST-LMM、EMMAX
  • 公式2: 常用於傳統遺傳評估軟件,比如ASREML,DMU,BLUPF90

3. 一個示例

下面這個示例是張勤老師在2019年統計遺傳學暑假學校教材的數據

# zhangqin laoshi PPT codeherd <- factor(c(1,2,2,1,1,2,1,2,2))  sire <- factor(c(1,1,1,2,2,3,4,4,4))  y <- c(240,190,170,180,200,140,170,100,130)  REML_data <- data.frame(herd,sire,y)  X <- model.matrix(y~herd-1)  Z <- model.matrix(y~sire-1)  A <- matrix(c(1,0.25,0,0,0.25,1,0,0,0,0,1,0,0,0,0,1),nr=4)

3.1 公式1

書寫似然函數

mixmodel.loglik<-function(theta,y){    sigmag2<-theta[1]    sigmae2<-theta[2]      n = length(y)    G = A*sigmag2    R = diag(n)*sigmae2    V = Z %*% G %*% t(Z) + R    Vi = solve(V)      P = Vi - Vi %*% X %*% solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi    logl = -0.5*(log(det(V)) + log(det(t(X) %*% Vi %*% X)) + t(y) %*% P %*% y)  return(-logl)  }
optim(c(1,1),mixmodel.loglik,y=y)

3.2 公式2

mixmodel.loglik2 <-function(theta,y){    sigmag2<-theta[1]    sigmae2<-theta[2]      n = length(y)    G = A*sigmag2    R = diag(n)*sigmae2    V = Z %*% G %*% t(Z) + R    Vi = solve(V)    P = Vi - Vi %*% X %*% solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi    C11 = t(X) %*%  solve(R) %*% X    C12 = t(X) %*%  solve(R) %*% Z    C21 = t(Z) %*%  solve(R) %*% X    C22 = t(Z) %*%  solve(R) %*% Z + solve(G)    C = rbind(cbind(C11,C12),cbind(C21,C22))    logl2 = -0.5*(log(det(C)) + log(det(R)) + log(det(G)) + t(y) %*% P %*% y)  return(-logl2)  }  optim(c(1,1),mixmodel.loglik2,y=y)

兩者結果一樣。

3.3 張勤老師PPT上的EM算法

herd <- factor(c(1,2,2,1,1,2,1,2,2))  sire <- factor(c(1,1,1,2,2,3,4,4,4))  y <- c(240,190,170,180,200,140,170,100,130)  REML_data <- data.frame(herd,sire,y)  X <- model.matrix(y~herd-1)  Z <- model.matrix(y~sire-1)  XX <- crossprod(X)  XZ <- crossprod(X,Z)  ZX <- t(XZ)  ZZ <- crossprod(Z)  LHS <- rbind(cbind(XX,XZ),cbind(ZX,ZZ))  Xy <- crossprod(X,y)  Zy <- crossprod(Z,y)  RHS <- rbind(Xy,Zy)  A <- matrix(c(1,0.25,0,0,0.25,1,0,0,0,0,1,0,0,0,0,1),nr=4)  Ainv <- solve(A)  yy <- crossprod(y)  N <- length(y)  rankX <-qr(X)$rank  q <- length(unique(sire))  nh <- length(unique(herd))  k0 <- 12threshold <- 0.00000001write.table(t(c("sigmaE","sigmaS","k")),file="results.txt",row.names = FALSE, col.names = FALSE)repeat{      LHS1 <- LHS      k <- k0      LHS1[(nh+1):(nh+q), (nh+1):(nh+q)] <- LHS1[(nh+1):(nh+q), (nh+1):(nh+q)]+Ainv*k      sol <- solve(LHS1,RHS)    C <- solve(LHS1)    Czz <- C[(nh+1):(nh+q), (nh+1):(nh+q)]      sigmaE <- (yy - crossprod(sol,RHS))/(N-rankX)      sAs <- t(sol[(nh+1):(nh+q)]) %*% Ainv %*% sol[(nh+1):(nh+q)]      trCA <- sum(diag(Czz %*% Ainv))      sigmaS <- (sAs + trCA*sigmaE)/q      k0 <- as.numeric(sigmaE/sigmaS)      out <- c(sigmaE,sigmaS,k0)      write.table(t(out), file="results.txt", append = TRUE, row.names = FALSE, col.names = FALSE)      if(abs(k-k0) < threshold) break}  results <- read.table("results.txt",header = TRUE)  results

sigmaE

sigmaS

k

925.6148

91.78376

10.0847344

897.9372

108.19047

8.2995954

863.6173

129.55950

6.6657968

820.9699

157.69430

5.2060850

768.2845

194.87483

3.9424508

704.4117

243.56957

2.8920351

629.8507

305.53381

2.0614764

548.0643

380.16717

1.4416402

465.8970

462.97884

1.0063030

391.7085

546.12170

0.7172550

331.7683

621.64909

0.5336907

287.8838

684.73480

0.4204312

258.0564

734.15339

0.3515020

238.7261

770.93255

0.3096589

226.5165

797.12077

0.2841684

218.8922

815.07790

0.2685538

214.1497

827.02910

0.2589385

211.2014

834.81106

0.2529931

209.3676

839.80212

0.2493058

208.2261

842.97109

0.2470145

207.5152

844.97003

0.2455888

207.0721

846.22564

0.2447009

206.7960

847.01226

0.2441476

206.6238

847.50423

0.2438027

206.5165

847.81160

0.2435877

206.4496

848.00351

0.2434537

206.4078

848.12328

0.2433701

206.3818

848.19801

0.2433179

206.3655

848.24463

0.2432854

206.3554

848.27371

0.2432651

206.3491

848.29185

0.2432525

206.3452

848.30317

0.2432446

206.3427

848.31022

0.2432397

206.3412

848.31462

0.2432366

206.3402

848.31737

0.2432347

206.3396

848.31908

0.2432335

206.3392

848.32015

0.2432328

206.3390

848.32081

0.2432323

206.3389

848.32123

0.2432320

206.3388

848.32149

0.2432318

206.3387

848.32165

0.2432317

206.3387

848.32175

0.2432316

206.3387

848.32181

0.2432316

206.3387

848.32185

0.2432316

206.3386

848.32188

0.2432315

206.3386

848.32189

0.2432315

206.3386

848.32190

0.2432315

最終的結果是,加性方差組分為206.3386,殘差的方差組分為848.3219,結果一致

4. 注意

公式中的log,也可以寫為ln,是自然對數,在R中log默認的就是自然對數

# 自然對數的3次方exp(3)

20.0855369231877

# 對上面結果求自然對數log(exp(3))

3

在R中,如果想計算10的對數,用函數log10(x)

log(10)

2.30258509299405

log10(10)

1

5. 參考文獻

http://people.math.aau.dk/~rw/Undervisning/Mat6F12/Handouts/2.hand.pdf http://www2.stat.duke.edu/~sayan/Sta613/2018/lec/LMM.pdf 張勤老師 《統計遺傳學暑期學校教材》 中國農業大學 2019.07 寧超公眾號:Python與數量遺傳學 方差組分估計之約束最大似然