通過WGCNA作者的測試數據來學習

2019 年 10 月 11 日
筆記

測試數據下載鏈接在：https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/SimulatedData.zip

在這樣的測試數據裡面很容易跟著作者的文檔，一步步掌握WGCNA，文檔步驟目錄如下：

Simulation of expression and trait data: PDF document, R script
Loading of expression data, an alternative to data simulation, provided to illustrate data loading of real data: PDF document, R script
Basic data preprocessing illustrates rudimentary techniques for handling missing data and removing outliers: PDF document, R script
Standard gene screening illustrates gene selection based on Pearson correlation and shows that the results are not satisfactory: PDF document, R script
Construction of a weighted gene co-expression network and network modules illustrated step-by-step; includes a discussion of alternate clustering techniques: PDF document, R script
Relating modules and module eigengenes to external data illustrates methods for relating modules to external microarray sample traits: PDF document, R script
Module membership, intramodular connectivity, and screening for intramodular hub genes illustrates using the intramodular connectivity to define measures of module membership and to screen for genes based on network information: PDF document, R script
Visualization of gene networks: PDF document, R script

第一步：了解測試數據

這裡作者模擬了 3000 genes in 50 samples 的表達矩陣，然後這3000個基因可以使用WGCNA演算法比較好的區分成為5個模組，顏色可以標記為( turquoise, blue, brown, green, and yellow)，當然，還有大量的基因處於grey模組，就是需要忽略掉的。

另外值得注意的是，作者模擬了 **a simulated clinical trait y ** 這個表型資訊，在後續分析也用得上。

這個模擬數據的程式碼，非常值得學習，因為它蘊藏著WGCNA的原理，相當於反向解析。

第二步：在R裡面載入測試數據

這個只需要注意一下R語言項目管理模式即可，使用Rstudio新建project文件夾。

第三步：數據預處理

主要是去除離群點，包括樣本和基因，主要是R基礎程式碼的應用。

也可以簡單的層次聚類，看看數據分布，樣本距離。在我https://github.com/jmzeng1314/my_WGCNA 展示的乳腺癌數據集，效果如下：

第四步：基因挑選

這個步驟主要是考慮到基因數量太大，後續計算量比較可觀，很多基因是沒有必要進入後續WGCNA環節的，這個時候很多人會喜歡先做差異分析，挑選統計學顯著的差異基因，但是作者不認為這樣的策略可取。

第五步：基因模組構建（主要）

首先需要使用函數 pickSoftThreshold 挑選最佳閾值！

然後使用函數 blockwiseModules 一步構建加權共表達網路（Weight co-expression network)

還可以使用函數 plotDendroAndColors 可視化我們的基因模組樹。

第六步：模組內部診斷

根據模組的基因集表達矩陣，判斷某個模組的eigengenes，然後基於各個模組的eigengenes進行模組之間相關性的計算

datME=moduleEigengenes(datExpr,moduleColors)$eigengenes  signif(cor(datME, use="p"), 2)  dissimME=(1-t(cor(datME, method="p")))/2  hclustdatME=hclust(as.dist(dissimME), method="average" )  # Plot the eigengene dendrogram  par(mfrow=c(1,1))  plot(hclustdatME, main="Clustering tree based of the module eigengenes")  sizeGrWindow(8,9)  plotMEpairs(datME )

也可以查看具體某個模組的基因集的表達量熱圖

sizeGrWindow(8,9)  par(mfrow=c(3,1), mar=c(1, 2, 4, 1))  which.module="turquoise";  plotMat(t(scale(datExpr[,colorh1==which.module ]) ),nrgcols=30,rlabels=T,          clabels=T,rcols=which.module,          title=which.module )

如果有臨床性狀指標，就可以把各個模組和臨床指標進行相關性診斷。比如在我GitHub講解的乳腺癌數據集是https://github.com/jmzeng1314/my_WGCNA 可以很清晰的看到不同乳腺癌壓型有著不同相關性的基因模組。

第七步：挑選模組裡面的重要基因

比如在我GitHub講解的乳腺癌數據集是https://github.com/jmzeng1314/my_WGCNA 就挑選了Luminal這個亞型的形狀，以及它最顯著相關的 brown 模組進行後續分析。

第八步：模組的其它可視化

主要是TOM矩陣，湊數用，還有模組之間的相關性展示，基本上也是湊數的，如下：

寫在最後

WGCNA包的作者，精心設計的這個測試數據集，其實最重要的不是WGCNA流程，而是它背後所呈現的原理。

希望你能靜下心來讀一遍。